Top Banner
Unlock the Secrets of R G.Janacek
42

Unlock the Secrets of R G.Janacek

Jan 23, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Unlock the Secrets of R G.Janacek

Unlock the Secrets of R

G.Janacek

Page 2: Unlock the Secrets of R G.Janacek

Contents

1 What is R? 21.0.1 Getting up and running . . . . . . . . . . . . . . . . . . . . . . . . . 21.0.2 Getting R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.0.3 A first R session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.0.4 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.0.5 Quitting R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Commands, objects and functions 72.0.6 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.0.7 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.0.8 Basic objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.0.9 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.0.10 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.0.11 Repetitive execution:for loops, repeat and while . . . . . . . . . . . . 112.0.12 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1 Vectors, Matrices and Data Frames . . . . . . . . . . . . . . . . . . . . . . 142.1.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.1.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.3 The Dataframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Input and Output 22

4 R help and documentation 234.0.4 The Plot Command . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Saving Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2 Adding Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2.2 OS X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3.1 Packages you already have . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Appendix 1: Vocabulary 29

6 Appendix 2 : Numerical Types 34

7 Regression 34

8 Multivariate Data 37

1

Page 3: Unlock the Secrets of R G.Janacek

1 What is R?

The S programming language was created by John Chambers for doing statistical analysisand was later implemented in the S Plus system. Later Ross Ihaka and Robert Gentlemancreated R named partly as a play on the name S from which it is derived.

The source code for the R software environment is written primarily in C, Fortran,and is freely available under the GNU General Public License. Free pre-compiled binaryversions are provided for various operating systems. R development is now mostly done bythe Core Development Team, see

http://www.rproject.org/contributors.

R is a powerful statistical programming language and has a broad set of facilities fordoing statistical analyses. Because it is open source as new statistical techniques aredeveloped, new packages become available. Consequently there is R code available to dopretty much any sort of analysis that you can think of, and if you do manage to think ofa new one you can always write the code yourself.

In short R will carry out analyses that are difficult or impossible in many other packages.

R has a broad range of graph-drawing tools, which make it very easy to producepublication-standard graphs. Because R is produced by people who know about datapresentation, the default options for R graphs are simple, elegant and sensible.

On top of this base graphics system there are a number of additional graph packagesavailable that give you whole new graphics systems. For examples of the graphics have alook at

http://rgraphgallery.blogspot.co.uk.

1.0.1 Getting up and running

Learning R takes some effort. However, just like any new natural language useful thingscan be done before achieving fluency. I think that the process of learning R can be brokendown into the following five stages:

1. Understand something of the environment in which the R programming languageis maintained. Become familiar with the resources available. Install the R on yourcomputer and run a test script.

2. Read csv files into data frames and use R functions to perform statistical analyses ina familiar area.

3. Use the basic control structures of the R language to write simple programs. Writeyour own functions, become familiar with the data structures included in R and beginto explore the rich features of the language. Interface with database, web pages andother external data sources.

2

Page 4: Unlock the Secrets of R G.Janacek

4. Write complex programs in the language. Develop an understanding of the deepstructure of the language S3 and S4 objects, closures etc.

5. Develop programs for production use. Write an R package.

The completion of Stage 2 with a bit at at Stage 3 is normally all that most peopleneed.

Once you become familiar with the libraries of R functions that are important to youfield, this is usually sufficient for most people.

1.0.2 Getting R

You will find R on most university machines but be aware if you do not have administrativeprivileges on the machine you are using then your use will be limited. If you have yourown machine then R is freely downloadable from

http://cran.r-project.org/,

as is a wide range of documentation. If you are using Windows, OS X you can downloadit and run the installer. It is is simple and easy to get running - honest. If you use Linuxthen it also poses few problems.

Bear in mind everything is to be found at

btexttthttp://www.r-project.org/

and

http://cran.r-project.org/

1.0.3 A first R session

So we have downloaded a copy of R, starting the the application and we see something like

R version 3.0.3 (2014-03-06) -- "Warm Puppy"

Copyright (C) 2014 The R Foundation for Statistical Computing

Platform: x86_64-apple-darwin10.8.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.

You are welcome to redistribute it under certain conditions.

Type ’license()’ or ’licence()’ for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.

Type ’contributors()’ for more information and

3

Page 5: Unlock the Secrets of R G.Janacek

’citation()’ on how to cite R or R packages in publications.

Type ’demo()’ for some demos, ’help()’ for on-line help, or

’help.start()’ for an HTML browser interface to help.

Type ’q()’ to quit R.

[R.app GUI 1.63 (6660) x86_64-apple-darwin10.8.0]

[Workspace restored from /Users/jan/.RData]

[History restored from /Users/jan/.Rapp.history]

>

and you are faced with the command prompt without any nice buttons or menus to helpyou.

1.0.4 An example

This is the point where things usually start going pear-shaped for many people so we lookat a simple analysis to see what is in store.

Here we have the Birthweight of babies from the excellent book by Annette Dobson.Before doing any sort of analysis we need to get our data loaded into the programme. For

Age Boy Age Girl

40 2968 40 331738 2795 36 272940 3163 40 293535 2925 38 275436 2625 42 321037 2847 39 281741 3292 40 312640 3473 37 253937 2628 36 241238 3176 38 299140 3421 39 287538 2975 40 3231

Table 1: Birth weight and age in weeks

a big data set you would do this by entering the data into a spreadsheet or text file andimporting it to R (see the Importing data later ) but with a small dataset you can enterthe data directly at the command prompt.

4

Page 6: Unlock the Secrets of R G.Janacek

Being lazy I will just take the first two columns. We input the age as follows:

> age<-scan()

1: 40 38 40 35 36 37 41 40 37 38 40 38

13:

Read 12 items

> boy<-scan()

1: 2968 2795 3163 2925 2625 2847 3292 3473 2628 3176 3421 2975

13:

Read 12 items

>

The command scan() reads data into the system until it gets a blank line return.You can cut and paste! Thus we have created two sets of numbers ( vectors) in R calledage and boy.The left arrow

<-

means take whats to the right of the arrow and make an object with the name thats to theleft of the arrow.

If you prefer use =. You can check to make sure its correct by just typing the nameand pressing enter.

> age

[1] 40 38 40 35 36 37 41 40 37 38 40 38

>

Now your data are entered into R you can take a look at them. Its always a good ideato visualise your data before doing any analysis, and you can do this by asking R to plotthe data out for you. Type

plot( age,boy)

and get a scatterplot. Not quite publication quality but not bad.While we are at it we can try a linear regression of Boy on Age. I know that regression

is lm in R speak so using lm we get

> rbaby=lm(boy~age) # Note there is no response!

>

> summary(rbaby)

Call:

lm(formula = boy ~ age)

Residuals:

5

Page 7: Unlock the Secrets of R G.Janacek

Min 1Q Median 3Q Max

-246.69 -151.20 -29.16 194.59 274.28

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1268.67 1239.97 -1.023 0.33035

age 111.98 32.31 3.466 0.00606 **

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 200.9 on 10 degrees of freedom

Multiple R-squared: 0.5457,Adjusted R-squared: 0.5003

F-statistic: 12.01 on 1 and 10 DF, p-value: 0.006065

# more plots are available

> plot(rbaby)

This has everything you need. It tells you exactly what sort of analysis has been done,and the names of the variables you asked it to analyse.

You get a significance test and a p-value, telling you that there is a statistically signif-icant coefficient.

Thats it. Data has been input, we have drawn a graph and carried out a statistical test.

This has illustrated some important points about how R works. You do things bytyping commands and pressing enter. We are using a statistical language so

\midx{\btexttt{boy~age}}

means that boy is related to age. You can create data objects such as baby which containyour results give commands that make R do things like carry out tests and draw graphs.Some of these commands have self-evident names like plot but be assured many do not.

You will find out that if you make mistakes in what you are typing the error messagesare rarely helpful.

1.0.5 Quitting R

To quit an R session use the function call q(). When you use R you create a workspacein which all the objects you have created are stored. When you try and quit you will beasked if you wish to save your (workspace) session.

6

Page 8: Unlock the Secrets of R G.Janacek

2 Commands, objects and functions

We now introduce some of the fundamental concepts that you need to know to be aconfident user of R.

2.0.6 Basics

You interact with R by typing something into the console at the command prompt andpressing enter. If what you have typed makes sense R it will do what its told to do, ifnot then you will get an error message. The simplest things you can get R to do arestraightforward sums. Just type in a calculation and press enter.

> 1+2+3

[1] 6

> 27*12

[1] 324

> 3/7

[1] 0.4285714

> 5^5

[1] 3125

> (1/4+2+3)*2.7*12*(-0.003^7)

[1] -3.720087e-16

>

Why does the answer have a [1] in front? We shall see later. When you’re asking R todo calculations that are more complex than just a single arithmetic operation then it willcarry out the calculations in usual PEDAMs fashion from left to right:

Most sensible people use brackets!

These simple calculations do not produce any kind of output that is remembered by R: theanswers just appear in the console window. If you want to do further calculations with theanswer to a calculation you need to give it a name and tell R to store it as an object.

We might wish to store the answer to a calculation, say add 1 and 2 and store theanswer in an object called example1. To find out what is stored in answer, just type thename of the object:

> example1=1+2

> example1

[1] 3

> example2<- 1+2

> example2

7

Page 9: Unlock the Secrets of R G.Janacek

[1] 3

>

Take particular note of the middle symbol in the instruction above, <-. This is theassign symbol, formed of a less than arrow and a hyphen and it looks like a left arrow .It means make the object on the left into the output of the command on the right.It alsoworks the other way around: 2+2->answer. You can also use an equals = sign for allocationbut it can cause confusion with other uses of the equals sign.

Its quite common when using R to type in a command and see nothing happen, exceptfor the command prompt popping up again. When nothing seems to happen that meansthat there have been no errors errors, at least as far as R is concerned, which implies thateverything has gone smoothly.

2.0.7 Objects

We have created objects like example1,example2 which are just variables.Note when we did the regression above using lm we created an object baby which

contained the results. To see the results contained in an object we used the commandsummary. To examine what is inside an object we can use also attributes() and thenlook inside using the dollar symbol

> attributes(rbaby)

$names

[1] "coefficients" "residuals" "effects" "rank" "fitted.values"

[6] "assign" "qr" "df.residual" "xlevels" "call"

[11] "terms" "model"

# No we can look at each element - hash is comment

$class

[1] "lm"

> rbaby$qr

$qr

(Intercept) age

1 -3.4641016 -132.7905619

2 0.2886751 6.2182527

3 0.2886751 -0.2079874

4 0.2886751 0.5960970

5 0.2886751 0.4352802

6 0.2886751 0.2744633

7 0.2886751 -0.3688042

8 0.2886751 -0.2079874

8

Page 10: Unlock the Secrets of R G.Janacek

9 0.2886751 0.2744633

10 0.2886751 0.1136464

11 0.2886751 -0.2079874

12 0.2886751 0.1136464

attr(,"assign")

[1] 0 1

$qraux

[1] 1.288675 1.113646

$pivot

[1] 1 2

$tol

[1] 1e-07

$rank

[1] 2

attr(,"class")

[1] "qr"

> rbaby$coefficients

(Intercept) age

-1268.6724 111.9828

>

You can use objects in calculations in exactly the same way as numbers used above. Youcan also store the results of a calculation done with objects as another object.

When you first open R, there are no objects stored, but after a while you might havelots. You can get a list of whats there by using the command. ls()

You can remove an object from Rs memory by using the rm() function. Notice thatwhen you type this

it does not ask you if you are sure, or give you any other sort of warning, nor does it letyou know whether its done as you asked.

The object you asked it to remove has just gone: you can confirm this by using ls() again.If you try to delete a non-existent object you get an error message.

2.0.8 Basic objects.

Internally R works with lists of objects, thus the basic numerical object is a list of numbers( a vector) and a single number is a vector of length 1. Hence the [1]’s appearing above.

9

Page 11: Unlock the Secrets of R G.Janacek

when we print a number.There are six basic (atomic) vector types: logical, integer, real, complex, string (or

character) and raw. The modes and storage modes for the different vector types are givenin the table in the appendix.

Single numbers, such as 4.2 , and strings, such as ”four point two” are still vectors, oflength 1; there are no more basic types. Vectors with length zero are possible (and useful).R can use strings of characters as objects. These have to be entered with quote marksaround them because otherwise R will think that they’re the names of objects and returnan error when it can’t find them. So

> name=c("brodwin","blodwin","r")

> name

[1] "brodwin" "blodwin" "r"

>

> w="four point 2"

> w

[1] "four point 2"

> w=c("A","B","C","D")

> w

[1] "A" "B" "C" "D"

Note :I have got ahead of myself and use the concatenate command c. This sticks elementstogether as above.

String vectors have mode and storage mode ”character” and a single element of a charactervector is often referred to as a character string.

You can also have TRUE and FALSE, NAN. TRUE and FALSE are obvious logicalsand NAN is “not a number” i.e. missing data.

2.0.9 Factors

A special type of data in R is a factor. When were collecting data we we might recordwhether a subject is male or female, whether a cricket is winged, wingless or intermediateor whether someone is male or female This type of data, where things are divided intoclasses, is called categorical or nominal data, and in R it is stored as a factor. We caninput nominal data into R as numbers if we assign a number to each category, such as1=red, 2=green and 3=blue and then tell R to make it a factor with the factor() function,but this can lead to confusion. Usually its better to input data like this as the wordsthemselves as character data and then tell R to make it a factor. So

> gender =c("female","female","female","female","male","male","male")

10

Page 12: Unlock the Secrets of R G.Janacek

> gender=factor(gender)

> gender

[1] female female female female male male male

Levels: female male

2.0.10 Control

If statements: The language has available a conditional construction of the form

> if (expr_1) expr_2 else expr_3

where expr 1 must evaluate to a single logical value and the result of the entire expressionis then evident.

> x=3

> if (x>0)(sqrt(x))else(NA)

[1] 1.732051

> x=-2

> if (x>0)(sqrt(x))else(NA)

[1] NA

>

The short-circuit operators && ”and”, || ”or” are often used as part of the conditionin an if statement. Whereas & and | apply element-wise to vectors, &&and || apply tovectors of length one, and only evaluate their second argument if necessary.

There is a vectorized version of the if/else construct, the ifelse function. This has theform

ifelse(condition, a, b)

and returns a vector of the length of its longest argument, with elements a[i] if condition[i] is true, otherwise b[i] .

> x=rnorm(20)

> x

[1] 1.07946911 -0.38368774 0.59394725 -0.59558286 1.14503714 0.85581439 0.31156696 -0.85308259

[9] 0.86578634 -0.87393923 -1.06192024 -0.61750575 0.05840533 -0.92554621 2.16806721 0.40747540

[17] -0.04350849 -0.63534693 0.61456348 -0.30498590

> ifelse(x>0 ,sqrt(x),sqrt(-x))

[1] 1.0389750 0.6194253 0.7706797 0.7717401 1.0700641 0.9251024 0.5581818 0.9236247 0.9304764

[10] 0.9348472 1.0304951 0.7858153 0.2416720 0.9620531 1.4724358 0.6383380 0.2085869 0.7970865

[19] 0.7839410 0.5522553

2.0.11 Repetitive execution:for loops, repeat and while

There is also a for loop construction which has the structure

> for (name in expr_1) expr_2

11

Page 13: Unlock the Secrets of R G.Janacek

where name is the loop variable expr 1 is a vector expression, (perhaps a sequence like1:20), and expr 2 is often a grouped expression with its sub-expressions written in termsof the dummy name. expr 2 is repeatedly evaluated as name ranges through the valuesin the vector result of expr.

> for (i in 1:10){

+ x=i^2

+ print(x)}

[1] 1

[1] 4

[1] 9

[1] 16

[1] 25

[1] 36

[1] 49

[1] 64

[1] 81

[1] 100

Warning: for() loops are used in R code much less often than in compiled languages. Codethat takes a whole object view is likely to be both clearer and faster in R. Thus

> x=1:10

> x

[1] 1 2 3 4 5 6 7 8 9 10

> sum(x)

[1] 55

and lapply(),tapply(),sapply().Other looping facilities include the repeat expr statement and the while (condition)

expr statement. The break statement can be used to terminate any loop, possibly abnor-mally. This is the only way to terminate repeat loops.

2.0.12 Functions

You can get so far by typing in calculations, but that is not much use for most statisticalanalyses. Remember that while R is really a programming language it comes with ahuge variety of (mostly) short ready-made pieces of code that will do things like manageyour data, do complex mathematical operations on your data, draw graphs and carryout statistical analyses ranging from the simple and straightforward to the eye-wateringlycomplex.

12

Page 14: Unlock the Secrets of R G.Janacek

These ready- made pieces of code are called functions. Each function name ends in apair of brackets e.g. lm() and for many of the more straightforward functions you justtype the name of the function and put the name of the object you would like the procedurecarried out on in the brackets.

You can carry out more complex calculations by making the argument of the function(the bit between the brackets) a calculation itself:

> log(3)

[1] 1.098612

> log(3*5/13)

[1] 0.1431008

> log(sin(3))

[1] -1.958145

You can use functions in creating new objects. In our example above the functionlm (linear model) was used to create an object baby while plot provided a plot.

One of the problems with R is learning the names of the functions. It is a bit like magicyou need the name of the spell! You need a dictionary or a crib to get you going so youneed to do some reading to afire a vocabulary.

A function is defined by an assignment of the form

name <- function(arg_1, arg_2, ...) expression

The expression is an R expression, (usually a grouped expression i.e. ... ), that usesthe arguments, to calculate a value. The value of the expression is the value returned forthe function. A call to the function then usually takes the form

name(expr_1, expr_2, ...)

and may occur anywhere a function call is legitimate.A nice thing about most R functions is that have default values specified for most of

their arguments, and if nothing is specified the function will just use the default value.As an example, consider a function to calculate the two sample t-statistic, showing all

the steps. This is an artificial example, of course, since there are other, simpler ways ofachieving the same end. The function is defined as follows:

> twosam <- function(y1, y2) {

n1 <- length(y1); n2 <- length(y2)

yb1 <- mean(y1); yb2 <- mean(y2)

s1 <- var(y1); s2 <- var(y2)

s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2)

tst <- (yb1 - yb2)/sqrt(s*(1/n1 + 1/n2))

tst

}

13

Page 15: Unlock the Secrets of R G.Janacek

With this function defined, you could perform two sample t-tests using a call such as

tstat <- twosam(data$male, data$female); tstat

Note that any ordinary assignments done within the function are local and temporary andare lost after exit from the function. Thus the assignment

yb1<-mean(y1)

does not affect the value of the argument in the calling program.To understand completely the rules governing the scope of R assignments the reader

needs to be familiar with the notion of an evaluation frame. This is a somewhat advanced,though hardly difficult, topic and is not covered further here.

2.1 Vectors, Matrices and Data Frames

When were analysing experimental data, of course, we are likely to be working with lots ofnumbers, and R is especially good at dealing with objects that are groups of numbers, orgroups of character or logical data. In the case of numbers these groups can be organisedas sequences, vectors, or as two dimensional tables of numbers, matrices. R can also dealwith tables that have some columns of numbers and some columns with other kinds ofdata: these are called data frames.

2.1.1 Vectors

We have already used the function called concatenate c function

> x=c(1,3,5,7,9)

> x

[1] 1 3 5 7 9

Which creates a new object called x (a vector in this case) containing a sequence of numberscounting up from 1 to 10. To see what x is just type its name. There are other ways to setup vectors. One of the most important uses the function called

seq(from,to,by)

which produces sequences of numbers. We could write out the command in full, with namesfor all the arguments, as , but because we know that R knows that the first argumentbetween the brackets corresponds to the

from= argument

, the second one to the

to= argument

14

Page 16: Unlock the Secrets of R G.Janacek

and the default value for

by= is

we can write a much shorter instruction.

> x=seq(1,10)

> x

[1] 1 2 3 4 5 6 7 8 9 10

which produces sequences of numbers. We could write out the command in full, with namesfor all the arguments, as

seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),

length.out = NULL, along.with = NULL, ...)

x=seq(1,1,10)

We can refer to elements or slices of a vector

> x[3]

[1] 5

> x[2:4]

[1] 3 5 7

> x[5:1]

[1] 9 7 5 3 1

> x[11]

[1] NA

Notice the NA for an element that does not exist.

> x*2

[1] 2 6 10 14 18

> x-y[1:5]

[1] 0 1 2 3 4

> x^3

[1] 1 27 125 343 729

>

2.1.2 Matrices

When we have data that are arranged in two dimensions rather than one we have a matrix.We can set one up using the function matrix

15

Page 17: Unlock the Secrets of R G.Janacek

> m1=matrix(data=seq(1:20),nrow=5,ncol=4,dimnames=list(c("A","B","C","D","E")))

> m1

[,1] [,2] [,3] [,4]

A 1 6 11 16

B 2 7 12 17

C 3 8 13 18

D 4 9 14 19

E 5 10 15 20

>

One thing to notice about this is that the default option in R is to fill a matrix up incolumn order rather than row order. This can be reversed by using the

byrow=TRUE

argument. Be careful about this when setting matrices up because it’s easy to make amistake. We can make life shorter as

> m2=matrix(1:20,5,4)

> m2

[,1] [,2] [,3] [,4]

[1,] 1 6 11 16

[2,] 2 7 12 17

[3,] 3 8 13 18

[4,] 4 9 14 19

[5,] 5 10 15 20

>

> m1[1,2]

A

6

> m1[2:4,1:3]

[,1] [,2] [,3]

B 2 7 12

C 3 8 13

D 4 9 14

> m1[,1]

A B C D E

1 2 3 4 5

> m1[1,]

[1] 1 6 11 16

>

16

Page 18: Unlock the Secrets of R G.Janacek

We can carry out simple arithmetic with our matrix just as we can with a vector.

> m1=matrix(1:4,2,2)

> m1

[,1] [,2]

[1,] 1 3

[2,] 2 4

> m2=matrix(1:6,2,3)

> m2

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6

> m3=matrix(c(-2,0,11,3),2,2)

> m3

[,1] [,2]

[1,] -2 11

[2,] 0 3

> m1+m1

[,1] [,2]

[1,] 2 6

[2,] 4 8

> m1-m2

Error in m1 - m2 : non-conformable arrays

> m1-m3

[,1] [,2]

[1,] 3 -8

[2,] 2 1

> m1%*%m3

[,1] [,2]

[1,] -2 20

[2,] -4 34

> m1%*%m2

[,1] [,2] [,3]

[1,] 7 15 23

[2,] 10 22 34

> solve(m1)

[,1] [,2]

[1,] -2 1.5

[2,] 1 -0.5

> m1%*%solve(m1)

[,1] [,2]

17

Page 19: Unlock the Secrets of R G.Janacek

[1,] 1 0

[2,] 0 1

We can use matrices as arguments for functions.

2.1.3 The Dataframe

Very often we have data in a tabular form, for example we might prefer

Age Boy Age Girl

40 2968 40 331738 2795 36 272940 3163 40 293535 2925 38 275436 2625 42 321037 2847 39 281741 3292 40 312640 3473 37 253937 2628 36 241238 3176 38 299140 3421 39 287538 2975 40 3231

Table 2: default

If this is in a spreadsheet, for example excel we can save as it as .csv file and thenread the file into a data frame. Data Frames are tightly coupled collections of variableswhich share many of the properties of matrices and of lists, and are the fundamental datastructure by most of R functions. Less formally, a dataframe is a type of table where thetypical use employs the rows as observations and the columns as variables.

> baby=read.csv(file.choose(),header=TRUE)

> baby

Gender Age Weight

1 M 40 2968

2 M 38 2795

3 M 40 3163

4 M 35 2925

5 M 36 2625

6 M 37 2847

7 M 41 3292

18

Page 20: Unlock the Secrets of R G.Janacek

Gender Age Weight

M 40 2968M 38 2795M 40 3163M 35 2925M 36 2625M 37 2847M 41 3292M 40 3473M 37 2628M 38 3176M 40 3421M 38 2975F 40 3317F 36 2729F 40 2935F 38 2754F 42 3210F 39 2817F 40 3126F 37 2539F 36 2412F 38 2991F 39 2875F 40 3231

Table 3: default

19

Page 21: Unlock the Secrets of R G.Janacek

8 M 40 3473

9 M 37 2628

10 M 38 3176

11 M 40 3421

12 M 38 2975

13 F 40 3317

14 F 36 2729

15 F 40 2935

16 F 38 2754

17 F 42 3210

18 F 39 2817

19 F 40 3126

20 F 37 2539

21 F 36 2412

22 F 38 2991

23 F 39 2875

24 F 40 3231

>

Note

• that R assumes that the character vectors that are going into the new data frame arefactors and makes them so.

• The file.choose() command is for people like me who connote remember which direc-tory contains the data file. Clever people use the path.

We can select part of the data frame by using indices

> baby[2:3,]

Gender Age Weight

2 M 38 2795

3 M 40 3163

> baby[1,]

Gender Age Weight

1 M 40 2968

> baby[,2]

[1] 40 38 40 35 36 37 41 40 37 38 40 38 40 36 40 38 42 39 40 37 36 38 39 40

>

or

> baby$Age

[1] 40 38 40 35 36 37 41 40 37 38 40 38 40 36 40 38 42 39 40 37 36 38 39 40

20

Page 22: Unlock the Secrets of R G.Janacek

or

> subset(baby,Weight>3000)

Gender Age Weight

3 M 40 3163

7 M 41 3292

8 M 40 3473

10 M 38 3176

11 M 40 3421

13 F 40 3317

17 F 42 3210

19 F 40 3126

24 F 40 3231

> subset(baby,Gender=="F")

Gender Age Weight

13 F 40 3317

14 F 36 2729

15 F 40 2935

16 F 38 2754

17 F 42 3210

18 F 39 2817

19 F 40 3126

20 F 37 2539

21 F 36 2412

22 F 38 2991

23 F 39 2875

24 F 40 3231

Now we have a data frame we find that we cannot use the data inside the frame. Toassess the data we need to do one of two things

1. attach the data frame. This tells R what is in the frame and it can be used

> attach(baby)

> Weight

[1] 2968 2795 3163 2925 2625 2847 3292 3473 2628 3176 3421 2975 3317 2729 2935 2754 3210

[18] 2817 3126 2539 2412 2991 2875 3231

2. The alternative is to be explicit and use command like

21

Page 23: Unlock the Secrets of R G.Janacek

> baby$Weight

[1] 2968 2795 3163 2925 2625 2847 3292 3473 2628 3176 3421 2975 3317 2729 2935 2754 3210

[18] 2817 3126 2539 2412 2991 2875 3231

> baby[3]

Weight

1 2968

2 2795

3 3163

4 2925

5 2625

6 2847

7 3292

8 3473

9 2628

10 3176

11 3421

12 2975

13 3317

14 2729

15 2935

16 2754

17 3210

18 2817

19 3126

20 2539

21 2412

22 2991

23 2875

24 3231

3 Input and Output

As you might expect there are lots of complex ways to get data. We have seen how to usescan but a more useful approach is to use .csv files as above. The command is

read.csv(file, header = TRUE, sep = ",", quote = "\"",

dec = ".", fill = TRUE, comment.char = "", ...)

I have restricted discussion of input but there are several other possibilities as the foreign

package allows one to import data in several other formate e.g. STATA. For example

22

Page 24: Unlock the Secrets of R G.Janacek

> require(foreign)

Loading required package: foreign

> require(MASS)

Loading required package: MASS

> cdata <- read.dta("http://www.ats.ucla.edu/stat/data/crime.dta")

> # This load STATA formatted data set at UCLA

You can write data using

write(x, file = "data",

ncolumns = if(is.character(x)) 1 else 5,

append = FALSE, sep = " ")

Beware you may have to transpose your data matrix.

4 R help and documentation

You are probably beginning to see a problem in using R. You are probably beginning tosee a problem in using You are probably beginning to see a problem in using R, you haveto know the name of the function you would like to use. While there are some prototypeGUI interfaces you will have to resign yourself to the UNIX command driven world. Theappendices contain copies of some of the CRAN crib sheets but in addition the help systemcan be useful.

The apropos() command is convenient when you are not sure that you know the nameof a function. For example if you were after a stem and leaf function but were not sure ifthe name was stem or stemandleaf. Try

> apropos(stem)

[1] "stem" "system" "system.file" "system.time"

The help system can be used in several ways may be used in several ways:

• Type help.start() at the R command line. This brings up an html version of thehelp system. (The Windows and Macintosh versions of the help system also containinformation specific to those environments.) Within the help system, in particular:

“An Introduction to R ”

is the definitive, quite advanced, reference manual intended for those with fairlysubstantial statistical knowledge;

• The R base and ctest packages document all the main R functions.

23

Page 25: Unlock the Secrets of R G.Janacek

• Help on individual functions and datasets is also available from the R commandline so for help on the function plot, type ?plot Or help(plot)

• help.search() or ?? for finding help pages on a vague topic;

• library() for listing available packages and the help objects they contain.

• data() for listing available data sets;

• Under R for Windows, the entire help system is also available from the Help menu.This also has pdf versions of “An Introduction to R ” and other manuals.

• Don’t forget Google etc.

The latest version of the entire help system in printable form, together with further con-tributed documentation and tutorials, is also available from CRAN

We can look at some samples using inbuilt datasets. Try help(data). A histogram tostart , we try help(hist)

> # Simple Histogram

> hist(mtcars$mpg)

> # Colored Histogram with Different Number of Bins

> hist(mtcars$mpg, breaks=20, col="green")

hist(mtcars$mpg, breaks=20, col="green",xlab="mpg")

hist(mtcars$mpg, breaks=20, col="green",xlab="mpg",main="MPG")

You may prefer density plots, so help(density)

> # Kernel Density Plot

> d <- density(mtcars$mpg) # returns the density data

> plot(d) # plots the results

# Filled Density Plot

d <- density(mtcars$mpg)

plot(d, main="Kernel Density of Miles Per Gallon")

polygon(d, col="red", border="blue")

Try help(bar plot) and texttthelp(box plot)

> attach(baby)

> boxplot(Weight~Gender)

An example of a formula is Weight Gender generated for each value of Gender.

24

Page 26: Unlock the Secrets of R G.Janacek

4.0.4 The Plot Command

By default, plot( ) function plots plots the (x,y) points using the names x and y to labelthe axes. However if you try the help system for plot you will find it is very flexible. Youcan

• choose lines to points by choosing type="p" or type="l"

• choose point by by choosing pch= a number or a character.

• label axes

• give a title

type can take the following values:

type description

p pointsl lineso overplotted points and linesb, c points (empty if ”c”) joined by liness, S stair stepsh histogram-like vertical linesn does not produce any points or lines

The commands lines and points have similar effect but they they will only over ploton a graph which exists. They CANNOT produce a plot ab initio.

You will find the par() command useful. We only point out one property —textttpar(mfrow=(r,s))sets up an r× s array and the next rs graphs become elements of the array.

> plot(Age,Weight)

# try with filled points

> plot(Age,Weight,pch=20)

# use colour to differentiate gender

> plot(Age,Weight,col=as.integer(Gender),pch=20)

# Or perhaps Letters

> plot(Age,Weight,pch=as.character(Gender))

Add colour

> plot(Age,Weight,pch=as.character(Gender),col= as.integer(Gender))

# Try any simplify

> flag=as.integer(Gender)

> flag

25

Page 27: Unlock the Secrets of R G.Janacek

[1] 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1

> plot(Weight,Age,pch=20,col=flag)

# Add a legend

> legend("topleft",legend=c("Female","Male"),pch=20,col=c(1,2))

# And if your journal is Black and white

> plot(Age,Weight,pch=flag+19)

> legend("topleft",legend=c("Female","Male"),pch=c(21,20))

Of course can have multiple graphs

par(mfrow=c(2,2))

plot(Age,Weight,pch=21)

plot(Age,Weight,col=as.integer(Gender),pch=20)

plot(Age,Weight,pch=as.character(Gender))

plot(Age,Weight,col=as.integer(Gender),pch=as.character(Gender))

par(mfrow=c(1,1))

4.1 Saving Graphs

You can save the graph in a variety of formats from the menu File -> Save As. Withfunction output to

1. pdf("mygraph.pdf") pdf file

2. win.metafile("mygraph.wmf") windows metafile

3. png("mygraph.png") png file

4. jpeg("mygraph.jpg") jpeg file

5. bmp("mygraph.bmp") bmp file

6. postscript("mygraph.ps") postscript file

4.2 Adding Packages

The are add ones to the basic R system called packages. Typically researchers bundle uptheir new tools in a package and place it with CRAN. If you want a particular analysisthere is probably a package available to do what you want.

Do look at

• http://cran.r-project.org/web/packages/available packages by name.html

26

Page 28: Unlock the Secrets of R G.Janacek

• http://blog.yhathq.com/posts/10RpackagesIwishIknewaboutearlier.html

• http://crantastic.org/

To add a package, once it is on your machine:

1. Download and install a package. It will then be available for use in future session aswell as the current one.

2. To use the package, invoke the library(package) command to load it into the currentsession. You need to do this each session, unless you customize your environment toautomatically load it each time.)

4.2.1 Windows

• Choose Install Packages from the Packages menu.

• Select a CRAN Mirror. (I find Switzerland reliable)

• Select a package e.g. boot.

• Then use the library(package) function to load it for use, either library(boot) orthe drop down menu.

4.2.2 OS X

• Choose Install Packages from the Packages and Data Menu.

• Select a CRAN Mirror. (I find Switzerland reliable)

• Select a package e.g. boot.

• Then use the library(package) function to load it for use, either library(boot)

or Package Installer from the drop down menu.

4.3 Some examples

• The boxplot.matrix( ) function in the sfsmisc package draws a boxplot for eachcolumn (row) in a matrix.

• The boxplot.n( ) function in the gplots package annotates each boxplot with itssample size.

• The bplot( ) function in the R lab package offers many more options controllingthe positioning and labeling of boxes in the output.

27

Page 29: Unlock the Secrets of R G.Janacek

• A violin plot is a combination of a boxplot and a kernel density plot. They can becreated using the vioplot( ) function from vioplot package.

Creating a new package is reasonably (?) straightforward, and because R is now sowidely used in academia the majority of authors of publications describing new analysistechniques release an R package when they publish their new ideas.

4.3.1 Packages you already have

Some packages come with the base installation of R but are not automatically loadedwhen you start the software. To see what they are either look at the drop down menu ortype library(). For example lattice, which includes functions for a range of advancedgraphics, MASS, which is a package associated with the book Modern Applied Statisticswith S by Venables and Ripley (2002, Springer-Verlag) and contains a variety of usefulfunctions to do things like fit generalized linear models with negative binomial errors,

nmle which lets you fit linear and non-linear mixed-effects models, cluster whichbrings a range of functions for cluster analysis and survival which has functions forsurvival analysis (surprise!).

These packages will already probably be loaded onto your computer, but to make sureyou can use the function. Just type it in with nothing between the brackets and you willget more information about what is there than you really need. If you want to use one ofthem you can load them into R by using the

library(package name)

or the drop down menu. and it will be loaded.If you want to know a bit more about whats in a particular package you can type

library(help = splines)

where splines is the name of the package.This will get you some information about the package and a list of the various functions

that are included in the package. If you want to know more , as we said, one of the easiestways of finding out more information is to go to

http://cran.rproject.org/web/packages

which lists all 4000 odd packages currently available for R. If you click on the name of apackage you will be able to navigate to a link for the package manual which should tellyou everything you might ever want to know. It might be difficult to follow because itslikely to be written for consumption by clever people but you just need to persevere anduse Google. I find

http://crantastic.org/ a wonderfulresource.If you look on CRAN you can find the web page for vegan at

28

Page 30: Unlock the Secrets of R G.Janacek

http://cran.r-project.org/web/packages/vegan/index.html

which lets you look at the manual for the package and also provides links to a number ofvignettes -documents giving details of how to carry out specific analyses using the package.This can be very useful once the package is installed When it is the wee small hours andyou would sell your cat’s soul to Satan just to get that analysis done those vignettes canpreserve your sanity.

5 Appendix 1: Vocabulary

The first functions to learn

?

str

# Important operators and assignment

%in%, match

=, <-, <<-

$, [, [[, head, tail, subset

with

assign, get

# Comparison

all.equal, identical

!=, ==, >, >=, <, <=

is.na, complete.cases

is.finite

# Basic math

*, +, -, /, ^, %%, %/%

abs, sign

acos, asin, atan, atan2

sin, cos, tan

ceiling, floor, round, trunc, signif

exp, log, log10, log2, sqrt

max, min, prod, sum

cummax, cummin, cumprod, cumsum, diff

pmax, pmin

29

Page 31: Unlock the Secrets of R G.Janacek

range

mean, median, cor, sd, var

rle

# Functions to do with functions

function

missing

on.exit

return, invisible

# Logical & sets

&, |, !, xor

all, any

intersect, union, setdiff, setequal

which

# Vectors and matrices

c, matrix

# automatic coercion rules character > numeric > logical

length, dim, ncol, nrow

cbind, rbind

names, colnames, rownames

t

diag

sweep

as.matrix, data.matrix

# Making vectors

c

rep, rep_len

seq, seq_len, seq_along

rev

sample

choose, factorial, combn

(is/as).(character/numeric/logical/...)

# Lists and data.frames

list, unlist

data.frame, as.data.frame

split

expand.grid

30

Page 32: Unlock the Secrets of R G.Janacek

# Control flow

if, &&, || (short circuiting)

for, while

next, break

switch

ifelse

# Apply & friends

lapply, sapply, vapply

apply

tapply

replicate

#Common data structures

# Date time

ISOdate, ISOdatetime, strftime, strptime, date

difftime

julian, months, quarters, weekdays

library(lubridate)

# Character manipulation

grep, agrep

gsub

strsplit

chartr

nchar

tolower, toupper

substr

paste

library(stringr)

# Factors

factor, levels, nlevels

reorder, relevel

cut, findInterval

interaction

options(stringsAsFactors = FALSE)

# Array manipulation

array

31

Page 33: Unlock the Secrets of R G.Janacek

dim

dimnames

aperm

library(abind)

#Statistics

# Ordering and tabulating

duplicated, unique

merge

order, rank, quantile

sort

table, ftable

# Linear models

fitted, predict, resid, rstandard

lm, glm

hat, influence.measures

logLik, df, deviance

formula, ~, I

anova, coef, confint, vcov

contrasts

# Miscellaneous tests

apropos("\\.test$")

# Random variables

(q, p, d, r) * (beta, binom, cauchy, chisq, exp, f, gamma, geom,

hyper, lnorm, logis, multinom, nbinom, norm, pois, signrank, t,

unif, weibull, wilcox, birthday, tukey)

# Matrix algebra

crossprod, tcrossprod

eigen, qr, svd

%*%, %o%, outer

rcond

solve

#Working with R

# Workspace

ls, exists, rm

getwd, setwd

32

Page 34: Unlock the Secrets of R G.Janacek

q

source

install.packages, library, require

# Help

help, ?

help.search

apropos

RSiteSearch

citation

demo

example

vignette

# Debugging

traceback

browser

recover

options(error = )

stop, warning, message

tryCatch, try

#I/O

# Output

print, cat

message, warning

dput

format

sink, capture.output

# Reading and writing data

data

count.fields

read.csv, write.csv

read.delim, write.delim

read.fwf

readLines, writeLines

readRDS, saveRDS

load, save

library(foreign)

33

Page 35: Unlock the Secrets of R G.Janacek

# Files and directories

dir

basename, dirname, tools::file_ext

file.path

path.expand, normalizePath

file.choose

file.copy, file.create, file.remove, file.rename, dir.create

file.exists, file.info

tempdir, tempfile

download.file, library(downloader)

6 Appendix 2 : Numerical Types

type mode storage.mode example

logical logical logical TRUE of FALSEinteger numeric integer 4double numeric double 4.0000

complex complex complex [1] 3+5icharacter character character [1] ”word”

raw raw raw The raw type is intended to hold raw bytes

7 Regression

1. Linear Regression

fit <- lm(y ~ x1 + x2 + x3, data=mydata)

summary(fit) # show results

2. Other useful functions

coefficients(fit) # model coefficients

confint(fit, level=0.95) # CIs for model parameters

fitted(fit) # predicted values

residuals(fit) # residuals

anova(fit) # anova table

vcov(fit) # covariance matrix for model parameters

influence(fit) # regression diagnostics

34

Page 36: Unlock the Secrets of R G.Janacek

# diagnostic plots

layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page

plot(fit)

3. compare models

fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)

fit2 <- lm(y ~ x1 + x2)

anova(fit1, fit2)

4. K-fold cross-validation

library(DAAG)

cv.lm(df=mydata, fit, m=3) # 3 fold cross-validation

5. Assessing R2 shrinkage using 10-Fold Cross-Validation

fit <- lm(y~x1+x2+x3,data=mydata)

library(bootstrap)

# define functions

theta.fit <- function(x,y){lsfit(x,y)}

theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}

# matrix of predictors

X <- as.matrix(mydata[c("x1","x2","x3")])

# vector of predicted values

y <- as.matrix(mydata[c("y")])

results <- crossval(X,y,theta.fit,theta.predict,ngroup=10)

cor(y, fit$fitted.values)**2 # raw R2

cor(y,results$cv.fit)**2 # cross-validated R2

6. Stepwise Regression

library(MASS)

fit <- lm(y~x1+x2+x3,data=mydata)

step <- stepAIC(fit, direction="both")

step$anova # display results

35

Page 37: Unlock the Secrets of R G.Janacek

7. All Subsets Regression

library(leaps)

attach(mydata)

leaps<-regsubsets(y~x1+x2+x3+x4,data=mydata,nbest=10)

# view results

summary(leaps)

# plot a table of models showing variables in each model.

# models are ordered by the selection statistic.

plot(leaps,scale="r2")

# plot statistic by subset size

library(car)

subsets(leaps, statistic="rsq")

8. The relaimpo package provides measures of relative importance for each of the pre-dictors in the model. See help(calc.relimp) for details on the four measures ofrelative importance provided.

9. Graphic Enhancements

The car package offers a wide variety of plots for regression, including added variableplots, and enhanced diagnostic and scatter plots.

10. Robust Regression

There are many functions in R to aid with robust regression. For example, you canperform robust regression with the rlm( ) function in the MASS package. The UCLAStatistical Computing website has robust regression rxamples.

The robust package provides a comprehensive library of robust methods, includingregression. The robustbase package also provides basic robust statistics includingmodel selection methods. And David Olive has provided an detailed online review ofApplied Robust Statistics with sample R code.

11. Glims Generalized linear models are just as easy to fit in R as ordinary linear model.In fact, they require only an additional parameter to specify the variance and linkfunctions.

> glm(formula, family, data, weights, subset, ...)

where ... stands for more esoteric options. The only parameter that we have notencountered before is family, which is a simple way of specifying a choice of varianceand link functions. There are six choices of family:

36

Page 38: Unlock the Secrets of R G.Janacek

Family Variance Link

gaussian gaussian identitybinomial binomial logit, probit or cloglogpoisson poisson log, identity or sqrtGamma Gamma inverse, identity or log

inverse.gaussian inverse.gaussian 1/mu2

quasi user-defined user-defined

As can be seen, each of the first five choices has an associated variance function (forbinomial the binomial variance m(1-m)), and one or more choices of link functions(for binomial the logit, probit or complementary log-log).

As long as you want the default link, all you have to specify is the family name.If you want an alternative link, you must add a link argument. For example to doprobits you use

> glm( formula, family=binomial(link=probit))

8 Multivariate Data

See http://cran.r-project.org/web/views/Multivariate.html

• Visualising multivariate data

A range of base graphics (e.g. pairs() and coplot()) and lattice functions (e.g.xyplot() and splom()) are useful for visualising pairwise arrays of 2-dimensionalscatterplots, clouds and 3-dimensional densities. scatterplot.matrix in the car

package provides usefully enhanced pairwise scatterplots. Beyond this, scatterplot3dprovides 3 dimensional scatterplots, aplpack provides bagplots and spin3R(), afunction for rotating 3d clouds. misc3d, dependent upon rgl, provides animatedfunctions within R useful for visualising densities. YaleToolkit provides a range ofuseful visualisation techniques for multivariate data.

More specialised multivariate plots include the following:

– faces() in aplpack provides Chernoff’s faces;

– parcoord() from MASS provides parallel coordinate plots;

– stars() in graphics provides a choice of star, radar and cobweb plots respec-tively.

– mstree() in ade4 and spantree() in vegan provide minimum spanning treefunctionality.

37

Page 39: Unlock the Secrets of R G.Janacek

– calibrate supports biplot and scatterplot axis labelling.

– geometry, which provides an interface to the qhull library, gives indices to therelevant points via convexhulln().

– ellipse draws ellipses for two parameters, and provides plotcorr(), visualdisplay of a correlation matrix.

– denpro provides level set trees for multivariate visualisation.

– Mosaic plots are available via mosaicplot() in graphics and mosaic() in vcd

that also contains other visualization techniques for multivariate categoricaldata.

– gclus provides a number of cluster specific graphical enhancements for scatter-plots and parallel coordinate plots. See the links for a reference to GGobi.

– rggobi interfaces with GGobi.

– xgobi interfaces to the XGobi and XGvis programs which allow linked, dynamicmultivariate plots as well as projection pursuit.

– Finally, iplots allows particularly powerful dynamic interactive graphics, ofwhich interactive parallel co-ordinate plots and mosaic plots may be of greatinterest. Seriation methods are provided by seriation which can reorder matricesand dendrograms.

• Data Preprocessing:

– summarize() and summary.formula() in Hmisc assist with descriptive func-tions; from the same package varclus() offers variable clustering while dataRep()and find.matches() assist in exploring a given dataset in terms of representa-tiveness and finding matches.

– dist() in base and daisy() in cluster provide a wide range of distance mea-sures, proxy provides a framework for more distance measures, including mea-sures between matrices.

– simba provides functions for dealing with presence / absence data includingsimilarity matrices and reshaping.

• Linear models

9 Regression

9.0.2 Model Formulae for ANOVA and regression

R functions such as aov( ), lm( ), and glm( ) use a formula interface to specify thevariables to be included in the analysis. The formula determines the model that willbe built (and tested) by the R procedure. The basic format of such a formula is...

38

Page 40: Unlock the Secrets of R G.Janacek

response variable ~ explanatory variables

The tilde should be read ”is modeled by” or ”is modeled as a function of.” A basisregression analysis would be formulated this way.

y ~ x

where ”x” is the explanatory variable , and ”y” is the response variable. Additionalexplanatory variables would be added in as follows...

y ~ x + z

which would make this a multiple regression with two predictors. This raises a criticalissue that must be understood to get model formulae correct. Symbols used as math-ematical operators in other contexts do not have their usual mathematical meaninginside model formulae. The following table lists the meaning of these symbols whenused in a formula.

symbol example meaning

+ +x include this variable− −x delete this variable: x : z include the interaction between these variables∗ x ∗ z include these variables and the interactions between them/ x/z nesting: include z nested within x| x|z conditioning: include x given z∧ (u + v + w) ∧ 3 include these variables and all interactions up to three way

poly poly(x,3) polynomial regression: orthogonal polynomialsError Error(a/b) specify the error term

I I(x ∗ z) as is: include a new variable consisting of these variables multiplied1 −1 intercept: delete the intercept (regress through the origin)

Some formula structures can be specified in more than one way...

y ~ u + v + w + u:v + u:w + v:w + u:v:w

y ~ u * v * w

y ~ (u + v + w)^3

All three of these specify a model in which the variables ”u”, ”v”, ”w”, and all theinteractions between them are included. Any of these formats...

39

Page 41: Unlock the Secrets of R G.Janacek

y ~ u + v + w + u:v + u:w + v:w

y ~ u * v * w - u:v:w

y ~ (u + v + w)^2

would delete the three way interaction.

The nature of the variables–binary, categorial (factors), numerical–will determine thenature of the analysis. For example, if ”u” and ”v” are factors...

y ~ u + v

dictates an analysis of variance (without the interaction term). If ”u” and ”v” arenumerical, the same formula would dictate a multiple regression. If ”u” is numericaland ”v” is a factor, then an analysis of covariance is dictated.

– From stats, lm() (with a matrix specified as the dependent variable) offers mul-tivariate linear models, anova.mlm() provides comparison of multivariate linearmodels. manova() offers MANOVA. sn provides msn.mle() and mst.mle()

which fit multivariate skew normal and multivariate skew t models.

– pls provides partial least squares regression (PLSR) and principal componentregression, ppls provides penalized partial least squares, dr provides dimen-sion reduction regression options such as ”sir” (sliced inverse regression), ”save”(sliced average variance estimation).

– plsgenomics provides partial least squares analyses for genomics. relaimpo

provides functions to investigate the relative importance of regression parame-ters.

– Principal components can be fitted with prcomp() (based on svd(), preferred)as well as princomp() (based on eigen() for compatibility with S-PLUS) fromstats.

– sca provides simple components. pc1() in Hmisc provides the first principalcomponent and gives coefficients for unscaled data.

– Additional support for an assessment of the scree plot can be found in nFactors,whereas paran provides routines for Horn’s evaluation of the number of dimen-sions to retain.

– For wide matrices, gmodels provides fast.prcomp() and fast.svd().

Further options for principal components in an ecological setting are available withinade4 and in a sensory setting in SensoMineR. psy provides a variety of routinesuseful in psychometry, in this context these include sphpca() which maps onto asphere and fpca() where some variables may be considered as dependent as well asscree.plot() which has the option of adding simulation results to help assess the

40

Page 42: Unlock the Secrets of R G.Janacek

observed data. PTAk provides principal tensor analysis analagous to both PCA andcorrespondence analysis. smatr provides standardised major axed

• Latent variable approaches

– factanal() in stats provides factor analysis by maximum likelihood, Bayesianfactor analysis is provided for Gaussian, ordinal and mixed variables in MCMCpack.

– GPArotation offers GPA (gradient projection algorithm) factor rotation. FAiR

provides factor analysis solved using genetic algorithms.

– sem fits linear structural equation models and ltm provides latent trait modelsunder item response theory and range of extensions to Rasch models can befound in eRm.

– FactoMineR provides a wide range of Factor Analysis methods, including MFA()

and HMFA() for multiple and hierarchical multiple factor analysis as well asADFM() for multiple factor analysis of quantitative and qualitative data.

– tsfa provides factor analysis for time series. poLCAprovides latent class andlatent class regression models for a variety of outcome variables.

41