Top Banner
Introducing R to Demographers with applications for Survival Analysis Roland Rau 1 Max Planck Institute for Demographic Research Konrad Zuse Str. 1 D–18057 Rostock Germany 1st February 2005 1 Tel. +49-381-2081109; email:
114

Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Apr 03, 2015

Download

Documents

juef42
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Introducing R to Demographerswith applications for Survival Analysis

Roland Rau1

Max Planck Institute for Demographic ResearchKonrad Zuse Str. 1D–18057 Rostock

Germany

1st February 2005

1Tel. +49-381-2081109; email: [email protected]

Page 2: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

2

Copyright c© 2004 Roland Rau

Page 3: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Contents

I Learning the Basics of R by Examples 11

0 Preliminaries 13

1 Aim of this Document 15

2 What is R? 172.1 Is it worth learning a new language / stats package? . . . . . . . . 172.2 The pros of R: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 The cons of R: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 First Interactive Steps 193.1 A Powerful Pocket Calculator . . . . . . . . . . . . . . . . . . . . . 193.2 Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5 Getting help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.6 Using an Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 The Representation of Data in R 314.1 Basic Units: Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Generating Data & Types of Data . . . . . . . . . . . . . . 314.1.2 Referencing Elements of A Vector . . . . . . . . . . . . . . 34

4.2 Matrices, Arrays, Dataframes, Lists . . . . . . . . . . . . . . . . . . 364.2.1 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2.2 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2.3 Dataframes . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2.4 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Missing values and other odd numerical values . . . . . . . . . . . 434.4 Requesting Information About Data Structures . . . . . . . . . . . 44

3

Page 4: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

4 CONTENTS

4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Distributions in R 475.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2 (Cumulative) Distribution Function p . . . . . . . . . . . . . . . . 475.3 Density Function d . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.4 Quantile Function q . . . . . . . . . . . . . . . . . . . . . . . . . . 515.5 Generation of Random Numbers r . . . . . . . . . . . . . . . . . . 51

5.5.1 Digression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.6 Overview of Functions . . . . . . . . . . . . . . . . . . . . . . . . . 525.7 The Gompertz Distribution . . . . . . . . . . . . . . . . . . . . . . 52

5.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 525.7.2 Hazard Function . . . . . . . . . . . . . . . . . . . . . . . . 545.7.3 Survivor Function . . . . . . . . . . . . . . . . . . . . . . . 545.7.4 Cumulative Hazard Function . . . . . . . . . . . . . . . . . 545.7.5 Density Function . . . . . . . . . . . . . . . . . . . . . . . . 545.7.6 Cumulative Distribution Function . . . . . . . . . . . . . . 555.7.7 Quantile Function . . . . . . . . . . . . . . . . . . . . . . . 555.7.8 Generation of Random Numbers . . . . . . . . . . . . . . . 55

5.8 Generating Random Numbers from Given Data — sample . . . . . 565.9 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6 Data Input/Output 596.1 Reading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 596.1.2 Setting the Working Directory . . . . . . . . . . . . . . . . 596.1.3 Reading Text Data . . . . . . . . . . . . . . . . . . . . . . . 606.1.4 Reading Binary Data . . . . . . . . . . . . . . . . . . . . . 61

6.2 Data Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.3 Writing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7 Programming with R 637.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637.2 Grouping Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 637.3 Flow-Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.3.1 Conditional Execution . . . . . . . . . . . . . . . . . . . . . 647.3.2 Repetitive Execution . . . . . . . . . . . . . . . . . . . . . . 66

7.4 Writing Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Page 5: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

CONTENTS 5

7.5 Problems with loops . . . . . . . . . . . . . . . . . . . . . . . . . . 727.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8 Graphing Data 778.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778.2 Basic plotting function: plot . . . . . . . . . . . . . . . . . . . . . 778.3 Histograms — hist . . . . . . . . . . . . . . . . . . . . . . . . . . 808.4 Barplot - barplot . . . . . . . . . . . . . . . . . . . . . . . . . . . 828.5 Making boxplots — boxplot . . . . . . . . . . . . . . . . . . . . . 848.6 QQ-Plots qqplot / qqnorm . . . . . . . . . . . . . . . . . . . . . . 868.7 Further Plotting Commands . . . . . . . . . . . . . . . . . . . . . . 868.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 938.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

9 Simple Statistical Models 959.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 959.2 Plotting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969.3 Estimating the Linear Model and Accessing the Components of the

Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989.4 Regression Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . 999.5 Technical Digression: using matrix language to estimate the coeffi-

cients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1029.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

10 More Useful Functions 10310.1 Crosstabulations using table . . . . . . . . . . . . . . . . . . . . . 10310.2 How many different values exist? — unique . . . . . . . . . . . . . 10410.3 Splitting Data — split and cut . . . . . . . . . . . . . . . . . . . 10510.4 Sorting Data — sort and order . . . . . . . . . . . . . . . . . . . 106

10.4.1 Sorting by One Variable . . . . . . . . . . . . . . . . . . . . 10610.4.2 Sorting by More Than One Variable . . . . . . . . . . . . . 107

11 Further Reading: 10911.1 Background on R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10911.2 Introductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10911.3 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10911.4 Programming in R/S . . . . . . . . . . . . . . . . . . . . . . . . . . 10911.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Page 6: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

6 CONTENTS

II Using R for Survival Analysis 111

Page 7: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

List of Figures

3.1 A first (nonsense) point of plots . . . . . . . . . . . . . . . . . . . . 233.2 An example of a lineplot . . . . . . . . . . . . . . . . . . . . . . . . 243.3 An example of a histogram with ∼100 bins for X ∼ N(µ = 0, σ = 1) 253.4 How to access the R manuals and other helpful tools . . . . . . . . 27

4.1 Understanding and Referencing an Array . . . . . . . . . . . . . . 39

5.1 Density of A Standard Normal Distribution . . . . . . . . . . . . . 505.2 Overlapping Gompertz Random Numbers with the Corresponding

Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8.1 Constructing a Plot Piece By Piece . . . . . . . . . . . . . . . . . . 798.2 How to plot a histogram . . . . . . . . . . . . . . . . . . . . . . . . 818.3 Barplot for Women and Men Dying From Lung Diseases in the UK 838.4 Constructing a Boxplot for Weights of Chickens (left) and Ages at

Death of the Swedish Birth Cohort from 1890 (right) . . . . . . . . 858.5 An Example of a QQ-Plot . . . . . . . . . . . . . . . . . . . . . . . 878.6 Displaying Data in Histogram-like Plot . . . . . . . . . . . . . . . . 898.7 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 908.8 Relating Histograms with Boxplots . . . . . . . . . . . . . . . . . . 94

9.1 Is there a linear relationship between height and weight? - A firstglance at the data . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

9.2 Regression Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . 101

7

Page 8: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

8 LIST OF FIGURES

Page 9: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

List of Tables

5.1 Overview of Built-In Distributions . . . . . . . . . . . . . . . . . . 53

9

Page 10: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

10 LIST OF TABLES

Page 11: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Part I

Learning the Basics of R byExamples

11

Page 12: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis
Page 13: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Chapter 0

Preliminaries

The current version of this document represents a first draft; the introduction forthe first part still has to be written and the (existing) notes for Survival Analysis(second part) have to be put together and included. In addition, I am planning toimprove also the existing chapters.

If you have any suggestions, comments, . . . , I am grateful for any feedback.You can reach me via email at [email protected]

Please note that the electronic version of this document has been prepared ina way that one can navigate through chapters, figures and tables just by clickingthem with the mouse. The extra visualization is only present in the electronicdocument; when printed the clickable cross-references can not be distinguishedfrom other parts.

This document has been prepared using:

> version

_platform i386-pc-mingw32arch i386os mingw32system i386, mingw32statusmajor 2minor 0.1year 2004month 11

13

Page 14: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

14 CHAPTER 0. PRELIMINARIES

day 15language R

Page 15: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Chapter 1

Aim of this Document

The document is divided into two parts.

• The first part presents a smooth introduction to R where the main featuresof the R language are presented. It starts with some first interactive steps toget a first impression for the language. Then, each subsequent chapter givesan overview for a specific topic: representation of data, distributions in R,reading and writing data, basic programming, and graphing data. The lastchapter of the first part introduces how statistical analyses are performed inR using simple examples and linear regression.

• The second part has to be still written. It will heavily rely on the weekly labsessions I have given during the Winter Semester 2004/2005 at the “Inter-national Max Planck Research School for Demography” to supplement thecourse “Survival Analysis” taught by Jutta Gampe and Francesco Lagona.For each lab-session, I have already written roughly 8–15 pages. The topicwhich will be covered are (not the final order):

– defining survival objects

– setting up the likelihood-functions

– non-parametric estimation (Kaplan-Meier)

– comparing survival curves (logrank-test)

– the Cox-Model

– time-varying covariates

– parametric survival models (Weibull and Gompertz; incl. writing downthe likelihood-function for data which are left-truncated and right-censored; optimizing the function; finding standard errors for the pa-rameter estimates and estimating confidence bands)

15

Page 16: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

16 CHAPTER 1. AIM OF THIS DOCUMENT

– Regression diagnostics: Cox-Snell Residuals, Martingale Residuals; ⇒checking the fit of the model, checking the functional form of the co-variates, checking proportionality assumption

– discrete-time survival models (logistic regression, embedded into gen-eralized linear models (GLMs))

– (something still has to be written about accelerated failure models(AFT))

Page 17: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Chapter 2

What is R?

2.1 Is it worth learning a new language /

stats package?

2.2 The pros of R:

2.3 The cons of R:

17

Page 18: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

18 CHAPTER 2. WHAT IS R?

Page 19: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Chapter 3

First Interactive Steps

3.1 A Powerful Pocket Calculator

R is most likely a new environment for most of you. So we want to make the startrelatively smooth. You can think of R as a powerful pocket calculator. You simplyenter arithmetic expressions and R will return the result for you.

> 2 + 3

[1] 5

> 4 * 15

[1] 60

> 2 - 3

[1] -1

> 4^5

[1] 1024

> 1e+06 - 4^3

[1] 1e+06

> (log(27) - (23^75))/(-34.6)

[1] 3.895e+100

19

Page 20: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

20 CHAPTER 3. FIRST INTERACTIVE STEPS

Actually, R allows you also to work with complex numbers. You enter them asshown in the following example. But it will not be required in our course to workwith complex numbers.

> 27 + (0+5i)

[1] 27+5i

3.2 Assignments

So far, R is just an overgrown calculator. If this was all, R would be far from beingthe powerful program it actually is. We will therefore introduce more and morefeatures in the next sections.One important feature of any programming language is the possibility to assignvalues to variables. Of course, this is also possible in R. Indeed, there are 2syntactically correct (and equivalent) ways to assign values to a variable. Eitheryou use = like in most other programming languages or you type <-. If you enterthe variable then, it will return its value. No limit is given on the length of variablenames like in SPSS or SAS. It is recommended that you use only alphanumericsymbols for your variable names (A–Z, a–z, 0–9) to avoid confusion with symbolswhich have a special meaning in R like ! or $. Compared to most other programson Windows R does make a difference between “Hello”, “hello”, and “HELLO”. Afew examples will clarify the use of assignments.

> n = 10> N <- 100> n

[1] 10

> N

[1] 100

> m = 81> c = 299792458> E = m * c^2> E

[1] 7.28e+18

It should be noted to all “immigrants” from S-Plus that the assignment operator�_� does not work anymore in R (since release 1.8.0). The assignment myvalue_4is hence invalid.

Page 21: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

3.3. FUNCTIONS 21

3.3 Functions

R is strongly related to the “functional programming” paradigm. This means thatthe basic building blocks of the language are functions which are evaluated. Thebasic idea is that you give the name of a function and then in parentheses therespective arguments. An introductory example is the usage of the runif-function.This function returns as many uniformly distributed random numbers between 0and 1 as you request with the given argument.

> runif(5)

[1] 0.9363 0.3241 0.3239 0.9250 0.7919

> runif(10)

[1] 0.86220 0.37548 0.58685 0.03471 0.24584 0.14294[7] 0.55058 0.91449 0.52005 0.17364

R is not restricted to one parameter. Many parameters can be passed to a function.Although the sequence of arguments/parameters is enough for the computer to findout which argument has which meaning, it is better for the user to give the namesof the arguments when you are looking at code at a later point of time. If you dothat, you can even change the order of the arguments given.

> runif(10, 1, 100)

[1] 72.021 1.091 40.617 18.549 5.403 52.899 48.013[8] 45.676 95.145 45.116

> runif(10, min = 1, max = 50)

[1] 33.919 16.578 47.236 23.250 13.579 12.450 16.515[8] 11.909 5.535 16.857

> runif(10, max = 50, min = 1)

[1] 13.781 26.049 28.122 26.058 2.002 14.964 1.857[8] 11.002 20.158 25.070

In those three examples above we always asked for 10 uniformly distributed randomnumbers. In the first case, the minimum is 1, the maximum is 100. In the secondcase, we asked again for 10 uniformly distributed random numbers. The range inthis instance was from 1 to 50 as indicated by the two arguments min= and max=.

Page 22: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

22 CHAPTER 3. FIRST INTERACTIVE STEPS

The third example requests the same as example 2. By giving the names of thearguments, we were able to change the order of them.

Please note that R allows to nest functions as shown in the following examplewhen 100,000 uniformly distributed random numbers are generated and their meanis calculated.

> mean(runif(1e+05))

[1] 0.4999

3.4 Graphs

One of the major advantages of R are its outstanding capability of displaying datain graphs. Basically, you have control over every little tick-mark you are setting.In those first few steps only some very basic plotting commands are introduced.They should simply show how you can make plots of your data. Figure 3.1 showsa first plot of uniformly distributed random numbers. This plot of points has beenproduced by this code. Admittedly, it does not make much sense.

> plot(runif(100))

Also the next plot is not too meaningful (Fig. 3.2). Instead of a point-plot weproduced a line-plot in red by adding two arguments.

> plot(runif(100), type = "l", col = "red")

More meaningful is probably a histogram for such data. Besides the new histfunction also another unknown function is introduced: rnorm. Without giving anyadditional arguments, it simply draws as many random numbers as requested froma standard normal distribution (X ∼ N(µ = 0, σ = 1)).

> hist(rnorm(1e+05), breaks = 100)

Page 23: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

3.4. GRAPHS 23

Figure 3.1: A first (nonsense) point of plots

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Index

runi

f(10

0)

Page 24: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

24 CHAPTER 3. FIRST INTERACTIVE STEPS

Figure 3.2: An example of a lineplot

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Index

runi

f(10

0)

Page 25: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

3.4. GRAPHS 25

Figure 3.3: An example of a histogram with ∼100 bins for X ∼ N(µ = 0, σ =1)

rnorm(1e+05)

Fre

quen

cy

−4 −2 0 2 4

010

0020

0030

0040

00

Page 26: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

26 CHAPTER 3. FIRST INTERACTIVE STEPS

3.5 Getting help

When I was younger, so much younger than today.I never needed anybody’s help in any way. . . .

This small tutorial can not cover all aspects of the R language. Hence, it isnecessary for you all to find the things you are looking for. It should be stressedthat R is extremely user friendly - it just may seem in the beginning that it is veryselective who its friends are. But after a while you will agree with the statementthat hardly any statistical package provides as much help as R does. The next fewparagraphs show you how you should proceed if are looking for something in R.

• If you know already the name of command (which is a function in R) yousimply ask for help by typing:

> help(mean)

• If you do not know the respective function, you can search across all func-tions.

> help.search("median")

With this output you know that there is a function called median in thepackage stats which calculates the Median Value. Then you can proceedby loading the package (actually, the stats package is loaded automatically)and use the help function again.

Page 27: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

3.5. GETTING HELP 27

Figure 3.4: How to access the R manuals and other helpful tools

> library(stats)> help(median)

• An important source of information is the included documentation. R shipswith several manuals which are extremely useful. You can access them ifyou click in RGui (Gui = R Graphical User Interface) on

¨§

¥¦Help and then

on¤£

¡¢Manuals as indicated in Picture 3.4. Among them, the manual “An

Introduction to R” by Venables, Smith and the R Development Core Teamis especially useful for Beginners.

• I would like to point at two more valuable sources of information in the same¨§

¥¦Help menu: FAQ on R and FAQ on R for Windows answer “Frequently

Asked Questions”.

• R has also a help mailing list. You can subscribe to that list via: https://stat.ethz.ch/mailman/listinfo/r-helpThis mailing list is very fast and also very helpful. But it can be also veryunfriendly if you do not read the posting guide first. You can read the post-ing guide at http://www.r-project.org/posting-guide.html. The main

Page 28: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

28 CHAPTER 3. FIRST INTERACTIVE STEPS

point is: “Do your homework before posting: If it is clear that you have donebasic background research, you are far more likely to get an informative re-sponse [. . . ].

– Do help.search("keyword") with different keywords (type this at theR prompt)

– Read the online help for relevant functions (type ?functionname, e.g.,?prod, at the R prompt)

– If something seems to have changed in R, look in the latest NEWS fileon CRAN for information about it.

– Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html)

– Read at least the relevant section in An Introduction to R

– If the function is from a package accompanying a book, e.g., the MASSpackage, consult the book before posting.”

Very helpful in that respect is also the essay “ How To Ask QuestionsThe Smart Way” written by Eric S. Raymond. It is online available athttp://www.catb.org/�esr/faqs/smart-questions.html.

• One point which is not included but turned out to be very useful for me arethe searchable help archives. They are located athttp://maths.newcastle.edu.au/�rking/R/. The main lesson I’ve learnedthere was: my problem is not original. At least in 9 out of 10 cases, somebodyelse already asked this question before.

• In addition, there is, of course, also the possibility to ask Jutta Gampe,Francesco Lagona, me or other R users at the MPIDR. But please do therecommended help, help.search, ... procedures when you are asking ques-tions to us, too. The best way is generally (also if you write the help mailinglist) to provide a little piece of code to have a generalized example of yourproblem.

3.6 Using an Editor

Code for R does usually not consist of only one line. It is therefore a good choice towrite your code in a text editor and then run the whole file or parts of it in the RInterpreter. We do not want to impose any editor on you. If you have experiencewith the Emacs family, then you are most welcome to use it. The editor whichhas been installed for the course is WinEdt with a special plugin for R. WinEdt

Page 29: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

3.7. EXERCISES 29

is the editor many people use here at the institute to write their R code. It hasseveral features which make it easy to learn and easy to use, for example syntaxhighlighting, submitting code directly from the editor to R, . . . . The Crimsoneditor offers also syntax highlighting as several other editors. Although they arealso working, we do not recommend the editors WordPad and Notepad which arepart of every Windows distribution.1 One possibility to run the code you havewritten is by doing

¤£

¡¢copy &

¨§

¥¦paste . The other possibility is to “source” the file.

An important command for that approach is setwd() which sets your workingdirectory. Please note that on Windows platforms you need the double backslash(\\) as in setwd("u:\\mypathto\\myworkingdirectory\\"). You can then loadthe file with the corresponding R-code by typing source("mycodefile.r").2 Thislet’s R run the given file line by line until the end of the file. If it encounters errors,it gives you warning messages.

3.7 Exercises

1No, Microsoft Word is not an editor.2It is not necessary to give the code file an extension with *.r. Nevertheless, several ed-

itors detect which kind of file (R, SAS, Stata, C, C++, . . . ) is used for syntax highlightingand other useful tools by checking the extension.

Page 30: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

30 CHAPTER 3. FIRST INTERACTIVE STEPS

Page 31: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Chapter 4

The Representation of Data inR

4.1 Basic Units: Vectors

After these appetizers, we will now dive a bit deeper into the R language. It hasalready been mentioned that the basic units of manipulating data are functions.The basic units of those data are vectors. Don’t worry: it does not hurt. Actually,you will soon find out that using vectors facilitates working with data a lot.

4.1.1 Generating Data & Types of Data

The first thing you will learn in this section how you generate these data as avector. R has very powerful built-in functions to generate (regular) sequences ofdata. The easiest thing is to concatenate the various elements after each other byusing the c-command.

> c(1, 2, 3, 4, 5, 6, 7, 8, 9)

[1] 1 2 3 4 5 6 7 8 9

> myfirstdata = c(1, 2, 3, 4, 5, 6, 7, 8, 9)> mynextdata = c(234, 456.56, 435, 435, 56,+ 547)> options(digits = 3)> moredata = c(myfirstdata, mynextdata, runif(6))> moredata

31

Page 32: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

32 CHAPTER 4. THE REPRESENTATION OF DATA IN R

[1] 1.0000 2.0000 3.0000 4.0000 5.0000[6] 6.0000 7.0000 8.0000 9.0000 234.0000

[11] 456.5600 435.0000 435.0000 56.0000 547.0000[16] 0.0733 0.0403 0.7021 0.0127 0.9119[21] 0.8008

Numbers are internally stored as floating point numbers (numbers with a dec-imal point). If you want n digits in your output — it does not affect the numbersof digits stored — use the command options(digits=n).1

Numbers can be stored on three different scale levels - just as you know itfrom your stats lessons. By default, numbers are represented on a metric scale.Categorical data, like religion, sex, . . . are represented as factors in R.

> sex <- factor(c(1, 2, 2, 1, 2, 1, ))> sex

[1] 1 2 2 1 2 1Levels: 1 2

If you enter factors into a regression model (which will be shown later), R knowsthat it can not be treated like a typical numerical value. Data measured on anordinal scale are can be entered into R in two different ways:

> ratings <- factor(c(1, 2, 2, 5, 3, 1, 5, 6),+ ordered = TRUE)> ratings

[1] 1 2 2 5 3 1 5 6Levels: 1 < 2 < 3 < 5 < 6

> ratings <- ordered(c(1, 2, 2, 5, 3, 1, 5,+ 6))> ratings

[1] 1 2 2 5 3 1 5 6Levels: 1 < 2 < 3 < 5 < 6

Besides numeric data, R allows you also to use vectors which consist of char-acter strings. You simply have to enclose the character string by quotation markslike in the following example.

1For the technically interested folks: The core part of R is coded in C. The defaultnumeric type of any stored number is double.

Page 33: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

4.1. BASIC UNITS: VECTORS 33

> simpsons <- c("Marge", "Homer", "Bart", "Lisa",+ "Maggie")> simpsons

[1] "Marge" "Homer" "Bart" "Lisa" "Maggie"

The vector myfirstdata was very tedious to enter by hand. At least, if youwant to have a longer regular sequence of integers. Then the function seq is quitehandy. Some examples will show you its usage.

> 1:10

[1] 1 2 3 4 5 6 7 8 9 10

> seq(1:10)

[1] 1 2 3 4 5 6 7 8 9 10

> seq(from = 1, to = 10, by = 1)

[1] 1 2 3 4 5 6 7 8 9 10

> seq(from = 1, to = 10, by = 2)

[1] 1 3 5 7 9

> seq(from = 10, to = 1, by = -3)

[1] 10 7 4 1

> seq(from = 1, to = 20, length = 14)

[1] 1.00 2.46 3.92 5.38 6.85 8.31 9.77 11.23[9] 12.69 14.15 15.62 17.08 18.54 20.00

Another command is the function rep(x,n) which repeats the data x as oftenas indicated by the integern. You will see in the following examples that x doesnot have to be scalar. It can be also a vector. If n is not a scalar it needs to haveas many elements as x.

> rep(1, 5)

[1] 1 1 1 1 1

Page 34: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

34 CHAPTER 4. THE REPRESENTATION OF DATA IN R

> rep(c(1, 2), 3)

[1] 1 2 1 2 1 2

> rep(c(runif(4), c(235.32, 325.432), seq(from = 1,+ to = 9999, length = 3)), 3)

[1] 3.56e-01 3.84e-01 8.51e-02 5.46e-01 2.35e+02[6] 3.25e+02 1.00e+00 5.00e+03 1.00e+04 3.56e-01

[11] 3.84e-01 8.51e-02 5.46e-01 2.35e+02 3.25e+02[16] 1.00e+00 5.00e+03 1.00e+04 3.56e-01 3.84e-01[21] 8.51e-02 5.46e-01 2.35e+02 3.25e+02 1.00e+00[26] 5.00e+03 1.00e+04

> rep(1:3, 1:3)

[1] 1 2 2 3 3 3

4.1.2 Referencing Elements of A Vector

A very useful feature of R is the possibility to reference elements of a vector. Youare maybe familiar with this concept if you have used the commands vector inSPSS or array in SAS. Given we have a vector generated as in the section like:

> mydata <- rep(1:10, 2)> mydata

[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7[18] 8 9 10

By using braces like [] R allows us to reference parts of this vector. The 15th

element of this vector is returned when you enter

> mydata[15]

[1] 5

You can also do more sophisticated selection. Imagine you want to select allelements which are larger than 0 from a vector consisting of 20 random numbersdrawn from a standard normal distribution.

> mynewdata <- rnorm(20)> mynewdata

Page 35: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

4.1. BASIC UNITS: VECTORS 35

[1] -1.0636 -0.5084 -0.2901 -0.3252 1.0546 1.0711[7] 1.4737 0.6861 -0.1877 0.5322 0.6785 -0.8432

[13] 0.0514 1.3047 0.5731 -0.2192 1.4456 -0.5853[19] 1.6863 -1.2310

An intuitive way would probably to write mynewdata[> 0]. But you see thatthis produces a Syntax Error. The solution is what is actually evaluated in thosebrackets: the expression given there has to be TRUE in a logical sense. Numbers like12, 35, or 5 are always TRUE. We have, however, written a comparison by sayinggreater than zero. But what should be greater than zero? Yes, correct, our givendata should be greater than zero. Thus, the solution is:

> mynewdata[mynewdata > 0]

[1] 1.0546 1.0711 1.4737 0.6861 0.5322 0.6785 0.0514[8] 1.3047 0.5731 1.4456 1.6863

The trick is that R evaluates in this case all values of mynewdata and returns thevalues TRUE and FALSE for each element if this vector.

> mynewdata > 0

[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE[9] FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE

[17] TRUE FALSE TRUE FALSE

Each TRUE element is then selected. This seems like it requires more work thanactually necessary. That is true in such a simple example. But it also gives yougreat flexibility. Given we have two vectors. One is sex of a an individual (1=male,2=female) and the other is age at death. Now we want to know the mean age atdeath for women and for men separately. As you will see in the following code-example, this is easy, because you can give the selection criterion also of anothervariable.

> sex <- c(1, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1)> age <- seq(from = 75, to = 95, length = length(sex))> mean(age[sex == 1])

[1] 87.3

> mean(age[sex == 2])

[1] 81

Page 36: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

36 CHAPTER 4. THE REPRESENTATION OF DATA IN R

4.2 Matrices, Arrays, Dataframes, Lists

Vectors are the basic building blocks for data in R. Vectors can be combined into amatrix, an array, dataframes and/or lists. The following sections will outline theirrespective characteristics.

4.2.1 Matrices

A matrix can be regarded as a collection of vectors of the same length (i.e. thesame number of elements) and of the same type (numeric values vs. charactervalues). Several ways exist to construct a matrix. Here, I would like to addressonly two possibilities using the previously defined vectors sex and age.2

> myfirstmatrix <- as.matrix(cbind(sex, age))> myfirstmatrix

sex age[1,] 1 75[2,] 2 77[3,] 2 79[4,] 2 81[5,] 1 83[6,] 1 85[7,] 2 87[8,] 1 89[9,] 1 91

[10,] 1 93[11,] 1 95

> mysecondmatrix <- as.matrix(rbind(sex, age))> mysecondmatrix

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]sex 1 2 2 2 1 1 2 1 1 1age 75 77 79 81 83 85 87 89 91 93

[,11]sex 1age 95

2Yes, there are three examples. But the first two differ only by the functions cbindand rbind. The meaning of them is to bind vectors by row or by column.

Page 37: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

4.2. MATRICES, ARRAYS, DATAFRAMES, LISTS 37

> mythirdmatrix <- matrix(c(sex, age), nrow = length(sex),+ ncol = 2, byrow = FALSE)> mythirdmatrix

[,1] [,2][1,] 1 75[2,] 2 77[3,] 2 79[4,] 2 81[5,] 1 83[6,] 1 85[7,] 2 87[8,] 1 89[9,] 1 91

[10,] 1 93[11,] 1 95

Having data defined as a matrix facilitates the analysis in several ways. For exam-ple in a regression setting (y = Xβ+ε) you can enter a matrix of covariates at oncewithout referring to each single vector/variable. In addition, R has also built-infunctions which work on a matrices. These features will be briefly introduction inChapter 9 starting on page 95.

Referencing elements in matrices works similar to referencing elements of vec-tors. The only difference is that we have data not only in one but in two dimensions.Consequently, two indices have to be given. The following rule applies if you wantto access a single element of a matrix: mymatrix[row-number, column-number].For example, if we want to have the element in the second column and the fourthrow we have to tell R:

> mythirdmatrix[4, 2]

[1] 81

If you only want to have the second row or first column you have to write:

> mythirdmatrix[2, ]

[1] 2 77

> mythirdmatrix[, 1]

[1] 1 2 2 2 1 1 2 1 1 1 1

Page 38: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

38 CHAPTER 4. THE REPRESENTATION OF DATA IN R

4.2.2 Arrays

The aforementioned matrices can be considered as a special case of an array. Whilematrices represent data in a two-dimensional way, arrays allow to have data in n-dimensions.3 An example for a three-dimensional array could be a collection ofdata, where death rates are gathered by period and age (i.e. a Lexis diagram) forseveral countries.

Arrays can be constructed with the array(data, dimensionvector) function.An example should clarify this approach. 32 uniform random numbers are put intoa three-dimensional array with 2 rows, 8 columns and 2 “layers”.

> myarray <- array(runif(32), dim = c(2, 8,+ 2))> myarray

, , 1

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8][1,] 0.425 0.238 0.3571 0.559 0.099 0.223 0.788 0.391[2,] 0.231 0.961 0.0784 0.333 0.362 0.296 0.761 0.565

, , 2

[,1] [,2] [,3] [,4] [,5] [,6] [,7][1,] 0.6538 0.824 0.134 0.0759 0.475 0.0763 0.943[2,] 0.0871 0.939 0.312 0.3185 0.815 0.0768 0.553

[,8][1,] 0.247[2,] 0.308

You can already see how you can reference the elements of an array by lookingat the output from the previous entered code lines. Figure 4.1 on page 39 shouldmake referencing and understanding arrays a bit simpler. While matrices need twoindices to reference elements, arrays need as many indices as there are dimensions.In Figure 4.1 we have three dimensions. As previously shown, the first and secondindex refer to the rows and columns, respectively. The third index indicates which“layer” is meant.

3As far as I know is the maximum number of dimensions in R n = 7.

Page 39: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

4.2. MATRICES, ARRAYS, DATAFRAMES, LISTS 39

Figure 4.1: Understanding and Referencing an Array

NULL

[1,1,1] [1,4,1]

[4,4,1][4,1,1]

[1,1,2] [1,4,2]

[4,4,2]

[1,1,3] [1,4,3]

[4,4,3]

[1,1,4] [1,4,4]

[4,4,4]

Page 40: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

40 CHAPTER 4. THE REPRESENTATION OF DATA IN R

The “upper, right” element on the second “layer” in such a scenario couldtherefore be referenced as:

> myarray <- array(runif(64), c(4, 4, 4))> myarray[1, 4, 2]

[1] 0.83

4.2.3 Dataframes

Dataframes represent data in a two-dimensional way like matrices. They are theclosest representation of data in R as you may know it from packages like SPSSor STATA. That means that one row typically represents the values of severalvariables for one individual/unit. In addition, it is also possible to assign namesto each row. See the following example how to construct a dataframe:

> simpsons <- c("Marge", "Homer", "Bart", "Lisa",+ "Maggie")> ages <- c(34, 38, 10, 8, 0)> iq <- c(100, 70, 80, 140, NA)> mydataframe <- data.frame(cbind(ages, iq),+ row.names = simpsons)> mydataframe

ages iqMarge 34 100Homer 38 70Bart 10 80Lisa 8 140Maggie 0 NA

You can reference vectors and elements in dataframes in the same way as you dofor matrices.

> mydataframe[3, 2]

[1] 80

> mydataframe[, 1]

[1] 34 38 10 8 0

Page 41: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

4.2. MATRICES, ARRAYS, DATAFRAMES, LISTS 41

There exists also two other ways to reference columns. With the $ sign or with[[]] you can specify which variable you select.

> mydataframe$ages

[1] 34 38 10 8 0

> mydataframe$iq

[1] 100 70 80 140 NA

> mydataframe[[2]]

[1] 100 70 80 140 NA

If you are not sure what the variable names of your dataframe are you can simplyask for them:

> names(mydataframe)

[1] "ages" "iq"

4.2.4 Lists

The most flexible way to store data in R is a list. It does not put any restrictionson the length or type of data. The elements of lists can be vectors, matrices,dataframes and even lists. This kind of construction is sometimes useful if youwant to collect all relevant information of one entity in one object. Look at thefollowing example how you construct, how you reference elements of a list and howyou can ask for the names of list elements.

> simpsons <- c("Marge", "Homer", "Bart", "Lisa",+ "Maggie")> ages <- c(34, 38, 10, 8, 0)> sex <- c(2, 1, 1, 2, 2)> address <- "743 Evergreen Terrace"> pets <- c("Snowball II", "Santa s Little Helper")> sisters <- c("Thelma", "Betty")> mylist <- list(familynames = simpsons, demographics = data.frame(cbind(ages,+ sex)), location = address, pets = pets,+ relatives = sisters)> mylist

Page 42: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

42 CHAPTER 4. THE REPRESENTATION OF DATA IN R

$familynames[1] "Marge" "Homer" "Bart" "Lisa" "Maggie"

$demographicsages sex

1 34 22 38 13 10 14 8 25 0 2

$location[1] "743 Evergreen Terrace"

$pets[1] "Snowball II" "Santa s Little Helper"

$relatives[1] "Thelma" "Betty"

> mylist$demographics

ages sex1 34 22 38 13 10 14 8 25 0 2

> mylist[[4]]

[1] "Snowball II" "Santa s Little Helper"

> names(mylist)

[1] "familynames" "demographics" "location"[4] "pets" "relatives"

Maybe you think right now that lists are some kind of odd way to store information.But you will see later on that a lot of data in R is represented as a list and thatit is very convenient to use that construction. For example, regression results arestored as a list with one element the actual function call, another containing theestimated coefficients, a third element consists of the fitted values, . . .

Page 43: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

4.3. MISSING VALUES AND OTHER ODD NUMERICAL VALUES 43

4.3 Missing values and other odd numerical

values

Missing values in R are labelled NA. It has already been briefly introduced in thevector iq when there was the value Not Available. The typical rule is that anyoperation involving NAs results in NA:

> mydataframe[5, 2]

[1] NA

> mydataframe[5, 2] + 4

[1] NA

> mean(iq)

[1] NA

Many built-in functions give you, though, the possibility to operate on data evenif they contain missing values. One example is the mean where you can passthe argument that na.rm=TRUE which stands for Not Available ReMove = TRUE.There is also a somewhat trickier way using the negation operator ! and thefunction is.na which is a function returning a logical value of TRUE or FALSE foreach element.

> mean(iq, na.rm = TRUE)

[1] 97.5

> mean(iq[!is.na(iq) == TRUE])

[1] 97.5

Besides missing values indicated as NA, R knows also three other“odd numericalvalues”: NaN, Inf and -Inf. NaN means Not a Number. You get this results whenyou take, for example, the square root of a negative number or by dividing

> 0/0

[1] NaN

> sqrt(-1)

Page 44: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

44 CHAPTER 4. THE REPRESENTATION OF DATA IN R

[1] NaN

As you may guess from its abbreviation, Inf stands for Infinity. As there is noactual limit to numbers, you may wonder when you actually reach infinity on afinite machine like a computer. You can ask R actually for infinity by using thecommand .Machine which returns various values. What we are interested in are

> .Machine$double.xmax

[1] 1.80e+308

> .Machine$double.xmin

[1] 2.23e-308

If you cross this upper and lower limit, you reach infinity using R.4

> .Machine$double.xmax * 1.00000000000001

[1] Inf

4.4 Requesting Information About Data Struc-

tures

With the following functions you can retrieve information about R data objectswhich can be very useful. Besides the already names (which is actually an extrac-tion function as you see below) introduced, you can use the following commands:

> dimnames(mydataframe)

4Just in case you wonder why the value is a power of 308: it follows the IEEE-Standard 754 (1985) for binary floating point operations. This standard is (shouldbe) supported by all C-Compilers which is the language the core part of R waswritten. If you want to compile R yourself at home and you know that your C-Compiler supports higher values than the one specified by IEEE, you simpliy changethe constant DBL MAX in file (I forgot where it is). You can obtain more informa-tion about this interesting subject at http://grouper.ieee.org/groups/754/ or athttp://www.validlab.com/goldberg/paper.ps An interesting reading material aboutthose limits is also “The Ariane 5 explosion as seen by a software engineer” located athttp://www.cas.mcmaster.ca/�baber/TechnicalReports/Ariane5/Ariane5.htm.

Page 45: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

4.5. EXERCISES 45

[[1]][1] "Marge" "Homer" "Bart" "Lisa" "Maggie"

[[2]][1] "ages" "iq"

> attributes(mylist)

$names[1] "familynames" "demographics" "location"[4] "pets" "relatives"

> mode(iq)

[1] "numeric"

> mode(simpsons)

[1] "character"

> mode(mylist)

[1] "list"

> mode(mydataframe)

[1] "list"

> typeof(mylist)

[1] "list"

> typeof(ages)

[1] "double"

> typeof(simpsons)

[1] "character"

> length(simpsons)

[1] 5

> length(mylist)

[1] 5

4.5 Exercises

Page 46: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

46 CHAPTER 4. THE REPRESENTATION OF DATA IN R

Page 47: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Chapter 5

Distributions in R

5.1 Introduction

R has many built-in distributions which are useful for many purposes. For exam-ple, if you want to simulate an experiment and you know require random numbersfrom a certain distribution. Or the test statistic t of a test yields a certain valueand you would like to know how probable this value is if t is following a certaindistribution under the null-hypothesis.

For each distribution, R offers 4 specific functions: the density, the distributionfunction, the quantile function and the generation of random numbers. You callone of these functions for a certain distribution by giving the name.in.R for thedistribution preceded by either one of the following letters d (for the density), p(for the distribution function), q (for the quantile function) or r (for the generationof random numbers).

How these functions can be used will be shown by examples in the followingsections based on the normal distribution.

5.2 (Cumulative) Distribution Function pThe cumulative distribution function (often abbreviated as cdf) of a random variateX, denoted by FX(x), is defined by (Casella and Berger, 1990):

FX(x) = PX(X ≤ x), for all x (5.1)

FX(x) denotes, thus, the probability for a specified distribution that any a valueis realized which is smaller or equal to “your” X. In R, you get this probability

47

Page 48: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

48 CHAPTER 5. DISTRIBUTIONS IN R

by preceding the name of your function with a p. We demonstrate this using thestandard normal distribution with mean µ = 0 and standard deviation σ = 1. IfX = 0, pnorm(0) should return the value 0.5 as this distribution is symmetricaround µ — and it actually does.

> pnorm(0)

[1] 0.5

You probably remember from your statistics lessons that you always tookplus/minus two times the standard deviation to construct a confidence intervalon the 95% level for normally distributed data. Let’s check whether this results inthe values 0.025 and 0.0975.

> pnorm(-2)

[1] 0.0228

> pnorm(2)

[1] 0.977

Well, not exactly. But it should be close enough. If you want to take “bettervalues” use 1.96 as a factor in the future.

> options(digits = 6)> pnorm(-1.96)

[1] 0.0249979

> pnorm(1.96)

[1] 0.975002

5.3 Density Function dFor continuous distributions, the density function is defined as (Vogel, 1995):

fX(x) =dFX(x)

dx, for all x. (5.2)

For discrete distributions, this function is often also called probability distributionfunction and is defined simply as (Casella and Berger, 1990):

fX(x) = P (X = x), for all x. (5.3)

Page 49: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

5.3. DENSITY FUNCTION D 49

fX(x) denotes, thus, the probability (the density) to obtain your realization X for agiven and specified distribution. In the case for the result from tossing a (fair) dice,the result is easy: as every number is equally probably fX(x) = P (X = x) = 1

6 forall x. But how can a value from a continuous distribution such as:

> dnorm(0.275)

[1] 0.384139

be interpreted? Plotting the density of the corresponding (standard) normal dis-tribution with reference lines at x = 0.275 and fX(0.275) should facilitate theunderstanding of that concept.

Page 50: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

50 CHAPTER 5. DISTRIBUTIONS IN R

Figure 5.1: Density of A Standard Normal Distribution

NULL

NULL

NULL

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

x

f X(x

)

x: 0.275

fX(0.275) 0.384138915305705

Page 51: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

5.4. QUANTILE FUNCTION Q 51

5.4 Quantile Function qThe quantile function which is requested by preceding the distribution in R withq can be seen as the inverse of the cdf . For a given probability FX(x) (bounded,of course, between [0; 1]) what is the corresponding value of x? Please see thefollowing examples:

> qnorm(0.5)

[1] 0

> qnorm(0.975)

[1] 1.96

> qnorm(0.999)

[1] 3.09

5.5 Generation of Random Numbers rTo obtain random numbers from a specified distribution, you typically have toprecede the name of the distribution in R with an r. To generate 10 randomnumbers from a standard normal distribution, you simply enter:

> rnorm(10)

[1] 1.6937 1.0485 -0.9687 -0.0609 1.1327 -0.8732[7] 1.5037 0.1354 0.1451 -1.1404

5.5.1 Digression

It is actually not required to use this idea, but I want to briefly show you how youcan obtain random numbers without using the function rdistributionname. Ithas been shown (e.g. Dagpunar, 1988): if you want to generate a random variateX of a distribution with cdf FX(·) you simply need to use the inverse of the cdf

X = F−1X (R) (5.4)

where R is a uniformly distributed random number between 0 and 1 (R ∼ U(0, 1)).We use the inverse of the cdf which is equivalent to the quantile function describedbefore. Now let’s try it:

Page 52: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

52 CHAPTER 5. DISTRIBUTIONS IN R

> qnorm(runif(1))

[1] -0.0533

Seems to work. Let’s make a check:

> mean(rnorm(10^4))

[1] 0.00963

> sd(rnorm(10^4))

[1] 1.01

> mean(qnorm(runif(10^4)))

[1] -0.0029

> sd(qnorm(runif(10^4)))

[1] 1

5.6 Overview of Functions

Table 5.1 gives an overview of the built-in distributions (in package stats) andtheir respective names in R. Please check yourself by using ?name.in.r (e.g.?rbinom or ?dpois) for the required parameters of the specific distribution youare interested in.

5.7 The Gompertz Distribution

5.7.1 Introduction

Unfortunately, the Gompertz distribution is not built-in. However, we will need itquite often in Survival analysis. Therefore, I have defined several useful functionsfor the Gompertz distribution. Please check Section 7.4 starting on page 68 to getto know how you can define functions yourself.

The Gompertz distribution requires always two parameters:

• the scale parameter α and

• the shape parameter β.

Page 53: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

5.7. THE GOMPERTZ DISTRIBUTION 53

Table 5.1: Overview of Built-In DistributionsDistributions in R

Real Name R Name† Real Name R Name†

Beta beta Logistic logisBinomial binom Multinomial multinomCauchy cauchy Negative Binomial‡ nbinomχ2 (Chi-Square) chisq Normal normExponential exp Poisson poisF f Wilcoxon SignedΓ (Gamma) gamma Rank Statistic signrankGeometric geom Student’s t tHypergeometric hyper Uniform unifLog-Normal lnorm Weibull weibull

† The R name has to be preceded by r, d, p or qto obtain a random number from this distributionthe density, the distribution function, or quantilesfrom the respective distribution.‡ Another version exists in package MASS;use it via: library(MASS); rnegbin(...); qnegbin(...); ...

Page 54: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

54 CHAPTER 5. DISTRIBUTIONS IN R

5.7.2 Hazard Function

The hazard h(x) function of the Gompertz distribution is defined as:

h(x) = αeβ∗x (5.5)

And in R:

> hgompertz <- function(x, alpha, beta) {+ return(alpha * exp(beta * x))+ }

5.7.3 Survivor Function

The survivor function S(x) of the Gompertz distribution is defined as:

S(x) = eαβ (1−eβx) (5.6)

And in R:

> Sgompertz <- function(x, alpha, beta) {+ return(exp((alpha/beta) * (1 - exp(beta *+ x))))+ }

5.7.4 Cumulative Hazard Function

The cumulative hazard H(x) is defined as:

H(x) = − ln S(x) (5.7)

And in R:

> Hgompertz <- function(x, alpha, beta) {+ return(-log(Sgompertz(x, alpha, beta)))+ }

5.7.5 Density Function

The density function of the Gompertz distribution is defined as:

fX(x) = αeβxeαβ (1−eβx) (5.8)

And in R:

Page 55: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

5.7. THE GOMPERTZ DISTRIBUTION 55

> dgompertz <- function(x, alpha, beta) {+ return(alpha * exp(beta * x) * exp((alpha/beta) *+ (1 - exp(beta * x))))+ }

5.7.6 Cumulative Distribution Function

The cumulative distribution function FX(x) of the Gompertz distribution is definedas:

FX(x) = 1− S(x) (5.9)

> pgompertz <- function(x, alpha, beta) {+ return(1 - Sgompertz(x, alpha, beta))+ }

5.7.7 Quantile Function

To obtain the quantile function of the Gompertz distribution we simply have toinverse the cdf .

F−1X (p) = β−1

(ln

(− ln (1− p) +

β

))− ln

β

))(5.10)

> qgompertz <- function(p, alpha, beta) {+ return(beta^(-1) * (log(-log(1 - p) ++ (alpha/beta)) - log(alpha/beta)))+ }

5.7.8 Generation of Random Numbers

To obtain random numbers from the Gompertz distribution, we simply take thequantile function F−1

X (x) and use random numbers from a uniform distribution(R ∼ U(0, 1)).

F−1X (R) = β−1

(ln

(− ln (1−R) +

β

))− ln

β

))(5.11)

In R, you use:

> rgompertz <- function(n, alpha, beta) {+ return(beta^(-1) * (log(-log(1 - runif(n)) ++ (alpha/beta)) - log(alpha/beta)))+ }

Page 56: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

56 CHAPTER 5. DISTRIBUTIONS IN R

Typical parameters for a human population whose mortality follows a Gom-pertz distribution are:

> alpha = 1e-04> beta = 0.1

To test whether our formulas are working correctly, we generate 100,000 ran-dom numbers and let the resulting histogram be overlapped with the density (seeFig. 5.2 on page 57).

> mygompis <- rgompertz(10^5, alpha, beta)> hist(mygompis, breaks = 50, xlim = c(0, 100),+ freq = FALSE, ylim = c(0, 0.05))> lines(x = 0:100, y = dgompertz(0:100, alpha,+ beta), lwd = 3, col = "red")

5.8 Generating Random Numbers from Given

Data — sampleGiven you have some data and you want to pick one or more of those data atrandom, you use the function sample. In its simple form, you have to feed thisfunction only with two arguments. The data from which you want to sample xand how many times you want to “pick” from this urn (size).

> sample(x = c("blue", "red", "green"), size = 2)

[1] "green" "blue"

Automatically, R assigns equal probabilities to each element of x and does notreplace an element which has been picked. You can change these settings usingthe additional arguments prob and replace.

> sample(x = c("blue", "red", "green"), size = 10,+ replace = TRUE, prob = c(0.5, 0.3, 0.2))

[1] "red" "green" "blue" "blue" "blue" "blue"[7] "red" "blue" "red" "red"

The function table will be shown later on. It gives you simple frequency tables.Let’s check whether we approach the given probabilities when we increase thesample size.

Page 57: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

5.8. GENERATING RANDOM NUMBERS FROM GIVEN DATA — SAMPLE57

Figure 5.2: Overlapping Gompertz Random Numbers with the Correspond-ing Density

Histogram of mygompis

mygompis

Den

sity

0 20 40 60 80 100

0.00

0.01

0.02

0.03

0.04

0.05

Page 58: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

58 CHAPTER 5. DISTRIBUTIONS IN R

> (table(sample(x = c("blue", "red", "green"),+ size = 10, replace = TRUE, prob = c(0.5,+ 0.3, 0.2))))/10

blue green red0.3 0.1 0.6

> (table(sample(x = c("blue", "red", "green"),+ size = 100, replace = TRUE, prob = c(0.5,+ 0.3, 0.2))))/100

blue green red0.59 0.13 0.28

> (table(sample(x = c("blue", "red", "green"),+ size = 1000, replace = TRUE, prob = c(0.5,+ 0.3, 0.2))))/1000

blue green red0.505 0.172 0.323

> (table(sample(x = c("blue", "red", "green"),+ size = 10000, replace = TRUE, prob = c(0.5,+ 0.3, 0.2))))/10000

blue green red0.500 0.203 0.297

5.9 Further Reading

• Dagpunar (1988)

• Casella and Berger (1990)

• Abramowitz and Stegun (1972)

5.10 Exercises

Page 59: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Chapter 6

Data Input/Output

6.1 Reading Data

6.1.1 Introduction

So far, we have used only data which we generated ourselves. One of the majortasks of any statistical package is, though, to be able to analyze given data. Howto read in data will be explained in the following subsections.

6.1.2 Setting the Working Directory

Here I want to briefly repeat the function setwd which sets the current workingdirectory to the path and directory you want to read data from.

> setwd("n:\\Survive\\Datasets\\")

Please note that the argument given (n:\\Survive\\Datasets\\) requires tobe quoted ("...") and the various directories need to be separated by doublebackslashs (\\) if you are using a Windows platform. On Unix/Linux platform,you simply use one ordinary slash (/). You can let R tell you what the currentworking directoy is by using the getwd() function. The content of the currentworking directory is returned by using the function dir(). Please see the followingcode-example:

> getwd()

[1] "n:/Survive/Datasets"

> dir()

59

Page 60: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

60 CHAPTER 6. DATA INPUT/OUTPUT

[1] "asthm98.sav"[2] "Italy.cohort1890.txt"[3] "Rintro.tex"[4] "Sweden.cohort1890.txt"[5] "Switzerland.cohort1890.txt"

6.1.3 Reading Text Data

Any statistical package is able to read pure text data (ASCII-Format). This formatrepresents thus a format which allows to exchange data regardless which softwareis used. In R this operation is performed using the function read.table(). Itreads in the specified file and transforms it into a dataframe. Several argumentsare often used for this function which are explained now (if you are missing someexplanation, please use: ?read.table). The only required argument is any caseis the name of the file. The argument header is a logical variable indicatingwhether variable names are included in the top of each column or not. The defaultsetting is FALSE. Therefore it assumes no variable names if header is not set toTRUE. The argument sep specifies which delimiter has been used to separate thecolumns/variables. If you are using a *.csv-file you need to tell R sep=",", if itis a space you write sep=" " and in case of a

¤£

¡¢TAB -separated file you need to

specify the delimiter in C style: sep="Ä". With these arguments, you should beable to read almost any text data. Sometimes, the arguments skip and nrows areuseful to specify whether you want to skip several lines in the beginning and/or ifyou only want to read a certain number of lines (=rows). Let’s show this via anexample:

> dir()

[1] "asthm98.sav"[2] "Italy.cohort1890.txt"[3] "Rintro.tex"[4] "Sweden.cohort1890.txt"[5] "Switzerland.cohort1890.txt"

> sweden1890 <- as.data.frame(read.table("Sweden.cohort1890.txt",+ sep = " ", header = TRUE))> names(sweden1890)

[1] "ageatdeath" "no.females" "no.males" "no.total"

> sum(sweden1890$no.females)

[1] 60786

Page 61: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

6.2. DATA EDITOR 61

6.1.4 Reading Binary Data

R is also able to read binary data from other software packages. The packageforeign can read data from the following programs:

• Epi Info

• Minitab

• SAS XPORT

• SPSS

• Stata: My own experience has shown me that R (version 1.8.0) had problemsreading new stata data. Thus it is recommended to save your data in Statausing the flag old. E.g. save filename, old replace

• S3: The ”old”binary format used in S-Plus for“S3-Classes”. Modern versionsof S-Plus use now the S4-classes. Please use the command data.dump(...,oldstyle=TRUE) to enable R to read S-Plus binary data sets.

• Excel Worksheets (provided by the package gregmisc).

As usual, let’s show how to do it via an example. The specified data-setcontains all deaths due to asthma in the United States in the year 1998 by sex,age, and month of death. Please note that read.spss does not read the datain automatically as a dataframe. Thus, we used the as.data.frame-function toenforce this conversion.

> library(foreign)> asthma <- as.data.frame(read.spss("asthm98.sav"))> names(asthma)

[1] "SEX" "DY" "DM" "AGE"[5] "AGECLASS" "COUNT_1"

For other“foreign”data formats, please read the help file for the foreign package(Click on

¨§

¥¦Help ,

¨§

¥¦Html Help and then choose the package

¨§

¥¦foreign . For the

Excel-Format: library(gregmisc), ?read.xls.

6.2 Data Editor

Fiddling around manually in your data — as you may know it from SPSS or Excel— is also possible in R via the Data Editor. You can access the data editor usingeither fix(dataobject) or edit(dataobject) or via data.entry(dataobject),e.g. data.entry(asthma) or fix(asthma) or edit(asthma).

Page 62: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

62 CHAPTER 6. DATA INPUT/OUTPUT

6.3 Writing Data

Sometimes, you do not only want to read data, but sometimes you also want towrite to disk. The standard method to write data is the write.table-function.Not only its name but also its usage is very close to the reading-command read.table.Given you want to write the dataframe mydataframe to disk including the “vari-able” names and you want to separate the columns using a comma, you have to dothe following (please note how the col.names=TRUE in write.table correspondsto header=TRUE in read.table):

> mydataframe

ages iqMarge 34 100Homer 38 70Bart 10 80Lisa 8 140Maggie 0 NA

> write.table(x = mydataframe, file = "simpsonsdata.txt",+ sep = ",", row.names = TRUE, col.names = TRUE)> mynewdataframe <- read.table("simpsonsdata.txt",+ header = TRUE, sep = ",")> mynewdataframe

ages iqMarge 34 100Homer 38 70Bart 10 80Lisa 8 140Maggie 0 NA

6.4 Further Reading

Besides the aforementioned help-files on ?read.table, ?read.spss, . . . , the in-cluded manual “R Data Import/Export”, written by the R development core team,provides the most comprehensive overview how to read and write data in R. Youcan access it via

¨§

¥¦Help ,

¨§

¥¦Manuals (in PDF) .

6.5 Exercises

Page 63: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Chapter 7

Programming with R

7.1 Introduction

So far, you might have the impression that R is nothing but just another statisticalpackage just like SPSS or STATA. R is, however, much more: it is a high-levelprogramming language (with many pre-built possibilities to manipulate, analyseand display data) which is capable to do anything other programming languagessuch as C, Fortran, or Pascal can do. The trade-off for its high-level of abstractionis the relative complexity of the language which seems to be overwhelming for thenovice user.1The following few sections should introduce you to the main conceptsof programming in R.

7.2 Grouping Expressions

If you have some experience in programming, you know that any language providesa special construct for grouping expressions. If you have never heard of that. Don’tworry, you will grasp its usefulness within the next 10 minutes. In R,2 you can usecurly braces to group your expressions. These expressions can either be separatedby a semicolon or by a new line.

{ expression1 ; expression2; ...; expressionn }{ expression1expression2expressionn1Especially compared to C which is built upon 32 keywords (Kernighan and Ritchie,

2000).2Just like in C or in LATEX.

63

Page 64: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

64 CHAPTER 7. PROGRAMMING WITH R

}

7.3 Flow-Control

7.3.1 Conditional Execution

Branching via if

Conditional execution is typically performed in any programming language usingthe command/function if. This is also the case for R. The main syntax is:

if ( condition is TRUE ) do_this

> simpsons

[1] "Marge" "Homer" "Bart" "Lisa" "Maggie"

> if (ages[2] == max(ages)) print("Homer is the oldest")

[1] "Homer is the oldest"

Using the curly braces which serves as grouping symbol, we can introduce moreexpressions.

> if (ages[2] == max(ages)) {+ another.expression <- 3+ print("Homer is the oldest")+ 34 * 435+ }

[1] "Homer is the oldest"[1] 14790

As is usually any programming language, it is also allowed in R to tell the inter-preter what happens if the condition is not TRUE but FALSE via the else statement.

> myrandom <- rnorm(1)> myrandom

[1] 1.42

Page 65: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

7.3. FLOW-CONTROL 65

> if (myrandom < 0) {+ print("My Random Number is Smaller than 0.")+ } else {+ print("My Random Number is 0 or Greater.")+ }

[1] "My Random Number is 0 or Greater."

When working with if constructions, one wants to combine sometimes sev-eral conditions. For this purpose R provides the operators &,&&,|, ||. &and &&areneeded for AND constructions; |and ||are their OR complements. The main dif-ferences is that the singly operators apply element-by-element the conditions tovectors, the double operators only apply to vectors of length one. See the followingexample for clarification:

> myrandom <- rnorm(10)> myrandom

[1] -0.847 -1.237 0.238 -0.521 -1.245 -1.241 -0.775[8] 0.175 0.797 0.735

> (myrandom < 0) && (myrandom < -1)

[1] FALSE

> (myrandom < 0) & (myrandom < -1)

[1] FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE[9] FALSE FALSE

If you are working with vectors — which is usually the case — the functionifelse comes in handy. The syntax of ifelse is: ifelse(condition,true,false).

> iq

[1] 100 70 80 140 NA

> ifelse(iq < 100, "Stupid", "Clever")

[1] "Clever" "Stupid" "Stupid" "Clever" NA

Page 66: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

66 CHAPTER 7. PROGRAMMING WITH R

Branching via switch

Less often used than if but present in many programming languages is the switchfunction. Meaningful applications are hard to consider outside of new, user-defined,functions. This branching-option is therefore introduced in Section 7.4 on page 68where it is explained how to develop your own functions.

7.3.2 Repetitive Execution

Besides the branching command if, there are several commands avaialble in R forrepetitive execution.

for-loops

The most common command for repetitive execution is the for loop. The syntax isfor (parameter in parameterspace) dosomething. The parameter parametertakes on the first value of parameterspace in the first loop and executes dosomething.In the second run the second value is taken from parameterspace and so on untilthe last value is taken from parameterspace and dosomething is executed the lasttime. R accepts numeric values as well as character vectors. See the followingexamples for clarification (they also show for the first time in this booklet thepossibility of nesting these control-flow statements:

> for (i in simpsons) {+ print(i)+ }

[1] "Marge"[1] "Homer"[1] "Bart"[1] "Lisa"[1] "Maggie"

> for (i in 1:5) {+ print(simpsons[i])+ }

[1] "Marge"[1] "Homer"[1] "Bart"[1] "Lisa"[1] "Maggie"

Page 67: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

7.3. FLOW-CONTROL 67

> for (i in (1:(length(simpsons)))) {+ if (i < 2) {+ print("Who is clever and who is stupid among the Simpsons?")+ }+ if (!is.na(iq[i])) {+ if (iq[i] >= 100) {+ print(c("Clever: ", simpsons[i]))+ }+ else {+ print(c("Stupid: ", simpsons[i]))+ }+ }+ }

[1] "Who is clever and who is stupid among the Simpsons?"[1] "Clever: " "Marge"[1] "Stupid: " "Homer"[1] "Stupid: " "Bart"[1] "Clever: " "Lisa"

while-loops

Less often used is the construction of a while-loop. The syntax is a bit simplerthan in the case of the for-loop: while (parameter is true) dosomething asyou can see in this code-piece:

> i <- 10> while (i > 0) {+ print(i * i)+ i <- i - 1+ }

[1] 100[1] 81[1] 64[1] 49[1] 36[1] 25[1] 16[1] 9[1] 4[1] 1

Page 68: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

68 CHAPTER 7. PROGRAMMING WITH R

As you can see, the while-loop needs necessarily an increment/decrement in thedosomething part (the body). Otherwise the program would end up in an infiniteloop. The while construction is often used in numerical optimization to repeat astep until a certain lower boundary of error tolerance has been reached.

repeat-loops

The syntax for the third loop construction — the repeat-loop — is simply: repeatdosomething. In comparison to the while-loop, there is no condition which has tobe fulfilled. Therefore one has to include a statement when R has to stop iterating.This stop signal is in R the command break. Another difference is that while firsttests the condition which has to be true. On the contrary, repeat immediatelystarts with the execution of the dosomething-part. An example is shown here:

> i <- 5> repeat {+ print(simpsons[i])+ i = i - 1+ if (i < 1) {+ break+ }+ }

[1] "Maggie"[1] "Lisa"[1] "Bart"[1] "Homer"[1] "Marge"

7.4 Writing Functions

One of the biggest advantages of R is its immense flexibility: if you are unhappywith a way something is implemented or if you think something is missing, you cansimply define your own function. Many functions in R have also been programmedthat way (e.g. the standard deviation sd or the median median). The syntax is:

nameoffunction <- function(argument1, argument2, ...) {function-bodyreturn(myresult)

}

Page 69: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

7.4. WRITING FUNCTIONS 69

You assign a new function to a name (object) just like you assign values tovariables via the <- operator. You need to tell the new function what kind ofarguments it requires (argument1, argument2, ... for the function-body).3Inside the function-body your calculations are done. If you want to have a certainvalue returned, you need to give the return-statement. This command is optional.If you omit it, the last calculated value will be returned. See the following simpleexample which squares the entered data(-vector).

> mysquare <- function(x) {+ myresult <- x * x+ return(myresult)+ }> mysquare(34)

[1] 1156

> mysquare(1:10)

[1] 1 4 9 16 25 36 49 64 81 100

You can also give some default settings for the arguments. If you want tore-write the mean function to automatically exclude missing values (NAs) then youdo simply:

> mymean <- function(x, missings = TRUE) {+ return(mean(x, na.rm = missings))+ }> iq

[1] 100 70 80 140 NA

> mean(iq)

[1] NA

> mymean(iq)

[1] 97.5

> mymean(iq, FALSE)3There is no need like in many programming languages to tell R of what data-type

these arguments are.

Page 70: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

70 CHAPTER 7. PROGRAMMING WITH R

[1] NA

R is not restricted to return just one value. The following code shows anextended example for calculating descriptive statistics of a given data-set.

> mydesc <- function(x) {+ rsltmin <- min(x)+ rsltmax <- max(x)+ rsltmed <- median(x)+ rsltmean <- mean(x)+ rsltsd <- sd(x)+ rsltvar <- var(x)+ myresult <- list(min = rsltmin, max = rsltmax,+ median = rsltmed, mean = rsltmean,+ stdev = rsltsd, var = rsltvar)+ return(myresult)+ }> exampledata <- rnorm(1e+05)> mydesc(exampledata)

$min[1] -4.23

$max[1] 4.31

$median[1] 0.00419

$mean[1] -5.38e-05

$stdev[1] 1.00

$var[1] 1.01

> (mydesc(exampledata))[[6]]

[1] 1.01

Page 71: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

7.4. WRITING FUNCTIONS 71

> (mydesc(exampledata))$var

[1] 1.01

R also allows you to define functions within other functions. Please see thefolowing code for a hypothetical example.

> average <- function(x) {+ sumofelements <- function(data) {+ return(sum(data))+ }+ numberofelements <- function(data) {+ return(length(data))+ }+ core.procedure <- function(mydata) {+ the.result <- sumofelements(mydata)/numberofelements(mydata)+ return(the.result)+ }+ my.final.result <- core.procedure(x)+ return(my.final.result)+ }> average(exampledata)

[1] -5.38e-05

> mydata

[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7[18] 8 9 10

Please note that the original data-set mydata has not been changed — althoughwe used an argument called mydata. The reason is that R can distinguish whetheran object is just needed within a function or whether it is globally defined.4

As promised in the section on conditional execution, the switch statementmakes mainly sense within a new, user-defined, function. Please look at the fol-lowing example to see how R can switch between several branches. We use asimilar way to construct a function which calculates various descriptive statisticsas before. First, we give an example without the switch statement. Afterwards,we show that switch helps to write a code which is more elegant.

4If you want to learn more about it: this features is called the scope of functions.

Page 72: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

72 CHAPTER 7. PROGRAMMING WITH R

> descnoswitch <- function(x, type = "mean") {+ if (type == "mean") {+ myresult <- mean(x)+ }+ if (type == "median") {+ myresult <- median(x)+ }+ if (type == "var") {+ myresult <- var(x)+ }+ return(myresult)+ }> descswitch <- function(x, type = "mean") {+ myresult <- switch(type, mean = mean(x),+ var = var(x), median = median(x))+ return(myresult)+ }> descswitch(mydata, "median")

[1] 5.5

> descnoswitch(mydata, "median")

[1] 5.5

7.5 Problems with loops

Compared to compiled languages, R is slower to execute loop structures. It istherefore recommended to try to avoid loops as much as possible. Thanks to thevectorized design of R, this is easier (in most cases) than it may seem. It should benoted, however, that it is not always possible to vectorize a loop. Some proceduresare intrinsically repetitive. These vectorizations are usually performed by usingone of the following functions: apply, sapply, tapply, or lapply. We will onlyshow how to apply the functions tapply and sapply

Consider 5 populations which are combined into one list:

> pop1 <- rnorm(10^3, mean = 12, sd = 32)> pop2 <- rnorm(10^3, mean = 34, sd = 12)> pop3 <- rnorm(10^3, mean = 76, sd = 65)> pop4 <- rnorm(10^3, mean = 32, sd = 2)> pop5 <- rnorm(10^3, mean = 324, sd = 32)

Page 73: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

7.5. PROBLEMS WITH LOOPS 73

> populations <- list(p1 = pop1, p2 = pop2,+ p3 = pop3, p4 = pop4, p5 = pop5)

Now we would like to calculate the mean and the standard deviation of eachof these populations.

One could use a for-loop like this:

> no.of.pops <- length(populations)> mymeans1 <- as.list(numeric(no.of.pops))> for (i in 1:no.of.pops) {+ mymeans1[[i]] = mean(populations[[i]])+ }> mymeans1

[[1]][1] 14.175

[[2]][1] 34.123

[[3]][1] 73.173

[[4]][1] 31.99

[[5]][1] 326.17

Easier, more elegant, and much faster for huge datasets is the function tapply(and so does sapply — just in a more user-friendly form) which applies a certainfunction to all elements of a list:

> mymeans2 <- lapply(populations, mean)> mymeans3 <- sapply(populations, mean)> mymeans2

$p1[1] 14.175

$p2[1] 34.123

Page 74: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

74 CHAPTER 7. PROGRAMMING WITH R

$p3[1] 73.173

$p4[1] 31.99

$p5[1] 326.17

> mymeans3

p1 p2 p3 p4 p514.175 34.123 73.173 31.990 326.174

While those two previous commands (lapply and sapply) work element-wise onlist, apply — as ?apply tells you — returns a vector or array or list ofvalues obtained by applying a function to margins of an array. What ismeant by this becomes clearer by giving an example: Consider a matrix or a data-frame of values:

> mydataframe

ages iqMarge 34 100Homer 38 70Bart 10 80Lisa 8 140Maggie 0 NA

Now you want to know the mean of the ages and of the iq - i.e. the means of eachcolumn. This is done simply by using apply which has the syntax apply(data,dimension, function):

> apply(mydataframe, 2, mean, na.rm = TRUE)

ages iq18.0 97.5

The value 2 has been chosen for the dimension as this indicates the columns.Remember: The first dimension in R are the rows, then the columns come second.Individual “Layers” for three-dimensional arrays are accessed via the third indexwhen referenced. This is also reflected in apply if you, for example, want tocalculate the median in each “layer” of a given data-array:

Page 75: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

7.6. FURTHER READING 75

> apply(myarray, 3, median)

[1] 0.518 0.784 0.505 0.428

7.6 Further Reading

• Venables and Ripley (2000)

7.7 Exercises

Page 76: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

76 CHAPTER 7. PROGRAMMING WITH R

Page 77: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Chapter 8

Graphing Data

8.1 Introduction

As pointed out in the introduction, R has graphical capabilities which are hard tomatch by any statistical package/language. The following few sections will intro-duce you to some of the most important graphing commands. For that purpose,we use the data-set which contains the ages at death of the Swedish birth cohortfrom 1890.

> sweden1890 <- as.data.frame(read.table("Sweden.cohort1890.txt",+ sep = " ", header = TRUE))

8.2 Basic plotting function: plotThe most simple plotting command has already been introduction. For a point-plot, just write:

> plot(sweden1890$no.females)

The following code and its representation in Figure 8.1 show you how candevelop a plot piece by piece. For that purpose, the new function par is introducedwhich sets graphical parameters. In our case we inform R that we want to constructa multi-panel graph consisting of 2 rows and 2 columns. If you want to get moreinformation, check the respective help pages like ?par, ?plot, . . .

> par(mfrow = c(2, 2))> plot(sweden1890$no.females)> plot(x = sweden1890$ageatdeath, y = sweden1890$no.females,

77

Page 78: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

78 CHAPTER 8. GRAPHING DATA

+ type = "l", col = "red")> plot(x = sweden1890$ageatdeath, y = sweden1890$no.females,+ type = "l", col = "red", lty = 1, lwd = 2)> lines(x = sweden1890$ageatdeath, y = sweden1890$no.males,+ col = "blue", lty = 2, lwd = 1)> plot(x = sweden1890$ageatdeath, y = sweden1890$no.females,+ type = "l", col = "red", lty = 1, lwd = 2,+ xlim = c(0, 110), ylim = c(0, 8000), xlab = "Age at Death",+ ylab = "No. of Deaths", axes = FALSE)> lines(x = sweden1890$ageatdeath, y = sweden1890$no.males,+ col = "blue", lty = 2, lwd = 1)> axis(side = 1, at = seq(from = 0, to = 110,+ by = 10), labels = TRUE, tick = TRUE)

NULL

> axis(side = 2, at = seq(from = 0, to = 8000,+ by = 2000), labels = TRUE, tick = TRUE)

NULL

> legend(x = 30, y = 6000, legend = c("Women",+ "Men"), col = c("red", "blue"), lty = c(1,+ 2), lwd = c(2, 1))

Page 79: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

8.2. BASIC PLOTTING FUNCTION: PLOT 79

Figure 8.1: Constructing a Plot Piece By Piece

NULL

NULL

0 20 40 60 80 100

020

0040

0060

00

Index

swed

en18

90$n

o.fe

mal

es

0 20 40 60 80 100

020

0040

0060

00

sweden1890$ageatdeath

swed

en18

90$n

o.fe

mal

es

0 20 40 60 80 100

020

0040

0060

00

sweden1890$ageatdeath

swed

en18

90$n

o.fe

mal

es

Age at Death

No.

of D

eath

s

0 20 40 60 80 100

020

0060

00

WomenMen

Page 80: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

80 CHAPTER 8. GRAPHING DATA

8.3 Histograms — histHistograms are a useful tool to display univariate time-series. The easiest way toaccomplish this is to use the hist-function. The following code and its result inFigure 8.2 on page 81 show its usage.

> par(mfrow = c(1, 2))> swedish.women <- rep(sweden1890$ageatdeath,+ sweden1890$no.females)> hist(swedish.women, breaks = 110, xlim = c(0,+ 110), ylim = c(0, 8000))> swedish.men <- rep(sweden1890$ageatdeath,+ sweden1890$no.males)> hist(swedish.men, breaks = 110, xlim = c(0,+ 110), ylim = c(0, 8000))

Page 81: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

8.3. HISTOGRAMS — HIST 81

Figure 8.2: How to plot a histogram

Histogram of swedish.women

swedish.women

Fre

quen

cy

0 20 60 100

020

0040

0060

0080

00

Histogram of swedish.men

swedish.men

Fre

quen

cy

0 20 60 100

020

0040

0060

0080

00

Page 82: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

82 CHAPTER 8. GRAPHING DATA

You can also make histograms using the argument type="h" for the commandplot. Try it for yourself!

8.4 Barplot - barplotA useful tool for the visual display of data is the barplot. Look at Figure 8.3 tosee a stacked barplot as you may know it from Excel. The plot has been producedusing the following code:1

> data(UKLungDeaths)> uklung <- aggregate(ts.union(mdeaths, fdeaths),+ 1)> barplot(t(uklung), names = 1974:1979, col = c("blue",+ "red"))> legend(x = 3, y = 15000, legend = c("Women",+ "Men"), bg = "white", fill = c("red",+ "blue"))> title(main = "Death from Lung Disease in the UK")

NULL

1Example taken from: Venables and Ripley (1999), slightly modified.

Page 83: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

8.4. BARPLOT - BARPLOT 83

Figure 8.3: Barplot for Women and Men Dying From Lung Diseases in theUK

NULL

1974 1975 1976 1977 1978 1979

050

0010

000

1500

020

000

2500

0

WomenMen

Death from Lung Disease in the UK

Page 84: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

84 CHAPTER 8. GRAPHING DATA

8.5 Making boxplots — boxplotUnivariate data can be plotted by using a histogram. It does not give, however,much information about the descriptive statistics of the underlying data. One wayto obtain these data is to use the command summary.

> swedish.men <- rep(sweden1890$ageatdeath,+ sweden1890$no.males)> summary(swedish.men)

Min. 1st Qu. Median Mean 3rd Qu. Max.0.0 16.0 65.0 50.8 78.0 109.0

> summary(swedish.women)

Min. 1st Qu. Median Mean 3rd Qu. Max.0.0 25.0 70.0 55.6 82.0 107.0

Using boxplots facilitates this approach considerably.

> par(mfrow = c(1, 2))> data(ChickWeight)> boxplot(ChickWeight$weight, xlab = "Chicken Weights",+ ylab = "Weight in Grams")> boxplot(list(Women = swedish.women, Men = swedish.men),+ ylab = "Age at Death")

Figure 8.4 on page 85 shows the standard boxplots how R plots them. The boxplotwith weights of the chicken have been used as the Swedish Cohort Data have nooutliers. The line in the middle of each box shows the median of the respectivedataset. The box around this median is bounded by the 25% quantile at the bottomand by the 75% quantile at the top. The distance between the 25% quantile andthe 75% quantile is called the interquartile-range (IQR). The end of the whiskersindicate the 75% quantile +1.5×IQR and the 75% quantile −1.5×IQR. If valuesfall outside the range of the whiskers each individual is plotted as a small circle aswe can see in the left panel of Figure 8.4.

Page 85: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

8.5. MAKING BOXPLOTS — BOXPLOT 85

Figure 8.4: Constructing a Boxplot for Weights of Chickens (left) and Agesat Death of the Swedish Birth Cohort from 1890 (right)

5010

015

020

025

030

035

0

Chicken Weights

Wei

ght i

n G

ram

s

Women Men

020

4060

8010

0

Age

at D

eath

Page 86: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

86 CHAPTER 8. GRAPHING DATA

8.6 QQ-Plots qqplot / qqnormThe last kind of plot which will be introduced here is the so-called QQ-Plot. It isa useful tool in detecting irregularities in your data. They are usually employedfor two different situations:

• you want to check whether two empirical distributions are identical.

• you want to check if one empirical distribution follows a certain distribution(e.g. normal distribution)

QQ-Plots sort the given data and produces a plot where the quantiles from onedata-set are plotted either against the other data-set or against the “theoretical”distribution. This helps a great deal to detect irregularities in your data, especiallyif we include an extra line via the abline command with intercept 0 and slope 1.If one distribution was matched by the other distribution they would both coverthe added line. Figure 8.5 will help to understand the usefulness of these plots.

> par(mfrow = c(2, 1))> qqplot(swedish.women, swedish.men)> abline(0, 1, col = "red", lwd = 3)> womnew <- quantile(swedish.women, probs = seq(from = 0,+ to = 1, length = 500))> mennew = quantile(swedish.men, probs = seq(from = 0,+ to = 1, length = 500))> plot(womnew, mennew)> abline(0, 1, col = "red", lwd = 3)

As you can see from the second part of the code, it is not so difficult to constructsuch plots also by yourself — simply using some data transformations and thenthe basic plot-command.

8.7 Further Plotting Commands

The following code-pieces and graphs may repeat several things you already knownow. The best thing is to look at the graphs and if you want to find out how youcan construct such a plot, just check the corresponding code.

Besides the already known par command which sets the graphic parameters ingeneral you can also use the not so often used screen function to partition yourscreen.

Page 87: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

8.7. FURTHER PLOTTING COMMANDS 87

Figure 8.5: An Example of a QQ-Plot

0 20 40 60 80 100

040

80

swedish.women

swed

ish.

men

0 20 40 60 80 100

040

80

womnew

men

new

> split.screen(c(1, 2))

[1] 1 2

> split.screen(c(2, 1), screen = 2)

[1] 3 4

> screen(1)> plot(x = sweden1890$ageatdeath, y = sweden1890$no.females,+ type = "h", xlim = c(0, 110), ylim = c(0,+ 6000))> text(x = 20, y = 5000, labels = "a)")

Page 88: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

88 CHAPTER 8. GRAPHING DATA

NULL

> screen(3)> barplot(sweden1890$no.females, names.arg = sweden1890$ageatdeath,+ xlim = c(0, 110), ylim = c(0, 6000))> title(main = "Women")

NULL

> text(x = 20, y = 5000, labels = "b)")

NULL

> screen(4)> barplot(sweden1890$no.males, names.arg = sweden1890$ageatdeath,+ xlim = c(0, 110), ylim = c(0, 6000))> title(main = "Men")

NULL

> text(x = 20, y = 5000, labels = "c)")

NULL

> close.screen(all = TRUE)

Page 89: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

8.7. FURTHER PLOTTING COMMANDS 89

Figure 8.6: Displaying Data in Histogram-like Plot

[1] 1 2

[1] 3 4

NULL

NULL

NULL

NULL

NULL

0 20 60 100

010

0020

0030

0040

0050

0060

00

sweden1890$ageatdeath

swed

en18

90$n

o.fe

mal

es

a)

0 16 35 54 73 92

020

0050

00

Women

b)

0 16 35 54 73 92

020

0050

00

Men

c)

Page 90: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

90 CHAPTER 8. GRAPHING DATA

While histograms might be appropriate to detect oddities like extreme outliers,they are less useful in other respects. The Figure 8.6 b) and c) may suggest thatthe distribution of age at death among Swedish women and men born in 1890 isvery similar. Another plot-type, the so-called boxplot can usually shed furtherlight on such an assessment.

> women <- rep(sweden1890$ageatdeath, sweden1890$no.females)> men <- rep(sweden1890$ageatdeath, sweden1890$no.males)> boxplot(list(Women = women, Men = men), range = 0)

Figure 8.7: Boxplots

Women Men

020

4060

8010

0

Page 91: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

8.7. FURTHER PLOTTING COMMANDS 91

Figure 8.8 shows how you can relate histograms and boxplots

> split.screen(c(2, 1))

[1] 1 2

> split.screen(c(1, 2), screen = 1)

[1] 3 4

> split.screen(c(1, 2), screen = 2)

[1] 5 6

> screen(3)> plot(x = sweden1890$ageatdeath, y = sweden1890$no.females,+ type = "h", xlim = c(0, 110), ylim = c(0,+ 6000), xlab = "Age at Death", ylab = "No. of Deaths")> title("Women, Histogram")

NULL

> abline(v = (quantile(women))[[2]], col = "green")> text(x = 35, y = 2000, labels = "25% Quantile",+ adj = 0, col = "green")

NULL

> abline(v = (quantile(women))[[4]], col = "blue")> text(x = 35, y = 5000, labels = "75% Quantile",+ adj = 0, col = "blue")> abline(v = (quantile(women))[[3]], col = "red")> text(x = 35, y = 3500, labels = "Median",+ adj = 0, col = "red")> screen(5)> plot(x = sweden1890$ageatdeath, y = sweden1890$no.males,+ type = "h", xlim = c(0, 110), ylim = c(0,+ 6000), xlab = "Age at Death", ylab = "No. of Deaths")> abline(v = (quantile(men))[[2]], col = "green")> text(x = 35, y = 2000, labels = "25% Quantile",+ adj = 0, col = "green")

NULL

Page 92: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

92 CHAPTER 8. GRAPHING DATA

> abline(v = (quantile(men))[[4]], col = "blue")> text(x = 35, y = 5000, labels = "75% Quantile",+ adj = 0, col = "blue")> abline(v = (quantile(men))[[3]], col = "red")> text(x = 35, y = 3500, labels = "Median",+ adj = 0, col = "red")> title("Men, Histogram")

NULL

> screen(4)> women <- rep(sweden1890$ageatdeath, sweden1890$no.females)> boxplot(women, range = 0, ylim = c(0, 110),+ ylab = "Age at Death")> title("Women, Boxplot")

NULL

> screen(6)> men <- rep(sweden1890$ageatdeath, sweden1890$no.males)> boxplot(men, range = 0, ylim = c(0, 110),+ ylab = "Age at Death")> title("Men, Boxplot")

NULL

> screen(1)> plot(1:10, ylim = c(0, 110), xlab = "", ylab = "",+ axes = FALSE, type = "n")> abline(h = (quantile(women))[[3]], col = "red")> abline(h = (quantile(women))[[2]], col = "green")> abline(h = (quantile(women))[[4]], col = "blue")> screen(2)> plot(1:10, ylim = c(0, 110), xlab = "", ylab = "",+ axes = FALSE, type = "n")> abline(h = (quantile(men))[[3]], col = "red")> abline(h = (quantile(men))[[2]], col = "green")> abline(h = (quantile(men))[[4]], col = "blue")> close.screen(all = TRUE)

This code looks fairly complicated. But if you have a closer look, you will see thatthe building blocks are mostly already known to you.

Page 93: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

8.8. FURTHER READING 93

8.8 Further Reading

8.9 Exercises

Page 94: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

94 CHAPTER 8. GRAPHING DATA

Figure 8.8: Relating Histograms with Boxplots

[1] 1 2

[1] 3 4

[1] 5 6

NULL

NULL

NULL

NULL

NULL

NULL

0 20 60 100

030

0060

00

Age at Death

No.

of D

eath

s

Women, Histogram

25% Quantile

75% Quantile

Median

0 20 60 100

030

0060

00

Age at Death

No.

of D

eath

s

25% Quantile

75% Quantile

Median

Men, Histogram

040

80

Age

at D

eath

Women, Boxplot

040

80

Age

at D

eath

Men, Boxplot

Page 95: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Chapter 9

Simple Statistical Models

9.1 Introduction

This chapter is going to introduce the basics of how to use R to estimate statis-tical models. Our course focusses on Survival Analysis. Nevertheless, I want tointroduce the basic concepts of regression modeling using simple linear models.

For that purpose we use the data-set Davis in the package car which hasinformation on weight and height of 200 broken down by sex, weight and height.1

> library(car)> data(Davis)> names(Davis)

[1] "sex" "weight" "height" "repwt" "repht"

> summary(Davis)

sex weight height repwtF:112 Min. : 39.0 Min. : 57 Min. : 41.0M: 88 1st Qu.: 55.0 1st Qu.:164 1st Qu.: 55.0

Median : 63.0 Median :170 Median : 63.0Mean : 65.8 Mean :170 Mean : 65.63rd Qu.: 74.0 3rd Qu.:177 3rd Qu.: 73.5Max. :166.0 Max. :197 Max. :124.0

NA s : 17.01The variables repwt and repht on self-reported weight and self-reported height are

not used here.

95

Page 96: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

96 CHAPTER 9. SIMPLE STATISTICAL MODELS

rephtMin. :1481st Qu.:161Median :168Mean :1683rd Qu.:175Max. :200NA s : 17

> Davis2 <- subset(Davis, Davis$height/Davis$weight >+ 0.4)> Daviswomen <- subset(Davis2, Davis$sex ==+ "F")> Davismen <- subset(Davis2, Davis$sex == "M")

In the line Davis2..., we only removed one case who had a height of 57 cm anda weight of 166kg — most probably an error during data entry.

9.2 Plotting the Data

We want to analyze whether there is a linear relationship which is able to predictthe weight from his/her height. Despite the fact that we are planning to do someregression analysis, it is always recommended to plot your data. We use differentcolors for women and men as it is possible that the relationship is different foreither sex.

> plot(x = Davismen$height, y = Davismen$weight,+ col = "blue", xlim = c(150, 200), ylim = c(40,+ 120), xlab = "Height", ylab = "Weight")> points(x = Daviswomen$height, y = Daviswomen$weight,+ col = "red")> legend(x = 150, y = 120, legend = c("Women",+ "Men"), pch = 1, col = c("red", "blue"))

The point-plot in Figure 9.1 on page 97 suggests two things: (1) Overall, the linearassumptions seems to be okay. (2) This linear relationship is probably the samefor women and for men.

Page 97: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

9.2. PLOTTING THE DATA 97

Figure 9.1: Is there a linear relationship between height and weight? - A firstglance at the data

150 160 170 180 190 200

4060

8010

012

0

Height

Wei

ght

WomenMen

Page 98: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

98 CHAPTER 9. SIMPLE STATISTICAL MODELS

9.3 Estimating the Linear Model and Access-

ing the Components of the Result

R allows a relatively intuitive approach to write down to the formula of the equationto be estimated. In our case, we would like to predict the weight of person by itsheight. We express this in R as (lm stands for linear model):

> mylinmod <- lm(formula = Davis2$weight ~ Davis2$height,+ data = Davis2)

The results from the estimation can be accessed via the well-known function names.Not surprisingly, summary gives a summary of the most important values and coefallows to extract the regression coefficients.

> names(mylinmod)

[1] "coefficients" "residuals" "effects"[4] "rank" "fitted.values" "assign"[7] "qr" "df.residual" "xlevels"

[10] "call" "terms" "model"

> summary(mylinmod)

Call:lm(formula = Davis2$weight ~ Davis2$height, data = Davis2)

Residuals:Min 1Q Median 3Q Max

-19.650 -5.419 -0.576 4.857 42.887

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -130.7470 11.5627 -11.3 <2e-16Davis2$height 1.1492 0.0677 17.0 <2e-16

(Intercept) ***Davis2$height ***---Signif. codes: 0 �*** 0.001 �** 0.01 �* 0.05 �. 0.1 � 1

Residual standard error: 8.52 on 197 degrees of freedomMultiple R-Squared: 0.594, Adjusted R-squared: 0.592F-statistic: 288 on 1 and 197 DF, p-value: <2e-16

Page 99: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

9.4. REGRESSION DIAGNOSTICS 99

> coef(mylinmod)

(Intercept) Davis2$height-130.75 1.15

This informs us that adj. r2 = 0.592 and on average weight increases by 1.14kgfor each gained centimeter in height. The height covariates is highly significant< 2× 10−16.

9.4 Regression Diagnostics

Now, let’s run several (graphical) diagnostics to check whether our model meetsalso its requirements. We do this by a 2× 2-panel as shown in Figure 9.2 on 101:The upper left graph has been produced by the following code:

> par(mfrow = c(2, 2))> plot(x = Davismen$height, y = Davismen$weight,+ col = "blue", xlim = c(150, 200), ylim = c(40,+ 120), xlab = "Height", ylab = "Weight")> points(x = Daviswomen$height, y = Daviswomen$weight,+ col = "red")> legend(x = 150, y = 120, legend = c("Women",+ "Men"), pch = 1, col = c("red", "blue"))> abline(mylinmod)

The function mylinmod knows which components have to be extracted from the fit-ted model to plot the fitted values correctly (coef(mylinmod)[[1]] and coef(mylinmod)[[1]]).Now we add also fitting plots for women and men to check whether we would haveobtained more or less the same parameters if we had estimated two separate mod-els.

> mylinmod.women <- lm(formula = Daviswomen$weight ~+ Daviswomen$height)> mylinmod.men <- lm(formula = Davismen$weight ~+ Davismen$height)> abline(mylinmod.women, col = "red")> abline(mylinmod.men, col = "blue")

Our plot suggest that it was okay to estimate the model for women and menat the same time.

The plot in the upper right corner checks whether the mean of the residuals iszero as required by the linear model.

Page 100: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

100 CHAPTER 9. SIMPLE STATISTICAL MODELS

> plot(fitted(mylinmod), resid(mylinmod))> lines(lowess(fitted(mylinmod), resid(mylinmod)),+ col = "red", lwd = 3)> abline(h = 0, col = "blue", lwd = 3)

For that purpose, we plot the fitted values by the obtained residuals. We impose ascatterplot-smoother via the lowess-function and compare this plot with a refer-ence line drawn at zero. Despite small deviations, we can state that this plot doesnot suggest that our linear model is not appropriate.

The remaining two panels checks graphically whether our assumption that thedata are drawn from a normal distribution is met. The lower panel on the leftdisplays a histogram of the residuals.

> hist(resid(mylinmod), xlim = c(-40, 40), breaks = 25)

Also this diagnostic gives a graphic result which resembles a normal distributionrelatively well. The last plot which we show to check the fit of the model is theqqplot. Remember that you plot quantiles from your (empirical) data againsttheoretical quantiles.

> qqnorm(resid(mylinmod))> qqline(resid(mylinmod))

First, we plot the qqplot of the residuals of our models against the theoreticalquantiles of the normal distribution. To facilitate the checking, we add a qqline.Looking at this plot in the lower right corner, we can state that the residualsfollow a normal distribution remarkably well and, thus, our assumption in thelinear model have not been violated.

9.5 Technical Digression: using matrix lan-

guage to estimate the coefficients

This is just a short digression which should show how you can estimate a linearmodel “yourself” using the built-in matrix language. Assume a linear model whichcan be denoted as y = Xβ + ε; In our case, the design matrix X consists onlyof one vector, the height of a person. The response vector y is the weight of theperson. As it is well-known from many textbooks, you can solve for β using thefollowing equation β = (X′X)−1X′y.

Page 101: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

9.5. TECHNICAL DIGRESSION: USING MATRIX LANGUAGE TO ESTIMATE THE COEFFICIENTS101

Figure 9.2: Regression Diagnostics

150 160 170 180 190 200

4060

8010

0

Height

Wei

ght

WomenMen

40 50 60 70 80 90

−20

020

40

fitted(mylinmod)

resi

d(m

ylin

mod

)

Histogram of resid(mylinmod)

resid(mylinmod)

Fre

quen

cy

−40 −20 0 20 40

05

1015

20

−3 −2 −1 0 1 2 3

−20

020

40

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Page 102: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

102 CHAPTER 9. SIMPLE STATISTICAL MODELS

How can we do this using the matrix language in R? It is relatively simpleas you can see here (we only need to add another column for the design matrixconsisting only of ones since the lm estimated also a model with an intercept).

> options(digits = 12)> x = Davis2$height> xmatrix <- as.matrix(cbind(rep(1, length(x)),+ x))> y = Davis2$weight> coef(lm(y ~ x))

(Intercept) x-130.74698436261 1.14922231385

> coefficients(lsfit(x = x, y = y))

Intercept X-130.74698436261 1.14922231385

> solve(t(xmatrix) %*% xmatrix) %*% t(xmatrix) %*%+ y

[,1]-130.74698436262

x 1.14922231385

9.6 Further Reading

9.7 Exercises

Page 103: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Chapter 10

More Useful Functions

10.1 Crosstabulations using tableIf you would like to do cross-tabulations like crosstabs in SPSS or PROC MEANS... in SAS, you simply use the command table in R. A relatively stupid butilluminating example would be a table of mydataframe.

> mydataframe

ages iqMarge 34 100Homer 38 70Bart 10 80Lisa 8 140Maggie 0 NA

> table(mydataframe)

iqages 70 80 100 1400 0 0 0 08 0 0 0 110 0 1 0 034 0 0 1 038 1 0 0 0

If you want to have the percentages of those entries, simply do:

> table(mydataframe)/(sum(table(mydataframe)))

103

Page 104: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

104 CHAPTER 10. MORE USEFUL FUNCTIONS

iqages 70 80 100 1400 0.00 0.00 0.00 0.008 0.00 0.00 0.00 0.2510 0.00 0.25 0.00 0.0034 0.00 0.00 0.25 0.0038 0.25 0.00 0.00 0.00

This command is also relatively useful to convert raw data into aggregated form.You remember how we expanded the data-set of Swedish women into individuals?We did it like this:

> swedish.women <- rep(sweden1890$ageatdeath,+ sweden1890$no.females)

To reverse this process we do:

> new.swedish.women <- table(swedish.women)

10.2 How many different values exist? —

uniqueIf you want to know which kind of values a certain variable takes on, you can usethe function unique which return the unique values in the argument. Duplicatesare automatically removed.

> unique(swedish.women)

[1] 0 1 2 3 4 5 6 7 8 9 10 11[13] 12 13 14 15 16 17 18 19 20 21 22 23[25] 24 25 26 27 28 29 30 31 32 33 34 35[37] 36 37 38 39 40 41 42 43 44 45 46 47[49] 48 49 50 51 52 53 54 55 56 57 58 59[61] 60 61 62 63 64 65 66 67 68 69 70 71[73] 72 73 74 75 76 77 78 79 80 81 82 83[85] 84 85 86 87 88 89 90 91 92 93 94 95[97] 96 97 98 99 100 101 102 103 104 105 106 107

This is much faster than the equivalent:

> as.numeric(as.character(levels(as.factor(swedish.women))))

Page 105: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

10.3. SPLITTING DATA — SPLIT AND CUT 105

[1] 0 1 2 3 4 5 6 7 8 9 10 11[13] 12 13 14 15 16 17 18 19 20 21 22 23[25] 24 25 26 27 28 29 30 31 32 33 34 35[37] 36 37 38 39 40 41 42 43 44 45 46 47[49] 48 49 50 51 52 53 54 55 56 57 58 59[61] 60 61 62 63 64 65 66 67 68 69 70 71[73] 72 73 74 75 76 77 78 79 80 81 82 83[85] 84 85 86 87 88 89 90 91 92 93 94 95[97] 96 97 98 99 100 101 102 103 104 105 106 107

10.3 Splitting Data — split and cutSometimes you want to split your data into distinct groups, for example womenand men. If you would like to obtain the weight of the Davis data-set separate forwomen and men, you simply do:

> sex.separated.weight <- (split(Davis$weight,+ Davis$sex))

A similar function is cut(x, breaks). It assigns each value in x a factor-levelspecified by breaks. To put them into bins, you have to give the table-command.Let’s show this by putting 105 uniformly distributed random numbers into 10 binsof the same size. More or less, we should obtain the same number of elements ineach bin.

> bindata <- runif(10^5)> bins1 <- table(cut(x = bindata, breaks = seq(from = 0,+ to = 1, length = 11)))> bins1

(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5]10033 9962 10058 10051 10047

(0.5,0.6] (0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]10027 9949 10061 9904 9908

For exactly this operation, it is recommended to use the hist-function instead,since it is less memory hungry (as written in the help page of ?cut). As you cansee, the result is exactly the same.

> bins2 <- (hist(x = bindata, breaks = seq(from = 0,+ to = 1, length = 11), plot = FALSE))$counts> bins2

Page 106: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

106 CHAPTER 10. MORE USEFUL FUNCTIONS

[1] 10033 9962 10058 10051 10047 10027 9949 10061[9] 9904 9908

10.4 Sorting Data — sort and orderSorting is somehow a bit tricky in R. So I want to show how you can sort dataseparately for single vectors and for dataframes.

10.4.1 Sorting by One Variable

Sorting Vectors

If you want to sort a vector, you simply have to give the command sort. Thedefault setting is ascending. For descending sorting, you have to give the argumentdecreasing=TRUE.

> ages

[1] 34 38 10 8 0

> sort(ages)

[1] 0 8 10 34 38

> sort(ages, decreasing = TRUE)

[1] 38 34 10 8 0

Sorting Dataframes

Sorting dataframes according to one variable is still relatively easy.

> mydataframe[order(mydataframe$ages), ]

ages iqMaggie 0 NALisa 8 140Bart 10 80Marge 34 100Homer 38 70

Page 107: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

10.4. SORTING DATA — SORT AND ORDER 107

10.4.2 Sorting by More Than One Variable

Sometimes it is possible to calculate an index variable made up of several othervariables. Then it is still not too difficult. See the following examples.

> years <- sample(1980:1989, size = 10, replace = FALSE)> months <- sample(1:12, size = 10, replace = TRUE)> gdp <- sample(1:1000, size = 10)> mysortindex <- years + ((months - 1)/12)> gdpframe <- as.data.frame(cbind(years, months,+ gdp, mysortindex))> gdpframe

years months gdp mysortindex1 1982 12 697 1982.916666672 1980 5 498 1980.333333333 1987 4 327 1987.250000004 1983 12 802 1983.916666675 1984 12 781 1984.916666676 1985 8 896 1985.583333337 1989 4 451 1989.250000008 1986 10 719 1986.750000009 1988 9 576 1988.6666666710 1981 10 151 1981.75000000

> gdpframe[order(gdpframe$mysortindex), ]

years months gdp mysortindex2 1980 5 498 1980.3333333310 1981 10 151 1981.750000001 1982 12 697 1982.916666674 1983 12 802 1983.916666675 1984 12 781 1984.916666676 1985 8 896 1985.583333338 1986 10 719 1986.750000003 1987 4 327 1987.250000009 1988 9 576 1988.666666677 1989 4 451 1989.25000000

If you have more than one variable after which the data (vector or dataframe)should be ordered, you have to proceed as follows:

Page 108: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

108 CHAPTER 10. MORE USEFUL FUNCTIONS

> newweight <- sample(x = c(80, 100, 120), size = 10,+ replace = TRUE)> newheight <- sample(x = 150:200, size = 10,+ replace = TRUE)> newsex <- sample(x = c(1, 2), size = 10, replace = TRUE)> newdataframe <- as.data.frame(cbind(newheight,+ newweight, newsex))> newdataframe

newheight newweight newsex1 154 80 22 198 100 23 151 120 24 154 120 25 197 120 26 188 120 17 191 120 18 199 120 19 154 100 110 174 120 2

> newdataframe[order(newdataframe$newsex, newdataframe$newweight),+ ]

newheight newweight newsex9 154 100 16 188 120 17 191 120 18 199 120 11 154 80 22 198 100 23 151 120 24 154 120 25 197 120 210 174 120 2

Page 109: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Chapter 11

Further Reading:

11.1 Background on R

• Ihaka and Gentleman (1996)

• R Development Core Team (2003)

11.2 Introductions

• Venables and Ripley (1999)

• Dalgaard (2002)

• Krause and Olson (2000)

11.3 Graphics

11.4 Programming in R/S

11.5 . . .

109

Page 110: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

110 CHAPTER 11. FURTHER READING:

Page 111: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Part II

Using R for Survival Analysis

111

Page 112: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis
Page 113: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

Bibliography

Abramowitz, M. and I. Stegun (1972). Handbook of Mathematical Functions (10thprinting ed.). Applied Mathematics Series - 55. Washington , D.C.: NationalBureau of Standards.

Casella, G. and R. L. Berger (1990). Statistical Inference. Belmont, CA: DuxburyPress.

Dagpunar, J. (1988). Principles of Random Variate Generation. Oxford, UK:Clarendon Press.

Dalgaard, P. (2002). Introductory Statistics with R. Statistics and Computing.New York, N.Y.: Springer.

Ihaka, R. and R. Gentleman (1996). R: A Language for Data Analysis and Graph-ics. Journal of Computational and Graphical Statistics 5 (3), 299–314.

Kernighan, B. W. and D. M. Ritchie (2000). The C Programming Language (Sec-ond Edition ed.). Prentice Hall Software Series. Englewood Cliffs, NJ: PrenticeHall.

Krause, A. and M. Olson (2000). The Basics of S and S-PLUS. Statistics andComputing. New York, N.Y.: Springer.

R Development Core Team (2003). R: A language and environment for statisticalcomputing. Vienna, Austria: R Foundation for Statistical Computing. ISBN3-900051-00-3.

Venables, W. and B. Ripley (1999). Modern Applied Statistics with S-PLUS (3rded.). New York, NY: Springer.

Venables, W. and B. Ripley (2000). S Programming. Statistics and Computing.New York, NY: Springer.

113

Page 114: Rau, R. (2005) - Introducing R to Demographers with applications for Survival Analysis

114 BIBLIOGRAPHY

Vogel, F. (1995). Beschreibende und schließ ende Statistik. Formeln, Definitionen,Erlauterungen, Stichworter und Tabellen (8 ed.). Munchen, D: Oldenbourg.