Top Banner
A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group
50

A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Mar 28, 2015

Download

Documents

Dylan Rollins
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

A gentle introduction to R – how to load in data and produce summary

statisticsBRC MH Bioinformatics group

Page 2: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Tutorial outline

• How to install R on your own computers– Its free– But its already installed on these computers

• Loading data from excel• Plotting• Summary statistics

Page 3: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Files

• Data and slides on:• http://core.brc.iop.kcl.ac.uk/brc-

bioinformatics-workshop-october-2012

Page 4: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Show file extensions

Page 5: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Show file extensions

• Uncheck ‘hide extensions for known file types’

• Click ‘Apply’

Page 6: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Installing R – skip as already installed

Page 7: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Installing R – skip as already installed

Page 8: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Installing R – skip as already installed

Page 9: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

And follow operating system specific installation instructions

Installing R – skip as already installed

Page 10: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Starting R on these computers

Page 11: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Help files

Page 12: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Loading help files

• A useful function is read.table()– It allows you to read data from spreadsheets into

R

• To see it’s help file you can use• You can use ?function_name for any function

to see a help file

?read.table

Page 13: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Loading data into R from excel

Page 14: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

From excelOpen testdata.xls

Page 15: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

From excel• You need to save it as a comma separated

value file (.csv), go to file>save as>other formats

Page 16: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

From excel

Page 17: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

R working directory

• To open a file you will need to point R towards the folder that contains it.

• You can do this with setwd(), but we’ll do it using the mouse

• Suppose you have the file in My Documents

Page 18: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Browsing folders• To check that you are in the right folder type

• To see files in this folder you can type

• To list the current variables type

• Nothing should be loaded yet

getwd()

list.files()

ls()

Page 19: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Loading data

To follow along with this section, make sure your R working directory is that which contains the tutorial data

Page 20: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

• Read the contents of file testdata.csv into an R variable my.data with:

• read.csv is a wrapper for read.table which lets you specify more details about your file, eg:

my.data <- read.csv(‘testdata.csv’)

my.data <- read.table(‘testdata.csv’,sep=‘,’,header=TRUE)

Page 21: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

• sep : Column separator• header : Does the first row of the file contain column headers?• skip : Number of rows to skip at the top of the file

• ?read.table for other useful parameters

read.table()

Page 22: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Looking at loaded data

Page 23: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

• Take a look at the top couple of lines:

• Generate some basic summary stats:

• Check your new variable is in the R environment:

ls()

head(my.data)

summary(my.data)

Page 24: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

• Number of rows and columns

• Row and column names

• Check the dimensions of your dataset:

dim(my.data)

nrow(my.data)ncol(my.data)

rownames(my.data)colnames(my.data)

Page 25: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Subsetting Data

Page 26: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

• Look at the first col:

• Look at the third column of row 10

• Look at the first row:

my.data[1,]

my.data[,1]

my.data[10,3]

Page 27: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

• Look at the first column for rows 100 to 110

• Same as above, but save to a variable

• Same as above but pre-defining the index vector

• Look at rows 30,40,50 and 60

my.data[100:110,1]

my.subset <- my.data[100:110,1]

my.data[c(30,40,50,60),]

my.indices <- c(30, 40, 50, 60)my.data[my.indices,]

Page 28: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

• Look at the columns named 'height' and 'weight' for row 1:

• Same as above but pre-define the colnames vector

• Look at the column named 'weight' for row 1:

You can subset on names instead of indices:

my.data[1,’weight’]

my.data[1,c(’weight’,’height’)]

cols <- c(’weight’,’height’)my.data[1,cols]

Page 29: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

• Look at all columns except the second for row 1

• Extract all rows except 1-100

• Extract all rows except 35, 67,101

Negative indices exclude elements:

my.data[1,-2]

my.new.data <- my.data[-1:-100,]

my.indices <- -1 * c(35, 67, 101)my.new.data <- my.data[my.indices,]

Page 30: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Quiz!

Page 31: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

• How tall is the person in the 7th row?

• What gender is the person in the 300th row?

• For the people in rows 20-30, who is the heaviest?

• For the people in rows 110, 350, 219, 74, who is the tallest?

• Save all rows except 500-600 in a variable my.new.data

• How many males and females are in this new dataset?

Page 32: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Formatting problems

Page 33: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Data isn't comma-separated?

• Specify the separator in read.table

• tab-delimited text is another common format, for which you can use sep=”\t”

Load "testdata.txt", a tab-delimited version of the data

Page 34: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Data has extra header information at the top?

• Either delete this data in Excel before exporting to csv

• Or, use the skip=N argument to read.table

Have a look at "testdata_1.csv" in Excel and then load it into R using read.table

Page 35: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Factors are inconsistently named

• R will just read in the data you give it.

• If you aren't consistent naming the levels of your factors it will see them as different levels

• R is case sensitive. 'MyLevel' != 'mylevel'

Load the data from testdata_2.csv and have a look at the gender variable.

Try and fix the problems in Excel and reload.

Page 36: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Measurements and units in a single column

• If you store values like 10kg, R will not interpret this as a numeric column

Try loading file 'testdata_3.csv' - what has happened to the weights and heights information?

Try loading again so that the two are loaded as character vectors.

Have a look at the sub() function and see if you can fix the problem

Page 37: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Excel has just screwed up your data

• Older versions of Excel have a limit of 65536 rows. If you open a larger dataset in Excel it will be truncated. If you then save this dataset you will be saving the truncated version.

Avoid opening large datasets in Excel, use R

• Excel tries to be helpful by formatting elements for you. Try the following and then open in Excel, save as csv and reload into R. What has happened?

my.genes<-c('MASH1','SOX2','OCT4')write.csv(my.genes, file='mygenes.csv')

Page 38: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Plotting

Page 39: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Drawing histograms

Optional exercises –

1) Try drawing a histogram of height

2) Try and label the x axis [hint: read the help file]

Page 40: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Drawing normal QQ plotsqqnorm(my.data$weight);qqline(my.data$weight)

Page 41: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Drawing scatterplots

Optional exercises: try these, do you understand this plot?

plot(height~weight,data=my.data)

plot(height~weight,data=my.data,col=as.numeric(gender))

Page 42: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Drawing boxplotsboxplot(height~gender,data=my.data)

Page 43: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Saving plots

JPEGs

PDFs

jpeg(“boxplot.jpg”)boxplot(height~gender,data=my.data)dev.off()

pdf(“boxplot.pdf”)boxplot(height~gender,data=my.data)dev.off()

Page 44: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Summary statistics

Page 45: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Functions Covered

http://www.statmethods.net/index.html

Page 46: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Writing tables

Page 47: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Calculate Mean and SD

Page 48: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Correlate phenotypes and test for group differences

Page 49: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

It is always important to check model assumptions before making statistical inferences

Page 50: A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Linear regression