Top Banner
Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry & Biochemistry, UCLA
45

Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Mar 30, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Welcome to lecture 6:An introduction to data analysis with R

IGERT – Sponsored Bioinformatics Workshop SeriesMichael Janis and Max Kopelevich, Ph.D.

Dept. of Chemistry & Biochemistry, UCLA

Page 2: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

We're moving ahead a bit...

• The majority of the class does some type of microarray analysis– Microarray analysis utilizes the same programmatic

concepts we've been exploring

• We’ve covered variables, control structures, data structures, functions in perl– Now we’ll introduce a new language called R

• particularly suited for numerical analysis

– We’ll learn about explorative data analysis

– We’ll write our own functionality in this new language

Page 3: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

R(A data-structure focused introduction)

– Every R intro I’ve read tends towards statistical tools first and data structures / programmatic concepts as an afterward

I don’t think this is the best way to learn a language– After all, we’ll be doing complex data analysis– Going beyond one-off biostatistics learning

• I’m going to introduce R the same way I introduced PERL

• I hope to show you that R is just as friendly…

Page 4: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

What is R?From the R-project webpage (www.r-project.org):

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

Page 5: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Where to get R?Available for a wide variety of platforms ...

(handled well in windows too!)

http://www.r-project.org

Libraries available via CRAN (like CPAN we used before)

The “bioperl” version of R is “bioconductor”

- used extensively for routine and experimental

microarray analysis

Page 6: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Let’s begin

The default mode for R is interactive CLI (with emacs keybindings!)A query – response line mode

– An overgrown calculator• R evaluates commands through function calls

• > 2+2

• [1] 4

– A programming language• Complex data structures

• Control blocks

• Objects!

Page 7: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Symbolic variables

• Variable assignment is handled via arrow notation <-• Variables can be examined by simply calling the variable• The index of the first element of the variable is given in

brackets on each line [1]• Scalar elements can be numerical, character, or boolean

> x<-2> x[1] 2> x + x[1] 4> x<-”ACTCGATCGACT”> x[1] “ACTCGATGCACT”> x<-T> x[1] TRUE

Page 8: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Vectors• R handles vectors as single objects• R defines three types of vectors:

– numerical vectors

– character vectors

– logical vectors• Vectors are created (and treated) as concatenation of

scalar elements:

> x<-c(1,2,3,4,5)> x[1] 1 2 3 4 5> x<-c(“ACT”,”TCA”,”GGA”,”CCG”)> x[1] “ACT” “TCA” “GGA” “CCG”> x<-c(T,T,F,T)> x[1] TRUE TRUE FALSE TRUE

Page 9: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Vector element access• Very similar to Perl array element access• Access by index

– The index itself can be a vector, or any type of data element

– Can be an expression

– Negative indeces denote exclusion

> x<-seq(1,20,2)> x[5][1] 9> x[c(1,3,4)][1] 1 5 7> x[x>10][1] 11 13 15 17 19> x[c(-1,-2,-3,-4,-5)][1] 11 13 15 17 19

Page 10: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Vector functions seq (“sequence”)

Creates a range of values in a vector> x<-seq(1,10,1)> x

[1] 1 2 3 4 5 6 7 8 9 10

> x<-4:12

> x

[1] 4 5 6 7 8 9 10 11 12

> x<-LETTERS(1:3)

> x

[1] A B C

Page 11: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Vector functions rep (“replicate”)

Generates repeated values

- Can be used to generate complex patterns

- Can be used to generate data grouping codes

> x<-c(10,100,1000)

> rep(x,3)

[1] 10 100 1000 10 100 1000 10 100 1000

> rep(x,1:3)

[1] 10 100 100 1000 1000 1000

> rep(1:2,c(5,10))

[1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

Page 12: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Vector functions sort

Sorts an array in-place

> x<-c(10000,10,1000)

> sort(x)

[1] 10 1000 10000

Page 13: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Vector functions factor

Grouping for categorical data

> x<-c(0,1,2,1,2)

> fx<-factor(x,levels=0:2)

> levels(fx)<-c(“low”,”middle”,”grande”)

> fx

[1] low middle grande middle grande

Page 14: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Matrices• Simply n-dimensional arrays

– in R, most everything is an array• Extends elements of any type• Can dynamically set and change dimensions

– default matrix dim is by columns

> x<-seq(1,12)> dim(x)<-c(3,4)> x

[,1] [,2] [,3] [,4][1,] 1 4 7 10[2,] 2 5 8 11[3,] 3 6 9 12> matrix(1:12,nrow=3,byrow=T)

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12

Page 15: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Matrix functions t (“transpose”)

Changes rows and columns

> matrix(1:12,nrow=3,byrow=T)[,1] [,2] [,3] [,4]

[1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12> t(x)

[,1] [,2] [,3][1,] 1 5 9[2,] 2 6 10[3,] 3 7 11[4,] 4 8 12

Page 16: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Matrix functions rownames

Assigns scalars to the row indeces (like a hash)

> x<-matrix(1:12,nrow=3,byrow=T)> x

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12> rownames(x)<-c(“one”,”two”,”three”)> x

[,1] [,2] [,3] [,4]one 1 2 3 4two 5 6 7 8three 9 10 11 12

Page 17: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Matrix functions colnames

Assigns scalars to the column indeces (like a hash)

> x<-matrix(1:12,nrow=3,byrow=T)> x

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12> colnames(x)<-c(“one”,”two”,”three”,”four”)> x

one two three four[1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12

Page 18: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Matrix functions cbind

Adds (in the agglomerative sense) cols together like XL

> x<-matrix(1:12,nrow=3,byrow=T)> x

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12> cbind(x,c(111,222,333))

[,1] [,2] [,3] [,4] [,5][1,] 1 2 3 4 111[2,] 5 6 7 8 222[3,] 9 10 11 12 333

Page 19: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Matrix functions rbind

Adds (in the agglomerative sense) rows together like XL

> x<-matrix(1:12,nrow=3,byrow=T)> x

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12> cbind(x,c(111,222,333,444))

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12[4,] 111 222 333 444

Page 20: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Object functions list

Combines collections into composite objects

- Objects are treated as vectors in R, plus methods

- Matrices are collections of vectors> x<-matrix(1:12,nrow=3,byrow=T)> x

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12> y<-c(1:5)> z<-list(matrix=x,vector=y)> z$matrix

[,1] [,2] [,3] [,4] ...$vector[1] 1 2 3 4 5

Page 21: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

list (object) functions indexing

Since it's a vector, we can obtain the elements

> z$vector[1] 1 2 3 4 5> z$vector[5][1] 5> z$matrix[1,3][1] 3> z$vector[z$vector>3][1] 4 5

Page 22: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

list (object) functions data.frame

If the vectors are the same length, we can agglomerate them in a special data matrix

- the data is paired, and has unique row names> x<-c(1:5)> y<-c(6:10)> z<-data.frame(x,y)> z

x y1 1 62 2 73 3 84 4 95 5 10

Page 23: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Reading data from files data.frame / read.table / read.csv / read.delim

> myData<-read.table(“example.txt”, header=T)> myData

field_one field_two1 10.3 1.082 11.2 0.973 … …

– The data frame is ideal for handling delimited files• Assumes a header is present

– (takes the header to have n-1 entries)

• Can handle a wide variety of interfaces with outputs– Tab, comma delimited txt files– SPSS, SAS, Stata, Minitab, S-PLUS v3 files– Works well with DB interface calls as well

Page 24: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Persistence save.image() / .RData / ls()

– The workspace is dynamic• Variables and functions are created or loaded

– objects() or ls() shows availability of both

– Can be saved to a local .RData file using save.image()

– .RData loaded by default upon startup

– Can specify the .RData (or whatever you name it) workspace using load() (may have to specify pathname!)

Page 25: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

data frame (object) functions subset

Allows extraction of a portion of a data frame

> x<-c(1:5)> y<-c(6:10)> z<-data.frame(x,y)> subset(z,x>2)

x y3 3 84 4 95 5 19

Page 26: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

data frame (object) functions transform

Allows extension of a data frame

> x<-c(1:5)> y<-c(6:10)> z<-data.frame(x,y)> transform(z,x.log=log(x))

x x.log1 1 0.00000002 2 0.69314723 3 1.09861234 4 1.38629445 5 1.6094379

Page 27: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

data frame (object) functions split

Lists vectors according to group> x<-c(1:5)> y<-c(6:10)> z<-data.frame(x,y)> h<-split(z$x,z$y)> h$”1”[1] 6$”2”[1] 7$”3”[1] 8$”4” [1] 9$”5”[1] 10

Page 28: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

data frame (object) functions lapply

Implicit looping over group members> x<-c(1:5)> y<-c(6:10)> z<-data.frame(x,y)> lapply(z, mean)$x[1] 3

$y[1] 8

Page 29: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Functions in R very similar to what we've seen in perl!

Blocks are the same

- Takes arguments

- Uses control structures (for, if, while loops, ...)

> x<-c(1:5)

> my.function<-function(x)

{

u<-mean(x)

}

> y<-my.function(x)

> y

[1] 3

Page 30: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Control structures for loop

Loops over a set range> myfunction<-function(x){

for (i in 1:10) {

do something here}

}

The variable i will take values of the sequence in turn

The range is specified by the sequence

Page 31: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

A stupid function exampleJust to illustrate passing args back and forth…

> myfun<-function(x)+ {+ X<-x+ for (i in 1:10)+ {+ X<-c(X,i)+ }+ X+ }> myfun(0) [1] 0 1 2 3 4 5 6 7 8 9 10

Page 32: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

A better function exampleA function to calculate the two sample t-statistic, showing

“all the steps”. (From http://cran.r-project.org/doc/manuals/R-intro.html)

> twosam <- function(y1, y2) {

n1 <- length(y1); n2 <- length(y2) yb1 <- mean(y1); yb2 <- mean(y2) s1 <- var(y1); s2 <- var(y2) s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2) tst <- (yb1 - yb2)/sqrt(s*(1/n1 + 1/n2)) tst

}

With this function defined, you could perform two sample t-tests using a call such as:

> tstat <- twosam(data$male, data$female); tstat

Page 33: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Control structures while loop

Loops while an evaluation returns boolean TRUE> myfunction<-function(x){

while (x>10) {

do something here}

}

The evaluation is tested at the beginning of the loop;Note that in this case, the block may never be executed

Page 34: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Control structures repeat loop

Loops until told to stop by break> myfunction<-function(x){

repeat {

do something hereif (x>10) break

}}

Uses a conditional if statement;The break is called whenever the boolean evaluation is true and the block is exited

Page 35: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Descriptive statistics summary()

Summary statistics related to a numeric variable> x<-rnorm(100)> summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. -3.08800 -0.70850 -0.11480 -0.04413 0.76510 2.89100 >

Page 36: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Descriptive statistics plot()

Simple x vs. y gram (scatterplot)> x<-rnorm(100)> y<-rnorm(100)> plot(x,y)> plot(rnorm(500))> lines(rnorm(500))

Page 37: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Descriptive statistics heatmap generation (image)

Scatterplot grid color weighted by intensities…

- very useful for microarray analysis (we’ll see next time…)

- can be used with dendrogram generation

Page 38: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

IQR

Statistics of populations

The equations so far are for sample statistics–a statistic is a single number estimated from a sample

We use the sample to make inferences about the population.

•a parameter is a single number that summarizes some quality of a variable in a population.

•the term for the population mean is (mu), and Ybar is a sample estimator of .

•the term for the population standard deviation is (sigma), and s is a sample estimator of .

Note that and are both elements of the normal probability curve.

Source: http://www.bsos.umd.edu/socy/smartin/601/

Page 39: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

IQR

Measuring probabilities under the normal curve

•We can make transformations by scaling everything with respect to the mean and standard deviation.

• Let z = the number of standard deviations above or below the population mean.

–z = 0y = –z = 1y = +/- (p=0.68)–z = 2y = +/- 2 (p=0.95)–z = 3y = +/- 3 (p=0.997)

Page 40: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Plotting using hist() and curve()

> y<-hist(h,plot=F)> ylim<-range(0,y$density,dnorm(0))> hist(x,freq=F,ylim=ylim> curve(dnorm(x),add=T)

Page 41: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Difficult to integrate… But probabilities have beenMapped out to this curve. Transformations from other Curves possible…

Page 42: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Plotting using qqnorm()

> qqnorm(x)

Page 43: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Box plots (box and whiskers plots, Tukey, 1977)

Outliers

Fence / whiskers

IQR

Q3

Q1

Median

Fence / whiskers

min((Q3+1.5(IQR)),largest X)

max((Q1+1.5(IQR)),smallest X)

Plotting using boxplot()

> boxplot(x)> boxplot(log(x))

Page 44: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

My advice

First learn to program in R. Then use the R libraries.Everything in R can be built up piecewise

– The data is made of component parts• It’s extremely useful to know how to handle the objects

– The graphics are made of component parts• This allows extreme fine-tuning of your visualization!

• Go beyond scatterplots and barplots to describe complex data well and visualize hidden trends

• A good reference is Data Visualization by Edward Tufte.

Page 45: Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Homework

A simple problem, but one we may use frequently

Use lapply (or sapply) to simulate the result of taking the mean of 100 random numbers from the normal distribution for 10 independent samples.