Top Banner
BIOINFORMATICS LAB Episode V – Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics
88

BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

Oct 12, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

BIOINFORMATICS LAB

Episode V – Introduction to R

Federico M. Giorgi, PhD

Department of Pharmacy and Biotechnology

First Cycle Degree in Genomics

Page 2: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

2/60

The R programming Language

Page 3: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

3/60

• R was developed in 2000 by Robert Gentleman

and Ross Ihaka

• It is Free Forever

• Open source alternative to commercial statistical

software (Excel, Matlab, S, SAS, SPSS)

• R is a procedural interpreted language (relatively

slow) but it integrates core functions of any

language (C, Fortran, Java, Python)

• It runs on every Operating System

Intro to R

R

C

C++ Java

Fortran

Python

Page 4: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

4/60

1. Basic Statistical Analysis (e.g. Avarege, SD, T-test, Wilcoxon test)

2. I/O operations (load dataset, modify it, print to Excel or text file)

3. Graphics/Data Visualization

4. Higher Stats (Machine Learning)

5. Document Generation (HTML, PDF, md)

6. Web Sites

Purposes of R

Page 5: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

5/60

An R website

https://shiny.rstudio.com/gallery/movie-explorer.html

Page 6: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

6/60

Another R website

https://shiny.rstudio.com/gallery/superzip-example.html

Page 7: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

7/60

A Genomics R website

https://shiny.rstudio.com/gallery/genome-browser.html

Page 8: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

8/60

1. Find the top grossing ($$$) movie with Harrison Ford

2. Find the highest Tomato Meter movie of the period

2000-2005 which won at least one Oscar

3. Find the top grossing Horror movie of all times

4. Find the longest (minutes) movie of all times which won

also at least 1 Oscar

5. Find the most populated area of U.S.A. (people/square

mile)

6. Find the ZIP code of northernmost town in U.S.A.

7. From the Pancreatic Cancer Genome Browser, one of

the donors has a lot of genomic rearrangements in

Chromosome 12. Which one?

8. For each donor, find the genomic region with the

highest number of transcripts affected by splice

variants

Exercises!

Page 9: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

9/60

Running R (the old Way)

Reminders:• userid student• password 4genomics4

Terminal

$ cd

$ pwd

/home/student

$ R

Page 10: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

10/60

Running R (the old way)

►The R “Hello World”

message("Hello, World!")

►The R graphics “Hello World”

demo(graphics)

Page 11: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

11/60

Some Basic R commands

►Assign a value to a variable

a<-42

a

►Compare variables

if(a==66) {

message("The variable is 42")

} else {

message("Goodbye")

q(save="no")

}

►Boolean comparisons > < >= <= ==

a>10

TRUE

Page 12: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

12/60

Let’s open RStudio!

Reminders:• userid student• password 4genomics4

RStudio

Page 13: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

13/60

• IDE: Integrated Development Environment

– What's the equivalent for Python?

• Create a New File (File->New File->R script or SHIFT+CTRL+N)

RStudio

Page 14: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

14/60

The Golden Programming Rule

• CTRL-S (create a "code" folder and pick any file name, such as 001_intro.R)

• Prevent data loss

• Organize your work, make it REPRODUCBILE

• At home, Keep it backed-up (OneDrive, GoogleDrive, Dropbox, iCloud, Github, Sourceforge)

• CTRL-S every time you change something• The code is your life

• If you see an asterisk, your code is NOT saved and you will NOT graduate

Page 15: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

15/60

RStudio

• The Code window

Page 16: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

16/60

RStudio

• R Console / Terminal window

Page 17: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

17/60

• The Environment/History window

RStudio

Page 18: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

18/60

RStudio

• The Files/Plot/Help window

Page 19: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

19/60

• Type code at the top

• CTRL+ENTER executes the highlighted code (or click on Run)

RStudio

Page 20: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

20/60

Basic Functions with RStudio

►Assign a numeric value to a variable

x<-5

x

►Assign a character value to a variable

y<-"jack"

y

►Assign a boolean value to a variable

z<-FALSE

z

►Concatenate

xyz<-c(x,y,z)

xyz

Page 21: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

21/60

Data Type Conversion

► To numeric

as.numeric(y)

as.numeric(z)

as.numeric("42")

► To character

as.character(x)

as.character(z)

►Assign the result to the variable to overwrite it! <-x<-as.character(x)

x

Page 22: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

22/60

More R commands

►Simple Operations

5+4

►Less Simple Operations

sqrt(144)

2^4

log10(1000)

►Vectors and Vectors operation

vec<-c(3,12,15,20)

mean(vec)

sd(vec)

Page 23: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

23/60

R Code Aesthetics►Comment Lines

# This is a comment line

a<-5

# a<-4

a<-a+1 # I can write comments also here

►Change Template

Tools

Global Options

Appearance

Editor Theme

Cobalt

Apply

Page 24: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

24/60

R Quality of Life

► Show all objects in workspace

ls()

► Show current work directory

getwd()

► Remove an object from workspace

rm(a)

► Remove ALL objects from workspace

rm(list=ls())

► Change current work directory

setwd(".")

► Help!

?rnorm

Page 25: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

25/60

The Normal Distribution

►Generate 1000 random values from a normal dist

gauss<-rnorm(1000)

►Calculate the mean

mean(gauss)

►Calculate the standard deviation

sd(gauss)

►Plot the values

plot(gauss)

►Plot the density of the values

plot(density(gauss))

Page 26: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

26/60

Plotting Stuff

►Generate a Vector

gauss<-rnorm(1000)

►Plot the values

plot(density(gauss))

hist(gauss)

►Function Arguments

hist(gauss,xlab="My Numbers",ylab="How Often")

hist(gauss,n=30)

hist(gauss,n=30,col="orange")

Page 27: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

27/60

Plotting Multiple Stuff►Generate two Vectors

gauss1<-rnorm(1000,mean=1,sd=0)

gauss2<-rnorm(1000,mean=1,sd=3)

►Plot the density of both

plot(density(gauss1),lwd=1,main="Gauss Title")

lines(density(gauss2),lwd=3,lty=2)

►Give a Legend

legend("topright",legend=c("SD1","SD3"),lwd=c(1,3),lty=c(1,2))

►Add a gridgrid(col="black",lwd=0.5,lty=3)

► Increase plot area with xlim

plot(density(gauss1),xlim=c(-3,4)

lines(density(gauss2))

Page 28: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

28/60

Recap of some plotting parameters

• col defines the plot track colors

– number (1 is black, 2 is red and so on)

– name ("black", "red", "cornflowerblue", "magenta")

• lty defines the line type

– 1 is solid line, 2 is dashed line, and so on

• lwd defines the line width

– 1 is standard, 2 is twice the standard, and so on

• More: main, xlab, ylab, xlim, ylim,

Page 29: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

29/60

Exercises!

• Try to reproduce every detail of this plot

Page 30: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

30/60

Solution

# Exercise

set.seed(1) # sets the random seed to make the example fully reproducible

# Generate vectors

gauss1<-rnorm(1000,mean=0,sd=0.5)

gauss2<-rnorm(1000,mean=0,sd=1)

gauss3<-rnorm(1000,mean=3,sd=1)

gauss4<-rnorm(1000,mean=1,sd=3)

# Plot them

plot(density(gauss1),lwd=3,lty=1,col="black",main="Normal

Distributions",xlab="Value",ylab="Frequency",xlim=c(-5,5))

lines(density(gauss2),lwd=3,lty=2,col="red")

lines(density(gauss3),lwd=3,lty=3,col="blue")

lines(density(gauss4),lwd=3,lty=4,col="green")

grid(col="grey",lty=2)

legend("topleft",col=c("black","red","blue","green"),lty=c(1,2,3,4),lwd=3,

legend=c("Mean=0,SD=0.5","Mean=0,SD=1","Mean=3,SD=1","Mean=1,SD=3")

)

Page 31: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

31/60

Increasing Dimensions

►Point

1

►Vector

c(2,3,4,5,6,7)

c(2:7)

c(7:2,15,0:3/10)

►Matrix

matrix("empty",nrow=10,ncol=5)

matrix(rnorm(50),nrow=10,ncol=5)

►Array

array(1:24,dim=c(4,3,2))

Page 32: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

32/60

Loops►For loop

v<-(1:5)^2

length(v)

for (i in 1:length(v)){

message("Row:",i," value:", v[i])

}

►Another example

v<-1:5

v2<-v

for (i in 1:length(v)){

v2[i]<-v[i]+10

}

v2

Page 33: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

33/60

Extracting Matrix Portions

►Row name and Column names

m<-matrix(rnorm(50),nrow=10,ncol=5)

rownames(m)<-LETTERS[1:10]

colnames(m)<-LETTERS[11:15]

m

View(m) # This is dangerous for big matrices (>10k cells)

►Nested Loop within a matrix

for(i in 1:nrow(m)){

for(j in 1:ncol(m)){

message(i,",",j," ",m[i,j])

}

}

Page 34: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

34/60

Extracting Matrix Portions (2)

►Getting an entire rowm[1,]

m["E",]

►Getting an entire column

m[,5]

m[,"O"]

►Combine pieces

cbind(m[,5],m[,2])

rbind(m[2:3,],m[c(1,4),])

►Size commands

nrow(m)

ncol(m)

Page 35: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

35/60

Exercises!

• Generate a matrix with random values

– 1000 rows, 20 columns

– No need to set rownames() and colnames()

• Calculate the mean and standard deviation of each row

• Generate a second matrix where each value of the first matrix is squared

• Plot the distributions of the values of both matrices

Page 36: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

36/60

Solution

# Exercise to calculate mean and standard deviation of each row

set.seed(1)

m<-matrix(rnorm(20*1000),nrow=1000,ncol=20)

for(i in 1:nrow(m)){

thisrow<-m[i,]

imean<-mean(thisrow)

message("Mean of row ",i,": ",imean)

isd<-sd(thisrow)

message("Standard Deviation of row ",i,": ",isd)

}

Page 37: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

37/60

Solution

# Exercise to square each value

set.seed(1)

m<-matrix(rnorm(20*1000),nrow=1000,ncol=20)

m2<-matrix(NA,nrow=1000,ncol=20)

for(i in 1:nrow(m)){

for(j in 1:ncol(m)){

m2[i,j]<-m[i,j]^2

}

}

m2

# Plot values

plot(density(m2),col="salmon",lwd=3,lty=2)

lines(density(m),col="navy",lwd=3,lty=1)

Page 38: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

38/60

apply►Calculating mean of each row of a big matrix:

m<-matrix(rnorm(100*2000),nrow=2000,ncol=100)

row_means<-c() # Empty vector

for(i in 1:nrow(m)){

newmean<-mean(m[i,])

row_means<-c(row_means,newmean)

}

row_means[1:10] # Print the first 10 row means

►A faster and more elegant way to do the same is to use apply

row_means<-apply(m,1,mean)

row_means[1:10]

column_sds<-apply(m,2,sd)

apply(matrix,dimension,function_name)

Page 39: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

39/60

Functions in R

►R function structure

function_name<-function(input1,input2){

command1

command2

return(output)

}

►A custom-made function to calculate the mean

mymean<-function(input1){

res<-sum(input1)

res<-res/length(input1)

return(res)

}

mymean(1:10)

Page 40: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

40/60

Plotting Multiple Variables

►We have two vectors of the same length

x<-runif(100,min=1830,max=1900)

y<-x/mean(x)+rnorm(100,sd=0.01)+60

plot(x,y) # a scatterplot

Page 41: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

41/60

Plotting Multiple Variables

►Beautify the Scatterplot

plot(x,y,cex=4)

plot(x,y,pch=20)

plot(x,y,pch=20,xlab="Year",ylab="Weight", main="Queen Victoria",col="red")

►Regression line

lm1<-lm(y~x)

lm1

abline(lm1)

Page 42: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

42/60

Exercises!

• Write a function to calculate the Geometric Mean of a set of numbers

– Hint: ?prod and the exponent ^

• Generate a random matrix (using rnorm with mean=1) with 300 rows and

200 columns

• Using apply, calculate the mean and geometric mean of each row

• Compare the calculated means and geometric means using a

scatterplot

• Run a red regression line through that scatterplot

• Repeat the scatterplot using rnorm with mean=0, 1, 10 and 100

Page 43: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

43/60

Solutions

# Geometric mean function

geomean<-function(a){

product<-prod(a)

exponent<-1/length(a)

res<-product^exponent

res<-res-1

return(res)

}

geomean(1:10)

# Arithemtic vs Geometric mean

variable<-1 # Try 0, 1, 10, 100, 1000

m<-matrix(rnorm(300*200,mean=variable),nrow=300,ncol=200)

means<-apply(m,1,mean)

geomeans<-apply(m,1,geomean)

plot(means,geomeans,pch=20,xlab="Arithmetic Mean",ylab="Geometric

mean",main=paste0("rnorm_mean=",variable))

lm1<-lm(geomeans~means)

abline(lm1,col="red")

Page 44: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

44/60

More plotting tricks

►Histogram

hist(rnorm(1000))

hist(runif(1000))

hist(rpois(1000,lambda=1))

►Quantile plotsqqnorm(rpois(1000,lambda=1))

qqnorm(rpois(1000,lambda=5))

qqnorm(rpois(1000,lambda=100))

qqnorm(rnorm(1000))

►Boxplotsx<-rnorm(1000,mean=1)

y<-rnorm(1000,mean=2)

z<-rnorm(1000,mean=2,sd=10)

boxplot(x,y,z)

Page 45: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

45/60

Saving Plots to a file

►PDF

x<-(1:10)^2

pdf("testplot.pdf")

plot(x)

dev.off()

►PNG

png("testplot.png")

plot(x,type="line")

dev.off()

png("testplot.png",width=2000,height=1000,pointsize=30)

plot(x,type="line")

dev.off()

Page 46: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

46/60

Remember Linux?

►We can still use it from within R! Using the system function

system("echo hello!")

►Let's get an expression file

command<-"wget https://www.dropbox.com/s/zo6j3cgby2l41q7/geneExpressionV2.csv"

system(command)

►Everything works the same way (except tools requiring interaction)

system("ls")

system("wc -l geneExpressionV2.csv")

system("head -n2 geneExpressionV2.csv")

system("less -S geneExpressionV2.csv")

Page 47: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

47/60

Input/Output

► On any OS, you can download a file from the internet using download.file()

link<-"https://www.dropbox.com/s/zo6j3cgby2l41q7/geneExpressionV2.csv?dl=1"

download.file(link,destfile="geneExpressionV2.csv")

►Some read functions

mydata<-read.csv("geneExpressionV2.csv") # specific CSV function

head(mydata)

mydata<-read.delim("geneExpressionV2.csv",sep=",")

mydata[1:5,1:5]

►More optionsfile<-"geneExpressionV2.csv"

mydata<-read.delim(file,sep=",",row.names=1,header=TRUE)

rownames(mydata)

colnames(mydata)

dim(mydata)

Page 48: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

48/60

Conditions► In R, conditional instructions can be used

if(condition1) {

instruction1

}else if(condition2){

instruction2

}else{

instruction3

}

►Example

age<-300

if(age>100) { # age>100 returns TRUE or FALSE

message("You are old")

} else {

message("You are young")

}

Page 49: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

49/60

Exercises!

• The geneExpressionV2.csv (mydata) is a Gene Expression Datasetmydata<-read.delim("geneExpressionV2.csv",sep=",",row.names=1,header=TRUE)

mydata<-as.matrix(mydata) # Otherwise it's a data.frame (we will talk about that in the future)

– Already normalized (the higher the number, the more expressed the gene)

– rows 1 to 32 are breast cancer samples

– rows 33 to 64 are healthy (control) samples

• How many genes do you have? (Rows)

• Using a for loop and conditionals, get the highest value in the matrix, plus

its row and column locations

• Print in a nice looking PNG the gene expression profile of

– the gene with the highest Standard Deviation

– the gene with the lowest Standard Deviation

Page 50: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

50/60

Solutions# How many genes do you have? (Rows)

?dim

nrow(mydata)

# Using a for loop and conditionals, get the highest value in the matrix,

plus its row and column locations

currentmax<-0

currentmaxrow<-NA

currentmaxcol<-NA

for(i in 1:nrow(mydata)){

for(j in 1:ncol(mydata)){

herevalue<-mydata[i,j]

if(herevalue>currentmax){

currentmax<-herevalue

currentmaxrow<-i

currentmaxcol<-j

}

}

}

message("Max value: ",currentmax)

message("Row: ",currentmaxrow)

message("Row name: ",rownames(mydata)[currentmaxrow])

message("Col: ",currentmaxcol)

message("Col name: ",colnames(mydata)[currentmaxcol])

Page 51: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

51/60

Solutions# Print in a nice looking PNG the gene expression profile of the gene with the

highest Standard Deviation

sds<-apply(mydata,1,sd)

mymax<-which.max(sds)

mymin<-which.min(sds)

maxname<-rownames(mydata)[mymax]

minname<-rownames(mydata)[mymin]

png("lineplots.png",w=2000,h=1000,p=40)

plot(mydata[mymax,],type="line",col="red",lwd=3,ylab="Gene Expression")

lines(mydata[mymin,],col="cornflowerblue",lwd=3)

legend("topright",legend=c(maxname,minname),lwd=3,col=c("red","cornflowerblue"))

dev.off()

Page 52: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

52/60

• R contains some functions from base packages

• There are more R packages out there

• Three main repositories

– CRAN, The Comprehensive R Archive Network (For Standard Statistical

Analysis)

– Bioconductor (Focused on Bioinformatics)

– Rforge (for prototype packages)

Installing Packages

Page 53: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

53/60

Using the Task Viewer

Finding packages in CRAN

Page 54: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

54/60

https://www.bioconductor.org/packages/release/BiocViews.html

Finding packages in BioConductor

Page 55: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

55/60

Installing and Loading Packages

► Installation from CRAN

library("randomForest")

install.packages("randomForest")

library("randomForest")

► Installation from Bioconductor

source("https://bioconductor.org/biocLite.R")

biocLite("affy")

► This will also recursively install all dependencies

►Load Library

library("affy")

Page 56: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

56/60

Package Vignettes

►A "vignette" is common R slang for a colorful R package tutorial

► It should be informative, easy to load and run, with

small input data to play with

source("https://bioconductor.org/biocLite.R")

biocLite("ComplexHeatmap")

library("ComplexHeatmap")

browseVignettes("ComplexHeatmap")

►You can always ask for help for any function or package:

?Heatmap

??Heatmap

Page 57: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

57/60

Most Downloaded Bioconductor Packages

The high majority is for Genome Annotation, RNA-Seq and microarray analysis

Page 58: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

58/60

Most Downloaded Bioconductor Packages

The high majority is for Genome Annotation, RNA-Seq and microarray analysis

Page 59: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

59/60

Enrichment Analysis

TP53KRAS

GATA3

RUNX2

MAPK

MYCMYCN

ESR1

CCND1

CCTNB1

Genes you

get from an

analysis

The Entire World of Gene Sets

vs.

Part 1: Over-Representation Analysis

Page 60: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

60/60

Enrichment Analysis

TP53KRAS

GATA3

RUNX2

MAPK

MYC

MYCN

ESR1

CCND1

CCTNB1

vs.List 1

……

List 2

……

List 3

List n

……

There are at least 17,810 curated gene

lists in the Msigdb

collection only

My list Another list

(e.g. cell cycle

genes)

intersection

To test whether our gene list is enriched for

a particular group of genes, we must

perform a test that takes into consideration

• The size of the intersection

• The size of our list

• The size of the compared list

• The total number of lists we are testing

(False Discovery Rate)

Part 1: Over-Representation Analysis

Page 61: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

61/60

Contingency Tables and Fisher's Exact Test

In Our Gene Set Not in Our Gene Set

In the Other Set Intersection Size Ours only Size

Not in the Other Set Other Only SizeRest of the Genome

- What's there

From the comparison, we derive a

Contingency Table

From a Contingency Table we can run a Fisher's Exact Test to

determine the significance of the intersection.

Invented by Sir Fisher in London to root out the scam of a

woman who had everyone believe she could predict whether

tea or milk was pured on a cup first.

Page 62: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

62/60

Do your own Fisher's Exact Test!

►Let's create two partially overlapping groups

g1<-paste0("Gene_",LETTERS[1:10])

g2<-paste0("Gene_",LETTERS[7:11])

►Assess Intersections

setdiff(g1,g2)

intersect(g1,g2)

setdiff(g2,g1)

► The Fisher's Exact Test Function

contingency<-rbind(c(4,6),c(1,15))

fisher.test(contingency)

4

6

1

4 6

1 26-4-6-1

This is the "Universe", the total

number of elements (even

those outside the two groups)

Page 63: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

63/60

Exercises Part 1

• Download have a list of all dead characters (up to Book 5) of a certain

TV series https://www.dropbox.com/s/rhsdyl2gkpzjar9/fisher.zip?dl=0

• Unzip it in your R folder

• Load the files as separate vectors in R. In order to have them as

character vectors, I suggest you use this code:dead<-read.delim("dead.txt",as.is=TRUE)

dead<-dead[,1]

• Question: is the dead names list enriched for a particular House?

• Calculate the Fisher's Exact Test p-value for each Allegiance (Arryn,

Stark, Baratheon, Lannister, GreyJoy and Night's Watch)

Page 64: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

64/60

Exercises Part 2

• A useful way to visually compare lists to find multiple overlaps is to use

Venn Diagrams. Learn how to use Venn Diagrams by reading this R

documentation and plot four groups: https://www.rdocumentation.org/packages/VennDiagram/versions/1.6.20/topics/venn.diagram

Page 65: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

65/60

Exercises Part 3• The way we are running the Fisher's Exact Test looks for imbalances in our intersection (i.e. testing if

it's greater than or less than one expected by chance). Play with the parameters in the fisher.test()

function to calculate the significance of an intersection to be greater than expected

3

600

150100

3 600

150 100

p=0pless=0

pgreater=1

In this case, the Intersection is

significantly smaller than expected

30

5

35200

30 5

35 200

p=0pless=1

pgreater=0

In this case, the Intersection is

significantly greater than expected

31

50

40100

31 50

40 100

p=0.18pless=0.94

pgreater=0.09

In this case, the Intersection is greater than

(but not significantly different from) what

expected by chance

Page 66: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

66/60

Exercises Part 4• What is the Allegiance with the lowest p-value?

• Can you name all intersections between the groups (except all.txt and

dead.txt)

• Since you calculated multiple tests, you need to correct your p-values (the

more tests you do, the more likely to get a significant result by chance).

Use the function p.adjust(). This is called False Discovery Rate correction

and you can think of it as the p-value (significance) of the p-value.

There are several methods to correct the P-value

• Bonferroni correction: multiply the p-value by the number of p-values calculated,

e.g. p1=0.007, p2=0.1, p3=0.4 become p1=0.021, p2=0.3, p3=1

• Benjamini-Hochberg correction: less stringent, it "punishes" p-values according to

their ranking (the lower, the more it is corrected) e.g. p1=0.007, p2=0.1, p3=0.4

become p1=0.021, p2=0.15, p3=0.4

Page 67: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

67/60

Enrichment AnalysisMy list Another list

(e.g. cell cycle

genes)1. Over Representation Analysis (ORA)Sometimes, what you have is simply a list of

genes, generated in multiple ways

(e.g.mutated/not mutated, or which genes are

upregulated by a drug more than two-fold)

In that previous statement, we had to define our gene list with an arbitrary cutoff (two-fold),

but in that way we may lose a lot of genes that are just below the cutoff. A way to avoid this

is…

2. Gene Set Enrichment Analysis (GSEA)A list of genes is overlapped over a Signature, i.e. a transcriptome-wide vector with weights

(e.g. up/down regulation score) assigned to each gene

Gene

Set

Our Gene Signature

(e.g. 20,000 genes Log2(Cancer/Normal) Expression

4 2 5 1 3

GSEA

Page 68: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

68/60

Enrichment: ConceptPart 2: Gene Set Enrichment Analysis

GSEA has become a very popular tool in

Genomics and Transcriptomics analyses,

partially thanks to its sensitivity

Look at case S3: no ORA could find a

significance there

Page 69: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

69/60

Exercises!

• Your future job requires you to familiarize with the enrichplot package

• Install it

• Read the vignette browseVignettes("enrichplot")Use preferably the updated one, date 2018-10-30

https://bioconductor.org/packages/release/bioc/vignettes/enrichplot/inst/doc/enrichplot.html

url<-"https://www.dropbox.com/s/ld2s58jfbv70kha/geneList.rda?dl=1"

download.file(url,destfile="geneList.rda",mode="wb")

load("geneList.rda")

plot(density(geneList),xlab="Log2 Fold Change")

• Run the vignette, but use this gene list (targets of the MYCN

oncogene in neuroblastoma, from the Cancer Gene Census)

Hints:

• Plots will look way better if using the png() function with high width, height and res

Page 70: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

70/60

Installation and gene set definition

# Update Bioconductor (optional and long , but it will guarantee you have the latest packages)

# BiocManager::install()

## Install library

source("https://bioconductor.org/biocLite.R")

biocLite("enrichplot")

library("enrichplot")

library("DOSE")

# Generate a list of differentially expressed genes (above Log2FC 2)

de <- names(geneList)[abs(geneList) > 2]

Page 71: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

71/60

Visualize the Data

# Gene Fold Change Visualization

barplot(geneList,ylab="Fold Change")

# Visualize only de genes

barplot(geneList[de])

# Convert entrez genes into gene symbols

library('org.Hs.eg.db')

# Symbol list

desymbols<- mapIds(org.Hs.eg.db, de, 'SYMBOL', 'ENTREZID'))

symbols<-geneList

names(symbols)<-mapIds(org.Hs.eg.db, names(geneList), 'ENTREZID', 'SYMBOL')

# Plot

sublist<-symbolList[desymbols]

cols<-rep("red3",length(sublist))

cols[sublist<0]<-"cornflowerblue"

png("degenes.png",w=18000,h=1500,res=200)

par(las=2)

barplot(sublist,col=cols,ylab="Fold Change")

dev.off()

Tumor Suppressor Genes

Oncogenes

Lo

g2

(Tu

mo

r/N

orm

al)

Page 72: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

72/60

Enrichment Analysis

# Over Representation Analysis

edo <- enrichDGN(de)

# Gene Set Enrichment Analysis

edo2 <- gseNCG(geneList, nPerm=10000)

Are these genes belonging to particular groups more than expected by chance?

Page 73: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

73/60

Enrichment Barplot

## Visualization of Analysis

# Bar plot

png("barplot.png",w=2000,h=1500,res=100)

barplot(edo, showCategory=20)

dev.off()

Page 74: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

74/60

Dot Plot

# Dot plot

library(ggplot2)

p1 <- dotplot(edo, showCategory=30) + ggtitle("dotplot for ORA")

p2 <- dotplot(edo2, showCategory=30) + ggtitle("dotplot for GSEA")

library(cowplot)

png("dotplots.png",w=2000,h=1500,res=200)

plot_grid(p1, p2, ncol=2)

dev.off()

Page 75: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

75/60

Gene Network## Gene-Concept Network

# Convert Entrez ids to gene symbols

edox <- setReadable(edo, 'org.Hs.eg.db', 'ENTREZID')

# Radial

png("cnet.png",w=2000,h=1500,res=100)

cnetplot(edox, foldChange=geneList)

dev.off()

Page 76: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

76/60

Gene Network# Circular

png("cnet_circular.png",w=2500,h=1500,res=100)

cnetplot(edox, foldChange=geneList, circular = TRUE,

colorEdge = TRUE)

dev.off()

Page 77: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

77/60

Genes belonging to Multiple Groups

## UpSet plot

png("upset.png",w=2500,h=1500,res=200)

upsetplot(edo)

dev.off()

Page 78: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

78/60

Heatmap

## Heatmap-like

png("heatmap.png",w=2000,h=1500,p=60)

heatplot(edox)

dev.off()

Page 79: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

79/60

Heatmap

# Heatmap with score

png("heatmap_score.png",w=3000,h=1500,res=300)

heatplot(edox, foldChange=geneList)

dev.off()

Page 80: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

80/60

Overlapping Gene Sets

# Enrichment Map

png("emapplot.png",w=3000,h=1500,res=300)

emapplot(edo)

dev.off()

Page 81: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

81/60

Significance of up/down regulated

# Ridgeline plot

png("ridgeplot.png",w=3000,h=1500,res=300)

ridgeplot(edo2)

dev.off()

Page 82: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

82/60

Visualization of a gene group over our signature

# GSEA plot

png("gseaplot.png",w=3000,h=1500,res=300)

i<-1

gseaplot2(edo2, geneSetID = i, title = edo2$Description[i])

dev.off()

Page 83: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

83/60

How the gene groups are cited in Pubmed

# Pubmed trend plot

png("pubmed.png",w=3000,h=1500,res=300)

terms <- edo$Description[1:20]

pmcplot(terms, 2010:2018, proportion=FALSE)

dev.off()

Page 84: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

www.giorgilab.org

Federico M. Giorgi, PhD

Department of Pharmacy and Biotechnology

[email protected]

to EF, bringer of salted breakfasts

Page 85: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

85/60

Tricks

# Create a contingency table

universe<-length(all)

ul<-intersect(dead,wildling)

ur<-setdiff(dead,ul)

dl<-setdiff(wildling,ul)

ul<-length(ul)

ur<-length(ur)

dl<-length(dl)

dr<-universe-ul-ur-dl

ctable<-rbind(c(ul,ur),c(dl,dr))

# Create a 3-way Venn Diagram

library(VennDiagram)

dev.off()

vennlist<-list(dead,stark,greyjoy)

venn.plot<-

venn.diagram(vennlist,NULL,fill=c("black","grey","orange"),alpha=0.5,

cex=2,cat.fontface=4,

category.names=c("Dead","Stark","Greyjoy"),

main="War of Five Kings")

grid.draw(venn.plot)

Page 86: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

86/60

Page 87: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

87/60

Exercises!

• Do something

Page 88: BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

88/60

Solutions

# Some solution

s<-5