BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

BIOINFORMATICS LAB

Episode V – Introduction to R

Federico M. Giorgi, PhD

Department of Pharmacy and Biotechnology

First Cycle Degree in Genomics

2/60

The R programming Language

3/60

• R was developed in 2000 by Robert Gentleman

and Ross Ihaka

• It is Free Forever

• Open source alternative to commercial statistical

software (Excel, Matlab, S, SAS, SPSS)

• R is a procedural interpreted language (relatively

slow) but it integrates core functions of any

language (C, Fortran, Java, Python)

• It runs on every Operating System

Intro to R

R

C

C++ Java

Fortran

Python

4/60

1. Basic Statistical Analysis (e.g. Avarege, SD, T-test, Wilcoxon test)

2. I/O operations (load dataset, modify it, print to Excel or text file)

3. Graphics/Data Visualization

4. Higher Stats (Machine Learning)

5. Document Generation (HTML, PDF, md)

6. Web Sites

Purposes of R

5/60

An R website

https://shiny.rstudio.com/gallery/movie-explorer.html

https://shiny.rstudio.com/gallery/movie-explorer.html

6/60

Another R website

https://shiny.rstudio.com/gallery/superzip-example.html

https://shiny.rstudio.com/gallery/superzip-example.html

7/60

A Genomics R website

https://shiny.rstudio.com/gallery/genome-browser.html

https://shiny.rstudio.com/gallery/genome-browser.html

8/60

1. Find the top grossing ($$$) movie with Harrison Ford

2. Find the highest Tomato Meter movie of the period

2000-2005 which won at least one Oscar

3. Find the top grossing Horror movie of all times

4. Find the longest (minutes) movie of all times which won

also at least 1 Oscar

5. Find the most populated area of U.S.A. (people/square

mile)

6. Find the ZIP code of northernmost town in U.S.A.

7. From the Pancreatic Cancer Genome Browser, one of

the donors has a lot of genomic rearrangements in

Chromosome 12. Which one?

8. For each donor, find the genomic region with the

highest number of transcripts affected by splice

variants

Exercises!

9/60

Running R (the old Way)

Reminders:• userid student• password 4genomics4

Terminal

$ cd

$ pwd

/home/student

$ R

10/60

Running R (the old way)

►The R “Hello World”

message("Hello, World!")

►The R graphics “Hello World”

demo(graphics)

11/60

Some Basic R commands

►Assign a value to a variable

a<-42

a

►Compare variables

if(a==66) {

message("The variable is 42")

} else {

message("Goodbye")

q(save="no")

}

►Boolean comparisons > < >= <= ==

a>10

TRUE

12/60

Let’s open RStudio!

Reminders:• userid student• password 4genomics4

RStudio

13/60

• IDE: Integrated Development Environment

– What's the equivalent for Python?

• Create a New File (File->New File->R script or SHIFT+CTRL+N)

RStudio

14/60

The Golden Programming Rule

• CTRL-S (create a "code" folder and pick any file name, such as 001_intro.R)

• Prevent data loss

• Organize your work, make it REPRODUCBILE

• At home, Keep it backed-up (OneDrive, GoogleDrive, Dropbox, iCloud, Github, Sourceforge)

• CTRL-S every time you change something• The code is your life

• If you see an asterisk, your code is NOT saved and you will NOT graduate

15/60

RStudio

• The Code window

16/60

RStudio

• R Console / Terminal window

17/60

• The Environment/History window

RStudio

18/60

RStudio

• The Files/Plot/Help window

19/60

• Type code at the top

• CTRL+ENTER executes the highlighted code (or click on Run)

RStudio

20/60

Basic Functions with RStudio

►Assign a numeric value to a variable

x<-5

x

►Assign a character value to a variable

y<-"jack"

y

►Assign a boolean value to a variable

z<-FALSE

z

►Concatenate

xyz<-c(x,y,z)

xyz

21/60

Data Type Conversion

► To numeric

as.numeric(y)

as.numeric(z)

as.numeric("42")

► To character

as.character(x)

as.character(z)

►Assign the result to the variable to overwrite it! <-x<-as.character(x)

x

22/60

More R commands

►Simple Operations

5+4

►Less Simple Operations

sqrt(144)

2^4

log10(1000)

►Vectors and Vectors operation

vec<-c(3,12,15,20)

mean(vec)

sd(vec)

23/60

R Code Aesthetics►Comment Lines

# This is a comment line

a<-5

# a<-4

a<-a+1 # I can write comments also here

►Change Template

Tools

Global Options

Appearance

Editor Theme

Cobalt

Apply

24/60

R Quality of Life

► Show all objects in workspace

ls()

► Show current work directory

getwd()

► Remove an object from workspace

rm(a)

► Remove ALL objects from workspace

rm(list=ls())

► Change current work directory

setwd(".")

► Help!

?rnorm

25/60

The Normal Distribution

►Generate 1000 random values from a normal dist

gauss<-rnorm(1000)

►Calculate the mean

mean(gauss)

►Calculate the standard deviation

sd(gauss)

►Plot the values

plot(gauss)

►Plot the density of the values

plot(density(gauss))

26/60

Plotting Stuff

►Generate a Vector

gauss<-rnorm(1000)

►Plot the values

plot(density(gauss))

hist(gauss)

►Function Arguments

hist(gauss,xlab="My Numbers",ylab="How Often")

hist(gauss,n=30)

hist(gauss,n=30,col="orange")

27/60

Plotting Multiple Stuff►Generate two Vectors

gauss1<-rnorm(1000,mean=1,sd=0)


►Plot the density of both

plot(density(gauss1),lwd=1,main="Gauss Title")

lines(density(gauss2),lwd=3,lty=2)

►Give a Legend

legend("topright",legend=c("SD1","SD3"),lwd=c(1,3),lty=c(1,2))

►Add a gridgrid(col="black",lwd=0.5,lty=3)

► Increase plot area with xlim

plot(density(gauss1),xlim=c(-3,4)

lines(density(gauss2))

28/60

Recap of some plotting parameters

• col defines the plot track colors

– number (1 is black, 2 is red and so on)

– name ("black", "red", "cornflowerblue", "magenta")

• lty defines the line type

– 1 is solid line, 2 is dashed line, and so on

• lwd defines the line width

– 1 is standard, 2 is twice the standard, and so on

• More: main, xlab, ylab, xlim, ylim,

29/60

Exercises!

• Try to reproduce every detail of this plot

30/60

Solution

# Exercise

set.seed(1) # sets the random seed to make the example fully reproducible

# Generate vectors

gauss1<-rnorm(1000,mean=0,sd=0.5)




# Plot them

plot(density(gauss1),lwd=3,lty=1,col="black",main="Normal

Distributions",xlab="Value",ylab="Frequency",xlim=c(-5,5))

lines(density(gauss2),lwd=3,lty=2,col="red")

lines(density(gauss3),lwd=3,lty=3,col="blue")

lines(density(gauss4),lwd=3,lty=4,col="green")

grid(col="grey",lty=2)

legend("topleft",col=c("black","red","blue","green"),lty=c(1,2,3,4),lwd=3,

legend=c("Mean=0,SD=0.5","Mean=0,SD=1","Mean=3,SD=1","Mean=1,SD=3")

)

31/60

Increasing Dimensions

►Point

1

►Vector

c(2,3,4,5,6,7)

c(2:7)

c(7:2,15,0:3/10)

►Matrix

matrix("empty",nrow=10,ncol=5)

matrix(rnorm(50),nrow=10,ncol=5)

►Array

array(1:24,dim=c(4,3,2))

32/60

Loops►For loop

v<-(1:5)^2

length(v)

for (i in 1:length(v)){

message("Row:",i," value:", v[i])

}

►Another example

v<-1:5

v2<-v

for (i in 1:length(v)){

v2[i]<-v[i]+10

}

v2

33/60

Extracting Matrix Portions

►Row name and Column names

m<-matrix(rnorm(50),nrow=10,ncol=5)

rownames(m)<-LETTERS[1:10]

colnames(m)<-LETTERS[11:15]

m

View(m) # This is dangerous for big matrices (>10k cells)

►Nested Loop within a matrix

for(i in 1:nrow(m)){

for(j in 1:ncol(m)){

message(i,",",j," ",m[i,j])

}

}

34/60

Extracting Matrix Portions (2)

►Getting an entire rowm[1,]

m["E",]

►Getting an entire column

m[,5]

m[,"O"]

►Combine pieces

cbind(m[,5],m[,2])

rbind(m[2:3,],m[c(1,4),])

►Size commands

nrow(m)

ncol(m)

35/60

Exercises!

• Generate a matrix with random values

– 1000 rows, 20 columns

– No need to set rownames() and colnames()

• Calculate the mean and standard deviation of each row

• Generate a second matrix where each value of the first matrix is squared

• Plot the distributions of the values of both matrices

36/60

Solution

# Exercise to calculate mean and standard deviation of each row

set.seed(1)

m<-matrix(rnorm(20*1000),nrow=1000,ncol=20)


thisrow<-m[i,]

imean<-mean(thisrow)

message("Mean of row ",i,": ",imean)

isd<-sd(thisrow)

message("Standard Deviation of row ",i,": ",isd)

}

37/60

Solution

# Exercise to square each value

set.seed(1)


m2<-matrix(NA,nrow=1000,ncol=20)


for(j in 1:ncol(m)){

m2[i,j]<-m[i,j]^2

}

}

m2

# Plot values

plot(density(m2),col="salmon",lwd=3,lty=2)

lines(density(m),col="navy",lwd=3,lty=1)

38/60

apply►Calculating mean of each row of a big matrix:


row_means<-c() # Empty vector


newmean<-mean(m[i,])

row_means<-c(row_means,newmean)

}

row_means[1:10] # Print the first 10 row means

►A faster and more elegant way to do the same is to use apply

row_means<-apply(m,1,mean)

row_means[1:10]

column_sds<-apply(m,2,sd)

apply(matrix,dimension,function_name)

39/60

Functions in R

►R function structure

function_name<-function(input1,input2){

command1

command2

return(output)

}

►A custom-made function to calculate the mean

mymean<-function(input1){

res<-sum(input1)

res<-res/length(input1)

return(res)

}

mymean(1:10)

40/60

Plotting Multiple Variables

►We have two vectors of the same length

x<-runif(100,min=1830,max=1900)

y<-x/mean(x)+rnorm(100,sd=0.01)+60

plot(x,y) # a scatterplot

41/60

Plotting Multiple Variables

►Beautify the Scatterplot

plot(x,y,cex=4)

plot(x,y,pch=20)

plot(x,y,pch=20,xlab="Year",ylab="Weight", main="Queen Victoria",col="red")

►Regression line

lm1<-lm(y~x)

lm1

abline(lm1)

42/60

Exercises!

• Write a function to calculate the Geometric Mean of a set of numbers

– Hint: ?prod and the exponent ^

• Generate a random matrix (using rnorm with mean=1) with 300 rows and

200 columns

• Using apply, calculate the mean and geometric mean of each row

• Compare the calculated means and geometric means using a

scatterplot

• Run a red regression line through that scatterplot

• Repeat the scatterplot using rnorm with mean=0, 1, 10 and 100

43/60

Solutions

# Geometric mean function

geomean<-function(a){

product<-prod(a)

exponent<-1/length(a)

res<-product^exponent

res<-res-1

return(res)

}

geomean(1:10)

# Arithemtic vs Geometric mean

variable<-1 # Try 0, 1, 10, 100, 1000

m<-matrix(rnorm(300*200,mean=variable),nrow=300,ncol=200)

means<-apply(m,1,mean)

geomeans<-apply(m,1,geomean)

plot(means,geomeans,pch=20,xlab="Arithmetic Mean",ylab="Geometric

mean",main=paste0("rnorm_mean=",variable))

lm1<-lm(geomeans~means)

abline(lm1,col="red")

44/60

More plotting tricks

►Histogram

hist(rnorm(1000))

hist(runif(1000))

hist(rpois(1000,lambda=1))

►Quantile plotsqqnorm(rpois(1000,lambda=1))

qqnorm(rpois(1000,lambda=5))

qqnorm(rpois(1000,lambda=100))

qqnorm(rnorm(1000))

►Boxplotsx<-rnorm(1000,mean=1)

y<-rnorm(1000,mean=2)

z<-rnorm(1000,mean=2,sd=10)

boxplot(x,y,z)

45/60

Saving Plots to a file

►PDF

x<-(1:10)^2

pdf("testplot.pdf")

plot(x)

dev.off()

►PNG

png("testplot.png")

plot(x,type="line")

dev.off()

png("testplot.png",width=2000,height=1000,pointsize=30)

plot(x,type="line")

dev.off()

46/60

Remember Linux?

►We can still use it from within R! Using the system function

system("echo hello!")

►Let's get an expression file

command<-"wget https://www.dropbox.com/s/zo6j3cgby2l41q7/geneExpressionV2.csv"

system(command)

►Everything works the same way (except tools requiring interaction)

system("ls")

system("wc -l geneExpressionV2.csv")

system("head -n2 geneExpressionV2.csv")

system("less -S geneExpressionV2.csv")

47/60

Input/Output

► On any OS, you can download a file from the internet using download.file()

link<-"https://www.dropbox.com/s/zo6j3cgby2l41q7/geneExpressionV2.csv?dl=1"

download.file(link,destfile="geneExpressionV2.csv")

►Some read functions

mydata<-read.csv("geneExpressionV2.csv") # specific CSV function

head(mydata)

mydata<-read.delim("geneExpressionV2.csv",sep=",")

mydata[1:5,1:5]

►More optionsfile<-"geneExpressionV2.csv"

mydata<-read.delim(file,sep=",",row.names=1,header=TRUE)

rownames(mydata)

colnames(mydata)

dim(mydata)

48/60

Conditions► In R, conditional instructions can be used

if(condition1) {

instruction1

}else if(condition2){

instruction2

}else{

instruction3

}

►Example

age<-300

if(age>100) { # age>100 returns TRUE or FALSE

message("You are old")

} else {

message("You are young")

}

49/60

Exercises!

• The geneExpressionV2.csv (mydata) is a Gene Expression Datasetmydata<-read.delim("geneExpressionV2.csv",sep=",",row.names=1,header=TRUE)

mydata<-as.matrix(mydata) # Otherwise it's a data.frame (we will talk about that in the future)

– Already normalized (the higher the number, the more expressed the gene)

– rows 1 to 32 are breast cancer samples

– rows 33 to 64 are healthy (control) samples

• How many genes do you have? (Rows)

• Using a for loop and conditionals, get the highest value in the matrix, plus

its row and column locations

• Print in a nice looking PNG the gene expression profile of

– the gene with the highest Standard Deviation

– the gene with the lowest Standard Deviation

50/60

Solutions# How many genes do you have? (Rows)

?dim

nrow(mydata)

# Using a for loop and conditionals, get the highest value in the matrix,

plus its row and column locations

currentmax<-0

currentmaxrow<-NA

currentmaxcol<-NA

for(i in 1:nrow(mydata)){

for(j in 1:ncol(mydata)){

herevalue<-mydata[i,j]

if(herevalue>currentmax){

currentmax<-herevalue

currentmaxrow<-i

currentmaxcol<-j

}

}

}

message("Max value: ",currentmax)

message("Row: ",currentmaxrow)

message("Row name: ",rownames(mydata)[currentmaxrow])

message("Col: ",currentmaxcol)

message("Col name: ",colnames(mydata)[currentmaxcol])

51/60

Solutions# Print in a nice looking PNG the gene expression profile of the gene with the

highest Standard Deviation

sds<-apply(mydata,1,sd)

mymax<-which.max(sds)

mymin<-which.min(sds)

maxname<-rownames(mydata)[mymax]

minname<-rownames(mydata)[mymin]

png("lineplots.png",w=2000,h=1000,p=40)

plot(mydata[mymax,],type="line",col="red",lwd=3,ylab="Gene Expression")

lines(mydata[mymin,],col="cornflowerblue",lwd=3)

legend("topright",legend=c(maxname,minname),lwd=3,col=c("red","cornflowerblue"))

dev.off()

52/60

• R contains some functions from base packages

• There are more R packages out there

• Three main repositories

– CRAN, The Comprehensive R Archive Network (For Standard Statistical

Analysis)

– Bioconductor (Focused on Bioinformatics)

– Rforge (for prototype packages)

Installing Packages

53/60

Using the Task Viewer

Finding packages in CRAN

54/60

https://www.bioconductor.org/packages/release/BiocViews.html

Finding packages in BioConductor

https://www.bioconductor.org/packages/release/BiocViews.html

55/60

Installing and Loading Packages

► Installation from CRAN

library("randomForest")

install.packages("randomForest")

library("randomForest")

► Installation from Bioconductor

source("https://bioconductor.org/biocLite.R")

biocLite("affy")

► This will also recursively install all dependencies

►Load Library

library("affy")

56/60

Package Vignettes

►A "vignette" is common R slang for a colorful R package tutorial

► It should be informative, easy to load and run, with

small input data to play with


biocLite("ComplexHeatmap")

library("ComplexHeatmap")

browseVignettes("ComplexHeatmap")

►You can always ask for help for any function or package:

?Heatmap

??Heatmap

57/60

Most Downloaded Bioconductor Packages

The high majority is for Genome Annotation, RNA-Seq and microarray analysis

58/60

Most Downloaded Bioconductor Packages

The high majority is for Genome Annotation, RNA-Seq and microarray analysis

59/60

Enrichment Analysis

TP53KRAS

GATA3

RUNX2

MAPK

MYCMYCN

ESR1

CCND1

CCTNB1

Genes you

get from an

analysis

The Entire World of Gene Sets

vs.

Part 1: Over-Representation Analysis

60/60

Enrichment Analysis

TP53KRAS

GATA3

RUNX2

MAPK

MYC

MYCN

ESR1

CCND1

CCTNB1

vs.List 1

……

…

List 2

…

……

List 3

…

…

…

List n

…

……

There are at least 17,810 curated gene

lists in the Msigdb

collection only

My list Another list

(e.g. cell cycle

genes)

intersection

To test whether our gene list is enriched for

a particular group of genes, we must

perform a test that takes into consideration

• The size of the intersection

• The size of our list

• The size of the compared list

• The total number of lists we are testing

(False Discovery Rate)

Part 1: Over-Representation Analysis

61/60

Contingency Tables and Fisher's Exact Test

In Our Gene Set Not in Our Gene Set

In the Other Set Intersection Size Ours only Size

Not in the Other Set Other Only SizeRest of the Genome

- What's there

From the comparison, we derive a

Contingency Table

From a Contingency Table we can run a Fisher's Exact Test to

determine the significance of the intersection.

Invented by Sir Fisher in London to root out the scam of a

woman who had everyone believe she could predict whether

tea or milk was pured on a cup first.

62/60

Do your own Fisher's Exact Test!

►Let's create two partially overlapping groups

g1<-paste0("Gene_",LETTERS[1:10])

g2<-paste0("Gene_",LETTERS[7:11])

►Assess Intersections

setdiff(g1,g2)

intersect(g1,g2)

setdiff(g2,g1)

► The Fisher's Exact Test Function

contingency<-rbind(c(4,6),c(1,15))

fisher.test(contingency)

4

6

1

4 6

1 26-4-6-1

This is the "Universe", the total

number of elements (even

those outside the two groups)

63/60

Exercises Part 1

• Download have a list of all dead characters (up to Book 5) of a certain

TV series https://www.dropbox.com/s/rhsdyl2gkpzjar9/fisher.zip?dl=0

• Unzip it in your R folder

• Load the files as separate vectors in R. In order to have them as

character vectors, I suggest you use this code:dead<-read.delim("dead.txt",as.is=TRUE)

dead<-dead[,1]

• Question: is the dead names list enriched for a particular House?

• Calculate the Fisher's Exact Test p-value for each Allegiance (Arryn,

Stark, Baratheon, Lannister, GreyJoy and Night's Watch)

https://www.dropbox.com/s/rhsdyl2gkpzjar9/fisher.zip?dl=0

64/60

Exercises Part 2

• A useful way to visually compare lists to find multiple overlaps is to use

Venn Diagrams. Learn how to use Venn Diagrams by reading this R

documentation and plot four groups: https://www.rdocumentation.org/packages/VennDiagram/versions/1.6.20/topics/venn.diagram

https://www.rdocumentation.org/packages/VennDiagram/versions/1.6.20/topics/venn.diagram

65/60

Exercises Part 3• The way we are running the Fisher's Exact Test looks for imbalances in our intersection (i.e. testing if

it's greater than or less than one expected by chance). Play with the parameters in the fisher.test()

function to calculate the significance of an intersection to be greater than expected

3

600

150100

3 600

150 100

p=0pless=0

pgreater=1

In this case, the Intersection is

significantly smaller than expected

30

5

35200

30 5

35 200

p=0pless=1

pgreater=0

In this case, the Intersection is

significantly greater than expected

31

50

40100

31 50

40 100

p=0.18pless=0.94

pgreater=0.09

In this case, the Intersection is greater than

(but not significantly different from) what

expected by chance

66/60

Exercises Part 4• What is the Allegiance with the lowest p-value?

• Can you name all intersections between the groups (except all.txt and

dead.txt)

• Since you calculated multiple tests, you need to correct your p-values (the

more tests you do, the more likely to get a significant result by chance).

Use the function p.adjust(). This is called False Discovery Rate correction

and you can think of it as the p-value (significance) of the p-value.

There are several methods to correct the P-value

• Bonferroni correction: multiply the p-value by the number of p-values calculated,

e.g. p1=0.007, p2=0.1, p3=0.4 become p1=0.021, p2=0.3, p3=1

• Benjamini-Hochberg correction: less stringent, it "punishes" p-values according to

their ranking (the lower, the more it is corrected) e.g. p1=0.007, p2=0.1, p3=0.4

become p1=0.021, p2=0.15, p3=0.4

67/60

Enrichment AnalysisMy list Another list

(e.g. cell cycle

genes)1. Over Representation Analysis (ORA)Sometimes, what you have is simply a list of

genes, generated in multiple ways

(e.g.mutated/not mutated, or which genes are

upregulated by a drug more than two-fold)

In that previous statement, we had to define our gene list with an arbitrary cutoff (two-fold),

but in that way we may lose a lot of genes that are just below the cutoff. A way to avoid this

is…

2. Gene Set Enrichment Analysis (GSEA)A list of genes is overlapped over a Signature, i.e. a transcriptome-wide vector with weights

(e.g. up/down regulation score) assigned to each gene

Gene

Set

Our Gene Signature

(e.g. 20,000 genes Log2(Cancer/Normal) Expression

4 2 5 1 3

GSEA

68/60

Enrichment: ConceptPart 2: Gene Set Enrichment Analysis

GSEA has become a very popular tool in

Genomics and Transcriptomics analyses,

partially thanks to its sensitivity

Look at case S3: no ORA could find a

significance there

69/60

Exercises!

• Your future job requires you to familiarize with the enrichplot package

• Install it

• Read the vignette browseVignettes("enrichplot")Use preferably the updated one, date 2018-10-30

https://bioconductor.org/packages/release/bioc/vignettes/enrichplot/inst/doc/enrichplot.html

url<-"https://www.dropbox.com/s/ld2s58jfbv70kha/geneList.rda?dl=1"

download.file(url,destfile="geneList.rda",mode="wb")

load("geneList.rda")

plot(density(geneList),xlab="Log2 Fold Change")

• Run the vignette, but use this gene list (targets of the MYCN

oncogene in neuroblastoma, from the Cancer Gene Census)

Hints:

• Plots will look way better if using the png() function with high width, height and res

https://bioconductor.org/packages/release/bioc/vignettes/enrichplot/inst/doc/enrichplot.html

70/60

Installation and gene set definition

# Update Bioconductor (optional and long , but it will guarantee you have the latest packages)

# BiocManager::install()

## Install library


biocLite("enrichplot")

library("enrichplot")

library("DOSE")

# Generate a list of differentially expressed genes (above Log2FC 2)

de <- names(geneList)[abs(geneList) > 2]

71/60

Visualize the Data

# Gene Fold Change Visualization

barplot(geneList,ylab="Fold Change")

# Visualize only de genes

barplot(geneList[de])

# Convert entrez genes into gene symbols

library('org.Hs.eg.db')

# Symbol list

desymbols<- mapIds(org.Hs.eg.db, de, 'SYMBOL', 'ENTREZID'))

symbols<-geneList

names(symbols)<-mapIds(org.Hs.eg.db, names(geneList), 'ENTREZID', 'SYMBOL')

# Plot

sublist<-symbolList[desymbols]

cols<-rep("red3",length(sublist))

cols[sublist<0]<-"cornflowerblue"

png("degenes.png",w=18000,h=1500,res=200)

par(las=2)

barplot(sublist,col=cols,ylab="Fold Change")

dev.off()

Tumor Suppressor Genes

Oncogenes

Lo

g2

(Tu

mo

r/N

orm

al)

72/60

Enrichment Analysis

# Over Representation Analysis

edo <- enrichDGN(de)

# Gene Set Enrichment Analysis

edo2 <- gseNCG(geneList, nPerm=10000)

Are these genes belonging to particular groups more than expected by chance?

73/60

Enrichment Barplot

## Visualization of Analysis

# Bar plot

png("barplot.png",w=2000,h=1500,res=100)

barplot(edo, showCategory=20)

dev.off()

74/60

Dot Plot

# Dot plot

library(ggplot2)

p1 <- dotplot(edo, showCategory=30) + ggtitle("dotplot for ORA")

p2 <- dotplot(edo2, showCategory=30) + ggtitle("dotplot for GSEA")

library(cowplot)

png("dotplots.png",w=2000,h=1500,res=200)

plot_grid(p1, p2, ncol=2)

dev.off()

75/60

Gene Network## Gene-Concept Network

# Convert Entrez ids to gene symbols

edox <- setReadable(edo, 'org.Hs.eg.db', 'ENTREZID')

# Radial

png("cnet.png",w=2000,h=1500,res=100)

cnetplot(edox, foldChange=geneList)

dev.off()

76/60

Gene Network# Circular

png("cnet_circular.png",w=2500,h=1500,res=100)

cnetplot(edox, foldChange=geneList, circular = TRUE,

colorEdge = TRUE)

dev.off()

77/60

Genes belonging to Multiple Groups

## UpSet plot

png("upset.png",w=2500,h=1500,res=200)

upsetplot(edo)

dev.off()

78/60

Heatmap

## Heatmap-like

png("heatmap.png",w=2000,h=1500,p=60)

heatplot(edox)

dev.off()

79/60

Heatmap

# Heatmap with score

png("heatmap_score.png",w=3000,h=1500,res=300)

heatplot(edox, foldChange=geneList)

dev.off()

80/60

Overlapping Gene Sets

# Enrichment Map

png("emapplot.png",w=3000,h=1500,res=300)

emapplot(edo)

dev.off()

81/60

Significance of up/down regulated

# Ridgeline plot

png("ridgeplot.png",w=3000,h=1500,res=300)

ridgeplot(edo2)

dev.off()

82/60

Visualization of a gene group over our signature

# GSEA plot

png("gseaplot.png",w=3000,h=1500,res=300)

i<-1

gseaplot2(edo2, geneSetID = i, title = edo2$Description[i])

dev.off()

83/60

How the gene groups are cited in Pubmed

# Pubmed trend plot

png("pubmed.png",w=3000,h=1500,res=300)

terms <- edo$Description[1:20]

pmcplot(terms, 2010:2018, proportion=FALSE)

dev.off()

www.giorgilab.org

Federico M. Giorgi, PhD

Department of Pharmacy and Biotechnology

[email protected]

to EF, bringer of salted breakfasts

85/60

Tricks

# Create a contingency table

universe<-length(all)

ul<-intersect(dead,wildling)

ur<-setdiff(dead,ul)

dl<-setdiff(wildling,ul)

ul<-length(ul)

ur<-length(ur)

dl<-length(dl)

dr<-universe-ul-ur-dl

ctable<-rbind(c(ul,ur),c(dl,dr))

# Create a 3-way Venn Diagram

library(VennDiagram)

dev.off()

vennlist<-list(dead,stark,greyjoy)

venn.plot<-

venn.diagram(vennlist,NULL,fill=c("black","grey","orange"),alpha=0.5,

cex=2,cat.fontface=4,

category.names=c("Dead","Stark","Greyjoy"),

main="War of Five Kings")

grid.draw(venn.plot)

86/60

87/60

Exercises!

• Do something

88/60

Solutions

# Some solution

s<-5

BIOINFORMATICS LAB Episode V Introduction to R · Episode V –Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics. 2/60

Documents