BIOINFORMATICS LAB Episode V – Introduction to R Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics
BIOINFORMATICS LAB
Episode V – Introduction to R
Federico M. Giorgi, PhD
Department of Pharmacy and Biotechnology
First Cycle Degree in Genomics
2/60
The R programming Language
3/60
• R was developed in 2000 by Robert Gentleman
and Ross Ihaka
• It is Free Forever
• Open source alternative to commercial statistical
software (Excel, Matlab, S, SAS, SPSS)
• R is a procedural interpreted language (relatively
slow) but it integrates core functions of any
language (C, Fortran, Java, Python)
• It runs on every Operating System
Intro to R
R
C
C++ Java
Fortran
Python
4/60
1. Basic Statistical Analysis (e.g. Avarege, SD, T-test, Wilcoxon test)
2. I/O operations (load dataset, modify it, print to Excel or text file)
3. Graphics/Data Visualization
4. Higher Stats (Machine Learning)
5. Document Generation (HTML, PDF, md)
6. Web Sites
Purposes of R
5/60
An R website
https://shiny.rstudio.com/gallery/movie-explorer.html
6/60
Another R website
https://shiny.rstudio.com/gallery/superzip-example.html
7/60
A Genomics R website
https://shiny.rstudio.com/gallery/genome-browser.html
8/60
1. Find the top grossing ($$$) movie with Harrison Ford
2. Find the highest Tomato Meter movie of the period
2000-2005 which won at least one Oscar
3. Find the top grossing Horror movie of all times
4. Find the longest (minutes) movie of all times which won
also at least 1 Oscar
5. Find the most populated area of U.S.A. (people/square
mile)
6. Find the ZIP code of northernmost town in U.S.A.
7. From the Pancreatic Cancer Genome Browser, one of
the donors has a lot of genomic rearrangements in
Chromosome 12. Which one?
8. For each donor, find the genomic region with the
highest number of transcripts affected by splice
variants
Exercises!
9/60
Running R (the old Way)
Reminders:• userid student• password 4genomics4
Terminal
$ cd
$ pwd
/home/student
$ R
10/60
Running R (the old way)
►The R “Hello World”
message("Hello, World!")
►The R graphics “Hello World”
demo(graphics)
11/60
Some Basic R commands
►Assign a value to a variable
a<-42
a
►Compare variables
if(a==66) {
message("The variable is 42")
} else {
message("Goodbye")
q(save="no")
}
►Boolean comparisons > < >= <= ==
a>10
TRUE
12/60
Let’s open RStudio!
Reminders:• userid student• password 4genomics4
RStudio
13/60
• IDE: Integrated Development Environment
– What's the equivalent for Python?
• Create a New File (File->New File->R script or SHIFT+CTRL+N)
RStudio
14/60
The Golden Programming Rule
• CTRL-S (create a "code" folder and pick any file name, such as 001_intro.R)
• Prevent data loss
• Organize your work, make it REPRODUCBILE
• At home, Keep it backed-up (OneDrive, GoogleDrive, Dropbox, iCloud, Github, Sourceforge)
• CTRL-S every time you change something• The code is your life
• If you see an asterisk, your code is NOT saved and you will NOT graduate
15/60
RStudio
• The Code window
16/60
RStudio
• R Console / Terminal window
17/60
• The Environment/History window
RStudio
18/60
RStudio
• The Files/Plot/Help window
19/60
• Type code at the top
• CTRL+ENTER executes the highlighted code (or click on Run)
RStudio
20/60
Basic Functions with RStudio
►Assign a numeric value to a variable
x<-5
x
►Assign a character value to a variable
y<-"jack"
y
►Assign a boolean value to a variable
z<-FALSE
z
►Concatenate
xyz<-c(x,y,z)
xyz
21/60
Data Type Conversion
► To numeric
as.numeric(y)
as.numeric(z)
as.numeric("42")
► To character
as.character(x)
as.character(z)
►Assign the result to the variable to overwrite it! <-x<-as.character(x)
x
22/60
More R commands
►Simple Operations
5+4
►Less Simple Operations
sqrt(144)
2^4
log10(1000)
►Vectors and Vectors operation
vec<-c(3,12,15,20)
mean(vec)
sd(vec)
23/60
R Code Aesthetics►Comment Lines
# This is a comment line
a<-5
# a<-4
a<-a+1 # I can write comments also here
►Change Template
Tools
Global Options
Appearance
Editor Theme
Cobalt
Apply
24/60
R Quality of Life
► Show all objects in workspace
ls()
► Show current work directory
getwd()
► Remove an object from workspace
rm(a)
► Remove ALL objects from workspace
rm(list=ls())
► Change current work directory
setwd(".")
► Help!
?rnorm
25/60
The Normal Distribution
►Generate 1000 random values from a normal dist
gauss<-rnorm(1000)
►Calculate the mean
mean(gauss)
►Calculate the standard deviation
sd(gauss)
►Plot the values
plot(gauss)
►Plot the density of the values
plot(density(gauss))
26/60
Plotting Stuff
►Generate a Vector
gauss<-rnorm(1000)
►Plot the values
plot(density(gauss))
hist(gauss)
►Function Arguments
hist(gauss,xlab="My Numbers",ylab="How Often")
hist(gauss,n=30)
hist(gauss,n=30,col="orange")
27/60
Plotting Multiple Stuff►Generate two Vectors
gauss1<-rnorm(1000,mean=1,sd=0)
gauss2<-rnorm(1000,mean=1,sd=3)
►Plot the density of both
plot(density(gauss1),lwd=1,main="Gauss Title")
lines(density(gauss2),lwd=3,lty=2)
►Give a Legend
legend("topright",legend=c("SD1","SD3"),lwd=c(1,3),lty=c(1,2))
►Add a gridgrid(col="black",lwd=0.5,lty=3)
► Increase plot area with xlim
plot(density(gauss1),xlim=c(-3,4)
lines(density(gauss2))
28/60
Recap of some plotting parameters
• col defines the plot track colors
– number (1 is black, 2 is red and so on)
– name ("black", "red", "cornflowerblue", "magenta")
• lty defines the line type
– 1 is solid line, 2 is dashed line, and so on
• lwd defines the line width
– 1 is standard, 2 is twice the standard, and so on
• More: main, xlab, ylab, xlim, ylim,
29/60
Exercises!
• Try to reproduce every detail of this plot
30/60
Solution
# Exercise
set.seed(1) # sets the random seed to make the example fully reproducible
# Generate vectors
gauss1<-rnorm(1000,mean=0,sd=0.5)
gauss2<-rnorm(1000,mean=0,sd=1)
gauss3<-rnorm(1000,mean=3,sd=1)
gauss4<-rnorm(1000,mean=1,sd=3)
# Plot them
plot(density(gauss1),lwd=3,lty=1,col="black",main="Normal
Distributions",xlab="Value",ylab="Frequency",xlim=c(-5,5))
lines(density(gauss2),lwd=3,lty=2,col="red")
lines(density(gauss3),lwd=3,lty=3,col="blue")
lines(density(gauss4),lwd=3,lty=4,col="green")
grid(col="grey",lty=2)
legend("topleft",col=c("black","red","blue","green"),lty=c(1,2,3,4),lwd=3,
legend=c("Mean=0,SD=0.5","Mean=0,SD=1","Mean=3,SD=1","Mean=1,SD=3")
)
31/60
Increasing Dimensions
►Point
1
►Vector
c(2,3,4,5,6,7)
c(2:7)
c(7:2,15,0:3/10)
►Matrix
matrix("empty",nrow=10,ncol=5)
matrix(rnorm(50),nrow=10,ncol=5)
►Array
array(1:24,dim=c(4,3,2))
32/60
Loops►For loop
v<-(1:5)^2
length(v)
for (i in 1:length(v)){
message("Row:",i," value:", v[i])
}
►Another example
v<-1:5
v2<-v
for (i in 1:length(v)){
v2[i]<-v[i]+10
}
v2
33/60
Extracting Matrix Portions
►Row name and Column names
m<-matrix(rnorm(50),nrow=10,ncol=5)
rownames(m)<-LETTERS[1:10]
colnames(m)<-LETTERS[11:15]
m
View(m) # This is dangerous for big matrices (>10k cells)
►Nested Loop within a matrix
for(i in 1:nrow(m)){
for(j in 1:ncol(m)){
message(i,",",j," ",m[i,j])
}
}
34/60
Extracting Matrix Portions (2)
►Getting an entire rowm[1,]
m["E",]
►Getting an entire column
m[,5]
m[,"O"]
►Combine pieces
cbind(m[,5],m[,2])
rbind(m[2:3,],m[c(1,4),])
►Size commands
nrow(m)
ncol(m)
35/60
Exercises!
• Generate a matrix with random values
– 1000 rows, 20 columns
– No need to set rownames() and colnames()
• Calculate the mean and standard deviation of each row
• Generate a second matrix where each value of the first matrix is squared
• Plot the distributions of the values of both matrices
36/60
Solution
# Exercise to calculate mean and standard deviation of each row
set.seed(1)
m<-matrix(rnorm(20*1000),nrow=1000,ncol=20)
for(i in 1:nrow(m)){
thisrow<-m[i,]
imean<-mean(thisrow)
message("Mean of row ",i,": ",imean)
isd<-sd(thisrow)
message("Standard Deviation of row ",i,": ",isd)
}
37/60
Solution
# Exercise to square each value
set.seed(1)
m<-matrix(rnorm(20*1000),nrow=1000,ncol=20)
m2<-matrix(NA,nrow=1000,ncol=20)
for(i in 1:nrow(m)){
for(j in 1:ncol(m)){
m2[i,j]<-m[i,j]^2
}
}
m2
# Plot values
plot(density(m2),col="salmon",lwd=3,lty=2)
lines(density(m),col="navy",lwd=3,lty=1)
38/60
apply►Calculating mean of each row of a big matrix:
m<-matrix(rnorm(100*2000),nrow=2000,ncol=100)
row_means<-c() # Empty vector
for(i in 1:nrow(m)){
newmean<-mean(m[i,])
row_means<-c(row_means,newmean)
}
row_means[1:10] # Print the first 10 row means
►A faster and more elegant way to do the same is to use apply
row_means<-apply(m,1,mean)
row_means[1:10]
column_sds<-apply(m,2,sd)
apply(matrix,dimension,function_name)
39/60
Functions in R
►R function structure
function_name<-function(input1,input2){
command1
command2
return(output)
}
►A custom-made function to calculate the mean
mymean<-function(input1){
res<-sum(input1)
res<-res/length(input1)
return(res)
}
mymean(1:10)
40/60
Plotting Multiple Variables
►We have two vectors of the same length
x<-runif(100,min=1830,max=1900)
y<-x/mean(x)+rnorm(100,sd=0.01)+60
plot(x,y) # a scatterplot
41/60
Plotting Multiple Variables
►Beautify the Scatterplot
plot(x,y,cex=4)
plot(x,y,pch=20)
plot(x,y,pch=20,xlab="Year",ylab="Weight", main="Queen Victoria",col="red")
►Regression line
lm1<-lm(y~x)
lm1
abline(lm1)
42/60
Exercises!
• Write a function to calculate the Geometric Mean of a set of numbers
– Hint: ?prod and the exponent ^
• Generate a random matrix (using rnorm with mean=1) with 300 rows and
200 columns
• Using apply, calculate the mean and geometric mean of each row
• Compare the calculated means and geometric means using a
scatterplot
• Run a red regression line through that scatterplot
• Repeat the scatterplot using rnorm with mean=0, 1, 10 and 100
43/60
Solutions
# Geometric mean function
geomean<-function(a){
product<-prod(a)
exponent<-1/length(a)
res<-product^exponent
res<-res-1
return(res)
}
geomean(1:10)
# Arithemtic vs Geometric mean
variable<-1 # Try 0, 1, 10, 100, 1000
m<-matrix(rnorm(300*200,mean=variable),nrow=300,ncol=200)
means<-apply(m,1,mean)
geomeans<-apply(m,1,geomean)
plot(means,geomeans,pch=20,xlab="Arithmetic Mean",ylab="Geometric
mean",main=paste0("rnorm_mean=",variable))
lm1<-lm(geomeans~means)
abline(lm1,col="red")
44/60
More plotting tricks
►Histogram
hist(rnorm(1000))
hist(runif(1000))
hist(rpois(1000,lambda=1))
►Quantile plotsqqnorm(rpois(1000,lambda=1))
qqnorm(rpois(1000,lambda=5))
qqnorm(rpois(1000,lambda=100))
qqnorm(rnorm(1000))
►Boxplotsx<-rnorm(1000,mean=1)
y<-rnorm(1000,mean=2)
z<-rnorm(1000,mean=2,sd=10)
boxplot(x,y,z)
45/60
Saving Plots to a file
x<-(1:10)^2
pdf("testplot.pdf")
plot(x)
dev.off()
►PNG
png("testplot.png")
plot(x,type="line")
dev.off()
png("testplot.png",width=2000,height=1000,pointsize=30)
plot(x,type="line")
dev.off()
46/60
Remember Linux?
►We can still use it from within R! Using the system function
system("echo hello!")
►Let's get an expression file
command<-"wget https://www.dropbox.com/s/zo6j3cgby2l41q7/geneExpressionV2.csv"
system(command)
►Everything works the same way (except tools requiring interaction)
system("ls")
system("wc -l geneExpressionV2.csv")
system("head -n2 geneExpressionV2.csv")
system("less -S geneExpressionV2.csv")
47/60
Input/Output
► On any OS, you can download a file from the internet using download.file()
link<-"https://www.dropbox.com/s/zo6j3cgby2l41q7/geneExpressionV2.csv?dl=1"
download.file(link,destfile="geneExpressionV2.csv")
►Some read functions
mydata<-read.csv("geneExpressionV2.csv") # specific CSV function
head(mydata)
mydata<-read.delim("geneExpressionV2.csv",sep=",")
mydata[1:5,1:5]
►More optionsfile<-"geneExpressionV2.csv"
mydata<-read.delim(file,sep=",",row.names=1,header=TRUE)
rownames(mydata)
colnames(mydata)
dim(mydata)
48/60
Conditions► In R, conditional instructions can be used
if(condition1) {
instruction1
}else if(condition2){
instruction2
}else{
instruction3
}
►Example
age<-300
if(age>100) { # age>100 returns TRUE or FALSE
message("You are old")
} else {
message("You are young")
}
49/60
Exercises!
• The geneExpressionV2.csv (mydata) is a Gene Expression Datasetmydata<-read.delim("geneExpressionV2.csv",sep=",",row.names=1,header=TRUE)
mydata<-as.matrix(mydata) # Otherwise it's a data.frame (we will talk about that in the future)
– Already normalized (the higher the number, the more expressed the gene)
– rows 1 to 32 are breast cancer samples
– rows 33 to 64 are healthy (control) samples
• How many genes do you have? (Rows)
• Using a for loop and conditionals, get the highest value in the matrix, plus
its row and column locations
• Print in a nice looking PNG the gene expression profile of
– the gene with the highest Standard Deviation
– the gene with the lowest Standard Deviation
50/60
Solutions# How many genes do you have? (Rows)
?dim
nrow(mydata)
# Using a for loop and conditionals, get the highest value in the matrix,
plus its row and column locations
currentmax<-0
currentmaxrow<-NA
currentmaxcol<-NA
for(i in 1:nrow(mydata)){
for(j in 1:ncol(mydata)){
herevalue<-mydata[i,j]
if(herevalue>currentmax){
currentmax<-herevalue
currentmaxrow<-i
currentmaxcol<-j
}
}
}
message("Max value: ",currentmax)
message("Row: ",currentmaxrow)
message("Row name: ",rownames(mydata)[currentmaxrow])
message("Col: ",currentmaxcol)
message("Col name: ",colnames(mydata)[currentmaxcol])
51/60
Solutions# Print in a nice looking PNG the gene expression profile of the gene with the
highest Standard Deviation
sds<-apply(mydata,1,sd)
mymax<-which.max(sds)
mymin<-which.min(sds)
maxname<-rownames(mydata)[mymax]
minname<-rownames(mydata)[mymin]
png("lineplots.png",w=2000,h=1000,p=40)
plot(mydata[mymax,],type="line",col="red",lwd=3,ylab="Gene Expression")
lines(mydata[mymin,],col="cornflowerblue",lwd=3)
legend("topright",legend=c(maxname,minname),lwd=3,col=c("red","cornflowerblue"))
dev.off()
52/60
• R contains some functions from base packages
• There are more R packages out there
• Three main repositories
– CRAN, The Comprehensive R Archive Network (For Standard Statistical
Analysis)
– Bioconductor (Focused on Bioinformatics)
– Rforge (for prototype packages)
Installing Packages
53/60
Using the Task Viewer
Finding packages in CRAN
54/60
https://www.bioconductor.org/packages/release/BiocViews.html
Finding packages in BioConductor
55/60
Installing and Loading Packages
► Installation from CRAN
library("randomForest")
install.packages("randomForest")
library("randomForest")
► Installation from Bioconductor
source("https://bioconductor.org/biocLite.R")
biocLite("affy")
► This will also recursively install all dependencies
►Load Library
library("affy")
56/60
Package Vignettes
►A "vignette" is common R slang for a colorful R package tutorial
► It should be informative, easy to load and run, with
small input data to play with
source("https://bioconductor.org/biocLite.R")
biocLite("ComplexHeatmap")
library("ComplexHeatmap")
browseVignettes("ComplexHeatmap")
►You can always ask for help for any function or package:
?Heatmap
??Heatmap
57/60
Most Downloaded Bioconductor Packages
The high majority is for Genome Annotation, RNA-Seq and microarray analysis
58/60
Most Downloaded Bioconductor Packages
The high majority is for Genome Annotation, RNA-Seq and microarray analysis
59/60
Enrichment Analysis
TP53KRAS
GATA3
RUNX2
MAPK
MYCMYCN
ESR1
CCND1
CCTNB1
Genes you
get from an
analysis
The Entire World of Gene Sets
vs.
Part 1: Over-Representation Analysis
60/60
Enrichment Analysis
TP53KRAS
GATA3
RUNX2
MAPK
MYC
MYCN
ESR1
CCND1
CCTNB1
vs.List 1
……
…
List 2
…
……
List 3
…
…
…
List n
…
……
There are at least 17,810 curated gene
lists in the Msigdb
collection only
My list Another list
(e.g. cell cycle
genes)
intersection
To test whether our gene list is enriched for
a particular group of genes, we must
perform a test that takes into consideration
• The size of the intersection
• The size of our list
• The size of the compared list
• The total number of lists we are testing
(False Discovery Rate)
Part 1: Over-Representation Analysis
61/60
Contingency Tables and Fisher's Exact Test
In Our Gene Set Not in Our Gene Set
In the Other Set Intersection Size Ours only Size
Not in the Other Set Other Only SizeRest of the Genome
- What's there
From the comparison, we derive a
Contingency Table
From a Contingency Table we can run a Fisher's Exact Test to
determine the significance of the intersection.
Invented by Sir Fisher in London to root out the scam of a
woman who had everyone believe she could predict whether
tea or milk was pured on a cup first.
62/60
Do your own Fisher's Exact Test!
►Let's create two partially overlapping groups
g1<-paste0("Gene_",LETTERS[1:10])
g2<-paste0("Gene_",LETTERS[7:11])
►Assess Intersections
setdiff(g1,g2)
intersect(g1,g2)
setdiff(g2,g1)
► The Fisher's Exact Test Function
contingency<-rbind(c(4,6),c(1,15))
fisher.test(contingency)
4
6
1
4 6
1 26-4-6-1
This is the "Universe", the total
number of elements (even
those outside the two groups)
63/60
Exercises Part 1
• Download have a list of all dead characters (up to Book 5) of a certain
TV series https://www.dropbox.com/s/rhsdyl2gkpzjar9/fisher.zip?dl=0
• Unzip it in your R folder
• Load the files as separate vectors in R. In order to have them as
character vectors, I suggest you use this code:dead<-read.delim("dead.txt",as.is=TRUE)
dead<-dead[,1]
• Question: is the dead names list enriched for a particular House?
• Calculate the Fisher's Exact Test p-value for each Allegiance (Arryn,
Stark, Baratheon, Lannister, GreyJoy and Night's Watch)
64/60
Exercises Part 2
• A useful way to visually compare lists to find multiple overlaps is to use
Venn Diagrams. Learn how to use Venn Diagrams by reading this R
documentation and plot four groups: https://www.rdocumentation.org/packages/VennDiagram/versions/1.6.20/topics/venn.diagram
65/60
Exercises Part 3• The way we are running the Fisher's Exact Test looks for imbalances in our intersection (i.e. testing if
it's greater than or less than one expected by chance). Play with the parameters in the fisher.test()
function to calculate the significance of an intersection to be greater than expected
3
600
150100
3 600
150 100
p=0pless=0
pgreater=1
In this case, the Intersection is
significantly smaller than expected
30
5
35200
30 5
35 200
p=0pless=1
pgreater=0
In this case, the Intersection is
significantly greater than expected
31
50
40100
31 50
40 100
p=0.18pless=0.94
pgreater=0.09
In this case, the Intersection is greater than
(but not significantly different from) what
expected by chance
66/60
Exercises Part 4• What is the Allegiance with the lowest p-value?
• Can you name all intersections between the groups (except all.txt and
dead.txt)
• Since you calculated multiple tests, you need to correct your p-values (the
more tests you do, the more likely to get a significant result by chance).
Use the function p.adjust(). This is called False Discovery Rate correction
and you can think of it as the p-value (significance) of the p-value.
There are several methods to correct the P-value
• Bonferroni correction: multiply the p-value by the number of p-values calculated,
e.g. p1=0.007, p2=0.1, p3=0.4 become p1=0.021, p2=0.3, p3=1
• Benjamini-Hochberg correction: less stringent, it "punishes" p-values according to
their ranking (the lower, the more it is corrected) e.g. p1=0.007, p2=0.1, p3=0.4
become p1=0.021, p2=0.15, p3=0.4
67/60
Enrichment AnalysisMy list Another list
(e.g. cell cycle
genes)1. Over Representation Analysis (ORA)Sometimes, what you have is simply a list of
genes, generated in multiple ways
(e.g.mutated/not mutated, or which genes are
upregulated by a drug more than two-fold)
In that previous statement, we had to define our gene list with an arbitrary cutoff (two-fold),
but in that way we may lose a lot of genes that are just below the cutoff. A way to avoid this
is…
2. Gene Set Enrichment Analysis (GSEA)A list of genes is overlapped over a Signature, i.e. a transcriptome-wide vector with weights
(e.g. up/down regulation score) assigned to each gene
Gene
Set
Our Gene Signature
(e.g. 20,000 genes Log2(Cancer/Normal) Expression
4 2 5 1 3
GSEA
68/60
Enrichment: ConceptPart 2: Gene Set Enrichment Analysis
GSEA has become a very popular tool in
Genomics and Transcriptomics analyses,
partially thanks to its sensitivity
Look at case S3: no ORA could find a
significance there
69/60
Exercises!
• Your future job requires you to familiarize with the enrichplot package
• Install it
• Read the vignette browseVignettes("enrichplot")Use preferably the updated one, date 2018-10-30
https://bioconductor.org/packages/release/bioc/vignettes/enrichplot/inst/doc/enrichplot.html
url<-"https://www.dropbox.com/s/ld2s58jfbv70kha/geneList.rda?dl=1"
download.file(url,destfile="geneList.rda",mode="wb")
load("geneList.rda")
plot(density(geneList),xlab="Log2 Fold Change")
• Run the vignette, but use this gene list (targets of the MYCN
oncogene in neuroblastoma, from the Cancer Gene Census)
Hints:
• Plots will look way better if using the png() function with high width, height and res
70/60
Installation and gene set definition
# Update Bioconductor (optional and long , but it will guarantee you have the latest packages)
# BiocManager::install()
## Install library
source("https://bioconductor.org/biocLite.R")
biocLite("enrichplot")
library("enrichplot")
library("DOSE")
# Generate a list of differentially expressed genes (above Log2FC 2)
de <- names(geneList)[abs(geneList) > 2]
71/60
Visualize the Data
# Gene Fold Change Visualization
barplot(geneList,ylab="Fold Change")
# Visualize only de genes
barplot(geneList[de])
# Convert entrez genes into gene symbols
library('org.Hs.eg.db')
# Symbol list
desymbols<- mapIds(org.Hs.eg.db, de, 'SYMBOL', 'ENTREZID'))
symbols<-geneList
names(symbols)<-mapIds(org.Hs.eg.db, names(geneList), 'ENTREZID', 'SYMBOL')
# Plot
sublist<-symbolList[desymbols]
cols<-rep("red3",length(sublist))
cols[sublist<0]<-"cornflowerblue"
png("degenes.png",w=18000,h=1500,res=200)
par(las=2)
barplot(sublist,col=cols,ylab="Fold Change")
dev.off()
Tumor Suppressor Genes
Oncogenes
Lo
g2
(Tu
mo
r/N
orm
al)
72/60
Enrichment Analysis
# Over Representation Analysis
edo <- enrichDGN(de)
# Gene Set Enrichment Analysis
edo2 <- gseNCG(geneList, nPerm=10000)
Are these genes belonging to particular groups more than expected by chance?
73/60
Enrichment Barplot
## Visualization of Analysis
# Bar plot
png("barplot.png",w=2000,h=1500,res=100)
barplot(edo, showCategory=20)
dev.off()
74/60
Dot Plot
# Dot plot
library(ggplot2)
p1 <- dotplot(edo, showCategory=30) + ggtitle("dotplot for ORA")
p2 <- dotplot(edo2, showCategory=30) + ggtitle("dotplot for GSEA")
library(cowplot)
png("dotplots.png",w=2000,h=1500,res=200)
plot_grid(p1, p2, ncol=2)
dev.off()
75/60
Gene Network## Gene-Concept Network
# Convert Entrez ids to gene symbols
edox <- setReadable(edo, 'org.Hs.eg.db', 'ENTREZID')
# Radial
png("cnet.png",w=2000,h=1500,res=100)
cnetplot(edox, foldChange=geneList)
dev.off()
76/60
Gene Network# Circular
png("cnet_circular.png",w=2500,h=1500,res=100)
cnetplot(edox, foldChange=geneList, circular = TRUE,
colorEdge = TRUE)
dev.off()
77/60
Genes belonging to Multiple Groups
## UpSet plot
png("upset.png",w=2500,h=1500,res=200)
upsetplot(edo)
dev.off()
78/60
Heatmap
## Heatmap-like
png("heatmap.png",w=2000,h=1500,p=60)
heatplot(edox)
dev.off()
79/60
Heatmap
# Heatmap with score
png("heatmap_score.png",w=3000,h=1500,res=300)
heatplot(edox, foldChange=geneList)
dev.off()
80/60
Overlapping Gene Sets
# Enrichment Map
png("emapplot.png",w=3000,h=1500,res=300)
emapplot(edo)
dev.off()
81/60
Significance of up/down regulated
# Ridgeline plot
png("ridgeplot.png",w=3000,h=1500,res=300)
ridgeplot(edo2)
dev.off()
82/60
Visualization of a gene group over our signature
# GSEA plot
png("gseaplot.png",w=3000,h=1500,res=300)
i<-1
gseaplot2(edo2, geneSetID = i, title = edo2$Description[i])
dev.off()
83/60
How the gene groups are cited in Pubmed
# Pubmed trend plot
png("pubmed.png",w=3000,h=1500,res=300)
terms <- edo$Description[1:20]
pmcplot(terms, 2010:2018, proportion=FALSE)
dev.off()
www.giorgilab.org
Federico M. Giorgi, PhD
Department of Pharmacy and Biotechnology
to EF, bringer of salted breakfasts
85/60
Tricks
# Create a contingency table
universe<-length(all)
ul<-intersect(dead,wildling)
ur<-setdiff(dead,ul)
dl<-setdiff(wildling,ul)
ul<-length(ul)
ur<-length(ur)
dl<-length(dl)
dr<-universe-ul-ur-dl
ctable<-rbind(c(ul,ur),c(dl,dr))
# Create a 3-way Venn Diagram
library(VennDiagram)
dev.off()
vennlist<-list(dead,stark,greyjoy)
venn.plot<-
venn.diagram(vennlist,NULL,fill=c("black","grey","orange"),alpha=0.5,
cex=2,cat.fontface=4,
category.names=c("Dead","Stark","Greyjoy"),
main="War of Five Kings")
grid.draw(venn.plot)
86/60
87/60
Exercises!
• Do something
88/60
Solutions
# Some solution
s<-5