Canadian Bioinformatics Workshops www.bioinformatics.ca
Dec 18, 2015
Canadian Bioinformatics Workshops
www.bioinformatics.ca
2Module #: Title of Module
Lecture 7ML & Data Visualization & Microarrays
MBP1010
Dr. Paul C. BoutrosWinter 2015
DEPARTMENT OFMEDICAL BIOPHYSICSDEPARTMENT OFMEDICAL BIOPHYSICS
This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others
††
††
Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Course Overview• Lecture 1: What is Statistics? Introduction to R• Lecture 2: Univariate Analyses I: continuous• Lecture 3: Univariate Analyses II: discrete• Lecture 4: Multivariate Analyses I: specialized models• Lecture 5: Multivariate Analyses II: general models• Lecture 6: Machine-Learning• Lecture 7: Microarray Analysis I: Pre-Processing• Lecture 8: Microarray Analysis II: Multiple-Testing• Lecture 9: Sequence Analysis• Final Exam (written)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
House Rules• Cell phones to silent
• No side conversations
• Hands up for questions
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Topics For This Week• Machine-learning 101 (Briefly)
• Data visualization 101
• Attendance
• Microarrays 101
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19])
D.cho<-dist(cho.data, method = "euclidean")
hc.single<-hclust(D.cho, method = "single", members=NULL)
Example: cell cycle data
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
plot(hc.single)
Single linkage
Example: cell cycle data
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Careful with the interpretation of dendrograms:they introduce a proximity between elements that does not correlate with distance between elements!cf.: # 1 and #47
Example: cell cycle data
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Single linkage, k=2
rect.hclust(hc.single,k=2)
Example: cell cycle data
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Single linkage, k=3
rect.hclust(hc.single,k=3)
Example: cell cycle data
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Single linkage, k=4
rect.hclust(hc.single,k=4)
Example: cell cycle data
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Single linkage, k=5
rect.hclust(hc.single,k=5)
Example: cell cycle data
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Single linkage, k=25
rect.hclust(hc.single,k=25)
Example: cell cycle data
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
1 2
3 4
class.single<-cutree(hc.single, k = 4)
par(mfrow=c(2,2))
matplot(t(cho.data[class.single==1,]),type="l",
xlab="time",ylab="log expression value")
matplot(t(cho.data[class.single==2,]),type="l",
xlab="time",ylab="log expression value")
matplot(as.matrix(cho.data[class.single==3,]),
type="l",xlab="time",ylab="log expression value")
matplot(t(cho.data[class.single==4,]),type="l",
xlab="time",ylab="log expression value")
Properties of cluster members, single linkage, k=4
Example: cell cycle data
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
1 2
3 4
Single linkage, k=4
1 2
3 4
Example: cell cycle data
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
1 2
3 4
Complete linkage, k=4
1 2
3 4
Single linkage, k=4
Example: cell cycle data
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Hierarchical clustering analyzed
Advantages Disadvantages
There may be small clusters nested inside large ones
Clusters might not be naturally represented by a hierarchical structure
No need to specify number groups ahead of time
Its necessary to ‘cut’ the dendrogram in order to produce clusters
Flexible linkage methods Bottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Partitioning methods
• Anatomy of a partitioning based method• data matrix• distance function• number of groups
• Output• group assignment of every object
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Partitioning based methods
• Choose K groups• initialise group centers
• aka centroid, medoid
• assign each object to the nearest centroid according to the distance metric
• reassign (or recompute) centroids
• repeat last 2 steps until assignment stabilizes
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
K-means vs. K-medoids
K-means K-medoids
Centroids are the ‘mean’ of the clusters
Centroids are an actual object that minimizes the total within cluster distance
Centroids need to be recomputed every iteration
Centroid can be determined from quick look up into the distance matrix
Initialisation difficult as notion of centroid may be unclear before beginning
Initialisation is simply K randomly selected objects
kmeans pam
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Partitioning based methods
Advantages Disadvantages
Number of groups is well defined
Have to choose the number of groups
A clear, deterministic assignment of an object to a group
Sometimes objects do not fit well to any cluster
Simple algorithms for inference
Can converge on locally optimal solutions and often require multiple restarts with random initializations
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
N items, assume K clusters
Goal is to minimize
over the possible assignments and centroids .
represents the location of the cluster.
K-means
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
1. Divide the data into K clustersInitialize the centroids with the mean of the clusters
2. Assign each item to the cluster with closest centroid
3. When all objects have been assigned, recalculate the centroids (mean)
4. Repeat 2-3 until the centroids no longer move
K-means
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
set.seed(100)x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")
set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)
set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)
set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)
K-means
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
set.seed(100)x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")
set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)
set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)
set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)
K-means
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
K-means, k=4
1 2
3 4
set.seed(100)km.cho<-kmeans(cho.data, 4)
par(mfrow=c(2,2))matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value")
K-means
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
1 2
3 4
K-means, k=4
1 2
3 4
Single linkage, k=4
K-means
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
K-means and hierarchical clustering methods are simple, fast and useful techniques
Beware of memory requirements for HC
Both are bit “ad hoc”:Number of clusters?
Distance metric?
Good clustering?
Summary
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Meta-Analysis• Combining results of multiple-studies that study related
hypotheses
• Often used to merge data from different microarray platforms
• Very challenging – unclear what the best approaches are, or how they should be adapted to the pecularities of microarray data
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Why Do Meta-Analysis?• Can identify publication biases• Appropriately weights diverse studies
• Sample-size• Experimental-reliability• Similarity of study-specific hypotheses to the overall one
• Increases statistical power• Reduces information
• A single meta-analysis vs. five large studies• Provides clearer guidance
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Challenges of Meta-Analysis• No control for bias
• What happens if most studies are poorly designed?
• File-drawer problem• Publication bias can be detected, but not explicitly controlled
for
• How homogeneous is the data?• Can it be fairly grouped?• Simpson’s Paradox
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Simpson’s Paradox
Group-wise correlations are inverted when the groups are merged. Cautionary note for all meta-analyses!
Group-wise correlations are inverted when the groups are merged. Cautionary note for all meta-analyses!
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Topics For This Week• Machine-learning 101 (Focus: Unsupervised)
• Data visualization 101
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Topics For This Week• Machine-learning 101 (Briefly)
• Data visualization 101
• Attendance
• Microarrays 101