Canadian Bioinformatics Workshops .

Canadian Bioinformatics Workshops

www.bioinformatics.ca

2Module #: Title of Module

Lecture 7ML & Data Visualization & Microarrays

MBP1010

Dr. Paul C. BoutrosWinter 2015

DEPARTMENT OFMEDICAL BIOPHYSICSDEPARTMENT OFMEDICAL BIOPHYSICS

This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others

††

††

Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Course Overview• Lecture 1: What is Statistics? Introduction to R• Lecture 2: Univariate Analyses I: continuous• Lecture 3: Univariate Analyses II: discrete• Lecture 4: Multivariate Analyses I: specialized models• Lecture 5: Multivariate Analyses II: general models• Lecture 6: Machine-Learning• Lecture 7: Microarray Analysis I: Pre-Processing• Lecture 8: Microarray Analysis II: Multiple-Testing• Lecture 9: Sequence Analysis• Final Exam (written)


House Rules• Cell phones to silent

• No side conversations

• Hands up for questions


Topics For This Week• Machine-learning 101 (Briefly)

• Data visualization 101

• Attendance

• Microarrays 101


cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19])

D.cho<-dist(cho.data, method = "euclidean")

hc.single<-hclust(D.cho, method = "single", members=NULL)

Example: cell cycle data


plot(hc.single)

Single linkage



Careful with the interpretation of dendrograms:they introduce a proximity between elements that does not correlate with distance between elements!cf.: # 1 and #47



Single linkage, k=2

rect.hclust(hc.single,k=2)



Single linkage, k=3




Single linkage, k=4




Single linkage, k=5




Single linkage, k=25




1 2

3 4

class.single<-cutree(hc.single, k = 4)

par(mfrow=c(2,2))

matplot(t(cho.data[class.single==1,]),type="l",

xlab="time",ylab="log expression value")



matplot(as.matrix(cho.data[class.single==3,]),

type="l",xlab="time",ylab="log expression value")



Properties of cluster members, single linkage, k=4



1 2

3 4

Single linkage, k=4

1 2

3 4



1 2

3 4

Complete linkage, k=4

1 2

3 4

Single linkage, k=4



Hierarchical clustering analyzed

Advantages Disadvantages

There may be small clusters nested inside large ones

Clusters might not be naturally represented by a hierarchical structure

No need to specify number groups ahead of time

Its necessary to ‘cut’ the dendrogram in order to produce clusters

Flexible linkage methods Bottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’


Partitioning methods

• Anatomy of a partitioning based method• data matrix• distance function• number of groups

• Output• group assignment of every object


Partitioning based methods

• Choose K groups• initialise group centers

• aka centroid, medoid

• assign each object to the nearest centroid according to the distance metric

• reassign (or recompute) centroids

• repeat last 2 steps until assignment stabilizes


K-means vs. K-medoids

K-means K-medoids

Centroids are the ‘mean’ of the clusters

Centroids are an actual object that minimizes the total within cluster distance

Centroids need to be recomputed every iteration

Centroid can be determined from quick look up into the distance matrix

Initialisation difficult as notion of centroid may be unclear before beginning

Initialisation is simply K randomly selected objects

kmeans pam


Partitioning based methods

Advantages Disadvantages

Number of groups is well defined

Have to choose the number of groups

A clear, deterministic assignment of an object to a group

Sometimes objects do not fit well to any cluster

Simple algorithms for inference

Can converge on locally optimal solutions and often require multiple restarts with random initializations


N items, assume K clusters

Goal is to minimize

over the possible assignments and centroids .

represents the location of the cluster.

K-means


1. Divide the data into K clustersInitialize the centroids with the mean of the clusters

2. Assign each item to the cluster with closest centroid

3. When all objects have been assigned, recalculate the centroids (mean)

4. Repeat 2-3 until the centroids no longer move

K-means


set.seed(100)x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")

set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)



K-means


set.seed(100)x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")




K-means


K-means, k=4

1 2

3 4

set.seed(100)km.cho<-kmeans(cho.data, 4)

par(mfrow=c(2,2))matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value")

K-means


1 2

3 4

K-means, k=4

1 2

3 4

Single linkage, k=4

K-means


K-means and hierarchical clustering methods are simple, fast and useful techniques

Beware of memory requirements for HC

Both are bit “ad hoc”:Number of clusters?

Distance metric?

Good clustering?

Summary


Meta-Analysis• Combining results of multiple-studies that study related

hypotheses

• Often used to merge data from different microarray platforms

• Very challenging – unclear what the best approaches are, or how they should be adapted to the pecularities of microarray data


Why Do Meta-Analysis?• Can identify publication biases• Appropriately weights diverse studies

• Sample-size• Experimental-reliability• Similarity of study-specific hypotheses to the overall one

• Increases statistical power• Reduces information

• A single meta-analysis vs. five large studies• Provides clearer guidance


Challenges of Meta-Analysis• No control for bias

• What happens if most studies are poorly designed?

• File-drawer problem• Publication bias can be detected, but not explicitly controlled

for

• How homogeneous is the data?• Can it be fairly grouped?• Simpson’s Paradox


Simpson’s Paradox

Group-wise correlations are inverted when the groups are merged. Cautionary note for all meta-analyses!

Group-wise correlations are inverted when the groups are merged. Cautionary note for all meta-analyses!


Topics For This Week• Machine-learning 101 (Focus: Unsupervised)



Topics For This Week• Machine-learning 101 (Briefly)


• Attendance

• Microarrays 101

Canadian Bioinformatics Workshops .

Documents

machinelearning lecture

preprocessing lecture

discrete lecture

continuous lecture

r lecture

course overview lecture

multipletesting lecture

bce slide