Top Banner
Cluster analysis for microarray data Anja von Heydebreck
26
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cluster analysis for microarray data Anja von Heydebreck.

Cluster analysis for microarray data

Anja von Heydebreck

Page 2: Cluster analysis for microarray data Anja von Heydebreck.

Aim of clustering: Group objects according to their similarity

Cluster:a set of objectsthat are similar to each otherand separatedfrom the otherobjects.

Example: green/red data pointswere generatedfrom two differentnormal distributions

Page 3: Cluster analysis for microarray data Anja von Heydebreck.

Clustering microarray data

• Genes and experiments/samples are given as the row and column vectors of a gene expression data matrix.

• Clustering may be applied either to genes or experiments (regarded as vectors in Rp or Rn).

n experiments

p ge

nes

gene expression data matrix

Page 4: Cluster analysis for microarray data Anja von Heydebreck.

Why cluster genes?

• Identify groups of possibly co-regulated genes (e.g. in conjunction with sequence data).

• Identify typical temporal or spatial gene expression patterns (e.g. cell cycle data).

• Arrange a set of genes in a linear order that is at least not totally meaningless.

Page 5: Cluster analysis for microarray data Anja von Heydebreck.

Why cluster experiments/samples?

• Quality control: Detect experimental artifacts/bad hybridizations

• Check whether samples are grouped according to known categories (though this might be better addressed using a supervised approach: statistical tests, classification)

• Identify new classes of biological samples (e.g. tumor subtypes)

Page 6: Cluster analysis for microarray data Anja von Heydebreck.

Alizadeh et al., Nature 403:503-11, 2000

Page 7: Cluster analysis for microarray data Anja von Heydebreck.

Cluster analysis

Generally, cluster analysis is based on two ingredients:

• Distance measure: Quantification of (dis-)similarity of objects.

• Cluster algorithm: A procedure to group objects. Aim: small within-cluster distances, large between-cluster distances.

Page 8: Cluster analysis for microarray data Anja von Heydebreck.

Some distance measures

Given vectors x = (x1, …, xn), y = (y1, …, yn)

• Euclidean distance:

• Manhattan distance:

• Correlation

distance:

n

iiiE yxyxd

1

2)(),(

.),(1

n

iiiM yxyxd

.)()(

))((1),(

1

2

1

2

1

ii

ii

iii

Cyyxx

yyxxyxd

Page 9: Cluster analysis for microarray data Anja von Heydebreck.

Which distance measure to use?

1.0 1.5 2.0 2.5 3.0 3.5 4.0

12

34

5Index

b

x = (1, 1, 1.5, 1.5)y = (2.5, 2.5, 3.5, 3.5) = 2x + 0.5z = (1.5, 1.5, 1, 1)

dc(x, y) = 0, dc(x, z) = 2.dE(x, z) = 1, dE(x, y) ~ 3.54.

• The choice of distance measure should be based on the application area. What sort of similarities would you like to detect?• Correlation distance dc measures trends/relative differences:

dc(x, y)= dc(ax+b, y) if a > 0.

Page 10: Cluster analysis for microarray data Anja von Heydebreck.

Which distance measure to use?

• Euclidean and Manhattan distance both measure absolute differences between vectors. Manhattan distance is more robust against outliers.

• May apply standardization to the observations: Subtract mean and divide by standard deviation:

• After standardization, Euclidean and correlation distance are equivalent:

ˆ x

x xx

21 2 1 2( , ) 2 ( , ).E Cd x x nd x x

Page 11: Cluster analysis for microarray data Anja von Heydebreck.

K-means clustering• Input: N objects given as data points in Rp

• Specify the number k of clusters.• Initialize k cluster centers. Iterate until convergence: - Assign each object to the cluster with the closest center

(wrt Euclidean distance). - The centroids/mean vectors of the obtained clusters are

taken as new cluster centers.• K-means can be seen as an optimization problem: Minimize the sum of squared within-cluster distances,

• Results depend on the initialization. Use several starting points and choose the “best” solution (with minimal W(C)).

2

1 ( ) ( )

1( ) ( , )

2

K

E i jk C i C j k

W C d x x

Page 12: Cluster analysis for microarray data Anja von Heydebreck.

Partitioning around medoids (PAM)• K-means clustering is based on Euclidean

distance. • Partitioning around medoids (PAM) generalizes

the idea and can be used with any distance measure d (objects xi need not be vectors).

• The cluster centers/prototypes are required to be observations:

• Try to minimize the sum of distances of the objects to their cluster centers,

using an iterative procedure analogous to the one in K-means clustering.

( )1

( , ),n

i j ii

d x m

, 1,..., .jj im x j K

Page 13: Cluster analysis for microarray data Anja von Heydebreck.

K-means/PAM: How to choose K (the number of clusters)?

• There is no easy answer.

• Many heuristic approaches try to compare the quality of clustering results for different values of K (for an overview see Dudoit/Fridlyand 2002).

• The problem can be better addressed in model-based clustering, where each cluster represents a probability distribution, and a likelihood-based framework can be used.

Page 14: Cluster analysis for microarray data Anja von Heydebreck.

Self-organizing maps• K = r*s clusters are arranged as

nodes of a two-dimensional grid. Nodes represent cluster centers/prototype vectors.

• This allows to represent similarity between clusters.

• Algorithm: Initialize nodes at random

positions. Iterate: - Randomly pick one data point

(gene) x. - Move nodes towards x, the

closest node most, remote nodes (in terms of the grid) less. Decrease amount of movements with no. of iterations.

from Tamayo et al. 1999

Page 15: Cluster analysis for microarray data Anja von Heydebreck.

Self-organizing maps

from Tamayoet al. 1999(yeast cell cycle data)

Page 16: Cluster analysis for microarray data Anja von Heydebreck.

Hierarchical clustering• Similarity of objects

is represented in a tree structure (dendrogram).

• Advantage: no need to specify the number of clusters in advance. Nested clusters can be represented.

Golub data: different types of leukemia. Clustering based on the 150 genes with highest variance across all samples.

Page 17: Cluster analysis for microarray data Anja von Heydebreck.

Agglomerative hierarchical clustering

• Bottom-up algorithm (top-down (divisive) methods are less common).

• Start with the objects as clusters.• In each iteration, merge the two clusters

with the minimal distance from each other - until you are left with a single cluster comprising all objects.

• But what is the distance between two clusters?

Page 18: Cluster analysis for microarray data Anja von Heydebreck.

Distances between clusters used for hierarchical clustering

Calculation of the distance between two clusters is based on the pairwise distances between members of the clusters.

• Complete linkage: largest distance

• Average linkage: average distance

• Single linkage: smallest distance

Complete linkage gives preference to compact/spherical clusters. Single linkage can produce long stretched clusters.

Page 19: Cluster analysis for microarray data Anja von Heydebreck.

Hierarchical clustering

• The height of a node in the dendrogram represents the distance of the two children clusters.

• Loss of information: n objects have n(n-1)/2 pairwise distances, tree has n-1 inner nodes.

• The ordering of the leaves is not uniquely defined by the dendrogram: 2n-2 possible choices.

Golub data: different types of leukemia. Clustering based on the 150 genes with highest variance across all samples.

Page 20: Cluster analysis for microarray data Anja von Heydebreck.

Alternative: direct visualization of similarity/distance matrices

Useful if one wants to investigate a specific factor (advantage: no loss of information). Sort experiments according to that factor.

Array batch 1

Array batch 2

Page 21: Cluster analysis for microarray data Anja von Heydebreck.

The role of feature selection

• Sometimes, people first select genes that appear to be differentially expressed between groups of samples. Then they cluster the samples based on the expression levels of these genes. Is it remarkable if the samples then cluster into the two groups?

• No, this doesn’t prove anything, because the genes were selected with respect to the two groups! Such effects can even be obtained with a matrix of i.i.d. random numbers.

Page 22: Cluster analysis for microarray data Anja von Heydebreck.

Clustering of time course data

• Suppose we have expression data from different time points t1, …, tn, and want to identify typical temporal expression profiles by clustering the genes.

• Usual clustering methods/distance measures don’t take the ordering of the time points into account – the result would be the same if the time points were permuted.

• Simple modification: Consider the difference xi(j+1) – xij between consecutive timepoints as an additional observation yij. Then apply a clustering algorithm such as K-means to the augmented data matrix.

Page 23: Cluster analysis for microarray data Anja von Heydebreck.

Biclustering• Usual clustering algorithms are

based on global similarities of rows or columns of an expression data matrix.

• But the similarity of the expression profiles of a group of genes may be restricted to certain experimental conditions.

• Goal of biclustering: identify “homogeneous” submatrices.

• Difficulties: computational complexity, assessing the statistical significance of results

• Example: Tanay et al. 2002.

Page 24: Cluster analysis for microarray data Anja von Heydebreck.

Conclusions

• Clustering has been very popular in the analysis of microarray data. However, many typical questions in microarray studies (e.g. identifying expression signatures of tumor types) are better addressed by other methods (classification, statistical tests).

• Clustering algorithms are easy to apply (you cannot really do it wrong), and they are useful for exploratory analysis. But it is difficult to assess the validity/significance of the results. Even “random” data with no structure can yield clusters or exhibit interesting looking patterns.

Page 25: Cluster analysis for microarray data Anja von Heydebreck.

Clustering in R

• library mva:

- Hierarchical clustering: hclust, heatmap

- k-means: kmeans

• library class:

- Self-organizing maps: SOM

• library cluster:

- pam and other functions

Page 26: Cluster analysis for microarray data Anja von Heydebreck.

References

• T. Hastie, R. Tibshirani, J. Friedman: The elements of statistical learning (Chapter 14). Springer 2001.

• M. Eisen et al.: Cluster analysis and display of genome-wide expression patterns. Proc.Natl.Acad.Sci.USA 95, 14863-8, 1998.

• P. Tamayo et al.: Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc.Natl.Acad.Sci.USA 96, 2907-12, 1999.

• S. Dudoit, J. Fridlyand: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3(7), 2002.

• A. Tanay, R. Sharan, R. Shamir: Discovering statistically significant biclusters in gene expression data. Bioinformatics 18, Suppl.1, 136-44, 2002.