Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Microarray Data Analysis

(Lecture for CS397-CXZ Algorithms in Bioinformatics)

March 19, 2004

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Gene Expression Data (Microarray)

p genes on n samples

Genes

mRNA samples

Gene expression level of gene i in mRNA sample j

Log (treated-exp-value /controlled-exp-value )

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

Some possible applications

Sample from specific organ to show which genes are expressed

Compare samples from healthy and sick host to find gene-disease connection

Discover co-regulated genes

Discover promoters

Major Analysis Techniques

Single gene analysis Compare the expression levels of the same gene under

different conditions

Main techniques: Significance test (e.g., t-test)

Gene group analysis Find genes that are expressed similarly across many different

conditions

Main techniques: Clustering (many possibilities)

Gene network analysis Analyze gene regulation relationship at a large scale

Main techniques: Bayesian networks

Clustering Methods

Similarity-based (need a similarity function) Construct a partition

Agglomerative, bottom up

Searching for an optimal partition

Typically “hard” clustering

Model-based (latent models, probabilistic or algebraic)

First compute the model

Clusters are obtained easily after having a model

Typically “soft” clustering

Similarity-based Clustering

Define a similarity function to measure similarity between two objects

Common criteria: Find a partition to Maximize intra-cluster similarity

Minimize inter-cluster similarity

Two ways to construct the partition Hierarchical (e.g.,Agglomerative Hierarchical Clustering)

Search by starting at a random partition (e.g., K-means)

Method 1 (Similarity-based):

Agglomerative Hierarchical Clustering

Agglomerative Hierachical Clustering

Given a similarity function to measure similarity between two objects

Gradually group similar objects together in a bottom-up fashion

Stop when some stopping criterion is met

Variations: different ways to compute group similarity based on individual object similarity

Similarity Measure: Pearson CC

The most popular correlation coefficient is Pearson correlation coefficient (1892)

correlation between X={X1, X2, …, Xn} and Y={Y1, Y2, …, Yn} ：

where

n

k

kk

YX

YYXX

nr

1

1

n

k

k

n

GGG

1

2

(Adapted from a Slide by Shin-Mu Tseng)

sXY

sXY is the

similaritybetween X & Y

Better measures focus on a subset of values…

Similarity-induced Structure

How to Compute Group Similarity?

Given two groups g1 and g2,

Single-link algorithm: s(g1,g2)= similarity of the closest pair

complete-link algorithm: s(g1,g2)= similarity of the farthest pair

average-link algorithm: s(g1,g2)= average of similarity of all pairs

Three Popular Methods:

Three Methods Illustrated

Single-link algorithm

?

g1 g2

complete-link algorithm

……

average-link algorithm

Comparison of the Three Methods

Single-link “Loose” clusters

Individual decision, sensitive to outliers

Complete-link “Tight” clusters

Individual decision, sensitive to outliers

Average-link “In between”

Group decision, insensitive to outliers

Which one is the best? Depends on what you need!

Method 2 (similarity-based):

K-Means

K-Means Clustering

Given a similarity function

Start with k randomly selected data points

Assume they are the centroids of k clusters

Assign every data point to a cluster whose centroid is the closest to the data point

Recompute the centroid for each cluster

Repeat this process until the similarity-based objective function converges

Method 3 (model-based):

Mixture Models

Mixture Model for Clustering

P(X|Cluster1)

P(X|Cluster2)

P(X|Cluster3)

P(X)=1P(X|Cluster1)+ 2P(X|Cluster2)+3P(X|Cluster3)

2| ( , )i i iX Cluster N

Mixture Model Estimation

Likelihood function

Parameters:i, i, i

Using EM algorithm

Similar to “soft” K-means

21

221

( )( ) exp( )

2i

ki

ii i

xp x

Method 4 (model-based) [If we have gtime]

Singular Value Decomposition (SVD)

Also called “Latent Semantic Indexing” (LSI)

Example of “Semantic Concepts”

(Slide from C. Faloutsos’s talk)

Singular Value Decomposition (SVD)

A[n x m] = U[n x r] r x r] (V[m x r])T

A: n x m matrix (n documents, m terms)

U: n x r matrix (n documents, r concepts)

: r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix)

V: m x r matrix (m terms, r concepts)

(Slide from C. Faloutsos’s talk)

Example of SVD

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainfretrieval

brainlung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=CS

MD

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

CS-concept MD-concept

Term rep of concept

(Slide adapted from C. Faloutsos’s talk)

Strength of CS-concept

Dim. Reduction

A = U VT

More clustering methods and software

Partitioning ： K-Means, K-Medoids, PAM, CLARA …

Hierarchical ： Cluster, HAC 、 BIRCH 、 CURE 、 ROCK

Density-based ： CAST, DBSCAN 、 OPTICS 、 CLIQUE…

Grid-based ： STING 、 CLIQUE 、 WaveCluster…

Model-based ： SOM (self-organized map) 、 COBWEB、 CLASSIT 、 AutoClass…

Two-way Clustering

Block clustering

Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Documents

similaritybased clustering

similarity function

group similarity

similarity measure

clustering methods similarity

soft clustering slide

intercluster similarity

intracluster similarity