Top Banner
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign
23

Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Dec 14, 2015

Download

Documents

Amari Overland
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Microarray Data Analysis

(Lecture for CS397-CXZ Algorithms in Bioinformatics)

March 19, 2004

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Page 2: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Gene Expression Data (Microarray)

p genes on n samples

Genes

mRNA samples

Gene expression level of gene i in mRNA sample j

Log (treated-exp-value /controlled-exp-value )

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

Page 3: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Some possible applications

Sample from specific organ to show which genes are expressed

Compare samples from healthy and sick host to find gene-disease connection

Discover co-regulated genes

Discover promoters

Page 4: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Major Analysis Techniques

Single gene analysis Compare the expression levels of the same gene under

different conditions

Main techniques: Significance test (e.g., t-test)

Gene group analysis Find genes that are expressed similarly across many different

conditions

Main techniques: Clustering (many possibilities)

Gene network analysis Analyze gene regulation relationship at a large scale

Main techniques: Bayesian networks

Page 5: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Clustering Methods

Similarity-based (need a similarity function) Construct a partition

Agglomerative, bottom up

Searching for an optimal partition

Typically “hard” clustering

Model-based (latent models, probabilistic or algebraic)

First compute the model

Clusters are obtained easily after having a model

Typically “soft” clustering

Page 6: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Similarity-based Clustering

Define a similarity function to measure similarity between two objects

Common criteria: Find a partition to Maximize intra-cluster similarity

Minimize inter-cluster similarity

Two ways to construct the partition Hierarchical (e.g.,Agglomerative Hierarchical Clustering)

Search by starting at a random partition (e.g., K-means)

Page 7: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Method 1 (Similarity-based):

Agglomerative Hierarchical Clustering

Page 8: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Agglomerative Hierachical Clustering

Given a similarity function to measure similarity between two objects

Gradually group similar objects together in a bottom-up fashion

Stop when some stopping criterion is met

Variations: different ways to compute group similarity based on individual object similarity

Page 9: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Similarity Measure: Pearson CC

The most popular correlation coefficient is Pearson correlation coefficient (1892)

correlation between X={X1, X2, …, Xn} and Y={Y1, Y2, …, Yn} :

where

n

k

kk

YX

YYXX

nr

1

1

n

k

k

n

GGG

1

2

(Adapted from a Slide by Shin-Mu Tseng)

sXY

sXY is the

similaritybetween X & Y

Better measures focus on a subset of values…

Page 10: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Similarity-induced Structure

Page 11: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

How to Compute Group Similarity?

Given two groups g1 and g2,

Single-link algorithm: s(g1,g2)= similarity of the closest pair

complete-link algorithm: s(g1,g2)= similarity of the farthest pair

average-link algorithm: s(g1,g2)= average of similarity of all pairs

Three Popular Methods:

Page 12: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Three Methods Illustrated

Single-link algorithm

?

g1 g2

complete-link algorithm

……

average-link algorithm

Page 13: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Comparison of the Three Methods

Single-link “Loose” clusters

Individual decision, sensitive to outliers

Complete-link “Tight” clusters

Individual decision, sensitive to outliers

Average-link “In between”

Group decision, insensitive to outliers

Which one is the best? Depends on what you need!

Page 14: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Method 2 (similarity-based):

K-Means

Page 15: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

K-Means Clustering

Given a similarity function

Start with k randomly selected data points

Assume they are the centroids of k clusters

Assign every data point to a cluster whose centroid is the closest to the data point

Recompute the centroid for each cluster

Repeat this process until the similarity-based objective function converges

Page 16: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Method 3 (model-based):

Mixture Models

Page 17: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Mixture Model for Clustering

P(X|Cluster1)

P(X|Cluster2)

P(X|Cluster3)

P(X)=1P(X|Cluster1)+ 2P(X|Cluster2)+3P(X|Cluster3)

2| ( , )i i iX Cluster N

Page 18: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Mixture Model Estimation

Likelihood function

Parameters:i, i, i

Using EM algorithm

Similar to “soft” K-means

21

221

( )( ) exp( )

2i

ki

ii i

xp x

Page 19: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Method 4 (model-based) [If we have gtime]

Singular Value Decomposition (SVD)

Also called “Latent Semantic Indexing” (LSI)

Page 20: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Example of “Semantic Concepts”

(Slide from C. Faloutsos’s talk)

Page 21: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Singular Value Decomposition (SVD)

A[n x m] = U[n x r] r x r] (V[m x r])T

A: n x m matrix (n documents, m terms)

U: n x r matrix (n documents, r concepts)

: r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix)

V: m x r matrix (m terms, r concepts)

(Slide from C. Faloutsos’s talk)

Page 22: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Example of SVD

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainfretrieval

brainlung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=CS

MD

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

CS-concept MD-concept

Term rep of concept

(Slide adapted from C. Faloutsos’s talk)

Strength of CS-concept

Dim. Reduction

A = U VT

Page 23: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

More clustering methods and software

Partitioning : K-Means, K-Medoids, PAM, CLARA …

Hierarchical : Cluster, HAC 、 BIRCH 、 CURE 、 ROCK

Density-based : CAST, DBSCAN 、 OPTICS 、 CLIQUE…

Grid-based : STING 、 CLIQUE 、 WaveCluster…

Model-based : SOM (self-organized map) 、 COBWEB、 CLASSIT 、 AutoClass…

Two-way Clustering

Block clustering