Micro Array Data Analysis Clustering and Classification Methods

8/8/2019 Micro Array Data Analysis Clustering and Classification Methods

http://slidepdf.com/reader/full/micro-array-data-analysis-clustering-and-classification-methods 1/12

Page 1

Copyright Russ B. Altman

Microarray data analysis:clustering and classification

methods

Russ B. AltmanBMI 214CS 274


Microarrays: DNA BasePairing

CA

G

TC

G

A

T

5’ 3’

5’3’

C A G TG T C A

5’

5’ 3’

3’


Microarrays:Experimental Protocol

Known DNA sequences

Glass slide

Cells of Interest

Isolate mRNA

Reference sampleCopyright Russ B. Altman

TypicalDNAarray

forYeast


Affymetrix chip technology

Instead of putting down intact genes on the chip,these chips put down N-mers of a certain length(around 20) systematically onto a chip bysynthesizing the N-mers on the spots.

Labelled mRNA is then added to the chip and a*pattern* of binding (based on which 20-mersare in the mRNA sequence) is seen.

Bioinformatics is used to deduce the mRNAsequences that are present


Affymetrix fabrication



Page 2


Affymetrix chip


Compare two types of chips

?

?

REMEMBER: Also control/reference DNA competing (in green)


Reproducibility of data sets

• Preparation of mRNA andexperimental design

• Hybridization of RNA to DNA – Sequence-specific effects – Length-related effects

• Quality of spotted genes on array – Proper sequence spotted evenly

• Finding and digitizing spotintensities

• Comparability of experiments onsame chip, experiments on differentchips


What are expression arraysgood for?

• Follow population of (synchronized) cells overtime, to see how expression changes (vs.baseline).

• Expose cells to different external stimuli andmeasure their response (vs. baseline).

• Take cancer cells (or other pathology) andcompare to normal cells.

• (Also some non-expression uses, such asassessing presence/absence of sequences in thegenome)


Matrix of ExpressionGene 1

Gene 2

Gene N

Experiment/Conditions 1

E1



Gene 2

Gene N


E1 E2



Page 3



Gene 2

Gene N


E1 E2 E3


Reorder Rows for Clustering

Gene 1

Gene 2

Gene N

E1 E2 E3

Gene 1

Gene 2

Gene N

E1 E2 E3


Why do we care aboutclustering expression data?

If two genes are expressed in the same way, theymay be functionally related.

If a gene has unknown function, but clusters withgenes of known function, this is a way to assign itsgeneral function.

We may be able to look at high resolutionmeasurements of expression and figure out which

genes control which other genes.

E.g. peak in cluster 1 always precedes peak incluster 2 => ?cluster 1 turns cluster 2 on?



Average of clustered waveforms


Typical“waveforms”

observed(note: not

lots of bumps)



Page 4


Cluster Analysis Result


Methods for Clustering

• Hierarchical Clustering

• Self Organizing Maps

• K-means

• Trillions of others.


Need a distance metrix for two n-dimensional vectors (e.g., for n

expression measurements)1. Euclidean Distance

D(X, Y) = sqrt [(x 1-y 1)2 + (x 2-y2)2 …(x n-yn)2 ]

2. Correlation coefficient

R(X,Y) = cov(xy)/sd(x)sd(y)= 1/n * SUM [ (x i-xo)/σ x * (y i-yo)/σ y ]

Where = σ x = sqrt (E(x 2) - E(x) 2)

and E(x) = expected value of x = average of x3. Other choices for distance too…


Hierarchical ClusteringUsed in Eisen et al

(Nodes = genes or groups of genes. Initially all nodes arerows of data matrix)

1. Compute matrix of all distances (they usedcorrelation coefficient)

2. Find two closest nodes.3. Merge them by averaging measurements (weighted)4. Compute distances from merged node to all others

5. Repeat until all nodes merged into a single node


Hierarchical Clustering E1 E2 E3


How many clusters? E1 E2 E37

4



Page 5


Hierarchical Clustering

• Easy to understand & implement• Can decide how big to make clusters by

choosing the “cut” level of the hierarchy• Can be sensitive to bad data• Can have problems interpreting the tree• Can have local minima

Most commonly used method formicroarray data.


Can buildtrees from

cluster

analysis,groupsgenes bycommon

patterns of expression.


K-means

(Computationally attractive)

1. Generate random points (“cluster centers”) inn dimensions

2. Compute distance of each data point to each of the cluster centers.

3. Assign each data point to the closest clustercenter.

4. Compute new cluster center position as averageof points assigned.

5. Loop to (2), stop when cluster centers do notmove very much.


Graphical Representation

A

B

Two features f1 (x-coordinate) and f2 (y-coordinate)


Self Organizing MapsUsed by Tamayo et al(use same idea of nodes)

1. Generate a simple (usually) 2D grid of nodes(x,y)

2. Map the nodes into n-dim expression vectors(initially randomly)

(e.g. (x,y) -> [0 0 0 x 0 0 0 y 0 0 0 0 0])

3. For each data point, P, change all node positions so that they move towards P. Closernodes move more than far nodes.

4. Iterate for a maximum number of iterations,and then assess position of all nodes.


SOM equations for updatingnode positions

f i+1 (N)= f i(N) + τ (d(N, N P), i) * [P- f i(N)]f i(N) = position of node N at iteration iP = position of current data pointP- f i(N) = vector from N to Pτ = weighting factor or “learning rate” dictates how

much to move N towards P.

τ (d(N, N P), i) = 0.02 T/(T+100 i) for d(N,Np) < cutoff radius, else = 0

T = maximum number of iterationsDecreases with iteration and distance of N to P



Page 6



A

B




SOMs

• Impose a partial structure on the clusterproblem as a start

• Easy to implement• Pretty fast• Let the clusters move towards the data• Easy to visualize results• Can be sensitive to starting structure• No guarantee of convergence to good

clusters.


Clustering Lymphomas

Works well if we use the appropriate 143 GC specific genes


Clustering vs. Classification

Clustering uses the primary data to grouptogether measurements, with noinformation from other sources. Oftencalled “unsupervised machine learning.”

Classification uses known groups of interest(from other sources) to learn the featuresassociated with these groups in the primarydata, and create rules for associating thedata with the groups of interest. Oftencalled “supervised machine learning.”



A

B




Page 7


Clusters

A

B



Apply external labels for classification

A

B

RED group and BLUE group now labeled


Tradeoffs

Clustering is not biased by previousknowledge, but therefore needsstronger signal to discovery clusters.

Classification uses previous knowledge,so can detect weaker signal, but maybe biased by WRONG previousknowledge.


Methods for Classification

Linear Models

Logistic Regressian

Naïve Bayes

Decision Trees

Support Vector Machines


Linear ModelEach gene, g, has list of n measurements at each

condition, [f1 f2 f3…fn].

Associate each gene with a 1 if in a group of interest,otherwise a 0.

Compute weights to optimize ability to predictwhether genes are in group of interest or not.

Predicted group = SUM [ weight(i) * fi]

If fi always occurs in group 1 genes, then weight ishigh. If never, then weight is low.

Assumes that weighted combination works.Copyright Russ B. Altman

Linear Model

A

B

PREDICT RED if high value for A and low value for B,(high weight on x coordinate, negative weight on y)



Page 8


Logistic Regressionp = probability of being in group of interestf = vector of expression measurements

log(p/(1-p) = β f

or

p = e β f /(1+e β f )

Use optimization methods to find β that maximizesthe difference between two groups. Then, can useequation to estimate membership of a gene in agroup.


Classifying Lymphomas


Bayes Rule for ClassificationBayes’ Rule: p(hypothesis|data) =

p(data|hypothesis)p(hypothesis)/p(data)

p(group 1| f) = p(f|group1) p(group1)/p(f)

p(group 1|f) = probability that gene is in group 1give the expression data

p(f) = probability of the data

p(f|group 1) = probability of data given that geneis in group 1

p(group 1) = probability of group 1 for a givengene (prior)


Naïve BayesAssume all expression measurements for a gene are

independent.

Assume p(f) and p(group1) are constant.

P(f|group 1) = p(f1&f2…fn|group1)= p(f1|group1) * p(f2|group1)…* p(fn|group1)

Can just multiply these probabilities (or add theirlogs), which are easy to compute, by counting upfrequencies in the set of “known” members of

group 1.

Choose a cutoff probability for saying “Group 1member.”


Naïve Bayes

A

B

If P(Red|x=A) * P(Red| y = 0) = HIGH, so assign to RED


Decision Trees

Consider an n-dimensional graph of all datapoints (f, gene expression vectors).

Try to learn cutoff values for each fi thatseparate different groups.



Page 9


Decision TreesIf x < A and y > B => BLUE

If Y < B OR Y >B and X > A => RED

A

B


Support Vector Machines

Draw a line that passes close to the members of two different groups that are the most difficultto distinguish.

Label those difficult members the “supportvectors.” (Remember, all points are vectors).

For a variety of reasons (discussed in thetutorial, and the Brown et al paper to somedegree), this choice of line is a good one forclassification, given many choices.


Support Vectors and Decision Line

A

B

(One point left out)


Support Vectors and Decision Line

A

B

(Bad point put back in…Can penalize boundary line for bad predictions

PENALTY based ondistance from line


Choose boundary line that isclosest to both support vectors

1/||w||


Notes about SVMsIf the points are not easily separable in n dimensions,

can add dimensions (similar to how we mapped lowdimensional SOM grid points to expressiondimensions).

Dot product is used as measure of distance betweentwo vectors. But can generalize to an arbitraryfunction of the features (expression measurements)as discussed in Brown and associated Burgestutorial.



Page 10

Copyright Russ B. Altman Copyright Russ B. Altman


Other informatics issues• Low level image processing of spots to assess

amount of fluorescence.

• Need to deal with missing values (due toexperimental artifacts, etc…)

• Need to decide how much of a change issignificant (e.g. “2-fold increase” inexpression).

• Creation of databases with the info (SMD)


TypicalDNAarray

forYeast


Estimate Missing Values.

Complete data set Data set with missingvalues estimated byKNNimpute algorithm

Data set with 30% entriesmissing (missing valuesappear black)


0.150.16

0.17

0.18

0.19

0.2

0.210.22

0.23

0.24

0.25

0 5 10 15 20

Percent of entries missing

N o r m a

l i z e d R M S e r r o r

filled withzeros

row average

SVDimpute

KNNimpute



Page 11


Choosing best “indicator” genes

Which gene or genes “predicts” the class thebest.

Might be a good candidate for a biomarker.

Allows you to focus attention on smallnumber of genes, instead of large numberrequired to get perfect discriminationbetween groups.


0

0.5

1

1.5

2

2.5

1 2 3 4 5 6 7

Principal Component

Variance

Principal Component Analysis of Conditionsin Microarray Experiment


PCA--first two components


PCAand clustersreported in

originalliterature


Independent Components Analysis

Find projection where distribution maximizes ameasure of non-normality.

Use kurtosis as measure of non-normality.

Morekurtotic

Lesskurtotic


Look for interesting genes

w’ A* After + w’ B*Before

w A* After + w B*Before




“Interesting” Projections

max variance = PCAmax kurtosismax non-Gaussianity

Normal = not interesting (PJHuber ‘85, Jone & Sibson‘87)

max information


ICA

MOTIFFINDING

PROGRAM

non-sites sites

500bp 5’ ORF

remainder


Discovered MSE Promoter

>YLL004W ORC3 1 74 ATTTGTGTCAT

>YML066C x 1 336 TTTTGTGTCAT

>YFR028C CDC14 1 35 TTCTGTGACTT

>YNL018C x 2 r444 TTTTGTGGCAC

r484 ATTTGTGACGT

>YJL212C x 1 r3 TTCTGTGACGT

>YNL174W x 1 283 ATCTGTGACAT

>YBR069C VAP1 1 350 TTTTGTGGCAT

>YDR191W HST4 1 384 TTTAGTGACAT

>YHR124W NDT80 2 414 TTTTGTGTCAC

r212 TTTTGTGTCAT

Consensus: TTTTGTGACAT

the regulatory elementthat recognizes this

promoter...multiple elements


Imagine other arraytechnologies

Protein chips to assess interaction of proteins (lay down proteins, and thenlabel others, and look for bindingevents).


Reproducibility of data setshttp://bioinformatics.oupjournals.org/cgi/reprint/18/3/405.pdf

• Preparation of mRNA and

experimental design• Hybridization of RNA to DNA

– Sequence-specific effects – Length-related effects

• Quality of spotted genes on array – Proper sequence spotted evenly

• Finding and digitizing spotintensities

• Comparability of experiments onsame chip, experiments on differentchips

Micro Array Data Analysis Clustering and Classification Methods

Documents