DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Molecular diagnosis

Florian [email protected] Planck Institute for Molecular GeneticsComputational Diagnostics GroupBerlin, Germany Ber

linC

en

ter

for

Genome BasedBio

i nfo

rmatics

IPM workshopTehran, 2005 April

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Personalized medicine

Which disease has the patient?

Which treatment should he get?

Will he develop side-effects?

These questions

1. refer to individuals,

2. address predictive problems,

3. directly link to decisions.

We are interested in individuals — not in

gene function (that’s functional genomics).

Florian Markowetz, Molecular diagnosis, 2005 April 1

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

DNA −→ RNA −→ Protein


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Microarray data

www.affymetrix.com


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Why to use microarrays

Two major advantages:

1. Bird’s eye view: Microarrays allow to

screen thousands of genes without a-

prior knowledge of which genes might be

involved.

2. Multivariate signatures: group of genes

together may be more accurate and robust

indicators of patients’ outcome than a

single gene.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Overview

1. Classification in high dimensions−→ a fight against overfitting


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Overview


2. Discriminant Analysis−→ Gaussian assumption, feature selection


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Overview



3. Support vector machines−→ Maximal margin hyperplanes, non-linear similarity measures


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Overview




4. Model selection and assessment−→ Traps and pitfalls, or: How to cheat.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Overview





5. Interpretation of results−→ what do classifiers teach us about biology?


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Molecular diagnosis = a classification problem

We measure p genes on N patients. Each microarray is a profilex(i) ∈ Rp. With each profiles comes a label yi ∈ K = {+1,−1}.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics



Assume data generating distribution Pr(X, Y ) — which is unknown!

What we got are samples from Pr, called a Training set:

D = {(x(1), y1), . . . , (x(N), yN)}


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics



Assume data generating distribution Pr(X, Y ) — which is unknown!

What we got are samples from Pr, called a Training set:

D = {(x(1), y1), . . . , (x(N), yN)}

A classification rule c : Rp → K splits the Rp into one subspace for

each class.

Challenge: Find c, which classifies future patients well.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

How to measure success

Loss function quantifies loss of classifying x to have label c(x) if

true label is y:

l(x, c(x), y) : Rp ×K ×K −→ [0,∞)

Risk is the expected loss over the whole population:

R[c] = E l(X, c(X), Y ) =∫

l(x, c(x), y) dPr(x, y)


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

0/1-loss

The most simple loss function for classification is 0/1-loss:

l(x, c(x), y) =

{0 if c(x) = y

1 if c(x) 6= y

With this loss function we get

R[c] = Pr( c(x) 6= y )


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

A first estimate for the risk

The empirical risk (aka training error) approximates Pr(X, Y ) by

the empirical distribution P r(X, Y ) of the training set:∫l(x, c(x), y) dPr(x, y) ≈ 1

N

N∑i=1

l(x(i), c(x(i)), yi) =: Remp[x]

First idea: Find a classifier c minimizing empirical risk!

c = argminc

Remp[x]


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

The trivial solution

A trivial classifier with zero empirical risk is

ctriv(x) =

{yi whenever x ∈ D1 else.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

The trivial solution

A trivial classifier with zero empirical risk is

ctriv(x) =

{yi whenever x ∈ D1 else.

Ok, this is a bit artificial.

But still: in small-sample situations, learning single datapoints instead

of general features of the data is the main problem. This is called

overfitting.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

From Under- to Over-fitting


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

From Under- to Over-fitting

Overfitting: Perfect separation of training data may not generalize

well to future patients.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Bias-variance trade-off

[4]Florian Markowetz, Molecular diagnosis, 2005 April 12

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

How to measure model complexity

We have to restrict the set of functions to one that has capacity (or

complexity) suitable for the amount of available training data.

Very prominent capacity concept [7, 8]:

Vapnik-Chervonenkis (VC) dimension


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics






Shattering points: With labels in {+1,−1} and N points, there are

at most 2N different labelings. A rich function class may be able to

realize all of them. It is then said to shatter the N points.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics






Shattering points: With labels in {+1,−1} and N points, there are

at most 2N different labelings. A rich function class may be able to

realize all of them. It is then said to shatter the N points.

VC dimension is defined as the largest N such that there exists a

set of N points the function class can shatter, and ∞ if there is no

such set.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Shattering p + 1 points in p dimensions

Curse of dimensionality:if p � N even linear methods are too complex.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Means to fight overfitting in high dimensions

1. Dimension reductione.g. Principal Components Analysis: Find the directions with

highest variance in the data.

2. Feature selectiongene-wise filtering or shrinkage.

3. Regularizationintroduce additional constraints into objective function.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Two roads to classification

1. model class probabilities→ Gaussian assumption leads to

Discriminant Analysis.

2. model class boundaries directly

→ Optimal Separating Hyperplanes

→ SVM


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Discriminant Analysis


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Bayes classifier

Image we knew the data generating distribution.

To minimize the risk with 0/1-loss we would classify a new point to

the most likely class:

c(x) = argmaxk

Pr(Y = k | X = x)

This is known as the Bayes classifier. It’s error rate is called the

Bayes rate.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Bayes classifier

Image we knew the data generating distribution.

To minimize the risk with 0/1-loss we would classify a new point to

the most likely class:

c(x) = argmaxk

Pr(Y = k | X = x)

This is known as the Bayes classifier. It’s error rate is called the

Bayes rate.

In real-world problems, we do not know the data generating

distribution. But we can still make an educated guess . . .


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Comparing Gaussian likelihoods

Assumption: each group of patients is well described by a Normal

density.

Training: estimate mean and covariance matrix for each group.

Prediction: assign new patient to group with higher likelihood.

Constraints on covariance structure lead to different forms of

discriminant analysis.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Gaussian likelihoods

Model each class density as multivariate Gaussian [2]

fk(x) = |2π Σk|−12 exp

{−1

2(x− µk)TΣ−1

k (x− µk)}

.

In comparing two classes k and l, we look at the log-ratio

logPr(Y = k|X = x)Pr(Y = l|X = x)


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics




{−1

2(x− µk)TΣ−1

k (x− µk)}

.



= logPr(X = x|Y = k) Pr(Y = k)Pr(X = x|Y = l) Pr(Y = l)


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics




{−1

2(x− µk)TΣ−1

k (x− µk)}

.



= logPr(X = x|Y = k) Pr(Y = k)Pr(X = x|Y = l) Pr(Y = l)

= logfk(x)fl(x)

+ logπk

πl.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Quadratic and Linear Discrimant Analysis

1. Unrestricted {Σk} lead to Quadratic discriminant analysis.

2. The special case Σk = Σ, ∀k, leads to convenient cancellations

in the log-ratio:


= logfk(x)fl(x)

+ logπk

πl


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Quadratic and Linear Discrimant Analysis

1. Unrestricted {Σk} lead to Quadratic discriminant analysis.

2. The special case Σk = Σ, ∀k, leads to convenient cancellations

in the log-ratio:


= logfk(x)fl(x)

+ logπk

πl=

= xTΣ−1(µk − µl)−12(µk + µl)TΣ−1(µk − µl) + log

πk

πl.

The quadratic parts vanish, the decision boundary is linear.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Discriminant functions

Equivalent descriptions of decision rule with c(x) = argmaxk δk(x).

Quadratic discriminant analysis

δQDAk (x) = −1

2log |Σk| −

12(x− µk)TΣ−1

k (x− µk) + log πk

Linear discriminant analysis

δLDAk (x) = xTΣ−1µk −

12µT

k Σ−1µk + log πk


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

More constraints on Σ

Diagonal discriminant analysis constraints Σk to diagonal form.

This means: genes/features are thought to be independent.

Again there exists a linear and a quadratic form.

Nearest centroids classification requires Σk = σ2k I, where I is the

identity matrix. Not only are genes now independent, they also have

the same variance (per class).

We will use both in the linear form, i.e. Σk = Σ, ∀k.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Estimation from data

Prior:πk = Nk/N


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics


Prior:πk = Nk/N

Class means:

µk =1

Nk

∑{i:yi=k}

xi


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics


Prior:πk = Nk/N

Class means:

µk =1

Nk

∑{i:yi=k}

xi

Covariance matrix:

Σ =1

N − 2

2∑k=1

∑{i:yi=k}

(xi − µk)(xi − µk)T


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Disriminant analysis in a nutshell

DLDA

QDA LDA

NearestCentroid

Characterize each class

by mean and covariancestructure.

• Quadratic D.A.different COVs

• Linear D.A.requires same COVs.

• Diagonal linear D.A.same diagonal COVs.

• Nearest centroidsforces COVs to σ2I.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Why does discriminant analysis work?

Is it, because Gaussian assumption is always fulfilled? Not likely!

The reason is more pragmatic:

1. The data can only support simple decision rules,

2. estimates by the Gaussian model are stable.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Why does discriminant analysis work?

Is it, because Gaussian assumption is always fulfilled? Not likely!

The reason is more pragmatic:

1. The data can only support simple decision rules,

2. estimates by the Gaussian model are stable.

But still we work in very high dimensions.

Next simplification: Base the classification only on a small number

of genes.

Feature selection: Find the most discriminative genes.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Single feature ranking

Idea: compare difference in group-means scaled by variance in the

groups.

freq

uenc

y

gene expression

freq

uenc

y


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Correlation scores

Three implementations of the mean/variance comparison:

t-statistic Fisher Golub [1]

t =µ1 − µ2√

σ21

n1+ σ2

2n2

f =(µ1 − µ2)2

σ21 + σ2

2

g =µ1 − µ2

σ1 + σ2

We rank the genes by one of these scores, use the top k for further

analysis and discard the rest.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

From filters to shrinkage

Filtering involves an arbitrary hard threshold. Gene k + 1 is

discarded, even if it bears no less information than gene k.

We fight this point by Shrinkage: Continously shrink genes until

only a few have influence on classification.

Example: Nearest Shrunken Centroids (NSC).


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Nearest Shrunken Centroids


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

NSC: global and class centroids

For gene i: how far is the class centroid xik from the overallcentroid xi, measured in units of standard deviation?

dik =xik − xi

mk · si,

where si is the pooled within-class standard deviation for gene i and

mk =√

1/Nk − 1/N .

We transform this quantity into

xik = xi + mk · si · dik.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

NSC: Shrinkage

Noisy and uninformative xik will be close

to the overall mean xi.

Shrink each dik toward zero by softthresholding [5, 6]:

d′ik = sign(dik)(|dik| −∆))+

This gives new class prototypes

x′ik = x′i + mk · si · d′ik.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

NSC: how genes vanish from the model

If shrinkage paramters ∆ is large enough, genes are eliminated from

class prediction:

If ∆ causes dik to shrink to zero for all classes k, then the class

centroid falls into one with the overall centroid.

The gene i then does not contribute to nearest centroid computation.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Shrunken Centroids

overall centroidfirst class centroid second class centroid

gene A

gene B


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Shrunken Centroids

Expression of gene 1

Exp

ress

ion

of g

ene

2●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●●


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

NSC: Discriminant scores

The discriminant function for nearest shrunken centroid classification

is

δNSCk (x) = xTΣ−1x′k −

12x′Tk Σ−1x′k + log πk,

which looks exactely like δLDAk (x) except for three differences:

1. diagonal wihtin-class covariance matrix Σ

2. shrunken centroids x′k rather than centroids xk ≡ µk.

3. as ∆ increases, more and more genes lose discriminatory power.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Class probabilities from discriminant scores

Using the δk(x) we can construct estimates of the class probabilities

Pr(Y = k|X = x):

p(x) =exp

(−1

2δk(x))∑K

l=1 exp(−1

2δl(x))

The monotone transformation log[p/(1 − p)] is called the logittransformation.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Shortcomings of filter and shrinkage methods

1. High correlated genes get similar score but offer no new

information.

But see Jaeger et al. [3] for a cure.

2. Filter and Shrinkage work only on single genes.

They don’t find interactions between groups of genes.

3. Filter and Shrinkage methods are only heuristics.

Search for best subset is infeasible for more than 30 genes.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Differential genes may not be predictive!

freq

uenc

y

gene expression

freq

uenc

y

The upper one is differential and predictive, the lower one is also

differential, but not predictive.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Predictive genes may not be differential!


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

A first summary

1. Molecular Diagnosis from microarray data is a classification

problem.

2. From training data, find a classifier working well for future patients.

3. Curse of dimensionality leads to easy overfitting.

4. Thus, bias the models to be simple!

5. One example: Gaussian model with restricted covariance and gene

selection.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

What’s to come

Part II will deal with





Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

What’s to come

Part II will deal with




Thank you! Questions?Florian Markowetz, Molecular diagnosis, 2005 April 42

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Acknowledgements

Thanks to MIT Press and the authors for making the figures from

Learning with Kernels available at

http://www.learning-with-kernels.org.

Thanks to Springer and the authors for making the figures from The

Elements of Statistical Learning available at

http://www-stat-class.stanford.edu/∼tibs/ElemStatLearn/.


Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmaticsReferences

[1] TR. Golub, DK. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, JP. Mesirov, H. Coller, ML. Loh, JR. Downing, MA.Caligiuri, CD. Bloomfield, and ES. Lander. Molecular classification of cancer: class discovery and class predictionby gene expression monitoring. Science, 286(5439):531–7, Oct 1999.

[2] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2001.

[3] J Jaeger, R Sengupta, and WL Ruzzo. Improved gene selection for classification of microarrays. Pac SympBiocomput, pages 53–64, 2003.

[4] Bernhard Scholkopf and Alexander J. Smola. Learning with kernels. The MIT Press, Cambridge, MA, 2002.

[5] Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Diagnosis of multiple cancertypes by shrunken centroids of gene expression. Proc Natl Acad Sci U S A, 99(10):6567–72, May 2002.

[6] Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Class prediction by nearestshrunken centroids, with applications to dna microarrays. Statist. Sci., 18(1):104–117, 2003.

[7] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995.

[8] Vladimir Vapnik. Statistical Learning Theory. Wiley, N.Y., 1998.


DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Documents