Page 1
Molecular diagnosis
Florian [email protected] Planck Institute for Molecular GeneticsComputational Diagnostics GroupBerlin, Germany Ber
linC
en
ter
for
Genome BasedBio
i nfo
rmatics
IPM workshopTehran, 2005 April
Page 2
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Personalized medicine
Which disease has the patient?
Which treatment should he get?
Will he develop side-effects?
These questions
1. refer to individuals,
2. address predictive problems,
3. directly link to decisions.
We are interested in individuals — not in
gene function (that’s functional genomics).
Florian Markowetz, Molecular diagnosis, 2005 April 1
Page 3
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
DNA −→ RNA −→ Protein
Florian Markowetz, Molecular diagnosis, 2005 April 2
Page 4
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Microarray data
www.affymetrix.com
Florian Markowetz, Molecular diagnosis, 2005 April 3
Page 5
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Why to use microarrays
Two major advantages:
1. Bird’s eye view: Microarrays allow to
screen thousands of genes without a-
prior knowledge of which genes might be
involved.
2. Multivariate signatures: group of genes
together may be more accurate and robust
indicators of patients’ outcome than a
single gene.
Florian Markowetz, Molecular diagnosis, 2005 April 4
Page 6
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Overview
1. Classification in high dimensions−→ a fight against overfitting
Florian Markowetz, Molecular diagnosis, 2005 April 5
Page 7
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Overview
1. Classification in high dimensions−→ a fight against overfitting
2. Discriminant Analysis−→ Gaussian assumption, feature selection
Florian Markowetz, Molecular diagnosis, 2005 April 5
Page 8
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Overview
1. Classification in high dimensions−→ a fight against overfitting
2. Discriminant Analysis−→ Gaussian assumption, feature selection
3. Support vector machines−→ Maximal margin hyperplanes, non-linear similarity measures
Florian Markowetz, Molecular diagnosis, 2005 April 5
Page 9
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Overview
1. Classification in high dimensions−→ a fight against overfitting
2. Discriminant Analysis−→ Gaussian assumption, feature selection
3. Support vector machines−→ Maximal margin hyperplanes, non-linear similarity measures
4. Model selection and assessment−→ Traps and pitfalls, or: How to cheat.
Florian Markowetz, Molecular diagnosis, 2005 April 5
Page 10
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Overview
1. Classification in high dimensions−→ a fight against overfitting
2. Discriminant Analysis−→ Gaussian assumption, feature selection
3. Support vector machines−→ Maximal margin hyperplanes, non-linear similarity measures
4. Model selection and assessment−→ Traps and pitfalls, or: How to cheat.
5. Interpretation of results−→ what do classifiers teach us about biology?
Florian Markowetz, Molecular diagnosis, 2005 April 5
Page 11
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Molecular diagnosis = a classification problem
We measure p genes on N patients. Each microarray is a profilex(i) ∈ Rp. With each profiles comes a label yi ∈ K = {+1,−1}.
Florian Markowetz, Molecular diagnosis, 2005 April 6
Page 12
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Molecular diagnosis = a classification problem
We measure p genes on N patients. Each microarray is a profilex(i) ∈ Rp. With each profiles comes a label yi ∈ K = {+1,−1}.
Assume data generating distribution Pr(X, Y ) — which is unknown!
What we got are samples from Pr, called a Training set:
D = {(x(1), y1), . . . , (x(N), yN)}
Florian Markowetz, Molecular diagnosis, 2005 April 6
Page 13
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Molecular diagnosis = a classification problem
We measure p genes on N patients. Each microarray is a profilex(i) ∈ Rp. With each profiles comes a label yi ∈ K = {+1,−1}.
Assume data generating distribution Pr(X, Y ) — which is unknown!
What we got are samples from Pr, called a Training set:
D = {(x(1), y1), . . . , (x(N), yN)}
A classification rule c : Rp → K splits the Rp into one subspace for
each class.
Challenge: Find c, which classifies future patients well.
Florian Markowetz, Molecular diagnosis, 2005 April 6
Page 14
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
How to measure success
Loss function quantifies loss of classifying x to have label c(x) if
true label is y:
l(x, c(x), y) : Rp ×K ×K −→ [0,∞)
Risk is the expected loss over the whole population:
R[c] = E l(X, c(X), Y ) =∫
l(x, c(x), y) dPr(x, y)
Florian Markowetz, Molecular diagnosis, 2005 April 7
Page 15
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
0/1-loss
The most simple loss function for classification is 0/1-loss:
l(x, c(x), y) =
{0 if c(x) = y
1 if c(x) 6= y
With this loss function we get
R[c] = Pr( c(x) 6= y )
Florian Markowetz, Molecular diagnosis, 2005 April 8
Page 16
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
A first estimate for the risk
The empirical risk (aka training error) approximates Pr(X, Y ) by
the empirical distribution P r(X, Y ) of the training set:∫l(x, c(x), y) dPr(x, y) ≈ 1
N
N∑i=1
l(x(i), c(x(i)), yi) =: Remp[x]
First idea: Find a classifier c minimizing empirical risk!
c = argminc
Remp[x]
Florian Markowetz, Molecular diagnosis, 2005 April 9
Page 17
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
The trivial solution
A trivial classifier with zero empirical risk is
ctriv(x) =
{yi whenever x ∈ D1 else.
Florian Markowetz, Molecular diagnosis, 2005 April 10
Page 18
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
The trivial solution
A trivial classifier with zero empirical risk is
ctriv(x) =
{yi whenever x ∈ D1 else.
Ok, this is a bit artificial.
But still: in small-sample situations, learning single datapoints instead
of general features of the data is the main problem. This is called
overfitting.
Florian Markowetz, Molecular diagnosis, 2005 April 10
Page 19
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
From Under- to Over-fitting
Florian Markowetz, Molecular diagnosis, 2005 April 11
Page 20
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
From Under- to Over-fitting
Overfitting: Perfect separation of training data may not generalize
well to future patients.
Florian Markowetz, Molecular diagnosis, 2005 April 11
Page 21
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Bias-variance trade-off
[4]Florian Markowetz, Molecular diagnosis, 2005 April 12
Page 22
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
How to measure model complexity
We have to restrict the set of functions to one that has capacity (or
complexity) suitable for the amount of available training data.
Very prominent capacity concept [7, 8]:
Vapnik-Chervonenkis (VC) dimension
Florian Markowetz, Molecular diagnosis, 2005 April 13
Page 23
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
How to measure model complexity
We have to restrict the set of functions to one that has capacity (or
complexity) suitable for the amount of available training data.
Very prominent capacity concept [7, 8]:
Vapnik-Chervonenkis (VC) dimension
Shattering points: With labels in {+1,−1} and N points, there are
at most 2N different labelings. A rich function class may be able to
realize all of them. It is then said to shatter the N points.
Florian Markowetz, Molecular diagnosis, 2005 April 13
Page 24
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
How to measure model complexity
We have to restrict the set of functions to one that has capacity (or
complexity) suitable for the amount of available training data.
Very prominent capacity concept [7, 8]:
Vapnik-Chervonenkis (VC) dimension
Shattering points: With labels in {+1,−1} and N points, there are
at most 2N different labelings. A rich function class may be able to
realize all of them. It is then said to shatter the N points.
VC dimension is defined as the largest N such that there exists a
set of N points the function class can shatter, and ∞ if there is no
such set.
Florian Markowetz, Molecular diagnosis, 2005 April 13
Page 25
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Shattering p + 1 points in p dimensions
Curse of dimensionality:if p � N even linear methods are too complex.
Florian Markowetz, Molecular diagnosis, 2005 April 14
Page 26
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Means to fight overfitting in high dimensions
1. Dimension reductione.g. Principal Components Analysis: Find the directions with
highest variance in the data.
2. Feature selectiongene-wise filtering or shrinkage.
3. Regularizationintroduce additional constraints into objective function.
Florian Markowetz, Molecular diagnosis, 2005 April 15
Page 27
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Two roads to classification
1. model class probabilities→ Gaussian assumption leads to
Discriminant Analysis.
2. model class boundaries directly
→ Optimal Separating Hyperplanes
→ SVM
Florian Markowetz, Molecular diagnosis, 2005 April 16
Page 28
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Discriminant Analysis
Florian Markowetz, Molecular diagnosis, 2005 April 17
Page 29
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Bayes classifier
Image we knew the data generating distribution.
To minimize the risk with 0/1-loss we would classify a new point to
the most likely class:
c(x) = argmaxk
Pr(Y = k | X = x)
This is known as the Bayes classifier. It’s error rate is called the
Bayes rate.
Florian Markowetz, Molecular diagnosis, 2005 April 18
Page 30
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Bayes classifier
Image we knew the data generating distribution.
To minimize the risk with 0/1-loss we would classify a new point to
the most likely class:
c(x) = argmaxk
Pr(Y = k | X = x)
This is known as the Bayes classifier. It’s error rate is called the
Bayes rate.
In real-world problems, we do not know the data generating
distribution. But we can still make an educated guess . . .
Florian Markowetz, Molecular diagnosis, 2005 April 18
Page 31
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Comparing Gaussian likelihoods
Assumption: each group of patients is well described by a Normal
density.
Training: estimate mean and covariance matrix for each group.
Prediction: assign new patient to group with higher likelihood.
Constraints on covariance structure lead to different forms of
discriminant analysis.
Florian Markowetz, Molecular diagnosis, 2005 April 19
Page 32
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Gaussian likelihoods
Model each class density as multivariate Gaussian [2]
fk(x) = |2π Σk|−12 exp
{−1
2(x− µk)TΣ−1
k (x− µk)}
.
In comparing two classes k and l, we look at the log-ratio
logPr(Y = k|X = x)Pr(Y = l|X = x)
Florian Markowetz, Molecular diagnosis, 2005 April 20
Page 33
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Gaussian likelihoods
Model each class density as multivariate Gaussian [2]
fk(x) = |2π Σk|−12 exp
{−1
2(x− µk)TΣ−1
k (x− µk)}
.
In comparing two classes k and l, we look at the log-ratio
logPr(Y = k|X = x)Pr(Y = l|X = x)
= logPr(X = x|Y = k) Pr(Y = k)Pr(X = x|Y = l) Pr(Y = l)
Florian Markowetz, Molecular diagnosis, 2005 April 20
Page 34
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Gaussian likelihoods
Model each class density as multivariate Gaussian [2]
fk(x) = |2π Σk|−12 exp
{−1
2(x− µk)TΣ−1
k (x− µk)}
.
In comparing two classes k and l, we look at the log-ratio
logPr(Y = k|X = x)Pr(Y = l|X = x)
= logPr(X = x|Y = k) Pr(Y = k)Pr(X = x|Y = l) Pr(Y = l)
= logfk(x)fl(x)
+ logπk
πl.
Florian Markowetz, Molecular diagnosis, 2005 April 20
Page 35
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Quadratic and Linear Discrimant Analysis
1. Unrestricted {Σk} lead to Quadratic discriminant analysis.
2. The special case Σk = Σ, ∀k, leads to convenient cancellations
in the log-ratio:
logPr(Y = k|X = x)Pr(Y = l|X = x)
= logfk(x)fl(x)
+ logπk
πl
Florian Markowetz, Molecular diagnosis, 2005 April 21
Page 36
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Quadratic and Linear Discrimant Analysis
1. Unrestricted {Σk} lead to Quadratic discriminant analysis.
2. The special case Σk = Σ, ∀k, leads to convenient cancellations
in the log-ratio:
logPr(Y = k|X = x)Pr(Y = l|X = x)
= logfk(x)fl(x)
+ logπk
πl=
= xTΣ−1(µk − µl)−12(µk + µl)TΣ−1(µk − µl) + log
πk
πl.
The quadratic parts vanish, the decision boundary is linear.
Florian Markowetz, Molecular diagnosis, 2005 April 21
Page 37
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Discriminant functions
Equivalent descriptions of decision rule with c(x) = argmaxk δk(x).
Quadratic discriminant analysis
δQDAk (x) = −1
2log |Σk| −
12(x− µk)TΣ−1
k (x− µk) + log πk
Linear discriminant analysis
δLDAk (x) = xTΣ−1µk −
12µT
k Σ−1µk + log πk
Florian Markowetz, Molecular diagnosis, 2005 April 22
Page 38
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
More constraints on Σ
Diagonal discriminant analysis constraints Σk to diagonal form.
This means: genes/features are thought to be independent.
Again there exists a linear and a quadratic form.
Nearest centroids classification requires Σk = σ2k I, where I is the
identity matrix. Not only are genes now independent, they also have
the same variance (per class).
We will use both in the linear form, i.e. Σk = Σ, ∀k.
Florian Markowetz, Molecular diagnosis, 2005 April 23
Page 39
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Estimation from data
Prior:πk = Nk/N
Florian Markowetz, Molecular diagnosis, 2005 April 24
Page 40
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Estimation from data
Prior:πk = Nk/N
Class means:
µk =1
Nk
∑{i:yi=k}
xi
Florian Markowetz, Molecular diagnosis, 2005 April 24
Page 41
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Estimation from data
Prior:πk = Nk/N
Class means:
µk =1
Nk
∑{i:yi=k}
xi
Covariance matrix:
Σ =1
N − 2
2∑k=1
∑{i:yi=k}
(xi − µk)(xi − µk)T
Florian Markowetz, Molecular diagnosis, 2005 April 24
Page 42
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Disriminant analysis in a nutshell
DLDA
QDA LDA
NearestCentroid
Characterize each class
by mean and covariancestructure.
• Quadratic D.A.different COVs
• Linear D.A.requires same COVs.
• Diagonal linear D.A.same diagonal COVs.
• Nearest centroidsforces COVs to σ2I.
Florian Markowetz, Molecular diagnosis, 2005 April 25
Page 43
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Why does discriminant analysis work?
Is it, because Gaussian assumption is always fulfilled? Not likely!
The reason is more pragmatic:
1. The data can only support simple decision rules,
2. estimates by the Gaussian model are stable.
Florian Markowetz, Molecular diagnosis, 2005 April 26
Page 44
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Why does discriminant analysis work?
Is it, because Gaussian assumption is always fulfilled? Not likely!
The reason is more pragmatic:
1. The data can only support simple decision rules,
2. estimates by the Gaussian model are stable.
But still we work in very high dimensions.
Next simplification: Base the classification only on a small number
of genes.
Feature selection: Find the most discriminative genes.
Florian Markowetz, Molecular diagnosis, 2005 April 26
Page 45
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Single feature ranking
Idea: compare difference in group-means scaled by variance in the
groups.
freq
uenc
y
gene expression
freq
uenc
y
Florian Markowetz, Molecular diagnosis, 2005 April 27
Page 46
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Correlation scores
Three implementations of the mean/variance comparison:
t-statistic Fisher Golub [1]
t =µ1 − µ2√
σ21
n1+ σ2
2n2
f =(µ1 − µ2)2
σ21 + σ2
2
g =µ1 − µ2
σ1 + σ2
We rank the genes by one of these scores, use the top k for further
analysis and discard the rest.
Florian Markowetz, Molecular diagnosis, 2005 April 28
Page 47
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
From filters to shrinkage
Filtering involves an arbitrary hard threshold. Gene k + 1 is
discarded, even if it bears no less information than gene k.
We fight this point by Shrinkage: Continously shrink genes until
only a few have influence on classification.
Example: Nearest Shrunken Centroids (NSC).
Florian Markowetz, Molecular diagnosis, 2005 April 29
Page 48
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Nearest Shrunken Centroids
Florian Markowetz, Molecular diagnosis, 2005 April 30
Page 49
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
NSC: global and class centroids
For gene i: how far is the class centroid xik from the overallcentroid xi, measured in units of standard deviation?
dik =xik − xi
mk · si,
where si is the pooled within-class standard deviation for gene i and
mk =√
1/Nk − 1/N .
We transform this quantity into
xik = xi + mk · si · dik.
Florian Markowetz, Molecular diagnosis, 2005 April 31
Page 50
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
NSC: Shrinkage
Noisy and uninformative xik will be close
to the overall mean xi.
Shrink each dik toward zero by softthresholding [5, 6]:
d′ik = sign(dik)(|dik| −∆))+
This gives new class prototypes
x′ik = x′i + mk · si · d′ik.
Florian Markowetz, Molecular diagnosis, 2005 April 32
Page 51
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
NSC: how genes vanish from the model
If shrinkage paramters ∆ is large enough, genes are eliminated from
class prediction:
If ∆ causes dik to shrink to zero for all classes k, then the class
centroid falls into one with the overall centroid.
The gene i then does not contribute to nearest centroid computation.
Florian Markowetz, Molecular diagnosis, 2005 April 33
Page 52
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Shrunken Centroids
overall centroidfirst class centroid second class centroid
gene A
gene B
Florian Markowetz, Molecular diagnosis, 2005 April 34
Page 53
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Shrunken Centroids
Expression of gene 1
Exp
ress
ion
of g
ene
2●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●●
Florian Markowetz, Molecular diagnosis, 2005 April 35
Page 54
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
NSC: Discriminant scores
The discriminant function for nearest shrunken centroid classification
is
δNSCk (x) = xTΣ−1x′k −
12x′Tk Σ−1x′k + log πk,
which looks exactely like δLDAk (x) except for three differences:
1. diagonal wihtin-class covariance matrix Σ
2. shrunken centroids x′k rather than centroids xk ≡ µk.
3. as ∆ increases, more and more genes lose discriminatory power.
Florian Markowetz, Molecular diagnosis, 2005 April 36
Page 55
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Class probabilities from discriminant scores
Using the δk(x) we can construct estimates of the class probabilities
Pr(Y = k|X = x):
p(x) =exp
(−1
2δk(x))∑K
l=1 exp(−1
2δl(x))
The monotone transformation log[p/(1 − p)] is called the logittransformation.
Florian Markowetz, Molecular diagnosis, 2005 April 37
Page 56
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Shortcomings of filter and shrinkage methods
1. High correlated genes get similar score but offer no new
information.
But see Jaeger et al. [3] for a cure.
2. Filter and Shrinkage work only on single genes.
They don’t find interactions between groups of genes.
3. Filter and Shrinkage methods are only heuristics.
Search for best subset is infeasible for more than 30 genes.
Florian Markowetz, Molecular diagnosis, 2005 April 38
Page 57
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Differential genes may not be predictive!
freq
uenc
y
gene expression
freq
uenc
y
The upper one is differential and predictive, the lower one is also
differential, but not predictive.
Florian Markowetz, Molecular diagnosis, 2005 April 39
Page 58
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Predictive genes may not be differential!
Florian Markowetz, Molecular diagnosis, 2005 April 40
Page 59
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
A first summary
1. Molecular Diagnosis from microarray data is a classification
problem.
2. From training data, find a classifier working well for future patients.
3. Curse of dimensionality leads to easy overfitting.
4. Thus, bias the models to be simple!
5. One example: Gaussian model with restricted covariance and gene
selection.
Florian Markowetz, Molecular diagnosis, 2005 April 41
Page 60
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
What’s to come
Part II will deal with
1. Support vector machines−→ Maximal margin hyperplanes, non-linear similarity measures
2. Model selection and assessment−→ Traps and pitfalls, or: How to cheat.
3. Interpretation of results−→ what do classifiers teach us about biology?
Florian Markowetz, Molecular diagnosis, 2005 April 42
Page 61
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
What’s to come
Part II will deal with
1. Support vector machines−→ Maximal margin hyperplanes, non-linear similarity measures
2. Model selection and assessment−→ Traps and pitfalls, or: How to cheat.
3. Interpretation of results−→ what do classifiers teach us about biology?
Thank you! Questions?Florian Markowetz, Molecular diagnosis, 2005 April 42
Page 62
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmatics
Acknowledgements
Thanks to MIT Press and the authors for making the figures from
Learning with Kernels available at
http://www.learning-with-kernels.org.
Thanks to Springer and the authors for making the figures from The
Elements of Statistical Learning available at
http://www-stat-class.stanford.edu/∼tibs/ElemStatLearn/.
Florian Markowetz, Molecular diagnosis, 2005 April 43
Page 63
Berlin
Cen
ter
for
Genome BasedBio
i nfo
rmaticsReferences
[1] TR. Golub, DK. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, JP. Mesirov, H. Coller, ML. Loh, JR. Downing, MA.Caligiuri, CD. Bloomfield, and ES. Lander. Molecular classification of cancer: class discovery and class predictionby gene expression monitoring. Science, 286(5439):531–7, Oct 1999.
[2] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2001.
[3] J Jaeger, R Sengupta, and WL Ruzzo. Improved gene selection for classification of microarrays. Pac SympBiocomput, pages 53–64, 2003.
[4] Bernhard Scholkopf and Alexander J. Smola. Learning with kernels. The MIT Press, Cambridge, MA, 2002.
[5] Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Diagnosis of multiple cancertypes by shrunken centroids of gene expression. Proc Natl Acad Sci U S A, 99(10):6567–72, May 2002.
[6] Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Class prediction by nearestshrunken centroids, with applications to dna microarrays. Statist. Sci., 18(1):104–117, 2003.
[7] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995.
[8] Vladimir Vapnik. Statistical Learning Theory. Wiley, N.Y., 1998.
Florian Markowetz, Molecular diagnosis, 2005 April 44