Central dogma of biology DNA RNA pre-mRNA mRNA Protein Central dogma.

Central dogma of biology

DNA RNA pre-mRNA

mRNA Protein

Central dogma

CGAACAAACCTCGAACCTGCTDNA:

mRNA: GCU UGU UUA CGA

Polypeptide: Ala Cys Leu Arg

Translation

Transcription

Basic molecular biology

Transcription

End modification

Splicing

Transport

Translation

Less basic molecular biology

Cy3 Cy5

ReferenceTest Sample

cDNA Clone(LIBRARY)

PCR Product

PE

Test Sample

OligonucleotideSynthesis

Biological Sample

RNA

ARRAY

ARRAY

Ramaswamy and Golub, JCO

Microarray technology

Lockhart and Winzler 2000

Oligonucleotide cDNA

Microarray technology

Yeast experiment

Microarray experiment

When the science is not well understood, resort to statistics:

Ultimate goal: discover the genetic pathways of cancers

Infer cancer genetics by analyzing microarray data from tumors

Curse of dimensionality: Far too few examples for so many dimensions to predict accurately

Immediate goal: models that discriminate tumor types or treatment outcomes and determine genes used in model

Basic difficulty: few examples 20-100, high-dimensionality 7,000-16,000 genes measured for each sample, ill-posed problem

Analytic challenge

Cancer Diagnosis

Acute Myeloblastic Leukemia v

Acute Lymphoblastic Leukemia

38 examples of Myeloid and Lymphoblastic leukemias Affymetrix human 6800, (7128 genes including control genes)

34 examples to test classifier

Results: 33/34 correct

d perpendicular distancefrom hyperplane

Test data

d

Cancer Classification

Coregulation: the expression of two genes must be correlated for a protein to be made, so we need to look at pairwise correlations as well as individual expression

Size of feature space: if there are 7,000 genes, feature space is about 24 million features, so the fact that feature space is never computed is important

2

22

)1()()(),(

1,,,,,)(

,

jijiji

TrkshshTrkTrksh

Trksh

xxxxxxK

eeeeeex

eex

φφ

φ

Two gene example: two genes measuring Sonic Hedgehog and TrkC

Coregulation and kernels

Nonlinear SVM helps when the most informative genes are removed,Informative as ranked using Signal to Noise (Golub et al).

Genes removed errors1st order 2nd order 3rd order

polynomials

0 1 1 110 2 1 120 3 2 130 3 3 240 3 3 250 3 2 2100 3 3 2200 3 3 3 1500 7 7 8

Gene coregulation

Golub et al classified 29 test points correctly, rejected 5 of which 2 were errors using 50 genes

Need to introduce concept of rejects to SVM

g1

g2

Normal

Cancer

Reject

Rejecting samples

Rejecting samples

Estimating a CDF

The regularized solution

1/d

P(c=1 | d)

.95

95% confidence or p = .05 d = .107

Rejections for SVMs

Results: 31 correct, 3 rejected of which 1 is an error

Test data

d

Results with rejections

SVMs as stated use all genes/features

Molecular biologists/oncologists seem to be convinced that only a small subset of genes are responsible for particular biological properties, so they want the genes most important in

discriminating

Practical reasons, a clinical device with thousands of genes is not financially practical

Possible performance improvement

Wrapper method for gene/feature selection

Gene selection

AML vs ALL: 40 genes 34/34 correct, 0 rejects. 5 genes 31/31 correct, 3 rejects of which 1 is an error.

B vs T cells for AML: 10 genes 33/33 correct, 0 rejects.

d

Test data

d

Test data

Results with gene selection

Dataset Total Samples

Class 0

Class 1

Leukemia Morphology (train)

38 27 ALL

11 AML

Leukemia Morpholgy (test)

34 20 ALL

14 AML

Leukemia Lineage (ALL)

23 15 B-Cell

8 T-Cell

Lymphoma Outcome (AML)

15 8 Low risk

7 High risk

Dataset Total Samples

Class 0

Class 1

Lymphoma Morphology

77 19 FSC

58 DLCL

Lymphoma Outcome

58 22 Low risk

36 High risk

Brain Morphology

41 14 Glioma

27 MD

Brain Outcome

50 38 Low risk

12 High risk

Hierarchy of difficulty:1. Histological differences: normal vs. malignant, skin vs. brain2. Morphologies: different leukemia types, ALL vs. AML3. Lineage B-Cell vs. T-Cell, folicular vs. large B-cell lymphoma4. Outcome: treatment outcome, elapse, or drug sensitivity.

Molecular classification of cancer

Dataset Algorithm Total Samples

Total errors

Class 1 errors

Class 0 errors

Number Genes

SVM 35 0/35 0/21 0/14 40

WV 35 2/35 1/21 1/14 50

Leukemia Morphology (trest) AML vs ALL

k-NN 35 3/35 1/21 2/14 10

SVM 23 0/23 0/15 0/8 10

WV 23 0/23 0/15 0/8 9

Leukemia Lineage (ALL) B vs T

k-NN 23 0/23 0/15 0/8 10

SVM 77 4/77 2/32 2/35 200

WV 77 6/77 1/32 5/35 30

Lymphoma FS vs DLCL

k-NN 77 3/77 1/32 2/35 250

SVM

41 1/41 1/27 0/14 100

WV

41 1/41 1/27 0/14 3

Brain MD vs Glioma

k-NN

41 0/41 0/27 0/14 5

Morphology classification

Dataset Algorithm Total Samples

Total errors

Class 1 errors

Class 0 errors

Number Genes

SVM 58 13/58 3/32 10/26 100

WV 58 15/58 5/32 10/26 12

Lymphoma LBC treatment outcome

k-NN 58 15/58 8/32 7/26 15

SVM 50 7/50 6/12 1/38 50

WV 50 13/50 6/12 7/38 6

Brain MD treatment outcome

k-NN 50 10/50 6/12 4/38 5

Outcome classification

Error rates ignore temporal information such as when a patient dies. Survivalanalysis takes temporal information into account. The Kaplan-Meier survivalplots and statistics for the above predictions show significance.

0 20 40 60 80 100 120

0.0

0.2

0.4

0.6

0.8

1.0

p-val = 0.0015

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

p-val = 0.00039

Lymphoma Medulloblastoma

Outcome classification

Breast Prostate Lung Colorectal

Lymphoma

Bladder

Melenoma Uterus Leuke

mia Renal Pancreas Ovary Mesothel

ioma Brain

Abrev B P L CR Ly Bl M U Le R PA Ov MS C

Total 11 10 11 13 22 11 10 10 30 11 11 11 11 20

Train 8 8 8 8 16 8 8 8 24 8 8 8 8 16

Test 3 2 3 5 6 3 2 3 6 3 3 3 3 4

Note that most of these tumors came from secondary sources and were notat the tissue of origin.

Multi tumor classification

CNS, Lymphoma, Leukemia tumors separate

Adenocarcinomas do not separate

Clustering is not accurate

+

+

+

+

R+1+1

Y-1-1

G+1-1

B-1+1

ClassG+RB+R

Combination approaches: All pairsOne versus all (OVA)

Multi tumor classification

GeneExpression

Dataset

FinalMulticlass

Call(Highest OVA

PredictionStrength)

Breast OVAClassifier

. . .

. . .

Prostate OVAClassifier

CNS OVAClassifier

TEST SAMPLE

BREAST TUMORS

ALL OTHER TUMORS

Hyperplane

Confidence

Breast (High Confidence)

-2

0

+2

Figure 2

Supervised methodology

0

0.2

0.4

0.6

0.8

1

-1 0 1 2 3 4

Accuracy Fraction of Calls

0

0.2

0.4

0.6

0.8

1

-1 0 1 2 3 4-1

0

1

2

3

4

5

Low HighLow High

Correct Errors Correct Errors

Lo

w

H

igh

Confidence Confidence

Co

nfi

den

ce

Train/ Test 1

cross -val.

Train/cross -val. Test 1

00.1

0.20.3

0.40.5

0.6

0.7

0.8

0.91

First Top 2 Top 3

Prediction Calls


0

0.2

0.4

0.6

0.8

1

-1 0 1 2 3 4


0

0.2

0.4

0.6

0.8

1

-1 0 1 2 3 4-1

0

1

2

3

4

5

Low HighLow High

Correct Errors Correct Errors

Lo

w

H

igh

Confidence Confidence

Co

nfi

den

ce

Train/ Test 1

cross -val.


0

0.1

0.20.3

0.40.5

0.6

0.7

0.8

0.91

First Top 2 Top 3

Prediction Calls


Dataset Sample Type ValidationMethod

Sample Number

TotalAccuracy

Confidence High LowFraction Accuracy Fraction Accuracy

Train Well Differentiated Cross-val. 144 78% 80% 90% 20% 28%

Test 1 Well Differentiated Train/Test 54 78% 78% 83% 22% 58%

Well differentiated tumors

Feature selection hurts performance

0

0.2

0.4

0.6

0.8

1

-1 0 1 2 3 4


-1

0

1

2

3

4

5

Low High

Confidence

Lo

w

H

igh

Co

nfi

den

ce

Correct Errors

00.10.20.30.40.50.60.70.80.9

1

First Top 2 Top 3

Prediction Calls

Dataset Sample Type ValidationMethod

Sample Number

TotalAccuracy

Confidence High LowFraction Accuracy Fraction Accuracy

Test Poorly Differentiated Train/test 20 30% 50% 50% 50% 10%

Poorly differentiated tumors

Morphing

Morphing

Talking faces

Talking faces

Talking faces

Recursive feature elimination (RFE): based upon perturbationanalysis, eliminate genes that perturb the margin the least

Optimize leave-one out (LOO): based upon optimization of leave-one out error of a SVM, leave-one out error is

unbiased

Two feature selection algorithms

(2) step goto and set gene reduced on SVM Retrain 4.

10%)

smallest (for magnitude absolute small withelements vector

those to ingcorrespond enesfeatures/g input Discard 3.

value absoluteby vector of elements order Rank

vector for problem SVM the Solve

w

w

.2

.1

Recursive feature elimination

Use leave-one-out (LOO) bounds for SVMs as a criterion to select features by searching over all possible subsets of n features for the ones that minimizes the bound.

When such a search is impossible because of combinatorial explosion, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one-out bound. One can then keep the features corresponding to the largest scaling variables.

The rescaling can be done in the input space or in a “Principal Components” space.

Optimizing the LOO

Rescale features to minimize the LOO bound R2/M2

x2

x1

R2/M2 >1

M

R

x2

R2/M2 =1

M = R

Pictorial demonstration

Radius margin bound: simple to compute, continuous very loose but often tracks LOO well

Jaakkola Haussler bound: somewhat tighter, simple to compute, discontinuous so need to smooth,

valid only for SVMs with no b term

Span bound: tight complicated to compute, discontinuous so need to smooth

Three LOO bounds

tion.multiplica elementby element denote

, where

vu

σσyσxKyxK

bxxKαyxf

nσ

iσSVi

ii

),,(),(

.),()(

We add a scaling parameter to the SVM, which scales genes, genes corresponding to small j are removed.

The SVM function has the form:

Classification function with scaling

. ,)

:formquadratic following the maximizingby computed are s' The

ij 0),((),(,

1212 αxxKααyyασαW

α

jiCjiσjiji

ii

.

data the around variance For

and ,

maximize data the around radius the computing For

1

jijiσ

iiiσ

ii

jijiσji

iiiσi

xxKxxKσV

ββxxKββxxKβσβR

,

12

,212

),(),()(

.10),(),(),(

2

).(),(),(),()( 221221 σVσαWσβRσαWTDL Remeber

SVM and other functionals

2. step to return and in elements small to ingcorrespond dimensions Discard 5.

3. step goto reached not is of minima local 4.If

step. gradient a with to respect with error of estimate the Minimize 3.

algorithm SVM standard the 2.Solve

Initialize

and compute to used are steps following The

σ

T

σT

σαWσα

σ

σα

α).,(maxarg)(

.1,...,1.1

Algorithm

Computing gradients

.01

1KK~

where

,K~K

~K~

span therespect toith gradient w The

,

,

radius therespect toith gradient w The

.,W

margin therespect toith gradient w The

SV

1-SV

SV1-SV

42

1,1

2

1,

2

T

fi

f

i

ji f

jiji

i i

iii

f

ji f

jijiji

f

SS

xxKxxKR

xxKyy

Linear problem with 6 relevant dimensions of 202

Nonlinear problem with 2 relevant dimensions of 52

number of samples number of samples

erro

r ra

te

erro

r ra

te

Toy data

Central dogma of biology DNA RNA pre-mRNA mRNA Protein Central dogma.

Documents

Central dogma of biology DNA RNA pre-mRNA mRNA Protein Central dogma.