All classification methods we have studied so far use all genes/features

CZ5225: Modeling and Simulation in BiologyCZ5225: Modeling and Simulation in Biology

Lecture 8: Microarray disease predictor-gene Lecture 8: Microarray disease predictor-gene selection by feature selection methods selection by feature selection methods

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@nus.edu.sg

http://bidd.nus.edu.sghttp://bidd.nus.edu.sgRoom 07-24, level 8, S17, Room 07-24, level 8, S17,

National University of SingaporeNational University of Singapore

mailto:[email protected]

http://bidd.nus.edu.sg/

22

• All classification methods we have studied so far use all genes/features

• Molecular biologists/oncologists seem to be convinced that only a small subset of genes are responsible for particular biological properties, so they want the genes most important in discriminating disease-types and treatment outcomes

• Practical reasons, a clinical device with thousands of genes is not financially practical

•

Gene selection?

33

Disease Example: Childhood LeukemiaDisease Example: Childhood Leukemia• Cancer in the cells of the immune system• Approx. 35 new cases in Denmark every year• 50 years ago – all patients died• Today – approx. 78% are cured• Risk Groups:

– Standard– Intermediate– High– Very high– Extra high

• Treatment:– Chemotherapy– Bone marrow transplantation– Radiation

44

Prognostic factors:

ImmunophenotypeAgeLeukocyte countNumber of chromosomesTranslocationsTreatment response

Risk Classification TodayRisk Classification TodayRisk group:

StandardIntermediateHighVery highExtra high

Patient:

Clinical data

Immunopheno-typing

Morphology

Genetic measurements

Microarraytechnology

55

Study and Diagnosis of Childhood LeukemiaStudy and Diagnosis of Childhood Leukemia

• Diagnostic bone marrow samples from leukemia patients

• Platform: Affymetrix Focus Array– 8793 human genes

• Immunophenotype– 18 patients with precursor B immunophenotype– 17 patients with T immunophenotype

• Outcome 5 years from diagnosis– 11 patients with relapse– 18 patients in complete remission

66

Problem: Too much data!Problem: Too much data!Gene Pat1 Pat2 Pat3 Pat4 Pat5 Pat6 Pat7 Pat8 Pat9209619_at 7758 4705 5342 7443 8747 4933 7950 5031 529332541_at 280 387 392 238 385 329 337 163 225206398_s_at 1050 835 1268 1723 1377 804 1846 1180 252219281_at 391 593 298 265 491 517 334 387 285207857_at 1425 977 2027 1184 939 814 658 593 659211338_at 37 27 28 38 33 16 36 23 31213539_at 124 197 454 116 162 113 97 97 160221497_x_at 120 86 175 99 115 80 83 119 66213958_at 179 225 449 174 185 203 186 185 157210835_s_at 203 144 197 314 250 353 173 285 325209199_s_at 758 1234 833 1449 769 1110 987 638 1133217979_at 570 563 972 796 869 494 673 1013 665201015_s_at 533 343 325 270 691 460 563 321 261203332_s_at 649 354 494 554 710 455 748 392 418204670_x_at 5577 3216 5323 4423 5771 3374 4328 3515 2072208788_at 648 327 1057 746 541 270 361 774 590210784_x_at 142 151 144 173 148 145 131 146 147204319_s_at 298 172 200 298 196 104 144 110 150205049_s_at 3294 1351 2080 2066 3726 1396 2244 2142 1248202114_at 833 674 733 1298 862 371 886 501 734213792_s_at 646 375 370 436 738 497 546 406 376203932_at 1977 1016 2436 1856 1917 822 1189 1092 623203963_at 97 63 77 136 85 74 91 61 66203978_at 315 279 221 260 227 222 232 141 123203753_at 1468 1105 381 1154 980 1419 1253 554 1045204891_s_at 78 71 152 74 127 57 66 153 70209365_s_at 472 519 365 349 756 528 637 828 720209604_s_at 772 74 130 216 108 311 80 235 177211005_at 49 58 129 70 56 77 61 61 75219686_at 694 342 345 502 960 403 535 513 25838521_at 775 604 305 563 542 543 725 587 406

77

• Reduction of dimensions– Principle Component Analysis (PCA)

• Feature selection (gene selection)– Significant genes: t-test– Selection of a limited number of genes

So, what do we do?So, what do we do?

88

Principal Component Analysis Principal Component Analysis (PCA)(PCA)

• Used for visualization of complex data

• Developed to capture as much of the variation in data as possible

• Generic features of principal components– summary variables– linear combinations of the original variables– uncorrelated with each other– capture as much of the original variance as possible

99

Principal componentsPrincipal components

1. principal component (PC1)– the direction along which there is greatest variation

2. principal component (PC2)– the direction with maximum variation left in data, orthogonal to

the direction (i.e. vector) of PC1

3. principal component (PC3)– the direction with maximal variation left in data, orthogonal to

the plane of PC1 and PC2– (Less frequently used)

1010

Example: 3 dimensions => 2 Example: 3 dimensions => 2 dimensionsdimensions

1111

PCA - ExamplePCA - Example

1212

PCA on all GenesPCA on all GenesLeukemia data, precursor B and TLeukemia data, precursor B and T

• Plot of 34 patients, 8973 dimensions (genes) reduced to 2

1313

Ranking of PCs and Gene SelectionRanking of PCs and Gene Selection

0

5

10

15

20

25

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

Varia

nce

(%)

1414

The t-test methodThe t-test method

• Compares the means ( & ) of two data sets– tells us if they can be assumed to be equal

• Can be used to identify significant genes– i.e. those that change their expression a lot!

1515

PCA on 100 top significant genes based on t-testPCA on 100 top significant genes based on t-test• Plot of 34 patients, 100 dimensions (genes) reduced to 2

1616

The next question: Can we classify new patients?The next question: Can we classify new patients?• Plot of 34 patients, 100 dimensions (genes) reduced to 2

P99.??

????

1717

Feature Selection Problem StatementFeature Selection Problem Statement

• A process of selecting a minimum subset of features that is sufficient to construct a hypothesis consistent with the training examples (Almuallim and Dietterich, 1991)

• Selecting a minimum subset G such that P(C|G) is equal or as close as possible to P(C|F) (Koller and Sahami, 1996)

1818

Feature Selection StrategiesFeature Selection Strategies• Wrapper methods

– Relying on a predetermined classification algorithm– Using predictive accuracy as goodness measure– High accuracy, computationally expensive

• Filter methods– Separating feature selection from classifier learning– Relying on general characteristics of data (distance, correlation,

consistency)– No bias towards any learning algorithm, fast

• Embedded methods– Jointly or simultaneously train both a classifier and a feature subset

by optimizing an objective function that jointly rewards accuracy of classification and penalizes use of more features.

1919

Feature Selection StrategiesFeature Selection Strategies• Filter methods

– Features (genes) are scored according to the evidence of predictive power and then are ranked. Top s genes with high score are selected and used by the classifier.

– Scores: t-statistics, F-statistics, signal-noise ratio, …

– The # of features selected, s, is then determined by cross validation.

– Advantage: Fast and easy to interpret.

2020

Feature Selection StrategiesFeature Selection Strategies• Problems of filter methods

– Genes are considered independently.– Redundant genes may be included.– Some genes jointly with strong discriminant power but

individually with weak contribution will be ignored.– The filtering procedure is independent to the classifying

method.

2121

Feature SelectionFeature Selection

Step-wise variable selection:

n*<Neffective variables modeling the classification function

N fe

atur

es

N stepsStep 1 Step N…

One feature vs. N features

…

2222

Feature SelectionFeature Selection

Step-wise selection of the features.

Steps

Ranked FeaturesDiscarded Features

2323


– Iterative search: many “feature subsets” are scored base on classification performance and the best is used.

– Subset selection: Forward selection, backward selection, their– combinations.

– The problem is very similar to variable selection in regression.

2424


– Analogous to variable selection in regression

– Exhaustive searching is not impossible, and greedy algorithms are used instead.

– Confounding problem can happen in both scenario. In regression, it is usually recommended not to include highly correlated covariates in analysis to avoid confounding. But it’s impossible to avoid confounding in feature selection of microarray classification.

2525

Feature Selection StrategiesFeature Selection Strategies• Problems of wrapper methods

– Computationally expensive: for each feature subset considered, the classifier is built and evaluated.

– Exhaustive searching is impossible. Greedy search only.

– Easy to overfit.

2626

Feature Selection StrategiesFeature Selection Strategies• Embedded methods

– Attempt to jointly or simultaneously train both a classifier and a feature subset.

– Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features.

– Intuitively appealing

– Examples: nearest shrunken centroids, CART and other tree-based algorithms.

2727

Feature Selection StrategiesFeature Selection Strategies• Example of wrapper methods

– Recursive Feature Elimination (RFE)

1. Train the classifier with SVM. (or LDA)2. Compute the ranking criterion for all features 3. Remove the feature with the smallest ranking criterion.4. Repeat step 1~3.

2828

Feature RankingFeature Ranking• Weighting and ranking individual features

• Selecting top-ranked ones for feature selection

• Advantages – Efficient: O(N) in terms of dimensionality N– Easy to implement

• Disadvantages– Hard to determine the threshold– Unable to consider correlation between features

2929

)].([)]](([[

)).(()(

11

1

DLEDxyfEE

xfyDL

DxyD

iii

:unbiased almost is estimator out one-leave The

points all on repeated is procedure This

out. left point the on test and points remainingthe on train set, training the from point one remove

:procedure out one-Leave

Leave-one out method

3030

• Use leave-one out (LOO) criterion or upper bound on LOO to select features by searching over all possible subsets of n features for the ones that minimizes the criterion.

• When such a search is impossible because of too many possibilities, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one out bound. One can then keep the features corresponding to the largest scaling variables.

Basic idea

3131

Rescale features to minimize the LOO bound R2/M2

x2

x1

R2/M2 >1

M

R

x2

R2/M2 =1

M = R

Illustration

3232

Radius margin bound: simple to compute, continuousvery loose but often tracks LOO well

Jaakkola Haussler bound: somewhat tighter, simple to compute,discontinuous so need to smooth,valid only for SVMs with no b term

Span bound: tight as a Britney Spears outfitcomplicated to compute, discontinuous so need to smooth

Three upper bounds on LOO

3333

theorem. sNovikoff' of napplicatio an as derived is bound This

:holds bound following the , margin have and size ofsphere a to belong data training the If :1 Theorem

.1)( 2212

2

DDD

D WRMR

TDL

MRD

Radius margin bound

3434

Jaakkola-Haussler bound

.,

.1,1)(

iiiiiii

iiii

xxKaxfxfy

xxKaTDL

inequality following the upon based is bound This

:term bias a withoutSVMs For :2 Theorem

θ

3535

.))()((

.λ)(φλ

)(φ

),1()(

2

//

21

iiiiii

ijj

SVijjji

ii

ii

i

Sαxfxfy

x

xS

SαTDL

i

inequality the upon based is It included. is tem bias the except boundHaussler-Jaakkola the as fashion similar a in derived is bound This

,

where set the and point between distance the is where

:holds bound following the , example removing whenchange not does vectors support of set the assumption the Under :3 Theorem

Span bound

3636

tion.multiplica elementby element denote , where

vuσσyσxKyxK

bxxKαyxf

nσ

iσSVi

ii

),,(),(

.),()(

We add a scaling parameter to the SVM, which scales genes, genes corresponding to small j are removed.

The SVM function has the form:

Classification function with gene selection

All classification methods we have studied so far use all genes/features

Documents

diagnosis11 patients

relapse18 patients

patients diedtoday

thousands of genes

practicalgene selection

feature selection methods

small subset of genes

disease example