CZ5225: Modeling and Simulation in CZ5225: Modeling and Simulation in Biology Biology Lecture 8: Microarray disease Lecture 8: Microarray disease predictor-gene selection by feature predictor-gene selection by feature selection methods selection methods Prof. Chen Yu Zong Prof. Chen Yu Zong Tel: 6874-6877 Tel: 6874-6877 Email: Email: [email protected][email protected]http://bidd.nus.edu.sg http://bidd.nus.edu.sg Room 07-24, level 8, S17, Room 07-24, level 8, S17, National University of Singapore National University of Singapore
36
Embed
All classification methods we have studied so far use all genes/features
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CZ5225: Modeling and Simulation in BiologyCZ5225: Modeling and Simulation in Biology
Lecture 8: Microarray disease predictor-gene Lecture 8: Microarray disease predictor-gene selection by feature selection methods selection by feature selection methods
• All classification methods we have studied so far use all genes/features
• Molecular biologists/oncologists seem to be convinced that only a small subset of genes are responsible for particular biological properties, so they want the genes most important in discriminating disease-types and treatment outcomes
• Practical reasons, a clinical device with thousands of genes is not financially practical
•
Gene selection?
33
Disease Example: Childhood LeukemiaDisease Example: Childhood Leukemia• Cancer in the cells of the immune system• Approx. 35 new cases in Denmark every year• 50 years ago – all patients died• Today – approx. 78% are cured• Risk Groups:
– Standard– Intermediate– High– Very high– Extra high
• Treatment:– Chemotherapy– Bone marrow transplantation– Radiation
44
Prognostic factors:
ImmunophenotypeAgeLeukocyte countNumber of chromosomesTranslocationsTreatment response
• Reduction of dimensions– Principle Component Analysis (PCA)
• Feature selection (gene selection)– Significant genes: t-test– Selection of a limited number of genes
So, what do we do?So, what do we do?
88
Principal Component Analysis Principal Component Analysis (PCA)(PCA)
• Used for visualization of complex data
• Developed to capture as much of the variation in data as possible
• Generic features of principal components– summary variables– linear combinations of the original variables– uncorrelated with each other– capture as much of the original variance as possible
99
Principal componentsPrincipal components
1. principal component (PC1)– the direction along which there is greatest variation
2. principal component (PC2)– the direction with maximum variation left in data, orthogonal to
the direction (i.e. vector) of PC1
3. principal component (PC3)– the direction with maximal variation left in data, orthogonal to
PCA on all GenesPCA on all GenesLeukemia data, precursor B and TLeukemia data, precursor B and T
• Plot of 34 patients, 8973 dimensions (genes) reduced to 2
1313
Ranking of PCs and Gene SelectionRanking of PCs and Gene Selection
0
5
10
15
20
25
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Varia
nce
(%)
1414
The t-test methodThe t-test method
• Compares the means ( & ) of two data sets– tells us if they can be assumed to be equal
• Can be used to identify significant genes– i.e. those that change their expression a lot!
1515
PCA on 100 top significant genes based on t-testPCA on 100 top significant genes based on t-test• Plot of 34 patients, 100 dimensions (genes) reduced to 2
1616
The next question: Can we classify new patients?The next question: Can we classify new patients?• Plot of 34 patients, 100 dimensions (genes) reduced to 2
P99.??
????
1717
Feature Selection Problem StatementFeature Selection Problem Statement
• A process of selecting a minimum subset of features that is sufficient to construct a hypothesis consistent with the training examples (Almuallim and Dietterich, 1991)
• Selecting a minimum subset G such that P(C|G) is equal or as close as possible to P(C|F) (Koller and Sahami, 1996)
– Features (genes) are scored according to the evidence of predictive power and then are ranked. Top s genes with high score are selected and used by the classifier.
– Exhaustive searching is not impossible, and greedy algorithms are used instead.
– Confounding problem can happen in both scenario. In regression, it is usually recommended not to include highly correlated covariates in analysis to avoid confounding. But it’s impossible to avoid confounding in feature selection of microarray classification.
2525
Feature Selection StrategiesFeature Selection Strategies• Problems of wrapper methods
– Computationally expensive: for each feature subset considered, the classifier is built and evaluated.
– Exhaustive searching is impossible. Greedy search only.
– Attempt to jointly or simultaneously train both a classifier and a feature subset.
– Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features.
– Intuitively appealing
– Examples: nearest shrunken centroids, CART and other tree-based algorithms.
2727
Feature Selection StrategiesFeature Selection Strategies• Example of wrapper methods
– Recursive Feature Elimination (RFE)
1. Train the classifier with SVM. (or LDA)2. Compute the ranking criterion for all features 3. Remove the feature with the smallest ranking criterion.4. Repeat step 1~3.
2828
Feature RankingFeature Ranking• Weighting and ranking individual features
• Selecting top-ranked ones for feature selection
• Advantages – Efficient: O(N) in terms of dimensionality N– Easy to implement
• Disadvantages– Hard to determine the threshold– Unable to consider correlation between features
2929
)].([)]](([[
)).(()(
11
1
DLEDxyfEE
xfyDL
DxyD
iii
:unbiased almost is estimator out one-leave The
points all on repeated is procedure This
out. left point the on test and points remainingthe on train set, training the from point one remove
:procedure out one-Leave
Leave-one out method
3030
• Use leave-one out (LOO) criterion or upper bound on LOO to select features by searching over all possible subsets of n features for the ones that minimizes the criterion.
• When such a search is impossible because of too many possibilities, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one out bound. One can then keep the features corresponding to the largest scaling variables.
Basic idea
3131
Rescale features to minimize the LOO bound R2/M2
x2
x1
R2/M2 >1
M
R
x2
R2/M2 =1
M = R
Illustration
3232
Radius margin bound: simple to compute, continuousvery loose but often tracks LOO well
Jaakkola Haussler bound: somewhat tighter, simple to compute,discontinuous so need to smooth,valid only for SVMs with no b term
Span bound: tight as a Britney Spears outfitcomplicated to compute, discontinuous so need to smooth
Three upper bounds on LOO
3333
theorem. sNovikoff' of napplicatio an as derived is bound This
:holds bound following the , margin have and size ofsphere a to belong data training the If :1 Theorem
.1)( 2212
2
DDD
D WRMR
TDL
MRD
Radius margin bound
3434
Jaakkola-Haussler bound
.,
.1,1)(
iiiiiii
iiii
xxKaxfxfy
xxKaTDL
inequality following the upon based is bound This
:term bias a withoutSVMs For :2 Theorem
θ
3535
.))()((
.λ)(φλ
)(φ
),1()(
2
//
21
iiiiii
ijj
SVijjji
ii
ii
i
Sαxfxfy
x
xS
SαTDL
i
inequality the upon based is It included. is tem bias the except boundHaussler-Jaakkola the as fashion similar a in derived is bound This
,
where set the and point between distance the is where
:holds bound following the , example removing whenchange not does vectors support of set the assumption the Under :3 Theorem
Span bound
3636
tion.multiplica elementby element denote , where
vuσσyσxKyxK
bxxKαyxf
nσ
iσSVi
ii
),,(),(
.),()(
We add a scaling parameter to the SVM, which scales genes, genes corresponding to small j are removed.