NISS Metabolomics Worksho p, 2005 1 Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data Kwan R. Lee, Ph.D. and Lei A. Zhu, Amit Bhattacharyya, J. Alan Menius Biomedical Data Sciences GlaxoSmithKline [email protected]
45
Embed
NISS Metabolomics Workshop, 20051 Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data Kwan R. Lee, Ph.D. and.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
NISS Metabolomics Workshop, 2005 1
Integrative Analysis of High Dimensional Gene Expression,
Variation explained by each platformPLS-DA for prediction of 2 experimental groups
Platform Q2(cum)
1 NO 49%
2 TA 86%
3 MT 45%
4 All 86%
Two Groups
HFD, vehicle
HFD, Drug treated
Q2(Y) = amount of variation among the 2 groups explained by the model (cross-validated)
The above table is based on 2- component model. If the 4th model uses morecomponents, 91% of the variation in the data can be explained by 4 components.
NISS Metabolomics Workshop, 2005 27
Challenge #5: Validation of the Prediction Model
• Correct way of doing cross-validation– Especially when the variables are selected
• Is your prediction accuracy significant?
NISS Metabolomics Workshop, 2005 28
Random Noise Data
• Simulate 20,000 marker columns of random noise for 100 patients and one additional column containing arbitrary labels of class indicators.
• Select 5 marker columns showing most correlation with class label.
• Make a prediction model for class indicators based on these 5 selected markers.
NISS Metabolomics Workshop, 2005 29
PCA of Full Markers
-40
-30
-20
-10
0
10
20
30
40
-40 -30 -20 -10 0 10 20 30 40 50
t[2]
t[1]
random_noise.M12 (PCA-X), Untitledt[Comp. 1]/t[Comp. 2]Colored according to classes in M12
Ellipse: Hotelling T2 (0.95)
Class 1Class 2
SIMCA-P+ 10.5 - 2/5/2005 8:22:09 AM
NISS Metabolomics Workshop, 2005 30
PLS-DA on Random Noise Data
• Running a full model on SIMCA does not yield a model – no significant Q2.– Multivariate approach is conservative.– Q2 computes prediction performance.
• But forced the software to fit a 6 -component model by PLS-DA
• (R2 = 1.0, Q2 = 0.225)
NISS Metabolomics Workshop, 2005 31
Full marker modelPLS-DA
-30
-20
-10
0
10
20
30
-30 -20 -10 0 10 20 30
t[2]
t[1]
random_noise.M1 (PLS-DA), All datat[Comp. 1]/t[Comp. 2]Colored according to classes in M1
• When a prediction model is tested on the same data that were used in the first instance to select the markers, selection bias makes the test error overly optimistic.– Many publications claimed a small set of selected
“genes” is highly predictive.– IBI practice is to use a data set to select markers
and use the same data set to fit a prediction model based on selected markers.
NISS Metabolomics Workshop, 2005 37
How to correct for selection bias?
• External validation should be undertaken subsequent to feature selection process.
1. Independent test data set (hold-out data set) that never used for feature selection.
2. External cross-validation (ECV).• Cross validation of the prediction model is
external to the selection process.
• In other words, make a new selection for each cross validation round.
NISS Metabolomics Workshop, 2005 38
Externally Validated PLS.Model and variable selection
• Divide the data set randomly into d parts.• Set ecv = 1; (this means hold-out one part and use d-1 parts for modeling)• Set a =1 ; (the number of components, do until 10)• Set k = total number of variables;• Loop: • Fit PLS model with given a and k , PLS (a,k);• Predict hold-out set, compute PRESS (ecv, a, k) and save;• Choose top half of the variables by appropriate statistics (coeff, vip, t-ratio etc);• Set k = k/2;• Go back to Loop until k = 2;• Set a = a + 1;• Go back to Loop until a =10;• Set ecv = ecv + 1;• Go back to Loop until ecv = d;• Compute PRESS (a, k) = Sum over ecv {PRESS (ecv, a, k)};• Compute Q2(a, k) = 1 – PRESS (a, k)/TSS;• Plot Q2(a,k) vs. log2(k);
NISS Metabolomics Workshop, 2005 39
Simulation of 2000 Random DataR. Simon 2003
• 20 x 6000 and 10/10 for class labels
• Repeat 2000 times
• Compute 3 different error rates– Re-substitution (wrong)– Cross validation after selection (wrong)– Cross validation before selection (correct)
NISS Metabolomics Workshop, 2005 40
Results of 2000 Random Data
NISS Metabolomics Workshop, 2005 41
Permutation testing
• Because of the high dimensionality of gene expression data, it may be possible to achieve relatively small error rates even for random data.
• To assess the significance of the classification results, permutation test may be suggested.
NISS Metabolomics Workshop, 2005 42
Challenge #5: Validation of the Prediction Model - summary
• Correct way of doing cross-validation– All the steps of the prediction modeling should
be cross-validated.– Each cross validation step should start from
scratch
• Is your prediction accuracy significant?– Random data can give you low prediction error– Permutation tests, bootstrap aggregation
NISS Metabolomics Workshop, 2005 43
Summary and Discussion• Recent technological advances present
challenging and interesting biological data at molecular level.
• Statistics and multivariate analysis play an important role in understanding and extracting knowledge from these type of data.
• Integrative analysis is even more challenging and we presented some solutions to these challenges. There is plenty of room for improvement.
NISS Metabolomics Workshop, 2005 44
Acknowledgement
GlaxoSmithKline– High Throughput Biology– Biomedical Data Sciences– Genomics and Proteomics Science
– Pathology, Cellular & Biochemical Toxicology
– Discovery IT
NISS Metabolomics Workshop, 2005 45
Data exploration: Present Challenges
Data is an extremely valuable asset, but like a cash crop, unless harvested, it is wasted.