University of Pittsburgh Department of Biomedical Informatics The Application of Naive Bayes Model Averaging to Predict Alzheimer’s Disease from Genome-Wide Data Wei Wei, Shyam Visweswaran and Gregory F. Cooper
Dec 14, 2014
University of Pittsburgh Department of Biomedical Informatics
The Application of Naive Bayes Model Averaging to Predict Alzheimer’s Disease
from Genome-Wide DataWei Wei, Shyam Visweswaran and Gregory F. Cooper
Motivation: Develop methods for using genome-wide information about an individual to inform clinical care
Background
• Genome-wide association studies (GWASs)• Single-nucleotide polymorphism (SNP)• High-throughput genotyping technologies
• Alzheimer’s disease (AD): • AD afflicts about 10% of persons over 65 and
almost half of those over 85• ~5.5 million cases currently in U.S.• 95% of all AD cases are Late-Onset AD (LOAD)
Background
• SourceTGEN dataset by Reiman et al *
• Cases• 1411 individuals• 861 LOAD and 550 controls
• SNPs• 312,316 SNPs• Two additional SNPs (rs429358 and rs7412)
genotyped separately (these determine APOE status)____________________________________________________________________
* Reiman E, Webster J, Myers A, Hardy J, Dunckley T, Zismann V, et al. GAB2 alleles modify Alzheimer's risk in APOE epsilon4 carriers. Neuron. 2007;54(5):713-20.
Background
• Bayesian Model Averaging• Represents uncertainty about the correctness of
any given model• Performs inference by weighting the prediction of
each model by our uncertainty in that model• Model-Averaged Naïve Bayes (MANB)
MANB efficiently averages over all naive Bayes models (on a given set of variables) in making a prediction for an individual patient case
Methods: Naive Bayes (NB)
SNP 1 SNP 2 SNP 3 …
LOAD
SNP 312318
Methods: Feature Selection Naive Bayes (FSNB)
Perform feature selection using a greedy, forward-stepping search that optimizes the prediction of LOAD
LOAD
SNP25,920
SNP 1,100
SNP104,582
SNP276,455
Methods: Model-Averaged Naive Bayes (MANB)
LOAD
SNP 1 SNP 2 SNP312,318
…
Methods: MANB
Model 1 Model i Model2312,318
……
312,3182
1
( | )
( | , ) ( | )i
P LOAD Ev
P LOAD Ev model i P model i training data
Methods: MANB• We can take advantage of the conditional independence
relationships in NB models to make it efficient to model average over all those many models.
• The computational “trick” is as follows*
• For each SNPi we construct a model-averaged conditional probability, PMANB(SNPi | LOAD), by averaging over whether or not there is an arc from LOAD to SNPi.
____________________________________________________________________* Dash D, Cooper G. Exact model averaging with naive Bayesian classifiers.
International Conference on Machine Learning (2002) 91 - 98.
This step can be viewed as a “soft” form of feature selection.
Methods: MANB• We can take advantage of the conditional independence
relationships in NB models to make it efficient to model average over all those many models.
• The computational “trick” is as follows*
• For each SNPi we construct a model-averaged conditional probability, PMANB(SNPi | LOAD), by averaging over whether or not there is an arc from LOAD to SNPi
• We use these model-averaged conditional probabilities to define a new NB model M over which we now perform NB inference.
• Performing inference with M is the same as model averaging over the exponential number of NB models discussed previously.
____________________________________________________________________* Dash D, Cooper G. Exact model averaging with naive Bayesian classifiers.
International Conference on Machine Learning (2002) 91 - 98.
Methods: Prior Probabilities
• Structure priors• FSNB and MANB assume each arc is present with some
probability p, independent of the status of other arcs in the model.
• Informed by the literature, we chose a value of p that yields an expected number of arcs of 20.
• Parameter priorsIf we think of P(SNPi |LOAD) as defining a table of probabilities, then we assume that every way of filling in that table (consistent with the axioms of probability) is equally likely a priori.
Methods: Experimental Design
• Five-fold cross-validation• Performance measures• Area under the ROC curve (AUC) as a measure of
discrimination• Calibration plots and Hosmer-Lemeshow goodness-of-
fit statistics• Run time
• Control algorithms• NB • FSNB
Results: Run time (in seconds)
Machine parameters: CPU 2.33 GHz, RAM 2 GB. Training time was the average over the five cross-validation folds. Time for loading data into memory is not included, but was about XYZ seconds.
MANB NB FSNB0
200400600800
10001200140016001800
16.1 15.6
1684.2
TrainingTime
TrainingTime
Results: Area under the ROC curve (AUC)
Discussion:• AUCs of FSNB and
MANB are similar (95% confidence interval of their AUC difference is -0.008 to 0.029). Their performance is strongly influenced by several APOE SNPs.
• AUCs of NB and MANB are strongly statistically different (p<0.00001).
Results: Calibration plot of NB
Discussion:
NB is poorly calibrated with almost all the test cases having probability predictions near 0 or 1. Such extreme predictions occur because there are such a large number of features in the model.
Results: Calibration plot of NB and FSNB
Discussion:
FSNB is the best calibrated algorithm among the three we evaluated. This result is likely due to the FSNB models containing only a few SNP features (< 4).
Results: Calibration plot of NB, FSNB and MANB
Discussion:
MANB is better calibrated than NB.
MANB is not as well calibrated as FSNB. We believe this result may be due to FSNB having such a small number of features in its models.
Summary of Results
NB FSNB MANB
AUC + +
Calibration ++ +
Run time ++ ++
Algorithm Availability
• A full description of the MANB algorithm is available in the appendix of our paper.
• It provides all the details needed to readily implement the algorithm.
Future Work Includes the Following
• Apply the MANB algorithm to additional datasets• Predict additional clinical outcomes• Use both genomic and clinical data to predict
clinical outcomes• Explore the use of additional genome-wide
measurement platforms, including next generation sequencing data
• Include additional control algorithms in future evaluations
Acknowledgement
• We thank Mr. Kevin Bui for his help in data preparation, software development, and the preparation of the appendix. We thank Dr. Pablo Hennings-Yeomans, Dr. Michael Barmada, and the other members of our research group for helpful discussions.
• The research reported here was funded by NLM grant R01-LM010020 and NSF grant IIS-0911032.
Thank you
Questions?