2.20e-206 3.55e-92 2.95e-07 2.32e-08 1.49e-08 3.74e-23 ns 4.19e-05 2.48e-08 1.53e-06 ns 2.32e-08 MAPPIN: Method for Annotating, Predicting Pathogenicity, and mode of Inheritance for Nonsynonymous variants Nehal Gosalia 1,2 , Aris N. Economides 1,2 , Frederick E. Dewey 1 , and Suganthi Balasubramanian 1 1 Regeneron Genetics Center, 2 Regeneron Pharmaceuticals, Tarrytown, NY BACKGROUND • An average exome can contain 10,500-13,500 nonsynonymous single nucleotide variants (nsSNVs) 1,2 , which is lower than expected suggesting negative selection 3,4 • A major challenge with whole exome sequencing (WES) is differentiating benign and disease-causing variants • HGMD and OMIM ® databases show that nsSNVs account for ~45% of disease-causing mutations 5,6 making it critical to identify them • Many algorithms predict pathogenicity of nsSNVs, however none of them are able to distinguish dominant vs. recessive- disease causing mutations 7 • It is important to differentiate between heterozygous dominant- acting variants and heterozygous carrier variants MAPPIN • Method for Annotating, Predicting Pathogenicity, and mode of Inheritance for Nonsynonymous variants • Training data sets: - Pathogenic variants from UniProt, ExoVar 7 , subdivided using known dominant- and recessive- disease causing genes from OMIM ®6 and others 8,9 - Haploinsufficient genes subset from dominant genes using haploinsufficiency predictions 10 - Benign variants from ClinVar 11 refined by a) criteria provided, multiple submitters, no conflict, b) reviewed by expert panel, and c) practice guideline (★★-★★★★) Input Variant Annotation VCF File (chr, pos, ref, alt) 99 evolutionary, functional, network, and allele frequency features Training sets ExoVar (pathogenic) ClinVar (benign) Scores for Dominant, Recessive, and Benign classes Random Forest Classifier Prediction Output Annotation Output Category Features Evolutionary GERP score 12 , paralogs 13 , pseudogenes 14 and other gene annotation metrics, dN/dS rates, average heterozygosity of nsSNVs and synonymous SNVs, nonsynonymous and synonymous SNP density Functional Transcript length, variant affecting all/some transcript isoforms, single exon gene, protein domain annotations 13 , GTeX expression in individual tissues 15 Network Protein-protein interactions (BIOGRID) 16 , number of networks and interfaces, interactions with known dominant or recessive disease causing genes (OMIM ® ) 6 Allele Frequency 1000G 17 , ESP6500 18 , ExAC 19 , pLI score by gene (measure of haploinsufficiency based on constraint, ExAC) 20 Table 1. Features annotated within MAPPIN and used for predictions subdivided into categories based on the type of annotation. Figure 1. Adapted from Li et al., PLoS Genetics, 2013 7 . Several prediction algorithms and combinations were tested on a dataset composed of known dominant and recessive disease causing mutations. Figure clearly demonstrates that existing algorithms are unable to call dominant or recessive mutations confidently (AUCs ~0.55). RESULTS MAPPIN trained under two models: i. Haploinsufficient model composed of genes causing dominant diseases through haploinsufficiency (Multiclass AUC = 0.96) ii. All dominant model composed of all dominant disease-causing genes (Multiclass AUC = 0.91) FEATURE IMPORTANCE PLOT PERFORMANCE OF FEATURE SUBSETS VALIDATION ON TWO MENDELIAN DATASETS: CMG AND DDDS Figure 2. Workflow for MAPPIN. User inputs a VCF file which is annotated with 99 features and then run through a prediction model based on a random forest classifier trained on benign and pathogenic variants. DIFFERENTIATING BETWEEN DOMINANT & RECESSIVE DISEASE-CAUSING VARIANTS DOMINANT AND RECESSIVE DISCRIMINATION FOR HGMD VARIANTS Figure 6. Violin plots of score distributions for HGMD 5 variants in dominant and recessive genes. Training variants are excluded in the comparison and variants were subset using genes from Berg et al. 24 MAPPIN dominant (A) and recessive (B) class scores for HGMD variants in dominant and recessive genes. CADD (C) and Eigen (D) phred scores 23 for HGMD variants in dominant and recessive genes. REFERENCES 1. Levy et al., The diploid genome sequence of an individual human. PLoS Biology, 2007 2. Ng et al., Genetic variation in an individual human exome. PLoS Genetics, 2008 3. Cargill, M et al., Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature Genetics, 1999 4. Stephens JC et al., Haplotype variation and linkage disequilibrium in 313 human genes. Science, 2001 5. Stenson et al., The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Human Genetics, 2014 6. Online Mendelian Inheritance in Man, OMIM ® . McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD). World Wide Web URL: http://omim.org/ 7. Li et al., Predicting Mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing studies. PLoS Genetics, 2013 8. Blekhman, R. et al., Natural selection on genes that underlie human disease susceptibility. Current Biology, 2008 9. Boone, P.M. et al., Deletions of recessive disease genes: CNV contribution to carrier states and disease- causing alleles. Genome Research, 2013 10. Huang et al., Characterising and predicting haploinsufficiency in the human genome. PLoS Genetics, 2010 11. Landrum MJ et al., ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Research, 2015 12. Cooper, G.M. et al., Distribution and intensity of constraint in mammalian genomic sequence. Genome Research, 2005 13. Flicek et al., Ensembl 2014. Nucleic Acids Research, 2014 14. GENCODE, Pei, B. et al. The GENCODE pseudogene resource. Genome Biology, 2012 15. Lonsdale et al., The Genotype-Tissue Expression (GTEx) project. Nature Genetics, 2013 16. Stark, C. et al., BioGRID: a general repository for interaction datasets. Nucleic Acids Research, 2006 (version 3.4.128) 17. The 1000 Genomes Project Consortium, An integrated map of genetic variation from 1,092 human genomes. Nature, 2012 18. Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA (URL: http://evs.gs.washington.edu/EVS/), 2015 19. Lek et al., Analysis of protein-coding genetic variation in 60,706 humans. Nature, 2016 20. Samocha et al., A framework for the interpretation of de novo mutation in human disease. Nature Genetics, 2014 21. Chong et al., The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. American Journal of Human Genetics, 2015 22. Deciphering Developmental Disorders Study, Large-scale discovery of novel genetic causes of developmental disorders. Nature, 2015 23. Liu et al., dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs. Human Mutation, 2016 24. Berg et al., An informatics approach to analyzing the incidentalome. Genetics in Medicine, 2013. Figure 3. Precision and recall values for training data. A. Precision was calculated as true positives over the sum of all positives for a 10-fold cross-validation. B. Recall was calculated as the true positives over the sum of true positives and false negatives. Benign Recessive Dominant 0.5 0.6 0.7 0.8 0.9 1.0 Precision 0.88 0.85 0.87 0.79 0.80 0.71 Benign Recessive Dominant 0.5 0.6 0.7 0.8 0.9 1.0 Recall Haploinsufficient Model All Dominant Model 0.94 0.95 0.87 0.79 0.74 0.62 Figure 4. Feature importance plot. First, out-of-bag (OOB) prediction error is calculated for each tree. Next, OOB error is calculated after permuting each feature. Finally, to derive the mean decrease in accuracy, the difference between the two is averaged across all trees and normalized by the standard deviation of the differences. Table 2. Precision values using different subsets of features. The precision values were calculated based on the training data under the haploinsufficient model using a 10-fold cross-validation. • 68 variants from the Centers for Mendelian Genomics (CMG) 21 , which are working towards identifying the genetic basis of Mendelian diseases • 158 variants from the Deciphering Developmental Disorders Study (DDDS) 22 , which includes 1,133 children presenting with severe, undiagnosed developmental disorders of which 28% were identified with possibly pathogenic variants Table 3. MAPPIN prediction accuracy for two Mendelian datasets. Table showing the prediction accuracies for pathogenicity and mode of inheritance for Mendelian validation datasets from CMG and DDDS. CMG (genes not in training) and DDDS (genes not in training) are pathogenicity and mode of inheritance results after excluding all CMG and DDDS genes from the training data. Dataset Pathogenicity Prediction Accuracy Inheritance Prediction Accuracy CMG 68/68 (100%) 45/64 (70.3%) DDDS 138/158 (87.3%) 124/158 (78.5%) CMG (genes not in training) 68/68 (100%) 45/64 (70.3%) DDDS (genes not in training) 138/158 (87.3%) 125/158 (79.1%) MAPPIN FEATURES A. B. Figure 5. Violin plots of score distributions for CMG and DDDS dominant and recessive disease-causing variants. MAPPIN dominant (A) and recessive (B) class scores for CMG and DDDS genes annotated as dominant or recessive. CADD (C) and Eigen (D) phred scores 23 for CMG and DDDS genes annotated as dominant or recessive. CONCLUSIONS & APPLICATIONS • To our knowledge, this is the first nsSNV prediction algorithm that predicts pathogenicity and mode of inheritance by classifying variants into three groups • Mode of inheritance predictions are useful because they allow taking the genotype into account when prioritizing variants • Prevents carrier heterozygous variants to be classified as equally pathogenic as dominant-acting heterozygous mutations • In Mendelian family-based analysis, MAPPIN would be useful for variant prioritization and interpretation especially in cases where there is not enough information to identify the inheritance pattern • For population genetics, MAPPIN annotations and predictions can support interpretation of variant and phenotype associations and variant aggregation for gene burden based association testing