MAPPIN: Method for Annotating, Predicting Pathogenicity ... · and mode of Inheritance for Nonsynonymous variants Nehal Gosalia1,2, Aris N. Economides1,2, Frederick E. Dewey1, ...

2.20e-206 3.55e-92

2.95e-07 2.32e-08

1.49e-08 3.74e-23

ns 4.19e-05

2.48e-08 1.53e-06

ns 2.32e-08

MAPPIN: Method for Annotating, Predicting Pathogenicity, and mode of Inheritance for Nonsynonymous variants

Nehal Gosalia1,2, Aris N. Economides1,2, Frederick E. Dewey1, and Suganthi Balasubramanian1

1Regeneron Genetics Center, 2Regeneron Pharmaceuticals, Tarrytown, NY

BACKGROUND• An average exome can contain 10,500-13,500 nonsynonymous

single nucleotide variants (nsSNVs)1,2, which is lower thanexpected suggesting negative selection3,4

• A major challenge with whole exome sequencing (WES) isdifferentiating benign and disease-causing variants

• HGMD and OMIM® databases show that nsSNVs account for~45% of disease-causing mutations5,6 making it critical toidentify them

• Many algorithms predict pathogenicity of nsSNVs, howevernone of them are able to distinguish dominant vs. recessive-disease causing mutations7

• It is important to differentiate between heterozygous dominant-acting variants and heterozygous carrier variants

MAPPIN• Method for Annotating, Predicting Pathogenicity, and

mode of Inheritance for Nonsynonymous variants

• Training data sets:- Pathogenic variants from UniProt, ExoVar7,subdivided using known dominant- and recessive-disease causing genes from OMIM®6 and others8,9

- Haploinsufficient genes subset from dominantgenes using haploinsufficiency predictions10

- Benign variants from ClinVar11 refined by a) criteriaprovided, multiple submitters, no conflict, b) reviewedby expert panel, and c) practice guideline (★★-★★★★)

Input

VariantAnnotation

VCF File (chr, pos, ref, alt)

99 evolutionary, functional, network, and allele frequency

features

Training sets ExoVar

(pathogenic)ClinVar

(benign)

Scores for Dominant,

Recessive, and Benign classes

Random Forest

ClassifierPrediction

Output

AnnotationOutput

Category Features

EvolutionaryGERP score12, paralogs13, pseudogenes14 and other gene annotation metrics, dN/dS rates, average heterozygosity of nsSNVs and synonymous SNVs, nonsynonymous and synonymous SNP density

FunctionalTranscript length, variant affecting all/some transcript isoforms, single exon gene, protein domain annotations13, GTeX expression in individual tissues15

NetworkProtein-protein interactions (BIOGRID)16, number of networks and interfaces, interactions with known dominant or recessive disease causing genes (OMIM®)6

Allele Frequency

1000G17, ESP650018, ExAC19, pLI score by gene (measure of haploinsufficiency based on constraint, ExAC)20

Table 1. Features annotated within MAPPIN and used for predictionssubdivided into categories based on the type of annotation.

Figure 1. Adapted from Li etal., PLoS Genetics, 20137.Several prediction algorithmsand combinations were testedon a dataset composed ofknown dominant and recessivedisease causing mutations.Figure clearly demonstratesthat existing algorithms areunable to call dominant orrecessive mutations confidently(AUCs ~0.55).

RESULTSMAPPIN trained under two models:i. Haploinsufficient model composed of genes causing dominant

diseases through haploinsufficiency (Multiclass AUC = 0.96)ii. All dominant model composed of all dominant disease-causing

genes (Multiclass AUC = 0.91)

FEATURE IMPORTANCE PLOT

PERFORMANCE OF FEATURE SUBSETS

VALIDATION ON TWO MENDELIAN DATASETS: CMG AND DDDS

Figure 2. Workflow for MAPPIN. User inputs a VCF file which is annotated with 99features and then run through a prediction model based on a random forestclassifier trained on benign and pathogenic variants.

DIFFERENTIATING BETWEEN DOMINANT & RECESSIVE DISEASE-CAUSING VARIANTS

DOMINANT AND RECESSIVE DISCRIMINATION FOR HGMD VARIANTS

Figure 6. Violin plots of score distributions for HGMD5 variants in dominant andrecessive genes. Training variants are excluded in the comparison and variants weresubset using genes from Berg et al.24 MAPPIN dominant (A) and recessive (B) classscores for HGMD variants in dominant and recessive genes. CADD (C) and Eigen (D)phred scores23 for HGMD variants in dominant and recessive genes.

REFERENCES1. Levy et al., The diploid genome sequence of an individual human. PLoS Biology, 20072. Ng et al., Genetic variation in an individual human exome. PLoS Genetics, 20083. Cargill, M et al., Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature

Genetics, 19994. Stephens JC et al., Haplotype variation and linkage disequilibrium in 313 human genes. Science, 20015. Stenson et al., The Human Gene Mutation Database: building a comprehensive mutation repository for clinical

and molecular genetics, diagnostic testing and personalized genomic medicine. Human Genetics, 20146. Online Mendelian Inheritance in Man, OMIM®. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins

University (Baltimore, MD). World Wide Web URL: http://omim.org/7. Li et al., Predicting Mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing

studies. PLoS Genetics, 20138. Blekhman, R. et al., Natural selection on genes that underlie human disease susceptibility. Current Biology, 20089. Boone, P.M. et al., Deletions of recessive disease genes: CNV contribution to carrier states and disease-

causing alleles. Genome Research, 201310. Huang et al., Characterising and predicting haploinsufficiency in the human genome. PLoS Genetics, 201011. Landrum MJ et al., ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids

Research, 201512. Cooper, G.M. et al., Distribution and intensity of constraint in mammalian genomic sequence. Genome

Research, 200513. Flicek et al., Ensembl 2014. Nucleic Acids Research, 201414. GENCODE, Pei, B. et al. The GENCODE pseudogene resource. Genome Biology, 201215. Lonsdale et al., The Genotype-Tissue Expression (GTEx) project. Nature Genetics, 201316. Stark, C. et al., BioGRID: a general repository for interaction datasets. Nucleic Acids Research, 2006 (version

3.4.128)17. The 1000 Genomes Project Consortium, An integrated map of genetic variation from 1,092 human genomes.

Nature, 201218. Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA (URL:

http://evs.gs.washington.edu/EVS/), 201519. Lek et al., Analysis of protein-coding genetic variation in 60,706 humans. Nature, 201620. Samocha et al., A framework for the interpretation of de novo mutation in human disease. Nature Genetics, 201421. Chong et al., The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. American

Journal of Human Genetics, 201522. Deciphering Developmental Disorders Study, Large-scale discovery of novel genetic causes of developmental

disorders. Nature, 201523. Liu et al., dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human

Nonsynonymous and Splice-Site SNVs. Human Mutation, 201624. Berg et al., An informatics approach to analyzing the incidentalome. Genetics in Medicine, 2013.

Figure 3. Precision and recall values for training data. A. Precision was calculated astrue positives over the sum of all positives for a 10-fold cross-validation. B. Recall wascalculated as the true positives over the sum of true positives and false negatives.

Benign

Recessive

Dominant

0.5

0.6

0.7

0.8

0.9

1.0

Precision 0.88 0.85 0.87

0.79 0.80

0.71

Benign

Recessive

Dominant

0.5

0.6

0.7

0.8

0.9

1.0

Recall

Haploinsufficient ModelAll Dominant Model

0.94 0.95

0.87

0.79

0.74

0.62

Figure 4. Feature importance plot. First, out-of-bag (OOB) prediction error iscalculated for each tree. Next, OOB error is calculated after permuting each feature.Finally, to derive the mean decrease in accuracy, the difference between the two isaveraged across all trees and normalized by the standard deviation of the differences.

Table 2. Precision values using different subsets of features. The precision valueswere calculated based on the training data under the haploinsufficient model using a10-fold cross-validation.

• 68 variants from the Centers for Mendelian Genomics (CMG)21, which are working towards identifying the genetic basis of Mendelian diseases

• 158 variants from the Deciphering Developmental Disorders Study (DDDS)22, which includes 1,133 children presenting with severe, undiagnosed developmental disorders of which 28% were identified with possibly pathogenic variants

Table 3. MAPPIN prediction accuracy for two Mendelian datasets. Table showing theprediction accuracies for pathogenicity and mode of inheritance for Mendelianvalidation datasets from CMG and DDDS. CMG (genes not in training) and DDDS(genes not in training) are pathogenicity and mode of inheritance results afterexcluding all CMG and DDDS genes from the training data.

Dataset Pathogenicity Prediction Accuracy

Inheritance Prediction Accuracy

CMG 68/68 (100%) 45/64 (70.3%)

DDDS 138/158 (87.3%) 124/158 (78.5%)

CMG (genes not in training) 68/68 (100%) 45/64 (70.3%)

DDDS (genes not in training) 138/158 (87.3%) 125/158 (79.1%)

MAPPIN FEATURES

A. B.

Figure 5. Violin plots of score distributions for CMG and DDDS dominant andrecessive disease-causing variants. MAPPIN dominant (A) and recessive (B) classscores for CMG and DDDS genes annotated as dominant or recessive. CADD (C) andEigen (D) phred scores23 for CMG and DDDS genes annotated as dominant orrecessive.

CONCLUSIONS & APPLICATIONS• To our knowledge, this is the first nsSNV prediction algorithm that

predicts pathogenicity and mode of inheritance by classifyingvariants into three groups

• Mode of inheritance predictions are useful because they allowtaking the genotype into account when prioritizing variants• Prevents carrier heterozygous variants to be classified as

equally pathogenic as dominant-acting heterozygous mutations

• In Mendelian family-based analysis, MAPPIN would be useful forvariant prioritization and interpretation especially in cases wherethere is not enough information to identify the inheritance pattern

• For population genetics, MAPPIN annotations and predictions cansupport interpretation of variant and phenotype associations andvariant aggregation for gene burden based association testing

MAPPIN: Method for Annotating, Predicting Pathogenicity ... · and mode of Inheritance for Nonsynonymous variants Nehal Gosalia1,2, Aris N. Economides1,2, Frederick E. Dewey1, ...

Documents