Sifting the human genome for functional polymorphisms Pauline C. Ng, PhD
Sifting the human genome for functional polymorphisms
Pauline C. Ng, PhD
From genotype to phenotype
humans are ~99.9% identical to each other
genetic variation causes different phenotypes
Coding Nonsynonymous SNPs, variation that causes an amino acid substitution
3’UTR
Change in protein function?
Variation around genes are most likely to contribute to phenotype
5’UTRupstream5’UTR
Amino acid substitutions can cause disease
gene lesions responsible for disease : aa substitutions ~50% (Human Mutation 15:45-51)
Hemoglobin E6V sickle-cell anemia
1 SNP / 1000 bp
Protein:
1:1 synonymous:nonsynonymous
1:2 expected
1:1 conservative:nonconservative
1:2 expected
Nat. Genetics 22:231-238, 22:239-247
Science 293:489-93
?
nsSNPs in humans are selected against
some of the observed nsSNPs may be involved in disease
Predicting the effect of an amino acid substitution
ApplicationsNonsynonymous SNPs Large-scale random mutagenesis projects
Cheap and quick for suggesting experiments
Computational Tools for Predicting AA Substitution Effects
1) SIFT (sorts intolerant from tolerant)
uses sequenceGenome Research 11:863-574
12:436-446
2) EMBL uses structure + sequence + annotation
Human Mol. Gen. 10:591-597
3) Variagenics uses structure + sequence
J. Mol. Biol. 307:683-706
Sequence conservation correlated with intolerance to substitutions
Conservation log220 + faalog faa
SIFT
Choosing sequencesa) Database searchb) Choose closely related sequences
Obtain alignment
with related proteins.
For each position, calculate scaled probabilities for each amino acid
substitution.
Query protein
< cutoff > cutoff
toleratedaffects function
SIFT: Choosing sequences
# of sequences: 123456789 10 11 12 13 14
SIFT: Calculating probabilities
2012
10
20
10
40
10
41
10
10
2016
139
42
10
10
120
20
30
97
1813
1912
1611
c
2014
10
c
139
c
1610
169
c
52
137
128
128
px/pmax < 0.05 => x affects function
SIFT output
Substitution Probability Prediction Confidence
M24S 0.04 Affect Function LowS82T 0.36 Tolerated HighV247A 0.03 Affect Function High
!!!
Confidence is determined by the diversity of sequences in the alignment
many highly identicalsequences
Ideal case: Diverse set of orthologous proteins
few sequencesavailable
Low confidence examples
c
Case Study: LacI
lac operon repressedLacI
expressed
lactose presentnormal state
c
4000 single amino acid substitutions assayed: throughout entire protein both neutral and affected phenotypes
TIBS 22:334-339
Prediction on LacI substitutions
63%
28%
Substitutions that affect protein function
Substitutions that give no phenotype
Total prediction accuracy 68% (2726/4004)
Pr(observe affected phenotype | predicted to be damaging)
63%
false -
false +
37%72%
predicted to affect functionpredicted to be tolerated
37%
False negative error: Positions not conserved among paralogues
dimer & sugarinterface notconserved
False positive error in LacI:surface with unknown function?
SIFTing human variant databases
69%
25%
Substitutionsinvolved in disease
7397 subst., 606 proteins from SWISS-PROTPredicted on 76% proteins 71% subst
nsSNPs in normal individuals
19%
Putativepolymorphisms
5780 nsSNPs, 3005 proteinsfrom dbSNP
Predicted on 60% prot., 53% subst.
185 nsSNPs, 69 proteins from Whitehead InstitutePredicted on 77% prot. 62% subst
31%
81% 75%
On functionally neutral substitutions, expected false positive error ~20%
dbSNP
nsSNPs in normal individuals
WhiteheadInstitute
Putative polymorphisms
suggests that most nsSNPs are functionally neutral
What accounts for the 5% difference?
25%19%
Account for 5% difference in dbSNP
16 genes with a high fraction of dbSNP variants predicted to affect function
1) Substitutions found in patients
2) Substitutions mapped to nonfunctional genes/regions
3) Substitutions detected in error
Supports SIFT as a prediction tool
Account for 5% difference in dbSNP
16 genes with a high fraction of dbSNP variants predicted to affect function
1) Substitutions found in patients
2) Substitutions mapped to nonfunctional genes/regions
3) Substitutions detected in error
Supports SIFT as a prediction tool
Mutations in MSHR increase skin cancer
Substitution Prediction
Affect
function
Tolerated
R151C
R160W
D294H
Mutations associated withcutaneous malignant melanoma1 Mutations not associated with CMM1-3
Substitution Prediction
Affect
function
Tolerated
L60V
D84E
R163Q
1 Am. J. Hum. Genet. 66: 176-186, 2 J. Invest. Dermatol. 116 :224-229, 3 J. Invest. Dermatol. 112: 512-513
R151C L60V
Mutations in PPAR, a candidate gene for diabetes
Substitution Prediction
Affect
function
Tolerated
R127Q
D304N
R409T
Substitution Prediction
Affect
function
Tolerated
V227A
L162V ***
A268V ***In diabetics and controls, but increases cholesterol levels in diabetics and perhaps nondiabetics2-4
SIFT will detect what has been selected against in evolution; inappropriate assay may fail to detect
1Am. J. Hum. Genet. 63:abs997 2Diabetologia 43:673-680 3Diabetes Metab. 26:393-401 4J.Lipid Res. 41: 945-952 5J. Hum. Genet. 46: 285-288
Mutations in diabetics1 Mutations in nondiabetics1-5
Mutations in MTHFR
Mutations with diminished enzyme activity1-5
Substitution Prediction
Affect function
Tolerated
A222V
E429A
R68Q
Unknown effect
• Common• Under balancing selection Increases neural tube defects Reduce risk for some types of leukemia
Found by contig comparison
1Nat. Genet. 10:111-113 2PNAS 96:12810-12815 3PNAS 98:4004-4009 4Cancer Res. 57:1098-11025Mol. Genet. Metab. 64: 169-172
dbSNP variants from patientsCan distinguish patients from controls Individuals with disease: 18/22 predicted to be damaging Control individuals: 9/10 predicted to be functionally neutral
SIFT detects what’s selected against in evolution & is independent of assayExample: PPAR
Detect substitutions that are deleterious in the context of the protein, not the organism– Can detect nsSNPs with minor effects on phenotype
genes increase risk of skin cancer, diabetes, cholesterol levels The protein need not be essential because SIFT predicts on the substitution.
– Can detect nsSNPs under balancing selection
Example: MTHFR
16 genes with a high fraction of dbSNP variants predicted to affect function
1) Substitutions found in patients
2) Substitutions mapped to nonfunctional genes or regions
3) Substitutions detected in error
16 genes with a high fraction of dbSNP variants predicted to affect function
1) Substitutions found in patients
2) Substitutions mapped to nonfunctional genes or regions
3) Substitutions detected in error
16 genes with a high fraction of dbSNP variants predicted to affect function
1) Substitutions found in patients
2) Substitutions mapped to nonfunctional genes/regions
3) Substitutions detected in error
Changes found in patientsConfirms SIFT prediction and its sensitivity
Unlikely to affect human health
Irrelevant to human health
Comparison of Prediction Tools
69% 69% 63%75%
28%9%
25% 32% 15%
19%
Variagenics
SIFTSIFT
EMBL
disease subst.
LacIVaria
genics
SIFT
LacI
EMBL*
15%
Variagenics
SIFTSIFT
EMBL
SNP databases
normalindividuals
Substitutions that affect function
Substitutions that do not affect function
Polymorphisms
31%
72% 69% 91% 75% 68% 81% 85%
SIFT has similar prediction accuracy to tools that use structure
http://blocks.fhcrc.org/sift/SIFT.html
SIFT, a prediction tool for the effect of substitutionsprediction is based only on sequence
Detect damaging nsSNPs on a large scale
Direct approach• SNPs likely to affect gene function• Association leads directly to candidate gene• Fewer SNPs to genotype
Association studies for finding disease loci
Indirect approach using haplotypes• tagSNPs to identify common haplotypes in a region•Relies on LD with causal variant•Genotype 200K-1 million SNPs
AATACGATAATACGATAATACGATGATACAACGATACAACGATACAAC
Feasibility of direct approachHave we identified all the causative variants? common variant, common disease hypothesis
• 80% of common SNPs in Europeans in dbSNP, 50% of common SNPs in Africans. Nat Genet. 33:518-21
What types of variants are involved in disease?
nsSNPs & splicing variants account for a large proportion of Mendelian disease
regulatory variation has a role in disease• ~50% genes show allele-specific expression Science 297:1143; Hum.
Genet. 113:149–153 • ~1/3 of promoter variants may alter gene expression Hum. Mol.
Genet. 12:2249–2254
1. nsSNPs
2. SNPs near intron/exon boundary
3. UTRs and promoter region
4. synonymous SNPs
Possible effect?
In LDwith causative
variant?
SNPs in and near genes
protein function
splicing
regulation
• covers 20,024 genes
Synonymous, 8,173
1000 bp upstream of TSS, 14,721
3'UTR, 20,857
5'UTR, 3,626
Within 10 bp of exon boundary, 2,042
Nonsynonymous, 13,514
Double-hit and known-frequency SNPs in genes
Non-genic regions could potentially harbor disease variants
• ~70% of bases in conserved sequences are noncoding (Genome Res. 13:2507-18 )
regulatory elementsnoncoding RNAsunknown genes
41,193 SNPs in noncoding conserved regions >= 80% identity with mouse
Adding SNPs in conserved regions improves SNP density
0
1000
2000
3000
4000
5000
6000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15M
ore
# of SNPs / 200 kb
Co
un
t
snps in/near genes
adding noncoding conserved SNPs
Focusing on variation in functional regions
• Large # of SNPs makes direct approach possible– If causative variant is not in set, may be in LD with another SNP
in the functional region
• Concentrating on functional regions allows interesting experiments– genotyping
– DNA copy number
– allele-expression differences
• Complementary to the indirect approach using haplotypes
Acknowledgments
FHCRC
Steve Henikoff
Jorja Henikoff
Henikoff Lab