Top Banner
Sifting the human genome for functional polymorphisms Pauline C. Ng, PhD
38
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: testing123

Sifting the human genome for functional polymorphisms

Pauline C. Ng, PhD

Page 2: testing123

From genotype to phenotype

humans are ~99.9% identical to each other

genetic variation causes different phenotypes

Page 3: testing123

Coding Nonsynonymous SNPs, variation that causes an amino acid substitution

3’UTR

Change in protein function?

Variation around genes are most likely to contribute to phenotype

5’UTRupstream5’UTR

Page 4: testing123

Amino acid substitutions can cause disease

gene lesions responsible for disease : aa substitutions ~50% (Human Mutation 15:45-51)

Hemoglobin E6V sickle-cell anemia

Page 5: testing123

1 SNP / 1000 bp

Protein:

1:1 synonymous:nonsynonymous

1:2 expected

1:1 conservative:nonconservative

1:2 expected

Nat. Genetics 22:231-238, 22:239-247

Science 293:489-93

?

nsSNPs in humans are selected against

some of the observed nsSNPs may be involved in disease

Page 6: testing123

Predicting the effect of an amino acid substitution

ApplicationsNonsynonymous SNPs Large-scale random mutagenesis projects

Cheap and quick for suggesting experiments

Page 7: testing123

Computational Tools for Predicting AA Substitution Effects

1) SIFT (sorts intolerant from tolerant)

uses sequenceGenome Research 11:863-574

12:436-446

2) EMBL uses structure + sequence + annotation

Human Mol. Gen. 10:591-597

3) Variagenics uses structure + sequence

J. Mol. Biol. 307:683-706

Page 8: testing123

Sequence conservation correlated with intolerance to substitutions

Conservation log220 + faalog faa

Page 9: testing123

SIFT

Choosing sequencesa) Database searchb) Choose closely related sequences

Obtain alignment

with related proteins.

For each position, calculate scaled probabilities for each amino acid

substitution.

Query protein

< cutoff > cutoff

toleratedaffects function

Page 10: testing123

SIFT: Choosing sequences

# of sequences: 123456789 10 11 12 13 14

Page 11: testing123

SIFT: Calculating probabilities

2012

10

20

10

40

10

41

10

10

2016

139

42

10

10

120

20

30

97

1813

1912

1611

c

2014

10

c

139

c

1610

169

c

52

137

128

128

px/pmax < 0.05 => x affects function

Page 12: testing123

SIFT output

Substitution Probability Prediction Confidence

M24S 0.04 Affect Function LowS82T 0.36 Tolerated HighV247A 0.03 Affect Function High

!!!

Page 13: testing123

Confidence is determined by the diversity of sequences in the alignment

many highly identicalsequences

Ideal case: Diverse set of orthologous proteins

few sequencesavailable

Low confidence examples

Page 14: testing123

c

Case Study: LacI

lac operon repressedLacI

expressed

lactose presentnormal state

c

4000 single amino acid substitutions assayed: throughout entire protein both neutral and affected phenotypes

TIBS 22:334-339

Page 15: testing123

Prediction on LacI substitutions

63%

28%

Substitutions that affect protein function

Substitutions that give no phenotype

Total prediction accuracy 68% (2726/4004)

Pr(observe affected phenotype | predicted to be damaging)

63%

false -

false +

37%72%

predicted to affect functionpredicted to be tolerated

37%

Page 16: testing123

False negative error: Positions not conserved among paralogues

dimer & sugarinterface notconserved

Page 17: testing123

False positive error in LacI:surface with unknown function?

Page 18: testing123

SIFTing human variant databases

69%

25%

Substitutionsinvolved in disease

7397 subst., 606 proteins from SWISS-PROTPredicted on 76% proteins 71% subst

nsSNPs in normal individuals

19%

Putativepolymorphisms

5780 nsSNPs, 3005 proteinsfrom dbSNP

Predicted on 60% prot., 53% subst.

185 nsSNPs, 69 proteins from Whitehead InstitutePredicted on 77% prot. 62% subst

31%

81% 75%

Page 19: testing123

On functionally neutral substitutions, expected false positive error ~20%

dbSNP

nsSNPs in normal individuals

WhiteheadInstitute

Putative polymorphisms

suggests that most nsSNPs are functionally neutral

What accounts for the 5% difference?

25%19%

Page 20: testing123

Account for 5% difference in dbSNP

16 genes with a high fraction of dbSNP variants predicted to affect function

1) Substitutions found in patients

2) Substitutions mapped to nonfunctional genes/regions

3) Substitutions detected in error

Supports SIFT as a prediction tool

Page 21: testing123

Account for 5% difference in dbSNP

16 genes with a high fraction of dbSNP variants predicted to affect function

1) Substitutions found in patients

2) Substitutions mapped to nonfunctional genes/regions

3) Substitutions detected in error

Supports SIFT as a prediction tool

Page 22: testing123

Mutations in MSHR increase skin cancer

Substitution Prediction

Affect

function

Tolerated

R151C

R160W

D294H

Mutations associated withcutaneous malignant melanoma1 Mutations not associated with CMM1-3

Substitution Prediction

Affect

function

Tolerated

L60V

D84E

R163Q

1 Am. J. Hum. Genet. 66: 176-186, 2 J. Invest. Dermatol. 116 :224-229, 3 J. Invest. Dermatol. 112: 512-513

R151C L60V

Page 23: testing123

Mutations in PPAR, a candidate gene for diabetes

Substitution Prediction

Affect

function

Tolerated

R127Q

D304N

R409T

Substitution Prediction

Affect

function

Tolerated

V227A

L162V ***

A268V ***In diabetics and controls, but increases cholesterol levels in diabetics and perhaps nondiabetics2-4

SIFT will detect what has been selected against in evolution; inappropriate assay may fail to detect

1Am. J. Hum. Genet. 63:abs997 2Diabetologia 43:673-680 3Diabetes Metab. 26:393-401 4J.Lipid Res. 41: 945-952 5J. Hum. Genet. 46: 285-288

Mutations in diabetics1 Mutations in nondiabetics1-5

Page 24: testing123

Mutations in MTHFR

Mutations with diminished enzyme activity1-5

Substitution Prediction

Affect function

Tolerated

A222V

E429A

R68Q

Unknown effect

• Common• Under balancing selection Increases neural tube defects Reduce risk for some types of leukemia

Found by contig comparison

1Nat. Genet. 10:111-113 2PNAS 96:12810-12815 3PNAS 98:4004-4009 4Cancer Res. 57:1098-11025Mol. Genet. Metab. 64: 169-172

Page 25: testing123

dbSNP variants from patientsCan distinguish patients from controls Individuals with disease: 18/22 predicted to be damaging Control individuals: 9/10 predicted to be functionally neutral

SIFT detects what’s selected against in evolution & is independent of assayExample: PPAR

Detect substitutions that are deleterious in the context of the protein, not the organism– Can detect nsSNPs with minor effects on phenotype

genes increase risk of skin cancer, diabetes, cholesterol levels The protein need not be essential because SIFT predicts on the substitution.

– Can detect nsSNPs under balancing selection

Example: MTHFR

Page 26: testing123

16 genes with a high fraction of dbSNP variants predicted to affect function

1) Substitutions found in patients

2) Substitutions mapped to nonfunctional genes or regions

3) Substitutions detected in error

Page 27: testing123

16 genes with a high fraction of dbSNP variants predicted to affect function

1) Substitutions found in patients

2) Substitutions mapped to nonfunctional genes or regions

3) Substitutions detected in error

Page 28: testing123

16 genes with a high fraction of dbSNP variants predicted to affect function

1) Substitutions found in patients

2) Substitutions mapped to nonfunctional genes/regions

3) Substitutions detected in error

Changes found in patientsConfirms SIFT prediction and its sensitivity

Unlikely to affect human health

Irrelevant to human health

Page 29: testing123

Comparison of Prediction Tools

69% 69% 63%75%

28%9%

25% 32% 15%

19%

Variagenics

SIFTSIFT

EMBL

disease subst.

LacIVaria

genics

SIFT

LacI

EMBL*

15%

Variagenics

SIFTSIFT

EMBL

SNP databases

normalindividuals

Substitutions that affect function

Substitutions that do not affect function

Polymorphisms

31%

72% 69% 91% 75% 68% 81% 85%

SIFT has similar prediction accuracy to tools that use structure

Page 30: testing123

http://blocks.fhcrc.org/sift/SIFT.html

SIFT, a prediction tool for the effect of substitutionsprediction is based only on sequence

Detect damaging nsSNPs on a large scale

Page 31: testing123

Direct approach• SNPs likely to affect gene function• Association leads directly to candidate gene• Fewer SNPs to genotype

Association studies for finding disease loci

Indirect approach using haplotypes• tagSNPs to identify common haplotypes in a region•Relies on LD with causal variant•Genotype 200K-1 million SNPs

AATACGATAATACGATAATACGATGATACAACGATACAACGATACAAC

Page 32: testing123

Feasibility of direct approachHave we identified all the causative variants? common variant, common disease hypothesis

• 80% of common SNPs in Europeans in dbSNP, 50% of common SNPs in Africans. Nat Genet. 33:518-21

What types of variants are involved in disease?

nsSNPs & splicing variants account for a large proportion of Mendelian disease

regulatory variation has a role in disease• ~50% genes show allele-specific expression Science 297:1143; Hum.

Genet. 113:149–153 • ~1/3 of promoter variants may alter gene expression Hum. Mol.

Genet. 12:2249–2254

Page 33: testing123

1. nsSNPs

2. SNPs near intron/exon boundary

3. UTRs and promoter region

4. synonymous SNPs

Possible effect?

In LDwith causative

variant?

SNPs in and near genes

protein function

splicing

regulation

Page 34: testing123

• covers 20,024 genes

Synonymous, 8,173

1000 bp upstream of TSS, 14,721

3'UTR, 20,857

5'UTR, 3,626

Within 10 bp of exon boundary, 2,042

Nonsynonymous, 13,514

Double-hit and known-frequency SNPs in genes

Page 35: testing123

Non-genic regions could potentially harbor disease variants

• ~70% of bases in conserved sequences are noncoding (Genome Res. 13:2507-18 )

regulatory elementsnoncoding RNAsunknown genes

41,193 SNPs in noncoding conserved regions >= 80% identity with mouse

Page 36: testing123

Adding SNPs in conserved regions improves SNP density

0

1000

2000

3000

4000

5000

6000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15M

ore

# of SNPs / 200 kb

Co

un

t

snps in/near genes

adding noncoding conserved SNPs

Page 37: testing123

Focusing on variation in functional regions

• Large # of SNPs makes direct approach possible– If causative variant is not in set, may be in LD with another SNP

in the functional region

• Concentrating on functional regions allows interesting experiments– genotyping

– DNA copy number

– allele-expression differences

• Complementary to the indirect approach using haplotypes

Page 38: testing123

Acknowledgments

FHCRC

Steve Henikoff

Jorja Henikoff

Henikoff Lab