Top Banner
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College [email protected] MIT Bioinformatics Seminar April 25, 2005 Cambridge, Massachusetts, USA
41

A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College [email protected].

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

A coalescent computational platform to predict strength of association for clinical

samples

Gabor T. MarthDepartment of Biology, Boston [email protected]

MIT Bioinformatics SeminarApril 25, 2005Cambridge, Massachusetts, USA

Page 2: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Sequence variations

cause inherited diseases

allow tracking ancestral human history

Page 3: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Automated polymorphism discovery tools

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

Marth et al. Nature Genetics 1999

Page 4: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Genome scale sequence variation resources

Sachidanandam et al. Nature 2001

~ 10 million

EST

WGS

BAC

genome reference

Page 5: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

How to use markers to find disease?

• problem: genotyping cost precludes using millions of markers simultaneously for an association study

genome-wide, dense SNP marker map

• depends on the patterns of allelic association in the human genome

• question: how to select from all available markers a subset that captures most mapping information (marker selection, marker prioritization)

Page 6: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Allelic association

• allelic association is the non-random assortment between alleles i.e. it measures how well knowledge of the allele state at one site permits prediction at another site marker site functional site

• by necessity, the strength of allelic association is measured between markers

• significant allelic association between a marker and a functional site permits localization (mapping) even without having the functional site in our collection

• there are pair-wise and multi-locus measures of association

Page 7: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Pair-wise: linkage disequilibrium (LD)

• LD measures the deviation from random assortment of the alleles at a pair of polymorphic sites

D=f( ) – f( ) x f( )

• other measures of LD are derived from D, by e.g. normalizing according to allele frequencies (r2)

Page 8: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

strong association: most chromosomes carry one of a few common haplotypes – reduced haplotype diversity

Multi-marker: haplotype diversity

• the most useful multi-marker measures of associations are related to haplotype diversity

2n possible haplotypesn

markers

random assortment of alleles at different sites

Page 9: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

The main determinants of allelic association

• recombination: breaks down allelic association by “randomizing” allele combinations

• demographic history of effective population size: bottlenecks increase allelic association by non-uniform re-sampling of allele combinations (haplotypes)

bottleneck

Page 10: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Block-like patterns in the human genome

Wall & Pritchard Nature Rev Gen 2003

Daly et al. Nature Genetics 2001

Page 11: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

The promise for medical genetics

CACTACCGACACGACTATTTGGCGTAT

• within blocks a small number of SNPs are sufficient to distinguish the few common haplotypes significant marker reduction is possible

• if the block structure is a general feature of human variation structure, whole-genome association studies will be possible at a reduced genotyping cost

• this motivated the HapMap project

Gibbs et al. Nature 2003

Page 12: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

The HapMap initiative

• goal: to map out human allele and association structure of at the kilobase scale

• deliverables: a set of physical and informational reagents

Page 13: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

HapMap physical reagents

• reference samples: 4 world populations, ~100 independent chromosomes from each

• markers: millions of SNPs from the US public variation database (dbSNP)

• genotypes: high-accuracy assays using various platforms; fast public data release

Page 14: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Informational: HapMap annotations

(“HaploView”, Daly lab)

Page 15: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Focal questions about the HapMap

CEPH European samples

1. Required marker density

Yoruban samples

4. How general the answers are to these questions among different human populations

2. How to quantify the strength of allelic association in genome region

3. How to choose tagging SNPs

Page 16: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Across samples from a single population?

(random 60-chromosome subsets of 120 CEPH chromosomes from 60 independent individuals)

Page 17: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Consequence for marker performance

Markers selected based on the allele structure of the HapMap reference samples…

… may not work well in another set of samples such as those used for a clinical study.

Page 18: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

How to assess sample-to-sample variability?

1. Understanding intrinsic properties of a given genome region, e.g. estimating local recombination rate from the HapMap data

3. It would be a desirable alternative to generate such additional sets with computational means

McVean et al. Science 2004

2. Experimentally genotype additional sets of samples, and compare association structure across consecutive sets directly

Page 19: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Towards a marker selection tool

2. generate computational samples for this genome region

3. test the performance of markers across consecutive sets of computational samples

1. select markers (tag SNPs) with standard methods

Page 20: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Modeling variations: the Coalescent

past

• the Coalescent is a simulation

technique

• produces possible genealogies

backwards, towards MRCA

• generates mutations (neutral,

non-recurrent)

• used to describe the statistical

properties of DNA samples

• these statistical properties

depend on model structure and

model parameters

Page 21: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

1. marker density (MD): distribution of number of SNPs in pairs of sequences

Two simple statistics

0

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 10

“rare” “common”

2. allele frequency spectrum (AFS): distribution of SNPs according to allele frequency in a set of samples

0

0.05

0.1

1 2 3 4 5 6 7 8 9 10

Page 22: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

The effects of demographic history

past

present

stationary expansioncollapse

MD(simulation)

AFS(direct form)

histo

ry

0

0.05

0.1

1 2 3 4 5 6 7 8 9 10

0

0.05

0.1

1 2 3 4 5 6 7 8 9 100

0.05

0.1

1 2 3 4 5 6 7 8 9 10

0

0.05

0.1

1 2 3 4 5 6 7 8 9 10

bottleneck

0

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 10

0

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 10

Page 23: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Generating haplotypes with the Coalescent

Page 24: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Global haplotypes vs. local data relevance

Page 25: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Generating data-relevant haplotypes1. Generate a pair of haplotype sets with Coalescent genealogies. This “models” that the two sets are “related” to each other by being drawn from a single population.

3. Use the second haplotype set induced by the same mutations as our computational samples.

2. Only accept the pair if the first set reproduces the observed haplotype structure of the HapMap reference samples. This enforces relevance to the observed genotype data in the specific region.

Page 26: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Generating computational samples

Problem: The efficiency of generating data-relevant genealogies (and therefore additional sample sets) with standard Coalescent tools is very low even for modest sample size (N) and number of markers (M). Despite serious efforts with various approaches (e.g. importance sampling) efficient generation of such genealogies is an unsolved problem.

N

M

We are developing a method to generate “approximative” M-marker haplotypes by composing consecutive, overlapping sets of data-relevant K-site haplotypes (for small K)Motivation from composite likelihood approaches to recombination rate estimation by Hudson, Clark, Wall, and others.

Page 27: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Approximating M-site haplotypes as composites of overlapping K-site

haplotypes

1. generate K-site sets

2. build M-site composites

M

Page 28: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Piecing together neighboring K-site sets

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

000100001101010110011111

000001010011100101110111 this should work to the degree to which

the constraint at overlapping markers preserves long-range marker association

Page 29: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Building composite haplotypes

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

A composite haplotype is built from a complete path through the (M-K+1) K-sites.

Page 30: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Initial results: 3-site composite haplotypes

a typical 3-site composite

30 CEPH HapMap reference individuals (60 chr)

Hinds et al. Science, 2005

Page 31: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

3-site composite vs. data

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r2 (data)

r2 (

3-si

te c

om

po

site

)

Page 32: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

3-site composites: the “best case”

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r2 (data)

r2 (

"exa

ct"

3-si

te c

om

po

site

)

“short-range”

“long-range”

1. generate K-site sets

Page 33: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Variability across setsThe purpose of the composite haplotypes sets …

… is to model sample variance across consecutive data sets.

But the variability across the composite haplotype sets is compounded by the inherent loss of long-range association when 3-sites are used.

Page 34: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

4-site composite haplotypes

4-site composite

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r2 (data)

r2 (

4-si

te c

om

po

site

#2)

Page 35: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

“Best-case” 4 site composites

Composite of exact 4-site sub-haplotypes

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r2 (data)

r2 (

"exa

ct"

4-si

te c

om

po

site

)

Page 36: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Variability across 4-site composites

Page 37: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Variability across 4-site composites

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r2 (data #1)

r2 (

dat

a #2

)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r2 (4-site composite #1)

r2 (

4-si

te c

om

po

site

#5)

… is comparable to the variability across data sets.

Page 38: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Software engineering aspects: efficiencyTo do larger-scale testing we must first improve the efficiency of generating composite sets. Currently, we run fresh Coalescent runs at each K-site (several hours per region).

Total # genotyped SNPs is ~ 1 million -> 1 million different K-sites to match. Any given Coalescent genealogy is likely to match one or more of these. Computational hap sets can be databased efficiently.

4 HapMap populations x 1 million K-sites x 1,000 comp sets x 50 bytes< 200 Gigabytes

Page 39: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Un-phased genotypes

(AC)(CG)(AT)(CT)

A G A CC C T Thttp://pga.gs.washington.edu/

• the primary data represent diploid genotypes

• one has the choice to “reconstruct” the haplotypes with statistical methods as shown (e.g. the PHASE program); this may be inaccurate• or one may account for all possible reconstructions when evaluating data-relevance; this is computationally very expensive

Page 40: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Conclusions

• 3-site composites are unlikely to work

• 4-site composites are very promising

• both the initial results and the expected payoff justify going ahead

• more thorough statistical analyses, performance evaluations, and algorithmic development work ahead

Page 41: A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu.

Acknowledgements

Eric TsungAaron Quinlan

Ike Unsal

Eva Czabarka (Dept. Mathematics, William & Mary)