Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

Software tools for the analysis of medically

important sequence variations

Gabor T. Marth, D.Sc.Boston CollegeDepartment of Biologymarth@bc.eduhttp://bioinformatics.bc.edu/marthlab

Pfizer visit, March 7. 2006

Our lab focuses on three main projects…

2. software for SNP discovery in clonal and re-sequencing data,

1. software tools for clinical case-control association studies

3. connecting HapMap and pharmaco-genetic data

1. We developing computer software to aid tagSNP selection and association testing

gene annotations

association statistics

input data views

LD views

user control interface

reference samples

representative computational samples

tag evaluationmarker selectionassociation testing

study specificationuser input

0 0.2 0.4 0.6 0.8 1

LA LD (r2)

1-4 Mrk Sep.

5-9 Mrk Sep.

10-17 Mrk Sep.

18-26 Mrk Sep.

computationalsample database

(discussed in more detail)

• inherited (germ line) polymorphisms are important as they can predispose to disease

2. We build computer tools for SNP discovery

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

NiorPrNiorPr

N )S,...,S(P)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

• we have a 5-year NIH R01 grant to re-develop our computer package, PolyBayes© , our SNP discovery tool originally developed while the PI was at the Washington University Medical School

Marth et al. Nature Genetics 1999

• looking for SNPs and short INDELs

Apply our tools for genome-scale SNP mining

Sachidanandam et al. Nature 2001

~ 10 million

genome reference

Extend our methods for SNP detection in medical re-sequencing data from traditional Sanger sequencers…

Homozygous T

Homozygous C

Heterozygous C/T

… and in 454 pyrosequence data

454 sequence from the NCBI Trace Archive

• accurate base calling for de novo sequencing

• detection of heterozygotes in medical re-sequencing data

Figure from Nordfors, et. al. Human Mutation 19:395-401 (2002)

Developing methods to detect somatic mutations (as distinguished from inherited polymorphisms)

• the detection of somatic mutations, and their distinction from inherited polymorphism, will be important to separate pre-disposing variants from mutations that occur during disease progression e.g. in cancer

Process DNA methylation data obtained with sequencing

DNA methylation is important e.g. because hypo- and hypermethylation is consistently present in various cancers

Issa. Nature Reviews Cancer, 4, 2004: 988-993

we are developing methods to interpret DNA methylation data obtained with sequencing, in the presence of methodological artifacts such as incomplete bi-sulfite conversion of un-methylated cytosines

Lewin et. al. Bioinformatics, 20:3005-30012, 2004

… and tools to integrate genetic and epigenetic data from varied sources to find “common themes” during cancer development

chromatin structure

gene expression profiles

copy number changes

methylation profiles

chromosome rearrangement

repeat expansions

somatic mutations

3. We are planning a project to connect multi-marker haplotypes to drug metabolic phenotypes

• predicting metabolic phenotypes (ADR) based on haplotype markers

• evolutionary origin of drug metabolizing enzyme polymorphisms

Computer software to aid case-control association studies: tagSNP selection and association testing (details)

0 0.2 0.4 0.6 0.8 1

LA LD (r2)

1-4 Mrk Sep.

5-9 Mrk Sep.

10-17 Mrk Sep.

18-26 Mrk Sep.

Dr. Eric Tsung

Clinical case-control association studies – concepts

• association studies are designed to find disease-causing genetic variants

• searching “significant” marker allele frequency differences between cases and controls

AF(cases)

clinical cases

clinical controls

• genotyping cases and controls at various polymorphisms

Association study designs

• region(s) interrogated: single gene, list of candidate genes (“candidate gene study”), or entire genome (“genome scan”)

• direct or indirect:

causative variant causative variantmarker that is co-inherited with causative variant

• single-SNP marker or multi-SNP haplotype marker

• single-stage or multi-stage

Marker (tag) selection for association studies

2. LD-driven – based entirely on the reduction of redundancy presented by the linkage disequilibrium (LD) between SNPs; tags represent other SNPs they are correlated with

1. hypothesis driven (i.e. based on gene function)

causative variant

for economy, one cannot genotype every SNP in thousands of clinical samples: marker selection is the process where a subset of all available SNPs is chosen

The International HapMap project

http://www.hapmap.org

The international HapMap project was designed to provide a set of physical and informational reagents for association studies by mapping out human LD structure

LD varies across samples

African reference (YRI)

there are large differences in LD between different human populations…

European reference (CEU)

… and even between samples from the same population.

Other European samples

Sample-to-sample LD differences make tagSNP selection problematic

groups of SNPs that are in LD in the HapMap reference samples may not be in a future set of clinical samples…

… and tags that were selected based on LD in the HapMap may no longer work (i.e. represent the SNPs they were supposed to) in the clinical samples…

… possibly resulting in missed disease associations.

Natural marker allele frequency differences confound association testing

reference samples: ~ 120 chromosomes

cases: 500-2,000 chromosomes

controls: 500-2,000 chromosomes

• the HapMap reference samples are much smaller than clinical sample sizes

• difficult to accurately assess both marker allele frequency (single-SNP or haplotype frequency) in the clinical samples and naturally occurring variation of marker allele frequency differences between cases and controls

AF(cases)

• therefore difficult to assess statistical significance of candidate associations

We are developing technology for assessing sample-to-sample variance in silico

reference

controlstag evaluationtag selection

association testing

we estimate LD differences betweenHapMap and future clinical samples…

“cases”

“controls”

…by generating “computational” samples representing future clinical samples…

… and use computational “proxy” samples for tabulating LD and allele frequency differences.

Two methods of computational sample generation

“HapMap” “cases”

“controls”HapMap

Method 1. “Data-relevant Coalescent”. This algorithm uses a population genetic model to connect mutations in the HapMap reference to mutations in future clinical samples. Full model but computationally slow.

Method 2. The PAC method (product of approximate conditionals, Li & Stephens). This method constructs “new” samples as mosaics of existing haplotypes, mimicking the effects of recombination. An approximation but fast.

Computational samples

HapMap (CEU)

Computational (PAC)

Computational (Coalescent)

Extra genotypes (Estonia)

MARKER EVALUATION with computational samples

test if markers selected from the HapMap continue to “tag” other SNPs in their original LD group

MARKER SELECTION with computational samples

selecting tags in multiple consecutive sets of computational samples and choosing for the association study the best-performing tags

ASSOCIATION TESTING with computational samples

“cases”

“controls”

“cases”

“controls”

“cases”

“controls”

tabulating ΔAF in “cases” vs. “controls” in multiple consecutive computational pairs of samples provides the natural range of allele frequency differences to decide if a candidate association is statistically significant

AF(cases)

Do computational samples represent future clinical genotypes realistically?

0 0.2 0.4 0.6 0.8 1

we quantify the quality of representation by comparing the correlation of LD between corresponding pairs of markers (i.e. ask if two markers were in strong LD in one set of samples, are they ALSO in strong LD in the other set?

LD difference -- comparison to extra experimental genotypes

0.949 +/- 0.013

0.978 +/- 0.0100.963 +/- 0.014

• we have analyzed two extra genotype sets collected at the HapMap SNPs in three genome regions, from our clinical collaborators (Prof. Thomas Hudson, McGill; Prof. Stanley Nelson, UCLA)

AF difference -- comparisons to extra experimental genotypes

0 0.01 0.02 0.03 0.04 0.05 0.06

AF Diff, Estonian Data

• according to our limited initial test, computational samples can represent future clinical samples well for estimating sample-to-sample variability

A new marker selection and association testing software tool

• data visualization

reference samples

representative computational samples

• representative computational sample generation

• advanced tag selection functionality

gene annotations

LD views

• gene annotations overlaid on physical map of SNPs (i.e. the human genome sequence)

association statistics0

0 0.2 0.4 0.6 0.8 1

LA LD (r2)

1-4 Mrk Sep.

5-9 Mrk Sep.

10-17 Mrk Sep.

18-26 Mrk Sep.

• advanced association testing functionality

• multi-level user customization including user conveniences e.g. tag prioritization based on SNP assay score

User community

• companies designing new generations of whole-genome or specialized SNP arrays

• researchers comparing alternative platforms (e.g. Affymetrix 500K and the Illumina 300K ) most suitable for their study

• clinical researchers designing candidate gene studies

• researchers designing second-stage follow-up studies in specific genome regions after an initial genome scan (our methods can take advantage of first-stage data already available in the clinical samples)

• the association testing features should be useful for analysts regardless of study design

Base calling and SNP detection in sequence traces including 454 data

Aaron Quinlan

Base calling and SNP detection in sequence traces including 454 “pyrogram” data

• PolyBayes was originally written to find SNPs in clonal sequences in large SNP discovery projects

• medical re-sequencing projects require the detection of SNPs in heterozygous diploid sequence traces

Heterozygote detection in sequence traces

Ind. 1

Ind. 2

Ind. 3

Ind. 4

Individual traces

• we use a machine learning method (Support Vector Machine, SVM) to recognize characteristic features of homozygous vs. heterozygous positions

Aggregating information from multiple traces

forward/reverse sequences from same individual

P(GT ) = .993

resultant genotype call

P(GT | Read) = .98

P(GT | Read) = .87

Discovery vs. genotyping

Prior(CT) = .001

discovery: “uninformed prior”don’t know if site is polymorphichave to test each site

Prior(CT) = 0.34

genotyping: “informed prior”1. site is known to be polymorphic2. allele frequency estimate

Our heterozygote detection works better than other methods

Performance Measured on ~1000 Alignments covering 500Kb Region of Chromosome 4

Fraction of Data

Analyzed

False Discovery

Fraction of Heterozygotes

Fraction of Homozygotes

PolyBayes+ 85.1 0.0375 86.60% 97.8%

Polyphred 5 86.17 0.0389 83.16% 82.63%

Base calling for “pyrograms”

From NCBI Trace Archive

• we have access to standardized data formats

• readout in pyrosequencing is based on instantaneous detection of base incorporation… multiple bases of the same type are incorporated in the same cycle

26 55 24 15 10 7 5 4 2 1 0 0

TCAGGGGGGGGGGGACGACAAGGCGTGGGGA• the identity of consecutive bases is very reliable but the length of mono-nucleotide runs (base number) is difficult to quantify (great for re-sequencing; but problematic for de novo sequencing)

SNP genotyping with pyrosequencers

Nordfors, et. al. Human Mutation 19:395-401 (2002)

we are in the process of identifying discriminating pyrogram features to use in our machine-learning methods to recognize polymorphic positions within traces

Somatic mutation detection

Michael Stromberg

Somatic mutations

the detection of somatic mutations, and their distinction from inherited polymorphism, is important to separate pre-disposing variants from mutations that occur during disease progression e.g. in cancer

1. detect the mutations

2. classify whether somatic or inherited

Detecting somatic mutations with comparative data

• based on comparison of cancer and normal tissue from the same individual

• often cancer tissue is highly heterogeneous and the somatic mutant allele may represent at low allele frequency

Detecting somatic mutations with subtraction

• if normal tissue samples are not available, we detect SNPs in cancer tissue against e.g. the human genome reference sequence

• subtract apparent mutations that are present in sequence variation databases

• search for evidence that these mutations are genetic

Detecting somatic mutations with subtraction

• we have applied our methods for somatic mutation detection in murine mitochondrial sequences

heteroplasmy homoplasmy

• we will be applying our methods for human nuclear DNA from our collaborators

Using new haplotype resources to connect genotype and clinical outcome in pharmaco-genetic systems

• the HapMap was designed as a tool to detect high-frequency (common) phenotypic (e.g. disease-causing) alleles

• important drug metabolizing enzymes are relatively few in number, well studied, are at known genome locations, many associated phenotypes are well described

• many functional alleles are known, and of high frequency (common)

• multi-SNP alleles are highly predictive of metabolic phenotype

• clinical phenotype (adverse drug reaction) less predictable

• ideal candidate for applying haplotype resources

Multi-marker haplotypes as accurate markers for ADRs?

functional allele (known metabolic

polymorphism)

genetic marker (haplotype) in genome

regions of drug metabolizing enzyme

(DME) genes

molecular phenotype (drug concentration measured in blood

plasma)

clinical endpoint (adverse drug

reaction)computational prediction

based on haplotype structure

Resources

• specifics of enzyme-drug interactions

• LD and haplotype structure in the HapMap reference samples, based on high-density SNP map

• functional alleles

• existing DME P genotyping chips

Evolutionary questions

• mutation age?

• mutations single-origin or recurrent?• geographic origin of mutations?

• analysis based on complete local variation structure and haplotype background of functional mutations

• specifics of the selection process that led to specific functional alleles?

Proposed steps of analysis

• haplotypes vs. metabolic phenotype?

• complete polymorphic structure?

• ethnicity?

• additional functional SNPs?

• haplotypes vs. functional alleles?

haplotype block?

functional allele(genotype)

metabolic phenotype

clinical phenotype(ADR)haplotype

• haplotypes vs. ADR phenotype?

Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology marth@bc.edu .

software tools

marker selection

computer tools

association studiesfor

association testing

snp detection

pyrosequence data

epigenetic data

Documents

Lab X.X1 SNPs and Haplotypes – Lab session Gabor T. Marth....

Changing Lives - bc.edu

The Allele Frequency Spectrum in Genome-Wide Human Variation...

Phylogenetic Analysis Gabor T. Marth Department of Biology,....

Computational Tools for Finding and Interpreting Genetic...

Marth In Brawl

Robert Akl, D.Sc.

data marth

A coalescent computational platform to predict strength of.....

Diagnostic Assessments in Algebra and Geometry ·...

Alan Daugherty, Ph.D., D.Sc., F.A.H.A.

Lecture 7.01 The informatics of SNPs and haplotypes Gabor T....

By Rev. James T. Bretzke, S.J., S.T.D. Professor of Moral...

Marth Stewart Living Full Line April 2015

Computational research for medical discovery at Boston...

PULSE Student Workbook - bc.edu