TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Post on 24-Dec-2015

215 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

Transcript

TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATATTCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCAGAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTTCCCTGTTTCCAGGTTTGTTGTCCCAAAATAGTGACCATTTCATATGTATA

Comparative Genomics

Overview

I. Comparing genome sequences• Concepts and terminology• Methods

- Whole-genome alignments

- Quantifying evolutionary conservation (PhastCons, PhyloP)

- Identifying conserved elements

• Available datasets at UCSC

II. Comparative analyses of function• Evolutionary dynamics of gene regulation• Case studies• Insights into regulatory variation within and across species

Distribution of evolutionary constraint in the human genome

Lindblad-Toh et al. Nature 478:476 (2011)

4.2% of genome is putatively constrained

~1 million putative regulatory elements

•Infer the course of past evolution using statistical models of sequence evolution

•Identify sequence elements evolving more slowly or more rapidly than neutral

•Evaluate the precise degree of constraint on specific positions

•Predict the functional effects of nucleotide or amino acid mutations in constrained sequences

Goals of comparative genomics

Vertebrate genomes available for comparative studies

Pri

mate

s

Mam

mals

Tetr

apods

Vert

eb

rate

s

Commonly used (and misused) terms

Mutation vs. Substitution• Mutations occur in individuals, segregate in populations

• Substitutions are mutations that have become fixed

• Mutations = within species; substitutions = between species

Conservation vs. Constraint• Conservation = an observation of sequence similarity

• Constraint = a hypothesis about the effect of purifying selection

Homology, Orthology and Paralogy• Homologous sequences = derived from a common ancestor

• Orthologous sequences = homologous sequences separated by a speciation event(e.g., human HOXA and mouse Hoxa)

• Paralogous sequences = homologous sequences separated by gene duplication(e.g., human HOXA and human HOXB)

Basic premises in comparative sequence analysis

Most mutations that affect function are eliminated by purifying selection• Constrained elements have lower substitution rates than expected from the neutral rate

• Contingent on the effect of the mutation and degree of constraint on the function

• Manifests as sequence conservation, even among distant species

Beneficial mutations may be driven to fixation by positive selection• May be detected as “faster-than-neutral” substitution rate

• Expected to be rare

Most sequence differences among genomes are neutral• Involve substitutions with minimal or no functional impact

• Fixed by random genetic drift

• Fixation rate is equal to mutation rate

• Genomes become more dissimilar with greater phylogenetic distance

Phylogenies

Phylogenetic trees show two things:• Evolutionary relationships among species or sequences: branching order• Evolutionary distance (e.g., degree of similarity or divergence): branch length

Internalnode

Terminalnode

Branch

Phylogenies

Phylogenetic trees show two things:• Evolutionary relationships among species or sequences: branching order• Evolutionary distance (e.g., degree of similarity or divergence): branch length

Species tree Gene tree

Orthologs and paralogs in gene trees

Capra et al. 2013

HMGCS1

HMGCS2

Orthologs and paralogs in gene trees

Capra et al. 2013

Ort

holo

gs

Ort

holo

gs

Para

logs

Duplication

Orthologs and paralogs in gene trees

Capra et al. 2013

1:1 Orthologs

1:1 Orthologs

Human HMGCS1Human HMGCS2

1:2

Ortholog assignments at Ensembl

Ortholog assignments at Ensembl

Ortholog assignments at Ensembl

Steps in sequence comparisons

Sequence alignment• Global vs. local• Whole-genome vs. genome segments (e.g., genes)• Identify sites that are homologous (not necessarily identical)

Measure similarity and divergence of sequences• Sequence similarity – level of conservation• Rates of change among sequences - divergence

Infer degree of evolutionary constraint• Are the sequences more conserved than expected from neutral evolution?

Rates of sequence change are estimated using models of the substitution process

Transition probabilities:

Phylogeny

Substitution rates are calculated for each lineage in a sequence phylogeny

Conserved sequences identified by local reductionsin substitution rate

aligned position

aligned position

localneut

Tools for quantifying evolutionary conservation acrossgenomes

Alignment: Multiz• Generates multiple species alignment relative to a base genome

• Constructed from pairwise alignment of individual genomes to reference

• 46-way and 100-way alignment to hg19, 30-way to mm9; 60-way to mm10

100-way Multiz alignment in hg19

Green = level of sequence similarity at each site

Conservation of synteny: “net” alignments

• Conservation of genome segments• Order and orientation of genes and regulatory sequences

Conservation of synteny: “net” alignments

• Synteny is frequently conserved on megabase scales

Tools for quantifying evolutionary conservation acrossgenomes

PhastCons• Estimates the probability that a nucleotide belongs to a conserved element

• Sensitive to ‘runs’ of conserved sites – effective for identifying conserved blocks

• For hg19, elements are calculated at three phylogenetic scopes

(Vertebrate, Placental Mammal, Primate)

PhyloP• Measures conservation independently at individual positions

• Provides per-base conservation scores: (-log p value under hypothesis of neutrality)

• Positive scores suggest constraint; negative scores suggest accelerated evolution

Alignment: Multiz• Generates multiple species alignment relative to a base genome

• Constructed from pairwise alignment of individual genomes to reference

• 46-way and 100-way alignment to hg19, 30-way to mm9; 60-way to mm10

Identifying conserved elements: PhastCons

PhastCons scores

PhastCons elements

lod score: log probability under conserved model – log probability under neutral modelScore: normalized lod score on 0-1000 scale

Use scores to rank elements by estimated constraint

lod: 882Score: 694

PhastCons elements estimated at 3 phylogenetic scopes

Primate

Placental

Vertebrate

Level of conservation decays with increasing evolutionary distance

PhyloP: measuring basewise conservation

PhyloPscores

• Scores are calculated independently for each base• Scores are –log P values under hypothesis of neutral evolution• Positive scores = constraint• Negative scores = acceleration

Per-site phyloP conservation scores

4.49 1.77 -0.96

Use PhastCons to identify conserved elementsUse phyloP to evaluate individual sites within elements

Accessing conservation data

Multiple genome alignments and conservation metrics are calculated independently for each reference genome

Orthologous region in mouse:

30-way multiz alignment

Conservation identifies critical binding sites in regulatory elementsR

egula

tory

info

(EN

CO

DE)

Conse

rvati

on

Important binding sites and variants that affect function will be here