Top Banner
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT CCCTGTTTCCAGGTTTGTTGTCCCAAAATAGTGACCATTTCATATGTATA Comparative Genomics
32

TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Dec 24, 2015

Download

Documents

Ross Fowler
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATATTCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCAGAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTTCCCTGTTTCCAGGTTTGTTGTCCCAAAATAGTGACCATTTCATATGTATA

Comparative Genomics

Page 2: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Overview

I. Comparing genome sequences• Concepts and terminology• Methods

- Whole-genome alignments

- Quantifying evolutionary conservation (PhastCons, PhyloP)

- Identifying conserved elements

• Available datasets at UCSC

II. Comparative analyses of function• Evolutionary dynamics of gene regulation• Case studies• Insights into regulatory variation within and across species

Page 3: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Distribution of evolutionary constraint in the human genome

Lindblad-Toh et al. Nature 478:476 (2011)

4.2% of genome is putatively constrained

~1 million putative regulatory elements

Page 4: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

•Infer the course of past evolution using statistical models of sequence evolution

•Identify sequence elements evolving more slowly or more rapidly than neutral

•Evaluate the precise degree of constraint on specific positions

•Predict the functional effects of nucleotide or amino acid mutations in constrained sequences

Goals of comparative genomics

Page 5: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Vertebrate genomes available for comparative studies

Pri

mate

s

Mam

mals

Tetr

apods

Vert

eb

rate

s

Page 6: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Commonly used (and misused) terms

Mutation vs. Substitution• Mutations occur in individuals, segregate in populations

• Substitutions are mutations that have become fixed

• Mutations = within species; substitutions = between species

Conservation vs. Constraint• Conservation = an observation of sequence similarity

• Constraint = a hypothesis about the effect of purifying selection

Homology, Orthology and Paralogy• Homologous sequences = derived from a common ancestor

• Orthologous sequences = homologous sequences separated by a speciation event(e.g., human HOXA and mouse Hoxa)

• Paralogous sequences = homologous sequences separated by gene duplication(e.g., human HOXA and human HOXB)

Page 7: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Basic premises in comparative sequence analysis

Most mutations that affect function are eliminated by purifying selection• Constrained elements have lower substitution rates than expected from the neutral rate

• Contingent on the effect of the mutation and degree of constraint on the function

• Manifests as sequence conservation, even among distant species

Beneficial mutations may be driven to fixation by positive selection• May be detected as “faster-than-neutral” substitution rate

• Expected to be rare

Most sequence differences among genomes are neutral• Involve substitutions with minimal or no functional impact

• Fixed by random genetic drift

• Fixation rate is equal to mutation rate

• Genomes become more dissimilar with greater phylogenetic distance

Page 8: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Phylogenies

Phylogenetic trees show two things:• Evolutionary relationships among species or sequences: branching order• Evolutionary distance (e.g., degree of similarity or divergence): branch length

Internalnode

Terminalnode

Branch

Page 9: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Phylogenies

Phylogenetic trees show two things:• Evolutionary relationships among species or sequences: branching order• Evolutionary distance (e.g., degree of similarity or divergence): branch length

Species tree Gene tree

Page 10: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Orthologs and paralogs in gene trees

Capra et al. 2013

HMGCS1

HMGCS2

Page 11: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Orthologs and paralogs in gene trees

Capra et al. 2013

Ort

holo

gs

Ort

holo

gs

Para

logs

Duplication

Page 12: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Orthologs and paralogs in gene trees

Capra et al. 2013

1:1 Orthologs

1:1 Orthologs

Human HMGCS1Human HMGCS2

1:2

Page 13: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Ortholog assignments at Ensembl

Page 14: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Ortholog assignments at Ensembl

Page 15: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Ortholog assignments at Ensembl

Page 16: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Steps in sequence comparisons

Sequence alignment• Global vs. local• Whole-genome vs. genome segments (e.g., genes)• Identify sites that are homologous (not necessarily identical)

Measure similarity and divergence of sequences• Sequence similarity – level of conservation• Rates of change among sequences - divergence

Infer degree of evolutionary constraint• Are the sequences more conserved than expected from neutral evolution?

Page 17: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Rates of sequence change are estimated using models of the substitution process

Transition probabilities:

Page 18: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Phylogeny

Substitution rates are calculated for each lineage in a sequence phylogeny

Page 19: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Conserved sequences identified by local reductionsin substitution rate

aligned position

aligned position

localneut

Page 20: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Tools for quantifying evolutionary conservation acrossgenomes

Alignment: Multiz• Generates multiple species alignment relative to a base genome

• Constructed from pairwise alignment of individual genomes to reference

• 46-way and 100-way alignment to hg19, 30-way to mm9; 60-way to mm10

Page 21: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

100-way Multiz alignment in hg19

Green = level of sequence similarity at each site

Page 22: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Conservation of synteny: “net” alignments

• Conservation of genome segments• Order and orientation of genes and regulatory sequences

Page 23: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Conservation of synteny: “net” alignments

• Synteny is frequently conserved on megabase scales

Page 24: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Tools for quantifying evolutionary conservation acrossgenomes

PhastCons• Estimates the probability that a nucleotide belongs to a conserved element

• Sensitive to ‘runs’ of conserved sites – effective for identifying conserved blocks

• For hg19, elements are calculated at three phylogenetic scopes

(Vertebrate, Placental Mammal, Primate)

PhyloP• Measures conservation independently at individual positions

• Provides per-base conservation scores: (-log p value under hypothesis of neutrality)

• Positive scores suggest constraint; negative scores suggest accelerated evolution

Alignment: Multiz• Generates multiple species alignment relative to a base genome

• Constructed from pairwise alignment of individual genomes to reference

• 46-way and 100-way alignment to hg19, 30-way to mm9; 60-way to mm10

Page 25: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Identifying conserved elements: PhastCons

PhastCons scores

PhastCons elements

lod score: log probability under conserved model – log probability under neutral modelScore: normalized lod score on 0-1000 scale

Use scores to rank elements by estimated constraint

lod: 882Score: 694

Page 26: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

PhastCons elements estimated at 3 phylogenetic scopes

Primate

Placental

Vertebrate

Page 27: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Level of conservation decays with increasing evolutionary distance

Page 28: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

PhyloP: measuring basewise conservation

PhyloPscores

• Scores are calculated independently for each base• Scores are –log P values under hypothesis of neutral evolution• Positive scores = constraint• Negative scores = acceleration

Page 29: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Per-site phyloP conservation scores

4.49 1.77 -0.96

Use PhastCons to identify conserved elementsUse phyloP to evaluate individual sites within elements

Page 30: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Accessing conservation data

Page 31: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Multiple genome alignments and conservation metrics are calculated independently for each reference genome

Orthologous region in mouse:

30-way multiz alignment

Page 32: TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Conservation identifies critical binding sites in regulatory elementsR

egula

tory

info

(EN

CO

DE)

Conse

rvati

on

Important binding sites and variants that affect function will be here