Top Banner
Comparative Genomics Sridhar Hannenhalli [email protected] "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky, 1932
73

Comparative Genomics Sridhar Hannenhalli [email protected] "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Jan 02, 2016

Download

Documents

Aileen Taylor
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Comparative Genomics

Sridhar [email protected]

"Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky, 1932

Page 2: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

A comparative investigation of functional elements ( genes, regulatory elements) and biological networks across multiple species, or evolutionarily related systems.

What is comparative genomics

Page 3: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

The basic premise of comparative genomics is that

Living things are related by evolution

The essential properties of biological entities and processes are likely to be conserved across similar organisms

The organism-specific genomic characteristics might reveal genetic bases for phenotypic differences between organisms.

Page 4: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Contexts of comparative genomics

Protein family Gene sequence Structure

Non-coding regions

Gene networks

Gene order

Page 5: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Some of the questions that comparative genomic may help address

What genomic regions are likely to be serve an essential biological role?

What genomic regions might underlie species-specific phenotypic differences?

What are the core set of proteins essential to a group species?

Despite sequences changes, are genetic interaction conserved?

Which genes or functional elements might have evolved adaptively?

Page 6: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Different Questions Require Different Comparisons

From: Hardison. Plos Biology. Vol 1 (2): 156-160

Page 7: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

A General Framework for Comparative Genomics

• Comparing a biological entity or process between individuals or species

• Why do we want to compare?

• How do we compare? (actual data, or model?)

• How do we assess the significance of comparison score?

• Both similarity and non-similarity can be of interest

Page 8: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Overview of topics

Gene identification in yeastKellis et al. Kellis et al. NatureNature 423, 241-254 (2003). 423, 241-254 (2003).

SelectionClark, A. G. (2006). "Genomics of the evolutionary process." Clark, A. G. (2006). "Genomics of the evolutionary process." Trends Ecol Evol Trends Ecol Evol 2121(6): 316-321.(6): 316-321.

KaKs test for selection in protein coding genes Tree-HMM to identify conserved regions Selection on epigenomic properties

Phylogenetic profilesPellegrini et al. PNAS Pellegrini et al. PNAS 9696, 4285-4288 (1999), 4285-4288 (1999)

Network alignmentKelley et al. (2003). "Conserved pathways within bacteria and yeast as revealed by global protein Kelley et al. (2003). "Conserved pathways within bacteria and yeast as revealed by global protein network alignment." network alignment." PNAS PNAS 100100(20): 11394-11399.(20): 11394-11399.

Page 9: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Sequencing and comparison of yeast species to identify genes and regulatory elements.

Kellis et al. Nature 423, 241-254 (2003).

Page 10: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

CATGTTTCCACTTACAGATGCTTCAAAAAGAGTGTTTCATAACTGCTCTATGAAAAGGAATGTTCAACTCTGTGAGTTAAATAAAAGCATCAAAAAAAAGTTTCTGAGAATGCTTCTGTCTAGTTTTTATGTGAAGATATTTCCATTTTCTCTATAAGCCTCAAAGCTGTCCAAATGTCCACTTGCAGATACTACAAAAAGAGTGTTTCAAAAGTGCTCAATGAAAAGGAATGTTCAGCTCTGTGAGTTAAATGCAAACATCACAAATAAGTTTCTGAGAATGCTTCTGTCTAGTTTTTATGGGAAGATAATTCCGTGTCCAGCGAAGGCTTCAAAGCTTTCAAAATATCCACTTGCAAATTCTACAAAAAGAGTGTTTCAAAGCTGCTTTATCAAAAGAAAGTTTCAACTCTGTGAGTTGAATGTGCACATCACAAAGAAGTTTCTGAGAATGCCTTCAGTCTGGTTTTTATGTGAAGATATTCCCTTTTCCAACGAAAGCCTCGAAGCTGTCCAAATATCCACTTGTAAGTGCTGCAAAAAGAGTGTTTCAAAACTGCTACAGCAAAAGAAAGGTTTATCTCTGTGAGTTGAGTAGACACATCAAGAAGAAATTTCTGAGAATGCTTCTGTCTAGTTTTTATGTGAAGATATTTCCTTTGTCACCATAGGCCTCCAAGCCCTCCAAATGTCCACTTGCAGATGCTACAAAAAGAGTGTTTCAAAACTGCTGTATGAAAAGAAATGCTCAAATCTGTGAGATAAATGCATACATCACAAAGAAGTCTTTGAGAATGCTTCTGTCTAGTTTTTATGTTAAGATATTTCCTATTTCACCATACGTCTCAACGCACACAAAATGTACACTTGCAGATGCTACAAAGAGAGTGTTTCAAAACTTGTAGATCAAAACAAGTGTTCAACTTTGTGAGTTGAGGACACACATCTGAAAGAAGTTTCTGAGAATGCTTCTGTCTAGTTTTTATGTGAAGATATTCCCGTTTCCAGCGAAAGCCCCAAAACTATCCAAATATCCACTTGCACATTCTACAAAAAGAGTGTTTCAAATCTGCTCTATCAAAATAAAGGTTCAACTCTGTGAGTTGACTACACACATCACAAAGAAGTTTCTAAGAATGCTTCTGTCTGGTTTTTATGGGAAGATATTTCCTTTTTCAACATAGGCCTTGCAGCATCTACAAAAAGAGTTTTTCAAAACTCCTCTAAGAAAAGGAATGATCAACTCCATGAGTTTAATGCAAAGATCACAAAGAAGTTTCTGAGAATGCTTCTGTCTAGTTTTAACCTGAAGACAGTTCCGTTTCCAGTGAAGGCTTGAAAGCTGTCCAAATATCCACATGCAAATTCTCAAAACGAGTGTTTAAAAGCTGCTCTATCACTAGAAAGTTTCACCTCTGTGAGCTGAATGCACACAGAAGTTTCTGAGAATGCTTCTGTCTGGTTTTTATGTGAAGATATTCCCGTTTCCAACCAAAGCCTCAAAGCTGTCCAAATATCCATTTGCAGATCCTACAGGGAGAGTGTTTCAAAACTGCTCTATAAAAAGAAAGGTTCTACTCTGTGAGTGGAGTACACACATCACAAAGAAGTTTCTGAGAATGCTTCTGTCTAGTTTTTATGTGAAAATAGTTCGTTCTCCAAAATAGTCCTCAAAGCGCTCCAAATGTCCACTTGCAGATTCTACAAAAAGAGTGTTTCAAAACTGCTCTATGAAAAGGAATGTTCAACT

Ab Initio Gene Prediction

….a bit of digression…..

Page 11: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Markov Model

Q1

Q4Q3

Q2

Finite State Set : Q1, Q2, Q3, Q4

Random variable X takes values from the setX(n) : State of X at time n, X(1) = Q2, X(2) = Q1 and so on….

T(i,j) : Transition probability from State i to State j = P(X(n+1) = Qj | X(n) = Qi)

Output generated by the Model is a State-sequence

Q2 Q1 Q4 Q1 Q1 Q2 Q4 Q3

Page 12: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Hidden Markov Model

The sequence generated by each state is governed by a state-specific probability function

Q1

Q4Q3

Q2

Stochastic Process/Prob. function

ATGCCCGC

Q2 Q1 Q4 Q1 Q1 Q2 Q4 Q3

ACGGATGGCGGGCTGATGATGTAGGGGAAT

Page 13: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

A simple HMM to model intronless genes

INTERGENIC START

STOP CODING

p

1-p

q 1-q

11

GeneratesA,C,G or T

Generatesa non-stop codon

GeneratesATG or GTG

Generatesa stop codon

ATG {a sequence of codons except stop codons} TAAOpen Reading Frame or ORF

Page 14: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Consider a stochastic process where codons are chosen at random among the 64 possibilities

Success is defined as choosing a stop codon => p = 3/64

Define random variable L = number of trial until first success

P(L = u) = (1-p)u-1p (Geometric distribution)

P(L > u) = (1-p)u-1

E(L) = 1/p ~ 20 Var(L) = (1-p)/p2

P(L = 100) = 0.0004

A ‘long’ ORF is highly likely to be a gene

Page 15: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Genome scale alignments

Global Alignment

Local Alignment

Genome Alignment

Establish synteny

Align syntenic regions

Human

Mouse

….back to the ‘comparative’ genomics…..

Page 16: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Main challenges with genome alignment

Computational limitations

Incomplete Data

What is the right parameter – poor evolutionary model

Lack of gold standards

Page 17: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Genome Alignment procedures can be generalized as a hierarchical process

1. Compute anchors (high scoring local alignments)

2. Chain/Link the anchors

3. Refine alignment (e.g. Needleman Wunsch)

Page 18: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

AVID Bray et al., 2003

1. Find all exact matches

2. Filter matches(keep the matches at least half as long as the longest match )

3. Compute optimal chain of matches (Needleman-Wunsch-style)

4. Within each consecutive match pair in the optimal chaina) Back to step 2 and repeat ORb) Apply Needelman-Wunsch

Page 19: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

The LAGAN algorithm (Brudno et al. 2003)

Get local alignments

Optimal Chaining

Gap filling using Needleman-Wunsch

Page 20: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Yeast genome alignment

SC

SB

Identify ORFs, BLAST all versus all

Remove all edges with scores less than 80% of the node max

SC

SB

Remove edges in conflict with synteny

SC

SB

Remove edges not preferred by either node

SC

SB

Page 21: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

All but 211 of the 6,235 SC ORFs were assigned unambiguously

These 211 genes fall in 32 clusters, almost all of which are in telomere

Accuracy of orthology mapping

Page 22: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Gene order differences

Page 23: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Gene order differences between Human and Mouse

Page 24: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

All 20 inversions are flanked by tRNA genes in opposite orientation

7 reciprocal translocations have Ty elements and 3 have pairs of highly similar ribosomal protein genes

A majority of rearrangements are in telomeric regions

Page 25: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

SC

SP

SM

SB

Q: Is this surprisingly conserved?

A: Well it depends (on what?)

Page 26: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

ORF-based tree

Intron-based tree

Distance-based tree reconstruction methods:

UPGMANeighbor-Joining

Page 27: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Species specific calibration

Relative to S. Cerevisiae percent identity inParadoxus 90% coding 80% non-codingMikatae 84% coding 62% non-codingBayanus 80% coding 62% non-coding

Coding versus non-coding discrimination

In 4-way alignmentMutated sites: coding 30% non-coding 58%Indels: coding 1.3% non-coding 14%1,2 indels: coding 0.14% non-coding 10.2%

Page 28: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

An unlikely ORF

Page 29: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Reading Frame Conservation (RFC) Test

Label ref species bases 1,2,3,1,2,3…

Label other species 1,2,3 then 2,3,1.. then 3,2,1.. And use the best

Count the number of aligned positions in frame (same label)

Average of all 100 bps window with 50 jump

Page 30: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Use RFC score of 80% ,75% and 70% as cutoff for the three species. Each species votes +1, 0 (absent) and -1. Sum of votes >= 1 qualifies an ORF as a gene

Negative control

96% percent of the windows were rejected in introns. 3% do contain an ORF based on manual inspection => 1% false positive

Summary of gene list update

~400 genes removed~50 novel short genes

What about the genes with introns?

RFC 80

Page 31: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

In summary

Present an effective way to identify new protein coding genes and correct old annotations

Identify gene order differences between species, and present some evidence for mechanism of inversion through non-homologous recombination

Identify specific genes that are fast evolving or extremely conserved

Page 32: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Overview of topics

Gene identification in yeast

Selection KaKs test for selection in protein coding genes Tree-HMM to identify conserved regions Selection on epigenomic properties

Phylogenetic profiles

Network alignment

Page 33: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Mutations create variants and are critical for evolution

Deleterious mutations are likely to be purged from the populations (purifying or negative selection)

Advantageous variants tend to increase in frequency and eventually replace other variants (adaptive or positive selection)

Study of selection involves testing whether observed evolutionary changes are more consistent with a neutral evolution or with either a purifying or adaptive selection

Page 34: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Test of Neutral evolution of coding genesTest of Neutral evolution of coding genesThe KaKs testThe KaKs test

Thr Leu His SerACG CTC CAT TCTACG CTT CAA AGT Thr Leu Gln Ser

No change

Synonymous substitution

Non-Synonymous substitution

Multiple Non-Synonymous substitution

If Non-synonymous mutation rate > synonymous substitution rate it implies positive selection

If Non-synonymous mutation rate < synonymous substitution rate it implies purifying selection

If Non-synonymous mutation rate = synonymous substitution rate it implies neutral evolution

Page 35: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

HisC A T

Non-synomymous

Non-synomymous

1/3 Synonymous2/3 Non-synomymous

S = 0 + 0 + 1/3 = 1/3N = 1 + 1 + 2/3 = 8/3S + N = 9/3 = 3 (total number of nucleotides)

Number of Synonymous and non-synonymous sites

IleA T T

Non-synomymous

Non-synomymous

2/3 Synonymous1/3 Non-synomymous

S = 0 + 0 + 2/3 = 2/3N = 1 + 1 + 1/3 = 7/3S + N = 9/3 = 3 (total number of nucleotides)

CAT S = (1/3 + 2/3)/2 = 1/2ATT N = (8/3 + 7/3)/2 = 5/2 S + N = 3 (total number of nucleotides)

The number of syn and non-synonymous sites are averaged for the two aligned codons

Page 36: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,
Page 37: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

TyrTAT

SerTCT

SerTCC

1 non-syn change

2 non-syn changes??

SerTCT

TyrTAC

Both paths involve 1 syn substitutionAnd 1 non-syn substitution. This averages out to be 1 syn and 1 non-syn substitutions

TyrTAC

1 syn change

Number of Synonymous and non-synonymous mutations

TAT TAT TATTAC TCT TCC

Codon Sd Nd total1 1 0 12 0 1 13 1 1 2

Sd + Nd equals the number of mismatches between the codon pair

Page 38: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

LeuCTA

IleGTT

2 non-syn changes??

ValGTA

LeuCTT

First path involves 2 non-syn substitutionsSecond path involves 1 syn and 1 non-synonymous substitutionsThis averages out to be 0.5 syn and 1.5 non-syn, for a total of 2 substitutions.

Number of Synonymous and non-synonymous mutations

CTAGTT

Codon Sd Nd total1 0.5 1.5 2

To be fixed

Page 39: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

KKS S is the relative rate of synonymous mutations per synonymous siteis the relative rate of synonymous mutations per synonymous site

KKA A is the relative rate of nonsynonymous mutations per non-is the relative rate of nonsynonymous mutations per non-synonymous sitesynonymous site

= K= KAA/K/KSSo If If = 1, neutral evolution = 1, neutral evolutiono If If < 1, purifying selection < 1, purifying selectiono If If > 1, positive Darwinian selection > 1, positive Darwinian selection

A log-likelihood measure is used to asses the significance of A log-likelihood measure is used to asses the significance of

Often, different sections of a gene will have different KOften, different sections of a gene will have different KAA/K/KSS

Detecting genes with unusual values for KDetecting genes with unusual values for KAA & K & KS S often leads to often leads to interesting discoveriesinteresting discoveries

Page 40: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

A scan for positively selected genes in the genomes of humans and chimpanzees.

Nielsen et al. 2005

13,731 human-chimp orthologs were analyzed for evidence of positive selection.

Genes that show the strongest evidence for positive selection are involved in tumor suppression, apoptosis, and spermatogenesis.

Genes with maximal expression in the testis tend to be enriched with positively selected genes.

Genes on the X chromosome also tend to show an elevated tendency for positive selection.

Page 41: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

www.genome.ucsc.edu/cgi-bin/hgGateway

Page 42: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

PhastCons Score

Page 43: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

PhastCons Siepel et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15, 1034-1050 (2005)

Regions of high conservation in human genome

Page 44: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Path 2

Path 1

T C G C G A C A T A T A C G AT T G G G G C A T G T G G G TA G C A G A C G T C C G C A A

HumanMouse

Rat

Conserved

Non-conserved

The goal is to determine the state assignments for each site, i.e., the path, that maximizes the likelihood of observing the data.

TD

S1Si Sn

)...,|( 1...1

nss

n ssTDLMAXLn

Page 45: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

S1Si Sn

n

iiiiin sTDsssTDsssTDL

211111 ),|Pr(*)Pr(),|Pr()Pr()...,|(

T C G C G A C A T A T A C G AT T G G G G C A T G T G G G TA G C A G A C G T C C G C A A

DiD1Dn

Data

States

T

Page 46: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

S1Si Sn

Si-1

)...,|( 1...1

ii

ss

i ssTDLL MAXi

),...,|( 11...

,

11

SsssTDLL iii

ss

Si MAXi

),( ,, NiCii LLMAXL Conserved = C, Non-conserved = N

),|Pr()Pr(),Pr( ,1,1, STDSNLSCLMAXL iNiCiSi

),|Pr()Pr( 1,1 STDSL S

Page 47: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

A C C

Y={A,C,G,T}

X={A,C,G,T}

AA CC GG TT

AA

CC PPCGCG

GG

TT

Substitution Matrix Q (kimura, HKY model etc.)

t1

t2

t3 t4

This can be computed efficiently using the pruning algorithm (dynamic programming)

Page 48: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

A C C

t1

t2

t3 t4

A C C

rt1

rt2

rt3 rt4

Non-conserved State Conserved State

The tree representing the conserved state is a scaled version of the non-conserved tree. Scaling factor – r, is the unknown parameter to be estimated.

T C G C G A C A T A T A C G AT T G G G G C A T G T G G G TA G C A G A C G T C C G C A A

Page 49: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,
Page 50: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Parameters

States s1 …. Sn

State frequencies

State transition probabilities

Conservation scaling factor – r

Branch lengths for un-conserved tree model

Parameters can be estimated using an Expectation-Maximization (EM) method

Page 51: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Genomic regions undergoing species-specific accelerated evolution

hum

an

chim

p

rhes

us

mou

se

rat

chic

ken

hum

an

chim

p

rhes

us

mou

se

rat

chic

ken

compare with

Page 52: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Identification of human-specific substitutions contributing to the gain of function in HACNS1. (Prabhakar et al, Science 2008)

Genomic regions undergoing species-specific accelerated evolution

Page 53: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Selection on nucleosome positioning(Babbit and Kim, MBE 2008)

S. cerevisiae

S. mikatae

NA

NB

= NA - NB

Is significantly greater (adaptive evolution) or smaller (purifying selection) than the expectation?

Simulate neutral evolution to estimate the range of expected

Page 54: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

CYC1 gene

Page 55: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Evidence of Abundant Purifying Selection in Humans for Recently Acquired Regulatory Functions

Ward and Kellis, Science Sep 2012

• A broad range of transcribed and regulatory nonconserved elements show decreased human diversity, suggesting lineage-specific purifying selection.

• Regulatory elements under human constraint in nonconserved regions were found near color vision and nerve-growth genes, consistent with purifying selection for recently evolved functions.

• Conversely, conserved elements lacking activity show increased human diversity, suggesting that some recently became nonfunctional.

Page 56: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Divergent Whole-Genome Methylation Maps of Human and Chimpanzee Brains Reveal Epigenetic Basis of Human Regulatory Evolution

Zeng et al. AJHG, Aug 2012

• Nucleotide-resolution whole-genome methylation maps of the prefrontal cortex of multiple humans and chimpanzees.

• Levels and patterns of DNA methylation vary across individuals within species according to the age and the sex of the individuals.

• Extensive species-level divergence in patterns of DNA methylation and that hundreds of genes exhibit significantly lower levels of promoter methylation in the human brain. A strong relationship between differential methylation and gene expression.

• Differentially methylated genes enriched with loci associated with neurological disorders, psychological disorders, and cancers.

Page 57: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Overview of topics

Gene identification in yeast

Selection KaKs test for selection in protein coding genes Tree-HMM to identify conserved regions Selection on epigenomic properties

Phylogenetic profiles

Network alignment

Page 58: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Predicting interactions using phylogenetic profilePredicting interactions using phylogenetic profile

Pellegrini et al. PNAS 96, 4285-4288 (1999)

Pellegrini et al. PNAS 96, 4285-4288 (1999)

Page 59: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Cokus et al BMC Bioinformatics (2007)

Putting ‘Phylogeny’ in the ‘Phylogenetic profile’

Page 60: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Barker and Pagel, PLoS Comput Biol. (2005)

Page 61: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Comparison of tree likelihood for the co-evolution model (eg. and the independent evolution model (eg. q12 = q34) provides evidence for co-evolution

Continuous-time Markov model for co-evolution

Page 62: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Overview of topics

Gene identification in yeast

Selection KaKs test for selection in protein coding genes Tree-HMM to identify conserved regions Selection on epigenomic properties

Phylogenetic profiles

Network alignment

Page 63: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Sequence Network

Global Alignment

Local Alignment

Page 64: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

InterologInterolog is an is an InterInteraction conserved among action conserved among orthoorthologlog pairs pairs

X

y y

x

X

y

x2

y

x1

If orthology could be unambiguously ascertained then the network alignment would not be a problem.

The term ‘Functional orthology’ is sometime used to express this ambiguity

Page 65: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Often the interaction is ‘transferred’ from one species to other based on the ‘orthology’ of two genes.

While at the same time, the conserved interaction can be used to strengthen the ‘orthology’ inference

Likely to be orthologous

Orhtology of nodes and orthology of edges reinforce each other

Page 66: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

PathBLAST PathBLAST (Kelly et al 2003)(Kelly et al 2003)

Goal: Given two protein networks, identify common (orthologous) pathways between them

Approach: Convert the network pair into a ‘Network Graph’ and identify high-scoring paths in this graph using Dynamic Programming

Page 67: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Each node in G is a composite “A/a” where A in N1 and a in N2 are homologs.

A “Direct” edge exists between “A/a” and “B/b” if protein-protein interactions (A,B) and (a,b) are present in N1 and N2

A “N1-gap” edge exists between “A/a” and “B/b” if (A,B) is present in N1, and the distance between a and b in N2 is 2

A “N2-gap” edge exists between “A/a” and “B/b” if (a,b) is present in N2, and the distance between A and B in N1 is 2

A “Mismatch” edge exists between “A/a” and “B/b” if A,B and a,b are connected at distance 2 in both N1 and N2

From a pair of networks (N1, N2) to a Network Graph G

Page 68: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Scoring a path in Network Graph G

)(

)()|()|()(

v

vv Ep

HpHEpEHpvp

Pe randomPv random q

eq

p

vpPS

)(log

)(log)( 1010

COG is used for True homology e-value distribution and p(H)

ei

ieq )Pr()( Empirically based on number of studies supporting an interaction

Page 69: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

Given a node and edge-weighted Directed Acyclic Network graph, high scoring paths can be efficiently computed in linear time using Dynamic Programming technique.

PathBLAST eliminates cycles by a heuristic edge-removal PathBLAST eliminates cycles by a heuristic edge-removal procedure and then performs Dynamic programming on the procedure and then performs Dynamic programming on the resulting Direct Acyclic Graph.resulting Direct Acyclic Graph.

Significance of high-scoring paths is estimated using Significance of high-scoring paths is estimated using randomized network graphs.randomized network graphs.

Collapse overlapping paths into network.Collapse overlapping paths into network.

Page 70: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,
Page 71: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

A generalization of PathBLAST - MaWISH (koyuturk et al. 2005)

G H

u1 u2

u3 u4

v1v2

v3

v4

{u1,v1} {u4,v2}

{u4,v4} {u2,v1}

{u3,v3}

x

-y

-y

x-z x : Matchy : Interaction loss/gain z : Gene loss/gain

Network Pair

Alignment Graph

Page 72: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

A conserved module between two networks will correspond to a subgraph containing many high scoring edges.

Any procedure for dense subgraph computation provides a heuristic for conserved module detection.

Koyuturk et al. use a greedy procedure starting with a node with maximum overall edges weight and greedy extension.

Page 73: Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky,

hypotheticalancestral

module

descendants

Evolutionary considerations in network alignmentEvolutionary considerations in network alignment

Duplication

Deletion Deletion