Top Banner
Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg [email protected]
42

Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg [email protected].

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Comparative Genome Maps

CSCI 7000-005: Computational Genomics

Debra Goldberg

[email protected]

Page 2: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

What is a comparative map?

Page 3: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Why construct comparative maps?

Identify & isolate genes• Crops: drought resistance, yield, nutrition...• Human: disease genes, drug response,…

Infer ancestral relationships Discover principles of evolution

• Chromosome• Gene family

“key to understanding the human genome”

Page 4: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Why automate?

Time consuming, laborious• Needs to be redone frequently

Codify a common set of principles

Nadeau and Sankoff: warn of “arbitrary nature of comparative map construction”

Page 5: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Definitions

Marker: identifiable chromosomal locus

Homology: genes with common ancester

Homeology: chromosomal regions derived from a common ancestral linkage group

Synteny: loci on the same chromosome

Colinearity: syntenic regions with conserved gene order

Page 6: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Input/Output

Input: • genetic maps of 2 species• marker/gene correspondences (homologs)

Output:• a comparative map

• homeologies identified

Page 7: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Map construction

3S

8L

10L

3L

Maize 1 (target), Rice (base)

Wilson et al. Genetics 1999

pds1 (3S)

rz742a (2S)

rz103b (2L)

cdo1387b (3S)

isu040 (3)

rz574 (3S)

cdo38a (7L)

cdo938a (3S)

rz585a (3S)

rz672a (3S)

isu081b (3S 10L)

rz323a (8L)

cdo344c (12L)

rz296a (5L)

bcd734b (3S)

rz500 (10L)

rz421 (10L)

isu74 (3S)

cdo464a (8L)

isu73 (3S)

cdo475b (6S)

cdo595 (8L)

cdo116 (8L)

rz28a (8L)

cdo99 (8L)

rz698a (9L)

bcd207a (10L)

cdo94b (10L)

bcd386a (10L)

isu78 (5L)

csu77 (10L)

cdo98b (10L)

rz630e (3L)

rz403 (3L)

cdo795a (3L)

bcd1072c (5C)

isu92b (3L)

cdo122a (3L)

rz912a (3L)

bcd808a (11S)

cdo246 (3L)

adh1 (11S)

cdo353b (3L)

isu106a (3L)

phi1 (3L)

Go from this to this

Page 8: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Chromosome labeling

Maize 1 (target),

Rice (base)

Wilson et al. Genetics 1999

Maize 1

pds1 (3S)

rz742a (2S)

rz103b (2L)

cdo1387b (3S)

isu040 (3)

rz574 (3S)

cdo38a (7L)

cdo938a (3S)

rz585a (3S)

rz672a (3S)

isu081b (3S 10L)

rz323a (8L)

cdo344c (12L)

rz296a (5L)

bcd734b (3S)

rz500 (10L)

rz421 (10L)

isu74 (3S)

cdo464a (8L)

isu73 (3S)

cdo475b (6S)

cdo595 (8L)

cdo116 (8L)

rz28a (8L)

cdo99 (8L)

rz698a (9L)

bcd207a (10L)

cdo94b (10L)

bcd386a (10L)

isu78 (5L)

csu77 (10L)

cdo98b (10L)

rz630e (3L)

rz403 (3L)

cdo795a (3L)

bcd1072c (5C)

isu92b (3L)

cdo122a (3L)

rz912a (3L)

bcd808a (11S)

cdo246 (3L)

adh1 (11S)

cdo353b (3L)

isu106a (3L)

phi1 (3L)

Rice

3S

8L

10L

3L

Page 9: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

A natural model?

Maize 1 (target),

Rice (base)

Wilson et al. Genetics 1999

Maize 1

pds1 (3S)

rz742a (2S)

rz103b (2L)

cdo1387b (3S)

isu040 (3)

rz574 (3S)

cdo38a (7L)

cdo938a (3S)

rz585a (3S)

rz672a (3S)

isu081b (3S 10L)

rz323a (8L)

cdo344c (12L)

rz296a (5L)

bcd734b (3S)

rz500 (10L)

rz421 (10L)

isu74 (3S)

cdo464a (8L)

isu73 (3S)

cdo475b (6S)

cdo595 (8L)

cdo116 (8L)

rz28a (8L)

cdo99 (8L)

rz698a (9L)

bcd207a (10L)

cdo94b (10L)

bcd386a (10L)

isu78 (5L)

csu77 (10L)

cdo98b (10L)

rz630e (3L)

rz403 (3L)

cdo795a (3L)

bcd1072c (5C)

isu92b (3L)

cdo122a (3L)

rz912a (3L)

bcd808a (11S)

cdo246 (3L)

adh1 (11S)

cdo353b (3L)

isu106a (3L)

phi1 (3L)

Rice

3S

8L

10L

3L

Page 10: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Scoring

10L

3L

s

m

bcd207a (10L)cdo94b (10L)bcd386a (10L)isu78 (5L)csu77 (10L)cdo98b (10L)rz630e (3L)rz403 (3L)cdo795a (3L)isu92b (3L)

Page 11: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Assumptions

Accept published marker order

All linkage groups of base are unique

Simplistic homeology criteria

At least one homeologous region

Page 12: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

A natural model?

Page 13: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

A natural model?

Page 14: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

A natural model?

Page 15: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

A natural model?

Page 16: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Dynamic programming

li = location of homolog to marker i

S[i,a] = penalty (score) for an optimal labeling of the submap from marker i to the end, when labeling begins with label a

a

1 ... i ... n

Page 17: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Recurrence relation

S[n,a] = m (a, ln)

S[i,a] = m (a, li) + min (S[i+1,b] + s (a,b) )bL

a b

... i i+1 ... n

li li+1 ln

a ... n... ln

Page 18: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Problem with linear model

s = 2

a-b-c motif:

a b c score: 2s = 4

a a a b b b c c c

a-b-a motif:

a score: 3m = 3

a a a b b b a a a

Page 19: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

The stack model

Segment at top of the stack can be:• pushed (remembered), later popped• replaced

Push and replace cost s -- pop is free.

b b bfe

dc

ac

Page 20: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Scoring

s

9L

7L

7L

“free” pop

m

m

m

uaz265a (7L) isu136 (2L) isu151 (7L) rz509b (7L) cdo59c (7L) rz698c (9L) bcd1087a (9L) rz206b (9L) bcd1088c (9L) csu40 (3S) cdo786a (9L) csu154 (7L) isu113a (7L) csu17 (7L) cdo337 (3L) rz530a (7L)

Page 21: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Dynamic programming

S[i,j,a] = score for an optimal labeling of:• submap from marker i to marker j• when labeling begins with label a --

i.e., marker i is labeled a

a

1 ... i ... j ... n

Page 22: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Recurrence relation

S[i,i,a] = m (a, li)

S[i,j,a] = min: m (a, li) + min (S[i+1,j,b] + s (a,b) )

min S[i,k,a] + S[k+1,j,a] i<k<j

bL

a a

1 ... i ... k+1 ... j ... n

a1 ... i i+1 ... n

a b1 ... i i+1 ... n

Page 23: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Results: infers evolutionary events

Maize 1 (target)

Rice (base)

Wilson et al.

Stack

Page 24: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Problem: Incomplete input

Gene order not always fully resolved. Co-located genes can be ordered to give

most parsimonious labeling.8p

19p

33.0 Atp6b1 (8p)33.0 Comp (19)33.0 Jak3 (19p)33.0 Jund1 (19p)33.0 Lpl (8p)33.0 Mel (19p)33.0 Npy1r (4q)33.0 Pde4c (19)33.033.0 Srebf1 (17p)

Slc18a1 (8p)

Atp6b1 (8p)Lpl (8p)

Npy1r (4q)Srebf1 (17p)Comp (19)Jak3 (19p)Jund1 (19p)Mel (19p)Pde4c (19)

Slc18a1 (8p)

=

8p

19p

Page 25: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

The reordering algorithm

Uses a compression scheme• Within a megalocus, group genes by location

of related gene.• Order these groups• First, last groups interact with nearby genes• Any ordering of internal groups is equally

parsimonious

Page 26: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

The reordering algorithm

Page 27: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

The reordering algorithm

Page 28: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Definitions

extended to distance to a set A of labels

0 if a A,

1 otherwise

S = the set of indices of supernode start elements

For simplicity, call supernode i S

(a, A) =

Page 29: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Definitions

For i S:

ni = # markers in i

ni(a) = # markers in i with a homolog on a

li = set of labels matching markers in i

• li = {a L | ni(a) 1},

Page 30: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Definitions

pi(c) gives mismatched marker and segment boundary penalties for label c

pi(c) = s : m ni(c) s

m ni(c) : m ni(c) s

Page 31: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Definitions

p(i,a,b) gives the total mismatched marker and segment boundary penalties attributed to “hidden markers”

(pi(c)) + m i (a,b) : for iS, ab

p(i,a,b) = (m ni(c)) + m i (a,b) : for iS, a=b

0 : otherwise.

c a,b

c a

Page 32: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Definitions

For i S:

i(a,b) = # labels in {a,b} without matching marker in i i(a,b) = (a, li) + (b, li)

i(a,b) {0,1,2}

Page 33: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Definitions

i (a,b) corrects if mismatch marker penalties assigned twice for same marker; in the recurrence and in p(i,a,b)

For example:i (a,b) = 0 if i(a,b) = 0

(if a, b are both represented in supernode)i (a,a) = -2 if i(a,a) > 0

(if a is not represented in supernode)

Page 34: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Recurrence relation

S[i,i,a] = m (a, li)

S[i,j,a] = min:

m (a, li) + min (S[i+1,j,b] + s (a,b) + p(i,a,b))

min S[i,k,a] + S[k+1,j,a] i<k<jk S

bL

Page 35: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Results: Fewer mismatches

stack reordering

Mouse 5 (target)

Human (base)

Page 36: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Results: Mismatches placed between segments

stack reordering

Mouse 8 (target)

Human (base)

Page 37: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Results: Detects new segments

stack reordering

Mouse 13 (target)

Human (base)

Page 38: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Summary

Finds optimal comparative map• Arranges markers in most parsimonious way

First algorithm to use megalocus data

Fast, objective, simple to use

Biologically meaningful results

Page 39: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Summary

Global view

Biologically meaningful results• Provides testable hypotheses

Robust• not species-specific• high/low resolution, genetic/physical maps• stable to errors in marker order

Page 40: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Future Directions

Algorithmic extensions• 3rd species• polyploidy• search for ancient duplications

Deduce history of evolutionary events• makes genome rearrangement measures

tractable and robust• infer common ancestor

Page 41: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Future Directions

Block-segmental sequence comparisons• non-local sequence alignment• protein domains

2D block-segmental comparisons• comparison of regulatory networks• image processing

Page 42: Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu.

Acknowledgments

Jon Kleinberg

Susan McCouch

Chris Pelkie

Sandra Harrington

Sam Cartinhour

Dave Schneider

NSF AAUW David and Lucile Packard

Foundation USDA Cooperative State

Research Education and Extension Service

ONR