Top Banner
34

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Page 2: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

exon exon exonintronintronintergene intergene

Find Gene Structures in DNA

Intergene State

First Exon State

IntronState

Page 3: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Hidden Markov Model for Gene Finding

• Intron, Exon, Intergenic states

• Exon frame is encoded in the architecture by defining more states

• Exon states have explicit duration density

• Intron states have geometric duration

• Parameters are trained separately in different levels of GC content (correlated with amount of genes, and length of exons & introns)

Page 4: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Comparison-based Methods

Page 5: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Cross-species gene finding

5’ 3’

Exon1 Exon2 Exon3Intron1 Intron2

[human]

[mouse]

GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | |C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-

Page 6: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Comparison of 1196 orthologous genes(Makalowski et al., 1996)

• Sequence identity between genes in human/mouse– exons: 84.6%– protein: 85.4%– introns: 35%– 5’ UTRs: 67%– 3’ UTRs: 69%

• 27 proteins were 100% identical.

Page 7: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Human Mouse

Human-mouse homology

Page 8: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Page 9: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Not always: HoxA human-mouse

Page 10: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Twinscan

• Twinscan is an augmented version of the Gencscan HMM.

E I

transitions

duration

emissionsACUAUACAGACAUAUAUCAU

Page 11: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Twinscan Algorithm

1. Align the two sequences (eg. from human and mouse)

2. Mark each human base as gap ( - ), mismatch ( : ), match ( | )

New “alphabet”: 4 x 3 = 12 letters

= { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }

Page 12: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Twinscan Algorithm

3. Run Viterbi using emissions ek(b) where b { A-, A:, A|, …, T| }

Note:

Emission distributions ek(b) estimated from real genes from human/mouse

eI(x|) < eE(x|): matches favored in exons

eI(x-) > eE(x-): gaps (and mismatches) favored in introns

Page 13: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Example

Human: ACGGCGACGUGCACGU

Mouse: ACUGUGACGUGCACUU

Alignment: ||:|:|||||||||:|

Input to Twinscan HMM:A| C| G: G| C: G| A| C| G| U| G| C| A| C| G: U|

Recall, eE(A|) > eI(A|)

eE(A-) < eI(A-)

Likely exon

Page 14: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

HMMs for simultaneous alignment and gene finding:

Generalized Pair HMMs

Page 15: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

A Pair HMM for alignments

MP(xi, yj)

IP(xi)

JP(yj)

1 - 2

1- - 2

1- - 2

BEGIN

END

M JI

Page 16: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Generalized Pair HMMs

Page 17: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Exon GPHMM

d

e

1.Choose exon lengths (d,e).2.Generate alignment of length d+e.

Page 18: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Cross-species gene finding

5’ 3’

Exon1 Exon2 Exon3Intron1 Intron2

CNS CNS CNS

[human]

[mouse]

Page 19: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

The SLAM hidden Markov model

Page 20: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Model Time Space

HMM N2T NTPHMM N2TU NTUGHMM D2N 2T NTGPHMM D4N 2TU NTU

N no. states

Dmax durationT length

seq1U length seq2

Computational complexity

Page 21: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Approximate alignment

Reduces

TU -factor

to

hT

Page 22: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Measuring Performance

Page 23: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Example: HoxA2 and HoxA3

SLAM

SGP-2

TwinscanGenscan

TBLASTXSLAM CNS

VISTARefSeq

Page 24: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Suffix Trees

(a short break from biology)

Page 25: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Suffix Trees

• Suffix trees are a method to find all maximal matches between two strings (and much more)

Example: x = dabdac d a b d a c

ca

bd

acc

cca

db

1

4

25

63

Page 26: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Definition of a Suffix Tree

Definition:

For string x = x1…xm, a suffix tree is:

A rooted tree with m leaves

Leaf i: xi…xm

Each edge is a substring

No two edges out of a node, start with same letter

It follows, every substring corresponds to

an initial part of a path from root to a leaf

Page 27: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Naïve Algorithm to Construct a Suffix Tree

1. Initialize tree T: a single root node r

2. Insert special symbol $ at end of x

3. For j = 1 to m

• Find longest match of xi…xm to T, starting from r

• Split edge where match stops: new node w

• Create edge (w, j), and label with unmatched portion of xi…xm

Page 28: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Example of Suffix Tree Construction

1

x = d a b d a $

d a b d a $

1. Insert d a b d a $

a

bd

a$

2

2. Insert a b d a $

$a

db

3

3. Insert b d a $

$

4

4. Insert d a $

$

5

5. Insert a $

$

6

6. Insert $

Page 29: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Memory to Store Suffix Tree

• Can store in O( N ) memory!

• Every edge is labeled with (i, j):

(i,j) denotes xi…xj

• Tree has O( N ) nodes

Proof:1. # leafs # nodes – 1

2. # leafs = |x|

Page 30: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Faster Construction

Several algorithms

O( N ) time,

O( N ) memory with a big constant ~15 bytes/char

Technical but not deep, outside the scope of this course

Optional: Gusfield, chapter 6

Page 31: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Application: find all matches between x, y

1. Build suffix tree for x, mark nodes with x

2. Insert y in suffix tree, mark all nodes y “passes from” with y

The path label of every node marked both 0 and 1, is a common substring

Page 32: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

1

x = d a b d a $y = a b a d a $

d a b d a $1. Construct tree for x

a

bd

a$2

$a

db

3

$

4

$

5

$6

xx

x

6. Insert a $

5

6

6. Insert $

4. Insert a d a $

da$

3

5. Insert d a $

y

4

2. Insert a b a d a $

a

y

da

$

1

y

yx

3. Insert b a d a $ ady

2

a$

x

Example of Suffix Tree construction

Page 33: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Application: common substrings of k strings

To find the longest common substring of s1, s2, …sn

1. Build suffix tree for s1,…, sn

2. All nodes labeled {si1, …, sik} represent a match between si1, …, sik

Page 34: GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Suffix Arrays

ABRACADABRA$

11 $10 A$ 7 ABRA$ 0

ABRACADABRA$ 3 ACADABRA$ 5 ADABRA$ 8 BRA$ 1 BRACADABRA$ 4 CADABRA$ 6 DABRA$ 9 RA$ 2 RACADABRA#$

• Fast O(log n) search for every specific string

• Used for data compression such as bzip2

• Can be built in O(n) time by first building suffix tree and then get ordered suffixes by in-order traversal Too much memory— ~15n bytes Difficult to implement

• Theoretical build in O(n log n) using O(n/ sqrt(log n)) extra memory

• Hot topic how to build fast in practice