Top Banner
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003
39

Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Large-Scale Global Alignments

Multiple Alignments

Lecture 10, Thursday May 1, 2003

Page 2: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Rapid global alignment

Genomic regions of interest contain ordered islands of similarity

– E.g. genes

1. Find local alignments2. Chain an optimal subset of them

Page 3: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Suffix Trees

• Suffix trees are a method to find all maximal matches between two strings (and much more)

Example: x = dabdac

d a b d a c

ca

bd

acc

cca

db

1

4

25

63

Page 4: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Application: Find all Matches Between x and y

1. Build suffix tree for x, mark nodes with x

2. Insert y in suffix tree, mark all nodes y “passes from” with y

– The path label of every node marked both 0 and 1, is a common substring

Page 5: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Example of Suffix Tree Construction for x, y

1

x = d a b d a $y = a b a d a $

d a b d a $1. Construct tree for x

a

bd

a$2

$a

db

3

$

4

$

5

$6

xx

x

6. Insert a $

5

6

6. Insert $

4. Insert a d a $

da$

3

5. Insert d a $

y

4

2. Insert a b a d a $

a

y

da

$

1

y

yx

3. Insert b a d a $ ady

2

a$

x

Page 6: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Application: Online Search of Strings on a Database

Say a database D = { s1, s2, …sn }(eg. proteins)

Question: given new string x, find all matches of x to database

1. Build suffix tree for {s1,…, sn}2. All new queries x take O( |x| ) time

(somewhat like BLAST)

Page 7: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Application: Common Substrings of k Strings

• Say we want to find the longest common substring of s1, s2, …sn

1. Build suffix tree for s1,…, sn

2. All nodes labeled {si1, …, sik} represent a match between si1, …, sik

Page 8: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Methods to CHAIN Local Alignments

Sparse Dynamic ProgrammingO(N log N)

Page 9: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

The Problem: Find a Chain of Local Alignments

(x,y) (x’,y’)

requires

x < x’y < y’

Each local alignment has a weight

FIND the chain with highest total weight

Page 10: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Quadratic Time Solution

• Build Directed Acyclic Graph (DAG):– Nodes: local alignments (xa,xb) (ya,yb) Directed Edges:

any two local alignments that can be chained

• Each local alignment is a node vi with alignment score si

Page 11: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Quadratic Time Solution (cont’d)

Dynamic programming:

Initialization:Find each node va s.t. there is no edge (u,v0)

Set score of V(a) to be sa

Iteration:For each vi, optimal path ending in vi has total score:

V(i) = max ( weight(vj, vi) + V(j) )Termination:

Optimal global chain: j = argmax ( V(j) ); trace chain from vj

Worst case time: quadratic

Page 12: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Faster Solution: Sparse DP

• 1,…, N: rectangles

• (hj, lj): y-coordinates of rectangle j

• w(j): weight of rectangle j

• V(j): optimal score of chain ending in j

• L: list of triplets (lj, V(j), j)

L is sorted by ljy

h

l

Page 13: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Sparse DP

Go through rectangle x-coordinates, from lowest to highest:

1. When on the leftmost end of i:a. j: rectangle in L, with largest lj < hi

b. V(i) = w(i) + V(j)

2. When on the rightmost end of i:a. j: rectangle in L, with largest lj lib. If V(i) > V(j):

i. INSERT (li, V(i), i) in L

ii. REMOVE all (lk, V(k), k) with V(k) V(i) & lk li

x

y

Page 14: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Example

x

y

1: 5

3: 3

2: 6

4: 45: 2

2

56

91011

1214

1516

Page 15: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Time Analysis

1. Sorting the x-coords takes O(N log N)

2. Going through x-coords: N steps

3. Each of N steps requires O(log N) time:• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N

Page 16: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Putting it All Together:Fast Global Alignment Algorithms

1. FIND local alignments2. CHAIN local alignments

FIND CHAINGLASS: k-mers hierarchical DPMumMer: Suffix Tree Sparse DPAvid: Suffix Tree hierarchical DP

LAGAN CHAOS Sparse DP

Page 17: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

LAGAN: Pairwise Alignment

1. FIND local alignments

2. CHAIN local alignments

3. DP restricted around chain

Page 18: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

LAGAN

1. Find local alignments

2. Chain -O(NlogN) L.I.S.

3. Restricted DP

Page 19: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

LAGAN: recursive call

• What if a box is too large?– Recursive application of LAGAN,

more sensitive word search

Page 20: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

3. LAGAN: memory

“necks” have tiny tracebacks

…only store tracebacks

Page 21: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Multiple Sequence Multiple Sequence AlignmentsAlignments

Page 22: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Page 23: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Page 24: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Overview

• Definition

• Scoring Schemes

• Algorithms

Page 25: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Multiple Sequence Alignments

Definition and Motivation

Page 26: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Definition

• Given N sequences x1, x2,…, xN:– Insert gaps (-) in each sequence xi, such that

• All sequences have the same length L• Score of the global map is maximum

• Pairwise alignment: a hypothesis on the evolutionary relationship between the letters of two sequences

• Same for a multiple alignment!

Page 27: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Motivation

• A faint similarity between two sequences becomes very significant if present in many

• Protein domains

• Motifs responsible for gene regulation

Page 28: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Another way to view multiple alignments

• Pairwise alignment:

Good alignment Evolutionary relationship

• Multiple alignment:

Something in common Sequence in common

“inverse” to pairwise alignment

Page 29: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Multiple Sequence Alignments

Scoring Function

Page 30: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Scoring Function

• Ideally:– Find alignment that maximizes probability that sequences

evolved from common ancestor

x

yz

w

v

?

Page 31: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Scoring Function (cont’d)

• Unfortunately: too many parameters

• Compromises:

– Ignore phylogenetic tree

– Statistically independent columns:

S(m) = G(m) + i S(mi)

m: alignment matrixG: function penalizing gaps

Page 32: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Scoring Function: Sum Of Pairs

Definition: Induced pairwise alignmentA pairwise alignment induced by the multiple alignment

Example:

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Page 33: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Sum Of Pairs (cont’d)

• The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments

S(m) = k<l s(mk, ml)

s(mk, ml): score of induced alignment (k,l)

Page 34: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Sum Of Pairs (cont’d)

• Heuristic way to incorporate evolution tree:

Human

Mouse

Chicken• Weighted SOP:

S(m) = k<l wkl s(mk, ml)

wkl: weight decreasing with distance

Duck

Page 35: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Consensus

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACCAG-CTATCAC--GACCGC----TCGATTTGCTCGAC

CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

• Find optimal consensus string m* to maximize

S(m) = i s(m*, mi)

s(mk, ml): score of pairwise alignment (k,l)

Page 36: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Multiple Sequence Alignments

Algorithms

Page 37: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Multidimensional Dynamic Programming

Generalization of Needleman-Wunsh:

S(m) = i S(mi)

(sum of column scores)

F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

Page 38: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

• Example: in 3D (three sequences):

• 7 neighbors/cell

F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk),F(i-1,j-1,k )+S(xi, xj, - ),F(i-1,j ,k-1)+S(xi, -, xk),F(i-1,j ,k )+S(xi, -, - ),F(i ,j-1,k-1)+S( -, xj, xk),F(i ,j-1,k )+S( -, xj, xk),F(i ,j ,k-1)+S( -, -, xk) }

Multidimensional Dynamic Programming

Page 39: Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Lecture 10, Thursday May 1, 2003

Multidimensional Dynamic Programming

Running Time:

1. Size of matrix: LN;

Where L = length of each sequence N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)