Top Banner
Computational Molecular Biology Multiple Sequence Alignment
86

Computational Molecular Biology

Jan 19, 2016

Download

Documents

Ricky Lien

Computational Molecular Biology. Multiple Sequence Alignment. Sequence Alignment. Problem Definition: Given: 2 DNA or protein sequences Find: Best match between them What is an Alignment: Given: 2 Strings S and S’ - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Computational Molecular Biology

Computational Molecular Biology

Multiple Sequence Alignment

Page 2: Computational Molecular Biology

My T. [email protected]

2

Sequence Alignment

Problem Definition: Given: 2 DNA or protein sequences Find: Best match between them

What is an Alignment: Given: 2 Strings S and S’ Goal: The lengths of S and S’ are the same by inserting

spaces (--; sometimes denote as ∆) into these strings

A -- T C -- A

-- C T C A A

Page 3: Computational Molecular Biology

My T. [email protected]

3

Matches, Mismatches and Indels

Match: two aligned, identical characters in an alignment

Mismatch: two aligned, unequal characters Indel: A character aligned with a space

A A C T A C T -- C C T A A C A C T -- ---- -- C T C C T A C C T -- -- T A C T T T

10 matches, 2 mismatches, 7 indels

Page 4: Computational Molecular Biology

My T. [email protected]

4

Basic Algorithmic Problem

Find the alignment of the two strings that: max m where m = (# matches – mismatches –

indels) Or min m where m is the SP-score of an alignment

m defines the similarity of the two strings, also called Optimal Global Alignment

Biologically: a mismatch represents a mutation, whereas an indel represents a historical insertion or deletion of a single character

Page 5: Computational Molecular Biology

My T. [email protected]

5

Multiple Sequence Alignment

Problem Definition: Similar to the sequence alignment problem but the

input has more than 2 strings

Challenges: NP-hard, MAX-SNP Guarantee factor: 2 – 2/k where k is the number of

the input sequences. More work to reduce the time and space complexity

Page 6: Computational Molecular Biology

Sum of Pairs Score (SP-Score) Given a finite alphabet and where

∆ denotes a space Consider k sequences over that we want

to align. After an alignment, each sequence has length l

A score d is assigned to each pair of letters:

My T. [email protected]

6

}{

Page 7: Computational Molecular Biology

SP-Score The SP-Score of an alignment A is defined as:

Consider a matrix of l columns and k rows where the rows represents the sequences and columns represent the letters

SP-Score is the sum of the scores of all columns: Score of each column is the sum of the scores of all

distinct unordered pairs of letters in the column

Or we can view as sum of pairwise sequence alignment values.

Find an (optimal) alignment to minimize the SP-Score value

My T. [email protected]

7

Page 8: Computational Molecular Biology

Proving MSA with SP-Score that is a Metric is NP-hard

My T. [email protected]

8

Page 9: Computational Molecular Biology

Some Notations

My T. [email protected]

9

Page 10: Computational Molecular Biology

Some Basic Properties

Lemma 1: Let s1, s2 be two sequences over Σ such that l1=|s1|, l2=|s2|, l2≥l1 and there are m symbols of s1 that are not in s2. Then every alignment of the set {s1,s2} has at least m+l2-l1 mismatches

My T. [email protected]

10

Page 11: Computational Molecular Biology

My T. [email protected]

11

Page 12: Computational Molecular Biology

The construction Reduce the vertex cover (or node cover) to MSA. Vertex cover:

Instance: A graph G=(V,E) and an integer k≤|V| Question: Is there a vertex cover V1 of G of size k or less?

MSA: Instance: A set S={s1, …, sn} of finite sequences over a fixed

alphabet Σ, an SP-score and an integer C Question: Is there a multiple alignment of the sequences in S that

is of value C or less?

My T. [email protected]

12

Page 13: Computational Molecular Biology

SP-Score (alphabet of size 6)

My T. [email protected]

13

Page 14: Computational Molecular Biology

The Reduction

My T. [email protected]

14

So, we have , T is a set of C2

sequences t and X contains C1 sequences x(k), where C1 and C2 will be determinedlater

Page 15: Computational Molecular Biology

An Example

My T. [email protected]

15

Page 16: Computational Molecular Biology

Intuition By the above construction, an optimal alignment A of S is

obtained when A satisfies certain properties (called standard alignment)

The value of standard alignment is bounded by a given threshold C only where G has a vertex cover of size k

How to obtain: Force d’s of the test sequences to be aligned with b’s of the edge

sequences Only one b of each edge sequence can be aligned to a d The number of such alignment determines the value of the

alignment

My T. [email protected]

16

Page 17: Computational Molecular Biology

Standard Alignemnt

My T. [email protected]

17

Page 18: Computational Molecular Biology

My T. [email protected]

18

Page 19: Computational Molecular Biology

My T. [email protected]

19

Page 20: Computational Molecular Biology

My T. [email protected]

20

Page 21: Computational Molecular Biology

My T. [email protected]

21

Page 22: Computational Molecular Biology

Let US and US,X denote the upper bounds of D(AS) and D(AS,X) respectively

By Corollary 8 and Lemma 9, we have the standard alignment has value not greater than DSD + US + US,X

where DSD = D(AX) + D(AT) + D(AX,T) + D(AS,T) over a standard alignment A

Now, let C1 > US and C2 > US + US,X, we can prove that an optimal alignment must be a standard one My T. Thai

[email protected]

22

Page 23: Computational Molecular Biology

My T. [email protected]

23

Page 24: Computational Molecular Biology

My T. [email protected]

24

Page 25: Computational Molecular Biology

Show the NP-hardness of any scoring matrix in a broad class M

Show that there is a scoring matrix M0 such that MSA for M0 is MAX-SNP hard

My T. [email protected]

25

Page 26: Computational Molecular Biology

Interesting Observation

Via the brute force, optimal MSA contains very few gaps

Suggesting the study of gap limitations: Have an upper bound of the number of gaps one can insert during

the alignment

Special case: Gap-0: No gap allows, but we can shift the strings for an alignment

(insert gaps at the beginning or at the end of a string) Gap-0-1: a gap-0 alignment such that the gaps at the beginning or

at the end of each string is exactly one space

My T. [email protected]

26

Page 27: Computational Molecular Biology

Problem Definition

Given a finite alphabet Scoring matrix

For i, j > 0, si,j represents the penalty for aligning ai with aj

For i > 0, s0,i and si,0 are called indel penalites

Gap opening penalties (in addition to the indel penalties) for aligning ai with the first or last ∆ in the string of ∆’s

My T. [email protected]

27

},,{ 1 waa wjwijiww sM ,,)1()1( )(

Page 28: Computational Molecular Biology

Generic Scoring Matrix

My T. [email protected]

28

Where Σ={A,T}, x, y, x are fixed nonnegative numbersand u > max{0, vA, vT} holds

• Let M2 be the class of all scoring matrices that contain a generic submatrix M

• Let M1 be the class of all scoring matrices that contain a sub-matrix isomorphicto a generic matrix M with z > vT.

• Let M be the class of all scoring matrices that contain a submatrix isomorphic to a generic matrix M with y > u and z > vT.

Theorem 1: (a)The gap-0-1 multiple alignment problem is NP-hard for every scoring matrix M in M2.(b) The gap-0 multiple alignment problem is NP-hard for every M in M1 (c) The multiple alignment problem is NP-hard for every M in M Note that M is quite broad and covers most scoring schemes used inbiological applications.

Page 29: Computational Molecular Biology

Reduction

Reduce the MAX-CUT-B: Given G=(V,E) where k=|V| and each vertex has a

degree at most B Find a partition of V into two disjoint sets such that

to maximize the number of edges crossing these two sets

Given a graph G=(V,E) with k vertices v0, …, vk-1 and l edges e0, …, el-1. We will construct a set of k2 sequences t0, …, tk

2

-1 as follows:

My T. [email protected]

29

Page 30: Computational Molecular Biology

Reduction

For each vertex vi, construct a sequence ti such that for each edge em={vh, vi} incident at vi, h < i, n < k5,

set

where ti,j represents the character at the jth position in ti.

For other j, let ti,j = T

For i ≥ k, set ti = T T T … T with length k12l

My T. [email protected]

30

Page 31: Computational Molecular Biology

An Example

My T. [email protected]

31

Page 32: Computational Molecular Biology

Proof of Theorem 1(a)

We will show that a gap-0-1 alignment will partition V into two disjoint subsets V0 and V1: V0: all vertices vi such that ti remains in place (a space appends at

the end) V1: all vertices vi such that ti shifts to the right

Thus, based on the alignment, we can find the cut. And vice versa, based on the cut, we can find the alignment

The left part is: prove that if k is sufficiently large, the optimal gap-0-1 alignment yields a partion of V with maximum edge cut.

My T. [email protected]

32

Page 33: Computational Molecular Biology

Proof of Theorem 1(a) Let c denote the cut based on the alignment A Consider all the sequences ti after that alignment A:

The total indel penalties is of order O(k4) (appears at the first and last column in the SP score matrix)

The total number of mismatches before the alignment is 3k5l(k2-1) To maximally reduce this number:

1 A-A match reduces 2 A-T mismatches For each edge (vh, vi), if there are in different subsets (of the partition),

then a total of k5 A-A matches between sequences th and ti are created

No other A-T mismatches can be elimiated

Thus the SP-score: k12lvTk2(k2-1)2+3k5l(u-vT)(k2-1)-ck5(2u-vA-vT)+O(k4)

My T. [email protected]

33

Page 34: Computational Molecular Biology

Theorem 2

Consider the following scoring matrix M0 for the

alphabet ∑0 = {A,T,C}.

(a)The gap-0-1 MSA problem is MAX-SNP-hard

(b)The gap-0 MSA problem in MAX-SNP-hard

(c)The MSA problem in MAX-SNP-hard

My T. [email protected]

34

Page 35: Computational Molecular Biology

MAX-SNP-hard Proof

To prove problem A’ is MAX-SNP-hard, we need to L-reduce problem A, which is MAX-SNP-hard to A’

L-reduce: There are two polynomial-time algorithms f, g and

constants a, b > 0 such that for each instance I of A: f produces an instance I’ = f(I) of A’ such that OPT(I’)

≤ aOPT(I) Given any solution of I’ with cost c’, g produces a

solution of I with cost c such that |c-OPT(I)| ≤ b|c’-OPT(I’)|

My T. [email protected]

35

Page 36: Computational Molecular Biology

Proof of Theorem 2

To prove MSA (with M0 and the scoring matrix mentioned before) MAX-SNP-hard: L-reduce the MAX-CUT-B to another optimization

problem, called A’, which is L-reduce to a scaled version of MSA

Problem A’: Given a graph G=(V,E) with bounded degree B.

For every partition P={V0, V1}, let cp be the size of cut determined by P.

Find the partition P of V that minimizes dp = 3|E|-2cp My T. Thai

[email protected]

36

Page 37: Computational Molecular Biology

Show A’ is MAX-SNP-hard

Let f and g be an identity function Set a = 3B and b = 2, we can easily prove the

two properties of the L-reduction since: cp ≥|E|/B and dp = 3|E| - 2 cp ≤ 3 |E|

Any increase of cp by 1 = decrease dp by 2

My T. [email protected]

37

Page 38: Computational Molecular Biology

Show A’ L-reduce to scaled MSA

My T. [email protected]

38

Similar to the above construction, we have:

Page 39: Computational Molecular Biology

Similar to the proof of Theorem 1, we have the optimal SP-score: where

If the SP-score is scaled by a factor of k-5/2 for a MSA of k sequences, then A’ L-reduce to MSA.

My T. [email protected]

39

Page 40: Computational Molecular Biology

GENETIC ALGORITHMS ALGORITHMS

Page 41: Computational Molecular Biology

How do GAs work? Create a population of random solutions Use natural selection:

crossover and mutation to improve the solutions

Stop the operation if satisfying some certain criteria such as: No improvement on fitness function The improvement is less than some certain threshold The number of iteration is more than some certain

threhold

Page 42: Computational Molecular Biology

Terms and Definitions Chromosomes

Potential solutions Population

Collection of chromosomes Generations

Successive populations

Page 43: Computational Molecular Biology

Terms and Definitions Crossover

Exchange of genes between two chromosomes

Mutation Random change of one or more genes in a

chromosome

Elitism Copy the best solutions without doing crossover

or mutation.

Page 44: Computational Molecular Biology

Terms and DefinitionsOffspring

New chromosome created by crossover between two parent chromosomes

Fitness function Measures how “good” a chromosome is.

Encoding scheme How do we represent every

chromosome/gene? Binary, combination, syntax trees.

Page 45: Computational Molecular Biology

Why are GAs attractive?

No need for a particular algorithm to solve the given problem. Only the fitness function is required to evaluate the quality of the solutions.

Implicitly a parallel technique and can be implement efficiently on powerful parallel computers for demanding large scale problems.

Page 46: Computational Molecular Biology

Basic Outline of a GA Initial population composed of random chromosomes,

called first generation Evaluate the fitness of each chromosome in the

population Create a new population:

Select two parent chromosomes from a population according to their fitness

Crossover (with some probability) to form a new offspring Mutation (with some probability) to mutate new offspring Place new offspring in a new population

Process is repeated until a satisfactory solution evolves

Page 47: Computational Molecular Biology
Page 48: Computational Molecular Biology

Operations

Mutation Operation:• Modify a single parent• Try to avoid local minima

Page 49: Computational Molecular Biology

Let's see some running examples Minimum of a function:

http://cs.felk.cvut.cz/~xobitko/ga/example_f.html

Elitism: http://cs.felk.cvut.cz/~xobitko/ga/params.html

The travelling salesman problem: http://cs.felk.cvut.cz/~xobitko/ga/tspexample.htm

l

Page 50: Computational Molecular Biology

Multiple Sequence Alignment Fitness function is used to compare the

different alignments Based on the number of matching symbols and the

number and size of gaps Also called the cost function

Different weights for different types of matches Gap costs

can be simple and count the total matching symbols can be complicated and consider the type of

matching symbols, location in the sequence, neighboring symbols etc.

Page 51: Computational Molecular Biology

Approximation Algorithms

My T. [email protected]

51

Page 52: Computational Molecular Biology

Scoring method

Score zero for a match or for two opposing spaces

Score one for a mismatch or for a character opposite a space

Page 53: Computational Molecular Biology

Assumptions:

Assume that two opposing spaces have a zero value

Assume other values satisfies triangle inequality s(x,z) ≤ s(x,y) + s(y,z) s(x,z) – cost of transforming character x into

character z

Page 54: Computational Molecular Biology

Objective Functions

Two objective functions SP

The sum of the values of pairwise alignments induced by an alignment A

TA Using the topology of the tree, map the strings to the

nodes of the tree The sum of the selected pairwise alignments is called

tree alignment

Page 55: Computational Molecular Biology

Center Star Method

For a set of k strings X Choose a center string Xc of X which minimizes

Σj≠cD(Xc,Xj)

Let M = min Σj≠cD(Xc,Xj)

Center star is a star tree of k nodes with the center node labeled Xc and each of the k-1 remaining

nodes labeled by a distinct string in X \ {Xc}

If Xi and Xj are strings labeling adjacent nodes of tree T, then alignment of Xi and Xj induced by A(T) has value D(Xi,Xj)

Page 56: Computational Molecular Biology

Center Star Method – Alg Ac

Do an optimal alignment for each pair (Xc, Xj) for all j ≠ c

s0 = max number of spaces placed before the first char of Xc

sf = max number of spaces placed after the last char of Xc

si = max number of spaces placed between Xc(i) and Xc(i+1)

Page 57: Computational Molecular Biology

Center Star Method – Alg Ac

For Xc, insert s0, si, and sf spaces at the beginning, between, and the end of Xc respectively. Call X’c

Then for each Xj, do the optimal alignment without modifying X’c

My T. [email protected]

57

Page 58: Computational Molecular Biology

Analysis

d(Xi,Xj) ≥ D(Xi,Xj)

V(Ac) = Σi<jd(Xi,Xj)

V(Ac) is at most twice the value of the optimal multiple alignment of X

My T. [email protected]

58

Page 59: Computational Molecular Biology

Analysis

Lemma 3.1: For any 2 strings Xi,Xj, we have:

d(Xi,Xj) ≤ d(Xi,Xc) + d(Xc,Xj)

= D(Xi,Xc) + D(Xc,Xj) triangle inequality

Page 60: Computational Molecular Biology

Analysis

A* be the optimal multiple alignment of k strings X

Define: V(A*) = Σi<jd*(Xi,Xj)

Page 61: Computational Molecular Biology

Analysis

Theorem 3.1

V(Ac) / V(A*) ≤ 2(k-1)/ k < 2

Proof:

Page 62: Computational Molecular Biology

Disadvantages

Requires all pairwise alignments Computationally expensive Faster, Randomized alignments

Randomly select string Xi Build multiple alignment with star centered at Xi

Select best multiple alignment A from p such stars At most (k-1)p pairwise alignments need to be

computed

Page 63: Computational Molecular Biology

Randomized Alignments

Theorem 3.2For any r >1, let e(r) be the expected number of stars needed to be chosen at random before the value of best resulting alignment is within a factor of 2+1/(r-1) of the optimal alignment. Then e(r) ≤ r.

e(r) is independent of k and the length of the strings.

Page 64: Computational Molecular Biology

Proof of Theorem 3.2

For r = 2, for each string Xi

define M(i) = ΣjD(Xi,Xj) then M(c) = MFrom Theorem 3.1,

Σ(i,j)D(Xi,Xj) = ΣjM(i) ≤ 2(k-1)M so the Avg value of M(i) < 2 M

Since min M(i) = M, then Median M(i) < 3MNumber of centers selected before a selected M(i) is less than the median = 2

Page 65: Computational Molecular Biology

Proof

Suppose median is ∂M for 1 ≤ ∂ ≤ 3

Then Σ(i,j)D(Xi,Xj)≥ kM/2 + k ∂ M/2

Value of the alignment obtained from any below median star ≤ 2(k-1) ∂ M

Therefore, error ratio for this star ≤ = 2 ∂ / (1/2 + ∂ /2)

When ∂ = 3, error ratio = 3. So we have e(2) ≤ 2

Page 66: Computational Molecular Biology

Proof

Now generalize this proof for r > 2 At least k/r stars have M(i) less than or equal to

(2r-1)M/(r-1) Minimum M(i) is M Mean < 2M

expected number of stars to pick with M(i) < ∂ M is r for 1 ≤ ∂ ≤ (2r-1)/(r-1)

error ratio = 2 ∂ /[1/r + (r-1) ∂ /r] (2r-1)/(r-1)=2 + 1/(r-1)

Page 67: Computational Molecular Biology

Theorem 3.3

Picking p stars at random, the best resulting alignment will have value within a factor of 2 + 1/(r-1) of the optimal with probability at least

1 – [(r-1)/r]p

Page 68: Computational Molecular Biology

Center Star Method

Proof From theorem 3.2, if Median value was actually

3M For half the stars M(i) = M and M(i) = 3M for the

other half

Σ(i,j)D(Xi,Xj)=2kM optimal SP alignment can be obtained from any

center string Xiwith M(i) = M Probability of selecting such a string is one-half

Page 69: Computational Molecular Biology

Tree Alignment Method

Typical approach: first find multiple alignment and then build a tree

showing the evolutionary derivations

Another approach (called tree alignment): first choose the typology of the tree and then map

the strings to the nodes of the tree Alignment is the pairwise alignments of the strings

at the ends of the edges of the tree

Page 70: Computational Molecular Biology

Formal Definitions

Let K be an input set of k strings K’ K be a set of strings containing K Evolutionary tree TK’ for K is a tree:

with at least k nodes each string in K’ labels exactly one node & each

node gets exactly one label in K’

The value of TK’ : V(TK’) = ΣD(X,Y)

the problem is to find a set of strings K’ and T(K’) for K which minimizes V (TK’)

Page 71: Computational Molecular Biology

The alignment value D(X,Y ) is interpreted as the minimum “cost" to transform string X to string Y

The sum of the alignment values of the edges gives the evolutionary cost implied by the tree.

Page 72: Computational Molecular Biology

Method

Let G be a graph with k nodes labeled with a distinct string in K

Each edge (X,Y) has a weight D(X,Y) Find the MST of G. This MST is an

evolutionary tree for K

Page 73: Computational Molecular Biology

Analysis

T* denote the optimal evolutionary tree for K. Prove: V(MST)/V(T*) < 2OPT

Let C be a traversal of edges of T* which traverses everyy edge exactly once in each direction

Let C1, …, Ck be the order that C encounters

Let V(C) = D(Ck,C1) + Σi<kD(Ci,Ci+1)

Page 74: Computational Molecular Biology

Analysis

My T. [email protected]

74

Page 75: Computational Molecular Biology

Analysis

Corollary 4.1: V(C) ≤ 2V(T*), Let D(Ci*,Ci*+1) be the largest distance of any

adjacent strings in C traversal Lemma(4.2)

V(MST) ≤ V(C) – D(Ci*,Ci*+1) ≤ V(C) – V(C)/K

Page 76: Computational Molecular Biology

Analysis Theorem 4.1

For any set K of k strings, we have:

V(MST)/ V(T*k) ≤ 2(k-1)/k < 2 Theorem 4.2

V(MST) / V(T*k) ≤ (k-1)/k V(C)/V(T*k) ≤ 2 (k-1)/k

Corollary 4.2V(T*k) > kV(MST)/2(k-1)

Page 77: Computational Molecular Biology

Constrained MSA

Page 78: Computational Molecular Biology

MotivationGeneral SP MSA problem: NP-completeness has already been established Appromixation algorithms have been developed Heuristics are also avaliable

Constrained MSA: Biologists often have additional knowledge of data (e.g. active site

residues) Additional knowledge can specify matches at certain locations Models allow users to provide additional constraints

Page 79: Computational Molecular Biology

Definition of CMSA Problem

Suppose that P = p1p2 . . . pα is a common subsequence of S1, S2, . . . , SK

The constrained multiple sequence alignment of S with respect to P is: an MSA A with the constraints that there are α columns

in A, c1, c2, . . . , cα with c1 < c2 < …< cα, such that the characters of column ci, 1 ≤ i ≤ α, are all equal to pi.

Page 80: Computational Molecular Biology

Optimal CPSA

Page 81: Computational Molecular Biology

Dynamic Algorithm

My T. [email protected]

81

Page 82: Computational Molecular Biology

Time and Space Complexities

My T. [email protected]

82

Page 83: Computational Molecular Biology

CMSAThe improvement of CPSA in turn improves the time & space complexity ofProgressive CMSA from O(αkn4) and O(αn4) to O(αk2n2) and O(αn2).Optimal CMSAThis Optimal CMSA algorithm involves the creation of a matrix with k+1 dimensions.(Assume δ(x,y) is the distance function and satisfies the triangle inequality.) Let D(i1, . . . , ik; γ) be the optimal CMSA score matrix for

{S1[1..i1], . . . , Sk[1..ik]} where P[1..γ] is aligned in γ columns. Then optimal alignment score is D(n1, . . . , nk; α), where ni =|Si|.

Computing D: D({0}k; 0) = 0 Let εj = 0 or 1 with εjSj[ij] where j = 0 represents a space, and

δ(x1, . . . , xk) = Σ1≤i<j≤kδ(xi, xj).

D(i1, i2, . . . , ik; γ) is the minimum of: if S1[i1] = . . . = Sk[ik] = P[γ],

D(i1 − 1, . . . , ik − 1; γ − 1) + δ(S1[i1], . . . , Sk[ik]) minε {0,1}∈ k (D(i1 − ε1, . . . , ik − εk; γ) + δ(ε1S1[i1], . . . , εkSk[ik])).

These values can be computed using dynamic programming.

Page 84: Computational Molecular Biology

CMSA (Center Star)

The Center-Star method proposed for the general

MSA problem can be modified to apply to the CMSA

problem. Consider each sequence as the center, Sc. Consider each

list position that Sc is aligned with P.

Find the minimum star-sum score Sc.

Create a constrained alignment matrix by merging the

constrained pairwise sequence alignments between Sc & Sj.

Page 85: Computational Molecular Biology

CMSA (Center Star)

The recurrence of Thm. 3.1 is only slightly modified:

Page 86: Computational Molecular Biology

Example

My T. [email protected]

86