Top Banner
Regular Expression Constrained Sequence Alignment By Abdullah N. Arslan Department of Computer science University of Vermont Presented by Tomer Heber & Raz Nissim
32

Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Oct 01, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Regular Expression Constrained Sequence

Alignment

By Abdullah N. Arslan

Department of Computer science

University of Vermont

Presented by Tomer Heber & Raz Nissim

Page 2: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Motivation

• When comparing two proteins, it may beimportant to take into account a common specificstructure.

• Families of similar protein sequences include aconserved region called a motif.

• The PROSITE database is a collection of thesemotifs, represented as regular expressions.

• Our problem concentrates on sequencealignment of proteins that contain these motifsand need to be aligned according to them.

Page 3: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Formalization of the problem

The Optimal Sequence Alignment Problem:

• String Alignment:

“Given two strings s and t, a global alignment is obtained by inserting spaces into s and t so that the characters of the resulting strings can be put in one-to-one correspondence to each other.”

H - A S K E L L

P A A - C A - L

* Two characters correspond/match each other.

• The “Optimal Sequence Alignment” is a string alignment that has the maximum number of characters that correspond to each other.

Page 4: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Example:

Given two strings:

S1 = TGFPSVGKTKDDA

S2 = TFSVAKDDDGKSA

The Optimal Sequence Alignment for this two strings is 8.

*By adding the gaps to both strings, 8 characters correspond to each other in both strings: T,F,S,V,K,D,D,A. There are no alignments of 9 or more corresponding characters, thus this is the optimal solution.

Page 5: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Regular Expression Constrained Sequence Alignment (RECSA)

Definition: Given strings S1, S2 and the regular expression R,RECSA is the problem of finding the optimal globalsequence alignment between S1 and S2 where there existss1 subset of S1 and s2 subset of S2 such that the followingconstraints are satisfied:

1. Substring s1 is aligned with substring s2 (local alignment).

2. Both s1 and s2 match the expression R, .)(2,1 RLss

Page 6: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Example:

Given two strings and a regular expression:

S1 = TGFPSVGKTKDDA

S2 = TFSVAKDDDGKSA

The Optimal global sequence alignment Constrained by the regular expression R is 4:

*s1 = GFPSVGKT is a substring of S1, and s2 = AKDDDGKS is a substring of S2. s1 and s2 are aligned with each other and both match the regular expression R. 4 characters correspond/match to each other in strings S1 and S2.

Page 7: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Needleman–Wunsch algorithm

• This algorithm is used for computing the maximum global score given two strings of size n and m in O(nm).

Page 8: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Needleman–Wunsch algorithm – Continued

• = The optimal global alignment score of the substrings .

• Insertion or Deletion are referred to as gaps.• = The price of adding a gap to S2.

• = The price of adding a gap to S1.

• = The price of a match or a mismatch of a character.

Question: Can we improve the runtime for finding an optimal global alignment?

Hint: LCS and optimal sequence alignment problems are equivalent.

ji ,

]..1[],..1[ 21 jSiS

)][( 1 iS

])[( 2 jS

])[][( 21 iSiS

Page 9: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Reminder

Given two strings S1, S2 and a regular expression R, we

want to find two substrings s1 and s2 (substrings of S1 and S2) that uphold these constraints:

1. s1 is aligned with s2.

2. Both s1 and s2 match the regular expression R.

Previous example:

Page 10: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Review of Automata (State Machines)

An automaton is represented by the 5-tuple , where:

• Q - Set of states.

• ∑ - Finite set of symbols, that we will call the alphabetof the language the automaton accepts.

• δ - Transition function, that is

• q0- Start state, that is, the state in which the automaton is when no input has been processed yet (Obviously, q0∈ Q).

• F - Set of states of Q (i.e. F⊆Q), called accept states.

Page 11: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Review of Automata (cont.)

• There exists a strong equivalence: for every regular language, there is a finite state automaton (DFA or NFA), and vice versa.

• Example- This automaton accepts the regular language: A (C + G)* (S + T)

Page 12: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Example of N x N “multiplication automaton”:

Page 13: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Constructing our Weighted Finite Automaton

We construct M from a given regular expression R in several steps:

(1)

First, given a regular expression R we construct a nondeterministic finite automaton N=(Q,Σ,δ,q0,F), with no ε- moves, such that L(N)= L(R).

N accepts only the set of strings described by the regular expression R.

Page 14: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Constructing our Weighted Finite Automaton (cont.)

(2)

We define a weighted N × N automaton as the finite automaton M = (Q ,W ,Σ ,q0 ,F ) which we construct as follows:

• Q = Q × Q is the set of states. Each state of M corresponds to a pair of states in N. M remembers in each state what part of the regular expression has been seen in S1 and S2.

• W : Q → is a function that assigns real weights to each state in Q . Initially all weights are -∞.

M M M M M

M

M

M

M

Page 15: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Step 2 (cont.)

• Σ = (Σ x Σ) \ {ε→ε}. The alphabet for M is the set of edit operations which does not include ε→ε.

• q0 = (q0, q0) is the start state whose weight is 0 and always stays 0.

• F = (F x F). Is the set of final states. If M is in a final state then M has processed an alignment that satisfies the regular expression constraint (there are substrings s1 of S1 and s2 of S2 that are aligned and both s1 and s2 take N to final states.

M

M

M

Page 16: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

δ : Σ x Q → Q . M moves on edit operations:

Step 2 – The Transition Function

M M M M

• Once an alignment satisfies the regular expression constraint, i.e.once a final state is reached in M, the rest of the alignment doesnot alter the satisfaction of the constraint.(M has the option of staying in a final state on any input after thatfinal state is reached)

Page 17: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

The main idea

Page 18: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Denotations

• For any given weighted N × N automaton M we denote by for any a copy of the automaton M after making the move on

.• Given two weighted N ×N automata M1 and M2,

we define a commutative and associative operation max-M such that max-M{M1, M2} is a weighted N ×N automaton M with state weights calculated as follows:For all (p, q) ∈

(p, q) = max{ (p, q), (p, q) }.

yxM M

YX

MQMW 1M

W 2MW

YX

Page 19: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Denotations (cont.)

• We denote by S[i..j] the substring of S frompositions i to j, i ≤ j. Let S[i] denote the i-thsymbol of string S.

• We denote by the optimally weightedautomaton for S1[1…i] and S2[1…j].

jiM ,

Page 20: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

The Algorithm

• Let |S1| = n, |S2| = m with n ≥ m, and let N bea non-deterministic automaton with no

- moves equivalent to regular expression R,and let M be a weighted N × N automatonconstructed from N as we described.

• For all i, j, 0 ≤ i ≤ n, 0 ≤ j ≤ m ,both and are identical weighted N ×N automaton whose state-weights are all −∞ (except for the weight of the start state (q0,q0) which is always 0).

0,iM jM ,0

Page 21: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Algorithm (cont.)

• Optimal automata are computed with this formula:

jiM ,

• is the copy of the optimally weighted automaton after making the move

. (we’ll see an example later on…)

][

,11 iS

jiM

jiM ,1

][1

iS

Page 22: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Algorithm (cont.)

• After we finish calculating for all i, j, we’ll focus on (the optimally weighted automaton for substrings s1=S1, s2=S2)

• We output the weight:

(the maximum of the weights of the accepting states in .)

jiM ,

mnM ,

}),(|),(max{ ,, mnmn MMFqpqpW

mnM ,

Page 23: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Calculating the weights

• Upon making a move, these two steps must be followed:

Page 24: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

)}()1,1(),()1,0(max{ aqqWaqqW MM

Calculating the weights (example)

q0,q1

q1,q1

q1,q3

V=80

V=90

a

20)( aV=?

• Step 1:

(q1,q3) gets the value :

In this case, max{80-20, 90-20} = 70.

a

Page 25: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Calculating the weights (example cont.)

• Step 2:If (p,q) is reachable with the move b->a only from parents that have a weight of -∞, its weight is updated to - ∞ as well.

• It is imperative that this step is done only after step 1 is completed, since some updates could be missed.

q0,q1

q1,q1

q1,q3

V=240 -> ?

ab

V= -∞

V= -∞

ab

Page 26: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Optimally Weighted - definition

• Mi,j is optimally weighted for S1[1..i], and S2[1..j] , if the following two properties hold:

Page 27: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Proof of correctness

• Claim:

For all i,j, computed by

is optimally weighted.

• Proof: By induction on nodes (i,j)…

jiM ,

Page 28: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Proof of correctness (cont.)

• Base case:

i = 0, for all j the weights of are -∞ (or 0).

The two properties hold, and the claim is true.

• Assuming the claim is true for ,

and , we will show that the following automata are optimal:

jM ,0

jiM ,1 1, jiM

1,1 jiM

Page 29: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Proof of correctness (Intuition)

• ‘s values are computed by taking themaximal of its parents’ values for each state,and adding to them the score of one editaction.

• Optimality of follows from its parents’assumption of optimality and since an optimalconstrained alignment at node (i,j) uses one ofthese arcs, and we compute maximum scoresfor all possible optimal alignments.

jiM ,

jiM ,

Page 30: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Run-Time Analysis

• Computing each takes time O(r) where r isthe size of the transition function of M.

* We note that r = O( ) where t is the number ofstates in automaton N accepting the languageL(R). This is because M has O( ) states and thetransition function of a nondeterministicautomata is of size O(states^2)

• We need to compute O(nm) automata andtherefore we have a running time of O(nmr).

jiM ,

4t

2t

Page 31: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

Optimal alignment path reconstruction

• In order to reconstruct the optimal path, wecan store for each active state of eachautomaton, the neighboring automaton andthe state in that automaton from which theoptimal score is obtained.

• This is possible by altering the Max-M function.

• starting at , we generate the alignment pathin reverse.

mnM ,

Page 32: Regular Expression Constrained Sequence Alignmentmichaluz/seminar/arslan_pres_for_dynami… · S2 = TFSVAKDDDGKSA The Optimal global sequence alignment Constrained by the regular

• A detailed example will be uploaded for further reference.

Thanks

Questions??