Regular Expression Constrained Sequence Alignment

Regular Expression Constrained Sequence Alignment

Abdullah N. ArslanAssistant Professor

Computer Science Department

Outline

• Sequence alignment Common frame-work DP solution Why constrained ?

• RE constrained sequence alignment Algorithm

• Concluding Remarks

Alignment Matrix

Edit Graph

Dynamic Programming Solution

Hi,j: maximum score achieved at (i, j)

where Hi,j = 0 whenever i=0 or j=0,

Hn,m in O(nm) time, O(m) space

DP Solution: Local Alignment

Hi,j: similarity score achieved at (i, j)

where Si,j = 0 whenever i=0 or j=0,

max Hi,j in O(nm) time, O(m) space

Dynamic Programming Formulation

Affine gap penalties Penalty for a gap of length k is +(k-1)

where Si,j = Fi,j = Ei,j = 0 when i=0 or j=0

max Hi,j O(nm) time, O(m) space

The Definition of the Constrained LCS Problem

• The contrained LCS (CLCS) problem Given strings S1,S2, and P

• Find lcs of S1 and S2 s.t. P is a subsequence of this lcs

• Motivation: Computing the homology of two biological

sequences that have a specific part in common

Constrained Sequence Alignment Problems

• Constrained LCS Tsai 2003, O(n2m2r) time Chin et. al 2004, Arslan and Egecioglu 2004

• O(nmr) time

• Edit-distance constrained sequence alignment Arslan and Egecioglu 2004, O(dnmr)

• Regular-expression constrained sequence alignment Motivation:

• Comet and Henry, 2002• PROSITE patterns

This paper

PROSITE patterns as constraints

• PROSITE patterns are Regular expressions with no Kleene closure PROSITE database e.g. [GA]-X(4)-G-K-[ST]

• ATP/GTP-binding site motif A (P-loop) (PS00017)

• Comet and Henry reward alignments• Regular expression constrained sequence

alignment Find a maximal alignment that includes a given

RE

Example: For [GA]-X(4)-G-K-[ST]

Using Edit Graph: e.g. A(C+G)*(S+T)

Automata for A(C+G)*(S+T)

Some Details of Automata Construction

• Equivalent NFA N to a given RE R

• Construct from N a new NxN automaton

Moves on edit operations • (or equivalently on alignment columns)

States have weights• Interested in the weights of the final states after the

alignment is complete

Weighted Automaton

• Initial weights are

• Weight of (q0,q0) is initially 0

• Update new maximum scores at reachable states

• Weights become in unreachable states

• What are the maximum weights at the final states?

Computations on Automata

Complexity• Simulate automata based on DP solution

Each steps requires examining the trasition functions

Maintain a list of active (reachable) states

Update state weights as alignments are formed

Automaton Mi,j has the optimum weights

Generalizations: Local Alignment & Affine gaps

CONCLUSION

• Introduced the regular expression constrained sequence alignment problem

• Present an algorithm for the problem

• Future work Generalization of the problem for

• Multiple sequence alignment• Multiple regular expressions as a constraint

Thank YouThank You

Regular Expression Constrained Sequence Alignment

Documents