Regular Expression Constrained Sequence Alignment Abdullah N. Arslan Assistant Professor Computer Science Department
Feb 25, 2016
Regular Expression Constrained Sequence Alignment
Abdullah N. ArslanAssistant Professor
Computer Science Department
Outline
• Sequence alignment Common frame-work DP solution Why constrained ?
• RE constrained sequence alignment Algorithm
• Concluding Remarks
Alignment Matrix
Edit Graph
Dynamic Programming Solution
Hi,j: maximum score achieved at (i, j)
where Hi,j = 0 whenever i=0 or j=0,
Hn,m in O(nm) time, O(m) space
DP Solution: Local Alignment
Hi,j: similarity score achieved at (i, j)
where Si,j = 0 whenever i=0 or j=0,
max Hi,j in O(nm) time, O(m) space
Dynamic Programming Formulation
Affine gap penalties Penalty for a gap of length k is +(k-1)
where Si,j = Fi,j = Ei,j = 0 when i=0 or j=0
max Hi,j O(nm) time, O(m) space
The Definition of the Constrained LCS Problem
• The contrained LCS (CLCS) problem Given strings S1,S2, and P
• Find lcs of S1 and S2 s.t. P is a subsequence of this lcs
• Motivation: Computing the homology of two biological
sequences that have a specific part in common
Constrained Sequence Alignment Problems
• Constrained LCS Tsai 2003, O(n2m2r) time Chin et. al 2004, Arslan and Egecioglu 2004
• O(nmr) time
• Edit-distance constrained sequence alignment Arslan and Egecioglu 2004, O(dnmr)
• Regular-expression constrained sequence alignment Motivation:
• Comet and Henry, 2002• PROSITE patterns
This paper
PROSITE patterns as constraints
• PROSITE patterns are Regular expressions with no Kleene closure PROSITE database e.g. [GA]-X(4)-G-K-[ST]
• ATP/GTP-binding site motif A (P-loop) (PS00017)
• Comet and Henry reward alignments• Regular expression constrained sequence
alignment Find a maximal alignment that includes a given
RE
Example: For [GA]-X(4)-G-K-[ST]
Using Edit Graph: e.g. A(C+G)*(S+T)
Automata for A(C+G)*(S+T)
Some Details of Automata Construction
• Equivalent NFA N to a given RE R
• Construct from N a new NxN automaton
Moves on edit operations • (or equivalently on alignment columns)
States have weights• Interested in the weights of the final states after the
alignment is complete
Weighted Automaton
• Initial weights are
• Weight of (q0,q0) is initially 0
• Update new maximum scores at reachable states
• Weights become in unreachable states
• What are the maximum weights at the final states?
Computations on Automata
Complexity• Simulate automata based on DP solution
Each steps requires examining the trasition functions
Maintain a list of active (reachable) states
Update state weights as alignments are formed
Automaton Mi,j has the optimum weights
Generalizations: Local Alignment & Affine gaps
CONCLUSION
• Introduced the regular expression constrained sequence alignment problem
• Present an algorithm for the problem
• Future work Generalization of the problem for
• Multiple sequence alignment• Multiple regular expressions as a constraint
Thank YouThank You