"An Eulerian path approach to global multiple alignment for DNA sequences” by Y. Zhang and M. Waterman * “An Eulerian path approach to local multiple alignment for DNA sequences” by Y. Zhang and M. Waterman ** Presented by Jaehee Jung Mar 4 2005 CPSC 689-604 *Journal of Computational Biology 10-6, pp. 803-819 (2003). ** Proc. National Academy of Science of USA 102-5, pp. 1285-1290 (2005).
48
Embed
"An Eulerian path approach to global multiple alignment for DNA sequences” by Y. Zhang and M. Waterman * “An Eulerian path approach to local multiple alignment.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
"An Eulerian path approach to global multiple alignment for DNA sequences”
by Y. Zhang and M. Waterman *
“An Eulerian path approach to local multiple alignment for DNA sequences” by Y. Zhang and M. Waterman **
Presented by Jaehee Jung
Mar 4 2005
CPSC 689-604
*Journal of Computational Biology 10-6, pp. 803-819 (2003). ** Proc. National Academy of Science of USA 102-5, pp. 1285-1290 (2005).
2
Outline• Motivation
– Hamiltonian & Eulerian path– Superpath problem
• Global Alignment– Global Alignment Algorithm – Probability Analysis– Complexity– Discussion
• Local Alignment– Local Alignment Algorithm– Significance Estimation– Complexity– Discussion
decumpling graph• Until no significant local alignment are left
43
Significance Estimation
• Estimate the P value of local multiple alignment– Remove thin edge formed by random matches– Rank multiple outputs by statistical significance
• Estimate minimum multiplicity of mutations free edge– Local alignment is complicated than in the global case
• Position and the orders of conserved regions in each sequences
44
Poisson clumping heuristic
• Pairwise alignment
– H is the optimal clump score
– p(2) is the probability that two letters are identical
– L1,L2 are the adjusted lengths of two sequences
– L1,L2 p(2)x is an approximation to the expected
number of clumps with score
• Multiple alignment
xpLL
exHP )2(211)(
)(1
)(),( nxn
i i pLn
Nbyxnh
n
in
xi pL
n
NehHp
1)())((1)(
,
45
Computation Efficiency
• k : tuple size
l : pattern length found in each iterations
N : number of sequences
L : average sequence length
• Time– Graph construction and transformation– Pairwise alignment with declumping
• Space
)(kNLO)(NLlO
)( 2lkNLO
The size of alignment matrix
46
Discussion
• Tuple size(10~20)
• How to detect true pattern other than concatenation different pattern
• Current version focus on DNA not protein sequence
47
Assignment #5
• When we using the de Bruijn graph in Eulerain graph, we just adopt in DNA because its characters are consist of four nucleotide like A,C,G,T. Give me an efficient algorithm to get the multiple sequence alignment for adopting protein (it is 20 characters) using the graph.– Hint: Not use de Bruijn graph and Eulerian
graph, Graph structure is embedded in the dynamic programming algorithm)
Reference • [1] “A new algorithm for DNA sequence assembly”
by Idury, R., and Waterman,. Journal of Computational Biology. 2, 291–306. (1993)
• [2] “An Eulerian path approach to DNA fragment assembly”. by Pevzner, P.A., Tang, H., and Waterman,Proc. National Academy of Science of USA, PP9748–9753 (1998)
• [3] "An Eulerian path approach to global multiple alignment for DNA sequences" by Y. Zhang and M. Waterman, Journal of Computational Biology 10-6, pp. 803-819 (2003).
• [4] "An Eulerian path approach to local multiple alignment for DNA sequences" by Y. Zhang and M. Waterman, Proc. National Academy of Science of USA 102-5, pp. 1285-1290 (2005).