Page 1
Bioinformatics PhD. Course
1. Biological introduction
Exact Extended Approximate
6. Projects: PROMO, MREPATT, …
5. Sequence assembly
2. Comparison of short sequences ( up to 10.000bps) Dot Matrix Pairwise align. Multiple align. Hash alg.
3. Comparison of large sequences ( more that 10.000bps) Data structures Suffix trees MUMs
4. String matching
Page 2
Comparison of large sequences
First part:
Alignment of large sequences
Page 3
Dynamic programming
What about genomes?
• Quadratic cost of space and time.
accaccacaccacaacgagcata … acctgagcgatat
acc..t
• Short sequences (up to 10.000 bps) can be aligned using dynamic programming
• Quadratic cost of space and time.
acc.................................agt | | |.................................|xxacc.................................a--
Page 4
Genomic sequences
In which case Dynamic Programming can be applied?
•The length of sequences is 1000 times longer.
• Genomic sequences have millions of base pairs.
•The running time is 1.000.000 times higher !
(1 second becomes 11 days)(1 minute becomes 2 years)
Page 5
First assumption
……………………………………………………………….
………………………….………………...…………...….
Genome B
Genome A
……………………………………Genome B
……
……
……
……
……
….
Gen
ome
A
Page 6
Realistic assumption?
Unrealistic assumption!
More realistic
assumption
……………………………………………………………….
………………………….………………...…………...….
Genome B
Genome A
………………………………………………………………….
………………………………………………...…………...….Genome A
Genome B
………………………
……
……G
enom
e A
Genome B
Page 7
Realistic assumptions?
But, now is it a
real case?
Unrealistic assumption!
More realistic
assumption
……………………………………………………………….
………………………….………………...…………...….
Genome B
Genome A
…………………………………………………………………
………………………………………………...…………...….Genome A
Genome B
………………………
……
……G
enom
e A
Genome B
Page 8
Preview in a real caseChlamidia muridarum: 1.084.689bps Chlamidia Thrachomatis:1057413bps
Page 9
Preview in a real case
Pyrococcus abyssis: 1.790.334 bpsPyrococcus horikoshu: 1.763.341 bps
Page 10
Methodology of an alignment
1st:
2nd:
3th: (Linear cost)
Identify the portions that can be aligned.
Make a preview: ……………………..….
…………………...….
Make the alignment:
…..…
……
………………….
(Linear cost)
Page 11
Methodology of an alignment
(Linear cost)
Make a preview: ……………………..….
…………………...….1st:
2nd:
3th:
Identify the portions that can be aligned.
Make the alignment:
…..…
……
………………….
?
Page 12
Preview-Revisited
… a a t g….c t g...
… c g t g….c c c ...
MatchingUniqueMaximal
MUMConnect to MALGEN
Page 13
Methodology of an alignment
1st:
2nd:
3th:
Identify the portions that can be aligned.
Make a preview: ……………………..….
…………………...….
Make the alignment:
…..…
……
………………….
How can MUMs be found?
With CLUSTALW, TCOFFEE,…
How can these portions be determined?
Linear costwith
Suffix trees
Page 14
Comparison of large sequences
M-GCAT
Todd Treangen
Page 15
Homework
1. Javier 14. Alexis2. Dmitry 15. Ramon3. Ana Iris4. David5. Patricia6. Rogeli7. Atif8. Aina9. Isaac10. Maria Merce11. Romina12. Guillem13. Raul
Page 16
Bioinformatics PhD. Course
Second part:
Introducing Suffix trees
Page 17
Suffix trees
Given string ababaas:
1: ababaas
2: babaas
3: abaas
4: baas
5: aas
6: as7: s
as,3
s,6
as,5
s,7
as,4ba
baas,2
a
babaas,1
a
babaas,1
ba
baas,2
as,3
as,4
s,6
as,5
s,7
Suffixes:
What kind of queries?
Page 18
Applications of Suffix trees
a
babaas,1as,3
ba
baas,2
as,4
s,6
as,5
s,7
1. Exact string matching
…………………………
• Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?
Page 19
Quadratic insertion algorithm
Given the string …………………………......
P1: the leaves of suffixes from have been inserted
and the suffix-tree
…...
Invariant Properties:
Page 20
Quadratic insertion algorithm
Given the string ababaabbs
ababaabbs,1
Page 21
Quadratic insertion algorithm
Given the string ababaabbs
babaabbs,2
ababaabbs,1
Page 22
Quadratic insertion algorithm
Given the string ababaabbs
babaabbs,2
ababaabbs,1ababaabbs,1
Page 23
Quadratic insertion algorithm
Given the string ababaabbs
babaabbs,2
ababaabbs,1
abbs,3
Page 24
Quadratic insertion algorithm
Given the string ababaabbs
babaabbs,2
ababaabbs,1
abbs,3
ba
baabbs,2
Page 25
Quadratic insertion algorithm
Given the string ababaabbs
ababaabbs,1
abbs,3
ba
baabbs,2
abbs,4
Page 26
Quadratic insertion algorithm
Given the string ababaabbs
ababaabbs,1
abbs,3
abbs,4ba
baabbs,2
abbs,4
abbs,3ba
a
baabbs,1
Page 27
Quadratic insertion algorithm
Given the string ababaabbs
abbs,4ba
baabbs,2
abbs,4
abbs,3ba
a
baabbs,1
abbs,5
Page 28
Quadratic insertion algorithm
Given the string ababaabbs
abbs,4ba
baabbs,2
abbs,4
abbs,3ba
a
baabbs,1
abbs,5
Page 29
Quadratic insertion algorithm
Given the string ababaabbs
abbs,4
ba
ba
baabbs,2
abbs,4
a abbs,5
ba abbs,3
baabbs,1
Page 30
Quadratic insertion algorithm
Given the string ababaabbs
abbs,4ba
baabbs,2
abbs,4
a abbs,5
ba abbs,3
baabbs,1
bs,6
Page 31
Quadratic insertion algorithm
Given the string ababaabbs
abbs,4ba
baabbs,2
abbs,4
a abbs,5
ba abbs,3
baabbs,1
bs,6
Page 32
Quadratic insertion algorithm
Given the string ababaabbs
a abbs,5
ba abbs,3
baabbs,1
bs,6
a
baabbs,2
b
abbs,4
bs,7
Page 33
Quadratic insertion algorithm
Given the string ababaabbs
a abbs,5
ba abbs,3
baabbs,1
bs,6
a
baabbs,2
b
abbs,4
bs,7
s,8
Page 34
Quadratic insertion algorithm
Given the string ababaabbs
a abbs,5
ba abbs,3
baabbs,1
bs,6
a
baabbs,2
b
abbs,4
bs,7
s,7
s,9
Page 35
Generalizad suffix tree
The suffix tree of many strings …
and it is the suffix tree of the concatenation of strings.
the generalized suffix tree of ababaabb and aabaat …
is the suffix tree of ababaabαaabaatβ, :
is called the generalized suffix tree …
For instance,
Page 36
Generalizad suffix tree
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given the suffix tree of ababaabα :
Construction of the suffix tree of ababaabbαaabaaβ :
Page 37
Generalizad suffix tree
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Construction of the suffix tree of ababaabbαaabaaβ :
Page 38
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
a bα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
abaaβ,1
Page 39
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
a bα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
abaaβ,1
Page 40
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
abaaβ,1
aβ,2
Construction of the suffix tree of ababaabbαaabaaβ :
Page 41
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
abaaβ,1
aβ,2
Construction of the suffix tree of ababaabbαaabaaβ :
Page 42
Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
abaaβ,1
aβ,2
a β,3
Page 43
Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
abaaβ,1
aβ,2
a β,3
Page 44
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4
Construction of the suffix tree of ababaabbαaabaaβ :
Page 45
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4
Construction of the suffix tree of ababaabbαaabaaβ :
Page 46
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4β,5
Construction of the suffix tree of ababaabbαaabaaβ :
Page 47
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4β,5
Construction of the suffix tree of ababaabbαaabaaβ :
Page 48
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4β,5
β,6
Construction of the suffix tree of ababaabbαaabaaβ :
Page 49
Generalizad suffix tree
a bα,5b
a bbα,3baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4β,5
β,6
Generalized suffix tree of ababaabbαaabaaβ :
Page 50
Applications of Suffix trees
a
babaas,1as,3
ba
baas,2
as,4
s,6
as,5
s,7
1. Exact string matching
…………………………
• Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?
Page 51
Applications of Suffix trees
2. The substring problem for a database of strings DB• Does the DB contain any ocurrence of patterns abab, aab, and ab?
a bα,5b
a bbα,3baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4β,5
β,6
Page 52
Applications of Suffix trees
3. The longest common substring of two strings
a bα,5b
a bbα,3baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4β,5
β,6
Page 53
Applications of Suffix trees
5. Finding MUMs.
a bα,5b
a bbα,3baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4β,5
β,6
Page 54
Bioinformatics PhD. Course
Third part:
Suffix links
Page 55
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
Page 56
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Page 57
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Page 58
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Page 59
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Page 60
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Page 61
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Page 62
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Page 63
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Page 64
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
Page 65
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
Page 66
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a
Page 67
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a
Page 68
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a aa in S2 [1] Unique matchings
Page 69
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a aa in S2 [1] Unique matchings
aab in S2 [1] =
S1[5..6-7] in S2 [1]
Page 70
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a Unique matchings S1[5..6-7] in S2 [1]
Page 71
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a Unique matchings S1[5..6-7] in S2 [1]
Page 72
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2]
Page 73
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2]
Page 74
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2]
Page 75
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2]
Page 76
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1] S1[3..6-8] in S2 [2]
S1[4..6-8] in S2 [3]
Page 77
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a b b a Unique matchings S1[5..8] in S2 [4] S1[3..6-8] in S2 [2] S1[4..6-8] in S2 [3] S1[6..8] in S2 [5] S1[7..8] in S2 [6]
Page 78
From UMs to MUMs
Given S2 = a a b a a b b a Unique matchings S1[5..8] in S2 [4] S1[3..6-8] in S2 [2] S1[4..6-8] in S2 [3] S1[6..8] in S2 [5] S1[7..8] in S2 [6]
Array of UMs123 6-84 6-85 86 87 889
and S1 = a b a b a a b b α
MUM: S1[3..6-8] in S2[2]
Page 79
Bioinformatics PhD. Course
Third part:
Linear insertion algorithm
Page 80
Quadratic insertion algorithm
Given the string …………………………......
P1: the leaves of suffixes from have been inserted
and the suffix-tree
…...
Invariant Properties:
Page 81
Linear insertion algorithm
Given the string …………………………......
P2: the string is the longest string that can be spelt through the tree.
P1: the leaves of suffixes from have been inserted
and the suffix-tree
…...
Invariant Properties:
Page 82
Linear insertion algorithm: example
Given the string ababaababb...
ba
baababb...,2
a ababb...,5
ba ababb...,3
baababb...,1ababb...,4
a
Page 83
Linear insertion algorithm: example
Given the string ababaababb...
ba
baababb...,2
a ababb...,5
ba ababb...,3
baababb...,1ababb...,4
6 7 8
Page 84
Linear insertion algorithm: example
ba
baababb...,2
a ababb...,5
ba ababb...,3
baababb...,1ababb...,4
6 7 8Given the string ababaababb...
Page 85
Linear insertion algorithm: example
ba
baababb...,2
a ababb...,5
ba ababb...,3
baababb...,1ababb...,4
6 7 89Given the string ababaababb...
Page 86
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
baababb...,1ba
baababb...,2
ababb...,4
Given the string ababaababb...
6 7 89
baababb...,1b
b...,6
aababb...,1
Page 87
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba
baababb...,2
ababb...,4
Given the string ababaababb...
7 89
b
b...,6
aababb...,1
Page 88
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba
baababb...,2
ababb...,4
Given the string ababaababb...
7 89
b
b...,6
aababb...,1
Page 89
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba
baababb...,2
ababb...,4
Given the string ababaababb...
7 89
b
b...,6
aababb...,1
Page 90
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba
baababb...,2
ababb...,4
Given the string ababaababb...
7 89
b
b...,6
aababb...,1
baababb...,2b aababb...,2
Page 91
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba
baababb...,2
ababb...,4
Given the string ababaababb...
7 8…
b
b...,6
aababb...,1
baababb...,2b
b...,7
aababb...,2
Page 92
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba ababb...,4
Given the string ababaababb...
89
b
b...,6
aababb...,1
b
b...,7
aababb...,2
Page 93
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba ababb...,4
Given the string ababaababb...
89
b
b...,6
aababb...,1
b
b...,7
aababb...,2
Page 94
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba ababb...,4
Given the string ababaababb...
89
b
b...,6
aababb...,1
b
b...,7
aababb...,2
Page 95
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba ababb...,4
Given the string ababaababb...
89
b
b...,6
aababb...,1
b
b...,7
aababb...,2
Page 96
Linear insertion algorithm: example
a ababb...,5
b
ba ababb...,4
Given the string ababaababb...
89
ababb...,3
b
b...,6
aababb...,1
b
b...,7
aababb...,2
a
Page 97
Linear insertion algorithm: example
a ababb...,5
b
ba ababb...,4
Given the string ababaababb...
89
ababb...,3
b
b...,6
aababb...,1
b
b...,7
aababb...,2
a
b...,8
Page 98
Linear insertion algorithm: example
a ababb...,5
b
ba ababb...,4
Given the string ababaababb...
9
ababb...,3
b
b...,6
aababb...,1
b
b...,7
aababb...,2
a
b...,8
Page 99
Linear insertion algorithm: example
a ababb...,5
b
ba ababb...,4
Given the string ababaababb...
9
ababb...,3
b
b...,6
aababb...,1
b
b...,7
aababb...,2
a
b...,8
Page 100
Linear insertion algorithm: example
a ababb...,5
b
b ababb...,4
Given the string ababaababb... 9
ababb...,3
b
b...,6
aababb...,1
b
b...,7
aababb...,2
a
b...,8a
Page 101
Linear insertion algorithm: example
a ababb...,5
b
b ababb...,4
Given the string ababaababb... 9
ababb...,3
b
b...,6
aababb...,1
b
b...,7
aababb...,2
a
b...,8a
b...,9
Page 102
Linear insertion algorithm: example
a ababb...,5
b
b ababb...,4
Given the string ababaababb... 9
ababb...,3
b
b...,6
ababb...,1
b
b...,7
aababb...,2
a
b...,8a
b...,9
Page 103
Linear insertion algorithm: example
a ababb...,5
b
b ababb...,4
Given the string ababaababb... 9
ababb...,3
b
b...,6
ababb...,1
b
b...,7
aababb...,2
a
b...,8a
b...,9
Page 104
Linear insertion algorithm: example
a ababb...,5
b
b ababb...,4
Given the string ababaababb...
9
ababb...,3
b
b...,6
ababb...,1
b
b...,7
aababb...,2
a
b...,8a
b...,9
Page 105
Index
Suffix arrays Suffix-arrays: a new method for on-line
string searches, G. Myers, U. Manber
Page 106
Suffix arrays
Given string ababaa#:
1: ababaa#2: babaa#
3: abaa#
4: baa#
5: aa#
6: a#
7: #
Suffixes: … but lexicographically sorted
1: ababaa#
2: babaa#
3: abaa#
4: baa#
5: aa#6: a#1: #1
234567
Which is the cost? O(n log(n))
Page 107
Applications of suffix arrays
1. Exact string matching• Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?
1: ababaa#
2: babaa#
3: abaa#
4: baa#
5: aa#6: a#1: #1
234567
Binary search
O(log(n) |P|)
… which is the cost?
O(log(n)+|P|) ?
Can it be improved to …
Page 108
Fast search with cost O(log(n)+|P|) Query:
Invariant Properties:
P1: α < query ≤ β α
β
12… …
n
Suffix array
P2: matches pref( query)
Page 109
Fast search with cost O(log(n)+|P|) Query:
Invariant Properties:
P1: α < query ≤ β α
β
γ Algorithm:
12… …
n
Suffix array
P2: matches pref( query)
If suff(γ)<suff(query) then α = γ else β = γ