Top Banner
Text Algorithms (6EAP) Full text indexing Jaak Vilo 2016 fall 1 MTAT.03.190 Text Algorithms Jaak Vilo
126

Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

May 27, 2018

Download

Documents

ĐỗDung
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

TextAlgorithms(6EAP)

Fulltextindexing

JaakVilo2016fall

1MTAT.03.190TextAlgorithmsJaakVilo

Page 2: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

Problem

• GivenPandS– findallexactorapproximateoccurrencesofPinS

• YouareallowedtopreprocessS(andP,ofcourse)

• Goal:tospeedupthesearches

Page 3: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

E.g.Dictionaryproblem

• DoesPbelongtoadictionaryD={d1,…,dn}– BuildabinarysearchtreeofD– B-TreeofD– Hashing– Sorting+Binarysearch

• Buildakeywordtrie:searchinO(|P|)– Assumingalphabethasuptoaconstantsizec– SeeAho-Corasickalgorithm,Trieconstruction

Page 4: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

Sortedarrayandbinarysearch

he

hershis

global

indexhappy

head

header

info

informal

search

show

stop

1 13

Page 5: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

Sortedarrayandbinarysearch

he

hershis

global

indexhappy

head

header

info

informal

search

show

stop

1 13

O( |P| log n )

Page 6: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

TrieforD={he,hers,his,she}

0

1

2

h

e

3

s

4

5

e

h

8

i

7

s

9

r

6

s

O( |P| )

Page 7: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

S!=setofwords

• Soflengthn

• Howtoindex?

• Indexfromeverypositionofatext

• Prefixofeverypossiblesuffixisimportant

Page 8: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

a

b

b

aaa

aa

b

b

b

babaababaabbaabaababb

Trie(babaab)

b

a

a

b

Page 9: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

Suffixtree• Definition: Acompactrepresentationofatriecorrespondingtothe

suffixesofagivenstringwhereallnodeswithonechildaremergedwiththeirparents.

• Definition(suffixtree).AsuffixtreeTforastringS(withn=|S|)isarooted,labeledtreewithaleafforeachnon-emptysuffixofS.Furthermore,asuffixtreesatisfiesthefollowingproperties:

• Eachinternalnode,otherthantheroot,hasatleasttwochildren;• Eachedgeleavingaparticularnodeislabeledwithanon-emptysubstring

ofSofwhichthefirstsymbolisuniqueamongallfirstsymbolsoftheedgelabelsoftheedgesleavingthisparticularnode;

• Foranyleafinthetree,theconcatenationoftheedgelabelsonthepathfromtheroottothisleafexactlyspellsoutanon-emptysuffixofs.

• DanGusfield: AlgorithmsonStrings,Trees,andSequences:ComputerScienceandComputationalBiology.Hardcover- 534pages1stedition(January15,1997).CambridgeUnivPr(Short);ISBN:0521585198.

Page 10: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

Literatureonsuffixtrees• http://en.wikipedia.org/wiki/Suffix_tree• DanGusfield: AlgorithmsonStrings,Trees,andSequences:Computer

ScienceandComputationalBiology.Hardcover- 534pages1stedition(January15,1997).CambridgeUnivPr(Short);ISBN:0521585198.(pages:89--208)

• E.Ukkonen.On-lineconstructionofsuffixtrees.Algorithmica,14:249-60,1995. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751

• Ching-FungCheung,JeffreyXuYu,HongjunLu."ConstructingSuffixTreeforGigabyteSequenceswithMegabyteMemory,"IEEETransactionsonKnowledgeandDataEngineering,vol.17,no.1,pp.90-105,January,2005.http://www2.computer.org/portal/web/csdl/doi/10.1109/TKDE.2005.3

• CPMarticlesarchive:http://www.cs.ucr.edu/~stelo/cpm/

• MarkNelson.FastStringSearchingWithSuffixTreesDr.Dobb'sJournal,August,1996.http://www.dogma.net/markn/articles/suffixt/suffixt.htm

Page 11: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english

Page 12: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

12

ThesuffixtreeTree(T)ofT

• datastructuresuffixtree, Tree(T),iscompactedtrie thatrepresentsallthesuffixesofstringT

• linearsize:|Tree(T)|=O(|T|)• canbeconstructedinlineartimeO(|T|)• hasmyriadvirtues (A.Apostolico)• iswell-known:366000Googlehits

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt

Page 13: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

13

Suffix tree andsuffix array techniques forpatternanalysis instringsEskoUkkonenUniv Helsinki

Erice School30Oct 2005E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt

Partlybasedon:

High-throughput genome-scale sequence analysis andmapping using compressed datastructures

VeliMäkinenDepartmentofComputerScience

University ofHelsinki

Page 14: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

14

ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt

Page 15: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

15

Analysisofastringofsymbols

• T=hattivatti’text’• P=att’pattern’

• FindtheoccurrencesofPinT:hattivatti

• Patternsynthesis:#(t)=4#(atti)=2#(t****t)=2

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt

Page 16: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 16

Solution:backtrackingwithsuffixtree

...ACACATTATCACAGGCATCGGCATTAGCGATCGAGTCG.....

Page 17: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

17

Patternfinding&synthesisproblems• T=t1t2 …tn,P=p1p2 …pn ,stringsofsymbolsinfinitealphabet

• Indexingproblem:PreprocessT(buildanindexstructure)suchthattheoccurrencesofdifferentpatternsPcanbefoundfast– statictext,anygivenpatternP

• Patternsynthesisproblem:LearnfromTnewpatternsthatoccursurprisinglyoften

• Whatisapattern?Exactsubstring,approximatesubstring,withgeneralizedsymbols,withgaps,…

Page 18: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

18

1. Suffix tree

2. Suffix array

3. Some applications

4. Finding motifs

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt

Page 19: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

19

ThesuffixtreeTree(T)ofT

• datastructuresuffixtree, Tree(T),iscompactedtrie thatrepresentsallthesuffixesofstringT

• linearsize:|Tree(T)|=O(|T|)• canbeconstructedinlineartimeO(|T|)• hasmyriadvirtues (A.Apostolico)• iswell-known:366000Googlehits

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt

Page 20: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

20

Suffixtrieandsuffixtree

a

b

b

aaa

aa

b

b

b

abaabbaabaababb

Trie(abaab)

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt

Page 21: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

21

Suffixtrieandsuffixtree

a

b

b

aaa

aa

b

b

b

a

baab

baabab

abaabbaabaababb

Trie(abaab) Tree(abaab)

Page 22: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

22

Trie(T)canbelarge

• |Trie(T)|=O(|T|2)• badexample:T=anbn

• Trie(T)canbeseenasaDFA:languageaccepted=thesuffixesofT

• minimizetheDFA=>directedcyclicwordgraph(’DAWG’)

Page 23: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

23

Tree(T)isoflinearsize

• onlytheinternalbranchingnodesandtheleavesrepresentedexplicitly

• edgeslabeledbysubstringsofT• v=node(α)ifthepathfromroottovspellsα• one-to-onecorrespondenceofleavesandsuffixes

• |T|leaves,hence<|T|internalnodes• |Tree(T)|=O(|T|+size(edgelabels))

Page 24: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

24

Tree(hattivatti)hattivatti

attivattittivatti

tivattiivatti

vattiatti

ttiti

i

hattivattiattivatti ttivatti

tivatti

ivatti

vatti

vattivatti

attiti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti

Page 25: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

25

Tree(hattivatti)hattivatti

attivattittivatti

tivattiivatti

vattiatti

ttiti

i

hattivattiattivatti ttivatti

tivatti

ivatti

vatti

vattivatti

attiti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

hattivatti

atti

substring labels of edges represented as pairs of pointers

Page 26: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

26

Tree(hattivatti)hattivatti

attivattittivatti

tivattiivatti

vattiatti

ttiti

i1 2 3

4

5

6

6,106,10

2,54,5

i

10

8

9

3,3

i

vatti

vatti

vatti

hattivatti

hattivatti

7

Page 27: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

27

Tree(T)isfull textindexTree(T)

P

31 8

P occurs in T at locations 8, 31, …

P occurs in T ó P is a prefix of some suffix of T ó Path for P exists in Tree(T)

All occurrences of P in time O(|P| + #occ)

Page 28: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

28

Findatt fromTree(hattivatti)hattivatti

attivattittivatti

tivattiivatti

vattiatti

ttiti

i

hattivattiattivatti ttivatti

tivatti

ivatti

vatti

vattivatti

attiti

2

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti7

Page 29: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

29

LineartimeconstructionofTree(T)

hattivatti

attivattittivatti

tivattiivatti

vattiatti

ttiti

i

Weiner (1973),

’algorithm of the year’

McCreight (1976)

’on-line’ algorithm (Ukkonen 1992)

Page 30: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

30

On-lineconstructionofTrie(T)

• T=t1t2 …tn$• Pi =t1t2 …ti i:th prefix ofT• on-lineidea:updateTrie(Pi) toTrie(Pi+1)• =>verysimpleconstruction

Page 31: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

31

Trie(abaab)

a a

b

b a

b

b

aa

Trie(a) Trie(ab) Trie(aba)

chain of links connects the end points of current suffixes

abaabaaaaεaε

Page 32: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

32

Trie(abaab)

a a

b

b a

b

b

aa

a

b

b

aaa

aa

Trie(abaa)

Page 33: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

33

Trie(abaab)

a a

b

b a

b

b

aa

a

b

b

aaa

aa

Trie(abaa)

Add next symbol = b

Page 34: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

34

Trie(abaab)

a a

b

b a

b

b

aa

a

b

b

aaa

aa

Trie(abaa)

Add next symbol = b

From here on b-arc already exists

Page 35: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

35

Trie(abaab)

a a

b

b a

b

b

aa

a

b

b

aaa

aa

a

b

b

aaa

aa

b

b

b

Trie(abaab)

Page 36: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

36

WhathappensinTrie(Pi) =>Trie(Pi+1) ?

ai

ai

ai

ai

ai

ai

Before

After

New nodes

New suffix links

From here on the ai-arc exists already => stop updating here

Page 37: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

37

WhathappensinTrie(Pi) =>Trie(Pi+1) ?

• time:O(sizeofTrie(T))• suffixlinks:

slink(node(aα))=node(α)

Page 38: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

38

On-lineprocedureforsuffixtrie

1. Create Trie(t1): nodes root and v, an arc son(root, t1) = v, and suffix links slink(v) := root and slink(root) := root

2. for i := 2 to n do begin3. vi-1 := leaf of Trie(t1…ti-1) for string t1…ti-1 (i.e., the deepest leaf)

4. v := vi-1; v´ := 0

5. while node v has no outgoing arc for ti do begin

6. Create a new node v´´ and an arc son(v,ti) = v´´

7. if v´ ≠ 0 then slink(v) := v´´

8. v := slink(v); v´ := v´´ end9. for the node v´´ such that v´´= son(v,ti) do

if v´´ = v´ then slink(v’) := root else slink(v´) := v´´

Page 39: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

39

Suffixtreeson-line

• ’compactedversion’oftheon-linetrieconstruction:simulatetheconstructiononthelinearsizetreeinsteadofthetrie=>timeO(|T|)

• alltrienodesareconceptuallystillneeded=>implicit andreal nodes

Page 40: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

40

Implicitandrealnodes

• Pair(v,α)isanimplicitnode inTree(T)ifvisanodeofTreeandα isa(proper)prefixofthelabelofsomearcfrom v.Ifα istheemptystringthen (v,α)isa ’real’ node(=v).

• Let v=node(α´)in Tree(T). Then implicitnode(v,α)representsnode(α´α)ofTrie(T)

Page 41: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

41

Implicitnode

v

(v, α)α…

α´

Page 42: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

42

Suffixlinksandopenarcs

v

α

root

slink(v)

label [i,*] instead of [i,j] if w is a leaf and j is the scanned position of T

w

Page 43: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

43

Bigpicture

… … …

suffix link path traversed: total work O(n)

new arcs and nodes created: total work O(size(Tree(T))

Page 44: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

44

On-lineprocedureforsuffixtree

Input: string T = t1t2 … tn$

Output: Tree(T)

Notation: son(v,α) = w iff there is an arc from v to w with label α

son(v,ε) = v

Function Canonize(v, α):

while son(v, α´) ≠ 0 where α = α´ α´´, | α´| > 0 do

v := son(v, α´); α := α´´

return (v, α)

Page 45: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

45

Suffix-treeon-line:mainprocedure

Create Tree(t1); slink(root) := root(v, α) := (root, ε) /* (v, α) is the start node */for i := 2 to n+1 do

v´ := 0while there is no arc from v with label prefix αti do

if α ≠ ε then /* divide the arc w = son(v, αη) into two */son(v, α) := v´´; son(v´´,ti) := v´´´; son(v´´,η) := w

elseson(v,ti) := v´´´; v´´ := v

if v´ ≠ 0 then slink(v´) := v´´v´ := v´´; v := slink(v); (v, α) := Canonize(v, α)

if v´ ≠ 0 then slink(v´) := v(v, α) := Canonize(v, αti) /* (v, α) = start node of the next round */

Page 46: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english

Page 47: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

47

Theactualtimeandspace• |Tree(T)|isabout20|T|inpractice• brute-forceconstructionisO(|T|log|T|)forrandomstringsastheaveragedepthofinternalnodesisO(log|T|)

• differencebetweenlinearandbrute-forceconstructionsnotnecessarilylarge(Giegerich&Kurtz)

• truncatedsuffixtrees:ksymbolslongprefixofeachsuffixrepresented(Naetal.2003)

• alphabetindependentlineartime(Farach1997)

Page 48: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

abc

Page 49: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

abcabxabcd

Page 50: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;
Page 51: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;
Page 52: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;
Page 53: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;
Page 54: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ApplicationsofSuffixTrees• DanGusfield: AlgorithmsonStrings,Trees,andSequences:

ComputerScienceandComputationalBiology.Hardcover-534pages1stedition(January15,1997).CambridgeUnivPr(Short);ISBN:0521585198.- book

• APL1:ExactStringMatchingSearchforPfromtextS.Solution1:buildSTree(S)- oneachievesthesameO(n+m)asKnuth-Morris-Pratt,forexample!

• SearchfromthesuffixtreeisO(|P|)• APL2:ExactsetmatchingSearchforasetofpatternsP

Page 55: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 55

C

Backtobacktracking

AC

T

4 2 1 5 36

CT

TAT

TAC

T

AT C

T

A

ACA, 1 mismatch

Same idea can be used to many otherforms of approximate search, like Smith-Waterman, position-restricted scoringmatrices, regular expression search, etc.

Page 56: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ApplicationsofSuffixTrees

• APL3:substringproblemforadatabaseofpatternsGivenasetofstringsS=S1,...,Sn--- adatabaseFindallSithathavePasasubstring

• GeneralizedsuffixtreecontainsallsuffixesofallSi• QueryintimeO(|P|),andcanidentifytheLONGESTcommonprefixofPinallSi

Page 57: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ApplicationsofSuffixTrees

• APL4:Longestcommonsubstringoftwostrings• FindthelongestcommonsubstringofSandT.• OveralltherearepotentiallyO(n2 )suchsubstrings,ifnisthelengthofashorterofSandT

• Donald Knuthonce(1970)conjectured thatlinear-timealgorithmisimpossible.

• Solution:constructtheSTree(S+T)andfindthenodedeepestinthetreethathassuffixesfrombothSandTinsubtreeleaves.

• Ex:S=superiorcalifornialives T=sealiver havebothasubstringalive.

Page 58: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 58

Simpleanalysistask:LCSS

• LetLCSSA(A,B) denotethelongestcommonsubstringtwosequencesA andB. E.g.:– LCSS(AGATCTATCT,CGCCTCTATG)=TCTAT.

• Agoodsolutionistobuildsuffixtreefortheshortersequenceandmakeadescendingsuffixwalk withtheothersequence.

Page 59: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 59

Suffixlink

X

aX

suffix link

Page 60: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 60

Descendingsuffixwalk

suffix tree of A Read B left-to-right,always going down in thetree when possible.If the next symbol of B doesnot match any edge labelon current position, takesuffix link, and try again.(Suffix link in the root to itself emits a symbol).The node v encountered with largest string depthis the solution.v

Page 61: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ApplicationsofSuffixTrees

• APL5:RecognizingDNAcontaminationRelatedtoDNAsequencing,searchforlongeststrings(longerthanthreshold)thatarepresentintheDBofsequencesofothergenomes.

• APL6: CommonsubstringsofmorethantwostringsGeneralizationofAPL4,canbedoneinlinear(intotallengthofallstrings)time

Page 62: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 62

Anothercommontool:Generalizedsuffixtree

ACCTTA....ACCT#CACATT..CAT#TGTCGT...GTA#TCACCACC...C$

AC

C

node info:subtree size 47813871sequence count 87

Page 63: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 63

Generalizedsuffixtreeapplication

...ACC..#...ACC...#...ACC...ACC..ACC..#..ACC..ACC...#...ACC...#...

...#....#...#...#...ACC...#...#...#...#...#...#..#..ACC..ACC...#......#...

AC

C

node info:subtree size 4398blue sequences 12/15red sequences 2/62

Page 64: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 64

Casestudycontinued

genome

regions with ChIP-seq matches

suffix tree of genome

5 blue1 red

TAC..........T

motif?

Page 65: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ApplicationsofSuffixTrees

• APL7:Buildingadirectedgraphforexactmatching: Suffixgraph - directedacyclicwordgraph(DAWG),asmallestfinitestateautomaton recognizingallsuffixesofastringS.Thisautomatoncanrecognize membership,butnottellwhichsuffixwasmatched.

• Construction:mergeisomorficsubtrees.• IsomorficinSuffixTreewhenexistssuffixlinkpath,andsubtreeshaveequalnr.ofleaves.

Page 66: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ApplicationsofSuffixTrees

• APL8:Areverseroleforsuffixtrees,andmajorspacereductionIndexthepattern,nottree...

• Matchingstatistics.• APL10:All-pairssuffix-prefixmatchingForallpairsSi, Sj, findthelongestmatchingsuffix-prefixpair.Usedinshortestcommonsuperstringgeneration(e.g.DNAsequenceassembly),ESTalignmentmetc.

Page 67: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ApplicationsofSuffixTrees

• APL11:Findingallmaximalrepetitivestructuresinlineartime

• APL12:Circularstringlinearizatione.g.circularchemicalmoleculesinthedatabase,onewantstolienarizetheminacanonicalway...

• APL13:Suffixarrays- morespacereductionwilltouchthatseparately

Page 68: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ApplicationsofSuffixTrees

• APL14:Suffixtreesingenome-scaleprojects• APL15:ABoyer-Mooreapproachtoexactsetmatching

• APL16:Ziv-Lempeldatacompression• APL17:MinimumlengthencodingofDNA

Page 69: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ApplicationsofSuffixTrees• AdditionalapplicationsMostlyexercises...• Extrafeature:CONSTANTtimelowestcommonancestorretrieval(LCA)

Andmestruktuurmisvõimaldableidakonstantseajagaalumistühistvanemat(seevastabpikimaleühiseleprefixile!)onvõimalikkoostadalineaarseajaga.

• APL:Longestcommonextension:abridgetoinexactmatching• APL:Findingallmaximalpalindromesinlineartime

Palindromereadsfromcentralpositionthesametoleftandright.E.g.:kirik,saippuakivikauppias.

• BuildthesuffixtreeofSandinvertedS(aabcbad=>aabcbad#dabcbaa)andusingtheLCAonecanaskforanypositionpair(i,2i-1),thelongestcommonprefixinconstanttime.

• ThewholeproblemcanbesolvedinO(n).

Page 70: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ApplicationsofSuffixTrees

• APL:Exactmatchingwithwildcards• APL:Thek-mismatchproblem• Approximatepalindromesandrepeats• Fastermethodsfortandemrepeats• Alinear-timesolutiontothemultiplecommonsubstringproblem

• Andmany-manymore...

Page 71: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 71

Propertiesofsuffixtree

• Suffixtreehasn leavesandatmostn-1internalnodes,wheren isthetotallengthofallsequencesindexed.

• Eachnoderequiresconstantnumberofintegers(pointerstofirstchild,sibling,parent,textrangeofincomingedge,statisticscounters,etc.).

• Canbeconstructedinlineartime.

Page 72: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 72

Propertiesofsuffixtree...inpractice

• Hugeoverheadduetopointerstructure:– Standardimplementationofsuffixtreeforhumangenomerequiresover200GB memory!

– Acarefulimplementation(usinglogn -bitfieldsforeachvalueandarraylayoutforthetree)stillrequiresover40GB.

– Humangenomeitselftakeslessthan1GB using2-bitsperbp.

Page 73: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

73

1. Suffix tree

2. Suffix array

3. Some applications

4. Finding motifs

Page 74: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

74

Suffixes- sorted

• Sortallsuffixes.Allowstoperformbinarysearch!

hattivattiattivattittivattitivattiivattivattiattittitiiε

εattiattivattihattivattiiivattititivattittittivattivatti

Page 75: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

75

Suffixarray:example

• suffixarray=lexicographicorderofthesuffixes

hattivattiattivattittivattitivattiivattivattiattittitiiε

εattiattivattihattivattiiivattititivattittittivattivatti

1172110594836

1234567891011

1172110594836

Page 76: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

76

Suffixarrayconstruction:sort!

• suffixarray=lexicographicorderofthesuffixes

hattivattiattivattittivattitivattiivattivattiattittitiiε

1172110594836

1234567891011

1172110594836

Page 77: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

77

Suffixarray

• suffixarray SA(T)=anarraygivingthelexicographicorderofthesuffixesofT

• spacerequirement:5|T|• practitionerslikesuffixarrays(simplicity,spaceefficiency)

• theoreticianslikesuffixtrees(explicitstructure)

Page 78: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 78

Reducingspace:suffixarray

AC

T

4 2 1 5 36

C A T A C T1 2 3 4 5 6

=[3,3]=[3,3]=[2,2]

suffix array

=[4,6]=[6,6]=[2,6]

=[3,6]=[5,6]

CC

TTA

T

TAC

T CT

A

T

A

Page 79: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 79

Suffixarray

• Manyalgorithmsonsuffixtreecanbesimulatedusingsuffixarray...– ...andcoupleofadditionalarrays...– ...formingso-calledenhancedsuffixarray...– ...leadingtothesimilarspacerequirementascarefulimplementationofsuffixtree

• Notasatisfactorysolutiontothespaceissue.

Page 80: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

80

Patternsearchfromsuffixarrayhattivattiattivattittivattitivattiivattivattiattittitiiε

εattiattivattihattivattiiivattititivattittittivattivatti

1172110594836

att binary search

Page 81: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 81

Whatwelearntoday?

• Welearnthatitispossibletoreplacesuffixtreeswithcompressedsuffixtrees thattake8.8GB forthehumangenome.

• Welearnthatbacktracking canbedoneusingcompressedsuffixarrays requiringonly2.1GBforthehumangenome.

• Welearnthatdiscovering interestingmotifseedsfromthehumangenometakes40hoursandrequires9.3GB space.

Page 82: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

82

Recentsuffixarrayconstructions

• Manber&Myers(1990):O(|T|log|T|)• lineartimeviasuffixtree• January/June2003:directlineartimeconstructionofsuffixarray- Kim,Sim,Park,Park(CPM03)- Kärkkäinen&Sanders(ICALP03)- Ko&Aluru(CPM03)

Page 83: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

83

Kärkkäinen-Sandersalgorithm

1.Construct the suffix array of the suffixes starting at positions i mod 3 ≠ 0. This is done by reduction to the suffix array construction of a string of two thirds the length, which is solved recursively.

2.Construct the suffix array of the remaining suffixes using the result of the first step.

3.Merge the two suffix arrays into one.

Page 84: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

84

Notation

• stringT=T[0,n)=t0t1 …tn-1• suffixSi =T[i,0)=titi+1 …tn-1• forC\subset[0,n]:SC ={Si|iinC}

• suffixarray SA[0,n]ofTisapermutationof[0,n]satisfyingSSA[0] <SSA[1] <…<SSA[n]

Page 85: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

85

Runningexample

• T[0,n)=yabbadabbado00…

• SA=(12,1,6,4,9,3,8,2,7,5,10,11,0)

0 1 2 3 4 5 6 7 8 9 10 11

Page 86: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

86

Step0:Constructasample

• fork=0,1,2Bk={iє [0,n]|imod3=k}

• C=B1UB2samplepositions• SC samplesuffixes

• Example:B1={1,4,7,10},B2={2,5,8,11},C={1,4,7,10,2,5,8,11}

Page 87: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

87

Step1:Sortsamplesuffixes• fork=1,2,construct

Rk=[tktk+1tk+2][tk+3tk+4tk+5]…[tmaxBktmaxBk+1tmaxBk+2]

R=R1^R2concatenationofR1andR2

SuffixesofRcorrespondtoSC:suffix[titi+1ti+2]…correspondstoSi;correspondenceisorderpreserving.

SortthesuffixesofR:radixsortthecharactersandrenamewithrankstoobtainR´.Ifallcharactersdifferent,theirorderdirectlygivestheorderofsuffixes.Otherwise,sortthesuffixesofR´ usingKärkkäinen-Sanders.Note:|R´|=2n/3.

Page 88: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

88

Step1(cont.)

• oncethesamplesuffixesaresorted,assignaranktoeach:rank(Si)=therankofSiinSC;rank(Sn+1)=rank(Sn+2)=0

• Example:R=[abb][ada][bba][do0][bba][dab][bad][o00]R´ =(1,2,4,6,4,5,3,7)SAR´ =(8,0,1,6,4,2,5,3,7)rank(Si)- 14- 26- 53– 78– 00

Page 89: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

89

Step2:Sortnonsamplesuffixes

• foreachnon-sampleSi є SB0 (notethatrank(Si+1)isalwaysdefinedforiє B0):

Si ≤Sj ↔(ti,rank(Si+1))≤(tj,rank(Sj+1))• radixsortthepairs(ti,rank(Si+1)).

• Example:S12 <S6 <S9 <S3 <S0because(0,0)<(a,5)<(a,7)<(b,2)<(y,1)

Page 90: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

90

Step3:Merge• mergethetwosortedsetsofsuffixesusingastandardcomparison-basedmerging:

• tocompareSi є SC withSj є SB0,distinguishtwocases:

• iє B1:Si ≤Sj ↔(ti,rank(Si+1))≤(tj,rank(Sj+1))• iє B2:Si ≤Sj ↔(ti,ti+1,rank(Si+2))≤(tj,tj+1,rank(Sj+2))

• notethattheranksaredefinedinallcases!• S1 <S6 as(a,4)<(a,5)andS3 <S8 as(b,a,6)<(b,a,7)

Page 91: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

91

RunningtimeO(n)

• excludingtherecursivecall,everythingcanbedoneinlineartime

• therecursionisonastringoflength2n/3• thusthetimeisgivenbyrecurrence

T(n)=T(2n/3)+O(n)• henceT(n)=O(n)

Page 92: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

92

Implementation

• about50linesofC++• codeavailablee.g.viaJuhaKärkkäinen’shomepage

Page 93: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

93

LCPtable

• LongestCommonPrefixofsuccessiveelementsofsuffixarray:

• LCP[i]=lengthofthelongestcommonprefixofsuffixesSSA[i] andSSA[i+1]

• buildinversearraySA-1 fromSAinlineartime• thenLCPtablefromSA-1 inlineartime(Kasaietal,CPM2001)

Page 94: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

• OxfordEnglishDisctionary http://www.oed.com/• Example- WordoftheDay,Fourth

http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/L7_SuffixTrees/wotd_fourth.htmlhttp://www.oed.com/cgi/display/wotd

• PATindex- byGastonGonnet(taonsamutiMapletarkvaraüksloojatestninghiljemmolekulaarbioloogiatarkvarapaketiväljatöötajaid)

• PATindexisessentiallyasuffixarray.Tosavespace,indexedonlyfromfirstcharacterofeveryword

• XML-tagging(orSGML,atthattime!)alsoindexed• TomarkcertainfieldsofXML,thebitvectorswereused.• Mainconcern- improvethespeedofsearchontheCD- minimizerandom

accesses.• Forslowmediumeven15-20accessesistooslow...• G.H.Gonnet,R.A.Baeza-Yates,andT.Snider,Lexicographicalindicesfor

text:Invertedfilesvs.PATtrees,TechnicalReportOED-91-01,CentrefortheNewOED,UniversityofWaterloo,1991.

Page 95: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

95

Suffixtreevssuffixarray

• suffixtreeó suffixarray+LCPtable

Page 96: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

96

1. Suffix tree

2. Suffix array

3. Some applications

4. Finding motifs

Page 97: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

97

SubstringmotifsofstringT

• stringT =t1 …tninalphabetA.• Problem:whatarethefrequentlyoccurring(ungapped)substringsofT?Longestsubstringthatoccursatleastq times?

• Thm:SuffixtreeTree(T) givescompleteoccurrencecountsofallsubstringmotifsofTinO(n) time(althoughT mayhaveO(n2)substrings!)

Page 98: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

98

Countingthesubstringmotifs

• internalnodesofTree(T)↔repeatingsubstringsofT

• numberofleavesofthesubtreeofanodeforstringP=numberofoccurrencesofPinT

Page 99: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

99

Substringmotifsofhattivatti

hattivattiattivatti ttivatti

tivatti

ivatti

vatti

vattivatti

attiti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti

2

2 2

24

Counts for the O(n) maximal motifs shown

Page 100: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

100

FindingrepeatsinDNA

• humanchromosome3• thefirst48999930bases• 31mincputime(8processors,4GB)

• Humangenome:3x109 bases• Tree(HumanGenome)feasible

Page 101: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

101

Longestrepeat?

Occurrences at: 28395980, 28401554r Length: 2559

ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaatgctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaacatgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcatagtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacatacgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaatcaccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgccattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttcttttgagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagtaggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggcttttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatgtaagtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaaggaagggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatcagatagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagtatagtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttccaattctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatgagcgtgtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtattttattctctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgagactttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctcttttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgtgccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaatttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtacgtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattttattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatgagttagg

Page 102: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

102

Tenoccurrences?

ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt

Length: 277

Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, 47398125In the reversed complement at: 17858493, 41463059, 42431718, 42580925

Page 103: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

103

Usingsuffixtrees:plagiarism

• findlongestcommonsubstringofstringsXandY

• buildTree(X$Y)andfindthedeepestnodewhichhasaleafpointingtoXandanotherpointingtoY

Page 104: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

104

Usingsuffixtrees:approximatematching

• editdistance:insertions,deletions,changes

• STOCKHOLMvsTUKHOLMA

Page 105: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

105

Stringdistance/similarityfunctions

STOCKHOLM vs TUKHOLMA

STOCKHOLM__TU_ KHOLMA

=> 2 deletions, 1 insertion, 1 change

Page 106: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

106

Dynamicprogrammingdi,j = min(if ai=bj then di-1,j-1 else ¥,

di-1,j + 1, di,j-1 + 1)

= distance between i-prefix of A and j-prefix of B(substitution excluded)

di,j

di-1,j-1

di,j-1

di-1,j

dm,n

mxn table d

A

B

ai

bj

+1

+1

Page 107: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

107

A\B s t o c k h o l m0 1 2 3 4 5 6 7 8 9

t 1 2 1 2 3 4 5 6 7 8u 2 3 2 3 4 5 6 7 8 9k 3 4 3 4 5 4 5 6 7 8h 4 5 4 5 6 5 4 5 6 7o 5 6 5 4 5 6 5 4 5 6l 6 7 6 5 6 7 6 5 4 5m 7 8 7 6 7 8 7 6 5 4a 8 9 8 7 8 9 8 7 6 5

di,j = min(if ai=bj then di-1,j-1 else ¥, di-1,j + 1, di,j-1 + 1)

dID(A,B)optimal alignment by trace-back

Page 108: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

108

Searchproblem

• findapproximateoccurrencesofpatternPintextT:substringsP’ofTsuchthatd(P,P’)small

• dynprogrwithsmallmodification:O(mn)• lotsof(practical)improvementtricks

P

T P’

Page 109: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

109

Indexforapproximatesearching?

• dynamicprogramming:PxTree(T)withbacktracking

P

Tree(T)

Page 110: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

Burrows-WheelerTransformation

• BWTfortextcompressionandindexing

Page 111: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

Burrows-Wheeler• SeeFAQ http://www.faqs.org/faqs/compression-faq/part2/section-9.html• Themethoddescribedintheoriginalpaperisreallyacompositeofthreedifferent

algorithms:– theblocksortingmainengine(alossless,veryslightlyexpansivepreprocessor),– themove-to-frontcoder(abyte-for-bytesimple,fast,locallyadaptivenoncompressivecoder)and– asimplestatisticalcompressor(firstorderHuffmanismentionedasacandidate)eventuallydoing

thecompression.

• Ofthesethreemethodsonlythefirsttwoarediscussedhereastheyarewhatconstitutestheheartofthealgorithm.Thesetwoalgorithmscombinedformacompletelyreversible(lossless)transformationthat- withtypicalinput- skewsthefirstordersymboldistributionstomakethedatamorecompressiblewithsimplemethods.Intuitivelyspeaking,themethodtransformsslackinthehigherorderprobabilitiesoftheinputblock(thusmakingthemmoreeven,whiteningthem)toslackinthelowerorderstatistics.Thiseffectiswhatisseeninthehistogramoftheresultingsymboldata.

• Please,readthearticlebyMarkNelson:• DataCompressionwiththeBurrows-WheelerTransformMarkNelson,Dr.Dobb'sJournal

September,1996.http://marknelson.us/1996/09/01/bwt/

Page 112: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;
Page 113: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;
Page 114: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

BWT

Page 115: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;
Page 116: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;
Page 117: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;
Page 118: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;
Page 119: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

DRD

1645*=>5240163023

DRDOBBS

0123456

Page 120: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

1645023

Page 121: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

CODE:t: hat acts like this:<13><10><1t: hat buffer to the constructort: hat corrupted the heap, or woW: hat goes up must come down<13t: hat happens, it isn't likelyw: hat if you want to dynamicallt: hat indicates an error.<13><1t: hat it removes arguments fromt: hat looks like this:<13><10><t: hat looks something like thist: hat looks something like thist: hat once I detect the mangled

Page 122: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;

Example

• Decode:errktreteoe.e

• Hint:. Isthelastcharacter,alphabeticallyfirst…

Page 123: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;
Page 124: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;
Page 125: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;
Page 126: Lecture 14 Full Text - ut text indexing Jaak Vilo ... A suffix tree T for a string S ... • Each internal node, other than the root, has at least two children;