Top Banner
Text Algorithms (6EAP) Approximate Matching Jaak Vilo 2017 fall 1 MTAT.03.190 Text Algorithms Jaak Vilo
56

Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Apr 10, 2018

Download

Documents

duongkhanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

TextAlgorithms(6EAP)

ApproximateMatching

JaakVilo2017fall

1MTAT.03.190TextAlgorithmsJaakVilo

Page 2: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Exactvsapproximatesearch

• Inexactsearchwesearchedforastringorsetofstringsinalongtext

• Thewelearnedhowtomeasurethesimilaritybetweensequences

• Thereareplentyofapplicationsthatrequireapproximatesearch

• Approximate matching,i.e.findthoseregionsinalongtextthataresimilartothequerystring

• E.g.tofindsubstringsofSthathaveeditdistacne<ktoquerystringm.

Page 3: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Problem

• GivenPandS– findallapproximateoccurrencesofPinS

S

P

Page 4: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

• Reviews• P.A.HallandG.R.Dowling.Approximatestringmatching.ACMComputing

Surveys,12(4):381--402,1980.ACMDL,PDF• G.Navarro.Aguidedtourtoapproximatestringmatching.ACM

ComputingSurveys,33(1):31--88,2001.(TechnicalReportTR/DCC-99-5,Dept.ofComputerScience,Univ.ofChile,1999.)CiteSeer,ACMDL,PDF

• Algorithms• S.WuandU.Manber.Fasttextsearchingallowingerrors.Communications

oftheACM,35(10):83--91,1992.ACMDL PDF• G.Myers.Afastbit-vectoralgorithmforapproximatestringmatching

basedondynamicprogamming.JournaloftheACM,46(3):395--415,1999.CiteSeer,PDF

• A.Amir,M.Lewenstein,andE.Porat.Fasteralgorithmsforstringmatchingwithkmismatches.InProc.11thACM-SIAMSymp.onDiscreteAlgorithms(SODA),pages794--803,2000.CiteSeer,ACMDL,PDF

Page 5: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

• Multipleapproximatematching• R.MuthandU.Manber.Approximatemultiplestringsearch.InProc.CPM'96,

pages75--86,1996.CiteSeer,Postscript• KimmoFredriksson- publicationshttp://www.cs.uku.fi/~fredriks/publications.html• Applications• UdiManber.Asimpleschemetomakepasswordsbasedonone-wayfunctions

muchhardertocrack.ComputersandSecurity,15(2):171-- 176,1996.CiteSeer,TR94-34,Postscript

• Tools• Webglimpse - glimpse,agrep

agrepforWin/DOSOriginalagrep

• Links• PatternMatchingPointers (StefanoLonardi)• Articles

Page 6: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Problemstatement• LetS=s1s2...sn∈ Σ* beatextandP=p1p2...pm thepattern.Letkbea

pregivenconstant.• Mainproblems• kmismatches

– FindfromSallsubstringsX,|X|=|P|,thatdifferfromPatmaxkpositsions(Hammingdistance)

• kdifferences– FindfromSallsubstringsX,whereD(X,P)≤k

(Editdistance)• bestmatch

– FindfromSsuchsubstringsX,thatD(X,P)isminimal• DistanceDcanbedefinedusingoneofthewaysfrompreviouschapters

Page 7: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Measureeditdistance

Page 8: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Findapproximateoccurrences

Page 9: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Algorithmforapproximatesearch,k editoperations

Input: P, S, kOutput: Approximate occurrences of P in S (with edit distance ≤ k)for j=0 to m do hj,0=j // Initialize first columnfor i=1 to n do

h0,i = 0for j=1 to m do

hj,i = min( hi-1,j-1 + (if pj==si then 0 else 1),hi-1,j + 1, hi,j-1 + 1 )

if hm,i ≤ k Report match at iTrace back and report the minimizing path (from-to)

Page 10: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Example

abracadabra000

r 110a 21d32a 43

Page 11: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...
Page 12: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

• Theorem Letsassumethatinthematrixhij thepaththatleadstothevaluehmj inthelastrowstartsfromsquareh0r.ThentheeditdistanceD(P,sr+1sr+2...sj)=hmj,andhmj istheminimalsuchdistanceforanysubstringstartingbeforej'thposition,hmj=min{D(P,stst+1...sj )|t≤j}

• Proofbyinduction• Everyminimizingpathstartsfromsomevalueintherow0• Sinceitispossibletoreachtothesameresultviamultiple

paths,thentheapproximatematchisnotalwaysunique

Page 13: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

• TimeandspacecomplexityO(mn)• Asncanbelarge,itissufficienttokeepthelastm+kcolumnsonly,whichcanfullyfitthefulloptimalpath.

• SpacecomplecityO(m2)• Or,onecankeepjustthesinglelastcolumnandincaseofamatchtorecalculatetheexactpath.

• SpacecomplecityO(m)• IfnoneedtofindthepathmO(m)

Page 14: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

• Diagonallemmawillhold• Ifoneneestofindonlytheregionswithatmostkeditoperations,thenonecanrestrictthedepthofthecalculations

Page 15: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

• Itsufficestocomputeuntilk-border• Modifiedalgorithm(homeassignment)willworkinaveragetimeO(kn)

• TherearebettermethodswhichworkinO(kn)atrtheworstcase.

• Landau&Vishkin(1988),Chang&Lampe(1991).

Page 16: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...
Page 17: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

ImprovedaveragecaseE.Ukkonen.Findingapproximatepatternsinstrings.JournalofAlgorithms,6(1-3):132-137,1985.1.//Preprocessing2.for j=0..mdo C[j]=j3.lact=k+1 //lastactiverow4.//Searching5.for i=0..n6.pC=0;nC=0 //previousandnewcolumnvalue7.for j=1..lact8.if S[i]==P[j]then nC=pC //why?9.else10. if pC<nCthen nC=pC11. if C[j]<nCthen nC=C[j]12. nC=nC+113. pC=C[j]14.C[j]=nC15.while C[lact]>kdo lact=lact-116.if lact=mthen reportmatchatpositioni17.else lact=lact+1

Page 18: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Ukkonen1985;O(kn)

Page 19: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

FourRussianstechnique

• Thisisageneraltechniquethatcanbeappliedindifferentcontexts

• Itimprovesthespeedofmatrixmultiplications• Hasbeenusedforregularexpressionandapproximate

matching• Letthecolumnvectord*j=(d0j,...,dmj)presentthecurrent

state• Letspreprocesstheautomatonfromeachstate• F(X,a)=Y,s.t.columnvectorXafterreadingcharactera

becomescolumnvectorY.• Example: LetsfindP=abc approximatematcheswhenthereis

atmost1operationallowed.

Page 20: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

FourRussianstechnique

• Thereare13differentpossibilities:

• Fromeachstatecomputepossiblenextstatesforallcharactersa,b,c,andx(xnotinP)

• Thestateswithdmj ≤1arefinalstates.• Thiscanbecometoolargetohandle.• Cuttheregionsintosmallerpieces,usethattoreducethe

complexity.• NavarroandRaffinot FlexiblePatternMatchinginStrings.

(CambridgeUniversityPress,2002).pp.152Fig6.5.

0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 1 1 1 1 1 1 1 10 0 1 1 1 0 0 1 1 1 2 2 20 1 0 1 2 0 1 0 1 2 1 2 3

Page 21: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Four-Russiansversion

Page 22: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...
Page 23: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

NFA/DFA

• Createanautomatonformatchingawordapproximately

• Allow0,1,…nerrors

Page 24: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...
Page 25: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...
Page 26: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...
Page 27: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Regularexpressions

Page 28: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Filteringtechniques

• q-gram (alsok-mer,oligomer)• (sub)stringoflengthq• LetshaveapatternPoflengthm• AssumepatternP isratherlongandkissmall,findoccurrenceswithatmostkmismatches

• HowlongsubstringsofPmusthaveanexactmatch?

• Ifmismatchesaremostevenly,thenweget~m/kpieces

Page 29: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Kmismatches

• K=3

• P

• For3-mismatchmatch,atleastonesubstringoflength(m-3)/4mustoccurexactly.

Page 30: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Filteringtechniqueswithq-grams

• IfPhaskmismatches,thenSmusthaveatleastonesubstringofPwhoselengthisatleast⌈ (m-k)/k⌉

• Filterforallpossibleq-merswhereqiscarefullyselected.– Becarefulwithoverlappingandnon-overlappingq-grams.– Ifnon-overlapping,thenhowlongexactmatchescanwefind?

• UsemultipleexactmatchingO(n)(orsublinear)algorithms• Whenanexactmatchofsuchsubstringisfound,thereisa

possibilityforanapproximateoverallmatch.• Checkfortheactualmatch

Page 31: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Filterandverify!

• P

Page 32: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Filteringtechniquescont.

• Lotsofresearchonapproximatematchingusingq-gramtechniques

• Lotsoftimesreinventedthewheelindifferentfields

Page 33: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Indexingusingq-grams• Filteringcanalsobeusedforindexing.E.g.indexallq-gramsandtheir

matchesinS.• IfonesearchesforP,firstsearchforq-gramsinindex.Ifasufficientnrof

matchesisfound,thenmakethecomparisontoseeifthematchisreal.• Filteringshouldbeefficientforcaseswhereahighsimilaritymatchfora

longpatternislookedfor.• Thisislikereverseindexfortexts:• word doc_id:word_iddoc_id:pos_id• word1 1:57:9167:987...

word2 2:53:678:1067:3...word3 3:55:677:1016:3......

• Q:wheredotheword1andword3occurtogether?

Page 34: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Bitparallelsearch

• Canweusebit-parallelismforapproximatesearch?

Page 35: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

• T=lasteaed,P=aste

l a s t e a e d

0 1 0 0 0 1

0 0 1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

Page 36: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Generalizedpatterns

• AgeneralizedpatternP=p1p2...pm consistsofgeneralizedcharacterspi suchthateachpi representsanon-emptysubsetofalphabetΣ*;

• pi =a,a∈ Σ• pi =#,"wildcard"(anynranysymbols)• pi =[group];e.g.:[abc],[^abc],[a-h],...• pi =¬C;CharactersfromasetΣ-C.• Example:[Tt][aeiou][kpt]#[^aeiou][mnr]matches

Tekstialgoritmbutnotwordtekstuur.• Problem:Searchforgeneralizedpatternsfromtext• ComparetoSHIFT-ORalgorithm!

Page 37: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

P= a[b-h]a¬a // agrep a[b-h]a[^a]paganamaa

a 110101[b-h] 221011a 332101¬a 433210

zero at last row – exact match!

Page 38: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

• Whataboutmismatches?• Mismatchifcharacterdoesnotbelongtoclassdefinedbypattern.Unitcost1.

• SHIFT-ADD- similartoSHIFT-OR,butinsteadofORanADDisused.(noinsertionsdeletionsonthisexample)

Page 39: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

• (noinsertionsdeletionsonthisexample)P=a[kpt]a¬a //agrepa[kpt]a[^a]

1 atlastpos- matchwith1mismatch!• Eachvalueofmatrixdij canbepresentedwithbbits(4bits

allowsvaluesupto16).Columnscanbesimpleintegers.

paganamaa0000000000

a 110101[kpt] 2211 21a 33221 3¬a 433221

Page 40: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

• Eachvalueofmatrixdij canbepresentedwithbbits(4bitsallowsvaluesupto16).Columnscanbesimpleintegers.

• Bj=dmj2b(m-1) +dm-1,j2b(m-1) +...d1j.(d0jisalways0,canbeomitted)

• Whenaddinganotherinteger,where0isonpositioniifthenextcharatj'thpositionbelongstoasetrepresentedbyPi and1otherwise.

Page 41: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

• Whenaddinganotherinteger,where0isonpositioniifthenextcharatj'thpositionbelongstoasetrepresentedbyPi and1otherwise.

010001000001011+ 001001 000000 001

----------------------------= 011010000001100

Page 42: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

• Oneneedstobeverycarefulnottohaveoverflow(111+001=1000).

• Shiftby3positions==multiplyby8

010001000001011 *8= 001000001011000

Page 43: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Usemultiplevectors,oneforeachkvalue

• Onecanalsouseseveralindividual1-bitvectors,eachcorrespondstodifferentk

• CanbeextendedtomaskoutregionswheremismatchesareNOTallowed

• Canintroducewildcardsofarbitrarylength

Page 44: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Bit-parallelism

• Maintainalistofpossible“states”

• Updatelistsusingbit-leveloperations

Page 45: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Example(note:leastsignificantbitisleftinthisoutput)

Pattern=AC#T<GA>[TG]Alength7,#=.*CV[char]A65 11111111111111111111111110101110C67 11111111111111111111111111111101G71 11111111111111111111111111010111T84 11111111111111111111111111011011WILDCARD

11111111111111111111111111111101ENDMASK

00000000000000000000000001000000NO_ERROR

00000000000000000000000000011000

7654321

Page 46: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

0 – position is “active”• R[0] – vector for (so far) 0 mismatches• R[1] – vector for (so far) 1 mismatch• R[2] – vector for (so far) 2 mismatches

• “Minimum” bybitwiseAND• If(even)oneofthevectorshas0,

thenbitwiseANDproduces0(whichissmallerof0and1,1and0,0and0)

• Ifboth(orall)ofthevectorshave1,thenbitwiseANDproduces1 (whichissmallerof1and1)

Page 47: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

• Howtogetnewvaluesfromoldones• P[0]P[1]...=>R[0]R[1]...R[0]

– isminofthree possibilities:

(P[i]shift1)bitorCV[textchar]//previouslyactive,nowmatchwithcharacter

(P[i]bitorWILDCARD)//wildcardmatch– thesamepositionremainsactive

(P[i-1]shift1bitorNO_ERROR)//Previously1lesserrors(unlessNO_ERRORallowed)

Page 48: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Thealgorithm

• R[i]ingeneralistheminimumof3possibilities:

(P[i]shift1)bitorCV[textchar]& //match(P[i]bitorWILDCARD)& //wildcard(P[i-1]shift1bitorNO_ERROR) //mismatch

Last-- Addonemismatchunlesserrorsnotallowed

diktorantuur

Page 49: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

BPR (p = p1p2...pm, T = t1t2...tn, k)1. Preprocessing2. for c ∈ S Do B[c] <- 0m3. for j ∈ 1 ... m Do B[pj] <- B[pj] | 0m-j10j-1

4. Searching5. for i ∈ 0 ... k Do Ri <- 0m-i1i6. for pos ∈ 1 ... n Do7. oldR <- R08. newR <- ((oldR << 1) | 1) & B[tpos]9. R0 <- newR10. for i ∈ 1 ... k Do11. newR <- ((Ri << 1) & B[tpos]) | oldR | ((oldR | newR) << 1)12. oldR <- Ri, Ri <- newR13. end of for14. If newR & 10m-1 <> 0 Then report an occurrence at pos15. End of for

Page 50: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

public static void BPR(string pattern, string text, int errors){

int[] B = new int[ushort.MaxValue];for (int i = 0; i < ushort.MaxValue; i++) B[i] = 0;// Initialize all characters positionsfor (int i = 0; i < pattern.Length; i++){

B[(ushort)pattern[i]] |= 1 << i;}// Initialize NFA statesint[] states = new int[errors+1]; for(int i= 0; i <= errors; i++){

states[i] = (i == 0) ? 0 : (1 << (i - 1) | states[i-1]);}//int oldR, newR;int exitCriteria = 1 << pattern.Length -1;

for (int i = 0; i < text.Length; i++){

oldR = states[0];newR = ((oldR << 1) | 1) & B[text[i]];states[0] = newR;

for (int j = 1; j <= errors; j++){

newR = ((states[j] << 1) & B[text[i]]) | oldR | ((oldR | newR) << 1);

oldR = states[j];states[j] = newR;

}

if ((newR & exitCriteria) != 0) Console.WriteLine("Occurrence at position {0}", i+1);

}}

Page 51: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...
Page 52: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

agrep

• S.WuandU.Manber.Fasttextsearchingallowingerrors.CommunicationsoftheACM,35(10):83--91,1992.ACMDL PDF

• Insertions,deletions• Wildcards• Non-uniformcostsforsubstitution,insertion,deletion

• Findbestmatch• Maskregionsfornoerrors• Recordorientated,notlineorientated

Page 53: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Agrepexamples(frommanagrep)• agrep-2-cABCDEFGfoo

givesthenumberoflinesinfilefoothatcontainABCDEFGwithintwoerrors.• agrep-1-D2-S2'ABCD#YZ'foo

outputsthelinescontainingABCDfollowed,withinarbitrarydistance,byYZ,withuptooneadditionalinsertion(-D2and-S2makedeletionsandsubstitutionstoo"expensive").

• agrep-5-pabcdefghij/usr/dict/wordsoutputsthelistofallwordscontainingatleast5ofthefirst10lettersofthealphabetinorder.(Tryit:anyliststart- ingwithacademiaandendingwithsacrilegiousmustmeansome- thing!)

• agrep-1'abc[0-9](de|fg)*[x-z]'foooutputsthelinescontaining,withinuptooneerror,thestringthatstartswithabcfollowedbyonedigit,followedbyzeroormorerepetitionsofeitherdeorfg,followedbyeitherx,y,orz.

• agrep-d'^From''breakdown;internet'mboxoutputsallmailmessages(thepattern'^From'separatesmailmessagesinamailfile)thatcontainkeywords'breakdown'and'internet'.

• agrep-d'$$'-1''foofindsallparagraphsthatcontainword1followedbyword2withoneerrorinplaceoftheblank.Inparticular,ifword1isthelastwordinalineandword2isthefirstwordinthenextline,thenthespacewillbesubstitutedbyanewlinesymbolanditwillmatch.Thus,thisisawaytoovercomeseparationbyanewline.Notethat-d'$$'(oranotherdelimwhichspansmorethanoneline)isnecessary,becauseotherwiseagrepsearchesonlyonelineatatime.

• agrep'^agrep'outputsalltheexamplesoftheuseofagrepinthismanpages.

Page 54: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

• GeneMyers:Afastbit-vectoralgorithmforapproximatestringmatchingbasedondynamicprogramming JournaloftheACM(JACM),Volume46,Issue3(May1999).http://doi.acm.org/10.1145/316542.316550.PDF

• Abstract• Theapproximatestringmatchingproblemistofindalllocationsatwhichaqueryoflengthmmatchesasubstringofatext

oflengthn withk-or-fewerdifferences.• Simpleandpracticalbit-vectoralgorithmshavebeendesignedforthisproblem,mostnotablytheoneusedinagrep.• Thesealgorithmscomputeabitrepresentationofthecurrentstate-setofthek-differenceautomatonforthequery,and

asymptoticallyrunineitherO(nm/w)orO(nmlogσ/w)timewherewisthewordsizeofthemachine(e.g.,32or64inpractice),andσisthesizeofthepatternalphabet.

• HerewepresentanalgorithmofcomparablesimplicitythatrequiresonlyO(nm/w)timebyvirtueofcomputingabitrepresentationoftherelocatabledynamicprogrammingmatrixfortheproblem.

• Thus,thealgorithm'sperformanceisindependentofk,anditisfoundtobemoreefficientthanthepreviousresultsformanychoicesofkandsmallm.

• Moreover,becausethealgorithmisnotdependentonk,itcanbeusedtorapidlycomputeblocksofthedynamicprogrammingmatrixasinthe4-RussiansalgorithmofWuetal.(1996).

• ThisgivesrisetoanO(kn/w)expected-timealgorithmforthecasewheremmaybearbitrarilylarge.• Inpracticethisnewalgorithm,thatcomputesaregionofthedynamicprogramming(d.p.)matrxwentriesatatimeusing

thebasicalgorithmasasubroutineissignificantlyfasterthanourprevious4-Russiansalgorithm,thatcomputesthesameregion4or5entriesatatimeusingtablelookup.

• Thisperformanceimprovementyieldsacodethatiseithersuperiororcompetitivewithallexistingalgorithmsexceptforsomefiltrationalgorithmsthataresuperiorwhenk/missufficientlysmall.

• Writingofanoverview,implementingthealgorithmandcreatingausefultoolcouldbeabigtopicforaBScorMScthesis.

Page 55: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

Multipleapproximatestringmatching

• Howtofindsimultaneouslytheapproximatematchesforasetofwords,e.g.adictionary.

• Orasetofregularexpressions,generalizedpatterns,etc.

• Onecanbuildautomatonsforsetsofwords,andthenmatchtheautomatonsapproximately.

• Filteringapproaches– ifcloseenough,test• Notmany(good)methodshavebeenproposed

Page 56: Text Algorithms (6EAP) - Arvutiteaduse instituut · •In exact search we searched for a string or set of strings in a long text •The we learned how to measure the similarity ...

• OverimposeNFAautomata• Filteronall(necessary)factors