Top Banner
Burrows-Wheeler transform and BWT-index
44

Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Aug 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform and BWT-index

Page 2: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Succinct and compressed indexes

!  succinct index takes space in bits proportional to that of the text itself

!  previous indexes are not succinct as they take O(n) computer words but O(n·log(n)) bits

!  compressed index takes space in bits proportional to that of the compressed text

!  self-index does not require storing the text

Page 3: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

$acatacagatg!acagatg$acat!acatacagatg$!agatg$acatac!atacagatg$ac!atg$acatacag!cagatg$acata!catacagatg$a!g$acatacagat!gatg$acataca!tacagatg$aca!tg$acatacaga!

T=acatacagatg$!

Page 4: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

$acatacagatg!acagatg$acat!acatacagatg$!agatg$acatac!atacagatg$ac!atg$acatacag!cagatg$acata!catacagatg$a!g$acatacagat!gatg$acataca!tacagatg$aca!tg$acatacaga!

12 5 1 7 3 9 6 2 11 8 4 10

T=acatacagatg$!SA

Page 5: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

$acatacagatg!acagatg$acat!acatacagatg$!agatg$acatac!atacagatg$ac!atg$acatacag!cagatg$acata!catacagatg$a!g$acatacagat!gatg$acataca!tacagatg$aca!tg$acatacaga!

T=acatacagatg$!

Page 6: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

Page 7: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  BWT has been defined for the purpose of compression, as BWT compresses better than the input text

•  BWT is reversible!

1 2 3 4 5 6 7 8 9 10 11 12

Page 8: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10]

F L

1 2 3 4 5 6 7 8 9 10 11 12

Page 9: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

T= $!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10]

F L

1 2 3 4 5 6 7 8 9 10 11 12

Page 10: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= $!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10]

F L

1 2 3 4 5 6 7 8 9 10 11 12

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 11: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= g$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10]

F L

1 2 3 4 5 6 7 8 9 10 11 12

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 12: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= g$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10] •  Obs 2: for identical chars, their relative

order in F and L is the same

F L

x x

i

j

LF[i]=C[BWT[i]]+rank[BWT[i],i]

Ex: LF[1]=8+1

1 2 3 4 5 6 7 8 9 10 11 12

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 13: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= tg$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10] •  Obs 2: for identical chars, their relative

order in F and L is the same

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i]

Ex: LF[1]=8+1

1 2 3 4 5 6 7 8 9 10 11 12

x x

i

j

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 14: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= tg$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10] •  Obs 2: for identical chars, their relative

order in F and L is the same

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i]

Ex: LF[1]=8+1

1 2 3 4 5 6 7 8 9 10 11 12

x x

i

j

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 15: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= atg$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10] •  Obs 2: for identical chars, their relative

order in F and L is the same

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i]

Ex: LF[1]=8+1

1 2 3 4 5 6 7 8 9 10 11 12

x x

i

j

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 16: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= atg$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10] •  Obs 2: for identical chars, their relative

order in F and L is the same

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i]

Ex: LF[1]=8+1

1 2 3 4 5 6 7 8 9 10 11 12

x x

i

j

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 17: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Burrows-Wheeler transform

T= gatg$!

BWT T[SA[i]] •  BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $

•  Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x

•  Ex: C=[0,1,6,8,10] •  Obs 2: for identical chars, their relative

order in F and L is the same

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i]

Ex: LF[1]=8+1

1 2 3 4 5 6 7 8 9 10 11 12

x x

i

j

$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!

Page 18: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

LF function

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

BWT T[SA[i]]

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i]

T=acatacagatg$!

•  LF[i] yields the index (in SA) of the suffix immediately preceding (in T) the i-th suffix (in SA). Formally, SA[LF[i]]=SA[i]-1.

1 2 3 4 5 6 7 8 9 10 11 12

12 5 1 7 3 9 6 2 11 8 4 10

SA

SA[LF[7]] = SA[7]-1 = 6-1 = 5

Page 19: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank function

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

BWT T[SA[i]]

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i]

Page 20: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank function

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

BWT T[SA[i]]

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i] 0 0 0 0 1 1 0 1 1 2 3 4

rank[BWT[i],i]

Page 21: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank function

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

BWT T[SA[i]]

how about general queries rank[a,i] for any letter a and any

position i?

F L

LF[i]=C[BWT[i]]+rank[BWT[i],i] 0 0 0 0 1 1 0 1 1 2 3 4

rank[BWT[i],i]

Page 22: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank/select functions

!  given a string T, efficiently answer queries rank(a,i) on the number of a’s in T[1..i]

!  rank function (on bit vectors) turns out to be a fundamental algorithmic block for building succinct data structures [Jacobson 89]

!  on bit vectors rank can be supported in time O(1) using o(n) additional bits of memory

!  complementary function select(a,j): output the position of the j-th occurrence of a in T. select can also be supported in O(1) time

!  [Jacobson 89]: using rank/select to represent binary trees to support navigation

Page 23: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Implementing rank on bitmaps

!  consider a bitmap B of size n !  tabulate rank within all blocks of size (log n)/2 there are 2(log n)/2 = √n different blocks, and (log n)/2 possible queries,

with the result taking (log log n) bits. Overall space: O(√n·(log n)·(log log n))=o(n)

!  idea: compute “cumulative rank” for block borders

!  takes 2n/(log n) · (log n) = 2n bits too much! trick: introduce two levels of

blocks

Page 24: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

Implementing rank on bitmaps

!  consider a bitmap B of size n !  tabulate rank within all blocks of size (log n)/2 there are 2(log n)/2 = √n different blocks, and (log n)/2 possible queries,

with the result taking (log log n) bits. Overall space: O(√n·(log n)·(log log n))=o(n) !  split B into n/(log2 n) superblocks of size (log2 n); compute

cumulative rank. This takes n/(log2 n) · (log n)=n/(log n)=o(n) bits !  split each superblock into blocks of size (log n)/2; compute

cumulative rank inside superblock; the result takes O(log log n) bits. Therefore we only need

2n/(log n) · (log log n) = o(n) bits

DONE!

Page 25: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank for large alphabets: wavelet trees

$,a,b c,d,r

$,a b

a $

c,d r

d c

Space: n·log(σ) bits

rank(a,11)=?

a: 001

Page 26: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank for large alphabets: wavelet trees

$,a,b c,d,r

$,a b

a $

c,d r

d c

Space: n·log(σ) bits

rank(a,11)=?

a: 001

rank(0,11,Sε)=5

Page 27: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank for large alphabets: wavelet trees

$,a,b c,d,r

$,a b

a $

c,d r

d c

Space: n·log(σ) bits

rank(a,11)=?

a: 001

rank(0,11,Sε)=5 rank(0,5,S0)=3

Page 28: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

rank for large alphabets: wavelet trees

$,a,b c,d,r

$,a b

a $

c,d r

d c

Space: n·log(σ) bits

rank(a,11)=?

a: 001

rank(0,11,Sε)=5 rank(0,5,S0)=3 rank(1,3,S00)=2

rank is computed in O(log(σ)) time

Page 29: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

Page 30: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

Page 31: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

Page 32: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

e!

f!

[e,f] : current interval x : letter

e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]

Page 33: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

e!

f!

[e,f] : current interval x : letter

e:= C[x]+rank[x,e] f:= C[x]+rank[x,f]-1

Page 34: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

e!f!

[e,f] : current interval x : letter

e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]

Page 35: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

e!f!

[e,f] : current interval x : letter

e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]

Page 36: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

e!f!

[e,f] : current interval x : letter

e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]

Page 37: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

[e,f] : current interval x : letter

e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]

e!f!

Page 38: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

S : current string (pattern suffix) [e,f] : current interval x : letter

compute new interval for xS

e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]

Page 39: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

What position is it??

Page 40: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

String matching with BWT

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

P=taca!

It is position 3 !

Page 41: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

BWT-index (FM-index)

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

Solution: store only a fraction of values of SA.

Storing one value over log(n) leads to O(n·log(n)/log(n))=O(n) bits

12 5 1 7 3 9 6 2 11 8 4 10

SA

Page 42: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

BWT-index (FM-index)

$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!

T=acatacagatg$!

BWT T[SA[i]]

F L

Solution: store only a fraction of values of SA.

Storing one value over log(n) leads to O(n·log(n)/log(n))=O(n) bits

Search time becomes O(|P|+occ·log(n))

[Ferragina, Manzini 00]

FM-index includes: •  BWT •  selection of SA values •  auxiliary structures: array C,

rank, position marking …

12

6 2

8 4 10

SA

Page 43: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

BWT-index: practical issues

!  BWT-index can be implemented using ~3 bits/char (!!) !  BWT-index is now widely used in practical bioinformatics

software: BWA, bowtie, SOAP2 (mapping), CGA (assembly)

!  Variant: bi-directional BWT-index

Page 44: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself

BWT-index: practical issues

!  BWT-index can be implemented using ~3 bits/char (!!) !  BWT-index is now widely used in practical bioinformatics

software: BWA, bowtie, SOAP2 (mapping), CGA (assembly)

!  Variant: bi-directional BWT-index !  other succinct data structure exist (including compact suffix

array) and continue to appear !  construction may require much more space than the

resulting structure !  external memory algorithms are important !  algorithms specialized to multi-core or GPU processor

architectures !  dynamic indexes