Top Banner
The Burrows-Wheeler The Burrows-Wheeler Transform Transform Sen Zhang
64

The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

The Burrows-Wheeler The Burrows-Wheeler Transform Transform

Sen Zhang

Page 2: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

TransformTransform

• What is the definition for “transform”?– To change the nature, function, or condition of; convert.– To change markedly the appearance or form of

• Lossless and reversible– By the way, to transform is simple, a kid can do it.– To put them back is a problem.– Think of a 3 years old baby, he pretty much can transform

anything, disassemble anything, but …• There exist efficient reverse algorithms that can retrieve the original

text from the transformed text.

2

Page 3: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

3

What is BWT?What is BWT?

• The Burrows and Wheeler transform (BWT) is a block sorting lossless and reversible data transform.

• The BWT can permute a text into a new sequence which is usually more “compressible”.

• Surfaced not long ago, 1994, by Michael Burrows and David Wheeler.

• The transformed text can be better compressed with fast locally-adaptive algorithms, such as run-length-encoding (or move-to-front coding) in combination with Huffman coding (or arithmetic coding).

Page 4: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

4

Outline Outline • What does BWT stand for?• Why BWT?

– Data Compression algorithms– REL– Huffman coding– Combine them– What is left out?– Bring the reality closer to ideality

• Steps of BWT• BWT is reversible and lossless• Steps to inverse• Variants of BWT

– ST• When was BWT initially proposed?• Where are the inventors of the algorithms?• Your homework!

Page 5: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

5

Why BWT?Why BWT?

• Run length encoding– Replacing a long series of a repeated character with a

count of the repetition. Squeezing to a number and a character.

• AAAAAAA• *A7 , * flag

– Ideally, the longer of the sequence of the same character is, the better.

– In reality, the input data, however, does not necessarily favor the expectation of the RLE method.

Page 6: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

6

Bridge reality and idealityBridge reality and ideality

• BWT can transform a text into a sequence that is easier to compress.– Closer to ideality (what is expected by RLE).

• Compression on the transformed text improves the compression performance

Page 7: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

7

PreliminariesPreliminaries• Alphabet Σ

– {a,b,c,$}• We assume

– an order on the alphabet• a<b<c<$

– A character is available to be used as the sentinel, denoted as $.

Page 8: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

8

How to transform?How to transform?

• Three steps– Form a N*N matrix by cyclically rotating (left)

the given text to form the rows of the matrix.– Sort the matrix according to the alphabetic

order.– Extract the last column of the matrix.

Page 9: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

9

One example

how the BWT transforms mississippi. • T=mississippi$

Page 10: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

10

Step 1: form the matrixStep 1: form the matrix

• The N * N symmetric matrix, MO, originally constructed from the texts obtained by rotating the text $T$.– The matrix OM has S as its first row, i.e. OM[1,

1:N]=T. – The rest rows of OM are constructed by applying

successive cyclic left-shifts to T, i.e. each of the remaining rows, a new text T_i is obtained by cyclically shifting the previous text T_{i-1} one column to the left.

• The matrix OM obtained is shown in the next slide.

Page 11: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

• A text T is a sequence of characters drawn from the alphabet. • Without loss of generality, a text T of length $N$ is denoted as

x_1x_2x_3...x_{N-1}$, where every character x_i is in the alphabet, Σ, for i in [1, N-1]. The last character of the text is a sentinel, which is the lexicographically greatest character in the alphabet and occurs exactly once in the text.

• Appending a sentinel to the original text is not a must but helps simplifying the understanding and make any text nonrepeating.– abcababac$

11

Page 12: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

12

Step 1 form the matrixStep 1 form the matrix

First treat the input string as a cyclic string and construct N* N matrix from it.

Page 13: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

13

m i s s i s s i p p i $ i s s i s s i p p i $ m s s i s s i p p i $ m i s i s s i p p i $ m i s i s s i p p i $ m i s s s s i p p i $ m i s s i s i p p i $ m i s s i s i p p i $ m i s s i s s p p i $ m i s s i s s i p i $ m i s s i s s i p i $ m i s s i s s i p p $ m i s s i s s i p p i

Step 1: form the matrixStep 1: form the matrix

Page 14: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

14

Step 2: transform the matrixStep 2: transform the matrix• Now, we sort all the rows of the matrix OM in ascending order with the

leftmost element of each row being the most significant position. • Consequently, we obtain the transformed matrix M as given below.

i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i

Completely sorted from the leftmost column to the rightmost column.

Page 15: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

15

Step 3: get the transformed textStep 3: get the transformed text

• The Burrows Wheeler transform is the last column in the sorted list, together with the row number where the original string ends up.

Page 16: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

16

Step 3: get the transformed textStep 3: get the transformed text

• From the above transform, L is easily obtained by taking the transpose of the last column of M together with the primary index.

– 4– L= s s m p $ p i s s i i i

i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i

4

•Notice how there are 3 i's in a row and 2 consecutive s's and another 2 consecutive s’s - this makes the text easier to compress, than the original string “mississippi$”.

Page 17: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

What is the benefit?What is the benefit?

• The transformed text is more amenable to subsequent compression algorithms.

17

Page 18: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

18

Any problem?Any problem?

• It sounds cool, but …

• Is the transformation reversible?

Page 19: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

19

BWT is reversible and losslessBWT is reversible and lossless• The remarkable thing about the BWT is not only that it generates a

more easily compressible output, but also that it is reversible, i.e. it allows the original text to be re-generated from the last column data and the primary index.

Page 20: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

20

BWT is reversible and losslessBWT is reversible and lossless

mississippi$

Index 4 and ssmp$pissiii

BWT

Inverse BWT

mississippi$

??? How to achieve the goal?

Page 21: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

21

The intuitionThe intuition

• Assuming you are in a 1000 people line.– For some reason, people are dispersed– Now, we need to restore the line.– What should you (the people in line) do?– What is your strategy?

• Centralized?– A bookkeeper or ticket numbers, that requires

centralized extra bookkeeping space

• Distributed?– If every person can point out who stood immediately in

front of him. Bookkeeping space is distributed.

Page 22: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

For IBWTFor IBWT

• The order is distributed and hidden in the output themselves!!!

22

Page 23: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

23

The trick isThe trick is

• Where to start? Who is the first one to ask?– The last one.

• Finding immediate preceding character– By finding immediate preceding row of the current

row.

• A loop is needed to recover all.• Each iteration involves two matters

• Recover the current people (by index)• In addition to that, to point out the next people (by index) to

keep the loop running.

Page 24: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

24

• Two matters• Recover the current people (by index)

– L[currentindex], so what is the currentindex?

• In addition to that, to point out the next people (by index)

– currentindex = new index; – // how to update currentindex, we need a updating

method.

Page 25: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

25

We want to know where is the We want to know where is the preceding character of a given preceding character of a given

character.character.i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i

4

Based on the already known primary index, 4, we know, L[4], i.e. $ is the first character to retrieve, backwardly, but our question is which character is the next character to retrieve?

Page 26: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

26

We want to know where is the We want to know where is the preceding character of a given preceding character of a given

character.character.i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i

4

We know that the next character is going to be ‘i’?But L[6]=L[9]= L[10] = L[11] =‘i’. Which index should be chosen?Any of 6, 9, 10, and 11 can give us the right character ‘i’, but the correct strategy also has to determine which index is the next index continue the restoration.

Page 27: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

27

• We know that the next character is going to be ‘i’?

• But L[6]=L[9]= L[10] = L[11] =‘i’. Which index should be chosen?

• Any of 6, 9, 10, and 11 can give us the right character ‘i’, but the correct strategy also has to determine which index is the next index continue the restoration.

Page 28: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

28

The solutionThe solution

• The solution turns out to be very simple:– Using LF mapping!– Continue to see what LF mapping is?

Page 29: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

29

Inverse BW-TransformInverse BW-Transform

• Assume we know the complete ordered matrix

• Using L and F, construct an LF-mapping LF[1…N] which maps each character in L to the character occurring in F.

• Using LF-mapping and L, then reconstruct T backwards by threading through the LF-mapping and reading the characters off of L.

Page 30: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

30

L and FL and F

i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i

4

Page 31: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

31

LF mappingLF mapping

i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i

4

7845

11609

10123

Page 32: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

32

Inverse BW-Transform:Inverse BW-Transform:Reconstruction of TReconstruction of T

• Start with T[] blank. • Let u = N

Initialize Index = the primary index (4 in our case)

• T[u] = L[index].We know that L[index] is the last character of T because M[the primary index] ends with $.

• For each i = u-1, …, 1 do:s = LF[s] (threading backwards)T[i] = L[s] (read off the next letter back)

Page 33: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

33

Inverse BW-Transform:Inverse BW-Transform:Reconstruction of TReconstruction of T

• First step:s = 4 T = [.._ _ _ _ _ $]

• Second step:s = LF[4] = 11 T = [.._ _ _ _ i $]

• Third step:s = LF[11] = 3 T = [.._ _ _ p i $]

• Fourth step:s = LF[3] = 5 T = [.._ _ p p i $]

• And so on…

Page 34: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

34

Who can retrieve the data?Who can retrieve the data?

• Please complete it!

Page 35: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

35

Why does LF mapping work?Why does LF mapping work?

i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i

4

7845

11609

10123

? Which one

Page 36: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

36

Why does LF mapping work?Why does LF mapping work?

i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i

4

7845

11609

10123

? Why not this?

Page 37: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

37

Why does LF mapping work?Why does LF mapping work?

i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i

4

7845

11609

10123

? Why this?

Page 38: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

38

Why does LF mapping work?Why does LF mapping work?

i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i

4

7845

11609

10123

? Why this?

Page 39: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

39

Why does LF mapping work?Why does LF mapping work?

i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i

4

7845

11609

10123

? Why this?

Page 40: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

40

The mathematic explanationThe mathematic explanation

• T1=S1+P

• T2=S2+P

• If T1<T2, S1<S2

• Now, let us reverse S and P

• P+S1= T1’

• P+S2=T2’

• Since S1<S2, we know T1’<T2’

Page 41: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

41

• The secret is hidden in the sorting strategy the forward component.– Sorting strategy preserves the relative order

in both last column and first column.

Page 42: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

42

• We had assumed we have the matrix. But actually we don’t.

• Observation, we only need two columns.

• Amazingly, the information contained in the Burrows-Wheeler transform (L) is enough to reconstruct F, and hence the mapping, hence the original message!

Page 43: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

43

• First, we know all of the characters in the original message, even if they're permuted in the wrong order. This enables us to reconstruct the first column.

Page 44: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

44

• Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters to get the first column.

Page 45: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

45

Inverse BW-Transform:Inverse BW-Transform:Construction of CConstruction of C

• Store in C[c] the number of occurrences in T of the characters {1, …, c-1}.

• In our example:

T = mississippi$

i 4, m 1, p 2, s 4, $ 1

C = [0 4 5 7 11]• Notice that C[c] + m is the position of the

mth occurrence of c in F (if any).

Page 46: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

46

Inverse BW-Transform:Inverse BW-Transform:Constructing the LF-mappingConstructing the LF-mapping

• Why and how the LF-mapping? – Notice that for every row of M, L[i] directly precedes

F[i] in the text (thanks to the cyclic shifts).

• Let L[i] = c, let ri be the number of occurrences of c in the prefix L[1,i], and let M[j] be the r i-th row of M that starts with c. Then the character in the first column F corresponding to L[i] is located at F[j].

• How to use this fact in the LF-mapping?

Page 47: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

47

Inverse BW-Transform:Inverse BW-Transform:Constructing the LF-mappingConstructing the LF-mapping

• So, define LF[1…N] as

LF[i] = C[L[i]] + ri.• C[L[i]] gets us the proper offset to the

zeroth occurrence of L[i], and the addition of ri gets us the ri-th row of M that starts with c.

Page 48: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

48

Inverse BW-TransformInverse BW-Transform

• Construct C[1…|Σ|], which stores in C[i] the cumulative number of occurrences in T of character i.

• Construct an LF-mapping LF[1…N] which maps each character in L to the character occurring in F using only L and C.

• Reconstruct T backwards by threading through the LF-mapping and reading the characters off of L.

Page 49: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

49

Another exampleAnother example

1. You are given and input string “ababc”

(a) Using Burrows-Wheeler, create all cyclic shifts of the string

(b) sorted order(b) Output L and the primary index.(g) Given L, determine F and LF (and show how you do it).(h) Decode the original string using indexX, L, and LF (and show how you do it).

Page 50: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

50

Pros and cons of BWTPros and cons of BWT

• Pros:– The transformed text does enjoy a compression-favorable

property which tends to group identical characters together so that the probability of finding a character close to another instance of the same character is increased substantially.

– More importantly, there exist efficient and smart algorithms to restore the original string from the transformed result.

• Cons:– the need of sorting all the contexts up to their full lengths of $N$

is the main cause for the super-linear time complexity of BWT. – Super-linear time algorithms are not hardware friendly.

Page 51: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

Block wiseBlock wise

• It works on blocks of certain typical size.

51

Page 52: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

52

An improved algorithm -An improved algorithm -Schindler TransformsSchindler Transforms

• To address the above drawbacks, a slightly different transform, called ST, was proposed.– which can sort the texts by using only their first $k$

characters (where $k$ can be a value far less than $N$), but still render itself reversible.

• The key idea of ST is a two-hierarchy priority sorting scheme, which can be easily achieved using the radix sort.– the lexicographical sorting criterion.– the positional sorting criterion.

Page 53: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

53

ST transformST transform

i p p i $ m i s s i s s i s s i s s i p p i $ m i s s i p p i $ m i s s i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i s s i p p i $ m i s s i p p i $ m i s s i s s s i s s i p p i $ m i s s i p p i $ m i s s i $ m i s s i s s i p p i

• Let OM be the same matrix as defined for the BWT. • Under k-order ST, OM is transformed to M_k by sorting all its rows

according to their first k leftmost characters, i.e. k-order contexts, only. • In case that any two k-order contexts are equal, the tie is resolved by their

relative positions in the original OM.

Only partially sorted on the leftmost two columns

Page 54: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

54

Pros and Cons of STPros and Cons of ST

• Pros:– Faster than BWT – Hardware implementation friendly

• Cons:– The currently known approach to inverse ST

is based on a hashing function.– The relationship between inverse ST and

inverse BWT is not well studied.

Page 55: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

55

An application schemeAn application schemein data communication systemin data communication system

Page 56: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

56

ConclusionsConclusions

• The BW transform makes the text (string) more amenable to compression.– BWT in itself does not modify the data stream.

It just reorders the symbols inside the data blocks.

– Evaluation of the performance actually is subject to information model assumed. Another topic.

• The transform is lossless and reversible

Page 57: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

57

BW Transform SummaryBW Transform Summary

• Any naïve implementation of the transform has an O(n^3) time complexity.– The best solution has O(n), which is tricky to

implement.

• We can reverse it to reconstruct the original text in O(n) time, using O(n) space.

• Once we obtain L, we can compress L in a provably efficient manner

Page 58: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

58

Issues left outIssues left out

• How about if all characters in the alphabet set appear in the text, i.e. no sentinel can be used?

• Do you need to compare N positions?• How about the input data is not ascii encoded, but an

image, or a biological sequence (DNA, RNA or protein)?• Why not the first column, but the last column?

– In BWT, the last column, L, of the sorted matrix contains concentrations of identical characters, which is why L is easy to compress. However, the first column, F, of the same matrix is even easier to compress since it . Why select column L and not column F?

Page 59: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

59

homeworkhomework

• The BWT algorithms– Forward Transform– Backward Transform

• Either in the Windows environment or the Linux environment

Page 60: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

60

Examples of running your Examples of running your program in the command lineprogram in the command line

• bwt –f text1 text2– Transfer text1 to text2

• bwt –i text2 text3– Inverse text2 to text3

Page 61: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

61

How to verify the correctness of How to verify the correctness of your algorithms.your algorithms.

• Because the bwt is reversible and lossless, if your implementation is correct, text3 should be the same as text1.– Your can manually verify text1 and text3– Alternatively, you can run “diff” command in

Linux to report any differences between any two files.

Page 62: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

62

RequirementsRequirements

• Stage 1: use a fixed string or accept a string from keyboard to test the correctness of your algorithms. (80 points)

• Stage 2: then expand your solution to read the string from a given file. (20 points) Notice that text2 should be a binary file, for the first data is index, then followed by ascii code.

Page 63: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

63

How to sort the matrixHow to sort the matrix

• 1. the simplest way– Whatever sorting algorithm you feel comfortable– Make each row a string, then do string comparison

• C string, need to know how functions for string comparison• Cpp string, need to know how to how to use string class.• You use whichever way you feel the most comfortable.

• 2. radix sort• 3. suffix array

Page 64: The Burrows-Wheeler Transform Sen Zhang. Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert.

64

Knowledge to be practiced for Knowledge to be practiced for the homework the homework

• Array

• Dynamic memory allocation

• String manipulation

• Sorting

• File operation

• Data compression algorithms