Processing of large document collections Part 4
Processing of large document collections
Part 4
Text compression
Despite a continuous increase in storage and transmission capacities, more and more effort has been put into using compression to increase the amount of data that can be handled
no matter how much storage space or transmission bandwidth is available, someone always finds ways to fill it with
Text compression
Efficient storage and representation of information is an old problem (before the computer era) Morse code: uses shorter
representations for common characters Braille code for the blind: includes
contractions, which represent common words with 2 or 3 characters
Text compression
On a computer: changing the representation of a file so that it takes less space to store or less time to transmit original file can be reconstructed exactly from
the compressed representationdifferent than data compression in general
text compression has to be lossless compare with sound and images: small
changes and noise is tolerated
Text compression methods
Huffman coding (in the 50’s) compressing English: 5 bits/character
Ziv-Lempel compression (in the 70’s) 4 bits/character
arithmetic coding 2 bits/char (more processing power
needed)prediction by partial matching (80’s)
Text compression methods
Since 80’s compression rate has been about the same
improvements are made in processor and memory utilization during compression
also: amount of compression may increase when more memory (for compression and uncompression) is available
Text compression methods
Most text compression methods can be placed in one of two classes: symbolwise methods dictionary methods
Symbolwise methods
Work by estimating the probabilities of symbols (often characters) coding one symbol at a time using shorter codewords for the most
likely symbols (in the same way as Morse code does)
Dictionary methods
Achieve compression by replacing words and other fragments of text with an index to an entry in a ”dictionary” the Braille code is a dictionary method
since special codes are used to represent whole words
Symbolwise methods
Usually based on Huffman coding or arithmetic coding
variations differ mainly in how they estimate probabilities for symbols the more accurate these estimates are, the
greater the compression that can be achieved to obtain good compression, the probability
estimate is usually based on the context in which a symbol occurs
Symbolwise methods
Modeling estimating probabilities there does not appear to be any single
”best” methodCoding
converting the probabilities into a bitstream for transmission
well understood, can be performed effectively
Dictionary methods
Generally use quite simple representations to code references to entries in the dictionary
obtain compression by representing several symbols as one output codeword
Dictionary methods
often based on Ziv-Lempel coding replaces strings of characters with a
reference to a previous occurrence of a string adaptive effective: most characters can be coded as
part of a string that has occurred earlier in the text
compression is achieved if the reference is stored in fewer bits than the string it replaces
Models
Compression methods obtain high compression by forming good models of the data that is to be coded
the function of a model is to predict symbols e.g. during the encoding of a text , the
”prediction” for the next symbol might include a probability of 2% for the letter ’u’, based on its relative frequency in a sample of text
Models
The set of all possible symbols is called the alphabet
the probability distribution provides an estimated probability for each symbol in the alphabet
Encoding, decoding
the model provides the probability distribution to the encoder, which uses it to encode the symbol that actually occurs
the decoder uses an identical model together with the output of the encoder to find out what the encoded symbol was
Information content of a symbol
The number of bits in which a symbol s should be coded is called the information content I(s) of the symbol
the information content I(s) is directly related to the symbol’s predicted probability P(s), by the function I(s) = -log P(s) bits
Information content of a symbol
The average amount of information per symbol over the whole alphabet is known as the entropy of the probability distribution, denoted by H:
ss
sPsPsIsPH )(log)()()(
Information content of a symbol
Provided that the symbols appear independently and with the assumed probabilities, H is a lower bound on compression, measured in bits per symbol, that can be achieved by any coding method
Information content of a symbol
If the probability of symbol ’u’ is estimated to be 2%, the corresponding information content is 5.6 bits
if ’u’ happens to be the next symbol that is to be coded, it should be transmitted in 5.6 bits
predictions can usually be improved by taking account of the previous symbol
Information content of a symbol
predictions can usually be improved by taking account of the previous symbol
if a ’q’ has just occurred, the probability of ’u’ may jump to 95 %, based on how often ’q’ is followed by ’u’ in a sample of text
information content of ’u’ in this case is 0.074 bits
Information content of a symbol
Models that take a few immediately preceding symbols into account to make a prediction are called finite-context models of order m m is the number of previous symbols
used to make a prediction
Adaptive models
There are many ways to estimate the probabilities in a model
we could use static modelling: always use the same probabilities for
symbols, regardless of what text is being coded
compressing system may not perform well, if different text is receivede.g. a model for English with a file of numbers
Adaptive models
One solution is to generate a model specifically for each file that is to be compressed
an initial pass is made through the file to estimate symbol probabilities, and these are transmitted to the decode before transmitting the encoded symbols
this is called semi-static modelling
Adaptive models
Semi-static modelling has the advantage that the model is invariably better suited to the input than a static one, but the penalty paid is having to transmit the model first, as well as the preliminary pass over the
data to accumulate symbol probabilities
Adaptive models
Adaptive model begins with a bland probability distribution and gradually alters it as more symbols are encountered
as an example, assume a zero-order model, i.e., no context is used to predict the next symbol
Adaptive models
Assume that a encoder has already encoded a long text and come to a sentence: It migh
now the probability that the next character is ’t’ is estimated to be 49,983/768,078 = 6.5 %, since in the previous text, 49,983 characters of the total of 768,078 characters were ’t’
Adaptive models
Using the same system, ’e’ has the probability 9.4 % and ’x’ has probability 0.11 %
the model provides this estimated probability distribution to an encoder
the decoder is able to generate the same model since it has the same probability estimates as the encoder
Adaptive models
For a higher-order model, such as a first-order model, the probability is estimated by how often that character has occurred in the current context
in a zero-order model earlier, a symbol ’t’ occurred in a context: It migh , but the model made no use of the characters of the phrase
Adaptive models
A first-order model would use the final ’h’ as a context with which to condition the probability estimates
the letter ’h’ has occurred 37,525 times in the prior text, and 1,133 of these times it was followed by a ’t’
the probability of ’t’ occurring after an ’h’ can be estimated to be 1,133/37,525=3.02 %
Adaptive models
For ’t’, a prediction of 3.2% is actually worse than in the zero-order model because ’t’ is rare in this context (’e’ follows ’h’ much more often)
second-order model would use the relative frequency that the context ’gh’ is followed by ’t’, which is the case in 64,6%
Adaptive models
Good: robust, reliable, flexibleBad: not suitable for random access
to compressed files a text can be decoded only from the
beginning: the model used for coding a particular part of the text is determined from all the preceding text
-> not suitable for full-text retrieval
Coding
Coding is the task of determining the output representation of a symbol, based on a probability distribution supplied by a model
general idea: the coder should output short codewords for likely symbols and long codewords for rare ones
symbolwise methods depend heavily on a good coder to achieve compression
Huffman coding
A phrase is coded by replacing each of its symbols with the codeword given by a table
Huffman coding generates codewords for a set of symbols, given some probability distribution for the symbols
the type of code is called prefix-free code no codeword is the prefix of another symbol’s
codeword
Huffman coding
The codewords can be stored in a tree (a decoding tree)
Huffman’s algorithm works by constructing the decoding tree from the bottom up
Huffman codingAlgorithm
create for each symbol a leaf node containing the symbol and its probability
two nodes with the smallest probabilities become siblings under a new parent node, which is given a probability equal to the sum of its two children’s probabilities
the combining operation is repeated until there is only one node without a parent
the two branches from every nonleaf node are then labeled 0 and 1
Huffman coding
Huffman coding is generally fast for both encoding and decoding, provided that the probability distribution is static adaptive Huffman coding is possible, but
needs either a lot of memory or is slowcoupled with a word-based model
(rather than character-based model), gives a good compression
Canonical Huffman codes
The frequency of each word is countedthe codewords are chosen for each word
to minimize the size of the compressed file -> static zero-order word-level model
the codewords are shown in decreasing order of length (and therefore in increasing order of word frequency) within each block of codes of the same
length, words are ordered alphabetically
Canonical Huffman codes
The list begins with the thousands of words (and numbers) that appear only once
when the codewords are sorted in lexical order, they are also in order from the longest to the shortest codeword
Canonical Huffman codes
A word’s encoding can be determined quickly from the length of its codeword, how far through the list it is, and the codeword for the first word of that length
Canonical Huffman codes
It is not necessary to store a decode tree
all the is required is a list of the symbols ordered according
to the lexical order of the codewords an array storing the first codeword of
each distinct length
Dictionary models
Dictionary-based compression methods use the principle of replacing substrings in a text with a codeword that identifies that substring in a dictionary
dictionary contains a list of substrings and a codeword for each substring
often fixed codewords used reasonable compression is obtained even if
coding is simple
Dictionary models
The simplest dictionary compression methods use small dictionaries
for instance, digram coding selected pairs of letters are replaced
with codewords a dictionary for the ASCII character set
might contain the 128 ASCII characters, as well as 128 common letter pairs
Dictionary modelsDigram coding…
the output codewords are eight bits each the presence of the full ASCII character set
ensures that any (ASCII) input can be represented
at best, every pair of characters is replaced with a codeword, reducing the input from 7 bits/character to 4 bits/characters
at worst, each 7 bit character will be expanded to 8 bits
Dictionary models
Natural extension: put even larger entries in the dictionary, e.g.
common words like ’and’, ’the’,… or common components of words like ’pre’, ’tion’…
a predefined set of dictionary phrases make the compression domain-dependent or very short phrases have to be used ->
good compression is not achieved
Dictionary modelsOne way to avoid the problem of the
dictionary being unsuitable for the text at hand is to use a semi-static dictionary scheme constuct a new dictionary for every text that
is to be compressed overhead of transmitting or storing the
dictionary is significant decision of which phrases should be included
is a difficult problem
Dictionary models
Solution: use an adaptive dictionary scheme
Ziv-Lempel coders (LZ77 and LZ78)a substring of text is replaced with a
pointer to where it has occurred previously
dictionary: all the text prior to the current position
codewords: pointers
Dictionary models
Ziv-Lempel… the prior text makes a very good
dictionary since it is usually in the same style and language as upcoming text
the dictionary is transmitted implicitly at no extra cost, because the decoder has access to all previously encoded text
LZ77
Key benefits: relatively easy to implement decoding can be performed extremely
quickly using only a small amount of memorysuitable when the resources required for
decoding must be minimized, like when data is distributed or broadcast from a central source to a number of small computers
LZ77
The output of an encoder consists of a sequence of triples, e.g. <3,2,b> the first component of a triple indicates
how far back to look in the previous (decoded) text to find the next phrase
the second component records how long the phrase is
the third component gives the next character from the input
LZ77
The components 1 and 2 consitute a pointer back into the text
the component 3 is actually necessary only when the character to be coded does not occur anywhere in the previous input
LZ77
Encoding for the text from the current point ahead:
search for the longest match in the previous textoutput a triple that records the position and
length of the matchthe search for a match may return a length of
zero, in which case the position of the match is not relevant
search can be accelerated by indexing the prior text with a suitable data structure
LZ77limitations on how far back a pointer can
refer and the maximum size of the string referred to
e.g. for English text, a window of a few thousand characters
the length of the phrase e.g. maximum of 16 characters
otherwise too much space wasted without benefit
LZ77
The decoding program is very simple, so it can be included with the data at very little cost
in fact, the compressed data is stored as part of the decoder program, which makes the data self-expanding
common way to distribute files
Synchronization
Good compression methods perform best when compressing large files
this tends to preclude random access because the decompression algorithms are sequential by nature
full-text retrieval systems require random access, and special measures need to be taken to facilitate this
Synchronization
There are two reasons that the better compression methods require files to be decoded from the beginning they use variable-length codes their models are adaptive
Synchronization
With variable-length codes, it is not possible to begin decoding at an arbitrary position in the file we cannot be sure of starting on the
boundary between two codewordsadaptive modelling: even if the
codeword boundary is known, the model required for decoding can be constructed only using all the preceding text
Synchronization
To achieve synchronization with adaptive modelling, large files must be broken up into small sections of a few kilobytes
a good compression cannot be obtained by compressing the small sections separately ->the compression model has to be constructed based on a sample text of the whole data
Synchronization
For full-text retrieval, it is usually preferable to use a static model rather than an adaptive one
static models are more appropriate given the static nature of a large textual database
SynchronizationIn a full-text retrieval system, the main
text usually consists of a number of documents, which represent the smallest unit to which random access is required
in an uncompressed text, a document can be identified simply by specifying how many bytes it is from the start of the file when a variable-length code is used, a
document may not start on a byte boundary
Synchronization
A solution is to insist that documents begin on byte boundaries, which means that the last byte may contain some wasted bits
there has to be some way to tell the decoder that it should not interpret some bits