Improving the Efficiency of Lossless Text Data Compression Algorithms A comparison of two reversible transforms James R. Achuff Penn State Great Valley School of Graduate Professional Studies 30 East Swedesford Road, Malvern, PA 19355, USA Abstract: Over the last decade the amount of textual information available in electronic form has exploded. It is estimated that text data currently comprises nearly half of all Internet traffic, but as of yet, no lossless compression standard for text has been proposed. A number of lossless text compression algorithms exist, however, none of these methods is able to consistently reach its theoretical best- case compression ratio. This paper evaluates the performance characteristics of several popular compression algorithms and explores two strategies for improving ratios without significantly impacting computation time. Key words: Text Compression, Lossless Compression, Reversible Transform 1. INTRODUCTION Compression means making things smaller by applying pressure. Data compression means reducing the amount of bits needed to represent a particular piece of data. Text compression means reducing the amount of bits or bytes needed to store textual information. It is necessary that the compressed form can be decompressed to reconstitute the original text, and it is usually important that the original is recreated exactly, not approximately. This differentiates text compression from many other kinds of data reduction, such as voice or picture coding, where some degradation of the signal may be tolerable if the compression achieved is worth the reduction in quality. [Bell, Cleary & Witten, 1990] 1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Improving the Efficiency of Lossless Text Data Compression AlgorithmsA comparison of two reversible transforms
James R. AchuffPenn State Great ValleySchool of Graduate Professional Studies30 East Swedesford Road, Malvern, PA 19355, USA
Abstract: Over the last decade the amount of textual information available in electronic form has exploded. It is estimated that text data currently comprises nearly half of all Internet traffic, but as of yet, no lossless compression standard for text has been proposed.
A number of lossless text compression algorithms exist, however, none of these methods is able to consistently reach its theoretical best-case compression ratio.
This paper evaluates the performance characteristics of several popular compression algorithms and explores two strategies for improving ratios without significantly impacting computation time.
Key words: Text Compression, Lossless Compression, Reversible Transform
1. INTRODUCTION
Compression means making things smaller by applying pressure. Data compression means reducing the amount of bits needed to represent a particular piece of data. Text compression means reducing the amount of bits or bytes needed to store textual information. It is necessary that the compressed form can be decompressed to reconstitute the original text, and it is usually important that the original is recreated exactly, not approximately. This differentiates text compression from many other kinds of data reduction, such as voice or picture coding, where some degradation of the signal may be tolerable if the compression achieved is worth the reduction in quality. [Bell, Cleary & Witten, 1990]
The immutable yardstick by which data compression is measured is the “compression ratio”, or ratio of the size of a compressed file to the original uncompressed file. For example, suppose a data file takes up 100 kilobytes (KB). Using data compression software, that file could be reduced in size to, say, 50 KB, making it easier to store on disk and faster to transmit over a network connection. In this specific case, the data compression software reduces the size of the data file by a factor of two, or results in a “compression ratio” of 2:1.
There are “lossless” and “lossy” forms of data compression. Lossless data compression is used when the data has to be uncompressed exactly as it was before compression. Text files are
1
Improving the Efficiency of Lossless Text Data Compression Algorithms 2
stored using lossless techniques, since losing a single character can in the worst case make the text dangerously misleading. Lossless compression ratios are generally in the range of 2:1 to 8:1.
Compression algorithms reduce the redundancy in data to decrease the storage requirements for that data. Data compression offers an attractive approach to reducing communications and storage costs by using available bandwidth effectively. With the trend of increasing amounts of digital data being transmitted over public and private networks expected to continue, it makes sense to pursue research on developing algorithms that can most effectively use available network bandwidth by maximally compressing data. This paper is focused on addressing this problem for lossless compression of text files. It is well known that there are theoretical predictions on how far a source file can be losslessly compressed [Shannon, 1951], but no existing compression approaches consistently attain these bounds over wide classes of text files.
One approach to tackling the problem of developing methods to improve compression is to develop better compression algorithms. However, given the sophistication of existing algorithms such as arithmetic coding, Lempel-Ziv algorithms, Dynamic Markov Coding, Prediction by Partial Match and their variants, it seems unlikely that major new progress will be made in this area.
An alternate approach, which is taken in this paper, is to perform a lossless, reversible transformation to a source file prior to applying an existing compression algorithm. This transformation is designed to make it easier to compress the source file. Figure 1 illustrates this strategy. The original text file is provided as input to the transformation, which outputs the transformed text. This output is provided to an existing, unmodified data compression algorithm, which compresses the transformed text. To decompress, on simply reverses the process by first invoking the appropriate decompression algorithm and then providing the resulting text to the inverse transform.
Figure 1. Text compression process involving a lossless, reversible transform
There are several important observations about this strategy. The transformation must be exactly reversible, so that the overall lossless text compression requirement is not compromised. The data compression and decompression algorithms are unmodified, so they do not exploit information about the transformation while compressing. The intent is to use the strategy to improve the overall compression ratio of the text in comparison with that achieved by the compression algorithm alone. A similar strategy has been employed in the compression of
Improving the Efficiency of Lossless Text Data Compression Algorithms 3
images and video transmissions using the Fourier transform, Discrete Cosine Transform or wavelet transforms. In these cases, however, the transforms are usually lossy, meaning that some data can be lost without compromising the interpretation of the image by a human.
One well-known example of the text compression strategy outlined in Figure 1 is the Burrows Wheeler Transform (BWT). BWT combines ad-hoc compression techniques (Run Length Encoding, Move to Front) and Huffman coding to provide one of the best compression ratios available on a wide range of data.
1.1 Lossless Text Compression Algorithms
As stated above, text compression ought to be exact – the reconstructed message should be identical to the original. Exact compression is also called noiseless (because it does not introduce any noise into the signal), lossless (since no information is lost), or reversible (because compression can be reversed to recover the original input exactly).
The task of finding a suitable model for text is an extremely important problem in compression. Data compression is inextricably bound up with prediction. In the extreme case, if one can predict infallibly what is going to come next, one can achieve perfect compression by dispensing with transmission altogether. Even if one can only predict approximately what is coming next, one can get by with transmitting just enough information to disambiguate the prediction. Once predictions are available, the are processed by an encoder that turns them into binary digits to be transmitted.
There are three ways that the encoder and decoder can maintain the same model: static, semiadaptive, and adaptive modelling. In static modelling the encoder and decoder agree on a fixed model, regardless of the text to be encoded. This is the method employed when sending a message via Morse Code. In semiadaptive modelling, a “codebook” of the most frequently used words or phrases is transmitted first and then used to encode and decode the message. Adaptive modelling builds it’s “codebook” as it progresses according to a predefined method. In this way, both the encoder and decoder use the same codebook without ever having to transmit the codes with the data.
1.1.1 Huffman Coding
In 1952, D. A. Huffman introduced his method for the construction of minimum redundancy codes – now more commonly known as “Huffman Coding”. In Huffman Coding, the characters in a data file are converted to a binary code, where the most common characters in the file have the shortest binary codes, and the least common have the longest. This is accomplished by building a binary tree based upon the frequency with which characters occur in a file.
1.1.2 Arithmetic Coding
In arithmetic coding a message is represented by an interval of real numbers between 0 and 1. As the message becomes longer, the interval needed to represent it becomes smaller, and the number of bits needed to specify that interval grows. Successive symbols of the message reduce the size of the interval in accordance with the symbol probabilities generated by the model. The more likely symbols reduce the range by less than the unlikely symbols and hence add fewer bits to the message.
Improving the Efficiency of Lossless Text Data Compression Algorithms 4
1.1.3 LZ Coding
In 1977, Jacob Ziv and Abraham Lempel described an adaptive dictionary encoder in which they “employ the concept of encoding future segments of the [input] via maximum-length copying from a buffer containing the recent past output.” The essence being that phrases are replaced with a pointer to where they have occurred earlier in the text.
Figure 2 illustrates how well this approach works for a variety of texts by indicating some of many instances where phrases could be replaced in this manner. A phrase might be a word, part of a word, or several words. It can be replaced with a pointer as long as it has occurred once before in the text, so coding adapts quickly to a new topic.
Figure 2. The principle of Ziv-Lempel coding – phrases are coded as pointers to earlier occurrences
Decoding a text that has been compressed in this manner is straightforward; the decoder simply replaces a pointer by the already decoded text to which it points. In practice LZ coding achieves good compression, and an important feature is that decoding can be very fast.
1.1.3.1 LZ77LZ77 was the first form of LZ coding to be published. In this scheme pointers denote phrases
in a fixed-size window that precedes the coding position. There is a maximum length for substrings that may be replaced by a pointer, usually 10 to 20. These restrictions allow LZ77 to be implemented using a “sliding windows” of N characters.
Ziv and Lempel showed that LZ77 could give at least as good compression as any semiadaptive dictionary designed specifically for the string being encoded, if N is sufficiently large. The main disadvantage of LZ77 is that although each encoding step requires a constant amount of time, that constant can be large, and a straightforward implementation can require a vast number of character comparisons per character coded. This property of slow encoding and fast decoding is common to many LZ schemes.
Improving the Efficiency of Lossless Text Data Compression Algorithms 5
1.1.4 Dynamic Markov Coding
Finite-state probabilistic models are based on finite-state machines. They have a set of states and transition probabilities that signify the likelihood of the model to transition from one state to another. Also, each state is labelled uniquely. Figure 3 shows a simple model with two states, 0 and 1.
Finite state-based modelling is typically too slow and too computationally cumbersome to support practical text compression. Dynamic Markov Coding (DMC) however, provides an efficient way of building complex state models that fit a particular sequence and is generally regarded as the only state-based technique that can be applied to text compression. [Bell, Witten & Cleary, 1989]
Figure 3. An order-1 finite state model for 0 and 1
The basic idea of DMC is to maintain frequency counts for each transition in the current finite-state model, and to “clone” a state when a related transition becomes sufficiently popular. Cloning consumes resources by creating an extra state, and should not be performed unless it is likely to be productive. High-frequency transitions have, by definition, been traversed often in the past and are therefore likely to be traversed often in the future. Consequently, they are likely candidates for cloning, since any correlations discovered will be utilised frequently.
1.1.5 Prediction by Partial Match
Prediction by Partial Match (PPM) is a statistical, predictive text compression algorithm originally proposed by Cleary and Witten in 1984 and refined by Moffat in 1988. PPM and its derivatives have consistently outperformed dictionary-based methods as well as other statistical methods for text compression. PPM maintains a list of already seen string prefixes, conventionally called contexts. For example, after processing the string ababc, the contexts are {, a, b, c, ab, ba, bc, aba, bab, abc, abab, babc, and ababc}. For each context PPM maintains a list of characters that appeared after the context. PPM also keeps track of how often the subsequent characters appeared. So in the given example the counts of subsequent characters for, say, ab are a and c both with a count of one. Normally, efficient implementations of PPM maintain contexts dynamically in a context trie. A context trie is a tree with characters as nodes and where any path from the root to a node represents the context formed by concatenating the characters along this path. The root node does not contain any character and represents the empty context (i.e., no prefix). In a context trie, children of a node constitute all characters that have
Improving the Efficiency of Lossless Text Data Compression Algorithms 6
been seen after its context. In order to keep track of the number of times that a certain character followed a given context, the number of its occurrences is noted along each edge. Based on this information PPM can assign probabilities to potentially subsequent characters. [Cleary and Witten, 1984]
The length of contexts is also called their order. Note that contexts of different order might yield different counts leading to varying predictions.
1.1.6 Burrows-Wheeler Transform
Burrows and Wheeler released a research report in 1994 entitled “A Block Sorting Lossless Data Compression Algorithm” which presented a data compression algorithm based on Wheeler’s earlier work.
The BWT is an algorithm that takes a block of data and rearranges it using a sorting scheme. The resulting output block contains exactly the same data elements that it started with differing only in their ordering. The transformation is reversible and lossless, meaning that the original ordering of the data elements can be restored with no loss of fidelity.
The BWT is performed on an entire block of data at once, preferably the largest amount possible. Since the BWT operates on data in memory, it must often break files up into smaller pieces and process one piece at a time.
2. LOSSLESS, REVERSIBLE TRANSFORMS
Work done by Awan and Mukherjee and Franceschini et al. details several lossless, reversible transforms that can be applied to text files in order to improve their compressibility by established algorithms. Two have been selected for study in this paper: star encoding (or *-encoding) and length index preserving transform (LIPT).
2.1.1 Star Encoding
The first transform proposed is an algorithm developed by Franceschini et al. Star encoding (or *-encoding) is designed to exploit the natural redundancy of the language. It is possible to replace certain characters in a word by a special placeholder character and retain a few key characters so that the word is still retrievable.
For example, if given a set of six letter words: {school, simple, strong, sturdy, supple}, and replacing “unnecessary” characters with a chosen symbol ‘*’, the set can now be represented unambiguously as {**h***, **m***, **r***, **u***, **p***}. In *-encoding, an unambiguous representation of a word by a partial sequence of letters from the original sequence of letters in the word interposed by special characters ‘*’ as placeholders will be called a signature of the word.
*-encoding utilises an indexed and sorted dictionary containing the natural form and the signature of each word. No word in a 60,000 word English dictionary required the use of more than two unencoded characters in its signature using Franceschini’s scheme. The predominant character in *-encoded text is ‘*’ which occupies more than fifty percent of the space. If a word is not in the dictionary, it is passed to the transformed text unaltered.
The main drawback of *-encoding is that the compressor and decompressor need to share a dictionary. The aforementioned 60,000 word English dictionary requires about one megabyte of storage overhead that must be shared by all users of this transform. Also, special provisions
Improving the Efficiency of Lossless Text Data Compression Algorithms 7
made to handle capitalisation, punctuation marks and special characters will most likely contribute to a slight increase of the size of the input text in its transformed form.
2.1.2 LIPT
Another method investigated here is the Length Index Preserving Transform or LIPT. Fawzia S. Awan and Amar Mukherjee developed LIPT as part of their project work at the University of Central Florida. LIPT is a dictionary method that replaces words in a text file with a marker character, a dictionary index and a word index.
LIPT is defined as follows: words of length more than four are encoded starting with ‘*’, this allows predictive compression algorithms to strongly predict the space character preceding a ‘*’ character. The last three characters form an encoding of the dictionary offset of the corresponding word. For words of more than four characters, the characters between the initial ‘*’ and the final three-character-sequence in the word encoding are constructed using a suffix of the string ‘…nopqrstuvw’. For instance, the first word of length 10 would be encoded as ‘*rstuvwxyzaA’. This method provides a strong local context within each word encoding and its delimiters.
3. PROCESS
To evaluate these methods, they were applied to the Calgary Corpus, a collection of text files that was originally used by Bell, Witten and Cleary in 1989 to evaluate the practical performance of various text compression schemes. The methods were also applied to three html files in order to supply a more “modern” facet to the test corpus.
3.1 Test Corpus
In the Calgary Corpus, nine different types of text are represented, and to confirm that the performance of schemes is consistent for any given type, many of the types have more than one representative. Normal English, both fiction and non-fiction, is represented by two books and six papers (labelled book1, book2, paper1, paper2, paper3, paper4, paper5, paper6). More unusual styles of English writing are found in a bibliography (bib) and a batch of unedited news articles (news). Three computer programs represent artificial languages (progc, progl, progp), and a transcript of a terminal session (trans) is included to indicate the increase in speed that could be achieved by applying compression to a slow line to a terminal. All of the above files use ASCII encoding. Some non-ASCII files are also included: two files of executable code (obj1, obj2), some geophysical data (geo), and a bit-map black and white picture (pic). The file geo is particularly difficult to compress because it contains a wide range of data values, while the file pic is highly compressible because of large amounts of white space in the picture, represented by long runs of zeros. [Witten and Bell, 1990]
3.2 Additional Test Files
The additional html files were chosen to representative of “average” web traffic. One is the front page of an American university (http://www.psu.edu), another is the front page of a popular Internet auction site (http://www.ebay.com) and the third is the main page of a popular multimedia web content company (http://www.real.com). Each contained different types of web content and page structures.
The following charts display the compression ratios for each file, grouped roughly by content type. It is interesting to note that the transforms generally do, but not always provide better compression.
Compression Ratios for bib, book1, book2 and news
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
bibnone
bib*-encoding
bibLIPT
book1none
book1*-encoding
book1LIPT
book2none
book2*-encoding
book2LIPT
new snone
new s*-encoding
new sLIPT
PK-ZIP 2.50
bzip (BWT)
Gzip (LZ77)
Arithmetic Coding
DMC
Huffman Coding
PPM (No Training)
Figure 4. Compression Ratios for bib, book1, book2, and news
Improving the Efficiency of Lossless Text Data Compression Algorithms 10
Compression Ratios forgeo, obj1, obj2
0
0.5
1
1.5
2
2.5
3
3.5
4
geonone
geo*-encoding
geoLIPT
obj1none
obj1*-encoding
obj1LIPT
obj2none
obj2*-encoding
obj2LIPT
PK-ZIP 2.50
bzip (BWT)
Gzip (LZ77)
Arithmetic Coding
DMC
Huffman Coding
PPM (No Training)
Figure 5. Compression Ratios for geo, obj1, obj2
pic: Compression Ratios
-
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
PK-ZIP2.50
bzip(BWT)
Gzip(LZ77)
ArithmeticCoding
DMC HuffmanCoding
PPM (NoTraining)
pic none
pic *-encoding
pic LIPT
Figure 6. Compression Ratios for pic
Compression Ratios for paper1, paper2, paper3, paper4, paper5, paper6
Improving the Efficiency of Lossless Text Data Compression Algorithms 11
Compression Ratios for progc, progl, progp, trans
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
progcnone
progc*-encoding
progcLIPT
proglnone
progl*-encoding
proglLIPT
progpnone
progp*-encoding
progpLIPT
transnone
trans*-encoding
transLIPT
PK-ZIP 2.50
bzip (BWT)
Gzip (LZ77)
Arithmetic Coding
DMC
Huffman Coding
PPM (No Training)
Figure 8. Compression Ratios for progc, progl, progp, trans
Compression Ratios for html1, html2, html3
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
html1none
html1*-encoding
html1LIPT
html2none
html2*-encoding
html2LIPT
html3none
html3*-encoding
html3LIPT
PK-ZIP 2.50
bzip (BWT)
Gzip (LZ77)
Arithmetic Coding
DMC
Huffman Coding
PPM (No Training)
Figure 9. Compression Ratios for html1, html2, html3
It is interesting to note that the transforms typically do not result in increased performance for arithmetic or Huffman coding. In fact, LIPT actually decreases the compression ratio for arithmetic coding by almost a third for the English language text files (bib, book1, book2, news, paper1, paper2, paper3, paper4, paper5, paper6, progc, progl, progp, trans).
*-encoding caused a decrease in compression for bib, book2, news, progl, progp and trans with PPM encoding, and for book1, book2 and progp with BWT encoding. Other than those, the transforms typically offer some increase. *-encoding offered improvements of 11% to nearly 15% of original file size for the books and papers when coupled with Huffman coding and LIPT offered improvements of up to 8% in combination with Huffman coding.
Overall, PPM with LIPT produced the best compression ratios of the English language text files and was nearly as good as any other method on the other files.
Improving the Efficiency of Lossless Text Data Compression Algorithms 12
5. CONCLUSION
This paper has shown that it is possible to make textual data more compressible, even if only to a small degree, by applying an intermediate reversible transform to the data prior to compression. Although not specifically measured for this paper, the time impact of applying these transforms to the data was not observed to be significant.
Transform encoding offered improvements of up to 15% for some standard compression methods and - depending on the methods used and the type of text contained in the input file - can offer compression ratios over 13 and can generally have a beneficial effect on the compressibility of data over standard compression algorithms.
It is recommended that further investigation be made into the applicability of this process to html files in an effort to decrease download times for web information and to conserve Internet bandwidth.
6. REFERENCES
Akman, K. Ibrahim. “A New Text Compression Technique Based on Language Structure.” Journal of Information Science. 21, no. 2 (February 1995): 87-95.
Awan, F. S. and A. Mukherjee, LIPT: A Lossless Text Transform to Improve Compression. [paper on-line] School of Electrical Engineering and Computer Science, University of Central Florida, available from http://vlsi.cs.ucf.edu/listpub.html; Internet; accessed 9 July 2001.
Bell, T. C., J. G. Cleary and I. H. Witten. Text Compression. Englewood Cliffs: Prentice-Hall, 1990.
Bell, Timothy, Ian H. Witten and John G. Cleary. “Modelling for Text Compression.” ACM Computing Surveys. 21, no. 4 (December 1989): 557-591.
Burrows, M. and D. J. Wheeler. “A Block-sorting Lossless Data Compression Algorithm.” SRC Research Report 124, Digital Systems Research Center, Palo Alto, (May 1994) available from http://citeseer.nj.nec.com/76182.html; Internet; accessed 15 July 2001.
Cleary, J. G. and I.H. Witten. “Data Compression Using Adaptive Coding and Partial String Matching.” IEEE Transactions on Communications. 32, no 4 (April 1984): 396-402.
Crochemore, Maxime and Thierry Lecroq. “Pattern-Matching and Text Compression Algorithms.” ACM Computing Surveys. 28, no. 1 (March 1996): 39-41.
Fenwick, P. Symbol Ranking Text Compression with Shannon Recodings. [paper on-line] Department of Computer Science, The University of Auckland, 6 June 1996 available from ftp://ftp.cs.auckland.ac.nz/out/peter-f/TechRep132; Internet; accessed 6 June 2001.
Franceschini, R., H. Kruse, N. Zhang, R. Iqbal and A. Mukherjee. Lossless, Reversible Transformation that Improve Text Compression Ratios. [paper on-line] School of Electrical Engineering and Computer Science, University of Central Florida, available from http://vlsi.cs.ucf.edu/listpub.html; Internet; accessed 9 July 2001.
Goebel, G.V., Data Compression. available from http://vectorsite.tripod.com/ttdcmp0.html; Internet; accessed 14 May, 2001.
Huffman, D. A. “A Method for the Construction of Minimum-Redundancy Codes.” Proceedings of the Institute of Electrical and Radio Engineers. 40, no 9 (September 1952): 1098-1101.
Moffat, Alistair, Radford M. Neal and Ian H. Witten. “Arithmetic Coding Revisited.” ACM Transactions on Information Systems. 16, no. 3 (July 1998): 256-294.
Improving the Efficiency of Lossless Text Data Compression Algorithms 13
Motgi, N. and A. Mukherjee, Network Conscious Text Compression System (NCTCSys). [paper on-line] School of Electrical Engineering and Computer Science, University of Central Florida, available from http://vlsi.cs.ucf.edu/listpub.html; Internet; accessed 9 July 2001.
Nelson, M. and J. L. Gailly. The Data Compression Book 2nd Edition. New York: M&T Books, 1996.
Nelson, Mark. “Data Compression with the Burrows-Wheeler Transform.” Dr. Dobb’s Journal. (September 1996) available from http://www.dogma.net/markn/articles/bwt/bwt.htm; Internet; accessed 18 June 2001
Salomon, D. Data Compression: The Complete Reference 2nd Edition. New York: Springer-Verlag 2000.
Sayood, K. Introduction to Data Compression 2nd Edition. San Diego: Academic Press, 2000.
Shannon, C. E. “Prediction and Entropy of Printed English.” Bell System Technical Journal. 30 (January 1951): 50-64.
Stork, Christian H., Vivek Haldar and Michael Franz. Generic Adaptive Syntax-Directed Compression for Mobile Code. [paper on-line] Department of Information and Computer Science, University of California, Irvine, available from http://www.ics.uci.edu/~franz/pubs-pdf/ICS-TR-00-42.pdf; Internet; accessed 14 July 2001.
Wayner, P. Compression Algorithms for Real Programmers. San Diego: Academic Press, 2000.
Witten, I. and T. Bell, README. included with the Calgary Corpus (May 1990) available from ftp://ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus; Internet; accessed 25 June 2001.
Ziv, J. and A. Lempel. “A Universal Algorithm for Sequential Data Compression.” IEEE Transactions of Information Theory. IT-23, no. 3 (May 1977): 337-343.