Top Banner
Improving the Efficiency of Lossless Text Data Compression Algorithms A comparison of two reversible transforms James R. Achuff Penn State Great Valley School of Graduate Professional Studies 30 East Swedesford Road, Malvern, PA 19355, USA Abstract: Over the last decade the amount of textual information available in electronic form has exploded. It is estimated that text data currently comprises nearly half of all Internet traffic, but as of yet, no lossless compression standard for text has been proposed. A number of lossless text compression algorithms exist, however, none of these methods is able to consistently reach its theoretical best- case compression ratio. This paper evaluates the performance characteristics of several popular compression algorithms and explores two strategies for improving ratios without significantly impacting computation time. Key words: Text Compression, Lossless Compression, Reversible Transform 1. INTRODUCTION Compression means making things smaller by applying pressure. Data compression means reducing the amount of bits needed to represent a particular piece of data. Text compression means reducing the amount of bits or bytes needed to store textual information. It is necessary that the compressed form can be decompressed to reconstitute the original text, and it is usually important that the original is recreated exactly, not approximately. This differentiates text compression from many other kinds of data reduction, such as voice or picture coding, where some degradation of the signal may be tolerable if the compression achieved is worth the reduction in quality. [Bell, Cleary & Witten, 1990] 1
16
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Text Data Compression

Improving the Efficiency of Lossless Text Data Compression AlgorithmsA comparison of two reversible transforms

James R. AchuffPenn State Great ValleySchool of Graduate Professional Studies30 East Swedesford Road, Malvern, PA 19355, USA

Abstract: Over the last decade the amount of textual information available in electronic form has exploded. It is estimated that text data currently comprises nearly half of all Internet traffic, but as of yet, no lossless compression standard for text has been proposed.

A number of lossless text compression algorithms exist, however, none of these methods is able to consistently reach its theoretical best-case compression ratio.

This paper evaluates the performance characteristics of several popular compression algorithms and explores two strategies for improving ratios without significantly impacting computation time.

Key words: Text Compression, Lossless Compression, Reversible Transform

1. INTRODUCTION

Compression means making things smaller by applying pressure. Data compression means reducing the amount of bits needed to represent a particular piece of data. Text compression means reducing the amount of bits or bytes needed to store textual information. It is necessary that the compressed form can be decompressed to reconstitute the original text, and it is usually important that the original is recreated exactly, not approximately. This differentiates text compression from many other kinds of data reduction, such as voice or picture coding, where some degradation of the signal may be tolerable if the compression achieved is worth the reduction in quality. [Bell, Cleary & Witten, 1990]

The immutable yardstick by which data compression is measured is the “compression ratio”, or ratio of the size of a compressed file to the original uncompressed file. For example, suppose a data file takes up 100 kilobytes (KB). Using data compression software, that file could be reduced in size to, say, 50 KB, making it easier to store on disk and faster to transmit over a network connection. In this specific case, the data compression software reduces the size of the data file by a factor of two, or results in a “compression ratio” of 2:1.

There are “lossless” and “lossy” forms of data compression. Lossless data compression is used when the data has to be uncompressed exactly as it was before compression. Text files are

1

Page 2: Text Data Compression

Improving the Efficiency of Lossless Text Data Compression Algorithms 2

stored using lossless techniques, since losing a single character can in the worst case make the text dangerously misleading. Lossless compression ratios are generally in the range of 2:1 to 8:1.

Compression algorithms reduce the redundancy in data to decrease the storage requirements for that data. Data compression offers an attractive approach to reducing communications and storage costs by using available bandwidth effectively. With the trend of increasing amounts of digital data being transmitted over public and private networks expected to continue, it makes sense to pursue research on developing algorithms that can most effectively use available network bandwidth by maximally compressing data. This paper is focused on addressing this problem for lossless compression of text files. It is well known that there are theoretical predictions on how far a source file can be losslessly compressed [Shannon, 1951], but no existing compression approaches consistently attain these bounds over wide classes of text files.

One approach to tackling the problem of developing methods to improve compression is to develop better compression algorithms. However, given the sophistication of existing algorithms such as arithmetic coding, Lempel-Ziv algorithms, Dynamic Markov Coding, Prediction by Partial Match and their variants, it seems unlikely that major new progress will be made in this area.

An alternate approach, which is taken in this paper, is to perform a lossless, reversible transformation to a source file prior to applying an existing compression algorithm. This transformation is designed to make it easier to compress the source file. Figure 1 illustrates this strategy. The original text file is provided as input to the transformation, which outputs the transformed text. This output is provided to an existing, unmodified data compression algorithm, which compresses the transformed text. To decompress, on simply reverses the process by first invoking the appropriate decompression algorithm and then providing the resulting text to the inverse transform.

Figure 1. Text compression process involving a lossless, reversible transform

There are several important observations about this strategy. The transformation must be exactly reversible, so that the overall lossless text compression requirement is not compromised. The data compression and decompression algorithms are unmodified, so they do not exploit information about the transformation while compressing. The intent is to use the strategy to improve the overall compression ratio of the text in comparison with that achieved by the compression algorithm alone. A similar strategy has been employed in the compression of

Page 3: Text Data Compression

Improving the Efficiency of Lossless Text Data Compression Algorithms 3

images and video transmissions using the Fourier transform, Discrete Cosine Transform or wavelet transforms. In these cases, however, the transforms are usually lossy, meaning that some data can be lost without compromising the interpretation of the image by a human.

One well-known example of the text compression strategy outlined in Figure 1 is the Burrows Wheeler Transform (BWT). BWT combines ad-hoc compression techniques (Run Length Encoding, Move to Front) and Huffman coding to provide one of the best compression ratios available on a wide range of data.

1.1 Lossless Text Compression Algorithms

As stated above, text compression ought to be exact – the reconstructed message should be identical to the original. Exact compression is also called noiseless (because it does not introduce any noise into the signal), lossless (since no information is lost), or reversible (because compression can be reversed to recover the original input exactly).

The task of finding a suitable model for text is an extremely important problem in compression. Data compression is inextricably bound up with prediction. In the extreme case, if one can predict infallibly what is going to come next, one can achieve perfect compression by dispensing with transmission altogether. Even if one can only predict approximately what is coming next, one can get by with transmitting just enough information to disambiguate the prediction. Once predictions are available, the are processed by an encoder that turns them into binary digits to be transmitted.

There are three ways that the encoder and decoder can maintain the same model: static, semiadaptive, and adaptive modelling. In static modelling the encoder and decoder agree on a fixed model, regardless of the text to be encoded. This is the method employed when sending a message via Morse Code. In semiadaptive modelling, a “codebook” of the most frequently used words or phrases is transmitted first and then used to encode and decode the message. Adaptive modelling builds it’s “codebook” as it progresses according to a predefined method. In this way, both the encoder and decoder use the same codebook without ever having to transmit the codes with the data.

1.1.1 Huffman Coding

In 1952, D. A. Huffman introduced his method for the construction of minimum redundancy codes – now more commonly known as “Huffman Coding”. In Huffman Coding, the characters in a data file are converted to a binary code, where the most common characters in the file have the shortest binary codes, and the least common have the longest. This is accomplished by building a binary tree based upon the frequency with which characters occur in a file.

1.1.2 Arithmetic Coding

In arithmetic coding a message is represented by an interval of real numbers between 0 and 1. As the message becomes longer, the interval needed to represent it becomes smaller, and the number of bits needed to specify that interval grows. Successive symbols of the message reduce the size of the interval in accordance with the symbol probabilities generated by the model. The more likely symbols reduce the range by less than the unlikely symbols and hence add fewer bits to the message.

Page 4: Text Data Compression

Improving the Efficiency of Lossless Text Data Compression Algorithms 4

1.1.3 LZ Coding

In 1977, Jacob Ziv and Abraham Lempel described an adaptive dictionary encoder in which they “employ the concept of encoding future segments of the [input] via maximum-length copying from a buffer containing the recent past output.” The essence being that phrases are replaced with a pointer to where they have occurred earlier in the text.

Figure 2 illustrates how well this approach works for a variety of texts by indicating some of many instances where phrases could be replaced in this manner. A phrase might be a word, part of a word, or several words. It can be replaced with a pointer as long as it has occurred once before in the text, so coding adapts quickly to a new topic.

Figure 2. The principle of Ziv-Lempel coding – phrases are coded as pointers to earlier occurrences

Decoding a text that has been compressed in this manner is straightforward; the decoder simply replaces a pointer by the already decoded text to which it points. In practice LZ coding achieves good compression, and an important feature is that decoding can be very fast.

1.1.3.1 LZ77LZ77 was the first form of LZ coding to be published. In this scheme pointers denote phrases

in a fixed-size window that precedes the coding position. There is a maximum length for substrings that may be replaced by a pointer, usually 10 to 20. These restrictions allow LZ77 to be implemented using a “sliding windows” of N characters.

Ziv and Lempel showed that LZ77 could give at least as good compression as any semiadaptive dictionary designed specifically for the string being encoded, if N is sufficiently large. The main disadvantage of LZ77 is that although each encoding step requires a constant amount of time, that constant can be large, and a straightforward implementation can require a vast number of character comparisons per character coded. This property of slow encoding and fast decoding is common to many LZ schemes.

Page 5: Text Data Compression

Improving the Efficiency of Lossless Text Data Compression Algorithms 5

1.1.4 Dynamic Markov Coding

Finite-state probabilistic models are based on finite-state machines. They have a set of states and transition probabilities that signify the likelihood of the model to transition from one state to another. Also, each state is labelled uniquely. Figure 3 shows a simple model with two states, 0 and 1.

Finite state-based modelling is typically too slow and too computationally cumbersome to support practical text compression. Dynamic Markov Coding (DMC) however, provides an efficient way of building complex state models that fit a particular sequence and is generally regarded as the only state-based technique that can be applied to text compression. [Bell, Witten & Cleary, 1989]

Figure 3. An order-1 finite state model for 0 and 1

The basic idea of DMC is to maintain frequency counts for each transition in the current finite-state model, and to “clone” a state when a related transition becomes sufficiently popular. Cloning consumes resources by creating an extra state, and should not be performed unless it is likely to be productive. High-frequency transitions have, by definition, been traversed often in the past and are therefore likely to be traversed often in the future. Consequently, they are likely candidates for cloning, since any correlations discovered will be utilised frequently.

1.1.5 Prediction by Partial Match

Prediction by Partial Match (PPM) is a statistical, predictive text compression algorithm originally proposed by Cleary and Witten in 1984 and refined by Moffat in 1988. PPM and its derivatives have consistently outperformed dictionary-based methods as well as other statistical methods for text compression. PPM maintains a list of already seen string prefixes, conventionally called contexts. For example, after processing the string ababc, the contexts are {, a, b, c, ab, ba, bc, aba, bab, abc, abab, babc, and ababc}. For each context PPM maintains a list of characters that appeared after the context. PPM also keeps track of how often the subsequent characters appeared. So in the given example the counts of subsequent characters for, say, ab are a and c both with a count of one. Normally, efficient implementations of PPM maintain contexts dynamically in a context trie. A context trie is a tree with characters as nodes and where any path from the root to a node represents the context formed by concatenating the characters along this path. The root node does not contain any character and represents the empty context (i.e., no prefix). In a context trie, children of a node constitute all characters that have

Page 6: Text Data Compression

Improving the Efficiency of Lossless Text Data Compression Algorithms 6

been seen after its context. In order to keep track of the number of times that a certain character followed a given context, the number of its occurrences is noted along each edge. Based on this information PPM can assign probabilities to potentially subsequent characters. [Cleary and Witten, 1984]

The length of contexts is also called their order. Note that contexts of different order might yield different counts leading to varying predictions.

1.1.6 Burrows-Wheeler Transform

Burrows and Wheeler released a research report in 1994 entitled “A Block Sorting Lossless Data Compression Algorithm” which presented a data compression algorithm based on Wheeler’s earlier work.

The BWT is an algorithm that takes a block of data and rearranges it using a sorting scheme. The resulting output block contains exactly the same data elements that it started with differing only in their ordering. The transformation is reversible and lossless, meaning that the original ordering of the data elements can be restored with no loss of fidelity.

The BWT is performed on an entire block of data at once, preferably the largest amount possible. Since the BWT operates on data in memory, it must often break files up into smaller pieces and process one piece at a time.

2. LOSSLESS, REVERSIBLE TRANSFORMS

Work done by Awan and Mukherjee and Franceschini et al. details several lossless, reversible transforms that can be applied to text files in order to improve their compressibility by established algorithms. Two have been selected for study in this paper: star encoding (or *-encoding) and length index preserving transform (LIPT).

2.1.1 Star Encoding

The first transform proposed is an algorithm developed by Franceschini et al. Star encoding (or *-encoding) is designed to exploit the natural redundancy of the language. It is possible to replace certain characters in a word by a special placeholder character and retain a few key characters so that the word is still retrievable.

For example, if given a set of six letter words: {school, simple, strong, sturdy, supple}, and replacing “unnecessary” characters with a chosen symbol ‘*’, the set can now be represented unambiguously as {**h***, **m***, **r***, **u***, **p***}. In *-encoding, an unambiguous representation of a word by a partial sequence of letters from the original sequence of letters in the word interposed by special characters ‘*’ as placeholders will be called a signature of the word.

*-encoding utilises an indexed and sorted dictionary containing the natural form and the signature of each word. No word in a 60,000 word English dictionary required the use of more than two unencoded characters in its signature using Franceschini’s scheme. The predominant character in *-encoded text is ‘*’ which occupies more than fifty percent of the space. If a word is not in the dictionary, it is passed to the transformed text unaltered.

The main drawback of *-encoding is that the compressor and decompressor need to share a dictionary. The aforementioned 60,000 word English dictionary requires about one megabyte of storage overhead that must be shared by all users of this transform. Also, special provisions

Page 7: Text Data Compression

Improving the Efficiency of Lossless Text Data Compression Algorithms 7

made to handle capitalisation, punctuation marks and special characters will most likely contribute to a slight increase of the size of the input text in its transformed form.

2.1.2 LIPT

Another method investigated here is the Length Index Preserving Transform or LIPT. Fawzia S. Awan and Amar Mukherjee developed LIPT as part of their project work at the University of Central Florida. LIPT is a dictionary method that replaces words in a text file with a marker character, a dictionary index and a word index.

LIPT is defined as follows: words of length more than four are encoded starting with ‘*’, this allows predictive compression algorithms to strongly predict the space character preceding a ‘*’ character. The last three characters form an encoding of the dictionary offset of the corresponding word. For words of more than four characters, the characters between the initial ‘*’ and the final three-character-sequence in the word encoding are constructed using a suffix of the string ‘…nopqrstuvw’. For instance, the first word of length 10 would be encoded as ‘*rstuvwxyzaA’. This method provides a strong local context within each word encoding and its delimiters.

3. PROCESS

To evaluate these methods, they were applied to the Calgary Corpus, a collection of text files that was originally used by Bell, Witten and Cleary in 1989 to evaluate the practical performance of various text compression schemes. The methods were also applied to three html files in order to supply a more “modern” facet to the test corpus.

3.1 Test Corpus

In the Calgary Corpus, nine different types of text are represented, and to confirm that the performance of schemes is consistent for any given type, many of the types have more than one representative. Normal English, both fiction and non-fiction, is represented by two books and six papers (labelled book1, book2, paper1, paper2, paper3, paper4, paper5, paper6). More unusual styles of English writing are found in a bibliography (bib) and a batch of unedited news articles (news). Three computer programs represent artificial languages (progc, progl, progp), and a transcript of a terminal session (trans) is included to indicate the increase in speed that could be achieved by applying compression to a slow line to a terminal. All of the above files use ASCII encoding. Some non-ASCII files are also included: two files of executable code (obj1, obj2), some geophysical data (geo), and a bit-map black and white picture (pic). The file geo is particularly difficult to compress because it contains a wide range of data values, while the file pic is highly compressible because of large amounts of white space in the picture, represented by long runs of zeros. [Witten and Bell, 1990]

3.2 Additional Test Files

The additional html files were chosen to representative of “average” web traffic. One is the front page of an American university (http://www.psu.edu), another is the front page of a popular Internet auction site (http://www.ebay.com) and the third is the main page of a popular multimedia web content company (http://www.real.com). Each contained different types of web content and page structures.

Page 8: Text Data Compression

Improving the Efficiency of Lossless Text Data Compression Algorithms 8

4. RESULTS

Table 1 shows the file names, their original sizes and their sizes after being processed by our transforms and by compression algorithms.

Table 1. Compression Resultes for Untransformed CorpusCalgary CorpusFilename Original Size *-encoded LIPT- PK-ZIP 2.50 bzip (BWT) Gzip (LZ77) Arithmetic DMC Huffman PPM (No Bib 111261 116385 101522 34961 27467 34900 40170 30535 72762 25898Book1 768771 779421 681210 311295 232598 312281 246687 238026 438375 221304Book2 610856 621779 512862 205538 157443 206158 195060 167229 368301 149917Geo 102400 76716 78309 68536 56921 102400 72481 61458 72905 60580News 377109 386662 350114 144102 118600 144400 150866 130717 246395 110998obj1 21504 16360 16232 10290 10787 10323 16149 11076 16377 10022obj2 246814 216798 217291 80948 76441 81631 193703 85291 194505 73374Paper1 53161 54917 45731 18496 16558 18543 20433 18141 33338 15480Paper2 82199 83752 69393 29516 25041 29667 27567 26581 47616 23787Paper3 46526 47328 37655 18027 15837 18074 17511 17089 27276 15015Paper4 13286 13498 10979 5499 5188 5534 5450 5460 7861 4806Paper5 11954 12242 10418 4959 4837 4995 5335 5088 7432 4458Paper6 38105 39372 34500 13271 12292 13213 15068 13412 24024 11488Pic 513216 66937 66948 52531 49759 56442 78010 52394 106757 51016Progc 39611 41057 38605 13317 12544 13261 15405 13637 25915 11700Progl 71646 71863 67133 16098 15579 16164 21319 17796 42983 15023Progp 49379 49995 48434 11171 10710 11186 14972 12318 30215 10466Trans 93695 92525 84753 18996 17899 18862 35100 22453 65219 17182

Additional html filesHtml1 29830 30819 29592 5689 5368 5788 19305 5955 19357 5021Html2 46893 47721 46205 9846 9607 9961 31753 10887 31882 9084Html3 36323 36921 35901 6317 6304 6460 24026 7323 24118 6064

Table 2 shows the file sizes after the application of the star encoding transform in conjunction with the compression algorithms.

Table 2. Compression Results for *-encoded CorpusCalgary CorpusFilename Original Size *-encoded PK-ZIP 2.50 bzip (BWT) Gzip (LZ77) Arithmetic DMC Huffman PPM (No bib.sta 111261 116385 34084 26825 34051 39821 29160 66487 26811book1.sta 768771 779421 282605 235559 282778 234839 226290 332042 220750book2.sta 610856 621779 191004 158070 191494 190827 163358 287140 156474geo.sta 102400 76716 62250 57753 62294 63932 65503 64327 59957news.sta 377109 386662 137289 118678 138012 149796 127577 224312 112494obj1.sta 21504 16360 9570 10190 9591 14005 10380 14203 9399obj2.sta 246814 216798 77229 73867 77961 179034 81493 179461 70552paper1.sta 53161 54917 17175 15901 17165 18649 16783 27146 15415paper2.sta 82199 83752 26758 24211 26751 24541 24129 35564 23176paper3.sta 46526 47328 16073 14732 16068 14583 14913 20852 14085paper4.sta 13286 13498 4815 4701 4854 4522 4671 5896 4413paper5.sta 11954 12242 4507 4500 4531 4775 4527 6079 4222paper6.sta 38105 39372 12361 11782 12353 14182 12246 19864 11332pic.sta 513216 66937 38741 38213 38409 43190 38954 43423 38052progc.sta 39611 41057 12893 12437 12850 15730 13296 24615 11670progl.sta 71646 71863 15546 15253 15599 22141 17228 39509 15309progp.sta 49379 49995 11096 10829 11112 16132 12331 29620 10632trans.sta 93695 92525 18336 17384 18281 34526 21739 59835 18067

Additional html fileshtml1.sta 29830 30819 5429 5176 5539 19213 5697 19244 4859html2.sta 46893 47721 9763 9562 9886 31702 10793 31755 9091html3.sta 36323 36921 6084 6116 6191 23613 7055 23691 5893

Page 9: Text Data Compression

Improving the Efficiency of Lossless Text Data Compression Algorithms 9

Table 3 shows the file sizes after the application of the LIPT transform in conjunction with the compression algorithms.

Table 3. Compression Results for LIPT-encoded CorpusCalgary Corpus

Filename Original Size LIPT-encoded PK-ZIP 2.50 bzip (BWT) Gzip (LZ77) Arithmetic Coding DMC Huffman Coding

PPM (No Training)

bib.lpt 111261 101522 33424 26,901 33948 67,592 29,439 67,746 25,437book1.lpt 768771 681210 285332 222,398 291973 390,421 225,414 393,507 214,509book2.lpt 610856 512862 189839 151,861 192939 321,284 162,481 323,539 145,636geo.lpt 102400 78309 62600 57,788 62566 65,311 64,233 65,711 59,988news.lpt 377109 350114 137688 115,586 139511 234,409 128,416 235,229 108,743obj1.lpt 21504 16232 9622 10,183 9644 14,118 10,403 14,328 9,442obj2.lpt 246814 217291 77410 73,820 78135 180,224 82,347 180,896 70,588paper1.lpt 53161 45731 17104 15,451 17228 29,670 16,771 29,734 14,658paper2.lpt 82199 69393 26903 23,180 27402 41,437 24,199 41,701 22,266paper3.lpt 46526 37655 16058 14,259 16211 23,400 14,944 23,496 13,786paper4.lpt 13286 10979 4880 4,553 4917 6,832 4,737 6,833 4,275paper5.lpt 11954 10418 4533 4,401 4566 6,769 4,567 6,774 4,097paper6.lpt 38105 34500 12513 11,450 12585 22,073 12,373 22,134 10,885pic.lpt 513216 66948 38752 38,208 38422 43,209 38,956 43,443 38,054progc.lpt 39611 38605 13002 12,082 12982 25,605 13,409 25,676 11,397progl.lpt 71646 67133 15610 14,868 15760 41,524 17,361 41,760 14,409progp.lpt 49379 48434 11180 10,607 11282 30,709 12,270 30,767 10,364trans.lpt 93695 84753 18326 17,173 18371 60,048 21,837 60,155 16,474

Additional html fileshtml1.lpt 29830 29592 5413 5146 5496 19,497 5,643 19,549 4,777html2.lpt 46893 46205 9786 9552 9893 31,890 10,800 32,041 8,913html3.lpt 36323 35901 6104 6108 6234 11,412 7075 24,163 5,787

The following charts display the compression ratios for each file, grouped roughly by content type. It is interesting to note that the transforms generally do, but not always provide better compression.

Compression Ratios for bib, book1, book2 and news

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

bibnone

bib*-encoding

bibLIPT

book1none

book1*-encoding

book1LIPT

book2none

book2*-encoding

book2LIPT

new snone

new s*-encoding

new sLIPT

PK-ZIP 2.50

bzip (BWT)

Gzip (LZ77)

Arithmetic Coding

DMC

Huffman Coding

PPM (No Training)

Figure 4. Compression Ratios for bib, book1, book2, and news

Page 10: Text Data Compression

Improving the Efficiency of Lossless Text Data Compression Algorithms 10

Compression Ratios forgeo, obj1, obj2

0

0.5

1

1.5

2

2.5

3

3.5

4

geonone

geo*-encoding

geoLIPT

obj1none

obj1*-encoding

obj1LIPT

obj2none

obj2*-encoding

obj2LIPT

PK-ZIP 2.50

bzip (BWT)

Gzip (LZ77)

Arithmetic Coding

DMC

Huffman Coding

PPM (No Training)

Figure 5. Compression Ratios for geo, obj1, obj2

pic: Compression Ratios

-

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

PK-ZIP2.50

bzip(BWT)

Gzip(LZ77)

ArithmeticCoding

DMC HuffmanCoding

PPM (NoTraining)

pic none

pic *-encoding

pic LIPT

Figure 6. Compression Ratios for pic

Compression Ratios for paper1, paper2, paper3, paper4, paper5, paper6

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

bibnone

bib*-encoding

bibLIPT

book1none

book1*-encoding

book1LIPT

book2none

book2*-encoding

book2LIPT

new snone

new s*-encoding

new sLIPT

PK-ZIP 2.50

bzip (BWT)

Gzip (LZ77)

Arithmetic Coding

DMC

Huffman Coding

PPM (No Training)

Figure 7. Compression Ratios for paper1, paper2, paper3, paper4, paper5, paper6

Page 11: Text Data Compression

Improving the Efficiency of Lossless Text Data Compression Algorithms 11

Compression Ratios for progc, progl, progp, trans

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

progcnone

progc*-encoding

progcLIPT

proglnone

progl*-encoding

proglLIPT

progpnone

progp*-encoding

progpLIPT

transnone

trans*-encoding

transLIPT

PK-ZIP 2.50

bzip (BWT)

Gzip (LZ77)

Arithmetic Coding

DMC

Huffman Coding

PPM (No Training)

Figure 8. Compression Ratios for progc, progl, progp, trans

Compression Ratios for html1, html2, html3

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

html1none

html1*-encoding

html1LIPT

html2none

html2*-encoding

html2LIPT

html3none

html3*-encoding

html3LIPT

PK-ZIP 2.50

bzip (BWT)

Gzip (LZ77)

Arithmetic Coding

DMC

Huffman Coding

PPM (No Training)

Figure 9. Compression Ratios for html1, html2, html3

It is interesting to note that the transforms typically do not result in increased performance for arithmetic or Huffman coding. In fact, LIPT actually decreases the compression ratio for arithmetic coding by almost a third for the English language text files (bib, book1, book2, news, paper1, paper2, paper3, paper4, paper5, paper6, progc, progl, progp, trans).

*-encoding caused a decrease in compression for bib, book2, news, progl, progp and trans with PPM encoding, and for book1, book2 and progp with BWT encoding. Other than those, the transforms typically offer some increase. *-encoding offered improvements of 11% to nearly 15% of original file size for the books and papers when coupled with Huffman coding and LIPT offered improvements of up to 8% in combination with Huffman coding.

Overall, PPM with LIPT produced the best compression ratios of the English language text files and was nearly as good as any other method on the other files.

Page 12: Text Data Compression

Improving the Efficiency of Lossless Text Data Compression Algorithms 12

5. CONCLUSION

This paper has shown that it is possible to make textual data more compressible, even if only to a small degree, by applying an intermediate reversible transform to the data prior to compression. Although not specifically measured for this paper, the time impact of applying these transforms to the data was not observed to be significant.

Transform encoding offered improvements of up to 15% for some standard compression methods and - depending on the methods used and the type of text contained in the input file - can offer compression ratios over 13 and can generally have a beneficial effect on the compressibility of data over standard compression algorithms.

It is recommended that further investigation be made into the applicability of this process to html files in an effort to decrease download times for web information and to conserve Internet bandwidth.

6. REFERENCES

Akman, K. Ibrahim. “A New Text Compression Technique Based on Language Structure.” Journal of Information Science. 21, no. 2 (February 1995): 87-95.

Awan, F. S. and A. Mukherjee, LIPT: A Lossless Text Transform to Improve Compression. [paper on-line] School of Electrical Engineering and Computer Science, University of Central Florida, available from http://vlsi.cs.ucf.edu/listpub.html; Internet; accessed 9 July 2001.

Bell, T. C., J. G. Cleary and I. H. Witten. Text Compression. Englewood Cliffs: Prentice-Hall, 1990.

Bell, Timothy, Ian H. Witten and John G. Cleary. “Modelling for Text Compression.” ACM Computing Surveys. 21, no. 4 (December 1989): 557-591.

Burrows, M. and D. J. Wheeler. “A Block-sorting Lossless Data Compression Algorithm.” SRC Research Report 124, Digital Systems Research Center, Palo Alto, (May 1994) available from http://citeseer.nj.nec.com/76182.html; Internet; accessed 15 July 2001.

Cleary, J. G. and I.H. Witten. “Data Compression Using Adaptive Coding and Partial String Matching.” IEEE Transactions on Communications. 32, no 4 (April 1984): 396-402.

Crochemore, Maxime and Thierry Lecroq. “Pattern-Matching and Text Compression Algorithms.” ACM Computing Surveys. 28, no. 1 (March 1996): 39-41.

Fenwick, P. Symbol Ranking Text Compression with Shannon Recodings. [paper on-line] Department of Computer Science, The University of Auckland, 6 June 1996 available from ftp://ftp.cs.auckland.ac.nz/out/peter-f/TechRep132; Internet; accessed 6 June 2001.

Franceschini, R., H. Kruse, N. Zhang, R. Iqbal and A. Mukherjee. Lossless, Reversible Transformation that Improve Text Compression Ratios. [paper on-line] School of Electrical Engineering and Computer Science, University of Central Florida, available from http://vlsi.cs.ucf.edu/listpub.html; Internet; accessed 9 July 2001.

Goebel, G.V., Data Compression. available from http://vectorsite.tripod.com/ttdcmp0.html; Internet; accessed 14 May, 2001.

Huffman, D. A. “A Method for the Construction of Minimum-Redundancy Codes.” Proceedings of the Institute of Electrical and Radio Engineers. 40, no 9 (September 1952): 1098-1101.

Moffat, Alistair, Radford M. Neal and Ian H. Witten. “Arithmetic Coding Revisited.” ACM Transactions on Information Systems. 16, no. 3 (July 1998): 256-294.

Page 13: Text Data Compression

Improving the Efficiency of Lossless Text Data Compression Algorithms 13

Motgi, N. and A. Mukherjee, Network Conscious Text Compression System (NCTCSys). [paper on-line] School of Electrical Engineering and Computer Science, University of Central Florida, available from http://vlsi.cs.ucf.edu/listpub.html; Internet; accessed 9 July 2001.

Nelson, M. and J. L. Gailly. The Data Compression Book 2nd Edition. New York: M&T Books, 1996.

Nelson, Mark. “Data Compression with the Burrows-Wheeler Transform.” Dr. Dobb’s Journal. (September 1996) available from http://www.dogma.net/markn/articles/bwt/bwt.htm; Internet; accessed 18 June 2001

Salomon, D. Data Compression: The Complete Reference 2nd Edition. New York: Springer-Verlag 2000.

Sayood, K. Introduction to Data Compression 2nd Edition. San Diego: Academic Press, 2000.

Shannon, C. E. “Prediction and Entropy of Printed English.” Bell System Technical Journal. 30 (January 1951): 50-64.

Stork, Christian H., Vivek Haldar and Michael Franz. Generic Adaptive Syntax-Directed Compression for Mobile Code. [paper on-line] Department of Information and Computer Science, University of California, Irvine, available from http://www.ics.uci.edu/~franz/pubs-pdf/ICS-TR-00-42.pdf; Internet; accessed 14 July 2001.

Wayner, P. Compression Algorithms for Real Programmers. San Diego: Academic Press, 2000.

Witten, I. and T. Bell, README. included with the Calgary Corpus (May 1990) available from ftp://ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus; Internet; accessed 25 June 2001.

Ziv, J. and A. Lempel. “A Universal Algorithm for Sequential Data Compression.” IEEE Transactions of Information Theory. IT-23, no. 3 (May 1977): 337-343.