2. Text Compression

2. Text Compression

강의 노트 (2 주 )

2

압축이 필요한 이유• 이유

– 컴퓨터 하드웨어 발전 필요한 자료의 양의 증가 속도 ( 통신 , 저장 ) ::: 따라잡을 수 없음 : Parkinson’s Law

– 인터넷 홈페이지 – 새로운 응용 멀티미디어 , Genome, 전자도서관 ,

전자상거래 , 인트라넷 – 압축이 되면 처리 속도도 빨라진다 !!!!

• 하드디스크 접근• 통신속도

• 예부터– Morse 코드 , Braille 코드 , 속기용 자판

3

최근 • 흐름

– PC 클러스터 , RAID 일반화– 주기억장치 DB 활용– 통신속도 향상 ( 인터넷 , 내부통신 )– Network is computing!!!

• 하지만– 멀티미디어 자료 , 대용량 자료는 압축이 필요– 상대적으로 대용량 역파일의 압축 중요성은

줄었지만 , 때에 따라서는 필요

4

종류• Text compression 완벽한 원상복귀• Multi-media ::: 약간의 변화나 잡음은

허용함

5

역사 • 1950’s : Huffman coding• 1970’s Ziv Lempel(Lampel-Ziv-Welch(gif)), Arithmetic coding• English Text

– Huffman (5bits/character) – adaptive

• Ziv-Lempel (4bits/character) : 70 년대• Arithmetic coding (2bits/character)

• PPM ::: Prediction by Partial Matching– 80 년대 초– Slow and require large amount of memory– 이 후 더 효과적인 방법은 나오지 않고 속도나 MEMORY 를 줄이면서 약간

압축률은 손해보는 형태만 나옴– 0.5~1Mbytes, 0.1Mbytes 아래에서는 Ziv Lempel 이 효과적

• 영어 text 압축은 1 비트로 보며 , 그 이상은 의미적 관계나 다른 외부적 지식을 이용해야 할 것으로 봄– 문법 이용 , space 복원

6

강의 내용 Models Adaptive models Coding Symbolwise models Dictionary models Synchronization Performance comparison

7

크게 분류• 방법

– Symbol-wise method– Dictionary method

• 압축– Models

• static adaptive

– Coding

8

Symbol-wise methods

– Estimating the probabilities of symbols• Statistical methods

– Huffman coding or arithmetic coding– Modeling : estimating probabilities – Coding: converting the probabilities into bits

treams

9

Dictionary methods

• Code references to entries in the dictionary• Several symbols as one output codeword • Group symbols dictionary • Ziv-Lempel coding by referencing (pointing) pre

vious occurrence of strings adaptive

• Hybrid schemes 효율은 symbol-wise schemes 보다 좋지 않으나 속도 증가

10

Models prediction

- To predict symbols, which amounts to providing a probability distribution for the next symbol to be coded

- 모델의 역할 : coding & decoding Information Content

I(s) = -log Pr[s] (bits) 확률분포의 entropy: Claude Shannon

H = Pr[s]·I(s) = - Pr[s]·logPr[s](a lower bound on compression)

Entropy 가 0 에 수렴하면 압축 가능성은 극대화 됨 Huffman coding 은 성능이 나빠진다 !!! 이유 ???

Zero probability, Entropy 가 극단적으로 크면 ( 확률이 0 이면 ), 코드로 표현이 불가능해 진다 .

11

Pr[]

• 확률이 ‘ 1’ 이면 전송이 필요 없다• 확률이 ‘ 0’ 이면 coding 될 수 없다 • ‘u’ 의 확률이 2% 이면 5.6bits 필요• ‘q’ 다음에 ‘ u’ 가 95% 확률로 나오면

0.074bits 필요 잘못 예측하면 추가의 bit 가 소요 !!!

12

Model 의 표현finite-context model of order m

- 앞에 나온 m 개의 symbol 을 이용하여 예측 finite-state model

- [Figure 2.2]The decoder works with an identical

probability distribution- synchronization- On error, synchronization would be lost

Formal languages as C, Java- grammars, …

13

Estimation of probabilities in a Model

static modeling- 텍스트의 내용에 관계없이 항상 같은 모델 사용- 영어문서 문자가 많은 문서 , 문자가 많은 문서 모스부호 - 같은 문서 내에서도 다른 형태가 ???

semi-static (semi-adaptive) modeling- 각각의 파일마다 새로운 모델을 encoding 하는 곳에서 만들어 전송- 사전에 모델을 전송하는 비용이 모델이 복잡하면 심각할 수 있음

adaptive modeling- 좋지 않는 model 에서 시작하여 전송되어 오는 내용을 보고 model

을 바꿈- 새로운 symbol 을 만날 때마다 확률 분포가 변화

14

Adaptive models zero-order model character by character zero frequency problem

- 어떤 character( 예 , ‘z’) 가 지금까지 한 번도 나타나지 않았을 때- 128 개 ASCII 중 82 개 문자가 나오고 , 46 개가 안 나왔을 때

- 1/(46*(768,078+1)) 25.07bits- 1/(768,078+128) 19.6bits

- 큰 문서에서는 중요하지 않으나 작거나 다양한 문자를 사용 또는 문맥이 바뀔 때는 중요

higher-order model- 0-probability 는 일단 고려하지 않음- first-order model ::: 37,526(‘h’) 1,139(‘t’) 1,139/37,526 9.302%

5.05bits (0-order 보다 못함 ) 이유는 ???- second-order model ::: ‘gh’ ‘t’ (64%, 0.636bits)

다양한 형태로 변경 가능 : encoding 과 decoding 부분이 같은 모델을 쓰는 한 (synchronization)

15

adaptive modeling

• 장점• Robust, Reliable, Flexible

• 단점 • Random access is impossible • fragile on communication errors

• Good for general compression utilities but not good for full-text retrieval

16

Coding coding 의 기능

- model 에 의해 제공된 확률 분포를 바탕으로 symbol 을 어떻게 나타낼지를 결정

coding 시 주의점- 코드길이

- short codewords for likely symbols- long codewords for rare symbols- 확률분포에 따라 최저평균길이는 정해지며 , 여기에 가깝게 함

- 속도- 속도가 중요한 요소면 압축률을 어느 정도 희생

symbolwise scheme 은 coder 에 의존 사전적 방법과 다름- Huffman coding : 속도가 빠름- Arithmetic coding : 압축률이 이론적 한계에 가까움

17

Huffman Coding

static model 을 사용할 때 encoding 과 decoding 속도가 빠름

adaptive Huffman coding - memory 나 시간이 많이 필요

full-text retrieval application 에 유용- random access 가 용이

18

Examples

• a 0000 0.05

b 0001 0.005 c 001 0.1 d 01 0.2 e 10 0.3 f 110 0.2 g 111 0.1

• Eefggfed• 10101101111111101001• Prefix-(free) code

19

Huffman coding: Algorithm

• Fig. 2.6 설명

• Fast for both encoding and decoding• Adaptive Huffman coding 도 있으나

arithmetic coding 이 오히려 나음– 궁극적으로 random access 가 불가능– 기억용량 , 속도 등에서 유리하지 않음

• Words-based approach 와 결합하면 좋은 결과를 줌

20

Canonical Huffman Coding I

a static zero-order word-level Canonical Huffman Coding : 표 2.2

Huffman code 와 같은 길이의 codeword 사용- codeword 의 길이가 긴 것부터 저장- 같은 빈도로 나타나는 단어인 단어는 자모순 - encoding 은 쉽게 코드의 길이와 같은 길이의 첫 번째 코드에서

상대적 위치와 첫번째 코드만 알면 가능- 예 ::: Table 2.2 에서 ‘ said’ 는 7bit 짜리 중에서 10 번째 ,

첫번째 코드 ‘ 1010100’ ‘1010100’+’1001’ = ‘1011101`

21

Canonical Huffman Coding II

Decoding : 심벌을 Codeword 의 순서대로 저장 + 코드길이에 따른 첫번 째 코드1100000101… 7bits(‘1010100), 6bits(11000

1) … 7bits 에서 12 번째 뒤 (with) decoding tree 를 사용하지 않음

22

Canonical Huffman Coding III

• Word와 확률만 정해지면 유일함• 표 2.3 참고• Canonical Huffman code 는 Huffman algorithm 에 의해 만들어

지지 않을 수 있다 !!!!!!! any prefix-free assignment of codewords where the length of ea

ch code is equal to the depth of that symbol in a Huffman tree

• Huffman 이 말한 바에 따르면 알고리즘이 바뀌어야 한다 !!!! 코드 길이를 계산하는 것으로 !!! – n 개 symbol 에 대해 2n-1 – 그 중 한 개가 canonical Huffman code

23

Canonical Huffman code IV

• 장점– Tree 를 만들 필요가 없으므로 memory 절약– Tree 를 찾을 필요가 없으므로 시간 절약

• 코드길이를 먼저 알고 , 위치를 계산하여 코드 값을 부여한다… 방법 설명 – – 긴 것 부터 !!! 1씩 더하면 !!!! 길이에 맞게 자르면 !!!! – [ 바로 큰 길이 첫 번째 코드 + 동일 코드 개수 ] 를 길이만큼 잘라 +1] 을 하면 됨

– ( 예 ) 5bits 4, 3bits 1, 2bits 3 00000, 00001,00010, 001, 01, 10, 11

24

알고리즘 • 단순히 tree 를 만들면 24n bytes

– 값 , pointer (2 개 )– Intermediate node + leaf node 2n

• 8n bytes 알고리즘– Heap 의 사용– 2n 개 정수 array– 알고리즘은 직접 쓰면서 설명

• 코드길이 계산

25

Arithmetic Coding

평균적으로 엔트로피보다 짧게 압축하기는 불가능 복잡한 model 을 사용하여 높은 압축률 얻음

- entropy 에 근접한 길이로 coding 한 symbol 을 1bit 이하로 표현 가능 특히 한 sy

mbol 이 높은 확률로 나타날 때 유리 tree 를 저장하지 않기 때문에 적은 메모리 필요 static 이나 semi-static application 에서는 Huffman c

oding 보다 느림 random access 어려움

26

Huffman Code 와 Arithmetic Code

Huffman Coding Arithmetic Coding Static model에 유리 Adaptive model 에 유리 아무리 높은 확률의 symbol 이라도 최소 한 bit 이하로 압축할 수 없다. – 해결책 : blocking 구현이 어렵다.

확률이 높은 symbol 을 적은 bit로 표현 가능하다.

많은 메모리 필요 : decoding tree를 저장

적은 메모리 필요 : tree 를 저장하지 않음

빠른 속도 : 미리 계산된 확률, 미리 정해진 codeword

느린 속도 : 실시간 확률과 range 계산

Random access 가능 Random access 어려움 full-text retrieval에서 text 압축에 사용됨

full-text retrieval 에서 image 압축에 사용됨

27

실제 예• 0.99, 0.01 의 확률로 두 심볼이 나올 때

– Arithmetic coding: 0.015bit– Huffman coding: (symbol 당 inefficiency)

Pr(s1)+log(2log2/e) ~ Pr[s1]+0.086 ( 여기서 s1 은 가장 빈도가 높은 심볼 ) : 1.076bits

• 영어문서 entropy : 5bits per character (0-order character level)– 공백문자 비중 : 0.18 0.266– 0.266/5bits 5.3% 의 inefficiency

• 이미지 : 주로 2 가지 symbol arithmetic coding

28

Transmission of output

• low = 0.6334 high = 0.6667– ‘6’, 0.334 0.667

• 32bit precession 으로 크게 압축률 감소는 없음

29

Arithmetic Coding (Static Model)

{ a, b, EO F } set 로 이 루 어 진 에 서 "bbaa" .를 압 축 한 다 Pr[ a] = 0.4, Pr[ b] = 0.5, Pr[ EO F] = 0.1

EO F

1.0

ba

0.4 0.90.0

EO F

0.9

ba

0.6 0.850.4

EO F

0.4

ba

0.16 0.360.0

EO F

0.7

ba

0.64 0.690.6

EO F

0.85

ba

0.7 0.8250.6

b : low = 0.4 high = 0.9입 력 Prefix =

b : low = 0.6 high = 0.85입 력 Prefix =

EO F : low = 0.36 high = 0.4입 력 Prefix =

a : low = 0.6 high = 0.64입 력 Prefix = 6

a : low = 0.6 high = 0.7입 력 Prefix =

O utput : 6

O utput : 36

30

Decoding(Static Model)

EO F

1.0

ba

0.4 0.90.0

EO F

0.9

ba

0.6 0.850.4

EO F

0.4

ba

0.16 0.360.0

EO F

0.7

ba

0.64 0.690.6

EO F

0.85

ba

0.7 0.8250.6

6 입 력

O utput : b

O utput : a

O utput : b

O utput : a

36 입 력

31

Arithmetic Coding (Adaptive Model)

{ a, b, EO F } set 로 이 루 어 진 에 서 "bbaa" . .를 압 축 한 다 초 기 확 률 은 다 음 과 같 다 Pr[ a] = 0.333, Pr[ b] = 0.333, Pr[ EO F] = 0.333

EO F

1.0

a

0.333 0.6660.0

EO F

0.666

a

0.4165 0.58320.333

EO F

0.2959

ba

0.2211 0.27720.165

EO F

0.498

b

0.2959 0.44250.165

EO F

0.5832

b

0.4498 0.54980.4165

b : Pr[ a] = 입 력1/ 4, Pr[ b] =2/ 4, Pr[ EO F] = 1/ 4

O utput : 4

O utput : 28

b

b

a

a

a : Pr[ a] = 입 력3/ 7, Pr[ b] =3/ 7, Pr[ EO F] = 1/ 7

a : Pr[ a] = 입 력2/ 6, Pr[ b] =3/ 6, Pr[ EO F] = 1/ 6

b : Pr[ a] = 1/ 5, Pr[ b] = 3/ 5, 입 력 Pr[ EO F] = 1/ 5

EO F입 력

32

Decoding(Adaptive Model)

EO F

1.0

a

0.333 0.6660.0

EO F

0.666

a

0.4165 0.58320.333

EO F

0.2959

ba

0.2211 0.27720.165

EO F

0.498

b

0.2959 0.44250.165

EO F

0.5832

b

0.4498 0.54980.4165

b

b

a

a

4 입 력

28입 력

O utput : b

O utput : a

O utput : b

O utput : a

O utput : EO F

33

Cumulative Count Calculation

• 방법 설명 – Heap – Encoding 101101 101101, 1011,

101, 1– 규칙 설명

34

Symbolwise models

Symbolwise model + coder( arithmatic, huffman )

Three Approaches

- PPM( Prediction by Partial Matching )

- DMC(Dynamic Markov Compression )

- Word-based compression

35

PPM ( Prediction by Partial Matching )

finite-context models of characters

variable-length code 이전의 code 화 된 text 와 partial matching zero-frequency problem

- Escape symbol

- PPMA: escape method A: escape symbol 을 1로

36

Escape method

• Escape method A (PPMA) count 1• Exclusion 중복되지만 사용되지 않는 것은 제외 ,

예 ) lie+s (201, 22), ?lie+s 에서 처리 179 번 lie, lie+r 19회 … 19/202 19/180

• Method C :: r/(n+r) total n, distinct symbols r, ci/(n+r) 2.5bits per character for Hardy’s book.

• Method D :: r/(2n)• Method X :: symbols of frequency 1 t1, (t1+1)/(n+t

1+1) • PPMZ, Swiss Army Knife Data Compression (SAKDC) 1991 년 , 1197 년 박사학위 논문

• 그림 2,24

37

Block-sorting compression

• 1994 년에 도입• 문서를 압축이 쉽게 변환 • Image compression discrete cosine tr

ansformation, Fourier transformation 과 비슷

• Input 이 block 단위로 나뉘어 있어야 !!!

38

DMC ( Dynamic Markov Compression )

finite state model

adaptive model - Probabilties and the structure of the finite state machine Figure 2.13

avoid zero-frequency problem

Figure 2.14

Cloning - heuristic - the adaptation of the structure of a DMC

39

Word-based Compression

parse a document into “words” and “nonwords”

Textual/Non-Textual 구분 압축 - Textual : zero-order model

suitable for large full-text database

Low Frequency Word - 비효율적 - 예 ) 연속된 Digit, Page Number

40

Dictionary Models

Principle of replacing substrings in a text with codeword

Adaptive dictionary compression model : LZ77, LZ78

Approaches

- LZ77

- Gzip

- LZ78

- LZW

41

Dictionary Model - LZ77

adaptive dictionary model

characteristic - easy to implement - quick decoding - using small amount of memory

Figure 2.16

Triples

< offset, length of phrase, character >

42

Dictionary Model - LZ77(continue)

Improve

- offset : shorter codewords for recent matches

- match length : variable length code

- character : 필요시에만 포함 (raw data 전송 )

Figure 2.17

43

Dictionary Model - Gzip

based on LZ77

hash table

Tuples

< offset, matched length >

Using Huffman code

- semi-static / canonical Huffman code

- 64K Blocks

- Code Table : Block 시작 위치

44

Dictionary Model - LZ78

adaptive dictionary model

parsed phrase reference

Tuples

- < phrase number, character >

- phrase 0 : empty string

Figure 2.19

Figure 2.18

45

Dictionary Model - LZ78(continue)

characteristic

- hash table : simple, fast

- encoding : fast

- decoding : slow

- trie : memory 사용 많음

46

Dictionary Model - LZW

variant of LZ78

encode only the phrase number

does not have explicit characters in the output

appending the fast character of the next phrase

Figure 2.20

characteristic

- good compression

- easy to implement

47

Synchronization

random access

- variable-length code

- adaptive model

synchronization point

synchronization with adaptive model

- large file -> break into small sections

impossible random access

48

Creating synchronization point

main text : consist of a number of documents

- 문서의 시작 / 끝에 추가 bit 로 길이 표시 bit offset

byte offset

- end of document symbol

- length of each document at its beginning

- end of file

49

Self-synchronizing codes

not useful or full-text retrieval

- compressed text 의 중간에서 decoding synchronizing cycle 을 찾아 decoding

- part of corrupteed, beginning is missing

motivation

fixed-length code : self-synchronizing 불가 Table 2.3

Figure 2.22

50

Performance comparisons

consideration

- compression speed

- compression performance

- computing resource

Table 2.4

51

Compression Performance

Calgary corpus

- English text, program source code, bilevel fascimile image

- geological data, program object code

Figure 2.24

Bits per character

52

Compression speed

speed dependency

- method of implementation

- architecure of machine

- compiler

Better compression, Slower program run

Ziv-Lempel based method : decoding > encoding

Table 2.6

53

Other Performance considerations

memory usage

- adaptive model : 많은 memory 사용

- Ziv-Lempel << Symbolwise model

Random access

- synchronization point

2. Text Compression

Documents