slavanya.weebly.com€¦ · Web viewUNIT-I. INFORMATION THEORY. 1. INTRODUCTION . Communication theory deals with systems for transmitting information from one point to another.

UNIT-I

INFORMATION THEORY

1. INTRODUCTION

Communication theory deals with systems for transmitting information from one point to another.

Fig 1: Commuication system

2. UNCERTAINTY, INFORMATION and ENTROPY

– Any information source produces an output that is random in nature. So the source output

is modeled as a discrete random variable, X, which is represented as symbols by

S = { s0, s1, ……… sk-1 }

with probabilities P(S = sk) = pk, where k= 0,1…., k-1

– Therefore, this set of probabilities must satisfy the condition,

– The symbols emitted by the source during successive signaling intervals are statistically

independent. The source having this property is known as discrete memoryless source.

– Any information can be defined in 3 ways:

o Uncertainty (event S = sk ,occur before a process)

o Surprise (during the process )

o Information gain (after the occurrence of the event )

– The amount of information is related to the inverse of the probability of the occurrence.

– Information Gain or self information: The amount of information gained after observing

the event S = sk, which occurs with the probability pk is termed as,

I (sk) =log(1/ pk) = - log pk

– The Units of information I (sk) are determined by the base of the logarithm, which is

usually selected as 2 or e.

When the base is 2 – units are in bits

When the base is e – units are in nats (natural units)

2.1 Properties of Information:1

∑k=0

k−1p k=1

– Consider the event S = sk, describing the emission of symbols xi , s with the probability pk,

1. I (sk) =0 , for pk =1

Outcome of an event is known before it occurs, therefore no information is gained.

2. I (sk) ≥ 0 for 0≤ pk ≤ 1

The occurrence of an event S = sk either provides some or no information.

3. I (sk) > I(si) for pk < p (si)

(i.e), less probable an event is, the more information we gain when it occurs.

4. I (sk, si) = I(sk) + I(si) if sk & si are statistically independent.

2.2 ENTROPY :

– It is a measure of average information content per source symbol.

– Denoted by H(S)

H(S) =E [I(sk)]

=−∑k=0

k−1

pk log pk bits/symbol

– The quantity H(S) is called the entropy of a discrete memoryless source with source letter S.

2.2.1 Properties of Entropy:

The entropy H(S) of such a source is bounds as follows

0≤ H(S) ≤ log2 K

where K is the radix (no: of symbols) of the letter of the Source

1. H(S)=0, if & only if pk =1 , for some k

pk = 0 , otherwise

This Lower bound on entropy corresponds to no uncertainty.

2. H(S) = log2 K, if & only if pk =1/ K , for all k .

This upper bound on entropy corresponds to maximum uncertainty.

2.2.2 Entropy of binary memoryless source:

– Consider a binary source for which symbol 0 occurs with probability p0 & symbol 1 with probability p1= 1- p0.

– Successive symbols emitted by the source are memoryless (i.e) statistically independent.

– The entropy of source is ,

H(s) = -p0 log2 (p0 ) - p1 log2 ( p1)

2

= -p0 log2 (p0 ) - (1- p0) log2 (1- p0) bits. H(p0) = -p0 log2 (p0 ) - (1- p0) log2 (1- p0)

– Here H(p0) is the entropy function at condition when p0 = 0 the entropy H(s) = 0

Fig 2: Entropy Function

– From the graph we observe that,

1. When p0 = 0, then H(s) = 0

2. When p0 = 1, then H(s) = 0

3. The entropy H(s) attains its maximum value, Hmax = 1 bit, when p1 = p0 = ½

2.2.3 Extension of a Discrete Memoryless Source

– The entropy of the extended source H(Sn ) is equal to n times of H(S), the original source.

H(Sn) = nH(S)

3. INFORMATION RATE:

– If the source is emitting symbols at a fixed rate of ‘’rs’ symbols / sec, then the average

source information rate ‘R’ is defined as

R = rs. H(S) bits / sec.

4. SOURCE CODING THEOREM:

– The problem in communication is the efficient representation of data generated by a source.

– The process by which the representation is accomplished is called source encoding.

– The device that performs the representation is called source encoder.

– The primary motivation is the compression of data due to efficient representation of the

symbols.

– Requirement of source encoder:

1. The code words produced by the encoder are in binary form

2. The source code is uniquely decodable

3

Entropy

Fig 3: Source Encoding

– The output sk is converted into a block of 0s and 1s by source encoder bk.

– Consider a discrete memoryless source whose output sk is converted by the source

encoder

into the blocks of 0’s and 1’s denoted by bk.

– Let the binary code word assigned to the symbol sk by the encoder have the length lk is

measured in bits.

– We defined average code word length of source encoder as

– The parameter represents the average number of bits per source symbol used in the source

encoding process.

– Let Lmin denote the minimum possible value of and we have define the code efficiency of

the source encoder as ,

η =

with ≥ Lmin, we clearly have η ≤ 1. The source encoder is said to be efficient when η approaches unity.

– The minimum value Lmin is determined by Shannon’s First theorem (Source coding

theorem).

– Shannon’s First theorem:

o Given a discrete memoryless source of entropy H(S) , the average code word length

L for any disortionless source encoding scheme is bounded as ,

≥ H(S)

o According to the source coding theorem, the entropy H(S) is represents a

fundamental limit on the average number of bits / source symbol.

Thus Lmin = H(S), then the efficiency of a source encoder is

η =

Types of source coding:

1. Fixed length code (FLC):

4

L

L

LminL

L

L

H (S )L

Ex: To represent the 26 letters in the english alphabet using bits, that can be uniquely

represented using 5 bits (25 =32>26).

Disadvantage: FLC does not provide an efficient way of representation (coding). Since

it gives equal importance to both frequently used letters and rarely used letters.

2. Variable length code (VLC):

Frequently used source symbols – assign short code

Rarely used source symbols – assign long code

Ex: Morse code is an example for variable length code .

- alphabets and numerals are encoded into marks(.) and Spaces(_)

- For E : .

- for Q : _ _ _ . _ _

3. Data compaction (Prefix code):

– It is used to remove redundant information from the signal prior the transmission.

– This is achieved by assigning short descriptions to the most frequent outcomes of the

source output.

– Source-coding schemes that are used in data compaction are e.g. prefix coding, huffman

coding, lempel-ziv coding.

– Prefix Coding:

• Prefix of code word : Any sequence made up of the initial part of the code word

• Prefix code: A code in which no code word is the prefix of any other code word .

• The prefix code must satisfy the Kraft – McMillan inequality.

• Kraft – McMillan inequality: – A prefix code has been constructed for a discrete memorless source with

source alphabet { s0, s1, ……… sk-1 } and

its probability { p0, p1, ……… pk-1 }

and the code word for symbol sk has length lk , k= 0,1….K-1, then the code word

lengths of the code always satisfy a certain inequality known as the Kraft –

McMillan inequality.

∑k =0

K−1

2 –lk ≤ 1

Where 2 refers to the radix (no: of symbols) in the binary alphabet

Example:

Source code symbol

Probability of occurrence Code I Code II Code III

s0 0.5 0 0 0

5

s1 0.25 1 10 01s2 0.125 00 110 011

s3 0.125 11

111

0111

Fig 4: Decision Tree

From the above table we observe that,1. Code I violates the Kraft – McMillan inequality; therefore it cannot be a prefix

code.2. The Kraft – McMillan inequality is satisfied by both codes II and III; but only code

II is a prefix code. From the above table we observe that,

1. Code I violates the Kraft – McMillan inequality; therefore it cannot be a prefix code.

2. The Kraft – McMillan inequality is satisfied by both codes II and III; but only code II is a prefix code.

Kraft – McMillan inequality for above example

6

Prefix code

Instantaneous codes:

– Prefix coding has an important feature that it is always uniquely decodable.

– Prefix codes can also be referred to as instantaneous codes, meaning that the

decoding process is achieved immediately.

– Given a discrete memoryless source of entropy H(S) , the average code word length

for any disortionless source encoding scheme is bounded as ,

H(S) ≤ < H(S) +1

– For extended source code,

H(Sn) ≤ < H(Sn) +1

nH(S) ≤ < nH(S) +1

5. SHANNON- FANO’S CODING:

– It is built on top-down approach.

Procedure:

1. List the source symbols in the decreasing probabilities.

2. Partition this ensemble into almost two equi-probable groups.

3. Assign 0 to one group and 1 to other group. These form the starting code symbols of the

code.

4. Repeat the steps 2 & 3 on each of the subgroups, until the subgroups contain only one

source symbol to determine the succeeding code symbol of the code words.

5. For convenience, a code tree may be constructed and codes read off directly.

Problem:

1. Apply the Shannon Fano’s encoding procedure fort the following ensemble.

X={ x1,x2,……,.. X8} and their probabilities P = { 0.25, 0.25, 0.125, 0.125, 0.0625, 0.0625 ,

0.0625 , 0.0625 }

Solution: Step 1 Step 2 Step 3 Step 4

71110

1101

110

110

111

111

X5

X6

X7

1100

101

10

10

11

11

11

11

X3

X4

X5

X6

X7

X8

100

X1 = 0.25

X2= 0.25

X3= 0.125

X4= 0.125

X5= 0.0625

X5= 0.0625

X6= 0.0625

X7= 0.0625

X1

X2

X3

X4

X5

X6

X7

X8

0

0

1

1

1

1

1

1

X1

X2

X3

X4

X5

X6

X7

X8

00

01

L

L

LnL

Source code symbol

Probability of occurrence (pk )

Code word Length (lk )

x1 0.25 00 2

x2 0.25 01 2

x3 0.125 100 3

x4 0.125 101 3

x5 0.0625 1100 4

x6 0.0625 1101 4

x7 0.0625 1110 4

x8 0.0625 1111 4

(i) Average code word length of source encoder as

=

= p0l0 + p1l1+ p2l2 + p3l3 + p4l4+ p5l5 + p6l6 + p7l7

= 0.5 + 0.5 + 0.375 + 0.375 +0.25 + 0.25 + 0.25 + 0.25

= 2. 75

(ii) Average information content per source symbol, Entropy,

H(S) =

= 2 [0.25 log (1/.25)] + 2 [0.125 log (1/.125)] + 4 [0.0625 log (1/.0625)]

= 2.75 bits / symbol

(iii) Efficiency η =

= 1

6. HUFFMAN CODING:

8

1111X8

∑k=0

7p k lk

L

∑k=0

k−1

pk log 1pk

=∑k=0

7

pk log 1pk

H (S )L

=2.752.75

– Each symbol of a given alphabet is assigned a sequence of bitsaccording to the symbol probability.

– Huffman tree is built by bottom-up approach.

Procedure:

1. Calculate the probability of the list of symbols .

2. Source symbols are listed in order of decreasing probability.

3. The two source symbols of lowest probability a 0 and a 1. This step is referred to as a

splitting stage.

4. These two source symbols are combined into a new source symbol with probability equal to

the sum of the two original probabilities and it is placed in the list according to its new

value.

5. Recursively apply steps 3 and 4, until each symbol has become a corresponding code leaf on

a tree.

Problem:

The five symbols of the alphabet of a discrete memoryless source and their probabilities are {s0,

s1, s2, s3, s4} and {0.4, 0.2, 0.2, 0.1, 0.1} respectively. Compute the codewords of Huffman Code.

Also compute the entropy of the source.

Solution:

9

η =

η = = 0.96

The average length-code satisfied the following source coding property,

H(S) ≤ < H(S) +1

2.12 ≤ 2.2 < 3.12

6.1 Properties of Huffman Coding

Huffman coding uses longer codewords for symbols with smaller probabilities and shorter codewords for symbols that often occur.

The two longest codewords differ only in the last bit.

The codewords are prefix codes and uniquely decodable.

It should satify the shanon’s first theorem (source coding theorem) , H(S) ≤ < H(S) +1

6.2 Extended Huffman Coding:

We can encode a group symbols together and get better performance.10

H(S)

H (S )L

2.121932.2

L

L

It should satify the shanon’s first theorem, H(S) ≤ < H(S) +1

Problem:

Consider the source symbol with alphabet A= {a1,a2,a3} and the probabilities p(a1)=

0.8, p(a2)= 0.02, p(a3)= 0.18.

Solution:

11

L

7. JOINT AND CONDITONAL ENTOPY

7.1 Joint Entropy:

– The joint entropy H(X,Y) of a pair of discrete random variables (X,Y) with a joint distribution probability P(X,Y) is defined as

12

7.2 Conditional Entropy:

– It is defined as the amount information gained by transmitter when the state of receiver is known.

– It is the amount of uncertainty remaining about the channel input after the channel output has been observed.

–

– The mutual information is the average amount of information that you get about X from observing the value of Y.

– Therefore to measure the uncertainty of X based on the channel output Y is termed as,

– The conditional entropy H(X|Y) is defined as

7.3 MUTUAL INFORMATION (M.I):

– The difference H(X) – H(X|Y) represents the uncertainty about the channel input that is

resolved by observing the channel output.

– Therefore the mutual information is termed as,

I(X;Y) = H(X) – H(X|Y)

Similarly, I(Y;X) = H(Y) – H(Y|X)

H(X) - is the entropy of the channel input X

H(X|Y) – is the conditional entropy of the channel input X after observing the channel

output Y

7.3.1 Properties of Mutual information :

Property 1 : The mutual information of a channel is symmetric; (i.e)

I(X;Y) = I(Y;X)

Proof:

By multiplying with the above equation, we get

13

Joint probability P(x,y) = p(x) p(y|x) orP(x,y) = p(y) p(x|y)

(Noisy version of channel input x)

∑k=0

k−1

p( y k|x j )

I(X;Y) = H(X) – H(X|Y)

Now substitute H(X|Y) and equation 1, with the above equation, we obtain,

From the Baye’s rule for conditional probabilities,

Substitute equation 3 in 2, we get,

Hence proved.

Property 2: The mutual information is always nonnegative, (i.e) I(X;Y) ≥ 0

Proof:

From the conditional probability, P(xj|yk)= , substituting this in equation 2, we get,

By applying fundamental inequality function directly we obtain,

I(X;Y) ≥ 0

I(X;Y) ≥ 0, means we cannot lose information, on the average, by observing the output of a

Channel.

I(X;Y) = 0, means the input and the output channel are statistically independent

Property 3: The mutual information of a channel is related to the joint entropy of the

channel input and channel output by,

14

4

2

3

H(X)

1

p( x j| y k )p( x j )

=p( y k|x j )

p( y k )

p( x j , y k )p ( x j )

I(X;Y) = H(X) + H(Y) – H(X,Y)

Where the H(X,Y) is the joint entropy,

7.4 Chain Rule:

– The relationship between joint and conditional entropy is given as,

H(X,Y) = H(X) + H(Y|X)

H(Y,X) = H(Y) + H(X|Y)

Proof:

8. DISCRETE MEMORYLESS CHANNELS:

– A discrete memoryless channel is a statistical model with an input of X and output of Y

which is a noisy version of X (here both are random variables)

– In each time slot the channel accepts an input symbol X selected from a given alphabet X (x0, x1.... xj-1) and it emits an output symbol Y from an alphabet Y.

– The channel is said to be “discrete” when both of the alphabets have finite sizes.

– It is said to be “memoryless” when the current output symbol depends only on the current

input symbol and not any of the previous ones.15

Fig 3: Discrete memoryless channels

Input alphabet X = {x0, x1,...., xj-1} Output alphabet Y.= {y0, y1,...., yk-1}

Transition Probabilities:

p(yk/xj) = P(Y= yk/ X = xj) for all j and k

0 ≤ p(yk/xj) ≤ 1 for all j and k

– Also the input alphabet X and output alphabet Y need not have same size.– A discrete memoryless channel is to arrange the various transition probabilities of the

channel in the form of a matrix as follows:

– The J-by-K matrix P is called channel matrix or transition matrix.

– The fundamental property of the channel marix P, is the sum of the elements along any row

of the matrix is always equal to 1.

for all j

– The joint probability distribution of the random variables X and Y is given by

p(xj ,yk) = P( X =xj , Y = yk )

= P(Y= yk | X= xj) p(xj ) 8.1

= p (xj | yk ) p(xj )

– The marginal probability distribution of the output random variable Y is obtained by

averaging out the dependence of p(xj ,yk) on xj as shown by

p(yk ) = P(Y= yk )

16

for k=0,1,.... K-1 8.2

∑k=0

k-1p(yk/xj)= 1

=∑j=0

J-1

P (Y= yk | X= xj ) p (xj )

=∑j=0

J-1

p(yk | xj ) p (xj )

– The probabilities p(xj ) for j= 0, 1,…….J-1, are known as the a priori probabilities of the

various input symbols.

– The equation 8.2 states that the inputs are a priori probabilities p(xj ) and the channel matrix

p (xj | yk ), and the output is p(yk ).

9. CHANNEL CAPACITY :

– Capacity in the channel is defined as a intrinsic ability of a channel to convey information

– The channel capacity of a discrete memoryless channel is a maximum mutual information

I(X;Y) in any single use of the channel, where the maximization is overall possible input

probability distribution { p(xj)} on X.

– Channel capacity is denoted by C

C=max I(X;Y)

– The Channel capacity C is measured in bits per channel use or bits per transmission.

9.1 BINARY SYMMETRIC CHANNEL:

– It is the special case of the discrete memoryless channel with J=K=2.

– The channel has two input symbols (x0 = 0, x1 = 1) and two output symbols (y0 = 0, y1 =1 )

– The channel is symmetric because the probability of receiving a 1 if a 0 is sent is the same

as the probability of receiving a 0 if a 1 is sent .

– Conditional probability of error is denoted by p.

Fig 3: Transition Probability diagram of Binary symmetric channel

Channel Capacity for Binary Symmetric channel:

– Consider the binary symmetric channel which is described by the transition probability

diagram fig 3.

– This diagram is defined by the conditional probability of error p.

– The entropy H(X) is maximized when the channel input probability p(x0)=p(x1)= 1/2.

– The mutual information I(X;Y)is similarly maximized, so that it can be written as

C=max I(X;Y) | p(x0)=p(x1)= ½

– From fig 3, p(y0 | x1) = p(y1 | x0) = p and

p(y0 | x0) = p(y1 | x1) = 1- p17

9.1{p(xj)}

– Substituting these channel transition probabilities into the below equation

with J= K= 2 and then setting the input probability p(x0) =p(x1) in accordance with the

equation 9.1

– From that the capacity of the binary symmetric channel is,

C= 1+p log2 p + (1-p) log2 (1- p)

– By using the entropy function given in the below equation

H(p0) = -p0 log2 (p0 ) - (1- p0) log2 (1- p0) equation 9.2 can be reduced as

C= 1- H(p)

– The Channel capacity for binary symmetric channel is C= 1- H(p)

– The Channel capacity C varies with the probability of error p as shown in fig 4.

Observations:

When p=0, the channel is noise free. (i.e) the channel capacity C attains its maximum

value of 1 bit per channel use, which is exactly the information in each channel input. At

this value of p, the entropy function H(p) attains its minimum value of zero.

When p=1/2 due to noise, the channel capacity C attains its minimum value of 0,

whereas H(p) attains its maximum value of one. In such a case the channel is said to be

useless.

9.2 CHANNEL CODING THEOREM ( SHANNON’S SECOND THEOREM):

– The design goal of channel coding is to increase the resistance of the communication

systems to channel noise.

– Channel coding consists of mapping the incoming data sequence into a channel input

sequence and inverse mapping the channel output sequence into an output data sequence,

so that the channel noise of the system is minimized.

18

9.2

– Mapping and inverse mapping operations are performed by encoders and decoders.

– The channel encoder and decoder should be designed to optimize the overall reliability of a

communication system.

– Block Codes: Message sequence is divided into sequential block,each ’k’ bit long

– Code Rate: Each ’k’ bit block is mapped into an ’n’ bit block by the channel coder, where

n>k. The ratio r=k/n is called as code rate,

where k=’k bit block, n=block length and r is less than unity.

– Discrete memoryless source has a source alphabet S and entropy H(S) bits/ source symbol.

The source emits the symbol once every Ts seconds. Hence the average information rate

of the source is H(s)/Ts bits / seconds

– Discrete memoryless source has channel capacity equal to ‘C’ bits per use of the channel.

– The channel is capable of being used once every Tc seconds. Hence the channel capacity

per unit time is C/Tc bits/ seconds, which represents the maximum rate of information

transfer over the channel.

– The channel coding theorem for a discrete memoryless channel is stated into 2 parts:

1. H(s)/Ts ≤ C/ Tc ,where C/ Tc is called critical rate. It states the source output can be

transmitted over the channel and be reconstructed with the small probability of

error.

2. H(s)/Ts > C/ Tc , it shows that the the source output is not possible to transmit over

the channel.

C=channel capacity, Ts &Tc =Time, H(s) =Entropy.

Draw backs:

It won’t show us how to construct a good code.

It doesn’t have a precise result for the probability of symbol error after decoding the channel

output.

11. SHANNON LIMIT:

– Shannon showed that any communications channel such as a telephone line, a radio band, a

fiber-optic cable could be characterized by two factors:

1. bandwidth 2. noise

19

– Bandwidth is the range of electronic, optical or electromagnetic frequencies that can be used

to transmit a signal;

– Noise is anything that can disturb that signal.

– Given a channel with particular bandwidth and noise characteristics, Shannon showed how

to calculate the maximum rate at which data can be sent without error.

– This rate is called as Shannon Limit.

20

slavanya.weebly.com€¦ · Web viewUNIT-I. INFORMATION THEORY. 1. INTRODUCTION . Communication theory deals with systems for transmitting information from one point to another.

Documents