Video Fundamentals

8/3/2019 Video Fundamentals

1/16

1

Video Fundamentals

1.1 Introduction

Video is an illusion that makes use of the properties of eye. Our eye has a peculiar property thatimage sensed by eye persists for 1/10 th of a second: ie our eye cannot notice the change of scenein this period. Its called persistence of vision. We can make the illusion of continuity of a sceneby capturing just more than 16 pictures of the visual in a second and displaying the same on ascreen in the same time period. Its the basic principle behind video render ing. Each picturecaptured is called frames. Therefore a frame rate more than 16 is convenient. Usually we go for25 fps or more. In the earlier television system we get the image by projecting an electron beamover a phosphor screen. The phosphor screen is divided into 525 lines in case PAL TV system or625 lines in the case of NTSC system. The electron beam sequentially scans the 525/625 lines

from left to right.

Figure 2.1 Progressive scan

When the electron beams, intensity modulated with the video signal, incidents on the phosphorscreen, it starts glowing. The intensity of glow depends on the strength of video signal. The wayof scanning in the sequential mode is called Progressive scan (Figure 2.1).

In the case of Progressive scanning, there is a finite delay, 0.04s, between successive frames.Thepoint on the phosphor screen corresponding to an image should glow for that much time. Butusually the screen has a poor response and results in flicker: ie the image fades and screen tendsto become black. An ideal solution to remove the flickering problem is to increase the frame rate.But it will result in the increased bandwidth usage as the amount of data corresponding to thoseframes increases. Another solution for this problem is the interlaced scanning. A frame is dividedinto two fields, odd and even. Odd field is obtained by grouping the odd lines in the frametogether and even field are obtained by grouping the even lines in the frame. Scanning the fieldsat different instant avoids the problem of flickering to an extent and doesnt introduce burden of
http://bp1.blogger.com/_6kpJik-Ua70/RkNfmtZCMCI/AAAAAAAAADc/2DkxsWLka4w/s1600-h/fig1.JPG


2/16

2

bandwidth hike. The screen refresh gets doubled. This type of scanning is called interlacedscanning, illustrated in Figure 2.2.

Figure 2.2 Interlaced scan

1 . 2 V I D E O C O D I N G C O N C E P T S

Data compression is achieved by removing redundancy , i.e. components that are not necessary

for faithful reproduction of the data. Many types of data contain statistical redundancy and canbe effectively compressed using lossless compression, so that the reconstructed data at the outputof the decoder is a perfect copy of the original data. Unfortunately, lossless compression of image and video information gives only a moderate amount of compression. The best that can beachieved with current lossless image compression standards is a compression ratio of around 3 4times. Lossy compression is necessary to achieve higher compression. In a lossy compressionsystem, the decompressed data is not identical to the source data and much higher compressionratios can be achieved at the expense of a loss of visual quality. Lossy video compressionsystems are based on the principle of removing subjective redundancy, elements of the image orvideo sequence that can be removed without significantly affecting the viewers perce ption of visual quality. Compression results in higher computational complexities.

Most video coding methods exploit both temporal and spatial redundancy to achievecompression. In the temporal domain, there is usually a high correlation (similarity) betweenframes of video that are captured at around the same time, especially if the temporal samplingrate (the frame rate) is high. In the spatial domain, there is usually a high correlation betweenpixels (samples) that are close to each other, i.e. the values of neighboring samples are often verysimilar.
http://bp1.blogger.com/_6kpJik-Ua70/RkNf7tZCMDI/AAAAAAAAADk/ujUC6IzyqEM/s1600-h/fig2.JPG


3/16

3

Spatial and temporal redundancy are illustrated in the Figure 2.3

Figure 2.3 Spatial & temporal correlation in a video sequence

1 . 3 S P A T I A L M O D E L

Due to the temporal correlation of frames, successive frames can be predicted from the previousone. For this we need reference frames. Reference frames are encoded in a very similar fashionas explained in section 1.5.

1 . 4 T E M P O R A L M O D E L

The goal of the temporal model is to reduce redundancy between transmitted frames by forminga predicted frame and subtracting this from the current frame. The output of this process is aresidual (difference) frame. The more accurate the prediction process, the

Figure 2.4 Difference of frame-1 and frame-2

less energy is contained in the residual frame. The residual frame is encoded

and sent to the decoder which re-creates the predicted frame, adds the decoded residual andreconstructs the current frame. The predicted frame is created from one or more past or futureframes (reference frames). The accuracy of the prediction can usually be improved bycompensating for motion between the reference frame(s) and the current frame.

The simplest method of temporal prediction is to use the previous frame as the predictor for thecurrent frame. Two successive frames from a video sequence are shown in Frame 1 and Frame 2in Figure 2.4.. Frame 1 is used as a predictor for frame 2 and the residual formed by subtractingthe predictor (frame 1) from the current frame (frame 2) is shown in Figure 2.4. The obviousproblem with this simple prediction is that a lot of energy remains in the residual frame
http://bp2.blogger.com/_6kpJik-Ua70/RkNgM9ZCMEI/AAAAAAAAADs/fB8UUOPSRzs/s1600-h/fig3.JPGhttp://bp2.blogger.com/_6kpJik-Ua70/RkNgM9ZCMEI/AAAAAAAAADs/fB8UUOPSRzs/s1600-h/fig3.JPG


4/16

4

(indicated by the light and dark areas) and this means that there is still a significant amount of information to compress after temporal prediction. Much of the residual energy is due to objectmovements between the two frames and a better prediction may be formed by compensating formotion between the two frames.

1 . 5 O P T I C F L O W

Changes between video frames may be caused by object motion (rigid object motion, forexample a moving car, and deformable object motion, for example a moving arm), cameramotion (panning, tilt, zoom, rotation), uncovered regions (for example, a portion of the scenebackground uncovered by a moving object) and lighting changes. With the exception of uncovered regions and lighting changes, these differences correspond to pixel movementsbetween frames. It is possible to estimate the trajectory of each pixel between successive videoframes, producing a field of pixel trajectories known as the optical flow (optic flow). Figure 2.5shows the optical flowfield for the frames 1 and 2 of Figure 2.4. The complete field contains aflow vector for every pixel position but for clarity, the field is sub-sampled so that only thevector for every 2nd pixel is shown.

Figure 2.5 Optical Flow

If the optical flow field is accurately known, it should be possible to form an accurate predictionof most of the pixels of the current frame by moving each pixel from the reference frame alongits optical flow vector. However, this is not a practical method of motion compensation forseveral reasons. An accurate calculation of optical flow is very computationally intensive (themore accurate methods use an iterative procedure for every pixel) and it would be necessary tosend the optical flow vector for every pixel to the decoder.
http://bp1.blogger.com/_6kpJik-Ua70/RkNgxtZCMGI/AAAAAAAAAD8/RFuIdAfQc-o/s1600-h/fig5.JPGhttp://bp0.blogger.com/_6kpJik-Ua70/RkNgddZCMFI/AAAAAAAAAD0/ChAjU_FreNA/s1600-h/fig4.JPGhttp://bp1.blogger.com/_6kpJik-Ua70/RkNgxtZCMGI/AAAAAAAAAD8/RFuIdAfQc-o/s1600-h/fig5.JPGhttp://bp0.blogger.com/_6kpJik-Ua70/RkNgddZCMFI/AAAAAAAAAD0/ChAjU_FreNA/s1600-h/fig4.JPG


5/16

5

1 . 6 B L O C K - B A S E D M O T I O N E S T I M A T I O N A N D C O M P E N S A T I O N

A practical and widely-used method of motion compensation is to compensate for movement of rectangular sections or blocks of the current frame. The following procedure is carried out for each block of M N samples in the current frame:

1. Search an area in the reference frame (past or future frame, previously coded and transmitted)to find a matching M N-sample region. This is carried out by comparing the M N block inthe current frame with some or all of the possible M N regions in the search area (usually aregion centered on the current block position) and finding the region that gives the best match.A popular matching criterion is the energy in the residual formed by subtracting the candidateregion from the current M N block, so that the candidate region that minimizes the residualenergy is chosen as the best match. This process of finding the best match is known as motionestimation .

2. The chosen candidate region becomes the predictor for the current M N block and issubtracted from the current block to form a residual M N block ( motion compensation, MC ).

3. The residual block is encoded and transmitted and the offset between the current block and theposition of the candidate region ( motion vector, MV ) is also transmitted. The decoder uses thereceived motion vector to re-create the predictor region and decodes the residual block, adds it tothe predictor and reconstructs a version of the original block.

Block-based motion compensation is popular for a number of reasons. It is relatively straightforward and computationally tractable, it fits well with rectangular video frames and with block-

based image transforms (e.g. the Discrete Cosine Transform, see later) and it provides areasonably effective temporal model for many video sequences. There are however a number of disadvantages, for example real objects rarely have neat edges that match rectangularboundaries, objects often move by a fractional number of pixel positions between frames andmany types of object motion are hard to compensate for using block-based methods (e.g.deformable objects, rotation and warping, complex motion such as a cloud of smoke). Despitethese disadvantages, block-based motion compensation is the basis of the temporal model usedby all current video coding standards.

1.6.1.1 Motion Compensated Prediction of Macroblock

Image is divided into macroblocks (MB) of size N N. By default, N = 16 for luminanceimages. For chrominance images N = 8 if 4:2:0 chroma subsampling (explained in section 1.6.2)is adopted. H.264 uses a variant of this. Motion compensation is performed at the macroblock level. Macroblock partitioning is shown in Figure 2.6. The general rule of prediction is asfollows:

- The current image frame is referred to as target frame .


6/16

6

- A match (only for Y macroblock) is sought between the macroblock in the TargetFrame and the most similar macroblock in previous and/or future frame(s) (referred to asreference frame(s) ).

- The displacement of the reference macroblock to the target macroblock is called

a motion vector, MV.

- Residual is found in case of Y macroblock as well as Cb & Cr macroblocks.

- Figure 2.7 shows the case of forward prediction in which the referenceframe is taken tobe a previous frame.

- MV search is usually limited to a small immediate neighborhood - bothhorizontal andvertical displacements in the range [ p, p ]. This makes a search window of size (2 p + 1) (2 p+1).

Figure 2.6 Size of macroblock for prediction

Figure 2.7 Motion Compensated prediction of Macroblock

General Rule for searching motion vectors

Parameter MAD (Mean Absolute Differnce) is defined as

MAD ( i, j ) = | C( x + k, y + l ) R ( x + i + k, y + j + l)| eqn 2.1

where

N - size of the macroblock,
http://bp1.blogger.com/_6kpJik-Ua70/RkNiJtZCMII/AAAAAAAAAEM/-t2Zs4WjaeA/s1600-h/fig7.JPGhttp://bp2.blogger.com/_6kpJik-Ua70/RkNg99ZCMHI/AAAAAAAAAEE/N8mX4cIdbaE/s1600-h/fig6.JPGhttp://bp1.blogger.com/_6kpJik-Ua70/RkNiJtZCMII/AAAAAAAAAEM/-t2Zs4WjaeA/s1600-h/fig7.JPGhttp://bp2.blogger.com/_6kpJik-Ua70/RkNg99ZCMHI/AAAAAAAAAEE/N8mX4cIdbaE/s1600-h/fig6.JPG


7/16

7

k and l - indices for pixels in the macroblock,

i and j - horizontal and vertical displacements,

C ( x + k, y + l) - pixels in macroblock in Target frame,

R( x + i + k, y + j + l) - pixels in macroblock in Reference frame

The search is to find a vector ( i, j ) as the motion vector MV = (u, v ), such that MAD (i, j ) isminimum:

(u, v ) = { ( i, j ) | MAD (i, j ) is minimum, i [p, p ]; j [p; p ] }

1.6.2 Frame Coding

RAW video can be viewed as a sequence of frames in the display rate of 25 frames per second orabove. Encoder converts these frames as Intra-frames ( I-frames), Inter-frames ( P-frames) orBidirectional frames (B-frames) . Transform coding method similar to JPEG is applied withineach I- frame, hence the name Intra. P -frames are not independent, coded by a forwardpredictive coding method (prediction from a previous P-frame is allowed - not just from aprevious I-frame). B frame is predicted from more than one frame usually from a previous and afuture frame.

Figure 2.8 I-frame Coding

I-frame coding is depicted in figure 2.8. Here Macroblocks are of size 16 16 pixels for the Yframe, and 8 8 for Cb and Cr frames, since 4:2:0 chroma subsampling is apllied. A macroblock consists of four Y, one Cb, and one Cr 8 8 blocks. For each 8 8 block a DCT transform is

applied, the DCT coefficients then go through quantization zigzag scan and entropy coding.

P-frame coding is based on Motion Compensation. For each macroblock in the Target frame, amotion vector is allocated. After the prediction, a difference macroblock is derived to measurethe prediction error . Each of these 8 8 blocks go through DCT, quantization, zigzag scan andentropy coding procedures. The P-frame coding encodes the difference macroblock (not theTarget macroblock itself). Sometimes, a good match cannot be found, i.e., the prediction errorexceeds a certain acceptable level. The MB itself is then encoded (treated as an Intra MB) and in
http://bp3.blogger.com/_6kpJik-Ua70/RkNipNZCMJI/AAAAAAAAAEU/MZqowyIDo8c/s1600-h/fig8.JPG


8/16

8

this case it is termed a non-motion compensated MB . For motion vector, the difference MVD issent for entropy coding. Figure 2.9 gives an overview on P-frame coding.

Figure 2.9 P-frame coding

Motion Compensation based B-frame coding method is illustrated in Figure. 2.10. EachMacroblock from a B-frame will have up to two motion vectors (MVs) (one from the forwardand one from the backward prediction). If its matching in both directions, then two MVs will besent and the two corresponding matching MBs are averaged (indicated by `%' in the figure 2.10)before comparing to the Target MB for generating the prediction error. If an acceptable matchcan be found in only one of the reference frames, then only one MV and its corresponding MBwill be used from either the forward or backward prediction.

Figure 2.10 B-frame Coding
http://bp0.blogger.com/_6kpJik-Ua70/RkNjSdZCMLI/AAAAAAAAAEk/onEJK4RTD60/s1600-h/fig10.JPGhttp://bp1.blogger.com/_6kpJik-Ua70/RkNi7tZCMKI/AAAAAAAAAEc/tudLMGnG3bQ/s1600-h/fig9.JPGhttp://bp0.blogger.com/_6kpJik-Ua70/RkNjSdZCMLI/AAAAAAAAAEk/onEJK4RTD60/s1600-h/fig10.JPGhttp://bp1.blogger.com/_6kpJik-Ua70/RkNi7tZCMKI/AAAAAAAAAEc/tudLMGnG3bQ/s1600-h/fig9.JPG


9/16

9

Figure 2.11 4:2:0 Subsampling in the case of interlaced video

1.6.3 Field Prediction

Field prediction comes in the case of codecs having an extended support to interlaced video.Chroma subsampling in the case of interlaced video is illustrated in the figure 2.11.

In interlaced video each frame consists of two fields, referred to as the top-field and the bottom- field . In a frame-picture , all scanlines from both fields are interleaved to form a single frame, thendivided into 16 x 16 macroblocks and coded using motion compensation. If each field is treatedas a separate picture, then it is called field picture .

Figure 2.12 Frame picture Vs Field picture

The common modes of prediction in case of interlaced video:1) Frame Prediction for Frame-pictures

Fields are combined to form frames and prediction is performed for macro blocks of size 16 16, in a similar fashion explained as in 2.4.3.

2) Field Prediction for Field-pictures
http://bp1.blogger.com/_6kpJik-Ua70/RkNkHtZCMNI/AAAAAAAAAE0/vNxn3yUO3a8/s1600-h/fig12.JPGhttp://bp3.blogger.com/_6kpJik-Ua70/RkNjyNZCMMI/AAAAAAAAAEs/HeaEHqjggOQ/s1600-h/fig11.JPGhttp://bp1.blogger.com/_6kpJik-Ua70/RkNkHtZCMNI/AAAAAAAAAE0/vNxn3yUO3a8/s1600-h/fig12.JPGhttp://bp3.blogger.com/_6kpJik-Ua70/RkNjyNZCMMI/AAAAAAAAAEs/HeaEHqjggOQ/s1600-h/fig11.JPG


10/16

10

A macroblock size of 16 16 from field pictures is used, as in figure 2.13

Figure 2.13 Field Prediction for field pictures

3) Field Prediction for Frame-pictures

The top field and bottom field of a frame picture are treated separately. Each

16 x 16 macroblock (MB) from the target frame picture is split into two 16 x 8 parts, eachcoming from one field. Field prediction is carried out for these 16 x 8 parts in a mannershown in Figure 2.14

Figure 2.14 Field Prediction for frame-pictures

4) 16 x 8 MC for Field-pictures

Each 16 x16 macroblock (MB) from the target field picture is split into top and

bottom 16 x 8 halves. Field prediction is performed on each half. This generates two motionvectors for each 16 x16 MB in the P-field picture, and up to four motion vectors for each MB
http://bp2.blogger.com/_6kpJik-Ua70/RkNlb9ZCMPI/AAAAAAAAAFE/lfIQqim5ZTg/s1600-h/fig14.JPGhttp://bp2.blogger.com/_6kpJik-Ua70/RkNkl9ZCMOI/AAAAAAAAAE8/ey9a7dW9Erc/s1600-h/fig13.JPGhttp://bp2.blogger.com/_6kpJik-Ua70/RkNlb9ZCMPI/AAAAAAAAAFE/lfIQqim5ZTg/s1600-h/fig14.JPGhttp://bp2.blogger.com/_6kpJik-Ua70/RkNkl9ZCMOI/AAAAAAAAAE8/ey9a7dW9Erc/s1600-h/fig13.JPG


11/16

11

in the B-field picture. This mode is good for a finer MC when motion is rapid and irregular.Figure 2.15 gives an illustration.

Figur e 2.15 16 x 8 motion compensation for field-pictures.

5) Dual-Prime for P-pictures

Field prediction from each previous field with the same parity (top or bottom) is made. Eachmotion vector MV is then used to derive a calculated motion vector CV in the field with theopposite parity taking into account the temporal scaling and vertical shift between lines in the topand bottom fields. For each MB the pair MV and CV yields two preliminary predictions. Theirprediction errors are averaged and used as the final prediction error. This mode mimics B-pictureprediction for P-pictures without adopting backward prediction (and hence with less encodingdelay). This is the only mode that can be used for both frame pictures and field-pictures.

1.6.4 Transform Coding

In case of I-frames, the 8 8 blocks of Y, Cb and Cr are transform coded. And in case of B-frames and P-frames the residual value obtained from block based motion estimation istransform coded. Figure 2.16 shows the 8 8 block division of a residual macroblock for theluma signal in the case of interlaced video. These 8 8 blocks are transform coded in a similarmanner as in section 1.5.4. Similar is the case for chroma signal.
http://bp2.blogger.com/_6kpJik-Ua70/RkNmI9ZCMQI/AAAAAAAAAFM/JlTEKCiYnn4/s1600-h/fig15.JPG


12/16

12

Figure 2.16 Frame and field DCT for a Y macroblock in interlaced video

1.6.5 Quantization

In the case video coding two quantization tables are used, one for intra-coding and the other forinter-coding in the case of MPEG-1 & MPEG-2. Also another parameter scale called

quantization parameter (QP) is defined to use the variants of quantization matrix.

Zij = Round () eqn 2.2

Zij = Round () eqn 2.3

Where Z is the quantized matrix and QP, quantization parameter and Q 1, Q2 representsthe quantization matix for intra and inter coding respectively.

The value of QP varies from 1 31 and in effect we have more no number of quantization table.The wide range of quantizer step sizes makes it possible for an encoder to control the tradeoff accurately and flexibly between bit rate and quality.

Q 1 =

Table 2.1 The matrix used for intra coding for eqn 1.1
http://bp2.blogger.com/_6kpJik-Ua70/RkNni9ZCMTI/AAAAAAAAAFk/tBTmRQk_zKk/s1600-h/fig20.JPGhttp://bp3.blogger.com/_6kpJik-Ua70/RkNmfNZCMRI/AAAAAAAAAFU/Rx_MrXUV7yE/s1600-h/fig16.JPGhttp://bp2.blogger.com/_6kpJik-Ua70/RkNni9ZCMTI/AAAAAAAAAFk/tBTmRQk_zKk/s1600-h/fig20.JPGhttp://bp3.blogger.com/_6kpJik-Ua70/RkNmfNZCMRI/AAAAAAAAAFU/Rx_MrXUV7yE/s1600-h/fig16.JPG


13/16

13

Q 2 =

Table 2.2 The matrix used for inter coding for eqn 1.2

1.6.6 Zig-Zag Scan

Two different types of reordering for the Transform coded values are found, zig-zag Figure 2.17 Reordering of quantized transform coefficients. Found zig-zag scan for

progressive video and another alternate scan for interlaced video shown in figure 2.17. Refersection 1.5.6.

1.6.7 Run Level Coding

The output of the reordering process is an array that typically contains one or more clusters of nonzero coefficients near the start, followed by strings of zero coefficients. The large numbers of zero values are encoded to represent them more compactly, for example by representing thearray as a series of (run, level) pairs where run indicates the number of zeros preceding anonzero coefficient and level indicates the signed magnitude of the nonzero coefficient. Higher-

frequency DCT coefficients are very often quantized to zero and so a reordered block willusually end in a run of zeros. A special case is required to indicate the final nonzero coefficientin a block, EOB (End of Block). This is calle d Two -dimensional run -level encoding. If Three -dimensional run -level encoding is used, each symbol encodes threequantities, run , level and last . Last indicates whether the level is the last non zero value in thelinear array.
http://bp2.blogger.com/_6kpJik-Ua70/RkNoI9ZCMUI/AAAAAAAAAFs/Zvivq0vKtD0/s1600-h/fig18.JPGhttp://bp1.blogger.com/_6kpJik-Ua70/RkNm_tZCMSI/AAAAAAAAAFc/SiuVjaFnq_Y/s1600-h/fig17.JPGhttp://bp2.blogger.com/_6kpJik-Ua70/RkNoI9ZCMUI/AAAAAAAAAFs/Zvivq0vKtD0/s1600-h/fig18.JPGhttp://bp1.blogger.com/_6kpJik-Ua70/RkNm_tZCMSI/AAAAAAAAAFc/SiuVjaFnq_Y/s1600-h/fig17.JPG


14/16

14

Example:

Input array (Zig Zag ordered value) : 16,0,0,3,5,6,0,0,0,0,7,0,0 . . .

2D Coding:

Output values : (0, 16), (2, 3), (0, 5), (0, 6), (4, 7), (EOB)

Each of these output values (a run-level pair) is encoded as a separate symbol by the entropyencoder.

3D coded values are:

Output values :(0 , 16 , 0) , (2 ,3 , 0) , (0 , 5 , 0) , (0 , 6 , 0) , (4 ,7 , 1)

The 1 in the final code indicates that this is the last nonzero coefficient in the block

1.6.8 Entropy Encoding

The entropy encoder converts a series of symbols representing elements of the video sequenceinto a compressed bitstream suitable for transmission and storage. In this section we discuss thewidely-used entropy coding techniques.

1.6.8.1 Variable-length Coding

A variable-length encoder maps input symbols to a series of codewords (variable length codes orVLCs). Each codeword may have varying length but each must contain an integral number of bits. Frequently occurring symbols are represented with short VLCs while less common symbolsare represented with long VLCs. Over a sufficiently large number of encoded symbols this leadsto compression of the data.

1.6.8.2 Huffman Coding

Huffman coding assigns a VLC to each symbol based on its probability of occurrence.According to the original scheme proposed by Huffman, it is necessary to calculate theprobability of occurrence of each symbol and to construct a set of variable length codewords.

Based on this key word we can encode and decode symbols. Since the VLC formed by Huffmanalgorithm is a prefix code we can easily decode the same to corresponding symbol.

Consider the probability distribution of MVD (Motion Vector Differences) given in table 2.3.

Vector Probability Code 0 0.8 11 0.08 01


15/16

15

-1 0.07 0012 0.03 0001-2 0.02 0000

Table 2.3 Code word for vector, based on Huffman Algorithm

If vector sequence is (1, -2, 0, 0, 2), then its binary sequence is 010000110001. In order todecode the data, the decoder alsomust have a copy of the Huffman code tree (or look-up table).This may be achieved by transmitting the look-up table itself or by sending the list of data andprobabilities prior to sending the coded data. Each uniquely-decodeable code is converted back to the original data, for example:

01 is decoded as (1)

0000 is decoded as (-2)

1 is decoded as (0).




16/16

Video Fundamentals

Documents