Top Banner
T325: Technologies for digital media Second semester – 2011/2012 Tutorial 6 – Video and Audio Coding (2-2) Arab Open University – Spring 2012 1
61

T325: Technologies for digital media

Jan 01, 2016

Download

Documents

remedios-valdez

T325: Technologies for digital media. Second semester – 2011/2012 Tutorial 6 – Video and Audio Coding (2-2). The JPEG standards were derived for the coding of still pictures . JPEG has been used to code series of individual fields in video sequences for transmission over digital links. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: T325: Technologies for digital media

T325: Technologies for digital media Second semester – 2011/2012Tutorial 6 – Video and Audio Coding (2-2)

Arab Open University – Spring 2012 1

Page 2: T325: Technologies for digital media

Arab Open University – Spring 2012 2

The coding of moving pictures: MPEG

• The JPEG standards were derived for the coding of still pictures.

• JPEG has been used to code series of individual fields in video sequences for transmission over digital links.

• Moving pictures can be coded simply by using a separate JPEG image for each picture (Motion-JPEG)

• Further compression can be introduced by exploiting features of the MOTION itself.

Page 3: T325: Technologies for digital media

Arab Open University – Spring 2012 3The coding of moving pictures – MPEG

• When pictures are produced at a rate of 25 per second, there cannot be much change between one picture and the next as a scene evolves in time, except, occasionally, when the camera or video editor cuts abruptly from one scene to another.

• There is, therefore, a considerable amount of temporal redundancy which is exploited in MPEG by, in effect, only transmitting the differences between one scene and the next.

Page 4: T325: Technologies for digital media

Arab Open University – Spring 2012 4Motion compensation

• In most cases, as a scene evolves, many parts of the scene, such as the background, do not change.

• Furthermore, if there is movement, then the objects that do move may not, themselves, change much. • What is more likely to change is their location in the picture

• Thus, although movement may destroy correlation between corresponding locations in consecutive pictures, a high degree of correlation is likely to be maintained between successive locations occupied by a moving object.

Page 5: T325: Technologies for digital media

Arab Open University – Spring 2012 5Motion compensation

Page 6: T325: Technologies for digital media

Arab Open University – Spring 2012 6Motion compensation

• In order to take this form of correlation into account, MPEG coding involves estimating the motion of objects between pictures.

• Different objects in a scene are likely to move by different amounts and in different directions Therefore, Motion is estimated for relatively small areas of a picture.

• The 8 x8 pixel blocks used for the spatial compression are rather too small for this purpose, so sets of four 8x8 blocks, called macroblocks are used.

Page 7: T325: Technologies for digital media

Arab Open University – Spring 2012 7Motion compensation

Page 8: T325: Technologies for digital media

Arab Open University – Spring 2012 8Motion compensation

• Motion is estimated by: 1. Taking a macroblock in one picture and,

2. comparing it with each of a group of macroblocks around the same region in a subsequent picture.

• This is called block matching The block most like the reference block is taken to indicate the motion of that reference block.

Page 9: T325: Technologies for digital media

Arab Open University – Spring 2012 9Motion compensation

Page 10: T325: Technologies for digital media

Arab Open University – Spring 2012 10Motion compensation

• The motion of the block between the two pictures is shown as a displacement (Δx, Δy)

• This is called a motion vector in MPEG processing

• Evaluation of motion vectors is known as motion estimation.

• In most cases, there is a gap of several pictures between the pair of pictures used for motion estimation saves on processing.

• The motion within the intervening pictures is estimated by interpolation.

Page 11: T325: Technologies for digital media

Arab Open University – Spring 2012 11Picture types and motion compensation

• There are three types of MPEG picture:• I (Intra) pictures• Coded independently of other pictures using the coding

techniques described earlier. • Form the starting point of a series of predicted and

interpolated pictures.

• P (Predicted) pictures• Coded using predictions based on the previous I or P picture

using motion compensation.

• B (Bi-directional) pictures• obtained by interpolation between the I or P pictures that

precede and follow them.

Page 12: T325: Technologies for digital media

Arab Open University – Spring 2012 12Picture types and motion compensation

Page 13: T325: Technologies for digital media

Arab Open University – Spring 2012 13Picture types and motion compensation

• Motion estimation between two pictures establishes pairs of matching blocks in the two pictures and the motion vectors for each of these blocks.

• From this information it is possible to make an initial estimate of the second picture by using the motion vectors to relocate matching blocks.

• The difference between this initial estimate and the actual pixel values for the second block can then be evaluated.

• Transmission of the motion vectors for each block in the first picture enables the receiver to derive the initial estimate.

• This is then corrected on reception of the differences between this estimate and the actual second picture values.

Page 14: T325: Technologies for digital media

Arab Open University – Spring 2012 14Picture types and motion compensation

• Because the size of a moving object is often greater than that of the macroblocks, there is often a strong correlation between the motion vectors of contiguous macroblocks the difference between motion vectors for neighboring macroblocks tends to be small.

• So-called differential coding is used to send just these differences to keep the number of bits required comparatively low.

• Motion estimation for P pictures is obtained from the preceding I or P picture and, because there may be several B pictures in between, the correction values to be transmitted can be quite large.

• In order to compress this information, it is coded using the same steps as for spatial coding : DCT, requantization, zigzag scan, run-length and Huffman encoding.

Page 15: T325: Technologies for digital media

Arab Open University – Spring 2012 15

Page 16: T325: Technologies for digital media

Arab Open University – Spring 2012 16

htt

p:/

/ww

w.t

om

shard

ware

.com

/revie

ws/

vid

eo-g

uid

e-p

art

-3,1

30

-6.h

tml

Page 17: T325: Technologies for digital media

Arab Open University – Spring 2012 17Picture types and motion compensation

• In most cases, changes affect relatively few macroblocks information is sent only when there are changes in a macroblock.

• The degree of compression for P pictures is significantly higher than that for I pictures.

• B pictures use interpolated values of the motion vectors obtained from the motion estimation for the following P picture.

• Because B pictures do not require separate motion estimation, considerably less processing is needed.

• The interpolation (of B pictures) is carried out in three ways: • Forward: by applying the interpolated motion vectors to the

macroblocks in the previous I or P picture• backwards from the macroblocks of the following I or P picture• Both ways: taking the average between the two directions• The option giving the smallest error is retained

Page 18: T325: Technologies for digital media

Arab Open University – Spring 2012 18Picture types and motion compensation

• The term group of pictures is used in MPEG for a sequence of pictures starting with an I picture and including all subsequent P and B pictures up to the next I picture.

• The structure of a group is specified in terms of two parameters: N, the total number of pictures in the group, and M, the number of adjacent B pictures plus 1. Thus M= 3 and N = 12 for the group

Page 19: T325: Technologies for digital media

Arab Open University – Spring 2012 19Picture types and motion compensation

• MPEG material may be recorded for editing or other purposes which require random access.

• Access points must coincide with the start of a group of pictures because of the interpolation and prediction processes.

• Long groups of pictures could lead to excessive distances between access points. • With M= 12 and a picture rate of 25 per second, the interval

between successive groups of pictures is 12/25, just under half a second. More than this would be unacceptable in many cases.

• The number of B pictures between two consecutive P pictures is limited by the accuracy of the interpolation process as well as the processing delays.

• Groups of pictures with N = 12 and M= 3, are commonly used.

Page 20: T325: Technologies for digital media

Arab Open University – Spring 2012 20MPEG-2

• The processing techniques described so far are all available in MPEG-1 systems.

• MPEG-2 systems can carry out all of these, but they can also operate at the higher bit rates and higher resolutions required for broadcast television.

• MPEG-2 has been described as a tool box which allows for the provision of a whole range of picture quality parameters.

• The selection of tools available in any particular implementation depends on what trade-off has been made between complexity (and, hence, cost) and picture quality

Page 21: T325: Technologies for digital media

Arab Open University – Spring 2012 21MPEG-2 - Encoder

• The digitized video input is reordered into macroblocks for motion estimation.

• The forward path, left to right across the figure, represents the spatial compression process.

• The loop which involves the predictor provides the corrections to the transmitted picture that ensure that a correct picture is reconstructed at the receiver when it applies the motion vectors to the received macroblocks.

Page 22: T325: Technologies for digital media

Arab Open University – Spring 2012 22MPEG-2 - Encoder

• The motion vectors and mode control information are combined in the output multiplexer.

• The multiplexer output is stored in a buffer.

• If the buffer contents grow too fast, then the requantization steps are temporarily made wider to reduce the bit rate of the input to the multiplexer.

Page 23: T325: Technologies for digital media

Arab Open University – Spring 2012 23MPEG-2 - decoder

• The spatial compression is reversed and motion estimation data is used to reconstruct interpolated pictures. • The sizes of the steps used for inverse requantization

are controlled using data extracted from the multiplexed coded input

Page 24: T325: Technologies for digital media

Arab Open University – Spring 2012 24MPEG Audio coding

• The basic features of MPEG-1 and MPEG-2 audio coding are the same.

• The main difference is in the number of audio channels provided in MPEG-2 as additional options, such as ‘surround sound’, which requires more than the two channels used for conventional stereo reproduction.

• Three levels of compression have been specified. • They are referred to as coding layers and numbered I, II

and III

Page 25: T325: Technologies for digital media

Arab Open University – Spring 2012 25MPEG audio coding

• Layer I provides the smallest compression and layer III the greatest.

• The layers are upwardly compatible • layer III can decode data compressed using layers I or II, • Layer II can decode data compressed using layer I.

• The precise details differ from layer to layer.• MPEG audio coding is based primarily on MPEG-1/2

layer II, which is part of the terrestrial digital video broadcasting standard (DVB-T) and many current DAB systems

Page 26: T325: Technologies for digital media

Arab Open University – Spring 2012 26MPEG audio coding

• The compression of audio messages depends on several aspects of our perception of sounds.

• There are many aspects of an audio signal which we do not perceive, although they carry information in the communications sense of the term.

• We are hardly, if at all, sensitive to the phase of the frequency components that make up an audio signal, so that any phase information may be omitted from the coding process.

• The main feature of the process is that the message is split into 32 equal and adjacent frequency bands, called sub-bands.

• This is done digitally, by means of filters.• MPEG standards allow for various sampling frequencies to be

used.

Page 27: T325: Technologies for digital media

Arab Open University – Spring 2012 27

MPEG audio coding – Example : 48KHz option

• Each sub-band has a width of 0.75 kHz, giving a total bandwidth of 0.75 x 32 = 24 kHz, half the sampling frequency, this being the assumed bandwidth of the original signal.

• The filtering process reduces the number of samples for each sub-band, so that the effective sampling frequency is 1.5 kHz, that is 1500 samples per second (twice the bandwidth of each sub-band)

• The total number of samples is 32 x 1500 = 48000, which is the same as for the original signal.

• The filtering process, by itself, does not reduce the number of bits to be processed, but it does allow this to be done in several ways.

• We are more sensitive to frequencies in the range 1 to 5 kHz than we are outside this range!

Page 28: T325: Technologies for digital media

Arab Open University – Spring 2012 28MPEG audio coding

• Relative audio sensitivity of humans: signal levels above the curve are audible.

• The curve itself represents the perception threshold. • Below this threshold, we do not hear the sounds.

Page 29: T325: Technologies for digital media

Arab Open University – Spring 2012 29MPEG audio coding

• If the signal level in a sub-band is below the threshold for the frequencies covered by that sub-band, then that part of the signal will not be perceived and the samples for that sub-band do not need to be transmitted.

• The next step relies on a perception phenomenon known as Noise Masking: result of two effects • Frequency Masking • Temporal Masking

Page 30: T325: Technologies for digital media

Arab Open University – Spring 2012 30MPEG audio coding

• The frequency masking effect arises because it turns out that a relatively loud sound at a particular frequency reduces our sensitivity (raises the threshold) for neighboring frequencies.

• The masking effect decreases as we move away from the frequency of the sound that causes the masking.

Page 31: T325: Technologies for digital media

Arab Open University – Spring 2012 31MPEG audio coding

• The temporal masking effect: Sensitivity to sounds in a narrow frequency range is reduced for a short period, of the order of a few milliseconds, before and after the presence of a relatively strong sound in that frequency range.

Page 32: T325: Technologies for digital media

Arab Open University – Spring 2012 32MPEG audio coding

• Within a relatively narrow frequency band, a signal component in that band will mask noise components with adjacent frequencies to an extent that decreases on either side of the masking signal.

• This effect takes place in what are called critical bands whose width increases with frequency.

Noise masking in a critical frequency band

Page 33: T325: Technologies for digital media

Arab Open University – Spring 2012 33

Masker

MaskedSounds

Threshold in quiet

Maskedthreshold

Page 34: T325: Technologies for digital media

Arab Open University – Spring 2012 34MPEG audio coding

• The width of the critical bands ranges from 100 Hz at low audio frequencies to about twice that above 500 Hz.

• Research indicates that there are about 24 critical bands in the audio range.

• They do not coincide exactly with the 32 MPEG sub-bands, but the match is sufficiently good for advantage to be taken of the effect in the following way:• The number of bits used per sample needs to be sufficient to

keep the quantization noise below the noise threshold• If the number of bits is reduced, then the quantization noise

will increase. • This noise will be distributed over the whole signal

frequency spectrum.

Page 35: T325: Technologies for digital media

Arab Open University – Spring 2012 35MPEG audio coding

• However, if the noise masking effect causes a sufficient increase in the thresholds, then it may be possible to decrease the number of bits needed for the samples in some sub-bands without causing a perceivable increase in noise.

• If noise masking causes the threshold to rise above the sample values in a sub-band, then these samples will no longer need to be transmitted.

• Information about the characteristics of audio perception – variation of threshold with frequency, and the threshold increases caused by frequency and temporal masking, is codified in the form of a psycho-acoustic model which is incorporated into MPEG encoders.

Page 36: T325: Technologies for digital media

Arab Open University – Spring 2012 36MPEG audio encoder

Page 37: T325: Technologies for digital media

Arab Open University – Spring 2012 37MPEG audio coding

• The output of each of the 32 sub-band filters is requantized under the control of the psycho-acoustic model element, which is fed with the relevant features of the overall signal.

• The psycho-acoustic model generates a masking curve• Samples from sub-bands below their respective

thresholds represented by the masking curve are suppressed altogether.

• The number of bits per sample from other sub-bands is reduced to the extent allowed by the noise masking effects, for which the relevant information is built into the model.

Page 38: T325: Technologies for digital media

Arab Open University – Spring 2012 38MPEG audio coding

Page 39: T325: Technologies for digital media

Arab Open University – Spring 2012 39MPEG audio coding

• Further compression is obtained by taking into account the range of levels covered by the signal in each sub-band.

• For instance, if use of the psycho-acoustic model indicated that 6-bit samples in a sub-band would be adequate, then if 000001 were used to represent the quantized value of a 0.1mV sample 111111 would represent a sample having 63 times this value; that is, 6.3mV.

• If 000001 represented a 10mV sample, then 111111 would represent a 630 mV sample.

Page 40: T325: Technologies for digital media

Arab Open University – Spring 2012 40MPEG audio coding

• In order to allow for the variation in the range of values which signals in different sub-bands may take, further information is needed.

• This is conveyed in the form of scaling factors.• A 6-bit number is used as the scaling factor for each sub-

band, giving 64 different possible values of the factor.• The magnitude ratio between any two consecutive scaling

factor values corresponds to a 2 dB difference in the audio signals, thus covering a 128 dB dynamic range.

• The scaling factor indicates the magnitude of the step sizes for the quantized samples.

Page 41: T325: Technologies for digital media

Arab Open University – Spring 2012 41MPEG audio coding

• For each sub-band, besides the scaling factor, the receiver needs to know the number of bits used to quantize the samples conveyed as a 4-bit number, which allows up to 16 bits per sample to be used.

• We are less sensitive to level differences at high frequencies than at low ones. So, the number of bits used to quantize samples is decreased for the higher frequency sub-bands.

• At the receiving end, the decoder reverses the requantization process by applying the scaling factors for each sub-band to the samples in that band.

• A bank of digital filters is then used to recombine the samples for all 32 sub-bands into a single decoded audio signal.

Page 42: T325: Technologies for digital media

Arab Open University – Spring 2012 42MPEG audio decoder

• In addition to the audio data itself (including samples and scaling factors), frames carry additional control information (including error control).

• A number of audio, video and synchronization signals are combined in a transport stream.

Page 43: T325: Technologies for digital media

Arab Open University – Spring 2012 43Why so many audio standards?

• The most important characteristics of the three MPEG-1/2 coding layers:• Layer I is known as ‘pre-MUSICAM’. The encoder can operate at

one of 14 fixed output bit rates ranging from 32 to 448 kbit/s. The rate required for a Hi-Fi audio channel is 192 kbit/s. The encoder and decoder are relatively simple.

• Layer II, the standard decoder for the European digital video broadcasting system, uses the algorithm known as MUSICAM with the same psycho-acoustic model as layer I.

• For the same perceived audio quality, its output bit rate is 30--50% of that of the layer I encoder, requiring 128 kbit/s per channel for hi-fi quality.

• Layer II uses a decreasing number of sample quantization bits with increasing sub-band frequency.

• Layer II can encode input streams sampled at 32, 44.1 and 48 kHz. The output bit rate options are 64, 96, 128 and 192 kbit/s.

Page 44: T325: Technologies for digital media

Arab Open University – Spring 2012 44Why so many audio standards?

• Layer III (MP3) uses a different psycho-acoustic model with many more sub-bands than in layers I/II, and Huffman coding for better compression.

• It also uses a DCT in addition to the sub-band coding of the other two layers.

• Compressed hi-fi requires only 64 kbit/s per channel. • The compression is roughly twice that of the layer II

encoder, but its structure is much more complex.• In MPEG-1, two audio channels are coded. They can be

used independently or as a stereo pair. • MPEG-2 can handle five channels. This allows for the

transmission of surround sound.

Page 45: T325: Technologies for digital media

Arab Open University – Spring 2012 45Why so many audio standards?

• Another standard gaining increasing importance is Advanced Audio Coding (AAC).

• AAC was developed from 1994 onwards as an MPEG-2 option providing better compression and quality than even MPEG-1/2 layer III (MP3).

• It has a lot in common with MP3, and is designed to be backwards compatible.

• Includes many features which offer significantly lower bit rates for the same audio quality.

• AAC is used in MPEG-4

Page 46: T325: Technologies for digital media

Arab Open University – Spring 2012 46MPEG audio coding

• A further development known as AAC+ will replace layer II encoding in the forthcoming new DAB system

• Other, non-MPEG, audio coding standards at the time of writing are Windows Audio Media and Ogg Vorbis, both of which were also designed to improve on MP3.

• So, why so many different standards?• One reason is that the continuing improvement in processor

performance and lowering of costs has meant that it has become possible to develop more and more complex algorithms or to include additional features.

• There is a need to maintain ‘legacy’ techniques.

Page 47: T325: Technologies for digital media

Arab Open University – Spring 2012 47Source multiplexing

• The audio and video bitstreams often need to be combined.

• Timing information must also be provided to control the scanning process in the display device and to synchronise the sound and picture lip movement with speech in particular!

Page 48: T325: Technologies for digital media

Arab Open University – Spring 2012 48Source multiplexing

• The bitstreams at the output of the video and audio encoders are known as elementary streams (ESs).

• Each stream is segmented into packets to form the video and audio packetized elementary streams (PESs).

Page 49: T325: Technologies for digital media

Arab Open University – Spring 2012 49Source multiplexing

• The information carried by the packet headers includes a leading fixed start code, used for framing the packet, and fields indicating stream identification, packet length and whether the data is scrambled.

• In the case of a single program, the packets are multiplexed to form a program stream.

• In DVB, a group of programs is multiplexed and modulated onto a radio-frequency carrier for transmission over a terrestrial or satellite radio link, or via a cable.

• The multiplexed stream is known as the transport stream.

Page 50: T325: Technologies for digital media

Arab Open University – Spring 2012 50Source multiplexing

Page 51: T325: Technologies for digital media

Arab Open University – Spring 2012 51Source multiplexing

• The packets that make up the PESs have variable lengths and can be relatively long.

• The radio-frequency channels that carry the transport stream tend to have relatively high BERs (higher than 10-

4), so that error control is necessary.• This puts a limit on the size of transport packets, as

retransmitting a long packet of which only a small part is errored would be very wasteful.

• Error control is easier with equal length packets. • Transport packets are all 188 bytes long, consisting of a

4-byte header and a 184-byte payload.

Page 52: T325: Technologies for digital media

Arab Open University – Spring 2012 52Source multiplexing

• The PES packets, including the header, are broken up into 184-byte sections, each of which forms the payload of a transport stream packet.

• MPEG-2 specifies that each transport packet may only carry data from one PES packet.

• If the PES data does not fill the last transport packet in the group allocated to the PES packet, then the byte count is made up to 184 by means of an adaptation field.

• The adaptation field, is located at the start of the last transport packet of the set.

Page 53: T325: Technologies for digital media

Arab Open University – Spring 2012 53Source multiplexing

Mapping PES packets onto transport stream

Page 54: T325: Technologies for digital media

Arab Open University – Spring 2012 54Source multiplexing

• Each transport packet has a 4-byte header which includes• framing byte• counter used to number consecutive portions of a

segmented PES packet• flag to indicate the presence of an adaptation field• indication of the packet type.

• If an adaptation field is present, then the first byte of the adaptation field indicates the length of that field.

• Besides being used for stuffing the transport packet, the adaptation field can carry other data needed for the operation of the system program clock reference (PCR) used for synchronization.

Page 55: T325: Technologies for digital media

Arab Open University – Spring 2012 55Source multiplexing

• When several programs are multiplexed, a great deal of information is needed by the receiver in order for it to be able to select and decode the data stream for the chosen program.

• This information is carried by the transport stream as sets of control data called tables that are split into sections.

• The presence of a table section in a transport packet is indicated by a 1-bit flag in the packet header.

• The main tables are transmitted between 10 and 50 times per second.

Page 56: T325: Technologies for digital media

TEST YOUR KNOWLEDGE

Arab Open University – Spring 2012 56

Page 57: T325: Technologies for digital media

Multiple choice questions

• If the input to a layer I MPEG audio encoder is the bit stream for one channel of a standard music CD and the encoder output rate is 192 Kbit/s, what is the compression ratio? Assume music CD signals consist of 16-bit samples with a sampling frequency of 44.1 kHz.a. 0.272

b. 2.72

c. 3.68

d. 5.13

e. 6.77

• In MPEG, part of the audio coding relies on a perception phenomenon known asa. Noise masking

b. Frequency masking

c. Temporal masking

d. Signal masking

e. Signal to noise ratio.

Arab Open University – Spring 2012 57

Page 58: T325: Technologies for digital media

Short questions

• Describe two aspects where the human auditory perception is exploited in audio compression

(Final Exam – Fall 2011)• Explain the three types of MPEG pictures• To achieve high levels of compression, MPEG squeezes

redundancy as possible. This is done in two stages. Explain what these two stages are

• Noise masking occurs in two ways in audio coding. Explain both of them.

• Explain Motion estimation that is found in the MPEG encoder.

(MTA – Fall 2011)

Arab Open University – Spring 2012 58

Page 59: T325: Technologies for digital media

Exercises

• Final Exam – Fall 2011

Arab Open University – Spring 2012 59

Page 60: T325: Technologies for digital media

Exercises

• As you have seen in Block one, chapter five, Discrete Cosine transform (DCT),

Requantization, Zig-Zag scanning and Run-length coding constitute four

consecutive stages of the JPEG encoding process which is used in MPEG coding.

Consider an MPEG encoder has a requantization matrix specified in matrix 1. If the

output of the DCT block is that in matrix 2, what will be the output of the run-length

coding considering that the coded DC term of the contiguous block is 110. Detail

each step separately.

Arab Open University – Spring 2012 60

Page 61: T325: Technologies for digital media

Exercises

• Consider a Packet Elementary Stream of 500 bytes to be mapped onto a transport stream. Answer the following questions (answers should be detailed; only mentioning the final answer is NOT acceptable)

• Calculate the number of transport stream packets, and indicate the number of payload data bytes in each packet

• Is there any adaptation field in the transport stream? If yes, indicate the location and calculate the size of this field

• What will be the value of the first byte of the adaptation field? Give your answer in decimal and binary format.

(Final Exam – Fall 2011)

Arab Open University – Spring 2012 61