Multiplexing the elementary streams of H.264 video and MPEG4 HE

i

Multiplexing the elementary streams of H.264 video and

MPEG4 HE AAC v2 audio using MPEG2 systems

specification, demultiplexing and achieving lip

synchronization during playback

by

NAVEEN SIDDARAJU

Presented to the Faculty of the Graduate School of

The University of Texas at Arlington in Partial Fulfillment

of the Requirements

for the Degree of

MASTER OF SCIENCE IN ELECTRICAL ENGINEERING

Nov 2010

ii

Copyright © by Naveen Siddaraju 2010

All Rights Reserved

iii

ACKNOWLEDGEMENTS

I am greatly thankful to my supervising professor Dr.K.R.Rao, whose constant

encouragement, guidance and support have helped me in smooth completion of

the project. He has always been accessible and helpful throughout. I also thank

him for introducing me to the field of multimedia processing.

I would like to thank Dr.W. Alan Davis and Dr.William E.Dillon for taking interest

in my project and accepting to be part of my project defense committee.

I am forever grateful to my parents for their unconditional support at each turn of

the road. I thank my brother and sisters, who have always been a source of

inspiration. I would like to thank my friends both in US and in India for their

encouragement and support.

November 22, 2010

iv

ABSTRACT

MULTIPLEXING THE ELEMENTARY STREAMS OF H.264 VIDEO AND

MPEG4 HE AAC v2 AUDIO USING MPEG2 SYSTEMS SPECIFICATION,

DEMULTIPLEXING AND ACHIEVING LIP SYNCHRONIZATION DURING

PLAYBACK

Naveen Siddaraju, MS

The University of Texas at Arlington, 2010

Supervising Professor: Dr. K. R. Rao

Delivering broadcast quality content to the mobile customers is one of the

most challenging tasks in the world of digital broadcasting. Limited network

bandwidth and processing capability of the handheld devices are critical factors

that should be considered. Hence selection of the compression schemes for the

media content is very important from both economic and quality points of view.

H.264 which is also known as Advanced Video Codec (AVC) [1] is the latest

and the most advanced video codec available in the market today. The H.264

baseline profile which is used in applications such as mobile television (mobile

DTV) broadcast has one of the best compression ratios among the other profiles

and requires the least processing power at the decoder. The audio MPEG4 HE

AAC v2 [2] which is also known as enhanced aacplus, is the latest audio codec

belonging to the AAC (advanced audio codec) [3] family. In addition to the core

AAC, it uses the latest tools such as Spectral Band Replication (SBR) [2] and

Parametric Stereo (PS) [2] resulting in the best perceived quality for the lowest

v

bitrates. The audio and video codec standards have been chosen based on ATSC-

M/H (advanced television systems committee – mobile handheld) [17].

For the television broadcasting applications such as ATSC-M/H, DVB [16] the

encoded audio and video streams should be transmitted in a single transport

stream containing fixed sized data packets, which can be easily recognized and

decoded at the receiver. The goal of the project is to implement a multiplexing

scheme for the elementary streams of H.264 baseline and HE AAC v2 using the

MPEG2 systems specifications [4], then demultiplex the transport stream and

playback the decoded elementary stream with lip synchronization or audio-video

synchronization. The multiplexing involves two layers of packetization of the

elementary streams of audio and video. The first level of packetization results in

Program Elementary Stream (PES) packets, which are variable size packets and

hence not suitable for transport. MPEG2 defines a transport stream where PES

packets are logically organized into fixed size packets called the Transport Stream

(TS) packets, which are 188 bytes long. These packets are continuously generated

to form a transport stream, which is decoded by the receiver and the original

elementary streams are reconstructed. The PES packets that are logically

encapsulated into the TS header contain the time stamp information which is

used at the de-multiplexer to achieve synchronization between audio and video

elementary streams.

vi

TABLE OF CONTENTS

ACKNOWLEDGEMENTS……………………………………………………………………………… iii

ABSTRACT…………………………………………………………………………………………………. iv

LIST OF FIGURES…………………………………………………………………………………….... iii

LIST OF TABLES…………………………………………………………………………………………. ix

LIST OF TABLES…………………………………………………………………………………………. xi

ACRONYMS AND ABBREVIATIONS…………………………………………………………….. xii

Chapter

1. INTRODUCTION…………………………………………………………………………………… 1

2. OVERVIEW OF H.264……………………………………………………………………………. 2

2.1 H.264/ AVC………………………………………………………………………………. 2

2.2 Coding structure……………………………………………………………………….. 2

2.3 Profiles and levels……………………………………………………………………. 3

2.4 Description of various profiles ………………………………………………… 4

2.4.1 Baseline Profile…………………………………………………………... 4

2.4.2 Extended profile………………………………………………………….. 5

2.4.3 Main Profile………………………………………………………………… 5

2.4.4 High Profiles………………………………………………………………… 5

2.5 H.264 encoder and decoder……………………………………………………. 6

2.5.1 Intra prediction…………………………………………………………… 8

2.5.2 Inter prediction…………………………………………………………… 9

2.5.3 Transform and quantization………………………………………… 10

2.5.4 Entropy coding…………………………………………………………… 10

vii

2.5.5 Deblocking filter………………………………………………………… 11

2.6 H.264 bitstream……………………………………………………………………. 11

3. OVERVIEW OF HE AAC V2…………………………………………………………………. 16

3.1HE AAC v2……………………………………………………………………………… 16

3.2 Spectral Band Replication (SBR)……………………………………………. 18

3.3 Parametric Stereo (PS)………………………………………………………… 19

3.4 Enhanced aacplus encoder………………………………………………….. 20

3.5 Enhanced aacplus decoder………………………………………………….. 22

3.6 Advanced Audio Coding (AAC)……………………………………………. 23

3.6.1 AAC encoder…………………………………………………………. 23

3.7 HE AAC v2 bitstream formats…………………………………………….. 27

4. TRANSPORT PROTOCOLS………………………………………………………………. 30

4.1 Introduction……………………………………………………………………… 30

4.2 Real-Time protocol (RTP)…………………………………………………… 30

4.3 MPEG2 systems layer…………………………………………………………. 31

4.4 Packetized elementary stream (PES)………………………………….. 32

4.4.1 PES encapsulation process………………………………………………. 34

4.5 MPEG Transport stream (MPEG- TS)…………………………………. 35

4.6 Time stamps……………………………………………………………………… 38

5. MULTIPLEXING…………………………………………………………………………. 42

6. DE MULTIPLEXING……………………………………………………………………. 48

6.1 Lip or audio-video synchronization………………………………… 51

viii

7. RESULTS………………………………………………………………………………….. 55

7.1 Buffer fullness…………………………………………………………….. 55

7.2 Synchronization/skew calculation……………………………….. 56

8. CONCLUSIONS …………………………………………………………………………. 59

9. FUTURE WORK………………………………………………………………………… 59

References…………………………………………………………………………………. 60

ix

LIST OF FIGURES

Fig 2.1. Video data organization in H.264 [42].

Fig 2.2. Specific coding parts of the profiles in H.264 [5].

Fig 2.3 Different YUV systems.

Fig 2.4. H.264 encoder [5].

Fig 2.5. H.264 decoder [5].

Fig2.6. Intra prediction modes for 4X4 luma in H.264

Fig2.7. Different layers of JVT coding.

Fig2.8. NAL formatting of VCL and non-VCL data [6].

Fig2.9. NAL unit format [6].

Fig2.10. Relationship between parameter sets and picture slices [24].

Fig3.1: HE AAC audio codec family

Fig 3.2: Typical bitrate ranges of HE-AAC v2, HE-AAC and AAC for stereo [7]

Fig 3.4 Original audio signal [28].

Fig 3.5 High band reconstruction through SBR [28].

Fig3.6: Enhanced aacplus encoder block diagram [9]

Fig3.7: Enhanced aacplus decoder block diagram [9]

Fig 3.8 AAC encoder block diagram [10]

Fig 3.9. ADTS elementary stream

Fig 4.1. RTP packet structure (simplified) [22]

x

Fig 4.2. MPEG2 transport stream [22]

Fig 4.3. Conversion of an elementary stream into PES packets [29]

Fig4.4. A standard MPEG-TS packet structure [14].

Fig4.5. Transport stream (TS) packet format used in this project.

Fig5.1. Overall multiplexer flow diagram

Fig5.2. Flow chart of video processing block

Fig 5.3. Flow chart of audio processing block.

Fig6.1. Flow chart for the de-multiplexer used

xi

LIST OF TABLES

Table2.1. NAL Unit types.

Table3.1 ADTS header format [2] [3]

Table3.2 Profile bits expansion [2] [3]

Table4.1. PES packet header format used [4].

Table7.1. Video and audio buffer sizes and their respective playback times

Table 7.2. Characteristics of test clips used

Table 7.3: Demultiplexer output.

xii

ACRONYMS AND ABBREVIATIONS

3GPP Third generation partnership project

AAC Advanced audio coding

AAC LPT Advanced audio coding - long term prediction

ADIF Audio data interchange format

ADTS Audio data transport stream

AFC Adaptation field control

ATM Asynchronous transfer mode

ATSC Advanced television systems committee

ATSC-M/H Advanced television systems committee- mobile /handheld

AVC Advanced Video Coding

CABAC Context-based Adaptive Binary Arithmetic Coding

CAVLC Context-based Adaptive Variable Length Coding

CC Continuity counter

CIF Common intermediate format

CRC Cyclic redundancy check

DCT Discrete code transform

DPCM Discrete pulse coded modulation

DVB Digital video broadcasting

DVB Digital video broadcasting

DVD Digital video disc

EI Error indicator

ES Elementary stream

FMO Flexible macro block order

GOP Group of pictures

HDTV High definition television

HE AACv2 High efficiency advanced audio codec version 2

IC Inter-channel Coherence

IDR Instantaneous decoder refresh

IID Inter-channel Intensity Differences

IP Internet protocol

IPD Inter-channel Phase Differences

ISDB Integrated Services Digital Broadcasting system

xiii

I-slice Intra predictive slice

ISO International Standards Organization

ITU International Telecommunication Union

JM Joint model

JVT Joint Video Team

M4A Moving picture experts group four file format audio only

MB Macro blocks

MC Motion compensation

MDCT Modified discrete cosine transform

ME Motion estimation

MP4 Moving picture experts group four file format

MPEG Moving Picture Experts Group

MPTS Multi program transport stream

NALU Network Abstraction Layer Unit

OPD Overall Phase Difference

PCR Program clock reference

PES Packetized elementary stream

PID Packet identifier

PMT Program map table

PPS Picture parameter set

PS Parametric stereo

P-slice Predictive slice

PTS Presentation time stamp

PUSI Payload unit start Indicator

QCIF Quarter common intermediate format

QMF Quadrature mirror filter banks

RTP Real time protocol

SBR Spectral band replication

SCI Simplified chrominance intra prediction

S-DMB Satellite - Digital multimedia broadcasting

SDTV Standard definition television

SEI Supplemental enhancement information

SPS Sequence parameter set

SPTS Single program transport stream

xiv

STC System timing clock

TCP Transmission control protocol

TNS Temporal noise shaping

TS Transport stream

UDP User datagram protocol

VC1 Video Codec1

VCEG Video Coding Experts Group

VCL Video coding layer

VLC Variable length coding

VUI Video usability information

YUV Luminance and chrominance color components

1

CHAPTER 1

INTRODUCTION

Mobile broadcast systems are increasingly important as cellular phones and

highly efficient digital video compression techniques merge to enable digital TV

and multimedia reception on the move. There are several digital mobile TV

broadcasters in the market. Major ones are DVB-H (digital video broadcast-

handheld) [16] and ATSC-M/H (advanced television systems committee-

mobile/handheld) [17] [18]. Both DVB-H and ATSC-M/H have a relatively smaller

channel bandwidth allocation (~14Mbps for DVB-H and ~19.6 Mbps for ATSC-

M/H), so the choice of multimedia compression standard and transport protocol

becomes very important. DVB-H specifies the use of either VC1 [19] or H.264 [1]

compression standard for video and AAC [3] audio. ATSC-M/H specifies H.264

baseline profile for video and HEAACv2 [2] for audio. The transport protocol is

usually RTP (real time protocol) [20] or MPEG2 part 1 systems [4].The MPEG-2

systems specification [4] describes how MPEG-compressed video and audio data

streams may be multiplexed together with other data to form a single data

stream suitable for digital transmission or storage. Two alternative streams are

specified for the MPEG-2 systems layer. The program stream is used for storage of

multimedia content like DVD while the transport stream is intended for the

simultaneous delivery of a number of programs over potentially error-prone

channels.

In this project, the compression standards used is H.264 baseline profile for

video and HEAACv2 for audio, as specified by the ATSC-M/H standard.

Distribution is achieved through the MPEG-2 part 1 systems specifications

transport stream. Chapters 2 and 3 give the brief overview of H.264 and HEAACv2

compression standards respectively. Chapter 4 explains the transport stream

2

protocol used in this project. Chapters 5 and 6 explain the multiplexing and de-

multiplexing schemes used in this project.

All the results are tabulated in chapter 7. Chapters 8 and 9 outline the conclusions

of the project and future work respectively.

3

CHAPTER 2

OVERVIEW OF H.264

2.1 H.264/ AVC

H.264 is the latest and the most advanced video codec available today. It

was jointly developed by the VCEG (video coding experts group) of ITU-T

(international telecommunication union) and the MPEG (moving pictures experts

group) of ISO/IEC (international standards organization). This standard achieves

much greater compression than its predecessors like MPEG-2 video [37], MPEG4

part 2 [38] etc. But the higher coding efficiency comes at the cost of increased

complexity. The H.264 has been adopted as the video standard for many

applications around the world including ATSC. The H.264 baseline profile with

some restrictions is the adopted standard for ATSC-M/H or ATSC mobile DTV [40].

2.2 Coding structure:

The basic coding structure of H.264 is similar to that of the earlier standards

(MPEG-1, MPEG-2) and is commonly referred to as motion-compensated

transform coding structure. Coding of video is performed picture by picture. Each

picture is partitioned into a number of slices, which is a sequence of macroblocks.

Each slice is coded independently in H.264. It is possible that a picture can have

just one slice. A macroblock (MB) consists of 16X16 luminance (y) component and

associated two chrominance ( ) components. Each macroblock’s 16X16

luminance can be partitioned into 16 X 16, 16 X 8, 8 X 16, and 8 X 8 units, and

further, each 8 X 8 luminance can be sub-partitioned into 8 X 8, 8 X 4, 4 X 8 and 4

X 4. The 4 X 4 sub-macroblock partition is called a block. The hierarchy of video

data organization is shown in figure 2.1.

4

Fig2.1 Video data organization in H.264 [42]

There are basically three types of slices I (intra predictive), P (predictive) and B

(bipredictive) slices. I slices are strictly intracoded I.e. Macro blocks (MB) are

compressed without using any motion prediction from earlier slices. A special

type picture containing only I-slice is called instantaneous decoder refresh (IDR)

picture. Any picture following the IDR picture does not use the pictures prior to

IDR for its motion prediction. IDR pictures can be used for random access or as

entry points for a coded sequence [6]. P-slices on the other hand contain

macroblocks which use motion prediction. The MBs of P-slice can use only one

frame as reference (either from past or future) for their motion prediction.

2.3 Profiles and levels:

Profiles and levels specify restrictions on bit streams and hence limits on

the capabilities needed to decode the bit streams. Profiles and levels may also be

used to indicate interoperability points between individual decoder

implementations. For any given profile, levels generally correspond to decoder

processing load and memory capability. Each level may support a different picture

size– QCIF, CIF, ITU-R 601 (SDTV), HDTV etc. Also each level sets the limits for data

bitrates, frame size, picture buffer size, etc [5].

5

H.264/AVC is purely a video codec with as many as seven profiles. Three

profiles main, baseline and extended profile were included in its first release. Four

new profiles were added in the subsequent releases defined in the fidelity range

extensions for applications such as content distribution, content – contribution,

and studio editing and post-processing [5]. Profiles and their specific tools and

common features are shown in fig 2.2.

Fig 2.2. Specific coding parts of the profiles in H.264 [5].

It can be noted that I-slice, P-slice and CAVLC (Context-based Adaptive Variable

Length Coding) entropy coding are common to all the profiles.

2.4 Description of various profiles:

2.4.1 Baseline Profile:

Baseline profile supports coded sequences containing I and P slices. Apart

from the common features, baseline profile consists of some error resilience tools

such as Flexible Macro Block order (FMO), arbitrary slice order and redundant

6

slices. It was designed for low delay applications, as well as for applications that

run on platforms with low processing power and in high packet loss environment.

Among the three profiles, it offers the least coding efficiency [6]. The baseline

profile caters to applications such as video conferencing and mobile television

broadcast video. This project uses baseline profile for video encoding since it is

specified by ATSC for mobile digital television.

2.4.2 Extended profile:

The extended profile is a superset of the baseline profile. Besides tools of

the baseline profile it includes B-, SP- and SI-slices, data partitioning, and interlace

coding tools. SP and SI are specifically coded P and I slice respectively which are

used for efficient switching between different bitrates in some streaming

applications. It however does not include CABAC. It is thus more complex but also

provides better coding efficiency. Its intended applications are streaming video

over internet [6].

2.4.3 Main Profile:

Other than the common features main profile includes tools such as CABAC

for entropy coding, B-slices. It does not include any error resilience tools such as

FMO. Main profile is used in Broadcast television and high resolution video

storage and playback. It also contains interlaced coding tools like extended

profile.

2.4.4 High Profiles:

High profiles are the superset of main profile. It also includes additional tools such

as adaptive transform block size, quantization scaling matrices. High profiles are

7

used for applications such as content-contribution, content-distribution, and

studio editing and post-processing [5]. Four different high profiles are described

below:

High Profile - supports the 8-bit video with 4:2:0 sampling for applications using

high resolution.

High 10 Profile – supports the 4:2:0 sampling with up to 10 bits of representation

accuracy per sample.

High 4:2:2 Profile - supports up to 4:2:2 chroma sampling and up to 10 bits per

sample.

High 4:4:4 Profile – supports up to 4:4:4 chroma sampling, up to 12 bits per

sample, and integer residual color transform for coding RGB signal. Different YUV

formats are shown in fig 2.3.

Fig 2.3 Different YUV systems.

8

For any given profile, a level corresponds to various data bit rates, frame size,

picture buffer size, etc.

2.5 H.264 encoder and decoder:

The H.264 encoder follows a classic DPCM encoding loop. The encoder may

select between various inter- or intra- prediction modes. Intra- coding uses up to

nine prediction modes to reduce spatial redundancy for the single picture. Inter-

coding is more efficient than intra-coding and are used in B and P frames. Inter-

coding uses motion vectors for block-based inter-prediction to reduce temporal

redundancy among different pictures [5]. The de-blocking filter is used to reduce

the blocking artifacts. The predicted signal is then subtracted with the input

sequence to get a residual which is further compressed by applying integer

transform. This will remove the spatial correlation between the pixels. The

resulting signal is given to the quantization block. Finally the quantized transform

coefficients, motion vectors, intra prediction modes, control data etc are given to

the entropy coding block. There are basically two types of entropy encoders in

H.264 they are CAVLC (context adaptive variable length coding) and CABAC

(context adaptive binary arithmetic coding). The encoder and decoder for H.264

are shown in figures 2.4 and 2.5 respectively.

9

Fig 2.4. H.264 encoder [5].

The decoder performs in the exact opposite way, taking in the encoded bitstream

and decoding it. Then it is given to the inverse quantization and inverse transform

block.

Fig 2.5. H.264 decoder [5]

10

2.5.1 Intra prediction:

In the intra-coded mode, a prediction block is formed based on previously

reconstructed (but, unfiltered for deblocking) blocks of the same frame. The

residual signal between the current block and the prediction is finally encoded.

All macroblocks are Intra-coded in I- slice. Macroblocks having unacceptable

temporal correlation in P and B slices are also intra coded. Essentially, the intra-

coded macroblocks introduces the large number of coded bits. This is a bottleneck

for reducing the bitrates. For the luma samples, the prediction block may be

formed for each 4 X 4 sub block, each 8 X 8 block, or for a 16 X16 macroblock.

There are a total of 9 prediction modes for 4 X 4 and 8 X 8 luma blocks; 4 modes

for a 16 X16 luma block; and four modes for each chroma block. Figure 2.6 shows

the intra prediction modes for 4X4 luma. There are basically nine different modes.

For mode 0 (vertical) and mode 1 (horizontal), the predicted samples are formed

by extrapolation from upper samples [A, B, C, D] and from left samples [I, J, K, L]

respectively. For mode 2 (DC), all of the predicted samples are formed by the

mean of the upper and left samples [A, B, C, D, I, J, K, L]. For mode 3 (diagonal

down left), mode 4 (diagonal down right), mode 5 (vertical right), mode 6

(horizontal down), mode 7 (vertical left), and mode 8 (horizontal up), the

predicted samples are formed from a weighted average of the prediction samples

A–M.

11

Fig 2.6. Intra prediction modes for 4X4 luma in H.264 [39].

For prediction of each 8X8 luma block, one mode is selected from the 9 modes,

similar to the (4 X4) intra-block prediction. For prediction of all 16 X16 luma

components of a macroblock, four modes are available. For mode 0 (vertical),

mode 1 (horizontal), mode 2 (DC), the predictions are similar with the cases of 4X

4 luma block. For mode 4 (Plane), a linear plane function is fitted to the upper and

left samples.

2.5.2 Inter prediction:

This block includes both Motion Estimation (ME) and Motion Compensation

(MC). It generates a predicted version of a rectangular array of pixels, by choosing

another similarly sized rectangular array of pixels from a previously decoded

reference picture, translating the reference array to the position of the

current rectangular array [6]. In H.264, the block sizes for motion prediction

include: 4X4, 4X8, 8X4, 8X8, 16X8, 8X16, and 16X16 pixels (shown in figure 2.1).

Inter-prediction of a sample block can also involve the selection of the frames to

be used as the reference pictures from a number of stored previously decoded

pictures. Reference pictures for motion compensation are stored in the picture

buffer. With respect to the current picture, the pictures before and after the

12

current picture, in the display order are stored into the picture buffer. These are

classified as short-term and long-term reference pictures. Long-term reference

pictures are introduced to extend the motion search range by using multiple

decoded pictures, instead of using just one decoded short-term picture. Memory

management is required to take care of marking some stored pictures as unused

and deciding which pictures to delete from the buffer for efficient memory

management [5].

2.5.3 Transform and quantization:

The residual signal (prediction error) will have a high spatial redundancy.

AVC like its predecessors uses block based transform (Integer DCT) and

quantization to remove/reduce this spatial redundancy. H.264 uses an adaptive

transform block size, 4X4 and 8X8 (for high profile only). The smaller block size

reduces the ringing artifacts. Also, the 4 X4 transform has the additional benefit of

removing the need for multiplications [5]. For improved compression H.264 also

employs 4X4 Hadamard transform for the DC components of the 4X4 (DCT)

transforms in case of luma 16X16 intra mode and 2X2 Hadamard transform for

chroma DC coefficients.

2.5.4 Entropy coding:

The predecessors of H.264 (MPEG 1 and MPEG 2) used the entropy coding

based on the fixed tables of variable length codes (VLC). H.264 uses different VLCs

to match a symbol to a code based on the context characteristics. All syntax

elements except for the residual data are encoded by the exp-golomb codes [5].

For coding the residual data, a more sophisticated method called CAVLC (context

based adaptive variable length coding) is employed. CABAC (context based

adaptive binary arithmetic coding) is employed in main and high profiles, CABAC

13

achieves better coding efficiency, but with higher complexity compared to CAVLC

[1].

2.5.5 Deblocking filter:

H.264 based systems may suffer from blocking artifacts due to block-based

transform in intra and inter-prediction coding, and the quantization of the

transform coefficients. The deblocking filter reduces the blocking artifacts in the

block boundary and prevents the propagation of accumulated coded noise since

the filter is present in the DPCM loop.

For this project, the H.264 baseline profile at level 1.3 is used as specified

for the ATSC- mobile digital television [41]. The resolution of the video sequence

is 416 pixels X240 lines (aspect ratio 16:9).

2.6 H.264 bitstream:

H.264 video syntax can be broken down into two layers. Video Coding Layer

(VCL) which consist of the video data, slice layer or below and the Network

Abstraction Layer (NAL) which formats the VCL representation of the video and

provides the header information. NAL also provides additional non-VCL

information like sequence parameter sets, picture parameter sets, Supplemental

Enhancement Information (SEI), etc so that it may be used in a variety of

transport streams like the MPEG2 transport stream , IP/RTP systems, etc or

storage media like ISO image. Figure 2.7 shows the different layers of JVT (joint

video team) coding. Figure 2.8 shows the NAL layer formatting of VCL and non-

VCL data.

14

Fig 2.7 Different layers of JVT coding [1].

Fig 2.8 NAL formatting of VCL and non-VCL data [6].

The H.264 bit stream is encapsulated into packets called NAL units (NALU).

Each NALU is separated by a 4 byte sequence “0X00000001”. After this byte

sequence the following byte is the NAL header and the rest is a variable byte

15

length raw byte sequence payload (RBSP). The NAL header/unit format is shown

in the figure 2.9. The first bit of the NAL header called the forbidden bit is always

zero. The next two bits indicate whether the NALU consist of sequence parameter

set, picture parameter set or the slice data of the reference picture. The next 5

bits indicate the type of NALUs (type indicator), depending upon the type of data

being carried by that NALU. There are 32 different types of NALU. These may be

classified into VCL NALUs and non-VCL NALUs, depending upon the type of data

they carry.

Fig 2.9. NAL unit format [6].

If the type indicator is less than 5, it is a VCL NALU and if the type indicator is

greater than 5 it is non-VCL NALU. Different types of NALUs are listed in table2.1.

NALU types 1-5 and 12 are VCL-NAL units containing coded VCL data. The

rest of NALUs are called non-VCL NAL units that contain information such as SEI,

sequence parameter set, picture parameter set etc. Of these NALUs IDR pictures,

sequence parameter set and picture parameter set are important.

An instantaneous decoder refresh (IDR) picture is a picture that is placed at

the beginning of the video sequence. When the decoder receives an IDR picture,

all information is refreshed, which indicated a new coded video sequence and

frames prior to this IDR frame are not required to decode this new sequence.

The sequence parameter set contains important header information that

applies to all NALUs in the coded sequence. The picture parameter set contains

16

important header information that is used for decoding one or more frames in the

coded sequence.

Type indicator

NALU type

0 unspecified

1 coded slice

2 data partition A

3 data partition B

4 data partition C

5 IDR (instantaneous decoder refresh)

6 SEI(supplemental enhancement

information)

7 sequence parameter set

8 picture parameter set

9 access unit delimiter

10 end of sequence

11 end of stream

12 filler data

13-23 extended

23-31 undefined Table 2.1. NAL unit types [6].

The relationship between the parameter sets and the slice data is shown in

fig 2.10. Each VCL NAL unit contains an identifier that refers to the content of the

relevant Picture Parameter Set (PPS) and each PPS contains an identifier that

refers to the content of the relevant Sequence Parameter Set (SPS). In this

manner, a small amount of data (the identifier) is used to refer to a larger amount

of information (the parameter set) without repeating that information within

each VCL NAL unit. Sequence and picture parameter sets can be sent well ahead

of the VCL NAL units to which they apply, and can be repeated to provide

robustness against data loss. In some applications, parameter sets may be sent

within the channel that carries the VCL NAL units (termed "in-band"

transmission). In other applications, it can be advantageous to convey the

17

parameter sets "out of band" using a more reliable transport mechanism than the

video channel itself. By using this mechanism, H.264 can transmit multiple video

sequences (with different parameters) in a single bitstream.

Fig 2.10. Relationship between parameter sets and picture slices[24].

The important information carried by SPS include profile/level indicator,

decoding order or playback order, frame size, number of reference frames, Video

Usability Information (VUI) such as aspect ratio, color or space details etc. SPS

remains the same for the entire coded video sequence. Important information

carried by PPS include entropy coding scheme used, macro block reordering,

quantization parameters, a flag to indicate whether inter predicted MBs can be

used for intrapredicton etc. PPS remains unchanged within a coded picture.

This chapter provided an overview of H.264. Various profiles, encoder,

decoder and the H.264 bit stream format were discussed in detail. An overview of

the HE AAC v2 audio codec is presented in the next chapter.

18

CHAPTER 3

OVERVIEW OF HE AAC V2

3.1HE AAC v2

High efficiency advanced audio codec version 2 also known as enhanced

aacplus is a low bit rate audio codec defined in MPEG4 audio profile [2] belonging

to the AAC family. It is specifically designed for low bit rate applications such as

streaming.

HE AAC v2 has been proven to be the most efficient audio compression tool

available today. It comes with a fully featured toolset which enables coding in

mono, stereo and multichannel modes (up to 48 channels). Apart from ATSC [17],

enhanced aacplus is already the audio standard in various applications and

systems around the world. In Asia, HE-AAC v2 is the mandatory audio codec for

the Korean Satellite Digital Multimedia Broadcasting (S-DMB) [25] technology and

is optional for Japan’s terrestrial Integrated Services Digital Broadcasting system

(ISDB) [26]. HE-AAC v2 is also a central element of the 3GPP and 3GPP2 [27]

specifications and is applied in multiple music download services over 2.5 and 3G

mobile communication networks. Others includes XM satellite radio (the digital

satellite broadcasting service in the USA), HD Radio (the terrestrial digital

broadcasting system of iBiquity Digital, USA) [7].

HE AAC v2 is a combination of three technologies: AAC (advanced audio

codec), SBR (spectral band replication) and PS (parametric stereo). All the 3

technologies are defined in MPEG4 audio standard [2]. The combination of AAC

and SBR is called HE-AAC or aacplus. AAC is a general audio codec, SBR is a

bandwidth extension technique offering substantial coding gain in combination

19

with AAC, and Parametric Stereo (PS) enables stereo coding at very low bitrates.

Figure 3.1 shows the family of AAC audio codecs.

HE AAC v2

HE AAC

AAC SBR PS

Fig 3.1: HE AAC audio codec family [7]

Figure 3.2 shows the typical bitrate ranges for stereo plotted against the

perceptual quality factor for all three forms of the codec. It can be easily derived

that HEAAC v2 provides the best quality for the lowest bitrate.

Fig 3.2: Typical bitrate ranges of HE-AAC v2, HE-AAC and AAC for stereo [7]

20

3.2 Spectral Band Replication (SBR):

SBR [2] is a bandwidth expansion technique; it has emerged as one of the

most important tools that have led to the development of audio coding

technology.

SBR exploits the correlation that exists between the energy of the audio

signal at high and low frequencies also referred to as high and low bands. It is also

based on the fact that psychoacoustic importance of high band is relatively low.

SBR uses a well guided technique called transposition to predict the energies at

high band from low band. Besides just transposition, the reconstruction of the

high band is conducted by transmitting some guiding information such as spectral

envelope of the original signal, prediction error, etc. These are referred to as SBR

data. The original and the high band reconstructed audio signal are shown in the

figures 3.4 and 3.5 respectively

Fig 3.4 original audio signal [28].

Fig 3.5 High band reconstruction through SBR [28].

21

SBR has enabled high-quality stereo sound at bitrates as low as 48 kbps. SBR was

invented as a bandwidth extension tool when used along with AAC. It was

adopted as an MPEG4 standard in March 2004 [2].

3.3 Parametric Stereo (PS):

Parametric stereo coding is a technique to efficiently code a stereo audio

signal as a monaural signal plus a small amount of stereo parameters. The

monaural signal can be encoded using any audio coder. The stereo parameters

can be embedded in the ancillary part of the mono bit stream creating backwards

mono compatibility. In the decoder, first the monaural signal is decoded after

which the stereo signal is reconstructed from the stereo parameters.

PS coding has led to a high quality stereo sound reconstruction at relatively

low bitrates. In the parametric approach, the audio signal or stereo image is

separated into its transient, sinusoid, and noise components. Next, each

component is re-represented via parameters that drive a model for the signal,

rather than the standard approach of coding the actual signal itself. PS uses three

types of parameters to describe the stereo image:

Inter-channel Intensity Differences (IID): describes the intensity differences

between the channels.

Inter-channel Phase Differences (IPD): describes the phase differences between

the channels and

Inter-channel Coherence (IC): describes the coherence between the channels.

The coherence is measured as the maximum of the cross-correlation as a function

of time or phase.

In principle, these three parameters allow for a high quality reconstruction of the

stereo image. However, the IPD parameters only specify the relative phase

22

differences between the channels of the stereo input signal. They do not

prescribe the distribution of these phase differences over the left and right

channels. Hence, a fourth type of parameter is introduced, describing an overall

phase offset or Overall Phase Difference (OPD). In order to reconstruct the stereo

image, in the PS decoder a number of operations are performed, consisting of

scaling (IID), phase rotations (IPD/OPD) and decorrelation (IC).

3.4 Enhanced aacplus encoder:

Figure 3.6 shows the complete block diagram of the enhanced aacplus

encoder. The input PCM time domain signal (raw audio signal) is first fed to a

stereo-to-mono down mix unit, which is only applied if the input signal is stereo

but the chosen audio encoding mode is selected to be mono.

The (mono or stereo) input time domain signal is fed to an IIR resampling

filter in order to adjust the input sampling frequency to the best-suited

sampling rate for the encoding process. The usage of the IIR resampler block

is only applied if the input signal sampling rate differs from the encoding sampling

rate. The IIR resampler may either be run as a 3:2 downsampler (e.g. to

downsample from 48 kHz to 32 kHz) or as a 1:2 upsampler (e.g. to upsample from

16 to 32 kHz). The QMF filter bank (part of SBR) is used to derive the spectral

envelope of the original signal. This envelope data along with some other error

information forms the SBR stream.

The enhanced aacplus encoder basically consists of the well-known AAC

encoder, the SBR high band reconstruction encoding tool and the Parametric

Stereo (PS) encoding tool. The enhanced aacplus encoder is operated in a dual

frequency mode, SBR encoder unit operates at the encoding sampling frequency

( as delivered from the IIR resampler and the AAC encoder unit at half of

23

this sampling rate . Consequently a 2:1 downsampler is present at the

input to the AAC encoder. For an efficient implementation an IIR (Infinite Impulse

Response) filter is used. The parametric stereo tool is used for low-bitrate stereo

coding, i.e. up to and including a bitrate of 44 kbps [4].

Fig3.6: Enhanced aacplus encoder block diagram [9]

The SBR encoder consists of a QMF (Quadrature Mirror Filter) analysis filter

bank, which is used to derive the spectral envelope of the original input signal.

This spectral envelope data along with transposition information forms the SBR

stream.

For stereo bitrates at and below 44 kbps, the parametric stereo encoding

tool in the enhanced aacplus encoder is used. For stereo bitrates above 44 kbps,

normal stereo operation is performed. The parametric stereo encoding tool

estimates parameters characterizing the perceived stereo image of the input

signal. These stereo parameters are embedded in the SBR stream.

Analysis

QMF Bank

SBR-related

Modules

2:1 IIR

Downsampler

AAC Core

Encoder

Envelope

Estimation

Input

PCM

Samples

Bitstr

ea

m P

aylo

ad

Fo

rma

tte

r

Coded

Audio

StreamIIR

Resampler1

stereo-to-

mono

downmix 1

1 usage dependant

on audio mode

Downsampled

Synthesis

QMF Bank

Parametric

Stereo

Estimation

(incl. Downmix)

stereo parameter

24

The embedding of the SBR stream (including the parametric stereo data) into the

AAC stream is done in a backwards compatible way, i.e. legacy AAC decoders can

parse the enhanced aacplus stream and decode the AAC core part.

3.5 Enhanced aacplus decoder:

Figure 3.7 shows the entire block diagram of an enhanced aacplus decoder.

In the decoder the bitstream is de-multiplexed into the AAC and the SBR stream.

Error concealment, e.g. in the case of frame loss, is achieved by designated

algorithms in the decoder for AAC, SBR and parametric stereo.

The low band AAC time domain signal, sampled at , is first fed to a

32-channel QMF analysis filter bank. The QMF low-band samples are then used to

generate a high-band signal, whereas the transmitted transposition guidance

information is used to best match the original input signal characteristics.

The transposed high band signal is then adjusted according to the

transmitted spectral envelope signal to best match the original’s spectral

envelope. Also, missing components that could not be reconstructed by the

transposition process are introduced. Finally, the lowband and the reconstructed

highband are combined to obtain the complete output signal in the QMF domain.

In the case of a stream using parametric stereo, the mono output signal

from the underlying aacplus decoder is converted into a stereo signal. This

process is carried out in the QMF domain and is controlled by the parametric

stereo parameters embedded in the SBR stream.

A 64-channel QMF synthesis filter bank is used to obtain the time domain output signal, sampled at the encoding sampling rate . The synthesis filter bank may also be used to apply an implicit down sampling by a factor of 2, resulting in an output sampling rate of .

25

Fig3.7: Enhanced aaplus decoder block diagram [9]

3.6 Advanced Audio Coding (AAC)

3.6.1 AAC encoder:

The AAC encoder acts as the core encoding algorithm of the enhanced

aacplus system encoding at half the sampling rate of aacplus. In the case of SBR

being used, the maximum AAC sampling rate is restricted to 24 kHz whereas if

SBR is not used, the maximum AAC sampling rate is restricted to 48 kHz [9].

Figure 3.8 shows the block diagram of a core AAC encoder. The various blocks in the encoder are explained below. Stereo preprocessing: In this block, the stereo width of difficult to encode signals

at low bitrates are reduced (attenuated). Stereo preprocessing is active for

bitrates less than 60kbps. The smaller the bitrate, the more attenuation of the

side channel takes place.

Analysis

QMF Bank

HF Generation

SBR

stereo-to-

mono

downmix1

AAC Core

Decoder

(incl. error

concealment)

Envelope

Adjustment

Output

PCM

Samples

Bitstr

eam

Paylo

ad D

em

ultip

lexer

Coded

Audio

StreamSpline

Resampler1

stereo-to-

mono

downmix 1

1 usage dependant

on audio mode

Me

rge lo

wband a

nd h

ighband

SBR error

concealment

guidance

information

stereo parameter

Synth

esis

QM

F B

ank

Para

metr

ic S

tere

o S

ynth

esis

PS error

concealment

26

Filter bank: The encoder breaks down the raw audio signal into segments known

as blocks. Modified Discrete Cosine Transform (MDCT) is applied to the blocks to

maintain smooth transition from block to block. AAC dynamically switches

between the two block sizes i.e., 2048-samples, and 256-samples each, referred

to as long blocks and short blocks, respectively. (3.1) shows the MDCT equation.

AAC also switches between two different types of long blocks: sine-function and

Kaiser-Bessel Derived (KBD) according to the complexity of the signal.

X (k)

Where N is the block length.

Psychoacoustic model: Is a highly complex block which implements the switching

between block sizes, threshold calculation (upper limit of quantization error),

spreaded energy calculation, grouping etc [10].

Temporal Noise Shaping (TNS) block: This technique does noise shaping in the

time domain by doing an open loop prediction in the frequency domain. The TNS

technique provides enhanced control of the location, in time, of quantization

noise within a filter bank window. TNS proves to be especially successful for the

improvement of speech quality at low bit-rates.

Mid/Side Stereo: M/S stereo coding is another data reduction module based on

channel pair coding. In this case channel pair elements are analyzed as left/right

and sum/difference signals on a block-by-block basis. In cases where the M/S

channel pair can be represented by fewer bits, the spectral coefficients are coded,

and a bit is set to note that the block has used M/S stereo coding. During

decoding, the decoded channel pair is de-matrixed back to its original left/right

state. For normal stereo operation, M/S Stereo, is only required when operating

the encoder at bitrates at or above 44 kbps. Below 44 kbps the parametric stereo

coding tool is used instead where the AAC core is operated in mono.

27

Reduction of psychoacoustic requirements block: Usually the requirements of the

psychoacoustic model are too strong for the desired bitrate. Thus a threshold

reduction strategy is necessary, i.e. the strategy reduces the requirements by

increasing the thresholds given by the psychoacoustic model.

Quantization and coding: A majority of the data reduction generally occurs in the

quantization phase after the data has already achieved a certain level of

compression when passed through the previous modules. This module contains

many other blocks such as Scale factor quantization block, Noiseless coding and

Out of Bits Prevention.

Scale factor quantization: This block consist of two additional blocks called scale

factor determination and scale factor difference reduction.

Scale factor determination: The scale factors determine the quantization step size

for each scale factor band. By changing the scale factor, the quantization noise

will be controlled.

Scale factor difference reduction: This block takes into account the difference of

the scale factors which will be encoded. A smaller difference between two

adjacent scale factors requires fewer bits.

Noiseless coding: Coding of the quantized spectral coefficients is done by the

noiseless coding. The encoder uses a so called greedy merge algorithm to

segment the 1024 coefficients of a frame into section and to find the best

Huffman codebook for each section.

Out of Bits Prevention: after noiseless coding, the number of really needed bits is

counted. If this number is too high, the number of bits has to be reduced.

28

Fig 3.8 AAC encoder block diagram [10]

Stereo

Preprocessing

Filterbank

TNS

M/S

Reduction of

psychoacoustic

requirements

scalefactors /

quantization

Noiseless

Coding

Out of bits

prevention

Psycho-

acoustic

Model

Bitstream

multiplex

Bitstream

Input signal

Quantization &

Coding

29

3.7 HE AAC v2 bitstream formats:

HE AAC v2 encoded data has variable file formats with different extensions,

depending on the implementation and the usage scenario. The most commonly

used file formats are the MPEG-4 file formats MP4 and M4A [15], carrying the

respective extensions .mp4 and .m4a. The “.m4a” extension is used to emphasize

the fact that a file contains audio only. Additionally there are other bit stream

formats, such as MPEG-4 ADTS (audio data transport stream) and ADIF (audio

data interchange format).

ADIF format has a single header at the beginning of the bit stream followed

by raw audio data blocks. It is used mainly for local storage purposes. ADTS has a

header before each access unit or audio frame and also the header information

will remain same for all the frames in a stream. ADTS is more robust against

errors and is suited for communication applications like broadcasting. For this

project file format ADTS has been used.

Tables 3.1 and 3.2 describe the ADTS header. This is present before each

access unit (a frame). This is later exploited for packetizing the frames into

packetized elementary stream (PES) packets, which is the first layer of

packetization before transport. Figure 3.9 shows the ADTS elementary stream.

In this chapter, an overview of HE AAC v2 audio codec standard was

presented. The encoder, decoder, SBR, PS, AAC encoder and the bit stream

format were described. Next chapter gives a brief overview of the transport

protocols in particular MPEG 2 systems layer.

30

Field name Number of bits

syncword 12 always "111111111111"

ADTS fixed

header

ID 1 0: MPEG-4, 1: MPEG-2

layer 2 always "00"

protection_absent 1

profile 2 Explained below

sampling_frequency_index 4

private_bit 1

channel_configuration 3

original/copy 1

home 1

copyright_identification_bit

ADTS variable header

copyright_identification_start

aac_frame_length 13 length of the frame including

header (in bytes)

adts_buffer_fullness 11 0x7FF indicates VBR

no_raw_data_blocks_in_frame 2

crc_check 16 only if protection_absent == 0

raw_data_blocks variable size

Table 3.1 ADTS header format [2] [3]

profile bits

bits ID == 1 (MPEG-2 profile) ID == 0 (MPEG-4 Object type)

00 (0) Main profile AAC MAIN

01 (1) Low Complexity profile (LC) AAC LC

10 (2) Scalable Sample Rate profile (SSR) AAC SSR

11 (3) (reserved) AAC LTP

Table 3.2 Profile bits expansion [2] [3]

31

ADTS frame ADTS frame ADTS frame

ADTS header ADTS raw data

Syn

c

wo

rd

Pro

file

Sa

mp

ling

fre

qu

en

cy

.. ADTS variable header

Fig 3.9 . ADTS elementary stream [3].

32

CHAPTER 4

TRANSPORT PROTOCOLS

4.1 Introduction

Once the raw video and audio are encoded into their respective elementary

streams, it has to be converted into fixed sized packets to enable transmission

across networks such as IP (internet protocol), wireless mobile networks etc.

H.264 and HE AAC v2 do not define a transport mechanism for the coded data.

There are a number of transport solutions available which can be used depending

on the application. Some of them are discussed briefly below.

4.2 Real-Time Protocol (RTP):

RTP is a packetisation protocol which can be used along with User

Datagram Protocol (UDP) to transmit real time multimedia content across

networks that uses the Internet Protocol (IP). The packet structure for RTP real

time data is shown in figure 4.1. The payload type indicates the type of codec

used to generate the coded data. Sequence is used during playback for reordering

the packets that are transmitted out of order. A time stamp is used for calculating

the presentation time during decoding. Transmission via RTP involves packetizing

each elementary stream into a series of RTP packets and transmitting them across

the IP network using UDP as the basic transport protocol. H.264 NAL has been

designed keeping RTP protocol in mind, since each NALU can be placed in its own

RTP packet.

33

Time

stampUnique

idpayload

Payload

type

Sequence

number

Fig 4.1. RTP packet structure (simplified) [22]

4.3 MPEG2 systems layer:

The MPEG2 part 1 [4] standard describes a method of combining one or

more elementary streams of video and audio , as well as other data , into a single

stream which is suitable for storage (DVD) or transmission ( digital television,

streaming etc). There are basically two types of system coding, Program Stream

(PS) and Transport Stream (TS). Each one is optimized for different types of

applications. Program stream is designed for reasonably reliable media such as

DVDs, while transport stream is designed for less reliable media such as television

broadcast, mobile networks etc. Irrespective of the scheme used the

transport/program stream is constructed in two layers, the outer layer is the

system layer and the inner most layer is the compression layer. System layer

provides the functions necessary for one or more compressed streams like audio

and video in a system.

The MPEG2 transport stream system is shown in figure 4.2.The elementary

stream such as coded audio or video undergoes two layers of packetization. The

first layer of packetization results in variable sized packets known as Packetized

Elementary Stream (PES). PES packets from different elementary streams undergo

one more level of packetisation known as multiplexing, where they are broken

down into fixed size packets (188 bytes) known as transport stream packets (TS).

These TS packets are what are actually transmitted across the network using

34

broadcast techniques such as those used in ATSC and DVB. The TS contains the

actual data (payload) as well as timing and synchronization information and some

error control mechanism. The latter plays a crucial role during the decoding

process. This project is implemented using the MPEG2 transport stream

specification with a few modifications. Transport of H.264 bit stream over MPEG2

systems is covered in amendment 3 of MPEG2 systems [3]. Even though MPEG2

systems support multiple elementary streams for multiplexing, for

implementation purposes only two elementary streams, audio and video are

considered. Additional elementary streams like data streams or different

audio/video streams may be added by following the same method described next.

Elementary stream eg.coded video,audio

packetize

PES

packets

from

multiple

streams

multiplex

Transport

stream

Fig 4.2. MPEG2 transport stream [22]

4.4 Packetized elementary stream (PES):

PES packets are obtained after the first layer of packetisation of audio and

video coded data. This packetisation process is carried out by sequentially

separating out the audio and video elementary streams into access units. The

access units for both audio and video elementary streams are frames. Hence each

35

PES packet is an encapsulation of one frame of data, either an audio or video

elementary stream. Each PES packet contains a packet header and the payload

data from only one particular stream. PES header contains information which can

distinguish between audio and video PES packets. Since the number of bits used

to represent a frame in the bit stream varies (for both audio and video) the size of

the PES packets also varies and depends on the type of frame that is encoded. For

example I frames require more bits to be represented than P frames. Figure 4.3

shows how the elementary stream is converted into PES stream.

Fig 4.3. Conversion of an elementary stream into PES packets [29]

The PES header used is shown in table 4.1. The PES header starts with a 3

byte packet start code prefix which is always “0x000001” followed by 1 byte

stream id. Stream id is used to uniquely identify a particular stream. Stream id

along with start code prefix is known as start code (4 bytes). Valid stream ids [30]

for audio streams range from 11000000 to 11011111 and for video streams range

from 11100000 to 11101111. Stream id 11000000 and 11100000 for audio and

video respectively are used in this implementation.

36

Name Size (in

Bytes) Description

Packet start code prefix 3 0x000001

Stream id 1 Unique ID to distinguish between audio and video PES packet

Examples: Audio streams (0xC0-0xDF), Video streams (0xE0-0xEF)[3]

Note: the above 4 bytes together known as start code.

PES Packet length 2 The PES packet can be of any length. A value of zero for the PES packet

length can be used only when the PES packet payload is a video elementary stream

Time Stamp 2 frame number

Table4.1. PES packet header format used [4].

PES packet length may vary and go up to 65536 bytes. In case of longer

elementary stream, the packet length may be set as unbound i.e. 0, only in the

case of video stream. The next two bytes in the header is the time stamp field,

which contains the playback time information. In this project we use frame

number to calculate the playback time, which is explained in detail later.

4.4.1 PES encapsulation process:

As discussed before PES packets are obtained by encapsulating sequential

access units’ data bytes from elementary streams into PES header. In the case of

an audio stream, HE AAC v2 bitstream is searched for the 12 bit synch word i.e.

“111111111111” which indicates the start of a ADTS header and the start of an

audio frame. The frame length is extracted from the ADTS header. The audio

frame number is calculated from the beginning of the frame and coded as a two

byte timestamp. The stream ID used for audio is 11000000. An audio PES packet is

formed by encapsulating start code, frame length, time stamp and the payload.

In the case of an video stream, the H.264 bitstream is searched for a 4 byte prefix

start sequence 0x00000001 which indicates the beginning of a NAL unit. Then the

5 bit frame type is extracted from the NAL header and checked if it is a video

37

frame or a parameter set. Parameter sets are very important and are required for

decoding process. So if a parameter set is found (both PPS and SPS) it is

packetized separately and transmitted. If NAL unit contains the slice data, then

frame number is calculated from beginning of the stream and coded as time

stamp in PES. It has to be noted that parameter sets are not counted as frames, so

while coding parameter sets the time stamp field is coded as zero. Stream id used

for video is 11100000. Then the video PES packet is formed by encapsulating the

start code, frame length, time stamp and payload.

4.5 MPEG Transport stream (MPEG- TS):

PES packets are of variable sizes and are difficult to multiplex and transmit

in an error prone network. Hence they undergo one more layer of packetisation

which results in Transport Stream (TS) packets.

MPEG Transport Streams (MPEG-TS) use a fixed length packet size and a

packet identifier identifies each transport packet within the transport stream. A

packet identifier in an MPEG system identifies the type of packetized elementary

stream (PES) whether audio or video. Each TS packet is 188 bytes long which

includes header and payload data. Each PES packet may be broken down into a

number of transport stream (TS) packets since a PES packet which represents an

access unit (a frame) in the elementary stream which will be usually much bigger

than 188bytes. And also a particular TS packet should contain data from only one

particular PES.

The standard MPEG TS packet format is shown in the fig 4.4. It consists of a

synchronization byte, whose value is 0x47, followed by three one-bit flags and a

13-bit PID (packet identifier). This is followed by a 4-bit continuity counter, which

usually increments with each subsequent packet of a frame, and can be used to

38

detect missing packets. Additional optional transport fields, whose presence may

be signaled in the optional adaptation field, may follow. The rest of the packet

consists of payload. Packets are most often 188 bytes in length, but some

transport streams consist of 204-byte packets which end in 16 bytes of Reed-

Solomon error correction data. The 188-byte packet size was originally chosen for

compatibility with ATM systems.

Fig4.4. A standard MPEG-TS packet structure [14].

The transport Error Indicator (EI) flag is set by the demodulator if it cannot

correct errors in the stream, to tell the demultiplexer that the packet has an

uncorrectable error. Payload Unit Start Indicator (PUSI) flag indicates the start of

PES data. A transport priority flag when set means higher priority than other

packets with the same PID. Out of 188 bytes, the header occupies 4 bytes and the

rest 184 bytes are for the payload.

http://wiki.multimedia.cx/index.php?title=Reed-Solomon&action=edit

http://wiki.multimedia.cx/index.php?title=Reed-Solomon&action=edit

39

For the purposes of this implementation all the flags and fields mentioned

above are not required, hence a few changes have been made although the frame

work and the packet size remains the same. The whole header information is

represented in 3 bytes instead of 4 bytes and the rest is available for payload

data. The modified transport stream (TS) packet is shown in fig. 4.5

Sync byte

0x47

P

U

S

I

A

F

C

CC(4

bits)

PID(10

bits)

188 bytes long

Data payloadOffset- optional

8bits

185 bytes1 byte 2 bytes

Fig 4.5. Transport stream (TS) packet format used in this project.

The sync byte (0x47) indicates the start of the new TS packet. It is followed

by a payload unit start indicator (PUSI) flag, which when set indicates that the

data payload contains start of new PES packet.

The Adaptation Field Control (AFC) flag when set indicates that the whole

of the allotted 185 bytes for the data payload is not occupied by the PES data.

This occurs when the PES data is smaller than 185 bytes. When this happens the

unoccupied bytes of the data payload are filled with filler data ( all zeros in this

case), and the length of the filler data is stored in a byte called the offset right

after the TS header, offset is calculated by 185 – length of PES data.

40

The Continuity Counter (CC) is a 4 bit field which is incremented by the

multiplexer for each TS packet sent for a particular stream I.e. audio PES or video

PES, this information is used at the demultiplexer side to determine if any packets

are lost, repeated or is out of sequence. Packet ID (PID) is a unique 10 bit

identification to describe a particular stream to which the data payload belongs in

the transport stream (TS) packet. The MPEG2 transport stream has a concept of

broadcast programs. Each single program is described by a Program Map Table

(PMT), and the elementary streams associated with that program have PIDs listed

in the PMT. For instance, a transport stream used in digital television might

contain three programs, to represent three television channels. Suppose each

channel consists of one video stream and one audio stream. A receiver wishing to

decode a particular "channel" merely has to decode the payloads of each PID

associated with its program. It can discard the contents of all other PIDs. A

transport stream with more than one program is referred to as a MPTS (multi

program transport stream). Similarly a transport stream with a single program is

referred to as a single program transport stream (SPTS). A PMT will have its own

PID and will be transmitted at regular intervals. In this implementation only two

streams, audio and video are used, so the PMT is not required. The PIDs is

assumed to be known by the decoders. The PIDs used for this implementation are

0000001110 (14) for audio stream and 0000001111 (15) for video stream.

Optional offset byte: as described above, if the adaptation field control flag is set,

this byte is filled with the length of the filler data (zeroes).

4.6 Time stamps:

Time stamps indicate where a particular access unit belongs in time. Lip

sync is obtained by incorporating time stamps into the headers in both video and

audio PES packets. When a decoder receives a selected PES packet, it decodes

http://en.wikipedia.org/wiki/ATSC_tuner

41

each access unit and stores the elementary streams in buffers. When the time-

line count reaches the value of the time stamp, the buffer is read out. This

operation has two desirable results. First, effective time base correction is

obtained in each elementary stream. Second, the video and audio elementary

streams can be synchronized together to make a program.

Traditionally to enable the decoder to maintain synchronization between

audio track and video frames, a 33 bit encoder clock sample called Program Clock

Reference (PCR) is transmitted in the adaptation field of the TS packet from time

to time (every 100 ms). The PCR transport stream (TS) packet will have its own

PID that will be recognized by the decoder. This is used to generate the system

timing clock (STC) in the decoder which provides an accurate time base. This

along with the presentation time stamp (PTS) field that resides in the PES packet

layer of the transport stream is used to synchronize the audio and video

elementary streams.

This project uses the frame numbers of both audio and video as time

stamps to synchronize the streams. This section explains how frame numbers can

be used to synchronize audio and video streams. As explained before in sections

2.6 and 3.7, both H.264 and HE AAC v2 bit streams are organized into access units

i.e. frames separated by their respective sync sequence. A particular video

sequence will have a fixed frame rate during playback which is specified by frames

per second (fps). So assuming that the decoder has a prior knowledge about the

fps of the stream, the presentation time or the playback time of a particular video

frame can be calculated using (4.1).

42

The AAC compression standard defines each audio frame to contain 1024

samples per channel. This is true for HE AAC v2 as well .The sampling frequency of

the audio stream can be extracted from the sampling frequency index field of the

ADTS header. The sampling frequency remains the same for a particular audio

stream. Since both samples per frame and sampling frequency are fixed the audio

frame rate also remains constant throughout a particular audio stream. Hence the

presentation time of a particular audio frame (assuming stereo) can be calculated

using (4.2).

The same expression can be expanded for multi channel audio streams, just by

multiplying the number of channels.

Hence by using (4.1) and (4.2), presentation times of the frames can be

calculated by coding the frame numbers as the time stamps. And also once the

presentation time of one stream is calculated, the frame number of the second

stream that has to played at that particular time can calculated. This approach is

used at the decoder to achieve the audio-video synchronization or lip

synchronization; this is explained in detail in later chapters.

Using frame numbers as time stamps has many advantages over the

traditional PCR approach. Obvious advantages are that there is no need to send

the additional Transport Stream (TS) packets with PCR information, reduced

overall complexity, no need to consider clock jitters during synchronization,

smaller time stamp field in the PES packet, just 16 bits to encode frame number

43

compared to 33 bits for the Presentation Time Stamp (PTS) which has a sample

from the encoder clock.

The time stamp field in this project is encoded in 2 bytes in the PES header,

which implies that time stamp field can carry frame numbers up to 65536. Once

the frame number of either stream exceeds this number, which is a possibility in

the case of long video and audio sequences, the frame number is reset to 1. The

reset is done simultaneously on both audio and video frame numbers as soon as

the frame number of either one of the stream crosses 65536. This will not create

a frame number conflict at the demultiplexer side during synchronization because

the audio and video buffer sizes are much smaller than the maximum allowed

frame number. Hence, at no point of time will there be two frames in the buffer

with the same time stamp.

The next chapter addresses the multiplexing scheme used to multiplex the

audio and video elementary streams.

44

CHAPTER 5

MULTIPLEXING

Multiplexing is a process where Transport Stream (TS) packets are

generated and transmitted in such a way that the buffers at the decoder (de-

multiplexer) do not overflow or under flow. Buffer overflow or underflow by the

video and audio elementary streams can cause skips or freeze/mute errors in

video and audio playback. There are many methods adopted in various systems to

prevent this at the decoder side, like when a potential buffer overflow is

detected; null packets (transmitted to maintain constant bit rate) are deleted or

presentation time is delayed by a few frames till both the buffers have the

content to be played back at that particular presentation time.

The flow chart of the multiplexing scheme used in this project is shown in

figures 5.1, 5.2 and 5.3. The basic logic is based on both audio and video

sequences having constant frame rates. For video the frames per second value

will remain same throughout the video sequence. In an audio sequence since

sampling frequency remains constant throughout the sequence and samples per

frame is fixed (1024 for stereo), the frame duration also remains constant.

For transmission a PES packet which represents a frame is logically broken down

to n number (n depends on PES packet size) of TS packets of 188 bytes each. The

exact presentation time of each TS packet ( ) may be calculated as

shown in (5.1), (5.2) and (5.3), where is the number of TS packets

required to represent corresponding PES packet or frame:

45

= +

Similarly for audio:

where is given by

= +

From (5.3) and (5.6) it may be observed that the presentation time of a

current TS packet is the cumulative sum of presentation time of previous TS

packet (of the same type) and the current TS duration. The decision to transmit a

particular TS packet (audio or video) is made by comparing their respective

presentation times, and which ever stream has a lower value, it is scheduled to

transmit a TS packet. This makes sure that both audio and video content get equal

priority and gets transmitted uniformly. Once the decision about which TS to

transmit is made, the control goes to one of the blocks where the actual

generation and transmission of TS and PES packets takes place.

In the audio/video processing block, the first step is to check whether the

multiplexer is still in the middle of a frame or in the beginning of a new frame. If a

new frame is being processed, (5.2) or (5.5) is executed appropriately, to find out

the TS duration. This information is used to update the TS presentation time at a

46

later stage. Next data is read from the concerned PES packet, if PES is bigger than

185 bytes then only the first 185 bytes are read out and the PES packet is adjusted

accordingly. If the current TS packet is the last packet for that PES packet, a new

PES packet for the next frame (for that stream) is generated. Now the 185 bytes

payload data and all the remaining information are ready to generate the

transport stream (TS) packet. Once a TS packet is generated, the TS presentation

time is updated using (5.3) and (5.6). Then the control goes back to the

presentation time decision block and the whole process is repeated till all the

video and audio frames are transmitted. It has to be noted here that one of

streams I.e. video or audio may get transmitted completely before the other. In

that case only that particular processing block is operated which is still pending

transmission.

The next chapter describes the de-multiplexing algorithm used and the

method used to achieve audio-video synchronization.

47

PT_VIDEO_TS < PT_AUDIO_TS?

· TRANSMIT A VIDEO TS PACKET,

· GENERATE THE NEXT VIDEO

PES PACKET IFREQUIRED

· UPDATE PT_VIDEO_TS

· TRANSMIT A AUDIO TS PACKET,

· GENERATE THE NEXT AUDIO

PES PACKET IF REQUIRED.

· UPDATE PT_AUDIO_TS

ALL AUDIO FRAMES DONE

AND

ALL VIDEO FRAMES DONE?

DONE

YES

NO

YES

NO

(Video processing block) (Audio processing block)

(comparing presentation

times of audio and video

TS)

Fig 5.1. Overall multiplexer flow diagram

48

NEW VIDEO FRAME

?

CALCULATE

· NO OF TS PACKETS

FOR THIS FRAME

· TS DURATION

CURRENT PES

LENGTH >185

· CONSIDER ONLY FIRST 185

BYTES FOR TRANSMISSION

· UPDATE PES DATA AND

LENGTH

GENERATE PES PACKET

FOR NEXT VIDEO FRAME

· GENERATE (TRANSMIT) A VIDEO TS PACKET

· UPDATE PT_VIDEO_TS VALUE

YES

YES

NO

NO

Fig 5.2. Flow chart of video processing block

49

NEW AUDIO FRAME

?

CALCULATE

· NO OF TS PACKETS

FOR THIS FRAME

· TS DURATION

CURRENT PES

LENGTH >185

· CONSIDER ONLY FIRST 185

BYTES FOR TRANSMISSION

· UPDATE PES DATA AND

LENGTH

GENERATE PES PACKET

FOR NEXT AUDIO FRAME

· GENERATE (TRANSMIT) A AUDIO TS PACKET

· UPDATE PT_AUDIO_TS VALUE

YES

YES

NO

NO

Fig 5.3. Flow chart of audio processing block.

50

CHAPTER 6

DE MULTIPLEXING

The Transport Stream (TS) input to a receiver is separated into a video

elementary stream and audio elementary stream by a demultiplexer. At this time,

the video elementary stream and the audio elementary stream are temporarily

stored in the video and audio buffer, respectively.

The basic flow chart of the demultiplexer is shown in the figure 6.1. After

receiving a packet, it is checked for the sync byte (0X47), to check if the packet is

valid or not. If invalid that packet is skipped and de-multiplexing is continued with

the next packet. The valid TS packet header is read to extract fields like packet ID

(PID), adaptation field control flag (AFC), payload unit start (PUS) flag, 4 bit

continuity counter etc. Now the payload is prepared to be read into the

appropriate buffer. By checking the AFC flag it can be known that an offset value

has to be calculated or all 185 bytes in the TS packet has payload data. If the AFC

is set then the payload is extracted by skipping through the stuffing bytes.

The Payload Unit Start (PUS) bit is checked to see if the present TS packet

contains a PES header. If so then, the PES header is first checked for the presence

of the sync sequence (I, e 0X000001). If not, the packet is discarded and the next

TS packet is processed. If valid then the PES header is read and fields like stream

ID, PES length, frame number are extracted. Now the PID is checked to see if it is

an audio TS packet or video TS packet. Once this decision is made, the payload is

written into its respective buffer. If the TS packet payload contained the PES

header, information like frame no, its location in the corresponding buffer , PES

length are stored in a separate array variable which is later used for synchronizing

audio and video streams.

51

Read TS packet

Valid

packet

?

Go to next packet

Get

PID

AFC flag

AFC = 1

?

Read PES header

- PES length

-frame no

-stream ID

Extract payload

Adjust offset

PUS = 1

Sync seq

present ?

Video packet

?

PUS = 1

?

Write payload data

in to video buffer

Is video buffer

full ?

Search for the next IDR frame and calculate the

corresponding audio frame

Write both video and audio buffer values from

those frames in to their respective bitstream

files(.264 and .aac)

PUS = 1

?

Save frame no and

pointer loc in

buffer

Write payload data

in to audio buffer

Save frame no and

pointer loc in

buffer

YES

YES

YES

YES

YES

YES

YES

YES

NO

NO

NO

NO

NO

NO

NO

NO

Fig 6.1. Flow chart for the Demultiplexer used

52

After the payload has been written into the audio/video buffer. Video

buffer is checked for fullness. Since video files are always much larger than audio

files, the video buffer gets filled up first. Once the video buffer is full, the next

occurring IDR frame is searched in the video buffer. Once found the frame

number is noted and used to calculate the corresponding audio frame number

that has to be played at that time which is given by (6.1).

The above equation is used to synchronize the audio and video streams.

Once the frame numbers are obtained, the audio and video elementary streams

can be constructed by writing the audio and video buffer contents from that point

(frame) into their respective elementary streams i.e. .aac and .264 files

respectively. Then the streams are merged into a container format by using mkv

merge software [31] which is freely available software. The resulting container

format can be played back by video player like VLC media player [32] or Gom

media player [33]. In the case of video sequence, to ensure proper playback,

picture and sequence parameter sets must be inserted before the first IDR frame.

The reason the de-multiplexing is carried out from an IDR (instantaneous

decoder refresh) frame is because the IDR frame breaks the sequence making

sure that the later frames like P-frames do not use frames before the IDR frame

for motion estimation. This is not true in the case of normal I- frame. So in a long

sequence the GOPs after the IDR frame are treated as a new sequence by the

H.264 decoder. In the case of audio the HE AAC v2 decoder can playback the

sequence from any audio frame.

53

6.1 Lip or audio-video synchronization:

Synchronization in multimedia systems refers to the temporal relations that

exist between the media objects in a system. A temporal or time relation is the

relation between a video and an audio sequence during the time of recording. If

these objects are presented, the temporal relation during the presentations of

the two media objects must correspond to the temporal relation at the time of

recording.

Once the video buffer is full, the content (audio and video) is ready to be

written in their elementary streams and played back. The audio ADTS elementary

stream can be played back from any frame. However for the H.264 stream

decoding can only start from an IDR frame. So the video buffer is searched for the

next occurring IDR frame, this frame number is used to calculate the

corresponding audio frame using (6.1).

Once the audio frame number is obtained, the audio buffer from that

frame is written into its elementary stream (.aac file). For the video elementary

stream however, the picture and sequence parameter sets (which were sent as a

separate TS packet from the multiplexer) are inserted before writing from the IDR

frame in the video buffer. Because both PPS and SPS information is used by

decoder to find out the encoding parameters used.

Since the output of (6.1) may not be a whole number, it is rounded off to

the closest integer value. The theoretical value of maximum rounding off error is

half the audio frame duration. This depends on the sampling frequency of the

audio stream. For example a sampling frequency of 24000, the frame duration is

1024/24000 i.e. 42.6 ms and the maximum latency will be 21.3 ms. For most

audio streams the frame duration value will be not more that 64 ms, hence the

maximum latency will be 31 ms [33]. This latency/time difference is known as a

54

“skew” *47+. According to a research the “in-sync” region spans a skew of -80 ms

(audio behind video) and +80 ms (video behind audio) [47]. In-sync refers to the

range of skew values where the synchronization error is not perceptible. The

MPEG-2 systems [4] define a skew threshold of ±40 ms. Once the streams are

synchronized the skew remains constant throughout. This possible maximum

skew is the limitation of the project; however the value remains well below the

allowed range. Simulation results using the audio and video test sequences are

presented in the next chapter.

55

CHAPTER 7

RESULTS

The multiplexing algorithm was implemented in MATLAB and de-

multiplexing algorithm was implemented using C++. JM (joint model) 16.1 [35]

and 3gpp Enhanced aacplus encoder [36] were used for encoding video and audio

raw sequences respectively. GOP sequence adopted for video encoding was IPPP;

the H.264 baseline profile was used. For audio encoding bitrate was set at 32 kbps

to enable parametric stereo.

7.1 Buffer fullness:

As stated before buffer overflow or underflow by the video and audio

elementary streams can cause skips or freeze/mute errors in video and audio

playback. Table 7.1 shows the values of video buffer and the corresponding audio

buffer size at that moment and the playback times of both audio and video

contents of buffer. It can be observed the content playback times vary only by

about 20 ms. This means that when a video buffer is full (for any size of video

buffer) almost all the corresponding audio content is present in the audio buffer.

56

video frames in buffer

Audio frames in

buffer

video buffer size

(in KB)

audio buffer size

(in KB)

video content play back time

(in sec)

audio content

play back time

(in sec)

100 98 771.076 17.49 4.166 4.181

200 196 1348.359 34.889 8.333 8.362

300 293 1770.271 52.122 12.5 12.51

400 391 2238.556 69.519 16.666 16.682

500 489 2612.134 86.949 20.833 20.864

600 586 3158.641 104.165 25 25.002

700 684 3696.039 121.627 29.166 29.184

800 782 4072.667 139.043 33.333 33.365

900 879 4500.471 156.216 37.5 37.504

1000 977 4981.05 173.657 41.666 41.685

Table7.1. Video and audio buffer sizes and their respective playback times

7.2 Synchronization/skew calculation

Table 7.2 shows the results and various parameters of the test clips used.

The results show that, the compression ratio achieved by HEAACv2 is in the order

of 45 to 65 which is at least three times better than that achieved by just core

AAC. Also H.264 video compression is in the order of 100, which is due to the fact

that baseline profile is used. The net transport stream bitrate requirement is

about 50 kBps, which can be easily accommodated in systems such as ATSC-M/H,

which has an allocated bandwidth of 19.6 Mbps [17] or 2450 kBps.

57

Test clip 1 2

Clip length (sec) 30 50

Video FPS 24 24

Audio sampling frequency (Hz) 24000 24000

total video frames 721 1199

Total audio frames 704 1173

Video raw file (.yuv) size(kB) 105447 175354

Audio raw file (.wav) size(kB) 5626 9379

H.264 file size(kB) 1273 1365

AAC file size (kB) 92 204

Video compression ratio 82.82 128.4

Audio compression ratio 61.15 45.97

H.264 encoder bitrate(kBps) 42.43 27.3

AAC encoder bitrate(kbps) 32 32

Total TS packets 8741 9858

Transport stream size(kB) 1605 1810

Transport stream bitrate (kBps) 53.49 36.2

Test clip size (kB) 1376.78 1576.6

Reconstructed clip size (kB) 1312.45 1563.22

Table 7.2. Characteristics of test clips used

Table 7.3 shows the skew for various start TS packets. The delay column

indicates the skew achieved when demultiplexing was started from different TS

packet number. The maximum theoretical value is 21 ms because the sampling

frequency used is 24,000 Hz (audio frame duration is 42 ms). As seen the worst

skew is 13 ms, but in most cases the skew rate is below 10ms. This is well below

the MPEG2 threshold of 40 ms. Chapter 8 outlines the conclusions followed by

future work which is described in chapter 9.

58

Transport stream packet

number

video IDR frame

number chosen

Audio frame number chosen

chosen video frame presentation

time (s)

chosen audio frame presentation

time (s)

delay ms

Perceptible ?

100 13 13 .5416 .5546 13 no

300 29 28 1.208 1.1946 13 no

400 33 32 1.375 1.365 9.6 no

500 45 44 1.875 1.877 2.3 no

600 53 52 2.208 2.218 10.6 no

800 73 71 3.041 3.03 11 no

100 89 87 3.708 3.712 4 no

Table 7.3: Demultiplexer output.

59

CHAPTER 8

CONCLUSIONS

This project implemented an effective multiplexing and demultiplexing

scheme with synchronization. The latest codec H.264 and HE AAC v2 were used.

Both encoders achieved very high compression ratios as a result the transport

stream bitrate requirement was contained to about 50 kBps. Also buffer fullness

was effectively handled with maximum buffer difference observed was around

20ms of media content. During decoding the audio-video synchronization was

achieved with a maximum skew of 13ms.

CHAPTER 9

FUTURE WORK

This project implemented a multiplexing/demultiplexing algorithm for one

audio and one video stream i.e. a single program. The same scheme can be

expanded to multiplex multiple programs by having a program map table (PMT).

Also the same algorithm can be modified to multiplex other elementary streams

like VC1 [44], Dirac video [45], AC3 [46] etc.

The present project used standards specified by MPEG2 systems. The same

multiplexing scheme can be applied to other transport streams like RTP/IP, which

are used for applications such as streaming videos over the internet.

Since transport stream is sent across networks that are prone to errors, an

error correction scheme like Reed-Solomon [43] or CRC can be added while

coding the transport stream (TS) packets.

60

References:

[1] MPEG-4: ISO/IEC JTC1/SC29 14496-10: Information technology – Coding of audio-visual

objects - Part 10: Advanced Video Coding, ISO/IEC, 2005.

[2] MPEG-4: ISO/IEC JTC1/SC29 14496-3: Information technology – coding of audio-visual

objects – part3: Audio, AMENDMENT 4: Audio lossless coding (ALS), new audio profiles and

BSAC extensions.

3] MPEG–2: ISO/IEC JTC1/SC29 13818–7, advanced audio coding, AAC. International Standard IS

WG11, 1997.

[4]MPEG-2: ISO/IEC 13818-1 Information technology—generic coding of moving pictures and

associated audio—Part 1: Systems, ISO/IEC: 2005.

[5] Soon-kak Kwon et al, “Overview of H.264 / MPEG-4 Part 10 (pp.186-216)”, Special issue on “

Emerging H.264/AVC video coding standard”, J. Visual Communication and Image

Representation, vol. 17, pp.183-552, April. 2006.

[6] A. Puri et al, “Video coding using the H.264/MPEG-4 AVC compression standard”, Signal

Processing: Image Communication, vol.19, pp. 793-849, Oct. 2004.

[7] MPEG-4 HE-AAC v2 — audio coding for today's digital media world , article in the EBU

technical review (01/2006) giving explanations on HE-AAC. Link:

http://tech.ebu.ch/docs/techreview/trev_305-moser.pdf

[8]ETSI TS 101 154 “Implementation guidelines for the use of video and audio coding in

broadcasting applications based on the MPEG-2 transport stream”.

[9] 3GPP TS 26.401: General Audio Codec audio processing functions; Enhanced aacPlus

General Audio Codec; 2009

[10] 3GPP TS 26.403: Enhanced aacPlus general audio codec; Encoder Specification AAC part.

[11] 3GPP TS 26.404 : Enhanced aacPlus general audio codec; Encoder Specification SBR part.

[12] 3GPP TS 26.405: Enhanced aacPlus general audio codec; Encoder Specification Parametric

Stereo part.

http://www.ebu.ch/en/technical/trev/trev_305-moser.pdf

http://en.wikipedia.org/wiki/EBU

http://tech.ebu.ch/docs/techreview/trev_305-moser.pdf

61

[13] E. Schuijers et al, “Low complexity parametric stereo coding “,Audio engineering society,

May 2004 , Link: http://www.jeroenbreebaart.com/papers/aes/aes116_2.pdf

[14] MPEG Transport Stream. Link:

http://www.iptvdictionary.com/iptv_dictionary_MPEG_Transport_Stream_TS_definition.html

[15] MPEG-4: ISO/IEC JTC1/SC29 14496-14 : Information technology — coding of audio-visual

objects — Part 14 :MP4 file format, 2003

[16] DVB-H : Global mobile TV. Link : http://www.dvb-h.org/

[17] ATSC-M/H. Link : http://www.atsc.org/cms/

[18] Open mobile vidéo coalition. Link : http://www.openmobilevideo.com/about-mobile-

dtv/standards/

[19] VC-1 Compressed Video Bitstream Format and Decoding Process (SMPTE 421M-

2006), SMPTE Standard, 2006 (http://store.smpte.org/category-s/1.htm).

[20] Henning Schulzrinne's RTP page. Link: http://www.cs.columbia.edu/~hgs/rtp/

[21] G.A.Davidson et al, “ATSC video and audio coding”, Proc. IEEE, vol.94, pp. 60-76,

Jan. 2006 (www.atsc.org).

[22] I. E.G.Richardson, “H.264 and MPEG-4 video compression: video coding for next-

generation multimedia”, Wiley, 2003.

[23] European Broadcasting Union, http://www.ebu.ch/

*24+ Shintaro Ueda, et al, “NAL level stream authentication for H.264/AVC” , IPSJ Digital

courier, vol.3 , Feb.2007.

[25] World DMB: link: http://www.worlddab.org/

[26] ISDB website. Link: http://www.dibeg.org/

[27] 3gpp website. Link: http://www.3gpp.org/

[28] M Modi, “Audio compression gets better and more complex”, link:

http://www.eetimes.com/discussion/other/4025543/Audio-compression-gets-better-and-

more-complex

[29] PA Sarginson,”MPEG-2: Overview of systems layer”, Link:

http://downloads.bbc.co.uk/rd/pubs/reports/1996-02.pdf

http://www.jeroenbreebaart.com/papers/aes/aes116_2.pdf

http://www.iptvdictionary.com/iptv_dictionary_MPEG_definition.html

http://www.iptvdictionary.com/iptv_dictionary_MPEG_Transport_Stream_TS_definition.html

http://www.dvb-h.org/

http://www.atsc.org/cms/

http://www.openmobilevideo.com/about-mobile-dtv/standards/

http://www.openmobilevideo.com/about-mobile-dtv/standards/

http://store.smpte.org/category-s/1.htm

http://www.cs.columbia.edu/~hgs/rtp

http://www.cs.columbia.edu/~hgs/rtp/

http://www.atsc.org/

http://www.ebu.ch/

http://www.worlddab.org/

http://www.dibeg.org/

http://www.3gpp.org/

http://www.eetimes.com/discussion/other/4025543/Audio-compression-gets-better-and-more-complex

http://www.eetimes.com/discussion/other/4025543/Audio-compression-gets-better-and-more-complex

http://downloads.bbc.co.uk/rd/pubs/reports/1996-02.pdf

62

[30] MPEG-2 ISO/IEC 13818-1: GENERIC CODING OF MOVING PICTURES AND AUDIO: part 1-

SYSTEMS Amendment 3: Transport of AVC video data over ITU-T Rec H.222.0 |ISO/IEC 13818-1

streams, 2003

[31] MKV merge software. Link: http://www.matroska.org/

[32] VLC media player. Link: http://www.videolan.org/

[33] Gom media player. Link: http://www.gomlab.com/

*34+ H. Murugan, “Multiplexing H264 video bit-stream with AAC audio bit-stream,

demultiplexing and achieving lip sync during playback”, M.S.E.E Thesis, University of Texas at

Arlington, TX May 2007.

[35] H.264/AVC JM Software link: http://iphome.hhi.de/suehring/tml/download/.

[36] 3GPP Enhanced aacPlus reference software. Link: http://www.3gpp.org/ftp/

[37] MPEG–2: ISO/IEC JTC1/SC29 13818–2, Information technology -- Generic coding of moving

pictures and associated audio information: Part 2 - Video, ISO/IEC, 2000.

[38] MPEG–4: ISO/IEC JTC1/SC29 14496–2, Information technology – Coding of audio visual

objects: Part 2 - visual, ISO/IEC, 2004.

[39] T. Wiegand et al, ”Overview of the H.264/AVC Video Coding Standard ”, IEEE Trans. CSVT,

Vol. 13, pp. 560-576, July 2003.

[40] ATSC-Mobile DTV Standard, part 7 – AVC and SVC video system characteristics. Link:

http://www.atsc.org/cms/standards/a153/a_153-Part-7-2009.pdf

[41] ATSC-Mobile DTV Standard, part 8 – HE AAC audio system characteristics. Link:


[42] H.264 Video Codec - Inter Prediction. Link:

http://mrutyunjayahiremath.blogspot.com/2010/09/h264-inter-predn.html

*43+ B.A. Cipra, “The Ubiquitous Reed-Solomon Codes”. Link:

http://www.eccpage.com/reed_solomon_codes.html

http://www.matroska.org/

http://www.videolan.org/

http://www.gomlab.com/

http://iphome.hhi.de/suehring/tml/download/

http://www.3gpp.org/ftp/



http://mrutyunjayahiremath.blogspot.com/2010/09/h264-inter-predn.html

http://www.eccpage.com/reed_solomon_codes.html

63

[44] VC1 technical overview .link:

http://www.microsoft.com/windows/windowsmedia/howto/articles/vc1techoverview.aspx

[45] Dirac video compression website. Link: http://diracvideo.org/

[46] MPEG2: ISO-IEC JTCl/SC29/WGll 13818-3 : Coding Of Moving Pictures and Associate Audio : Part 3 – audio Nov.1994

[47] G. Blakowski et al, “A Media Synchronization Survey: Reference Model, Specification, and

Case Studies”, IEEE Journal on selected areas in communications, VOL. 14, NO. 1, Jan 1996.

http://www.microsoft.com/windows/windowsmedia/howto/articles/vc1techoverview.aspx

http://diracvideo.org/

Multiplexing the elementary streams of H.264 video and MPEG4 HE

Documents