MPEG-21 bitstream syntax descriptions for scalable video codecs

7/27/2019 MPEG-21 bitstream syntax descriptions for scalable video codecs

1/19

Multimedia Systems (2006) 11(5): 403421DOI 10.1007/s00530-006-0021-5

R E G U L AR PA P E R

Davy De Schrijver Chris Poppe Sam Lerouge Wesley De Neve Rik Van de Walle

MPEG-21 bitstream syntax descriptions for scalable video codecs

Published online: 8 February 2006c Springer-Verlag 2006

Abstract In order to obtain a useful multichannel publica-tion environment, a content producer has to respect the dif-ferent terminal and network characteristics of the multime-

dia devices of its target audience. Embedded scalable videobitstreams, together with a complementary content adapta-tion framework, give the possibility to respond to heteroge-neous usage environments. In this paper, temporally scalableH.264/MPEG-4 AVC encoded bitstreams and bitstreams en-coded by relying on the fully-embedded MC-EZBC wavelet-based codec are used. The MPEG-21 Bitstream Syntax De-scription Language (BSDL) specification is used to generatehigh-level XML descriptions of the structure of a bitstream.As such, the adaptation of a scalable video stream can berealized in the XML domain, rather than on the bitstreamitself. Different transformation technologies are comparedto each other as well. Finally, a practical setup of a video

streaming use case is discussed by relying on the MPEG-21BSDL framework.

Keywords Content Adaptation H.264/MPEG-4 AVC MPEG-21 BSDL Scalable Video Coding UniversalMultimedia Access

1 Introduction

Since the 1950s, people have been able to watch news bul-letins on the television. Now, there is the opportunity to fol-low digital news feeds on the Internet; and in the near future,people will be able to see the news on mobile devices such as

D. De Schrijver (B) S. Lerouge W. De NeveDepartment of Electronics and Information Systems,Multimedia Lab, Ghent University IBBT,Sint-Pietersnieuwstraat 41, B-9000 Ghent, BelgiumE-mail: {davy.deschrijver, sam.lerouge, wesley.deneve}@ugent.be

C. Poppe R. Van de WalleDepartment of Electronics and Information Systems, Multimedia Lab,Ghent University IBBT IMEC, Sint-Pietersnieuwstraat 41, B-9000Ghent, BelgiumE-mail: {chris.poppe, rik.vandewalle}@ugent.be

cellphones, Personal Digital Assistants (PDAs), etc. Thesedifferent devices and networks all have dissimilar charac-teristics. In order to provide the end-users with the highest-

quality content, a multichannel publication environment isrequired.

In a multichannel publication environment, contentproviders want to give their users the possibility to consumecontent on all possible devices without having to create adifferent version (of the content) for each possible device. Inother words, a content provider wants to create content onceand then publish it on every device. This concept is betterknown as Universal Multimedia Access (UMA) [1].

In this paper, attention will be paid to one type of con-tent, in particular digital video. The latter takes an importantrole in our modern data communication and will even be-come more important in the future. Think for instance about

the usage of digital video in mobile data communication.Memory, processing power, and often a considerable amountof bandwidth are needed to consume a video fragment. Toextract multiple reduced bitstreams from one basic videostream (for example, bitstreams with lower spatial resolu-tion or lower quality), a scalable video bitstream is neededas well as a decision taking mechanism to determine the bestpossible stream for the targeted usage environment.

How to adapt a bitstream in a codec independent man-ner will be discussed in this paper. The Bitstream SyntaxDescription Language (BSDL), as defined in the MPEG-21standard [2], is used to achieve this goal. MPEG-21 BSDLgenerates descriptions in the Extensible Markup Language

(XML) of the high-level structure of a bitstream allowing torealize the adaptation in the XML domain [3]. Therefore, anXML transformation is needed. In order to achieve the trans-formations, different approaches will be compared to eachother in terms of execution times and memory consumption.

The outline of the paper is as follows: in Sect. 2, wediscuss the functioning of the BSDL framework and howBSDL is enclosed in MPEG-21. Further in this section, asmall overview is provided about wavelet-based ScalableVideo Coding (SVC) together with an explanation on howto exploit temporal scalability in H.264/MPEG-4 part 10


2/19

404 D. De Schrijver et al.

Advanced Video Coding (AVC). The section is closed byan overview of related work about bitstream structure de-scription languages and scalable coding schemes. In Sect. 3,possible Bitstream Syntax (BS) schemata are described forthe scalable video codecs of Sect. 2 and it is explainedhow scalability can be exploited in a theoretical manner. InSect. 4, we discuss how we have tested the performance of

the BSDL framework by using the MPEG-21 reference soft-ware and the different implementations of the transforma-tions. In Sect. 5, the performance of the different steps inthe BSDL framework is discussed. A practical use case withregard to a format-agnostic video adaptation framework forstreaming applications is discussed in Sect. 6. Finally, theconclusions are provided in Sect. 7.

2 Technologies used

In this section, an overview is given with respect to thework flow of an MPEG-21 BSDL-based adaptation frame-work. After that, we explain how an embedded wavelet-based scalable video codec works and how temporal scal-able H.264/AVC bitstreams can be generated in an elegantmanner. We close this section with an overview of relatedwork.

2.1 Bitstream syntax description language (BSDL)

The Bitstream Syntax Description Language (BSD Lan-guage) is part of the Bitstream Syntax Description tool,which is in its turn embedded in the Digital Item Adaptationpart of MPEG-21 [4]. MPEG-21, also known as ISO/IEC21000, is the latest standard of the Moving Picture ExpertsGroup (MPEG). In contrast to the other MPEG standards,MPEG-21 is a generic framework for multimedia produc-tion and consumption. MPEG-21 is built around one centralconcept: the Digital Item (DI). All communication withinthe framework is based on the transaction of DIs. A Dig-ital Item is a structured digital object represented in XMLthat contains content and metadata. Because of its broadscope, MPEG-21 is divided into different parts. A completeoverview of these parts is given in [5]. The part to whichBSDL belongs is part 7: Digital Item Adaptation (DIA).

DIA defines mechanisms for the adaptation of a DI andthe resources that the DI contains. The framework is de-veloped to achieve interoperable transparent multimedia ac-cess, hereby taking into account the network and terminalcharacteristics. DIA comprises several tools [2] and one ofthese tools is Bitstream Syntax Description (BSD). This toolconsists of two related concepts, in particular BSDL and themore generic approach generic Bitstream Syntax Schema.

The idea behind BSDL is to develop a mechanism thatis able to automatically generate a description of the (high-level) structure of a bitstream [6]. This description is for-matted in XML and hereby, it should be easier to adapt anXML description rather than directly modifying the bits of

the bitstream. The descriptions act as an abstraction of thecompressed bitstream since their high-level nature only re-quires a limited knowledge about the structure of the mediaresource in question. Moreover, once we have the descrip-tion of the bitstream in XML, it is possible to add extra in-formation (metadata) about the video fragment or bitstream(for example, semantic information about the stream). These

metadata, which are encapsulated in the description, can beused to adapt a bitstream description in a more intelligentway (for example, to only extract the foreign news from acomplete news bulletin). This kind of adaptation is not di-rectly possible on the bitstream because a bitstream containsno semantic information. The (adapted) bitstream can begenerated automatically from the adapted description oncethe latter is available.

The complete work flow of BSDL is sketched in Fig. 1and a simple example is given in Fig. 2. We start from agiven bitstream (encoded with a certain codec) and we wantto obtain a high-level description for this bitstream in XML.In the example, the bitstream is described not bit-by-bit but

frame-by-frame (of course, a frame-based description of avideo is much more high-level than a bit-based description).Therefore, a BS schema has to be developed specifically forthis codec. This BS schema is a representation of all possible

Fig. 1 MPEG-21 BSDL framework

Fig. 2 Simple example of the MPEG-21 BSDL framework


3/19

MPEG-21 bitstream syntax descriptions for scalable video codecs 405

kinds of bitstreams that can be generated by the encoder.A BS schema defines the structure of a possible bitstream

just as an XML schema defines the allowable structure of agroup of XML documents [7]. Once we know what all pos-sible bitstreams look like and what the given bitstream is, itis possible to generate the bitstream description. Therefore,once we have the BS schema, we can generate a descrip-

tion for each possible bitstream that uses the correspondingcodec. This generation can be done automatically with theBintoBSD tool as represented in the figures. How to imple-ment this engine is completely open: only the structure andthe tags that can appear within the BS schema are standard-ized. As output of the engine, we get an XML descriptionthat represents the structure of the original bitstream.

During the next step, we can transform the XML de-scription. The transformation is the XML-equivalent ofthe adaptation of the original bitstream. How to transformthe description is also not standardized. For example, onecan use an Extensible Stylesheet Language Transformations(XSLT, [8]), or an implementation using an API such as the

Simple API for XML (SAX) or Document Object Model(DOM) when a more complex transformation is needed. Theonly restriction on the transformation is that the adapteddescription has to be valid with respect to the appropriateBS schema. In our example, we have eliminated the secondframe, something that can be easily expressed by an XSLTtransformation.

The last step is the generation of the adapted bitstream;for this, the BSDtoBin tool can be used. This tool takes asinput the adapted bitstream description and the BS schemabut mostly also the original bitstream because the descrip-tion can contain references to data blocks (of bytes) in theoriginal bitstream (for example, copy all the bytes between

byte 1100 and byte 1180).Note that the adapted description is not the bitstream de-

scription of the adapted bitstream. In our example, it is clearthat the adapted description (after the transformation) is notthe same XML fragment as the description of the adaptedbitstream we get after using the adapted bitstream as inputfor the BintoBSD engine.

The BSDL approach differs from the current adaptivevideo frameworks in the fact that the server does not haveto know what kind of media bitstreams are stored in its un-derlying storage system. The server contains BS schematafor the different structures of the stored bitstreams andcan generate descriptions and the adapted bitstreams in a

format-agnostic way by using the BintoBSD and BSDtoBintools. The functioning of these tools is standardized andevery MPEG-21 compliant terminal should follow thisspecification.

2.2 Scalable video coding

2.2.1 Fully scalable video coding: MC-EZBC

Traditional video coding algorithms as used in standardssuch as MPEG-1 Video, MPEG-2 Video, and H.264/AVC

are not designed to support embedded scalability. In thesetraditional block-based coding algorithms, a specific bit bud-get is given to the encoder and the latter has to compress thestream as good as possible considering the targeted bit bud-get. In most cases, it will not be straightforward to extracta new bitstream at a lower bitrate once we have the (com-pressed) bitstream.

Embedded scalability refers to the possibility totranscode a given bitstream to another stream containing thesame content information but for instance at a lower framerate, resolution, or visual quality without a complete decode-encode phase. The structure of the bitstream has to be builtin such a manner that other versions can be extracted almostimmediately and in a very fast way. Such a technology is forinstance also present in the JPEG2000 specification for stillimage coding [9]. The transcoding can be seen as simpleediting operations to the structure of the bitstream resultingin a limited usage of resources.

The encoder of a fully scalable video codec needs noinformation about the target bitrate but generates a (possi-

bly lossless) parent bitstream. Starting from this stream,other streams can be generated with lower bitrates, framerates, or spatial resolutions by using simple bitstream edit-ing operations. A wavelet-based coding algorithm is consid-ered in this paper, in particular Motion Compensated Em-bedded Zero Block Coding (MC-EZBC). This codec offersembedded scalability in terms of quality, spatial resolution,and frame rate [10].

The MC-EZBC algorithm uses a t+2D wavelet transformfor implementing the three kinds of scalability. The codec inquestion contains three main parts. The first part is the en-coder that generates the near-lossless parent bitstream. Thisstep must be executed once for each video sequence. In the

second phase, we can filter out different versions of the (par-ent) stream having a lower frame rate, lower quality, or lowerresolution by using the so-called pull function of the MC-EZBC codec. In general, the new bitstreams are still scal-able, meaning that they can be the subject of further adapta-tions. During the last step, we decode the (filtered) stream.

It is not possible to describe the entire algorithm in detailin this paper but a limited overview is given on how parentbitstreams can be generated. For more details, the reader isreferred to [10]. The encoder goes through a number of stepsin order to obtain a parent bitstream.

During the first step, the encoder reads all the frames ofthe current Group of Pictures (GOP). The length of a GOP

is a parameter for the encoder. It is important to use a goodvalue for this parameter; the parameter determines the num-ber of temporal decompositions in the second step and there-fore the number of different frame rates that can be offered.Typical values for the length of a GOP are 8, 16 or 32. Wewill use GOP sizes of 16 frames for the rest of the paper.

In the second step, Motion Compensated Temporal Fil-tering (MCTF) is applied. The first phase of this step is theMotion Estimation on the original frames. The Motion Esti-mation is necessary to eliminate the temporal correlation be-tween two consecutive frames [11]. During the second phase


4/19


Fig. 3 Full temporal decomposition of a GOP structure containing 16frames

of this step, a (Haar) wavelet filter is used in the temporaldomain utilizing the information from the motion vectors toobtain low and high frequency temporal subbands (indicatedas L or H frames, respectively). This wavelet filter will beapplied recursively to the low frequency subbands (in otherwords, the decomposition of an L frame gives us LL and LHframes). The structure of a complete decomposition is givenin Fig. 3. To reconstruct the 16 original frames, we need oneLLLL frame, one LLLH frame, two LLH frames, four LHframes, and eight H frames.

In Fig. 3, one can also see how temporal scalability canbe realized in a straightforward way. When the 8 H framesare eliminated, the eight L frames can still be reconstructed.The latter are a good estimation of the original sequence butat a lower frame rate (in particular, half of the original framerate).

After the temporal decomposition, spatial subbanddecomposition is executed comparable to the JPEG2000standard [9]. The Daubechies 9/7 analyses/syntheses filteris used [12] (which is another wavelet filter pair). Thisfilter divides a frame in one low-pass frame and three detail(or high-pass) frames. In Fig. 4, one can see a two-stagedecomposition: the picture in the upper left corner is thelow-pass picture; the others are the detail pictures. InFig. 4, it is easy to understand that spatial scalability can beobtained by discarding the detail pictures.

In the last step, the coefficients of the subbands and themotion vectors are entropy encoded and the actual bitstreamis generated. The coefficients of the subbands are encodedby using a bitplane encoder, in particular the Embedded ZeroBlock Coder (EZBC) [13]. The motion vectors are encodedwith Differential Pulse Code Modulation (DPCM) and anarithmetic entropy coding scheme.

Fig. 4 Two-stage spatial wavelet decomposition

For content providers, scalable video coding is an inter-esting concept because they only have to generate one bit-stream in order to meet the requirements of a heterogeneoususage environment. Therefore, only one bitstream for all de-vices and networks is necessary. This is in contrast with tra-ditional coding techniques where one bitstream is neededfor every possible device and network configuration (simul-store).

2.2.2 Temporal scalability in H.264/AVC

Currently, ITU-T H.264/MPEG-4 part 10 AVC [14] is thebest standardized video specification in terms of compres-sion efficiency. Notwithstanding the fact that the standardis not meant to generate fully scalable bitstreams, it definesa mechanism to exploit temporal scalability. This feature isbased on the sub-sequence coding technique, a tool that willbe explained further in this section. It is a further develop-ment of what is presented in [15].

The elementary unit of processing in the context of theH.264/AVC specification [16] is a slice and not a picture(in contrast with other MPEG and VCEG video standards).There are three types of slices, in particular I, P, and B slices.A picture can contain different types of slices, and mac-roblocks in slices can be used as references for the predictionof macroblocks in other slices. This means that, in contrastwith for example MPEG-2, frames that contain only B slicescannot be removed without the certainty that the resultingbitstream can still be decoded. To eliminate these sideeffects after removing frames from a video sequence, theH.264/AVC standard defines a mechanism, in particular sub-sequences, in order to be able to remove frames in a validmanner. Tian et al. [17] explains the usage of sub-sequencesand the positive impact on the compression ratio. We use


5/19


Fig. 5 Example of a bitstream with sub-sequences: coding patternIPpPBbP

these sub-sequences to exploit temporal scalability in anH.264/AVC bitstream. A sub-sequence represents a numberof inter-dependent frames that can be removed without af-fecting the decoding of any other sub-sequence in the samesub-sequence layer or any sub-sequence in any lower sub-sequence layer. To obtain a layered sub-sequence structure,an explicit GOP structure is defined during the encodingphase. In Fig. 5, an example of a possible explicit GOPstructure is given. The pattern of this structure is IPpPBbP(a small letter indicates a non-referenced picture). Removingall sub-sequences of the highest layer will not disturb the de-coding process. We can summarize that each frame belongsto exactly one sub-sequence, and that each sub-sequencebelongs to exactly one sub-sequence layer. Note, that thedecoder does not have to know that the bitstream is encodedbased on an explicit GOP structure as long as the decoderreceives the decoding slices or frames in the correct order.

2.3 Related work

In this section, an overview is given of the state-of-the-art ofbitstream structure description languages and scalable videocoding algorithms. We discuss the different technologies andstandards.

2.3.1 Bitstream structure description languages

As already mentioned in the section Introduction, thebitstream structure description language that we use inthis paper is MPEG-21 BSDL. BSDL is a part of thebigger MPEG-21 DIA standard. This standard also con-tains another framework to describe bitstreams on a higherlevel, in particular generic Bitstream Syntax Schema (gBSSchema) [18]. A gBS Description (gBSD) describes, justas a BSD, a bitstream in XML but, in contrast with theBSDL framework, the descriptions cannot be generated bya generic tool, such as the BintoBSD tool in the BSDLframework. A gBSD validates against a general standard-ized XML schema which is not codec specific. To steer thetransformation in the XML domain, it is possible to addmarkers to the description and based on these markers an

adaptation can be executed. The number of different mark-ers defines the granularity of the description and the numberof transformations that can be executed. We conclude thata gBS Schema is format-independent (all possible gBSDsvalidate against the same standardized gBS Schema) but thegBSD is application-dependent (depending on the applica-tion, the number of adaptations is defined). This in contrast

with BSDL that is format-dependent (for every video speci-fication, we need another BS schema and bitstream descrip-tions generated for different specifications validate againstanother BS schema) but it is application-independent (allkind of embedded adaptations can be executed on the gener-ated descriptions).

Formal Language for Audio-Visual Object Representa-tion (FLAVOR) [19] is another language that can be usedfor multimedia representation. This language has a numberof similarities with the MPEG-21 languages as explainedabove, in particular with gBSD and BSDL. FLAVOR wasinitially designed to automatically generate parsers for mul-timedia bitstreams. The FLAVOR code describes the struc-

ture of a bitstream and this code can in its turn be translatedinto C++ or Java code. The automatically generated C++or Java code constitutes a parser for the bitstream struc-ture that is described by the FLAVOR code. In contrast withBSDL and gBSD, FLAVOR describes the structures on a bit-per-bit basis and not on a higher-level. Another shortcomingof this language is the lack of an XML-based representa-tion of the description. Therefore, FLAVOR was enhanced tosupport XML features and this modified framework is calledXFLAVOR [20]. Based on an XFLAVOR description, it isnot only possible to automatically generate the source codefor a parser, but it is also possible to automatically gener-ate a W3C XML schema. The latter can be used to validate

the XML descriptions that are being produced by the au-tomatically generated parser. Because (X)FLAVOR code isa bit-per-bit description of a bitstream structure, the XMLdocument embeds the complete bitstream data, resulting inthe fact that the description is typically not be representedin a compact way (the descriptions could be larger than theoriginal bitstream). That is the reason why XFLAVOR wasnot chosen as bitstream structure description language.

MPEG video Markup Language (MPML, [21]) is a de-scription language that has the same functionality as XFLA-VOR, in particular a bit-per-bit representation of the bit-stream. This language is developed specifically for MPEG-4 Visual content without any possibility of portability to

another video specification. Because we want to obtain acodec-independent adaptation framework, this language isnot a feasible candidate as a generic bitstream structure de-scription language.

2.3.2 Scalable video coding

In Sect. 2.2.1, we have discussed an experimental fully scal-able wavelet-based video coding scheme and in Sect. 2.2.2,we have explained how temporal scalability can be exploitedin H.264/AVC. These two coding schemes give us an idea


6/19


on how a future fully scalable video codec can work. At thetime of writing, a new Scalable Video Codec (SVC) basedon both discussed technologies is under standardization bythe Joint Video Team (JVT) and it will take until 2007 be-fore the standardization is finished. The two approaches areused in this paper to explain how a future adaptation frame-work can work. The current video adaptation systems are not

based on that kind of fully embedded scalable bitstreams.The most intelligent adaptation systems can make use of thescalability features encapsulated in the MPEG-2 Video [22]and MPEG-4 Visual [23] standard. MPEG-2 provides multi-ple types of scalability but the objective of these features wasnot to adapt a bitstream to a given bitrate, but rather to adaptthe decoding complexity to the decoding device. As a conse-quence, these extensions are not used in practice. A scalableMPEG-2 bitstream consists of a number of layers, in partic-ular a base layer and an enhancement layer. The base layerhas to be decoded every time and the more the bits of theenhancement layer are decoded the better the video qualitywill be. The bitstreams have a layered structure and are not

embedded scalable.The Fine Granular Scalability (FGS) feature of MPEG-4

Visual [24] provides the possibility of embedded SNR scal-ability. An FGS-encoded bitstream also consists of one baselayer and one enhancement layer that improves the qualityof the decoded video sequence. In contrast to MPEG-2, thestreaming server can truncate the enhancement layer at anypoint and send a smaller bitstream than the original one to areceiver. Nevertheless, MPEG-4 Visual provides no solutionto generate fully embedded scalable video bitstreams.

3 BSDL in the context of scalable bitstreams

In Sect. 2, the MPEG-21 BSDL standard, the MC-EZBCwavelet-based codec and the possibilities to exploit tempo-ral scalability in H.264/AVC were explained. In this section,we discuss how we have developed a BS schema for the twobitstream structures in question and how the embedded scal-abilities can be exploited.

3.1 BS schema for MC-EZBC bitstreams

3.1.1 Structure of a MC-EZBC bitstream

The structure of a bitstream is codec-dependent in case ofthe MC-EZBC compressor. In this paper, we use the im-plementation of the MC-EZBC codec that was released onSeptember 2003 [25]. The complete structure of an encodedbitstream is given in Fig. 6. The first part of the bitstreamcontains a header followed by the payload of the video. Theheader contains general information about the bitstream.Some information is necessary to play back the video cor-rectly, like the frame rate, the resolution, the number offrames, and so on. Other information will be used by thedecoder to decode the bitstream in the correct manner, like

Fig. 6 Structure of a generated MC-EZBC bitstream

t_level (reduction of the number of temporal levels, result oftemporal scalability); s_level (reduction of the spatial reso-lution); and so on. The complete header always has a fixedlength of 100 bytes.

After the header, we obtain the actual payload of thevideo sequence. The first part of the payload contains thesizes of the different GOPs. This is important because thedecoder has to know how many bytes to read before the de-coding can start. The bitstream contains no start codes (incontrast with other video specifications such as MPEG-4Visual). This implies that it will be impossible to use thatkind of streams in live streaming (for example, live televi-sion broadcasting).

In the bitstream, the GOPs are saved one after the other.The number of pictures in a GOP structure depends on thenumber of temporal decompositions. This number has to beentered at the start of the encoding phase and can be found

in the header of the bitstream (GOP sizes of 16 frames areused in this paper).

Each GOP structure starts with a GOP header. Thisheader contains the results of the motion estimation. Basedon this information, the decoder knows whether motion vec-tors are available or not. After the GOP header, the contentof the GOP follows, ordered by temporal level: starting fromthe last decomposition step (fourth decomposition in Fig. 3)up to the first level. Between two decomposition levels, themotion vectors of the current level will be included. Eachtemporal subband is split into a number of spatial subbands;this is necessary to exploit spatial scalability. The number ofspatial subbands depends on the parameter s_level (which is

also included in the bitstream header).

3.1.2 Constructing a BS schema

Once we know the structure of the bitstream, it is ratherstraightforward to generate the corresponding BS schema.After a first implementation of the schema, conforming tothe MPEG-21 BSDL standard, version 1.2.1 of the Bin-toBSD tool (that is part of the MPEG-21 reference soft-ware) was unacceptably slow. The reason for the long ex-ecution time was the usage of complex XPath expressions


7/19


in the schema definition. To resolve this problem, we havemade use of some non-normative extensions to the currentBSDL specification, in particular the implementationconstruction. This extension makes it possible to constructJava classes containing complex computations or datatypesthat cannot be described by relying on the normative lan-guage constructs. We have used this construction to decrease

the calculation time of the XPath expressions. In almost allXPath expressions, we have to calculate the number of spa-tial levels, which is a complex computation. Therefore, wecalculate this number once, save it in the header-tag of thebitstream description and every time we need this number,we refer to the place in the header-tag with the correct data(without doing a recalculation of the number of spatial lev-els). Of course, because we are using the data only in ourXPath expressions, the data are not included in the bitstream;they are some kind of metadata.

As such, a bitstream syntax description of the scalablebitstream can be generated in an acceptable amount of timeby using the BintoBSD tool of the MPEG-21 reference soft-

ware, i.e. the execution of the BintoBSD process is fasterthan the actual encoding of the original frames. Once wehave the corresponding XML description, we can transformor adapt it in order to exploit the three types of scalabilitythat are supported by MC-EZBC. To understand how thetransformations work, we need to know what an XML de-scription looks like. In Fig. 7, the global structure of a BSDis given (which is a simplification of the MC-EZBC specificBS schema).

Note the similarities between Fig. 7 and the structure ofthe bitstream in Fig. 6. It is clear that the BS schema can beconsidered as the XML schema equivalent to the structureof a bitstream.

In order to extract another bitstream from the parentstream without the usage of BSDL, the composition of thecomplete bitstream has to be known to the bitlevel. Evenwhen the structure of the bitstream is known such as inFig. 6, it is not easy to implement a program that generatesnew bitstreams with for example a lower frame rate. It ismuch more convenient to change the XML description andto make use of this adapted description in order to generatea new bitstream (and this by relying on the BSDtoBin tool).In the next paragraphs, we will explain how the XML de-

Fig. 7 Structure of a generated bitstream description

Fig. 8 XML example of temporal scalability

scriptions can be adapted in order to exploit the three kindsof embedded scalability.

3.1.3 Temporal scalability

In the case of temporal scalability, we want to obtain a videowith a lower frame rate. Therefore, some frames have to bediscarded in such a way that the quality of the video is stillacceptable, which means as smooth as possible. The lattercan be realized by eliminating the highest content levels ineach GOP tag as depicted in Fig. 7. This will work becauseit corresponds to the elimination of the eight H (high fre-quency) frames in Fig. 3 as explained in Sect. 2.2.1. In Fig. 8,an XML example is given in which the frame rate is reducedby a factor of two.

How these tags are removed is not specified in the stan-dard. It can be done manually (by editing the XML docu-ment); by using XSLT or by implementing a program thatuses a SAX or DOM library. In the next section, we discusswhat the consequences of these choices are with respect tothe execution times and memory consumption of the trans-

formation.

3.1.4 Spatial scalability

After the exploitation of spatial scalability, a video frag-ment is obtained having a lower resolution than the parentvideo stream. This can be achieved by eliminating the detailframes or high frequency bands in Fig. 4. From Fig. 7, wecan deduce that we have to discard the highest subbands ofeach substream (the exact number dependent on how manyspatial levels that have to be eliminated). In Fig. 9, an ex-ample is given whereby the resolution is reduced by a factortwo during the adaptation.

3.1.5 SNR scalability

This form of scalability is probably the most important one.Hereby, a bitstream with a lower bitrate has to be generated.Normally, the bitrate in question will be given at the startof the transformation. Every subband has to be truncatedin such a manner that the given global bitrate is respected;hereby delivering a constant quality that is as high as possi-ble. How to achieve the best quality for a given bitrate is asubjective problem that is not in the scope of this paper [ 26].


8/19


Fig. 9 XML example of spatial scalability

As one can see in Fig. 6, the structure of the bitstreamsand the corresponding XML descriptions (Fig. 7) contain noinformation about the bitplanes and on how to truncate thesubbands in an efficient way. Therefore, it will be impossi-ble to realize SNR scalability with this kind of high-level bit-stream descriptions. To resolve this problem, we can observethat the encoding process not only delivers a compressed bit-stream, but also a file containing information about the bit-planes of the different subbands. The pull function, i.e. theprogram that is used to extract bitstreams from the parentstream, uses this file to generate the best possible bitstreamfor a given bitrate and it is this function that has to be simu-lated in the XML domain on our BSDs.

As just mentioned, SNR scalability has to be imple-mented with constant quality for a given bitrate based on theXML description. Therefore, the information about the bit-planes, available in the extra generated file, has to be addedto the description in order to make the SNR scalability possi-ble. BSDL contains a mechanism to add metadata to the de-scription: the bs1:ignore attribute can be placed in thetags that contain metadata. These tags are skipped by theBSDtoBin Parser when an adapted bitstream is generated.

Every temporal layer of a bitstream contains a numberof spatial levels, from low-frequency information to high-frequency data. Each of the spatial subbands contains a num-ber of bitplanes, resulting in #spatial levels#bitplanes pos-sible SNR scales. The information for every SNR scale isadded as metadata in the header of the description. For eachSNR scale, it contains the number of bytes across the com-plete sequence that are necessary to obtain this SNR leveland as such a constant quality over the complete sequence.Based on this information, the bitplane and the correspond-ing spatial level on which a subband has to be truncatedcan be detected. The number of bytes that every subband

Fig. 10 Numerical example of SNR scalability

of a spatial level contains for a certain bitplane is also addedas metadata in every spatial level. So, the new descriptionscontain metadata in the header tag about the possible SNR

scales and every spatial subband contains metadata about thelength (in bytes) of the bitplanes. In Fig. 10, a numerical ex-ample of metadata that can be encapsulated into a descrip-tion is given.

This example is constructed for a bitstream containingone temporal level; two spatial levels; and three bitplanesper level. So, the number of SNR levels is six as indicated inthe first column. The first two columns contain the informa-tion that is added to the header tag of the description and theother information is added as metadata to the correspond-ing spatial subband. Suppose that a bitstream is preferredof 430 bytes, then one can see from the first two columnsthat subplane or SNR level number three is needed. Becausethe bitstream contains two spatial levels, one knows that thefirst spatial subband should contain two bitplanes and thatthe second spatial subband needs only one bitplane. Fromthe last four columns, one can see that the first spatial sub-band of the first frame should be truncated after 65 bytesand the second subband after 80 bytes. For the second framehold that the first spatial subband contains 75 bytes after theadaptation and the second one should be truncated after 200bytes. Figure 11 shows such an XML description that con-tains the necessary metadata and it illustrates the SNR trans-formation in the XML domain.

When this new bitstream description is used, the inser-tion of extra metadata has consequences on the previoustemporal and spatial transformations. The elimination of thedifferent tags is the same but it is now possible that the bit-plane metadata, as encapsulated in the header, have to berecalculated. It is clear that when a temporal decompositionlevel is eliminated for a certain bitrate, more bytes can beused for the remainder subbands. This way, higher bitplanescan be used for the subbands to maintain the same bitrate. Toupdate the bitplane information, we have to adapt the meta-data in the header that contains the information necessaryto calculate the highest bitplane for a given bitrate. So, af-ter the execution of one of the previous transformations, themetadata in the header have to be recalculated.


9/19


Fig. 11 XML example of SNR scalability

3.2 BS schema for temporal scalable H.264/AVC bitstreams

In this section, we describe the structure of H.264/AVC bit-streams [27] and what is needed in order to exploit temporalscalability.

3.2.1 Structure of an H.264/AVC bitstream

The structure of an H.264/AVC bitstream is much morecomplex in comparison with the one of MC-EZBC bit-streams. The aim of an embedded scalable codec is to adaptthe bitstream along a scalability axis by executing elemen-tary editing operations. This means that the bitstream doesnot have to be parsed too deep in order to be able to exploitthe scalability and to keep the XML based bitstream descrip-tions small. In the MC-EZBC bitstreams, this was possiblebecause of the straightforward structure of the bitstreams asexplained in the previous subsection. On the other hand, thestructure of a traditional H.264/AVC bitstream is very com-plicated and without taking precautions, the bitstreams haveto be parsed quite deep before temporal scalability can be ex-ploited (in particular, up to and including the slice header).We will discuss a novel approach to make the temporal scal-ability possible without the knowledge of the complete bit-stream structure.

Fig. 12 Global structure of an H.264/AVC bitstream

The high-level structure of an H.264/AVC bitstream isgiven in Fig. 12.

As one can see, the bitstreams are built as a successionofNal_Units preceded by a start code. The Nal_Unitsare the building blocks of a stream which can contain differ-ent types of data. There are mainly three categories of thesetypes.


10/19


11/19


Fig. 14 Part of a temporal scalable H.264/AVC bitstream structure de-scription

Table 1 Characteristics of the different video sequences

Sequence name Content No. of frames Length (s)

Sequence_1 Football game 40 1.30Sequence_2 Falling leaves 121 4.03

and a flowerSequence_3 Sailing boat 300 10.00Sequence_4 Football game 541 18.03

framework, is discussed in Sect. 6 and explains the practica-bility of our approach.

For the performance analysis, four video sequences havebeen selected with different characteristics in length, con-tent, and frame resolution. In Table 1, the different charac-teristics of the selected sequences have been summarized.Based on this table, the impact of a certain characteristic onthe performance of the BSDL tools and the transformationswill be examined.

The first step in the BSDL framework, after the encoding

of the video sequence, is the generation of the description.The execution time for the generation of a description by us-ing the BintoBSD tool is less important because the parentdescription only has to be generated once. This generationcan happen in parallel with the encoding phase or immedi-ately after the creation of the bitstream. In order to do so, wehave modified version 1.2.1 of the BintoBSD tool as avail-able in the MPEG-21 reference software package.

To obtain the descriptions for the MC-EZBC bitstreamscontaining the necessary metadata about the bitplanes, thestandardized BSDL framework of Fig. 1 had to be modifiedand the new framework to obtain a bitstream description thathas to be sent to the transformer is given in Fig. 15. As one

can see in the figure, there is an additional branch necessaryto add the extra metadata about the bitplanes into the de-scription. This information is generated by the encoder andstored in a separate file. The metadata is needed to realizethe SNR scalability as explained in Sect. 3.1.5.

The adapted BSDL framework necessary to generatethe descriptions for the temporal scalable H.264/AVC bit-streams is given in Fig. 16. As already explained, the SEI

Fig. 15 Adapted BSDL framework for MC-EZBC bitstreams


12/19


Fig. 16 Adapted BSDL framework for temporal scalable H.264/AVCbitstreams

messages are necessary to exploit the temporal scalability.At the moment, there are no encoders that generate these

messages; therefore the messages have to be added auto-matically to the bitstreams. To add the SEI messages tothe bitstream, we again use the BSDL framework becauseit gives us a flexible way to correct and to change a bit-stream. To encode the original sequence, encoding param-eters are given such as the explicit GOP structure to obtainthe sub-sequences and different layers. From the encodedbitstream, a BSD can be generated without the presence ofthe SEI messages. Based on the encoding parameters and thealready generated description, it is possible to calculate thenumber of layers, the average bitrate and frame rate, and thelayer to which every picture belongs to. The generated SEImessages are subsequently encapsulated into the bitstream

description. By using our BS schema, which contains thesyntax of the SEI messages, and the original generated bit-stream, it is possible to generate a new temporal scalableH.264/AVC bitstream that contains the necessary SEI mes-sages to exploit the scalabilities. From that point, the tradi-tional BSDL framework, as given in Fig. 1, can be followed.The final adapted bitstream will also contain the SEI mes-sages but an H.264/AVC decoder, which is not aware of suchmessages, will ignore them (they are not necessary to recon-struct the sample values).

Once the scalable descriptions are available, the timeneeded to transform the XML descriptions can be measured.For most of the transformations, the adaptation has been

implemented using multiple technologies. The easiest wayto implement an XML transformation is to make use ofXSLT. We have used the Xalan processor in our tests [28].The main advantage of XSLT is the possibility to implementa transformation with only a small number of lines of code.The disadvantages are the slow execution times and thefact that XSLT is a functional programming language. Wehave used XSLT to realize the temporal, spatial, and SNRscalability for the MC-EZBC bitstreams. Because of thepoor performance of XSLT, we have also implemented thetransformations by using procedural languages, in particular

C++ and Java. To realize an XML transformation in theselanguages, an XML parser was needed. There exist twomain types of XML parsers. In particular, parsers that arebuilt on top of tree-based and event-based models.

In our experiments, we have used an implementation us-ing a SAX and W3C DOM API. They are making use of atree-based and an event-based model, respectively. The two

models in question have their own advantages and disadvan-tages. The first advantage of using DOM for XML parsing isthat DOM generates an in-memory tree that closely reflectsthe content of the XML files. For a developer, it is simpler toimplement a DOM-based application than using the alterna-tive SAX approach. Because the tree is persistent in mem-ory, modifications and all kinds of navigation are possible onthe internal tree. The tree-based implementations have somedisadvantages, certainly in the context of bitstream descrip-tions. The first step in a DOM implementation is the genera-tion of the internal DOM tree, which can be a slow process,typically characterized by high memory requirements (cer-tainly for large files). It is also only possible to start the trans-

formation once the complete tree is generated. Therefore, itis impossible to use DOM in a streaming scenario (which isthe use case that will be discussed in Sect. 6). A DOM treecan be usable in non-time critical applications, so we haveused a DOM implementation to add the SEI messages intothe H.264/AVC bitstream descriptions.

An event-based parser, such as SAX, works in anotherway. It generates an event every time the parser encountersan XML token. The events in question have to be caught bythe application and they contain no information about theprevious and following tokens. The application must keepup with the state of the parser in the XML file and does notknow when an event will be thrown. Therefore, SAX-based

applications are often more complex. Another disadvantageof SAX is the fact that only forward navigation operationsare possible. It is also impossible to manipulate the contentof the XML file. Of course, SAX also has advantages. Oneof the most important advantages is the small memory foot-print because no XML data have to be kept in the memory.This should make a SAX implementation faster than a DOMvariant. Another advantage is the fact that the analysis of theapplication can start immediately, rather than having to waitfor all of the data to be processed. That makes SAX very in-teresting for streaming applications, such as streaming videodescriptions [29].

The last step in the BSDL adaptation chain is the gen-

eration of the adapted bitstream. We used version 1.2.1 ofthe BSDtoBin tool. A bug had to be fixed such that thesoftware could parse XML descriptions with metadata (byusing the bs1:ignore attribute of BSDL). The softwarewas optimized by using buffered output streams instead ofbyte-based output streams. Improper usage of exceptionswas eliminated by using boolean values.

To perform the measurements, every step was executedfive times and the average was calculated over the five runs.The measurements were done on a PC having an Intel Pen-tium IV CPU, clocked at 2.8 GHz with Hyper-Threading and


13/19


Table 2 Performance analysis of the BintoBSD tool for the MC-EZBC bitstreams

Sequence

Measurement Sequence_1 Sequence_2 Sequence_3 Sequence_4

BintoBSD (s) 2.48 11.04 48.10 108.53Bitstream size (Kilobytes) 2,425 6,474 14,358 33,349Description size, no metadata (bytes) 30,309 95,943 232,831 375,329Adding metadata (s) 1.68 4.10 8.00 11.40Description size, with all metadata (bytes) 1,693,313 6,135,085 14,812,199 20,530,685Compressed size (bytes) 65,672 211,697 508,585 757,453Compression factor (%) 96.12 96.55 96.57 96.31Description size, with metadata 245,120 245,120 1,977,526 2,982,961

for 16 bitplanes (bytes)Compressed size (bytes) 11,964 31,539 72,570 127,522Compression factor (%) 95.12 96.15 96.33 95.72

Table 3 Performance analysis of the BintoBSD tool for the temporal scalable H.264/AVC bitstreams

Sequence

Measurement Sequence_1 Sequence_2 Sequence_3 Sequence_4

BintoBSD (s) 4.61 26.62 139.78 399.05

Bitstream size (Kilobytes) 164 229 629 2,512Description size (bytes) 59,986 204,967 495,103 891,448Compressed size (bytes) 2,318 3,946 7,033 11,567Compression factor (%) 96.14 98.07 98.58 98.70

having 1 GB of RAM at its disposal. The operating systemused was Windows XP Pro (service pack 2). Sun Microsys-temss Java 2 Runtime Environment (Standard Edition ver-sion 1.5.0_02-b09) was running as Java Virtual Machine.The H.264/AVC bitstreams were created by relying on theJM 9.4 reference software and by using an explicit GOPstructure of IbBbBbBbP (which implies four sub-sequencelayers). The MC-EZBC bitstreams were encoded with five

temporal levels (i.e., GOP structures of 16 frames).

5 Performance results

In this section, we discuss the performance results with re-spect to the different phases and technologies used as ex-plained in the previous section.

5.1 Performance of the BintoBSD tool

Tables 2 and 3 summarize the execution times and the sizesof the generated XML descriptions as obtained by makinguse of the modified BintoBSD software. Table 2 containsthe results for the MC-EZBC bitstreams. Additionally, thetime is shown that is necessary for adding the metadata tothe descriptions in order to obtain the adaptable descrip-tions. Furthermore, the table also contains the size of thedescriptions after the addition of the metadata because ofthe fact that these descriptions will be the subject of thetransformations. One can see from the table that the additionof the metadata results in an exponential increase of thesize of the descriptions. The metadata contain information

about approximately 128 bitplanes per spatial subband,which leads to 640 SNR levels in the case of a bitstream thatcontains five spatial levels. In the most practical situations,it is not necessary to have so many SNR levels. Therefore,BSDs were generated containing less bitplanes per spatialsubband, in particular 16 bitplanes for each spatial subbandresulting in approximately 80 SNR levels. The sizes of thesenew descriptions are also recorded in the table, resulting in

a strong reduction of the description sizes. As one will seein the next subsection, the impact of the description size cannot be neglected during the transformation and the genera-tion of the adapted bitstream. In Table 3, the execution timesare given for the temporal scalable H.264/AVC bitstreams.The BintoBSD execution times are measured during thegeneration of the descriptions from the bitstreams thatalready contain the SEI messages. In order words, we havemeasured the execution time of the second BintoBSD toolin Fig. 16. The table also contains the sizes of the bitstreamsand the sizes of the XML descriptions.

Because the sizes of the descriptions are big (they are al-most as large as the bitstreams themselves), they would be

unusable in practical situations. Fortunately, in a completeBSDL-based application (as will be explained in Sect. 6),the XML files are likely to be compressed. As an exam-ple, we used a simple ZIP algorithm to illustrate that thecompression of XML files does make sense in this context.As can be seen in the tables, a ratio of up to 95% is notexceptional in the case of bitstream structure descriptions.These high-compression ratios are possible because of therepetitive character of the elements in the bitstream descrip-tion. In a practical situation, the compression process maymostly not conflict with the real-time requirements of our


14/19


Table 4 Performance analysis of the temporal transformation

Sequence name XSLT (s) DOMJava (s) SAXJava (s) DOMC++ (s) SAXC++ (s)

(a)a (b)b (a)a (b)b (a)a (b)b (a)a (b)b (a)a (b)b

Sequence_1 159.10 3.31 1.90 1.20 0.66 0.37 0.48 0.15 0.27 0.05Sequence_2 771.60 14.79 4.32 1.54 1.39 0.49 1.42 0.26 0.94 0.14Sequence_3 2005.36 46.39 8.30 2.11 2.73 0.72 3.50 0.53 2.22 0.32

Sequence_4 2505.80 82.67 11.20 2.50 3.71 0.88 4.92 0.78 3.09 0.45aDescriptions with all metadata about the bitplanesbDescriptions with limited metadata for 16 bitplanes per subband

adaptation framework. Therefore, we suggest to use anothercompression technique than the traditional plain text algo-rithms (such as ZIP), in particular by using Binary formatfor Metadata (BiM, [30]). The latter was developed in thecontext of MPEG-7 and is now subject to modifications inorder to make it suitable for the binarisation of MPEG-21Digital Items. BiM is a W3C XML Schema aware encod-ing scheme for XML documents. It uses information from

the XML schema to create an efficient representation ofthe XML data within the binary domain [31]. The result-ing compression ratio is comparable to those achieved bytraditional plain text algorithms. The most important advan-tages of BiM are the support for parsing the XML data inthe binary domain; the possibilities for dynamic and partialupdating of existing XML trees; and certainly its streamingcapabilities, which is a very important requirement for ourformat-agnostic adaptation framework based on XML de-scriptions.

Finally, we have to note that the BintoBSD executiontimes are less time-critical because the descriptions could begenerated offline. Only for live streaming applications, such

as television broadcasting, the generation should happen asfast as the encoding (at least in real time).

5.2 Performance of the transformations

5.2.1 Scalability of the MC-EZBC descriptions

The first scalability that has been tested is the temporal scal-ability. To measure the execution time of the temporal trans-formation, adapted descriptions have been generated thatcontain four times less frames. This implies that two tempo-ral levels had to be eliminated. The results of the transforma-tions are given in Table 4. In this table, the execution timesare shown for the XSLT transformation, followed by the ex-ecution times of the Java implementation as well as the timesfor the implementation in C++. The table in question illus-trates that XSLT is unusable to transform a description thatcontains metadata. So, a first conclusion is that the transfor-mations, in the context of bitstream descriptions, have to beimplemented in Java or C++, hereby using a library to parsethe XML descriptions and to generate the adapted versions.One can see also that the C++ implementations are fasterthan the Java implementations, although the same algorithmis used in the two implementations, certainly when an im-

Table 5 DOM-based implementations in Java and C++

DOMJava (s) DOMC++ (s)

DOM generation 6.8 1.7Transformation 1.1 1.4Serialization 3.3 1.8Total 11.2 4.9

plementation is used based on a DOM parser. A DOM-basedtransformation of an XML description consists of three el-ementary steps. First the XML file has to be read in and an(internal) DOM structure has to be generated. In the nextstep, the transformation on the DOM structure is executed.Finally, the adapted DOM structure is stored in a flat XMLfile (a process also known as a serialization). To find the dif-ferences between the Java and C++ implementations, wehave looked at how long the different steps take. In Table 5,the execution times are provided for the different parts ofthe implementation, as measured for the Sequence_4 stream.From this table, one can notice that the transformation isquite fast for both implementations. The difference lies in

the I/O-operations: the generation of the DOM structure andthe serialization of the transformed XML file take 90% of thecomplete execution time when making use of Java. The onlypart of the implementation that we are able to control and tooptimize is the transformation. The generation of the DOMtree and the serialization is the responsibility of the library.In Table 4, we can also see that the SAX implementationsare faster than its equivalent DOM implementations (whichwere expected). Finally, we can derive that in the case of aSAX based implementation, the influence of the program-ming language used is less important. When the metadatain the descriptions are limited to 16 bitplanes, the transfor-mation can be executed very fast: less than 1 s for a video

sequence of 20 s.A last interesting measurement is the memory consump-tion during the transformation. A profiling of the applicationby using JProfiler 3.2,1 resulted in the observation that theJavaSAX implementation for the temporal transforma-tion of the description of the Sequence_4 sequence onlyneeds 2 MB of memory. This is in contrast to the DOMimplementation: the latter needs about 180 MB of memory(that is also representative for XSLT). As expected, DOM

1 Java Profiler JProfiler can be found on http://www.ej-technologies.com/


15/19


Table 6 Performance analysis of the spatial transformation

XSLT (s) DOMJava (s) SAXJava (s) DOMC++ (s) SAXC++ (s)

Sequence name (a)a (b)b (a)a (b)b (a)a (b)b (a)a (b)b (a)a (b)b

Sequence_1 100.2 2.3 2.20 1.26 0.65 0.38 0.49 0.15 0.29 0.06Sequence_2 587.6 10.2 4.69 1.57 1.44 0.55 1.64 0.32 0.97 0.19Sequence_3 1474.1 27.1 8.80 2.18 2.77 0.72 3.81 0.62 2.45 0.43

Sequence_4 1620.6 34.7 11.72 2.68 3.62 0.88 4.90 0.84 3.20 0.55aDescriptions with all metadata about the bitplanesbDescriptions with limited metadata for 16 bitplanes per subband

Table 7 Performance analysis of the SNR transformation

XSLT (s) DOMJava (s) SAXJava (s) DOMC++ (s) SAXC++ (s)

Sequence name (a)a (b)b (a)a (b)b (a)a (b)b (a)a (b)b (a)a (b)b

Sequence_1 4.5 1.2 2.14 1.28 0.85 0.43 0.97 0.21 0.41 0.08Sequence_2 20.2 2.4 4.87 1.51 1.98 0.62 3.35 0.53 1.42 0.21Sequence_3 72.6 6.7 10.21 2.35 4.07 0.92 7.96 1.15 3.46 0.48Sequence_4 128.7 12.2 13.89 2.83 5.38 1.18 10.87 1.7 4.78 0.69

aDescriptions with all metadata about the bitplanes

bDescriptions with limited metadata for 16 bitplanes per subband

has high memory requirements while SAX uses almost nomemory. We can conclude that DOM is unusable for longsequences such as complete news bulletins or movies. SAXis the appropriate way to transform large bitstream syntaxdescriptions. DOM is only usable when the descriptionsare small, for example, when a very high-level bitstreamdescription is used. The latter holds true for applicationswhereby only temporal scalability has to be exploited (andso no information about the bitplanes is necessary). Fornonstreaming applications, such as downloading musicclips for PDAs, a DOM implementation is a possiblesolution.

After exploiting of the temporal scalability, the spatialtransformation was executed under the same circumstances.The results for the spatial transformations are very similar tothe ones obtained for the temporal transformations. Again,the performance was measured by reducing the spatial reso-lution two times. In Table 6, the different results of the testruns are shown. The same conclusions can be drawn as forthe temporal scalability.

To realize SNR (or quality) scalability, a target bitratehas to be entered in order to guide the transformation.The original (parent) bitstreams have a bitrate of about

13 Mbits/s. For our test case, we have chosen to make useof a target bitrate of 1000 Kilobits/s. In Table 7, the resultsof the SNR transformation implementations are shown, asexplained in Sect. 3. In this table, one can see that these im-plementations take more time than the previous two trans-formations (with the exception of XSLT, but the times arestill unacceptably high). From this table, it is clear that aSAX-based transformation can be executed in real time andthe impact of the sizes of the descriptions cannot be ignored.Adapting a description with only 16 bitplanes per subband,resulting in almost the same quality for the given bitrate as

Table 8 Performance analysis of the temporal transformation for theH.264/AVC descriptions

Sequence name SAXJava (s)

Sequence_1 0.13Sequence_2 0.17Sequence_3 0.22Sequence_4 0.25

for descriptions containing all metadata, needs much lesstime (in Java approximately 1 s and in C++ less than 1 s).

5.2.2 Temporal scalability for an H.264/AVC description

To exploit temporal scalability in the context of H.264/AVCbitstreams, the transformation has only been implementedin SAX. This because SAX is the fastest implementationfor transforming bitstream descriptions, as mentioned in theanalysis for the MC-EZBC bitstreams and it is the onlyone that is usable in streaming use cases. The results of thetransformation are given in Table 8. The transformation canhereby be steered by relying on different adaptation parame-ters. The transformation engine can receive the number ofthe layers that have to be eliminated but it is also possi-ble that the engine has to calculate that number by itselfwhen an average frame rate or bitrate is given. Based onthe information that is encapsulated into the sub-sequencelayer characteristics SEI message, the engine knows howmany and which layers that have to be removed. Whatkind of adaptation parameter the engine receives, has noimpact on the execution times. As one can see in the ta-ble, the execution times are very low in contrast with thetimes for the MC-EZBC descriptions. It is clear from this


16/19


Table 9 Performance analysis of the BSDtoBin tool for the generation of MC-EZBC bitstreams

BSDtoBin, BSDtoBin, BSDtoBin, with metadatano metadata (s) with metadata (s) for 16 bitplanes (s)

Sequence Sequence Reference Reference Referencename length (s) implementation Optimized implementation Optimized implementation Optimized

Sequence_1 1.30 12.2 0.5 12.6 0.7 12.3 0.6Sequence_2 4.03 31.9 0.6 33.0 1.2 32.0 0.7Sequence_3 10.00 69.8 0.8 71.8 2.1 70.4 1.0Sequence_4 18.03 161.4 1.7 163.2 3.1 162.5 1.9

Table 10 Performance analysis of the execution time of the BSDtoBin tool for the generation of H.264/AVC bitstreams

BSDtoBin, optimized version (s)

Sequence name Sequence length (s) Loading schema Without loading time Total

Sequence_1 1.30 0.62 0.10 0.72Sequence_2 4.03 0.62 0.16 0.78Sequence_3 10.00 0.62 0.33 0.95Sequence_4 18.03 0.62 0.53 1.15

table that the transformations in the XML domain are notan obstacle in the usage of a description-based adaptationframework.

5.3 Performance of the BSDtoBin tool

To investigate the performance of the step that involves thegeneration of adapted bitstreams, the BSDtoBin tool was ap-plied to the generated descriptions of the original bitstreamwithout doing a transformation. It is clear that after the bit-stream generation, a bit-equivalent version of the originalbitstream has to be obtained. This can be considered as a

worst-case scenario in terms of execution times for the BS-DtoBin tool because the biggest description possible is used.In practical situations, the times of this tool will be lowerbecause the descriptions should already be adapted or trans-formed by a transformation engine.

In Tables 9 and 10, an overview is given of the executiontimes of the BSDtoBin tool. In Table 9, the times are given inorder to generate the MC-EZBC bitstreams. The table showsthe times generated by the two software implementations, inparticular by the reference software version 1.2.1 and by ouroptimized implementation. From this table, we can concludethat the generation of a bitstream from a (adapted) descrip-tion can be done very fast, certainly when the descriptions

are not too long.In Table 10, the execution times of our optimized im-plementation of the BSDtoBin tool are given for the gener-ation of H.264/AVC bitstreams. The times are divided intothe time needed to load the BS schema and the time neededto generate the bitstream once the schema is loaded. Theschema loading time is a constant time for all executions ofthe tool. One can see from the table that the generation of anadapted H.264/AVC bitstream from a description can be re-alized very fast (approximately half a second for a sequenceof about 18 s).

6 Use case

In this section, we describe a possible practical implementa-tion of a format-agnostic adaptation framework by using thetechnologies as explained in this paper. The discussed usecase is a streaming-based application whereby the bitstreamsare adapted on-the-fly, hereby taking into account the chang-ing environment characteristics. The adaptation happens ina format-agnostic way and is executed by relying on XMLdescriptions. The adaptation should be executed in the stan-dardized MPEG-21 framework. The schema for the use caseis given in Fig. 17.

The schema contains three main parts. The first part is

the generation of the original (scalable and adaptable) bit-stream and corresponding description. This step should hap-pen only once and therefore, it is less time critical. As ex-plained in Sect. 4, it is sometimes necessary to make thegenerated BSD transformable so that the embedded scala-bility of the bitstreams can be executed in the XML domain.The resulting bitstream and corresponding BSD have to bestored in the server that interacts with the client. In case theserver does not yet contain the BS schema for the descrip-tion, this schema also has to be stored on the server.

The second part is the server that should interact withthe third part, in particular the clients. The server does nothave to know what kind of media bitstreams are stored in its

underlying storage system. Every client has its own termi-nal and network with specific characteristics and every clienthas its own preferences. The client requests a bitstream fromthe server. The request should be processed at the server, re-sulting in a SAX filter for the transformation together withthe original BSD is sent to the client. The BSD could be sentunder streaming conditions. At the moment that the client re-ceives the filter and the first tags of the BSD, the transforma-tion of the description can be started. The transformer usesthe user preferences, network, and terminal characteristics tosteer the adaptation. Because the transformation is executed


17/19


Fig. 17 Format-agnostic streaming use case by using the BSDL framework

under streaming conditions, the transformer can anticipatechanging conditions, for example a descending bandwidth.The transformed description should be streamed back to theserver and by using the BSDtoBin tool of the BSDL frame-work, the adapted bitstream can be streamed to the client.The client can decode the bitstream and can render the videosequence.

From the performance analysis of the previous section,one can see that it is possible to deal with different clients inparallel. The only restricting tool for the server is the genera-tion of the bitstream, in particular the BSDtoBin tool. FromTables 9 and 10, it is clear that this tool can generate the

bitstreams very fast. Up to 40 requests can be processed inparallel in the case of the scalable H.264/AVC bitstreams.Note that the BSDtoBin Parser used is a modified versionof the non-optimized reference software of MPEG-21 that isimplemented in Java. A commercial optimized implementa-tion should run a lot faster.

7 Conclusions

In a universal multimedia access framework, it is necessarythat a video bitstream can be adapted in a transparent way. In

this paper, a coding format-independent adaptation frame-work for binary media resources is discussed. Therefore, thehigh-level structure of a bitstream is described in XML byusing the MPEG-21 BSDL standard. BSDL defines a lan-guage to express the structure of an encoded bitstream in thesame way as an XML schema defines the content of a groupof XML documents. To evaluate the performance of theBSDL framework, bitstreams encoded by two different cod-ing schemata were used. The first one is a fully embeddedscalable wavelet-based algorithm, in particular the MC-EZBC codec, and the second one is the non-fully embeddedscalable H.264/AVC standard. The paper describes the usage

of the sub-sequence coding tool of the H.264/AVC specifica-tion to obtain temporal scalable H.264/AVC bitstreams. Theconstruction of the BS schemata for the codecs in questionis discussed as well as how the different kinds of scalabilitycan be realized in the XML domain. To exploit the differenttypes of scalability in the XML domain, the descriptionshave to be extended with metadata. In particular, informationabout the length of bitplanes is needed in case of MC-EZBCbitstreams and the inclusion of Supplemental EnhancementInformation (SEI) messages into the high-level XMLdescriptions is necessary to execute adaptations on temporal


18/19


scalable H.264/AVC bitstreams. This paper describes howthe BSDL framework has to be adapted to make it possibleto include the necessary metadata into the descriptions.This approach for bitstream adaptation also has a penalty,in particular the extra processing time to generate the XMLdescriptions and to transform the descriptions. During theperformance analysis, different transformation technologies

were compared to each other, resulting in a preference fora transformation based on an implementation of a SAXparser. A possible practical implementation of the approachdescribed in this paper is discussed as well. The implemen-tation scheme results in an elegant and flexible solution for aformat-agnostic adaptation framework for streaming appli-cations, taking into account the network, terminal, and usercharacteristics.

Acknowledgements The research activities that have been describedin this paper were funded by Ghent University, the InterdisciplinaryInstitute for Broadband Technology (IBBT), the Institute for thePromotion of Innovation by Science and Technology in Flanders

(IWT), the Fund for Scientific Research-Flanders (FWO-Flanders), theBelgian Federal Science Policy Office (BFSPO), and the EuropeanUnion.

References

1. Vetro, A., Christopoulos, C., Ebrahimi, T.: From the guest editors Universal multimedia access. IEEE Signal Process. Mag. 20(2),16 (2003)

2. ISO/IEC 21000-7:2004: Information technologymultimediaframework (MPEG-21)Part 7: Digital Item Adaptation (2004)

3. Mukherjee, D., Wang, H., Said, A., Liu, S.: Format-Agnosticadaptation using the MPEG-21 DIA framework. In: Proceedings

of SPIE Annual Meeting 2004: Signal and Image Processing andSensors. Denver, vol. 5558, pp. 351362 (2004)

4. Devillers, S., Timmerer, C., Heuer, J., Hellwagner, H.: Bitstreamsyntax description-based adaptation in streaming and constrainedenvironments. IEEE Trans. Multimedia 7(3), 463470 (2005)

5. Burnett, I., Van de Walle, R., Hill, K., Bormans, J., Pereira, F.:MPEG-21: Goals and achievements. IEEE Multimedia 10(4), 6070 (2003)

6. Amielh, M., Deviller, S.: Bitstream Syntax Description Language:Applicationof XML-schema to multimedia contentadaptation. In:Proceedings of the 11th International WWW Conference. Hon-olulu, Hawaii (2002)

7. Duckett, J., Ozu, N., Williams, K., Mohr, S., Cagle, K., Graffin,O., Norton, F., Stokes-Rees, I., Tennison, J.: Professional XMLSchemas. Wrox Press Ltd., Birmingham, UK (2001)

8. Kay, M.: XSLT Programmers Reference, 2nd edn. Wrox Press

Ltd., Birmingham, UK (2001)9. Taubman, D.S., Marcellin, M.W.: JPEG2000: Image Compression

Fundamentals, Standards, and Practice. Kluwer Academic, Dor-drecht (2002)

10. Chen, P.: Fully scalable subband/wavelet coding. Ph.D. Thesis,Rensselaer Polytechnic Institute, Troy, New York (2003)

11. Choi, S.J., Woods, J.W.: Motion-compensated 3-D subband cod-ing of video. IEEE Trans. Image Process. 8(2), 155167 (1999)

12. Antonini, M., Barlaud, M., Mathieu, P., Daubechies, I.: Imagecoding using wavelet transform. IEEE Trans. Image Process. 1(2),205220 (1992)

13. Hsiang, S.T., Woods, J.W.: Embedded image coding using zer-oblocks of subband/wavelet coefficients and context modeling. In:The 2000 IEEE International Symposium on Circuits and Sys-tems, vol. 3, pp. 662665. Geneva, Switzerland (2000)

14. ISO/IEC 14496-10:2004: Information technologycoding ofaudio-visual objectsPart 10: Advanced Video Coding (2004)

15. De Neve, W., Van Deursen, D., De Schrijver, D., De Wolf, K.,Van de Walle, R.: Using bitstream structure descriptions for theexploitation of multi-layer temporal scalability in H.264/MPEG-4AVCs Base Specification. In: Advances in Multimedia Informa-tion Processing, PCM 2005, 6th Pacific-Rim Conference on Mul-timedia. Lecture Notes in Computer Science, pp. 641652. JejuIsland, Korea (2005)

16. Sullivan, G.J., Wiegand, T.: Video Compressionfrom conceptsto the H.264/AVC standard. Proceedings IEEE 93(1), 1831(2005)

17. Tian, D., Hannuksela, M.M., Gabbouj, M.: Sub-sequence videocoding for improved temporal scalability. In: Proceedings of theIEEE International Symposium on Circuits and Systems, 2005,vol. 6, pp. 60746077. Kobe, Japan (2005)

18. Timmerer, C., Panis, G., Kosch, H., Heuer, J., Hellwagner, H.,Hutter, A.: Coding format independent multimedia content adap-tation using XML. In: Proceedings of SPIE International Sympo-sium ITCom 2003 on Internet Multimedia Management SystemsIV, vol. 5242, pp. 92103. Orlando, Florida (2003)

19. Eleftheriadis, A.: Flavor: A language for media representa-tion. In: Proceedings of the 5th ACM international confer-ence on Multimedia, pp. 19. Seattle, Washington (1997).http://flavor.sourceforge.net

20. Hong, D., Eleftheriadis, A.: XFlavor: bridging bits and objects inmedia representation. In: Proceedings of IEEE International Con-ference on Multimedia and Expo (ICME02). Lausanne, Switzer-land (2002). http://flavor.sourceforge.net

21. Sun, X., Kim, C.S., Jay Kuo, C.C.: MPEG video markup languageand its applications to robust video transmission. J. Vis. Commun.Image Represent. 16(45), 589620 (2005)

22. ISO/IEC 13818-2:2000: Information technologygeneric cod-ing of moving pictures and associated audio information. Video(2000)

23. ISO/IEC 14496-2:2004: Information technologycoding ofaudio-visual objectsPart 2. Visual (2004)

24. Richardson, I.E.G.: H.264 and MPEG-4 Video Compression:Video Coding for Next-generation Multimedia. Wiley (2003)

25. Golwelkar, A., Woods, J.W.: Implementation of MC-EZBC codec.http://www.cipr.rpi.edu/golwea

26. Lerouge, S., Lambert, P., Van de Walle, R.: Multi-criteria opti-mization for scalable bitstreams. In: Visual Content Processingand Representation, Proceedings Lecture Notes in Computer Sci-ence, vol. 2849, pp. 122130 (2003)

27. De Neve, W., Lerouge, S., Lambert, P., Van de Walle, R.: Aperformance evaluation of MPEG-21 BSDL in the context ofH.264/AVC. In: Proceedings of the SPIE annual meeting 2004:Signal and Image Processing and Sensors, vol. 5558, pp. 555566.Denver, USA (2004)

28. The Apache Xalan Project. http://xalan.apache.org/index.html

29. De Sutter, R., Timmerer, C., Hellwagner, H., Van de Walle, R.:

Evaluation of models for parsing binary encoded XML-basedmetadata. In: Proceedings of 2004 International Symposium onIntelligent Signal Processing and Communication Systems, pp.419424. Seoul, Korea (2004)

30. ISO/IEC 15938-1:2002: Information technologymultimediacontent description interfacePart 1. Systems (2002)

31. De Sutter, R., Lerouge, S., De Schrijver, D., Van de Walle, R.:Enhancing RSS feeds: Eliminating overhead through binary en-coding. In: Proceedings of the 3rd International Conference onInformation Technology and Applications, vol. 1, pp. 520525.Sydney, Australia (2005)


19/19


Davy De Schrijver received themaster degree in computer sciencefrom Ghent University, Belgium,in 2003. He joined the MultimediaLab in 2003 where he is currentlyworking toward the PhD degree.His research interests include scal-able video coding technologies, me-dia content adaptation, fractal cod-ing, interactive digital television,and MPEG standardization.

Chris Poppe received the masterdegree in industrial sciences fromKaHo Sint-Lieven, Belgium, in2002 and received his master degreein computer science from GhentUniversity, Belgium, in 2004. He

joined the Multimedia Lab in 2004where he is currently working to-ward the PhD degree. His researchinterests include scalable video cod-ing technologies, context collectionand processing, content adaptationand MPEG standardization.

Sam Lerouge received his mas-ter degree in computer science fromGhent University, Belgium in 2001.Since then, he started working to-ward the PhD degree in the Mul-timedia Lab, which he obtained in2005. His research focuses on appli-cations that use scalable video cod-ing, in particular the maximizationof the visual quality in constrainedenvironments.

Wesley De Neve received the mas-ter degree in computer sciencefrom Ghent University, Belgium,in 2002. He joined the Multime-dia Lab in 2002 where he is cur-rently working toward the PhD de-gree. His research interests includevideo coding technologies, mediacontent adaptation, and multimediaAPIs.

Rik Van de Walle received hisM.Sc. and PhD degrees in Engi-neering from Ghent University, Bel-gium in 1994 and 1998 respec-tively. After a visiting scholarship atthe University of Arizona (Tucson,USA), he returned to Ghent Uni-versity, where he became professorof multimedia systems and appli-cations, and head of the Multime-dia Lab. His current research inter-ests include multimedia content de-livery, presentation and archiving,coding and description of multime-dia data, content adaptation, and in-teractive (mobile) multimedia ap-plications.

MPEG-21 bitstream syntax descriptions for scalable video codecs

Documents