Top Banner
Paper [99] proposes a corresponding pair of efficient streaming schedule and pipeline decoding architectures to deal with the mentioned problems. The proposed method can be applied to the case of streaming stored FGS videos and can benefit FGS-related applications. Texture coding based on discrete wavelet transform (DWT) is playing a lead- ing role for its higher performance in terms of signal analysis, multiresolution features, and improved compression compared to existing methods such as DCT-based compression schemes adopted in the old JPEG standard. This success is testified by the fact that the wavelet transform has now been adopted by MPEG-4 for still-texture coding and by JPEG2000. Indeed, superior performance at low bit rates and transmission of data according to client display resolution are particularly interesting for mobile applications. The wavelet transform shows bet- ter results because it is intrinsically well suited to nonstationary signal analysis, such as images. Although it is a rather simple transform, DWT implementations may lead to critical requirements in terms of memory size and bandwidth, poss- ibly yielding costly implementations. Thus, efficient implementations must be investigated to fit different system scenarios. In other words, the goal is to find different architectures, each of them specifically optimized for any specific sys- tem requirement in terms of complexity and memory bandwidth. To facilitate MPEG-1 and MPEG-2 video compression, many graphics coproces- sors provide the accelerators to the key function blocks, such as inverse DCT and motion compensation, in compression algorithms for real-time video decoding. The MPEG-4 multimedia coding standard supports object-based coding and manipulation of natural video and synthetic graphics objects. Therefore, it is desir- able to use the graphics coprocessors to accelerate decoding of arbitrary-shaped MPEG-4 video objects as well [100]. It is found that boundary macroblock padding, which is an essential processing step in decoding arbitrarily shaped video objects, could not be efficiently accelerated on the graphics coprocessors due to its complex- ity. Although such a padding can be implemented by the host processor, the frame data processed on the graphics coprocessor need to be transferred to the host pro- cessor for padding. In addition, the padded data on the host processor need to be sent back to the graphics coprocessor to be used as a reference for subsequent frames. To avoid this overhead, there are two approaches of boundary macroblock padding. In the first approach, the boundary macroblock padding is partitioned into two tasks, one that the host processor can perform without the overhead of data transfers, and the second approach, in which two new instructions are specified and an algorithm is proposed for the next-generation graphics coprocessors or media-processors, which gives a performance improvement of up to a factor of nine compared to that with the PentiumIII [100]. 5.4 MEDIA STREAMING Advances in computers, networking, and communications have created new distri- bution channel and business opportunities for the dissemination of multimedia 5.4 MEDIA STREAMING 431
31

Introduction to Multimedia Communications : Applications ...The MPEG-4 multimedia coding standard supports object-based coding and manipulation of natural video and synthetic graphics

Apr 17, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Paper [99] proposes a corresponding pair of efficient streaming schedule and

    pipeline decoding architectures to deal with the mentioned problems. The proposed

    method can be applied to the case of streaming stored FGS videos and can benefit

    FGS-related applications.

    Texture coding based on discrete wavelet transform (DWT) is playing a lead-

    ing role for its higher performance in terms of signal analysis, multiresolution

    features, and improved compression compared to existing methods such as

    DCT-based compression schemes adopted in the old JPEG standard. This success

    is testified by the fact that the wavelet transform has now been adopted by

    MPEG-4 for still-texture coding and by JPEG2000. Indeed, superior performance

    at low bit rates and transmission of data according to client display resolution are

    particularly interesting for mobile applications. The wavelet transform shows bet-

    ter results because it is intrinsically well suited to nonstationary signal analysis,

    such as images. Although it is a rather simple transform, DWT implementations

    may lead to critical requirements in terms of memory size and bandwidth, poss-

    ibly yielding costly implementations. Thus, efficient implementations must be

    investigated to fit different system scenarios. In other words, the goal is to find

    different architectures, each of them specifically optimized for any specific sys-

    tem requirement in terms of complexity and memory bandwidth.

    To facilitate MPEG-1 and MPEG-2 video compression, many graphics coproces-

    sors provide the accelerators to the key function blocks, such as inverse DCT and

    motion compensation, in compression algorithms for real-time video decoding.

    The MPEG-4 multimedia coding standard supports object-based coding and

    manipulation of natural video and synthetic graphics objects. Therefore, it is desir-

    able to use the graphics coprocessors to accelerate decoding of arbitrary-shaped

    MPEG-4 video objects as well [100]. It is found that boundary macroblock padding,

    which is an essential processing step in decoding arbitrarily shaped video objects,

    could not be efficiently accelerated on the graphics coprocessors due to its complex-

    ity. Although such a padding can be implemented by the host processor, the frame

    data processed on the graphics coprocessor need to be transferred to the host pro-

    cessor for padding. In addition, the padded data on the host processor need to be

    sent back to the graphics coprocessor to be used as a reference for subsequent

    frames. To avoid this overhead, there are two approaches of boundary macroblock

    padding. In the first approach, the boundary macroblock padding is partitioned into

    two tasks, one that the host processor can perform without the overhead of data

    transfers, and the second approach, in which two new instructions are specified

    and an algorithm is proposed for the next-generation graphics coprocessors or

    media-processors, which gives a performance improvement of up to a factor of

    nine compared to that with the PentiumIII [100].

    5.4 MEDIA STREAMING

    Advances in computers, networking, and communications have created new distri-

    bution channel and business opportunities for the dissemination of multimedia

    5.4 MEDIA STREAMING 431

  • content. Streaming audio and video over networks such as the Internet, local area

    wireless networks, home networks, and commercial cellular phone systems has

    become a reality and it is likely that streaming media will become a mainstream

    means of communication. Despite some initial commercial success, streaming

    media still faces challenging technical issues, including quality of service (QoS)

    and cost-effectiveness. For example, deployments of multimedia services over

    2.5G and 3G wireless networks have presented significant problems for real-time

    servers and clients in terms of high variability of network throughput and packet

    loss due to network buffer overflows and noisy channels. New streaming architec-

    tures such as pear-to-pear (P2P) networks and wireless ad hoc networks have also

    raised many interesting research challenges. This section is intended to address

    some of the principal technical challenges for streaming media by presenting a col-

    lection of the most recent advances in research and development.

    5.4.1 MPEG-4 Delivery Framework

    The framework is a model that hides to its upper layers the details of the technology

    being used to access the multimedia content. It supports virtual and known com-

    munication scenarios (e.g., stored files, remotely retrieved files, interactive retrieval

    from a real-time streaming server, multicast, broadcast, and interpersonal communi-

    cation). The delivery framework provides, in ISO/OSI terms, a session layer ser-vice. This is further referred to as the delivery multimedia integration framework

    (DMIF) layer, and the modules making use of it are referred to as DMIF users.

    The DMIF layer manages sessions (associated to overall MPEG-4 presentations)

    and channels (associated to individual MPEG-4 elementary streams) and allows

    for the transmission of both user data and commands. The data transmission part

    is often referred to in the open literature as the user plane, while the management

    side is referred to as the control plane. The term DMIF, for instance, is used to indi-

    cate an implementation of the delivery layer for a specific delivery technology [101].

    In the DMIF context, the different protocol stack options are generally named

    transport multiplexer (TransMux). Specific instances of a TransMux, such as a

    user datagram protocol (UDP), are named TransMux channels [102]. Within a

    TransMux channel, several streams can be further multiplexed, and MPEG-4 speci-

    fies a suitable multiplexing tool, the FlexMux. The need for an additional multiplex-

    ing stage (the FlexMux) derives from the wide variety of potential MPEG-4

    applications, in which even huge numbers of MPEG-4 elementary streams (ESs)

    can be used at once. This is somewhat a new requirement specific to MPEG-4; in

    the IP world, for example, the real-time transport protocol (RTP) that is often

    used for streaming applications normally carries one stream per socket. However,

    in order to more effectively support the whole spectrum of potential MPEG-4 appli-

    cations, the usage of the FlexMux in combination with RTP is being considered

    jointly between IETF and MPEG [103].

    Figure 5.49 shows some of the possible stacks that can be used within the delivery

    framework to provide access to MPEG-4 content. Reading the figure as it applies for

    the transmitter side: ESs are first packetized and packets are equipped with

    432 MIDDLEWARE LAYER

  • information necessary for synchronization (timing) – SL packets. Within the con-

    text of MPEG-4 Systems, the Sync Layer (SL) syntax is used for this purpose.

    Then, the packets are passed through the delivery application interface (DAI).

    They possibly get multiplexed by the MPEG-4 FlexMux tool, and finally they

    enter one of the various possible TransMuxes.

    In order to control the flow of the ESs, commands such as PLAY, PAUSE,

    RESUME, and related parameters needed to be conveyed as well. Such commands

    are considered by DMIF as user commands, associated to channels. Such commands

    are opaquely managed (i.e., not interpreted by DMIF and just evaluated at the peer

    entity). This allows the stream control protocol(s) to evolve independently from

    DMIF. When real-time streaming protocol (RTSP) is used as the actual control pro-

    tocol, the separation between use commands and signaling messages vanishes as

    RTSP deals with both channel setup and stream control. This separation is also

    void, for example, when directly accessing a file.

    The delivery framework also is prepared for QoS management. Each request for

    creating a new channel might have associated certain QoS parameters, and a simple

    but generic model for monitoring QoS performance has been introduced as well. The

    infrastructure for QoS handling does not include, however, generic support for QoS

    negotiation or modification.

    Of course, not all the features modeled in the delivery framework are meaningful

    for all scenarios. For example, it makes little sense to consider QoS when reading

    content from local files. Still, an application making use of the DMIF service as a

    whole need not be further concerned with the details of the actually involved

    scenario.

    The approach of making the application running on top of DMIF totally unaware

    of the delivery stack details works well with MPEG-4. Multimedia presentations can

    be repurposed with minimal intervention. Repurposing means here that a certain

    multimedia content can be generated in different forms to suit specific scenarios,

    for example, a set of files to be locally consumed, or broadcast/multicast, or even

    FlexMux FlexMux

    Elementary Streams

    DMIF-Application interfaceFlexMux channel

    Sync layer

    FlexMux channel FlexMux streams

    Optional use of

    FlexMux tool

    Delivery layer

    TransMux streams

    Co

    nfig

    ure

    d b

    y

    ISO

    /IE

    C 1

    44

    96

    -6

    (MP

    EG

    -4 D

    MIF

    )

    Co

    nfig

    ure

    d b

    y

    ISO

    /IE

    C 1

    44

    96

    -1

    (MP

    EG

    -4 S

    yste

    ms)

    SL SL SL SL SL SL

    TCP

    IP

    UDP

    IP

    (PES)

    MPEG2

    TS

    AAL5

    ATMH233

    GSTN

    DAB

    mux

    SL

    ...

    ...

    ...

    SL-Packetized Streams

    Figure 5.49 User plane in an MPEG-4 terminal [104]. (#2000 ISO/IEC.)

    5.4 MEDIA STREAMING 433

  • interactively served from a remote real-time streaming application. Combinations of

    these scenarios are also enabled within a single presentation.

    The delivery application service (DAI) represents the boundary between the ses-

    sion layer service offered by the delivery framework and the application making use

    of it, thus defining the functions offered by DMIF. In ISO/OSI terms, it correspondsto a Session Service Access Point.

    The entity that uses the service provided by DMIF is termed the DMIF user and is

    hidden from the details of the technology used to access the multimedia content.

    The DAI comprises a simple set of primitives that are defined in the standard in

    their semantic only. Actual implementation of the DAI needs to assign a precise syn-

    tax to each function and related parameters, as well as to extend the set of primitives

    to include initialization, reset, statistics monitoring, and any other housekeeping

    function.

    DAI primitives can be categorized into five families, and analyzed as follows:

    . service primitives (create or destroy a service)

    . channel primitives (create or destroy channels)

    . QoS monitoring primitives (set up and control QoS monitoring functions)

    . user command primitives (carry user commands)

    . data primitives (carry the actual media content).

    In general, all the primitives being presented have two different (although similar)

    signatures (i.e., variations with different sets of parameters): one for communication

    from DMIF user to the DMIF layer and another for the communication in the reverse

    direction. The second is distinguished by the Callback suffix and, in a retrieval appli-

    cation, applies only to the remote peer. Moreover, each primitive presents both IN

    and OUT parameters, meaning that IN parameters are provided when the primitive is

    called, whereas OUT parameters are made available when the primitive returns. Of

    course, a specific implementation may choose to use nonblocking calls and to return

    the OUT parameters through an asynchronous callback. This is the case for the

    implementation provided in the MPEG-4 reference software.

    The MPEG-4 delivery framework is intended to support a variety of communi-

    cation scenarios while presenting a single, uniform interface to the DMIF used. It

    is then up to the specific DMIF instance to map the DAI primitives into appropriate

    actions to access the requested content. In general, each DMIF instance will deal

    with very specific protocols and technologies, such as the broadcast of MPEG-2

    transport streams, or communication with a remote peer. In the latter case, however,

    a significant number of options in the selection of control plane protocol exists. This

    variety justifies the attempt to define a further level of commonality among the var-

    ious options, making the final mapping to the actual bits on the wire a little more

    focused. Delivery applications interface (DAI) and delivery multimedia integration

    network interface (DNI) in the DMIF framework architecture are shown in

    Figure 5.50 [104].

    434 MIDDLEWARE LAYER

  • The DNI captures a few generic concepts that are potentially common to peer-to-

    peer control protocols, for example, the usage of a reduced number of network

    resources (such as sockets) into which several channels would be multiplexed,

    and the ability to discriminate between a peer-to-peer relation (network session)

    and different services possibly activated within that single session (services).

    DNI follows a model that helps determine the correct information to be delivered

    between peers but by no means defines the bits on the wire. If the concepts of sharing

    a TransMux channel among several streams or of the separation between network

    session and services are meaningless in some context, that is fine, and does not con-

    tradict the DMIF model as a whole.

    The mapping between DAI and DNI primitives has been specified in reference

    [104]. As a consequence, the actual mapping between the DAI and a concrete pro-

    tocol can be determined as the concatenation of the mappings between the DAI and

    DNI and between DNI and the selected protocol. The first mapping determines how

    to split the service creation process into two elementary steps, and how multiple

    channels managed at the DAI level can be multiplexed into one TransMux channel

    (by means of the FlexMux tool). The second protocol-specific mapping is usually

    straightforward and consists of placing the semantic information exposed at the

    DNI in concrete bits in the messages being sent on the wire.

    In general, the DNI captures the information elements that need to be exchanged

    between peers, regardless of the actual control protocol being used.

    DNI primitives can be categorized into five families, analyzed in the following:

    . service primitives (create or destroy a session)

    . channel primitives (create or destroy service)

    Target

    App.

    Originating

    DMIF

    for Broadcast

    Originating

    DMIF

    for Local Files

    Originating

    DMIF

    for Remote Srv

    DM

    IF F

    ilte

    r

    Target DMIF

    Target DMIF

    Sig

    Map

    Target App.

    Target App.

    Broadcast

    Source

    Local

    Storage

    Network

    DAI DNI

    Sig

    MapTarget DMIF

    DNI DAI

    Ori

    gin

    atin

    g

    Ap

    p.

    Flows between independent systems (normative)

    Flows internal to single system (either informative or out of DMIF scope)

    Figure 5.50 DAI and DNI in the DMIF architecture [104]. (#2000 ISO/IEC.)

    5.4 MEDIA STREAMING 435

  • . Transmux primitives (create or destroy a TransMux channel carrying one or

    more streams)

    . channel primitives (create or destroy a FlexMux channel carrying a single

    stream)

    . user command primitives (carry user commands).

    In general, all the primitives being presented have two different but similar signa-

    tures, one for each communication direction. The Callback suffix indicates primi-

    tives that are issued by the lower layer. Different from the DAI, for the DNI the

    signatures of both the normal and the associated Callback primitives are identical.

    As for the DAI, each primitive presents both IN and OUT parameters, meaning

    that IN parameters are provided when the primitive is called, whereas OUT par-

    ameters are made available when the primitive returns. As for the DAI, the actual

    implementation may choose to use nonblocking calls, and to return the OUT par-

    ameters through an asynchronous callback. Also, some primitives use a loop( ) con-

    struct within the parameter list. This indicates that multiple tuples of those

    parameters can be exposed at once (e.g., in an array).

    5.4.2 Streaming Video Over the Internet

    Recent advances in computing technology, compression technology, high-band-

    width storage devices, and high-speed networks have made it feasible to provide

    real-time multimedia services over the Internet. Real-time multimedia, as the

    name implies, has timing constraints. For example, audio and video data must be

    played out continuously. If the data do not arrive in time, the playout process will

    pause, which is annoying to human ears and eyes.

    Real-time transport of live video or stored video is the predominant part of real-

    time multimedia. Here, we are concerned with video streaming, which refers to real-

    time transmission of stored video. There are two modes for transmission of stored

    video over the Itnernet, namely the download mode and the streaming mode (i.e.,

    video streaming). In the download mode, a user downloads the entire video file

    and then plays back the video file. However, full file transfer in the download

    mode usually suffers long and perhaps unacceptable transfer time. In contrast, in

    the streaming mode, the video contents are being received and decoded. Owing to

    its real-time nature, video streaming typically has bandwidth, delay, and loss

    requirements. However, the current best-effort Internet does not offer any quality

    of service (QoS) guarantees to streaming video over the Internet. In addition, for

    multicast, it is difficult to efficiently support multicast video while providing service

    flexibility to meet a wide range of QoS requirements from the users. Thus, designing

    mechanisms and protocols for Internet streaming video poses many challenges

    [105]. To address these challenges, extensive research has been conducted. To intro-

    duce the necessary background and give the reader a complete picture of this field,

    we cover some key areas of streaming video, such as video compression, application

    layer QoS control, continuous media distribution services, streaming servers, media

    436 MIDDLEWARE LAYER

  • synchronization mechanisms, and protocols for streaming media [106]. The

    relations among the basic building blocks are illustrated in Figure 5.51. Raw

    video and audio data are precompressed, by video and audio compression algor-

    ithms, and then saved in storage devices. It can be seen that the areas are closely

    related and they are coherent constituents of the video streaming architecture. We

    will briefly describe the areas. Before that it must be pointed out that upon the cli-

    ent’s requests, a streaming server retrieves compressed video/audio data from sto-rage devices and the application-layer QoS control module adapts the video/audiobit streams according to the network status and QoS requirements. After the adap-

    tation, the transport protocols packetize the compressed bit streams and send the

    video/audio packets to the Internet. Packets may be dropped or experience exces-sive delay inside the Internet due to congestion. To improve the quality of video/audio transmission, continuous media distribution service (e.g., caching) is deployed

    in the Internet. For packets that are successfully delivered to the receiver, they first

    pass through the transport layers and are then processed by the application layer

    before being decoded at the video/audio decoder. To achieve synchronizationbetween video and audio presentations, media synchronization mechanisms are

    required.

    Video Compression

    Since raw video consumes a large amount of bandwidth, compression is usually

    employed to achieve transmission efficiency. In this section, we discuss various

    compression approaches and requirements imposed by streaming applications on

    the video encoder and decoder.

    In essence, video compression schemes can be classified into two approaches:

    scalable and nonscalable video coding. We will show the encoder and decoder in

    intramode and only use DCT. Intramode coding refers to coding video macroblocks

    without any reference to previously coded data. Since scalable video is capable of

    Raw video

    Videocompression Compressed

    video

    Raw video

    Audiocompression Compressed

    audio

    Storage device

    Application-layerQoS control

    Transportprotocols

    Streaming serverVideo

    decoder

    Application-layerQoS control

    Transportprotocols

    Audiodecoder

    Client/receiver

    Internet

    (Continuous media distribution services)

    Mediasynchronization

    Figure 5.51 Basic building blocks in the architecture for video streaming [105]. (#2001 IEEE.)

    5.4 MEDIA STREAMING 437

  • gracefully coping with the bandwidth fluctuations in the Internet [107], we are pri-

    marily concerned with scalable video coding techniques.

    A nonscalable video encoder shown in Figure 5.52a generates one compressed

    bit stream, while a scalable video decoder is presented in Figure 5.52b. In contrast

    a scalable video encoder compresses a raw video sequence into multiple substreams

    as represented in Figure 5.53a. One of the compressed substreams is the base sub-

    stream, which can be independently decoded and provides coarse visual quality.

    Other compressed substreams are enhanced substreams, which can only be decoded

    together with the base substream and can provide better visual quality. The complete

    bit stream (i.e., the combination of all substreams) provides the highest quality. An

    SNR scalable encoder as well as scalable decoder are shown in Figure 5.53.

    The scalabilities of quality, image sizes, or frame rates are called SNR, spatial, or

    temporal scalabilities, respectively. These three scalabilities form the basic mechan-

    isms, such as spatiotemporal scalability [108]. To provide more flexibility in meet-

    ing different demands of streaming (e.g., different access link bandwidths and

    different latency requirements), the fine granularity scalability (FGS) coding mech-

    anism is proposed in MPEG-4 [109, 110]. An FGS encoder and FGS decoder are

    shown in Figure 5.54.

    The FGS encoder compresses a raw video sequence into two substreams, that is, a

    base layer bit stream and an enhancement layer bit stream. Different from an SNR-

    scalable encoder, an FGS encoder uses bit-plane coding representing the enhance-

    ment stream. Bit-plane coding uses embedded representations. Bit planes of

    enhancement DCT coefficients are shown in Figure 5.55. With bit-plane coding,

    an FGS encoder is capable of achieving continuous rate control for the enhancement

    stream. This is because the enhancement bit stream can be truncated anywhere to

    achieve the target bit rate.

    Figure 5.52 (a) Nonscalable video encoder, (b) nonscalable video decoder.

    438 MIDDLEWARE LAYER

  • Example 5.3. A DCT coefficient can be represented by 7 bits (i.e., its value ranges from 0 to127). There are 64 DCT coefficients. Each DCT coefficient has a most significant bit (MSB).

    All the MSB from the 64 DCT coefficients form Bitplane 0 (Figure 5.55). Similarly, all the

    second most significant bits form Bitplane 1.

    A variation of FGS is progressive fine granularity scalability (PFGS) [111]. PFGS

    shares the good features of FGS, such as fine granularity bit rate scalability and error

    resilience. Unlike FGS, which only has two layers, PFGS could have more then two

    layers. The essential difference between FGS and PFGS is that FGS only uses the

    base layer as a reference to reduce prediction error, resulting in higher coding

    efficiency.

    Various Requirements Imposed by Streaming Applications

    In what follows we will describe various requirements imposed by streaming allo-

    cations on the video encoder and decoder. Also, we will briefly discuss some tech-

    niques that address these requirements.

    Bandwidth. To achieve acceptable perceptual quality, a streaming application typi-cally has minimum bandwidth requirement. However, the current Internet does not

    provide bandwidth reservation to support this requirement. In addition, it is desirable

    for video streaming applications to employ congestion control to avoid congestion,

    which happens when the network is heavily loaded. For video streaming, congestion

    Raw

    video

    Base layer

    compressed

    bit-stream

    Enhancement layer

    compressed

    bit-stream

    Base layer

    compressed

    bit-stream

    Base layer

    decoded

    video

    Enhancement layer

    compressed

    bit-stream

    Enhancement

    layer decoded

    video

    DCT Q

    Q

    VLC

    VLC

    IQ

    VLD IQ IDCT

    VLD IQ IDCT

    (a)

    (b)

    +

    +-

    Figure 5.53 (a) SNR-scalable video encoder, (b) SNR-scalable video decoder.

    5.4 MEDIA STREAMING 439

  • control takes the form of rate control, that is, adapting the sending rate to the avail-

    able bandwidth in the network. Compared with nonscalable video, scalable video is

    more adaptable to the varying available bandwidth in the network.

    Delay. Streaming video requires bounded end-to-end delay so that packets can

    arrive at the receiver in time to be decoded and displayed. If a video packet does

    +–

    DCT Q VLC

    IQ

    Bitplane

    shift

    Find

    Maximum

    Bitplane

    VLC

    Enhancement

    compressed

    bit-stream

    Base Layer

    compressed

    bit-stream

    Raw

    video

    Base layer

    compressed

    bit-stream

    Base layer

    decoded

    video

    Enhancement layer

    compressed

    bit-stream

    Bitplane

    VLD

    Bitplane

    shift IDCTEnhancement

    decoded

    video

    (b)

    (a)

    +

    VLD IQ IDCT

    Figure 5.54 (a) Fine granularity scalability (FGS) encoder, (b) FGS decoder [105]. (#2001

    IEEE.)

    Bitplane 0

    Bitplane 1

    Bitplane k

    DCT

    Coefficient 0DCT

    Coefficient 1

    DCT

    Coefficient 2

    Least significant bit

    Most significant bit

    Figure 5.55 Bitplanes of enhancement DCT coefficients [105]. (#2001 IEEE.)

    440 MIDDLEWARE LAYER

  • not arrive in time, the playout process will pause, which is annoying to human eyes.

    A video packet that arrives beyond its delay bound (e.g., its playout time) is useless

    and can be regarded as lost. Since the Internet introduces time-varying delay, to pro-

    vide continuous playout, a buffer at the receiver is usually introduced before decod-

    ing [112].

    Loss. Packet loss is inevitable in the Internet and can damage pictures, which is dis-

    pleasing to human eyes. Thus, it is desirable that a video stream be robust to packet

    loss. Multiple description coding is such a compression technique to deal with

    packet loss [113].

    Video Cassette Recorder (VCR) like Functions. Some streaming applications

    require VCR-like functions such as stop, pause/resume, fast forward, fast backward,and random access. A dual bit stream least-cost scheme to efficiently provide VCR-

    like functionality for MPEG video streaming is proposed in reference [105].

    Decoding Complexity. Some devices such as cellular phones and personal digital

    assistants (PDAs) require low power consumption. Therefore, streaming video

    applications running on these devices must be simple. In particular, low decoding

    complexity is desirable.

    We here present the application-layer QoS control mechanisms, which adapt the

    video bit streams according to the network status and QoS requirements.

    Application-Layer QoS Control

    The objective of application-layer QoS control is to avoid congestion and maximize

    video quality in the presence of packet loss. To cope with varying network con-

    ditions, and different presentation quality requested by the users, various appli-

    cation-layer QoS control techniques have been proposed [114, 115]. The

    application-layer QoS control techniques include congestion control and error con-

    trol. These techniques are employed by the end systems and do not require any QoS

    support from the network.

    Congestion Control. Bursty loss and excessive delay have a devasting effect on

    video presentation quality, and they are usually caused by network congestion.

    Thus, congestion control mechanisms at end systems are necessary to help reduce

    packet loss and delay. Typically, for streaming video, congestion control takes the

    form of rate control. This is a technique used to determine the sending rate of

    video traffic based on the estimated available bandwidth in the network. Rate control

    attempts to minimize the possibility of network congestion by matching the rate of

    the video stream to the available network bandwidth. Existing rate control schemes

    can be classified into three categories: source-based, receiver-based, and hybrid rate

    control.

    Under source-based rate control, the sender is responsible for adapting the video

    transmission rate. Feedback is employed by source-based rate control mechanisms.

    5.4 MEDIA STREAMING 441

  • Based upon the feedback information about the network, the sender can regulate the

    rate of the video stream. The service-based rate control can be applied to both uni-

    cast [116] and multicast [117]. Figure 5.56 represents unicast and multicast video

    distribution.

    For unicast video, existing source-based rate control mechanisms follow two

    approaches: probe-based and model-based. The probe-based approach is based on

    probing experiments. In particular, the source probes for the available network band-

    width by adjusting the sending rate in a way that could maintain the packet loss ratio

    p below a certain threshold Pth. There are two ways to adjust the sending rate: (1)

    additive increase and multiplicative decrease, and (2) multiplicative increase and

    multiplicative decrease [118].

    The model-based approach is based on a throughput model of a transmission con-

    trol protocol (TCP) connection. Specifically, the throughput of a TCP connection

    can be characterized by the following formula:

    l ¼1:22þMTURTTx

    ffiffiffiffi

    pp (5:1)

    where l is throughput of a TCP connection, MTU (maximum transit unit) is the

    packet size used by the connection, RTT is the round-trip time for the connection,

    and p is the packet loss ratio experienced by the connection [119].

    Under the model-based rate control, equation (5.1) is used to determine the send-

    ing rate of the video stream. Thus, the video connection can avoid congestion in a

    Receiver Receiver

    Receiver

    Receiver Receiver

    Receiver

    ReceiverReceiver

    ReceiverReceiver

    SenderLink 1 Link 1

    SenderLink 1 Link 1

    (a)

    (b)

    Figure 5.56 (a) Unicast video distribution usingmultiple point-to-point connections, (b) multicast

    video distribution using point-to-multipoint transmission.

    442 MIDDLEWARE LAYER

  • way similar to that of TCP and it can compete fairly with TCP flows. For this reason,

    the model-based rate control is also called TCP-friendly rate control.

    For multicast, under the source-based rate control, the sender uses a single chan-

    nel to transport video to the receivers. Such multicast is called single-channelmulti-

    cast. For single-channel multicast, only the probe-based rate control can be

    employed.

    Single-channel multicast is efficient since all the receivers share one channel.

    However, single-channel multicast is unable to provide flexible services to meet

    the different demands from receivers with various access link bandwidths. In con-

    trast, if multicast video were to be delivered through individual unicast streams,

    the bandwidth efficiency is low, but the service could be differentiated since each

    receiver can negotiate the parameters of the services with the source.

    Under the receiver-based rate control, the receiver regulates the receiving rate of

    video streams by adding/dropping channels, while the sender does not participate inrate control. Typically, receiver-based rate control is used in multicast scalable

    video, where there are several layers in the scalable video and each layer corre-

    sponds to one channel in the multicast tree [105].

    Similar to the source-based rate control, the existing receiver-based rate-control

    mechanisms follow two approaches: probe-based and model-based. The basic

    probe-based rate control consists of two parts [105]:

    . When no congestion is detected, a receiver probes for the available bandwidth

    by joining a layer/channel, resulting in an increase of its receiving rate. If nocongestion is detected after the joining, the joining experiment is successful.

    Otherwise, the receiver drops the newly added layer.

    . When congestion is detected, a receiver drops a layer (i.e., leaves a channel),

    resulting in a reduction of its receiving rate.

    Unlike the probe-based approach, which implicitly estimates the available network

    bandwidth through probing experiments, the model-based approach uses explicit

    estimation for the available network bandwidth.

    Under the hybrid rate control the receiver regulates the receiving rate of video

    streams by adding/dropping channels, while the sender also adjusts the transmissionrate of each channel based on feedback from the receivers. Examples of hybrid rate

    control include the destination set grouping and layered multicast scheme [120].

    Architecture for source-based rate control is shown in Figure 5.57. An associated

    technique with rate control is rate shaping. The objective of rate shaping is to match

    rate of a precompressed video bit stream to the target rate constraints. A rate shaper

    (or filter), which performs rate shaping, is required for source-based rate control.

    This is because the stored video may be precompressed at a certain rate, which

    may not match the available bandwidth in the network. Many types of filters can

    be used, such as codec filter, frame-dropping filter, layer-dropping filter, frequency

    filter, and requantization filter [121].

    5.4 MEDIA STREAMING 443

  • In some applications, the purpose of congestion control is to avoid congestion.

    On the other hand, packet loss is inevitable in the Internet and may have significant

    impact on perceptual quality. This prompts the need to design mechanisms to maxi-

    mize video presentation quality in the presence of packet loss. Error control is such a

    mechanism, and will be presented next.

    Error Control. In the Internet, packets may be dropped due to congestion at routers,

    they may be misrouted, or they may reach the destination with such a long delay as

    to be considered useless or lost. Packet loss may severely degrade the visual presen-

    tation quality. To enhance the video quality in presence of packet loss, error-control

    mechanisms have been proposed.

    For certain types of data (such as text), packet loss is intolerable while delay is

    acceptable. When a packet is lost, there are two ways to recover the packet: the cor-

    rupted data must be corrected by traditional forward error correction (FEC), that is,

    channel coding, or the packet must be retransmitted. On the other hand, for real-time

    video, some visual quality degradation is often acceptable while delay must be

    bounded. This feature of real-time video introduces many new error-control mech-

    anisms, which are applicable to video applications but not applicable to traditional

    data such as text. In essence, the error-control mechanisms for video applications

    can be classified into four types, namely, FEC, retransmission, error resilience,

    and error concealment. FEC, retransmission, and error resilience are performed at

    Figure 5.57 Architecture for source-based rate control [105]. (#2001 IEEE.)

    444 MIDDLEWARE LAYER

  • both the source and the receiver side, while error concealment is carried out only at

    the receiver side.

    The principle of FEC is to add redundant information so that the original message

    can be reconstructed in the presence of packet loss. Based on the kind of redundant

    information to be added, we classify existing FEC schemes into three categories:

    channel coding, source coding-based FEC, and joint source/channel coding [106].For Internet applications, channel coding is typically used in terms of block

    codes. Specifically, a video stream is first chopped into segments, each of which

    is packetized into k packets; then, for each segment, a block code is applied to

    the k packets to generate an n-packet block, where n . k. To perfectly recover a seg-

    ment, a user only needs to receive k packets in the n-packet block [122].

    Source coding-based FEC (SFEC) is a variant of FEC for Internet video [123].

    Like channel coding, SFEC also adds redundant information to recover from loss.

    For example, the nth packet contains the nth group of blocks (GOB) and redundant

    information about the (n2 1)th GOB, which is a compressed version of the

    (n2 1)th GOB with larger quantizer.

    Joint source/channel coding is an approach to optimal rate allocation betweensource coding and channel coding [106].

    Delay-Constrained Retransmission. Retransmission is usually dismissed as a

    method to recover lost packets in real-time video since a retransmitted packet

    may miss its playout time. However, if the one-way trip time is short with respect

    to the maximum allowable delay, a retransmission-based approach (called delay-

    constrained retransmission) is a viable option for error control.

    For unicast, the receivers can perform the following delay-constrained retrans-

    mission scheme. When the receiver detects the loss of packet N, if [Tcþ RTTþDa , Td(N)] send the request for packet N to the sender, where Tc is current time,

    RTT is estimated round-trip time, Da is a slack term, Td(N) is time when packet N

    is scheduled for display.

    The slack time Da may include tolerance of error in estimating RTT, the sender’s

    response time, and the receiver’s decoding delay. The timing diagram for receiver-

    based control is shown in Figure 5.58, where Da is only the receiver’s decoding

    Sender

    packet 1Receiver

    packet 2 lost

    retransmitted packet 2

    packet 3

    retransmitted packet 2

    Tc

    RTT

    Ds

    Td(2)

    Figure 5.58 Timing diagram for receiver-based control [105]. (#2001 IEEE.)

    5.4 MEDIA STREAMING 445

  • delay. It is clear that the objective of the delay-constrained retransmission is to sup-

    press requests of retransmission that will not arrive in time for display.

    Error-Resilient Encoding. The objective of error-resilient encoding is to enhancerobustness of compressed video to packet loss. The standardized error-resilient

    encoding schemes include resynchronization marking, data partitioning, and data

    recovery [124]. However, resynchronization marking, data partitioning, and data

    recovery are targeted at error-prone environments like wireless channels and may

    not be applicable to the Internet environment. For video transmission over the Inter-

    net, the boundary of a packet already provides a synchronization point in the vari-

    able-length coded bit stream at the receiver side. On the other hand, since a

    packet loss may cause the loss of all the motion data and its associated shape/texturedata, mechanisms such as resynchronization, marking, data partitioning, and data

    recovery may not be useful for Internet video applications. Therefore, we will not

    present the standardized error-resilient tools. Instead, we present multiple descrip-

    tion coding (MDC), which is promising for robust Internet video transmission [125].

    With MDC, a raw video sequence is compressed into multiple streams (referred

    to as descriptions) as follows: each description provides acceptable visual quality;

    more combined descriptions provide a better visual quality. The advantages of

    MDC are:

    . Robustness to loss – even if a receiver gets only one description (other descrip-

    tions being lost), it can still reconstruct video with acceptable quality.

    . Enhanced quality – if a receiver gets multiple descriptions, it can combine

    them together to produce a better reconstruction than that produced from any

    one of them.

    However, the advantages come with a cost. To make each description provide

    acceptable visual quality, each description must carry sufficient information about

    the original video. This will reduce the compression efficiency compared to conven-

    tional single description coding (SDC). In addition, although more description com-

    binations provide a better visual quality, a certain degree of correlation between the

    multiple description has to be embedded in each description, resulting in further

    reduction of the compression efficiency. Further investigation is needed to find a

    good trade-off between compression efficiency and reconstruction quality from

    any one description.

    Error Concealment. Error-resilient encoding is executed by the source to enhance

    robustness of compressed video before packet loss actually happens (this is called

    preventive approach). On the other hand, error concealment is performed by the

    receiver when packet loss has already occurred (this is called reactive approach).

    Specifically, error concealment is employed by the receiver to conceal the lost

    data and make the presentation less displeasing to human eyes.

    The are two basic approaches for error concealment: spatial and temporal inter-

    polations. In spatial interpolation, missing pixel values are reconstructed using

    446 MIDDLEWARE LAYER

  • neighboring spatial information. In temporal interpolation, the lost data are recon-

    structed from data in the previous frames. Typically, spatial interpolation is used

    to reconstruct the missing data in intracoded frames, while temporal interpolation

    is used to reconstruct the missing data in intercoded frames [126]. If the network

    is able to support QoS for video streaming, the performance can be further enhanced.

    Continuous Media Distribution Services

    In order to provide quality multimedia presentations, adequate support from the net-

    work is critical. This is because network support can reduce transport delay and

    packet loss ratio. Streaming video and audio are classified as continuous media

    because they consist of a sequence of media quanta (such as audio samples or

    video frames), which convey meaningful information only when presented in

    time. Built on top of the Internet (IP protocol), continuous media distribution ser-

    vices are designed with the aim of providing QoS and achieving efficiency for

    streaming video/audio over the best-effort Internet. Continuous media distributionservices include network filtering, application-level multicast, and content

    replication.

    Network Filtering. As a congestion control technique, network filtering aims to

    maximize video quality during network congestion. The filter at the video server

    can adapt the rate of video streams according to the network congestion status.

    Figure 5.59 illustrates an example of placing filters in the network. The nodes

    labeled R denote routers that have no knowledge of the format of the media streams

    and may randomly discard packets. The filter nodes receive the client’s requests and

    adapt the stream sent by the server accordingly. This solution allows the service pro-

    vider to place filters on the nodes that connect to network bottlenecks. Furthermore,

    multiple filters can be placed along the path from a server to a client.

    To illustrate the operations of filters, a system model is depicted in Figure 5.60.

    The model consists of the server, the client, at least one filter, and two virtual

    Server

    Client

    FilterR

    R

    Filter R R

    Client

    Client

    Figure 5.59 Filters placed inside the network.

    5.4 MEDIA STREAMING 447

  • channels between them. Of the virtual channels, one is for control and the other is for

    data. The same channels exist between any pair of filters. The control channel is

    bidirectional, which can be realized by TCP connections. The model shown allows

    the client to communicate with only one host (the last filter), which will either for-

    ward the requests or act upon them. The operations of a filter on the data plane

    include: (1) receiving video stream from server or previous filter and (2) sending

    video client or next filter at the target rate. The operations of a filter on the control

    plane include: (1) receiving requests from client or next filter, (2) acting upon

    requests, and (3) forwarding the requests to its previous filter.

    Typically, frame-dropping filters are used as network filters. The receiver can

    change the bandwidth of the media stream by sending requests to the filter to

    increase or decrease the frame dropping rate. To facilitate decisions on whether

    the filter should increase or decrease the bandwidth, the receiver continuously

    measures the packet loss ratio p. Based on the packet loss ratio, a rate-control mech-

    anism can be designed as follows. If the packet loss ratio is higher that a threshold a,

    the client will ask the filter to increase the frame dropping rate. If the packet loss

    ratio is less than another threshold b (b , a), the receiver will ask the filter to

    reduce the frame dropping rate [105].

    The advantage of using frame-dropping filters inside the network include the

    following:

    . Improved video quality. For example, when a video stream flows from an

    upstream link with larger available bandwidth to a downstream link with smal-

    ler available bandwidth, use of a frame-dropping filter at the connection point)

    between the upstream link and the downstream link can help improve the video

    quality. This is because the filter understands the format of the media stream

    and can drop packets in a way that gracefully degrades the stream’s quality

    instead of corrupting the flow outright.

    . Bandwidth efficiency. This is because the filtering can help to save network

    resources by discarding those frames that are late.

    Application-Level Multicast. The Internet’s original design, while well suited for

    point-to-point applications like e-mail, file transfer, and Web browsing, fails to

    effectively support large-scale content delivery like streaming-media multicast. In

    an attempt to address this shortcoming, a technology called IP multicast was pro-

    posed. As an extension to IP layer, IP multicast is capable of providing efficient mul-

    tipoint packet delivery. To be specific, the efficiency is achieved by having one and

    Server

    Control

    Data

    Control

    Data

    ClientFilter

    Figure 5.60 Systems model of network filtering.

    448 MIDDLEWARE LAYER

  • only one copy of the original IP packet (sent by the multicast source) be transported

    along any physical path in the IP multicast tree. However, with a decade of research

    and development, there are still many barriers in deploying IP multicast. These pro-

    blems include scalability, network management, deployment and support for higher

    layer functionality (e.g., error flow and congestion control) [127].

    Application-level multicast is aimed at building a multicast service on top of the

    Internet. It enables independent content delivery service providers (CSPs), Internet

    service providers (ISPs), or enterprises to build their own Internet multicast net-

    works and interconnect them into larger, worldwide media multicast networks.

    That is, the media multicast network can support peering relationships at the appli-

    cation level or streaming-media/content layer, where content backbones intercon-nect service providers. Hence, much as the Internet is built from an

    interconnection of networks enabled through IP-level peering relationships among

    ISPs, the media multicast networks can be built from an interconnection of con-

    tent-distribution networks enabled through application-level peering relationships

    among various sorts of service providers, namely, traditional ISPs, CSPs, and appli-

    cations service providers (ASPs).

    The advantage of the application-level multicast is that it breaks barriers such as

    scalability, network management, and support for congestion control, which have

    prevented Internet service providers from establishing IP multicast peering

    arrangements.

    Content Replication. An important technique for improving scalability of the

    media delivery system is content media replication. The content replication takes

    two forms: mirroring and caching, which are deployed by the content delivery ser-

    vice provider (CSP) and Internet service provider (ISP). Both mirroring and caching

    seek to place content closer to the clients and both share the following advantages:

    . reduced bandwidth consumption on network links

    . reduced load on streaming servers

    . reduced latency for clients

    . increased availability.

    Mirroring is to place copies of the original multimedia files on the other machines

    scattered around the Internet. That is, the original multimedia files are stored on

    the main server while copies of the original multimedia files are placed on duplicate

    servers. In this way, clients can retrieve multimedia data from the nearest duplicate

    server, which gives the clients the best performance (e.g., lowest latency). Mirroring

    has some disadvantages. Currently, mechanisms for establishing a dedicated mirror

    on an existing server, while cheaper, is still an ad hoc and administratively complex

    process. Finally, there is no standard way to make scripts and server setup easily

    transferable from one server to another.

    Caching, which is based on the belief that different clients will load many of the

    same contents, makes local copies of contents that the clients retrieve. Typically,

    5.4 MEDIA STREAMING 449

  • clients in a single organization retrieve all contents from a single local machine,

    called a cache. The cache retrieves a video file from the origin server, storing a

    copy locally and then passing it on to the client who requests it. If a client asks

    for a video file that the cache has already stored, the cache will return the local

    copy rather that going all the way to the origin server where the video file resides.

    In addition, cache sharing and cache hierarchies allow each cache to access files

    stored at other caches so that the load on the origin server can be reduced and net-

    work bottlenecks can be alleviated [128].

    Streaming ServersStreaming servers play a key role in providing streaming services. To offer quality

    streaming services, streaming servers are required to process multimedia data under

    timing constraints in order to prevent artifacts (e.g., jerkiness in video motion and

    pops in audio) during playback at the clients. In addition, streaming servers also

    need to support video cassette recorder (VCR) like control operations, such as

    stop, pause/resume, fast forward, and fast backward. Streaming servers have toretrieve media components in a synchronous fashion. For example, retrieving a lec-

    ture presentation requires synchronizing video and audio with lecture slides. A

    streaming server consists of the following three subsystems: communicator (e.g.,

    transport protocol), operating system, and storage system.

    . The communicator involves the application layer and transport protocols

    implemented on the server. Through a communicator the clients can communi-

    cate with a server and retrieve multimedia contents in a continuous and syn-

    chronous manner.

    . The operating system, different from traditional operating systems, needs to

    satisfy real-time requirements for streaming applications.

    . The storage system for streaming services has to support continuous media sto-

    rage and retrieval.

    Media SynchronizationMedia synchronization is a major feature that distinguishes multimedia applications

    from other traditional data applications. With media synchronization mechanisms,

    the application at the receiver side can present various media streams in the same

    way as they were originally captured. An example of media synchronization is

    that the movements of a speaker’s lips match the played-out audio.

    A major feature that distinguishes multimedia applications from other traditional

    data applications is the integration of various media streams that must be presented

    in a synchronized fashion. For example, in distance learning, the presentation of

    slides should be synchronized with the commenting audio stream. Otherwise, the

    current slide being displayed on the screen may not correspond to the lecturer’s

    explanation heard by the students, which is annoying. With media synchronization,

    the application at the receiver side can present the media in the same way as they

    450 MIDDLEWARE LAYER

  • were originally captured. Synchronization between the slides and the commenting

    audio stream is shown in Figure 5.61.

    Media synchronization refers to maintaining the temporal relationships within

    one data stream and among various media streams. There are three levels of syn-

    chronization, namely, intrastream, interstream, and interobject synchronization.

    The three levels of synchronization correspond to three semantic layers of multime-

    dia data as follows [129]:

    Intrastream Synchronization. The lowest layer of continuous media or time-dependent data (such as video and audio) is the media layer. The unit of the

    media layer is a logical data unit such as a video/audio frame, which adheres to stricttemporal constraints to ensure acceptable user perception at playback. Synchroniza-

    tion at this layer is referred to as intrastream synchronization, which maintains the

    continuity of logical data units. Without intrastream synchronization, the presen-

    tation of the stream may be interrupted by pauses or gaps.

    Interstream Synchronization. The second layer of time-dependent data is the

    stream layer. The unit of the stream layer is a whole stream. Synchronization at

    this layer is referred to as interstream synchronization, which maintains temporal

    relationships among different continuous media. Without interstream synchroniza-

    tion, skew between the streams may become intolerable. For example, users could

    be annoyed if they notice that the movements of the lips of a speaker do not corre-

    spond to the presented audio.

    Interobject Synchronization. The highest layer of the multimedia document is theobject layer, which integrates streams and time-independent data such as text and

    still images. Synchronization at this layer is referred to as interobject synchroniza-

    tion. The objective of interobject synchronization is to start and stop the presentation

    of the time-independent data within a tolerable time interval, if some previously

    defined points of the presentation of a time-dependent media object are reached.

    Without interobject synchronization, for example, the audience of a slide show

    could be annoyed if the audio is commenting on one slide while another slide

    being presented.

    The essential part of any media synchronization mechanism is the specifications of

    the temporal relations within a medium and between the media. The temporal

    relations can be specified either automatically or manually. In the case of audio/video recording and playback, the relations are specified automatically by the

    Slide 1 Slide 2 Slide 3 Slide 4

    Audio sequence

    Figure 5.61 Synchronization between the slides and the commenting audio stream.

    5.4 MEDIA STREAMING 451

  • recording device. In the case of presentations that are composed of independently cap-

    tured or otherwise created media, the temporal relations have to be specified manually

    (with human support). The manual specification can be illustrated by the design of a

    slide show: the designer selects the appropriated slides, creates an audio object, and

    defines the units of the audio stream where the slides have to be presented.

    The methods that are used to specify the temporal relations include interval-

    based, axes-based, control flow-based, and event-based specifications. A widely

    used specification method for continuous media is axes-based specifications

    or time-stamping: at the source, a stream is time-stamped to keep temporal

    information within the stream and with respect to other streams; at the destina-

    tion, the application presents the streams according to their temporal

    relation [130].

    Besides specifying the temporal relations, it is desirable that synchronization be

    supported by each component on the transport path. For example, the servers store

    large amounts of data in such way that retrieval is quick and efficient to reduce

    delay; the network provides sufficient bandwidth, and delay and jitter introduced

    by the network are tolerable to the multimedia applications; the operating systems

    and the applications provide real-time data processing (e.g., retrieval, resynchroni-

    zation, and display). However, real-time support from the network is not available in

    the current Internet. Hence, most synchronization mechanisms are implemented at

    the end systems. The synchronization mechanisms can be either preventive or cor-

    rective [131].

    Preventive mechanisms are designed to minimize synchronization errors as data

    is transported from the server to the user. In other words, preventive mechanisms

    attempt to minimize latencies and jitters. These mechanisms involve disk-reading

    scheduling algorithms, network transport protocols, operating systems, and synchro-

    nization schedulers. Disk-reading scheduling is the process of organizing and coor-

    dinating the retrieval of data from the storage devices. Network transport protocols

    provide means for maintaining synchronization during data transmission over the

    Internet.

    Corrective mechanisms are designed to recover synchronization in the presence

    of synchronization errors. Synchronization errors are unavoidable, since the Inter-

    net introduces random delay, which destroys the continuity of the media stream by

    incurring gaps and jitters during data transmission. Therefore, certain compen-

    sations (i.e., corrective mechanisms) at the receiver are necessary when synchroni-

    zation errors occur. An example of corrective mechanisms is the stream

    synchronization protocol (SSP). In SSP, the concept of an intentional delay is

    used by the various streams in order to adjust their presentation time to recover

    from network delay variations. The operations of SSP are described as follows.

    At the client side, units that control and monitor the client end of the data connec-

    tions compare the real arrival times of data with the ones predicted by the presen-

    tation schedule and notify the scheduler of any discrepancies. These discrepancies

    are then compensated by the scheduler, which delays the display of data that are

    ahead of other data, allowing the late data to catch up. To conclude, media

    synchronization is one of the key issues in the design of media streaming services.

    452 MIDDLEWARE LAYER

  • Protocols for Streaming MediaProtocols are designed and standardized for communication between clients and

    streaming servers. Protocols for streaming media provide such services as network

    addressing, transport, and session control. According to their functionalities, the pro-

    tocols can be classified into three categories: network-layer protocol such as Internet

    protocol (IP), transport protocol such as user datagram protocol (UDP), and session

    control protocol such as real-time streaming protocol (RTSP).

    Network-layer protocol provides basic network service support such as network

    addressing. The IP serves as the network-layer protocol for Internet video streaming.

    Transport protocol provides end-to-end network transport functions for stream-

    ing applications. Transport protocols include UDP, TCP, real-time transport proto-

    col (RTP), and real-time control protocol (RTCP). UDP and TCP are lower-layer

    transport protocols while RTP and RTCP [132] are upper-layer transport protocols,

    which are implemented on top of UDP/TCP. Protocol stacks for media streamingare shown in Figure 5.62.

    Session control protocol defines the messages and procedures to control the

    delivery of the multimedia data during an established session. The RTSP and the

    session initiation protocol (SIP) are such session control protocols [133, 134].

    To illustrate the relationship among the three types of protocols, we depict the

    protocol stacks for media streaming. For the data plane, at the sending side, the com-

    pressed video/audio data is retrieved and packetized at the RTP layer. The RTP-packetized streams provide timing and synchronization information, as well as

    sequence numbers. The RTP-packetized streams are then passed to the UDP/TCPlayer and the IP layer. The resulting IP packets are transported over the Internet.

    At the receiver side, the media streams are processed in the reversed manner before

    Compressed

    Video/AudioRTP Layer

    Data Plane

    Protocol Stacks

    Control Plane

    RTCP Layer RTSP/SIP Layer

    UDP/TCP Layer

    IP Layer

    Internet

    Figure 5.62 Protocol stacks for media streaming [105]. (#2001 IEEE.)

    5.4 MEDIA STREAMING 453

  • their presentations. For the control plane, RTCP packets and RTSP packets are mul-

    tiplexed at the UDP/TCP layer and move to the IP layer for transmission over theInternet.

    In what follows, we will discuss transport protocols for streaming media. Then,

    we will describe control protocols, that is, real-time streaming protocol (RTSP) and

    session initiation protocol (SIP).

    Transport Protocols. The transport protocol family for media streaming includes

    UDP, TCP, RTP, and RTCP protocols [135]. UDP and TCP provide basic transport

    functions, while RTP and RTCP run on top of UDP/TCP. UDP and TCP protocolssupport such functions as multiplexing, error control, congestion control, or flow

    control. These functions can be briefly described as follows. First, UDP and TCP

    can multiplex data streams for different applications running on the same machine

    with the same IP address. Secondly, for the purpose of error control, TCP and most

    UDP implementations employ the checksum to detect bit errors. If single or multiple

    bit errors are detected in the incoming packet, the TCP/UDP layer discards thepacket so that the upper layer (e.g., RTP) will not receive the corrupted packet.

    On the other hand, different from UDP, TCP uses retransmission to recover lost

    packets. Therefore, TCP provides reliable transmission while UDP does not.

    Thirdly, TCP employs congestion control to avoid sending too much traffic,

    which may cause network congestion. This is another feature that distinguishes

    TCP from UDP. Lastly, TCP employs flow control to prevent the receiver buffer

    from overflowing while UDP does not have any flow control mechanisms.

    Since TCP retransmission introduces delays that are not acceptable for streaming

    applications with stringent delay requirements, UDP is typically employed as the

    transport protocol for video streams. In addition, since UDP does not guarantee

    packet delivery, the receiver needs to rely on the upper layer (i.e., RTP) to detect

    packet loss.

    RTP is an Internet standard protocol designed to provide end-to-end transport

    functions for supporting real-time applications. RTCP is a companion protocol

    with RTP and is designed to provide QoS feedback to the participants of an RTP

    session. In other words, RTP is a data transfer protocol while RTCP is a control

    protocol.

    RTP does not guarantee QoS or reliable delivery, but rather provides the follow-

    ing functions in support of media streaming:

    . Time-stamping. RTP provides time-stamping to synchronize different media

    streams. Note that RTP itself is not responsible for the synchronization,

    which is left to the applications.

    . Sequence numbering. Since packets arriving at the receiver may be out of

    sequence (UDP does not deliver packets in sequence), RTP employs sequence

    numbering to place the incoming RTP packets in the correct order. The

    sequence number is also used for packet loss detection.

    . Payload type identification. The type of the payload contained in an RTP packet

    is indicated by an RTP-header field called payload type identifier. The receiver

    454 MIDDLEWARE LAYER

  • interprets the contents of the packet based on the payload type identifier. Cer-

    tain common payload types such as MPEG-1/2 audio and video have beenassigned payload type numbers. For other payloads, this assignment can be

    done with session control protocols [136].

    . Source identification. The source of each RTP packet is identified by an RTP-

    header field called Synchronization SouRCe identifier (SSRC), which provides

    a means for the receiver to distinguish different sources.

    RTCP is the control protocol designed to work in conjunction with RTP. In an

    RTP session, participants periodically send RTCP packets to convey feedback on

    quality of data delivery and information of membership. RTCP essentially provides

    the following services.

    . QoS feedback. This is the primary function of RTCP. RTCP provides feedback

    to an application regarding the quality of data distribution. The feedback is

    in the form of sender reports (sent by the source) and receiver reports (sent

    by the receiver). The reports can contain information on the quality of reception

    such as: (1) fraction of the lost RTP packets, since the last report; (2) cumulat-

    ive number of lost packets, since the beginning of reception; (3) packet inter-

    arrival jitter; and (4) delay since receiving the last sender’s report. The

    control information is useful to the senders, the receivers, and third-party moni-

    tors. Based on the feedback, the sender can adjust its transmission rate, the

    receivers can determine whether congestion is local, regional, or global, and

    the network manager can evaluate the network performance for multicast

    distribution.

    . Participant identification. A source can be identified by the SSRC field in the

    RTP header. Unfortunately, the SSRC identifier is not convenient for the human

    user. To remedy this problem, the RTCP provides human-friendly mechanisms

    for source identification. Specifically, the RTCO SDES (source description)

    packet contains textual information called canonical names as globally unique

    identifiers of the session participants. It may include a user’s name, telephone

    number, e-mail address, and other information.

    . Control packet scaling. To scale that the RTCP controls packet transmission

    with the number of participants, a control mechanism is designed as follows.

    The control mechanism keeps the total control packets to 5 percent of the

    total session bandwidth. Among the control packets, 25 percent are allocated

    to the sender reports and 75 percent to the receiver reports. To prevent control

    packet starvation, at least one control packet is sent within 5 s at the sender or

    receiver.

    . Inter media synchronization. The RTCP sender reports contain an indication of

    real time and the corresponding RTP time-stamp. This can be used in inter-

    media synchronization like lip synchronization in video.

    . Minimal session control information. This optional functionality can be used

    for transporting session information such as names of the participants.

    5.4 MEDIA STREAMING 455

  • Session Control Protocols (RTSP and SIP). The RTSP is a session control pro-

    tocol for streaming media over the Internet. One of the main functions of RTSP is to

    support VCR-like control operations such as stop, pause/resume, fast forward, andfast backward. In addition, RTSP also provides means for choosing delivery chan-

    nels (e.g., UDP, multicast UDP, or TCP), and delivery mechanisms based upon RTP.

    RTSP works for multicast as well as unicast.

    Another main function of RTSP is to establish and control streams of continuous

    audio and video media between the media server and the clients. Specifically, RTSP

    provides the following operations.

    . Media retrieval. The client can request a presentation description, and ask the

    server to set up a session to send the requested media data.

    . Adding media to an existing session. The server or the client can notify each

    other about any additional media becoming available to the established session.

    In RTSP, each presentation and media stream is identified by an RTSP universal

    resource locator (URLS). The overall presentation and the properties of the media

    are defined in a presentation description file, which may include the encoding,

    language, RTSP URLs, destination address, port and other parameters. The presen-

    tation description file can be obtained by the client using HTTP, e-mail, or other

    means.

    SIP is another session control protocol. Similar to RTSP, SIP can also create and

    terminate sessions with one or more participants. Unlike RTSP, SIP supports user

    mobility by proxying and redirecting requests to the user’s current location.

    To summarize, RTSP and SIP are designed to initiate and direct delivery of

    streaming media data from media servers. RTP is a transport protocol for streaming

    media data while RTCP is a protocol for monitoring delivery of RTP packets. UDP

    and TCP are lower-layer transport protocols for RTP/RTCP/RTSP/SIP packets andIP provides a common platform for delivering UDP/TCP packets over the Internet.The combination of these protocols provides a complete streaming service over the

    Internet.

    Video streaming is an important component of many Internet multimedia appli-

    cations, such as distance learning, digital libraries, home shopping, and video-on-

    demand. The best-effort nature of the current Internet poses many challenges to

    the design of streaming video systems. Our objective is to give the reader a perspec-

    tive on the range of options available and the associated trade-offs among perform-

    ance, functionality, and complexity.

    5.4.3 Challenges for Transporting Real-Time Video Over the Internet

    Transporting video over the Internet is an important component of many multimedia

    applications. Lack of QoS support in the current Internet, and the heterogeneity of

    the networks and end systems pose many challenging problems for designing

    video delivery systems. Four problems for video delivery systems can be identified:

    456 MIDDLEWARE LAYER

  • bandwidth, delay, loss, and heterogeneity. Two general approaches address these

    problems: the network-centric approach and the end system-based approach

    [106]. Over the past several years extensive research based on the end system-

    based approach has been conducted and various solutions have been proposed. A

    holistic approach was taken from both transport and compression perspectives. A

    framework for transporting real-time Internet video includes two components,

    namely, congestion control and error control. It is well known that congestion con-

    trol consists of rate control, rate-adaptive coding, and rate shaping. Error control

    consists of forward error correction (FEC), retransmission, error resilience and

    error concealment. As shown in Table 5.12, the approaches in the design space

    can be classified along two dimensions: the transport perspective and the com-

    pression perspective.

    There are three mechanisms for congestion control: rate control, rate adaptive

    video encoding, and rate shaping. On the other hand, rate schemes can be classi-

    fied into three categories: source-based, receiver-based, and hybrid. As shown

    in Table 5.13, rate control schemes can follow either the model-based

    approach or probe-based approach. Source-based rate control is primarily targeted

    at unicast and can follow either the model-based approach or the probe-based

    approach.

    There have been extensive efforts on the combined transport approach and com-

    pression approach [137]. The synergy of transport and compression can provide bet-

    ter solutions in the design of video delivery systems.

    Table 5.12 Taxonomy of the design space

    Transport Compression

    Congestion control Rate control Source-based

    Receiver-based

    Hybrid

    Rate adaptive Altering quantizer

    encoding Altering frame rate

    Rate shaping Selective

    frame discard

    Dynamic

    rate shaping

    Error control FEC Channel coding SFEC

    Joint channel/source codingDelay-constrained

    retransmission

    Sender-based control

    Receiver-based control

    Hybrid control

    Error resilience Optimal mode selection

    Multiple description

    coding

    Error concealment EC-1, EC-2, EC-3

    5.4 MEDIA STREAMING 457

  • Under the end-to-end approach, three factors are identified to have impact on the

    video presentation quality at the receiver:

    . the source behavior (e.g., quantization and packetization)

    . the path characteristics

    . the receiver behavior (e.g., error concealment (EC)).

    Figure 5.63 represents factors that have impact on the video presentation quality,

    that is, source behavior, path characteristics, and receiver behavior. By taking into

    consideration the network congestion status and receiver behavior, the end-to-end

    approach is capable of offering superior performance over the classical approach

    for Internet video applications. A promising future research direction is to combine

    the end system-based control techniques with QoS support from the network.

    Different from the case in circuit-switched networks, in packet-switched net-

    works, flows are statistically multiplexed onto physical links and no flow is isolated.

    To achieve high statistical multiplexing gain or high resource utilization in the net-

    work, occasional violations of hard QoS guarantees (called statistical QoS) are

    allowed. For example, the delay of 95 percent packets is within the delay bound

    while 5 percent packets are not guaranteed to have bounded delays. The percentage

    (e.g., 95 percent) is in an average sense. In other words, a certain flow may have only

    10 percent packets arriving within the delay bound while the average for all flows is

    Table 5.13 Rate control

    Model-based Probe-based

    Rate control Source-based Unicast Unicast/MulticastReceiver-based Multicast Multicast

    Hybrid Multicast

    Source behavior

    Raw

    Video Video

    Encoder

    Packetizer

    Transport

    Protocol

    Path characteristics

    Network

    Receiver behavior

    Video

    Encoder

    Depacketizer

    Transport

    Protocol

    Figure 5.63 Factors that have impact on video presentation quality: source behavior, path

    characteristics, and receiver behavior [137]. (#2000 IEEE.)

    458 MIDDLEWARE LAYER

  • 95 percent. The statistical QoS service only guarantees the average performance,

    rather than the performance for each flow. In this case, if the end system-based con-

    trol is employed for each video stream, higher presentation quality can be achieved

    since the end system-based control is capable of adapting to short-term violations.

    As a final note, we would like to point out that each scheme has a trade-off

    between cost/complexity and performance. Designers can choose a scheme in thedesign space that meets the specific cost/performance objectives.

    5.4.4 End-to-End Architecture for Transporting

    MPEG-4 Video Over the Internet

    With the success of the Internet and flexibility of MPEG-4, transporting MPEG-4

    video over the Internet is expected to be an important component of many multime-

    dia applications in the near future. Video applications typically have delay and loss

    requirements, which cannot be adequately supported by the current Internet. Thus, it

    is a challenging problem to design an efficient MPEG-4 video delivery system that

    can maximize perceptual quality while achieving high resource utilization.

    MPEG-4 builds on elements from several successful technologies, such as digital

    video, computer graphics, and the World Wide Web, with the aim of providing

    powerful tools in the production, distribution, and display of multimedia contents.

    With the flexibility and efficiency provided by coding a new form of visual data

    called visual object (VO), it is foreseen that MPEG-4 will be capable of addressing

    interactive content-based video services, as well as conventional storage and trans-

    mission video. Internet video applications typically have unique delay and loss

    requirements that differ from other data types. Furthermore, the traffic load con-

    dition over the Internet varies drastically over time, which is detrimental to video

    transmission. Thus, it is a major challenge to design an efficient video delivery sys-

    tem that can both maximize users’ perceived quality of service (QoS) while achiev-

    ing high resource utilization in the Internet.

    Figure 5.64 shows an end-to-end architecture for transporting MPEG-4 video

    over the Internet. The architecture is applicable to both precompressed video and

    live video.

    If the source is a precompressed video, the bit rate of the stream can be matched

    to the rate enforced by a feedback control protocol through dynamic rate shaping

    [138] or selective frame discarding [136]. If the source is a live video, it is used

    in the MPEG-4 rate adaptation algorithm to control the output rate of the encoder.

    On the sender side raw bit stream of live video is encoded by an adaptive MPEG-

    4 encoder. After this stage, the compressed video bit stream is first packetized at

    the sync layer (SL) and then passed through the RTP/UDP/IP layers before enteringthe Internet.

    Packets may be dropped at a router/switch (due to congestion) or at the destina-tion (due to excess delay). For packets that are successfully delivered to the destina-

    tion, they first pass through the RTP/UDP/IP layers in reverse order before beingdecoded at the MPEG-4 decoder.

    5.4 MEDIA STREAMING 459

  • Under the architecture, a QoS monitor is kept at the receiver side to infer network

    congestion status based on the behavior of the arriving packets, for example, packet

    loss and delay. Such information is used in the feedback-control protocol, which is

    sent back to the source. Based on such feedback information, the sender estimates the

    available network bandwidth and controls the output rate of the MPEG-4 encoder.

    Figure 5.65 shows the protocol stack for transporting MPEG-4 video. The right

    half shows the processing stages at an end system. At the sending side, the com-

    pression layer compresses the visual information and generates elementary streams

    (ESs), which contain the coded representation of the VOs. The ESs are packetized as

    SL-packetized streams at the SL. The SL-packetized streams are multiplexed into

    FlexMux stream at the TransMux Layer, which is then passed to the transport pro-

    tocol stacks composed of RTP, UDP, and IP. The resulting IP packets are trans-

    ported over the Internet. At the receiver side, the video stream is processed in the

    reversed manner before its presentation. The left half shows the data format of

    each layer.

    Figure 5.66 shows the structure of MPEG-4 video encoder. Raw video stream is

    first segmented into video objects, then encoded by individual VO encoder. The

    encoded VO bit streams are packetized before beingmultiplexed by the streammulti-

    plex interface. The resulting FlexMux stream is passed to the RTP/UDP/IP module.The structure of an MPEG-4 video decoder is shown in Figure 5.67. Packets from

    RTP/UDP/IP are transferred to a stream demultiplex interface and FlexMux buffer.The packets are demultiplexed and put into corresponding decoding buffers. The

    error concealment component will duplicate the previous VOP when packet loss

    is detected. The VO decoders decode the data in the decoding buffer and produce

    composition units (CUs), which are then put into composition memories to be con-

    sumed by the compositor.

    To conclude, the MPEG-4 video standard has the potential of offering interactive

    content-based video services by using VO-based coding. Transporting MPEG-4

    Feedback Control

    Protocol

    Raw Video Rate Adaptive

    MPEG-4 Encoder

    RTP/UDP/IPModule Internet

    RTP/UDP/IPModule

    QoS Monitor

    Feedback ControlProtocol

    RTP/UDP/IPModule

    Figure 5.64 An end-to-end architecture for transporting MPEG-4 video [116]. (#2000 IEEE.)

    460 MIDDLEWARE LAYER

  • video is foreseen to be an important component of many multimedia applications.

    On the other hand, since the current Internet lacks QoS support, there remain

    many challenging problems in transporting MPEG-4 video with satisfactory video

    quality. For example, one issue is packet loss control and recovery associated

    with transporting MPEG-4 video. Another issue that needs to be addressed is the

    support of multicast for Internet video.

    5.4.5 Broadband Access

    The demand for broadband access has grown steadily as users experience the con-

    venience of high-speed response combined with always on connectivity. There are a

    MPEG-4

    data

    MPEG-4

    data

    SL

    header

    MPEG-4

    Compression

    Layer

    MPEG-4

    Sync

    Layer

    Elementary

    streams...

    SL-packetized

    streams...

    Flex Mux

    header

    SL

    header

    MPEG-4

    data

    RTP

    header

    Flex Mux

    header

    SL

    header

    MPEG-4

    data

    UDP

    header

    RTP

    header

    Flex Mux

    header

    SL

    header

    MPEG-4

    data

    IP

    header

    UDP

    header

    RTP

    header

    Flex Mux

    header

    SL

    header

    MPEG-4

    data

    MPEG-4

    FlexMux

    Layer

    MPEG-4TransMux

    Layer

    FlexMux

    stream

    RTP

    Layer

    UDP

    Layer

    IP

    Layer

    FlexMux

    stream

    Internet

    Figure 5.65 Data format at each processing layer at an end system [116]. (#2000 IEEE.)

    5.4 MEDIA STREAMING 461

    page_431.pdfpage_432.pdfpage_433.pdfpage_434.pdfpage_435.pdfpage_436.pdfpage_437.pdfpage_438.pdfpage_439.pdfpage_440.pdfpage_441.pdfpage_442.pdfpage_443.pdfpage_444.pdfpage_445.pdfpage_446.pdfpage_447.pdfpage_448.pdfpage_449.pdfpage_450.pdfpage_451.pdfpage_452.pdfpage_453.pdfpage_454.pdfpage_455.pdfpage_456.pdfpage_457.pdfpage_458.pdfpage_459.pdfpage_460.pdfpage_461.pdf