Multimedia Middle Ware

8/8/2019 Multimedia Middle Ware

1/100

Helwan University

Faculty of Engineering

Department of Electronics,Communications, and Computers

MULTIMEDIA MIDDLEWARE

by

Nora Abdel gaffar Naguib El-morsy

Bsc. In Telecommunication Engineering, 2005

[email protected]

A thesis submitted in partial fulfillment of the requirements for the degree of

Masters of Science in Telecommunications Engineering

Supervised by:

Prof. Mohamed I. El AdawyFaculty of Engineering, Helwan University

Dr. Hesham A. Keshk Faculty of Engineering, Helwan University

Dr. Ahmed E. HusseinFaculty of Engineering, Helwan University

2010


2/100

ii | P a g e


3/100

P a g e | iii

ACKNOWLEDGEMENT

It is a pleasure to thank those who made this thesis possible. I would like to

express my gratitude to Prof. Mohamed I. El-Adawy for his constant support

and most valuable advice. I would like to thank the rest of the supervisory

committee for all their help and Dr. Ahmed E. Hussien for the suggestion of

reference titles.

I would also like to thank my family for the support they provided me

through my entire life and in particular, I really cannot express my full

gratitude to my brother Yasser Naguib who patiently proofread this entire

thesis. Special thanks go to my brother Wael Naguib without whose

motivation and encouragement I would not have considered a post graduate

degree. Above all, to my mother who stood beside me all the time.

Lastly, I offer my regards to all of those who supported me in any respect

during the completion of the project.

I dedicate this thesis to My Mother


4/100

iv | P a g e


5/100

P a g e | v

PUBLICATIONS

Nora A. Naguib, Ahmed E. Hussein, Hesham A. Keshk, andMohamed I. El- Adawy Contrast Error Distribution Measurementfor Full Reference Image Quality Assessment, The 18thInternational Conference on Computer Theory and Applications2008, Alexandria, Egypt.

Nora A. Naguib, Ahmed E. Hussein, Hesham A. Keshk, andMohamed I. El- Adawy Using PFA in Feature Analysis and Selectionfor H.264 Adaptation, World Academy of Science, Engineering and

Technology, VOLUME 54, JUNE 2009, Paris, France, ISSN: 2070-3724


6/100

vi | P a g e


7/100

P a g e | vii

ABSTRACT

In today's world, users have heterogeneous devices connected to a mesh of networks each

with different capabilities and restrictions. Multimedia content providers need innovative

approaches to keep not only one version of each video but having the capability to offer

different bitstreams for a variety of client capabilities as well. The previously used design of

"one size fits all" systems can not apply in diverse environments presented today. A single

bitstream with static parameter cannot satisfy the diversity presented on the client side. This is

why the researchers in Universal Multimedia Access (UMA) are working on the development

of new techniques for coding multimedia objects with maximum compression efficiency

along with flexibility in the parameters of the provided video when dealing with client devices.

The transcoding of multimedia objects requires the presence of intermediate systems that are

capable of altering the bitstream on demand. Those systems should have the capability of

manipulating different format of bitstreams. A large number of adaptation techniques exists

in todays litera ture, each specialized in altering the video bitstream with respect to only one

dimension namely temporal (frame-rate), spatial (resolution), Signal to Noise Ratio (SNR), orformat conversion. In real world, adaptation of video sequences should take the form of

multi-dimensional adaptation allowing the system to do a combination of reduction processes

on different parameters of video sequence while providing the best possible quality.

In this thesis, we have focused on the transcoder policy module. While most of the previous

studies in multimedia transcoding focused on the transcoding techniques, the lack of control

algorithm rendered those techniques useless. The study was directed toward the creation of

an offline data analysis model for transcoders policy module.

The results and analysis provided in this thesis help toward the creation of policy module that

control the transcoder operation for universal multimedia access.

KEYWORDS: Multimedia Transcoding, Objective Quality Assessment, Universal

Multimedia Access.


8/100


9/100

P a g e | ix

4-3-3 Prediction Accuracy 48

4-3-4 Prediction Monotonicity 48 4-3-5 Prediction Consistency 49 4-4 Results 49

4-4-1 Overall Performance 50 4-4-2 Cross-Distortion Performance 50 4-4-3 Logistic Regression Performance 53 4-4-4 Complexity Performance 52

Data Analysis 63 5-1 Introduction 63 5-2 Offline Data Analysis Model 64

5-3 H.264 Setup 65 5-4 Test Sequences 66 5-5 Features 66

5-5-1 Feature Definitions 68 5-5-1-1 Source Domain Features 68 5-5-1-2 Resources Required 68 5-5-1-3 Coded Domain features 69

5-5-2 Analysis and selection 69 5-6 Results 70 5-7 Transcoder Configuration 73

5-8 Transcoder Setup 74 5-9 Clustering 76

Conclusion and Future Work 79 6-1 Conclusion 79 6-2 Future Work 82

Bibliography 83


10/100

x | P a g e

LIST OF FIGURES

FIGURE1-1 MULTIMEDIAMIDDLEWARE.........................................................................................3 FIGURE2-1 MULTIMEDIACOMMUNICATIONS STUDY AREAS(2001 ITU-T) ..........................................8 FIGURE2-2 GENERALARCHITECTURE OFCODING ALGORITHMS..........................................................9 FIGURE2-3 SCALABLEBITSTREAMS .............................................................................................. 10 FIGURE3-1 BLOCK DIAGRAM OF THEPERCEPTUALDISTORTION METRIC(PDM) ................................. 19 FIGURE3-2 BLOCK DIAGRAM OF THESTRUCTURALSIMILARITY......................................................... 20

FIGURE3-3 BLOCKDIAGRAM OF THEMULTI-SCALESTRUCTURALSIMILARITYL: LOW PASS FILTERING; 2: DOWN SAMPLING BY2 .................................................................................................... 22

FIGURE3-4 CONCEPTUAL DIAGRAM OF THEVIF ............................................................................ 22 FIGURE3-5 SUBJECTIVEEXPERIMENTS: VIEWING MODES (ON THE LEFT) SCORESCALE(ON THERIGHT).

(A) DOUBLESTIMULUSIMPAIRMENT SCALE(DSIS) (B) DOUBLESTIMULUSCONTINUOUS Q UALITYSCALE(DSCQS) (C) SINGLESTIMULUSCONTINUOUS Q UALITYSCALE(SSCQS) ....................... 25

FIGURE3-6 (A) VIDEO CODING LAYER(VLC)ANDNETWORKABSTRACTIONLAYER(NAL)ARRANGEMENT. (B) NALUNIT ................................................................................................................ 29

FIGURE3-7 BLOCK DIAGRAM OFH.264 ENCODER......................................................................... 30 FIGURE3-8 BLOCK DIAGRAM OF THEH.264 DECODER................................................................... 30 FIGURE3-9 H.264 PROFILES...................................................................................................... 33 FIGURE3-10 HOMOGENEOUS TRANSCODING................................................................................ 35 FIGURE3-11 TRANSCODERIMPLEMENTATION............................................................................... 36 FIGURE3-12 UTILITYMODEL..................................................................................................... 36 FIGURE3-13 INFO-PYRAMID BASED CONTROL SCHEME................................................................... 37 FIGURE3-14 THREE DIMENSIONAL VIEW...................................................................................... 38 FIGURE3-15 SYSTEM OVERVIEW................................................................................................. 39 FIGURE3-16 ADAPTATION, RESOURCE, UTILITY SPACES.................................................................. 41 FIGURE4-1 BLOCKDIAGRAM OF THECONTRASTERRORDISTRIBUTION(CED) ................................... 46 FIGURE4-2 SCATTER PLOT OFVQRS AGAINSTDMOS VALUES(BLUE), ANDNONLINEARLOGISTIC FITTING

CURVE(BLACK). THIS WAS CALCULATED FOR6 VQM: PSNR, SSIM, VIF, PD-VIF, CED, LOG(CED)RESPECTIVELY................................................................................................. 54

FIGURE4-3 SCATTER PLOT OF PREDICTEDDMOS (VQRS AFTER LOGISTIC REGRESSION) AGAINSTDMOS VALUES. THIS WAS CALCULATED FOR6 VQM: PSNR, SSIM, VIF, PD-VIF, CED, LOG(CED) RESPECTIVELY................................................................................................................. 57

FIGURE4-4 CALIBRATIONCURVES FOR EACH ERROR DOMAIN: JPEG2K(GREEN), JPEG (RED), WHITENOISE (BLUE), GAUSSIANBLUE(MAGENTA), FASTFADING(CYAN) AND ALL ERROR DOMAINS(BLACK). THIS WAS CALCULATED FOR6 VQM: PSNR, SSIM, VIF, PD-VIF, CED, LOG(CED) .... 60

FIGURE5-1 BLOCK DIAGRAM OFMULTIMEDIAMIDDLEWARE.......................................................... 65 FIGURE5-2 TESTSEQUENCESDESCRIPTION.................................................................................. 67 FIGURE5-3 STANDARDTRANSCODERCONFIGURATION................................................................... 73


11/100

P a g e | xi

FIGURE5-4 ADOPTED TRANSCODER CONFIGURATION..................................................................... 74

FIGURE5-5 NORMALIZEDBITRATE AGAINST DIFFERENT TRANSCODING PARAMETERS FOR ALL THE TESTSEQUENCES.................................................................................................................... 76

FIGURE5-6 DENDROGRAM OF THE GENERATED CLUSTERS............................................................... 77 FIGURE5-7 NORMALIZEDBITRATE AFTER ADDING THE NO TRANSCODING VALUES............................... 77


12/100

xii | P a g e

LIST OF TABLES

TABLE1 COMPARISON BETWEEN THE PSNR, SSIM, CED, PD-VIF, LOG(CED), LOG(VIF)WITH RESPECTTO CC: PEARSONCORRELATIONCOEFFICIENT, SROCC: SPEARMANRANKCORRELATIONCOEFFICIENT, RMSE: ROOT MEANSQUAREERROR ............................................................. 51

TABLE2 PEARSON CORRELATIONCOEFFICIENT OF THESSIM, CED, PD-VIF, LOG(CED), LOG(VIF). CALCULATED FOR THE DISTORTION DOMAINSJPEG2000, JPEG, WHITENOISE, GAUSSIANBLUR, ANDFASTFADING........................................................................................................... 51

TABLE3 SPEARMANRANKCORRELATIONCOEFFICIENT OF THESSIM, CED, PD-VIF, LOG(CED),

LOG(VIF). CALCULATED FOR THE DISTORTION DOMAINSJPEG2000, JPEG, WHITENOISE, GAUSSIANBLUR, ANDFASTFADING................................................................................... 51

TABLE4 ROOT MEASSQUAREERROR OF THESSIM, CED, PD-VIF, LOG(CED), LOG(VIF). CALCULATEDFOR THE DISTORTION DOMAINSJPEG2000, JPEG, WHITENOISE, GAUSSIANBLUR, ANDFASTFADING. ........................................................................................................................ 52

TABLE5 EVALUATION OF THEQ UALITYMETRICS............................................................................ 52 TABLE6 SOURCEDOMAIN FEATURES........................................................................................... 71 TABLE7 RESOURCEFEATURES..................................................................................................... 72 TABLE8 CODED DOMAIN FEATURES............................................................................................ 72 TABLE9 FINALTRAIL................................................................................................................. 72


13/100

P a g e | xiii

ACRONYM

ARU Adaptation / Resource / Utility

CED Contrast Error Distribution

CPDT Cascaded Pixel Domain Transcoder

DCT Discrete Cosine Transform

DFT Discrete Fourier TransformDMOS Differential Mean Opinion Score

DSCQS Double Stimulus Continuous Quality Scale

DSIS Double Stimulus Impairment Scale

DWT Discrete Wavelet Transform

FIR Finite Impulse Response

FR QA Full Reference Quality AssessmentHVS Human Visual System

ISO/IEC International Organization for Standardization

/ International Electro-technical Commission

IT Information Technology

ITU-R International Telecommunication Union

Radio Communication

ITU-T International Telecommunication Union

Telecommunications

MM FSA MultiMedia Framework Study Areas

MPEG Motion Pictures Experts Group

MSE Mean Square Error

NAL Network Abstraction Layer


14/100

xiv | P a g e

NR QA No Reference Quality Assessment

NSS Natural Scene Statistics

PCA Principle Component Analysis

PDM Perceptual Distortion Metric

PFA Principle Feature Analysis

PSNR Peak Signal to Noise Ratio

QoE Quality of Experience

RR QA Reduce Reference Quality AssessmentSDOs Standards Development Organizations

SG Study Group

SNR Signal to Noise Ratio

SSCQS Single Stimulus Continuous Quality Scale

SSIM Structural Similarity

UMA Universal Multimedia Access VCL Video Coding Layer

VIF Visual Information Fidelity

VQEG Video Quality Experts Group

VQM Video Quality Metric

VQR Video Quality Rating


15/100

P a g e | 1

C h a p t e r 1

I n t r o d u c t i o n

1.

1-1 Motivation

Multimedia plays an important role in our life. We now have terms that were

introduced to industry, culture and leisure that solely depend on theevolvement of the Multimedia Communications field. Working with another

team member overseas through your laptop was never possible if it were not

for the video conferencing capabilities. The term webinar was not used until

few years ago when it was found that a web based seminar would be more

effective in reaching all its target audience with no regard to distances apart.

Multimedia objects can be described as the highest demanding object

transferred between networks, where the Quality of Experience (QoE) [1] is

the most important thing. The slightest delay or error would heavily affect the

quality and render the multimedia object useless. This however doesnt

change the fact that multimedia is the most popular type of data on the

internet.


16/100

2 | P a g e

The growth of users with access to the internet along with the tremendous

increase in their network capabilities and mobility, made way to the increase

in amount of data accessed and uploaded through the internet. This data as a

whole contains at least 70% of it as multimedia objects. Those users spend

more than 20% of their time away from their primary workplace.

For a relatively long time now, we are used to having two types of networks

available to us. Telecommunications and IT (Information Technology)networks. Though we have interconnections between them, we havent yet

reached the combination of the two. To achieve this merge the ITU-T

(International Telecommunications Union - Telecommunication) is working

on the standardization of what is called Next Generation Networks.

The work of study group 16 is focused in providing guidelines for Network

of the Networks that unifies the view points of end users, standard

committee, telecommunication and IT providers. This will allow the

convergence of all services under the umbrella of one network, and the

cooperation of content providers and network service providers to serve end

users better.

This advancement in the telecommunications networks and deviceinteroperability led to increasing the importance of multimedia objects.

Multimedia communication is expected to dominate the field of

communications in the following 10 years. This fact makes it crucial for us to

tackle the problem of exchanging multimedia objects seamlessly in these

changing environments. The research presented in this thesis is a trail to

examine some of the open issues in the field of multimedia communications.


17/100

P a g e | 3

1-2 Problem Statement

Multimedia middleware are intermediate systems between the client and the

content server that provides a number of complementary services. The

generalized block diagram of multimedia middleware is illustrated in Figure

1-1. Those servers are used to transcode multimedia objects before delivery

to client devices. This transcoding will help in situations where, we do not

want to exhaust network resources or device processing power when users

are just reviewing multimedia objects to select one, or when the client device

does not have a high screen resolution.

Figure 1-1 Multimedia Middleware

Transcoding can be done with respect to numerous domains, none of which

will result in the same combination of resources. The transcoding middleware

should be able to evaluate the client request, analyze the content of the

multimedia object requested, choose a transcoding scheme, then Transcode


18/100

4 | P a g e

and deliver it to the user. This middleware server will need to fit within the

existing system and be transparent to both content server and client.

A multimedia middleware should possess the following qualities in order to

be transparent to the client side:

When adding a new multimedia object to the content server, the timerequired for the transcoding server to analyze the content of the

video should be minimized. Time from the reception of client requests till delivery of the content

back to the user should be minimized.

Transcoding server should not require the presence of any pixeldomain information in any of its processes.

The server should have the means to assess the quality of thegenerated version of the multimedia object and choose betweendifferent transcoding schemes.

The above qualities provide a roadmap for the implementation of transcoding

servers. However for those servers to function properly, a set of offline data

analysis studies for multimedia objects should be done. In the available

literature, a number of studies worked on this point but none have reached

the optimal criteria satisfying the above stated qualities. Our work inmultimedia middleware is focused toward the implementation of the

transcoder policy module. We have divided the analysis into two points. A

quality assessment model has been developed for the use in offline data

analysis along with an overall feature analysis for the selection of transcoding

schemes.

1-3 Objectives and contributions

The middleware server request cycle consists of the following:


19/100

P a g e | 5

Data Analysis of the pre-encoded video stream.

Policy Module: Choosing a transcoding scheme that best fits theclient requirement and have the best quality of all possible solutions.

Transcoding the video stream.

The objective of this research is to examine the first two stages. This work

will help toward the practical implementation of the middleware server

control module. The contribution of this research was concentrated in thefollowing:

Discovering the features that best serve in clustering the multimediaobjects and provide means of predicting the way those objects wouldreact to different transcoding schemes.

Developing a new quality assessment metric for the evaluation andthe choice of the best available transcoding scheme.

1-4 Thesis Outline

This thesis is organized as follows: chapter 2 introduces some of the

multimedia communications concepts used in the discussion presented in this

thesis, chapter 3 provides a review of the related literature, chapter 4

introduces the proposed objective quality assessment model along with theevaluation of its performance, chapter 5 presents the offline data analysis and

the features analysis for the implementation of the transcoder policy module,

and chapter 6 presents the conclusion and future work.


20/100

6 | P a g e


21/100

P a g e | 7

C h a p t e r 2

M u l t i m e d i a C o m m u n i c a t i o n sB a s i c s

2.

2-1 ITU-T MediaCom2004 project

The advances in the multimedia communications depend not only on fields

that study multimedia objects but on the development of underlying

networks and services that will allow the integration of complex multimedia

objects in resource limited network, taking into consideration the quality

received by end users.

ITU-T SG16, the lead Study Group for Multimedia, is working on project -

MEDIACOM 2004 (Multimedia Communication 2004) [2]. The objective of

the Mediacom 2004 Project is to establish a framework for Multimediastandardization for use both inside and external to the ITU. This framework

will support the harmonized and coordinated development of global

multimedia communication standards across all ITU-T and ITU-R Study

Groups, and in close cooperation with other regional and international

standards development organizations (SDOs).


22/100

8 | P a g e

Figure 2-1 presents the Multimedia framework study areas (MM FSA) as

defined by the Mediacom project.

Figure 2-1 Multimedia Communications Study areas (2001 ITU-T)

2-2 MPEG-7 and MPEG-21

Another important segment of research is the semantic annotation of

multimedia content. This annotation provides a bigger picture view on the

overall information that resides in a webpage. As a result, the content of this

webpage can be classified based on its importance and then delivered.

MPEG-7 and MPEG-21 are two standards developed by the Moving

Pictures Experts Group MPEG in 2003. Those standards are not intended

for the coding of Multimedia objects as the proceeding standards. However,they aim in the integration with the other coding algorithms to allow the


23/100

P a g e | 9

transmission of user preference and context information back and forth

between clients and content servers.

2-3 Coding Standards

Multimedia objects are known to contain a large amount of correlated data.

Coding algorithms are designed to decouple these associations in both the

temporal and spatial dimension and by that achieve a high compression rate

without losing valuable information. Figure 2-2 illustrates the maincomponents of coding algorithms.

Figure 2-2 General Architecture of Coding Algorithms

MPEG-4 and H.264 are the newest standards for multimedia coding

developed by the MPEG. They both rely on the same coding principles but with significantly different visions. MPEG-4 is mainly concerned with

flexibility where H.264 features efficient compression and reliability.

As stated above, the difference between the two standards does not reside in

the theory of the compression module itself, but in how the input is treated.

In MPEG-4, the input of the compression module is a series of multimedia


24/100

10 | P a g e

objects that are contained in video frames. H.264 uses frame based

compression.

2-4 Transcoding Vs Scalable Coding

Scalable Video Encoding is the coding of video streams to contain a number

of substreams that can be decoded separately. The bitstream structure is

shown in Figure 2-3. First, a base substream containing the most basic

information that allows client devices to render the video with the lowestobtainable quality is considered. This is usually the case in mobile devices

where the client is connected on a low bandwidth network. That base

substream is followed by a series of enhancement layers that can be

downloaded on-demand; this is usually the case when the client can afford

more resources to increase the quality of received video.

Figure 2-3 Scalable Bitstreams

On the other hand, transcoding can be achieved by the presence of

intermediate systems (Multimedia Middleware) between server and client. On

these subsystems the video is re-encoded upon receiving client requests. Those requests will contain the characteristics of the client device along with


25/100

P a g e | 11

network resources available. In this thesis the words transcoding and

adaptation will be used interchangeably.

The most basic form of a transcoder is a back to back encoder-decoder

configuration. However, this configuration requires a heavy processing power

on the intermediate system. Another form is based on partially decoding the

stream and manipulating the data in its pre-coded form without referring to

the pixel domain data. Those transcoding systems exploit dependenciesbetween coded domain and pixel domain information along with full

understanding of coding scheme itself.

Scalable coding and transcoding are the two coexisting lines of UMA

researches where each has its advantages and limitations. Scalable coding has

the advantage of processing videos in advance therefore it does not require

any intermediate system. However, it means that the video bitstream

resource/quality degradation can be done only on predefined steps and

therefore it does not comply with the exact client requirements.

In other words, scalable coding provides error margin between the provided

bitstream and the requested resource/quality. Meanwhile transcoding tailors

video bitstreams to the exact device/network requirements provided by theclient requests.

Two other limitations involved on practical implementation of the scalable

coding are as follows:


26/100

12 | P a g e

The decoders compliance to the scalable coding format. Non

compliant decoders will only decode the base layer of the bitstreamyielding a low quality video on clients which can support higherquality.

Enormous number of video bitstreams is available on todaysnetworks that adapt single layer bitstreams; In order to accommodatethe scalable coding techniques transcoding is required for all present

videos

2-5 Quality Assessment

Quality assessment is an important step in transcoding / adaptation process.

In proxy /middleware, the choice of the transcoding dimension and the exact

parameter is dependent on the quality produced. Although meeting client

requests and resources is the steering wheels of the transcoding middleware,

the QoE on client side is what this whole system is about.

During assessment of reduced bit stream, we should bear in mind that quality

measurement of multimedia objects is not defined as fidelity of the new

bitstream to original. Quality when it comes to multimedia objects is defined

as the perceived quality which means that some errors are more important

than others. The perceived quality is related to the limitation within the

Human visual system (HVS) where some errors are neutral while others areseverely perceived by it.

Peak Signal to Noise Ratio (PSNR) is considered to be the most recognized

quality metric. This metric calculates error power within the image.

Consequently, it overlooks the significance of the affected data within the

image, along with modification in HVS response due to this variation in data.


27/100

P a g e | 13

The degree by which the alteration of video bitstream has affected the

perceived quality can be calculated by either subjective experiments or

objective quality metrics. Subjective experiments refer to viewing videos by

human observers where each observer rates the video quality and then a

mean opinion record is calculated for this video. Objective quality metrics

measures degradation of visual perceptual quality defining criterion for

describing perceptual error.


28/100

14 | P a g e


29/100


30/100

16 | P a g e

on the screen with respect to the original video stream. This clarifies why any

fidelity measure as the SNR would fail in describing the opinion of the

observer.

Although, HVS is a complex organism, it is limited when it comes to error

perception. These limitations are the reason why an error with less power

might contribute in a much severe way to the degradation of image quality.

Up until now, subjective experiments have been used for the assessment of

multimedia quality. However those experiments are impractical, expensive

and time consuming. Hence, they cannot be used in estimating the quality of

multimedia objects during its reproduction. Researchers in the field of

multimedia quality assessment are working on the development of objective

metrics that can predict the observers opinion about the quality of

multimedia objects.

3-1-2 Simple Quality Metrics Simple error power models are considered to be the most recognized quality

metric. This metric calculates the error power within the image.

Consequently, it overlooks the significance of the affected data within the

image, along with the modification in the HVS response due to this variationin data.

To calculate the PSNR between the original and distorted images, we start by

calculating the MSE (Mean Square Error) of pixels grayscale values.

=1

, ,

, ,

2

[3]


31/100

P a g e | 17

Where: images have a width of X pixels and height of Y pixels and the video

sequence contains F frames.

= 10 102

[3]

Where: I is the maximum value that a pixel can take.

From the above we can see that the MSE defines the difference between thetwo signals and the PSNR defines the fidelity of the distorted image to the

original. In [4] the authors illustrate why error power cannot be used as a

metric for the perceptual quality. They considered the following cases:

Different types of visual error with equal power introduced to thesame image.

Identical error introduced to different images.

In these two cases, although the error has identical power values, the two

images may enclose different perceptual quality. In other words, the type of

error should be studied with respect to its effect on HVS and the image in

hand.

3-1-3 Objective Quality Metrics The above argument about error power based metrics led the researchers to

explore, and formulate a definition for the perceived quality. Some of the

metrics were designed to be generic and utilized the basic understanding of

the limitations in the HVS. The metric itself was designed to mimic the

processing done in the human eye and brain. Other metrics were more


32/100

18 | P a g e

specific and relied on the prior information about the distortion process that

multimedia object went through. (For example: Coding algorithms introduce

blocking Artifacts)

Three types of references can be used for quality assessment: Full Reference

(FR), Reduced Reference (RR), and No Reference (NR). In FR QA (Full

Reference Quality Assessment) the original image is compared to the

reproduced image, while in RR QA only some features of the original imageare used in the comparison. NR QA is the techniques that rely on the natural

image features to decide about the quality of the image without referring to

any outside information. Obviously, the FR and RR are not very suitable for

the transmission quality problem, due to the need for the original image or

some of its features at the receiver. However, FR and RR are very useful in

cases of developing coding and transcoding techniques. These metrics are

used in order to judge the quality of the image where the original is already

available.

In the following sections, we are going to present a number of FR QA

metrics that have been developed by researchers in the quality assessment

field, along with the underlying definition of the perceptual quality.

3-1-3-1 USING DCT, DWT, AND DFT Authors in [5] examined the effect of decoupling inter-pixel dependencies by

using transforms like Discrete Cosine Transform (DCT), Discrete Wavelet

Transform (DWT) or Discrete Fourier Transform (DFT). Their study shows

that by transforming images to the frequency domain and then doing a simple

pixel difference, the resulting performance would surpass complex quality

measures.


33/100

P a g e | 19

3-1-3-2 P ERCEPTUAL D ISTORTION METRIC (PDM)

Figure 3-1 Block diagram of the Perceptual Distortion Metric (PDM)

In [3], a generic model of HVS is used as an objective quality assessment

metric. The block diagram of the metric is illustrated in Figure 3-1. The color

space conversion block relies on the fact that HVS treats colors as a nonlinear

color differences rather than RGB. i.e., White-Black, Red-Green, and Blue- Yellow. The perceptual decomposition is a set of spatio-temporal filters that

would mimic the nonlinearity of the neuron responses in HVS to different

Spatio-temporal patterns. The HVS sensitivity decreases with high spatial

frequency, the contrast gain control module is used to compensate for this

feature.


34/100

20 | P a g e

3-1-3-3 S TRUCTURAL SIMILARITY

Figure 3-2 Block diagram of the Structural Similarity

The argument used in this metric is based on the idea where the human eye is

tuned to detect structural error. By definition there are three types of error

that can be introduced to multimedia objects i.e., variation of average local

luminance or contrast and structural error. The first two dont contribute in

the degradation of the perceived quality. Thus by removing those two error

types, we can calculate the structural error that would result in defining the

amount of degradation in the image quality. The block diagram of theStructural Similarity (SSIM) is shown in Figure3-2.

The definition of those three types of error is as follows:

Luminance error:

, = 22 + 2


35/100

P a g e | 21

Contrast error:

, =22 + 2

Structure error:

, = 2+

2

Where:

x: Mean of image X

y : Mean of image Y

x: Variance of image X

y: Variance of image Y

xy: Covariance between image X and Y

From the above, Authors in [4], [6], and [7] presented the structural error as

the cosine of the angle between the original (x-) and distorted image (y-).

This logic assumes that after the removal of the luminance and contrast

errors, the resulting error would be illustrated as a circle where all error have

the same error power but different angle defining its effect on the perceived

quality.


36/100

22 | P a g e

Figure 3-3 Block Diagram of the Multi-scale Structural Similarity L: Low passfiltering; 2: Down sampling by 2

In [8], an improvement of the system showed that running the metric on

downscaled images and combining the results would be more effective in

catching all the structural error in the image, and compensate for the different

watching distances. A diagram of the Multi-scale SSIM is in Figure3-3.

3-1-3-4 V ISUAL I NFORMATION F IDELITY ANDN ATURAL SCENE S TATISTICS

Figure 3-4 Conceptual diagram of the VIF

Although at the beginning of this discussion we argued that fidelity measures

do not correlate well to the perceived quality. This concept is illustrated in

Figure 3-4. The authors in [9-10] presented their concept of a fidelity measurethat uses the natural scene statistics to calculate the amount of information


37/100

P a g e | 23

conveyed correctly between the original and distorted image through to the

observer.

The Natural Scene Statistics (NSS) rely on the fact that natural scenes occupy

tiny subspace out of all possible permutations for pixel values, and by that, it

is easy to describe natural undistorted images with a number of statistical

features. Visual Information Fidelity (VIF) defines the perceived quality as the

difference in mutual information between the input and output of HVS forno-distortion and distortion channels.

3-2 Subjective Experiments

Subjective experiments [11] are required for the evaluation of Video Quality

Metrics (VQMs). In these experiments, human subjects are requested to

review, evaluate and assess the quality of images available in the database. Thesubjects are normally screened for visual acuity and color blindness, to make

sure those quality score describe the accurate perceived quality for each

image. Moreover, viewing session should last for less than 30 minutes to

reduce the effect of fatigue on the observers.

The output of these experiments is the Differential Mean Opinion Score

(DMOS) of each image in the database. Those DMOS values serve as

benchmark for perceived quality, and are to be compared with the output

values of objective model when they are evaluated. Generally the evaluation

significance is affected by the size, and the different error types used in the

database.


38/100

24 | P a g e

There are a number of internationally accepted test methods to perform

subjective experiment. They are illustrated in Figure 3-5 and following is a

description of their scheme:

3-2-1 Double Stimulus Impairment Scale (DSIS)Human subjects review reference / test image sets, then rate the images in a

discrete scale ranging from: Imperceptible, perceptible, slightly annoying,

annoying, and very annoying.

3-2-2 Double Stimulus Continuous Quality Scale (DSCQS)In this test method, subjects are blind as to which image is the reference.

Each reference/ test set is viewed twice. The rating of the images is scored on

two scales continuous and discrete.

3-2-3 Single Stimulus Continuous Quality Scale (SSCQS) This method differs from DSCQS in the number of viewing times for the

reference/ test sets. Therefore, it is used for longer sequences (several

minutes) whereas the DSCQS is only suitable for sequences of about 20-30

seconds. Furthermore, the SSCQS resemble the real viewing conditions more

than DSCQS.


39/100

P a g e | 25

Figure 3-5 Subjective Experiments: Viewing Modes (On the Left) ScoreScale (On the Right). (A) Double Stimulus Impairment Scale (DSIS) (B)Double Stimulus Continuous Quality Scale (DSCQS) (C) Single Stimulus

Continuous Quality Scale (SSCQS)

3-3 VQEG

The Video Quality Experts Group (VQEG) was formed on 1997. Its main

objective was to validate and standardize objective quality assessment models.

Moreover, that group works toward standardization of performance metricsfor validating the objective models. So far, the VQEG have completed two

sets of tests.

Phase I (1998): The subjective experiment used DSCQS. Nineobjective quality assessment models were evaluated. This test showedthat 8 out of 9 models gave results that are indistinguishable from

PSNR.


40/100


41/100

P a g e | 27

at least 20-29 human observers. Single stimulus method was used. The

database was rated in 7 separate viewing sessions.

The fact that images were reviewed in more than one session led to a

mismatch scale in the scores given to those images. Therefore, an extra round

of review was performed using double stimulus methodology and a randomly

selected 50 images.

3-4-3 Realignment Process The raw scores for each subject were converted to difference scores (between

the test and the reference) and then Z-scores and then scaled and shifted to

the full range (1 to 100). Finally, a Difference Mean Opinion Score (DMOS)

value for each distorted image was computed.

For a single image if the single score is considered an outlier, that is outside acertain interval from the standard deviation of the mean score for the image.

This point is removed from the DMOS calculation for that image.

A subject is rejected if the number of outliers is more than a specific accepted

rate. Hence all the ratings done by that subject are excluded from the final

dataset.

3-4-4 Datasets The database of images is accompanied with a number of datasets that define

the benchmark values of the perceived quality for each image of the 982

images available in the database.


42/100

28 | P a g e

dmos.mat: contains two arrays of length 982 each: DMOS and orgs.

o orgs(i)==0 for distorted image, and orgs(i)==1 for referenceimages.

o DMOS(1:227): JP2K, DMOS(228:460):JPEG,DMOS(461:634): White Noise, DMOS(634:808): GaussianBlur, DMOS (809:982): Fast Fading]

o The values of DMOS when corresponding (orgs==1) are

zero (they are reference images)

refnames_all.mat: contains a cell array refnames_all.

o refnames_all{i} is the name of the reference image for imagei whose DMOS value is given by DMOS(i).

o If orgs(i)==0, then this is a valid DMOS entry. Else if orgs(i)==1 then image i denotes a copy of the reference

image.

DMOS_realigned.mat: DMOS values after realignment.

3-5 H.264 Review

Throughout this study, the H.264 standard was used as the main compression

technique for encoding and transcoding all test sequences. In this section we

are going to review this standard and demonstrate its new features.

H.264 is the newest in its series known as International standard 14496-10 or

MPEG-4 part 10 Advanced Video Coding of ISO/IEC. The standard was

finalized in March 2003 and approved by the ITU-T in May 2003 [14-16]

The encoder decoder configuration is separated into two separate stages

Video Coding Layer (VCL) and the Network Abstraction Layer (NAL).

Figure 3-6 shows the arrangement of both layers.


43/100

P a g e | 29

Figure 3-6 (A) Video Coding Layer (VLC) and Network Abstraction Layer(NAL) arrangement. (B) NAL unit

VCL is responsible for efficient coding of video frames and delivering coded

information to be formatted by NAL. The main aim of NAL is to arrange all

of the coded information in a way that would be comprehended by the

receiver. All the information are sent in what is known as NAL units, these

units act as packets that can be handled separately by the transport layer fortransmission or storage in file. Each NAL unit consists of a NAL header that

specifies sequencing of the information within the unit, and the payload data.

H.264 coding standard falls into the category of Block-based motion

compensated video compression. Figure 3-7 and Figure 3-8 show the detailed

block diagram of encoder and decoder.


44/100

30 | P a g e

Figure 3-7 Block diagram of H.264 Encoder

Figure 3-8 Block diagram of the H.264 Decoder

The term slice refers to a set of macroblocks in raster order that are to becoded in the same type, i.e. I, P, B, SI, SP. Macroblocks are defined as the


45/100

P a g e | 31

area of 16 X 16 pixels. It is the main building block on which the processing

occurs.

The slice type is defined by type of coding done on macroblocks contained in

the slice. Different slice types are illustrated in the following:

I (Intra) slice: Macroblocks are coded through prediction frommacroblocks in the same frame.

P (Predicted) slice: Macroblocks are coded in reference to previously coded frames.

B (Bi-directional predicted) slice: Macroblocks uses both previous andnext frames.

SI and SP (Switching) slice: used to switch between different substreams.

The processing in the macroblock layer is divided in two categories intra and

inter coding. In intra coding a macroblock is predicted using only spatial

information i.e., macroblocks from the same frame. However in inter frames,

the prediction rely on temporal dependencies. This is done by copying an area

from previously coded frames and assigning it to the currently encoded

macroblock. The encoder then sends motion vectors, reference frames and

error signal between the predicted and the current macroblock. Moreover,

Motion vector are not sent to the receiver. Only a displacement Motion

vector is sent to adjust values predicted by the receiver. This is dependent on

the fact that motion prediction in both encoder and decoder are identical.

Therefore, Motion vectors are predicted from the surrounding macroblocks

and then a compensation MV is sent to the receiver to correct the value.


46/100

32 | P a g e

The motion prediction in H.264 supports half and quarter pixel values. The

intensity values for fractional pixels are determined by means of interpolation.

Luma half pixel: 6-tap FIR filter.

Luma quarter pixel: Averaging of half and integer pixels.

Chroma: All fractional pixels are computed through averaging.

In the following a list of the differences between H.264 and earlier standards:

H.264 includes a Deblocking filter.

H.264 allows for multiple reference frames.

H.264 introduces the spatial prediction in intra frames.

H.264 uses 4X4 integer transform instead of the former DCT 8X8transform.

The standard defines a set of profiles in which H.264 can operate: baseline,

main, and extended. Each profile defines accepted syntax and tools to be

used. The profiles are in Figure 3-9. In this study we have used the Baseline

profile.


47/100

P a g e | 33

Figure 3-9 H.264 profiles

H.264 is the most efficient coding algorithm with respect to bit rate

reduction, yet the most complex among its peers. In [17] authors performed a

number of tests to analyze the complexity-distortion relationship withinH.264. They found that P frames are more efficient with respect to distortion

and complexity but requires more bitrate than sequences containing B frames.

The authors in [18] show that processing time of H.264 is dominated by

deblocking filter (49.01%) and fractional pixel interpolation (19.98%).

3-6 Multimedia Transcoding

The research in multimedia transcoding is categorized into the following:

Transcoding techniques: design of transcoding techniques that wouldadapt the video stream to fit fewer resources.

Transcoder analysis: analysis of the resource utilization in transcodersand its optimization schemes.


48/100

34 | P a g e

Control Schemes: to control the selection process of transcoding

techniques along with amount of transcoding done by each of them.

Although, there is now a large number of studies on transcoding techniques

design, the lack of policy module that supports transcoder implementation

left those designs useless. In the following section we are going to review the

first category to familiarize the reader with the baseline knowledge about

transcoding. The rest of the section will provide a review of control scheme.

3-6-1 Transcoding Techniques Transcoding has different types based on kind of change induced in the

bitstream [19]:

Homogeneous: is the modification in one or more resources that isrequired by bitstream. The different types of resources are

demonstrated in Figure 3-10.

Heterogeneous: is the change of the bitstream syntax from onestandard coding scheme to another.

Error Resilience: is the injection of some bits to increase thebitstreams robustness to error.


49/100

P a g e | 35

Figure 3-10 Homogeneous transcoding

Another categorization for transcoding techniques would be from the

implementation point of view. The simplest implementation is back to back

encoder- decoder configuration, also known as cascaded pixel domain

transcoder (CPDT). CPDT is the simplest yet most time consuming

transcoder implementation. In Figure 3-11, we demonstrate that as we go

deeper in the structure of the bitstream we will get higher quality transcoding

at the cost of transcoder complexity.


50/100

36 | P a g e

Figure 3-11 Transcoder Implementation

3-6-2 Control Schemes In [20], authors proposed a utility model based on the maximization of utility

under a certain amount of resources. The system didnt support dynamic

transcoding nor was transcoding done online. Three profiles where defined

for each multimedia object, namely: gold, silver, bronze.

Figure 3-12 Utility Model

Another argument was set by authors in [21] that is, Offline transcoded

objects can be arranged in what is called Info-pyramid. The info-pyramid isby definition a progressive data representation scheme. Objects stored in the

info-pyramid have different resolutions and abstraction levels:


51/100

P a g e | 37

Fidelity: is spatial and temporal resolution using lossy compression

technique.

Modality: can be the selection of either: key frame images, audiotrack, and closed captions.

When customization and selection module receives client request, it assigns

the object that best fits the request and sends it back to user. The architecture

of the system is illustrated in Figure 3-13.

Figure 3-13 Info-pyramid based control scheme

On the other hand, authors in [22] proposed a model with three dimensions:

Device Modality: Display, audio, memory, CPU, and color


52/100

38 | P a g e

Network conditions: Bandwidth, Latency, and BER

User preferences

The illustration of the dimensions and overall system architecture is illustrated

in Figure 3-14 and Figure 3-15 respectively.

For each dimension a number of classes where defined and offline

transcoding for multimedia objects was done. Storage and mapping of these

different bitstreams is done by using MPEG-7 standard. When users request

is received the system chooses from a matrix of classes the most appreciate

one and sends it to user.

Figure 3-14 Three dimensional view


53/100

P a g e | 39

Figure 3-15 System overview

Another type of control schemes was proposed in[23]. The system operatesin real-time and uses a single dimensional transcoding to fit videos to

available bit rate. A buffer based control scheme was used. The system

utilizes the relation between delay, occupancy of buffer and bitrate. Two type

of transcoding was used re-quantization and frame dropping. The estimated

amount of bits required to encode a frame is predicted by using information

gathered from previously encoded frames.

Control scheme can be simplified to fit as a solution to a specific application.

In [24], authors proposed a control scheme for map viewing application. The

scheme is user centric, where information about the type of usage is

important in defining amount of details to be sent to the user. For example a

hiker would require more fine details than a car driver.


54/100


55/100

P a g e | 41

Figure 3-16 Adaptation, Resource, Utility spaces

The curves for these three spaces cannot be developed from a single video

sequence. Each video sequence can react differently to adaptation processes. Authors developed a system for the generation of utility functions by

extracting a set of features from video sequences. Those features are then

used to cluster the sequences into a number of predefined clusters that are

supposed to behave in the same way with respect to different adaptation

process. Those clusters are defined through the analysis of a set of test

sequences.


56/100

42 | P a g e


57/100

P a g e | 43

C h a p t e r 4

Q u a l i t y A s s e s s m e n t

4.

4-1 Introduction

Our work in the objective quality assessment was mainly driven by the need

for an objective model to be used in the policy module of the transcoding engine. This FR QA model should possess the following in order to replace

the need for subjective experiments:

High correlation to subjective experiments output.

Consistent with respect to its reaction to different type of visual error

and image content.

Inexpensive with respect to time consumption.

These features are crucial for the metric to be practically used in place of

human observers. Research in quality assessment has revealed different

perspectives for looking at perceptual error. Although, these definitions of

the perceptual error made use of high level features of images, none of them


58/100

44 | P a g e

have reached the optimal criteria for providing the metric features described

above.

In [28], the authors studied 10 state-of-art FR QA metrics. This extensive

evaluation shows that most of these metrics produce results worse than or

indistinguishable from PSNR. Although these metrics are based on high level

visual features, they didnt correlate well with the subjective data.

In this chapter, we are going to present our work on the formulation of an

objective metric that would comply with the above criteria, along with the

logic behind its design.

4-2 Proposed Metric

Studies examining how HVS treats the received visual information found that

HVS doesnt treat images as luminance values but as contrast differences.

Moreover, this contrast based response varies with the viewing distance. This

led to the use of contrast sensitivity function after the decomposition of the

image into spatial and temporal bands in HVS based metrics.

The idea of the metric presented here uses this fact. If the change in contrast

values was distributed well on the entire image, HVS will not capture this typeof error, since the relations between the contrast values are maintained, and

vice versa, contrast change due to a distortion having a large standard

deviation would modify the contrast relations in images.

The proposed algorithm for calculating the Contrast Error Distribution

(CED) metric is as follows [29]:


59/100


60/100

46 | P a g e

Calculate the metric.

=

Figure 4-1 Block Diagram of the Contrast Error Distribution (CED)

4-3 Metric Evaluation Process

The metric evaluation process is not just a simple measurement of how much

resemblance is there between DMOS values and Video Quality Ratings

(VQRs). A number of metrics are to be applied on the VQRs to confirm if

this metric gives good results regardless of error type, image content, or even

the amount of quality degradation.

In short, all of the above would comply with a single definition:

Generalizability. VQEG defines it as: the ability of a model to perform

reliably over a very broad set of video content. This is obviously a critical

selection factor given the very wide variety of content found in real

applications. There is no specific metric that is specific to generalizability so

this objective testing procedure requires the selection of as broad a set of

representative test sequences as is possible.[12]


61/100

P a g e | 47

As stated above, to achieve this generalizability, we have to perform VQM

tests over a wide range of images and use performance test that would

describe every aspect of generalizability. For this reason, the VQEG

standardized evaluation domains for VQM, as follows:

Prediction Accuracy: the ability to predict the subjective quality ratings with low error.

Prediction Monotonicity: the degree to which the models predictionsagree with the relative magnitudes of subjective quality ratings.

Prediction Consistency: the degree to which the model maintainsprediction accuracy over the range of video test sequences, i.e., itsresponse is robust to a variety of video impairments.

4-3-1 Subjective data rescaling

DMOS values after realignment might take invalid values, for examplenegative values. Therefore a linear scaling is required to level values between

0 and 1. Zero being the worst perceived quality.

The scaling function is as follows:

=Raw Difference Score Minimum Value

Maximum Value Minimum Value

4-3-2 Nonlinear Regression The relation between DMOS and VQRs is not linear. Therefore, the

application of performance metrics on VQM output will lead to inaccurate

results. This nonlinearity is due to the fact that subjective test results tend to


62/100

48 | P a g e

be compressed at the extreme of the test range. Consequently, a nonlinear

regression process is required to compensate for this.

We have used a 5 parameter logistic regression function as follows:

= 11 + 2 3 [28]

The nonlinear regression converts VQRs into DMOS p(predicted) that can bethen compared to DMOS(subjective).

4-3-3 Prediction Accuracy Pearson linear correlation Coefficients:

2

=

2

2 2

Where xy , x, and y are defined as follows:

2 =

2 = 2 2 = 2

4-3-4 Prediction Monotonicity Spearman rank order correlation coefficient is a measure of monotonic

association that is used when distribution of data make Pearson correlation

coefficient undesirable or misleading.


63/100

P a g e | 49

=2 2

4-3-5 Prediction Consistency Outlier Ratio:

=

Where:

No is the number of outlier points

N is the total number of data points

A point is considered as outlier where Qerror[i] for 1iN andQerror[i]=DMOS[i] DMOSp[i], if the following condition is satisfied:

> 2 _ _ .

Root mean square error (RMSE): is also considered as a metric for

consistency.

4-4 Results

In the evaluation cycle we have chosen 6 FR-QA metric to be compared.

Peak Signal to Noise Ratio (PSNR)

Structural Similarity (SSIM) [31]

Visual Information Fidelity (Log(VIF)) [32]

Pixel-Domain Visual Information Fidelity (VIF-PD): a less compleximplementation of the VIF [33]


64/100

50 | P a g e

Contrast Error Distribution (CED) [Proposed]

Contrast Error Distribution (Log(CED)) [Proposed]

4-4-1 Overall Performance The overall performance was measured by computing the Pearson

Correlation Coefficient, spearman rank, and the root mean square error of

the 6 quality assessment metrics mentioned above. The results are shown in

Table 1. The values in Table 1 demonstrate that CED gives results similar toa more sophisticated metrics as the VIF.

4-4-2 Cross-Distortion Performance The results shown in Table 2 through Table 4 are the detailed values of the

above performance metrics for each distortion domains. The tables show that

CEDs output is identical across all distortion domains whereas the othermetrics perform worse in the Fast Fading domain.


65/100

P a g e | 51

Table 1 Comparison between the PSNR, SSIM, CED, PD-VIF, Log(CED),Log(VIF) with respect to CC: Pearson Correlation Coefficient, SROCC:

Spearman Rank Correlation Coefficient, RMSE: Root Mean Square Error

PSNR SSIM CED

(Proposed)

PD-VIF Log(CED)

(Proposed)

Log(VIF)

CC 0.8700 0.8959 0.9369 0.9326 0.9525 0.9544

SROCC 0.8755 0.9075 0.9550 0.9471 0.9550 0.9637

RMSE 13.4713 12.1396 9.9549 9.8798 8.3168 8.1708

Table 2 Pearson Correlation Coefficient of the SSIM, CED, PD-VIF,Log(CED), Log(VIF). Calculated for the distortion domains JPEG2000,

JPEG, White Noise, Gaussian Blur, and Fast Fading

JP2K JPEG WN GBlur FF

SSIM 0.9311 0.9436 0.9693 0.8622 0.9271

CED (Proposed) 0.9561 0.9688 0.9325 0.9368 0.9466

PD-VIF 0.9702 0.9749 0.9717 0.9538 0.8698

Log(CED) (Proposed) 0.9598 0.9738 0.9716 0.9696 0.9635Log(VIF) 0.9744 0.9688 0.9804 0.9707 0.9490

Table 3 Spearman Rank Correlation Coefficient of the SSIM, CED, PD-VIF,Log(CED), Log(VIF). Calculated for the distortion domains JPEG2000,

JPEG, White Noise, Gaussian Blur, and Fast Fading

JP2k JPEG WN GBlur FF

SSIM 0.9331 0.9389 0.9684 0.8827 0.9380CED (Proposed) 0.9545 0.9712 0.9719 0.9699 0.9658

PD-VIF 0.9717 0.9840 0.9872 0.9695 0.8675

Log(CED) (Proposed) 0.9545 0.9712 0.9719 0.9699 0.9658

Log(VIF) 0.9698 0.9600 0.9856 0.9734 0.9658


66/100

52 | P a g e

Table 4 Root Meas Square Error of the SSIM, CED, PD-VIF, Log(CED),Log(VIF). Calculated for the distortion domains JPEG2000, JPEG, White

Noise, Gaussian Blur, and Fast Fading.

JP2k JPEG WN GBlur FF

SSIM 9.2222 10.5526 6.8789 9.3565 10.6995

CED (Proposed) 7.6804 8.2344 10.4274 6.8455 9.6306

PD-VIF 6.1433 7.1296 6.6276 5.5593 14.0610

Log(CED) (Proposed) 7.0897 7.2565 6.6182 4.5263 7.6321

Log(VIF) 5.6908 7.8561 5.5314 4.4474 9.0253

4-4-3 Complexity Performance

VQEG has not yet standardized a complexity measure for the VQM.

However the complexity of the 3 metrics was evaluated using a Pentium M,

1.86 GHz Laptop; using the consumed time in calculating the quality metric

for all the jpeg 2000 distorted images (227 images). The complexity measure

is shown in Table 5.

From the results, it can be seen that CED provides a good tradeoff between

performance and complexity. Where it operates in 1.5 seconds per image

where metrics with comparable results operate in 12 seconds per image.

Table 5 Evaluation of the Quality Metrics

MSSIM CED

(Proposed)

PD-VIF VIF

224.11sec/227images

310.91sec/227images

498.26sec/227images

2768.4sec/227images

0.99sec/ averageimage





67/100

P a g e | 53

4-4-4 Logistic Regression Performance Figure 4-2 shows the scatter plot of output of VQMs against DMOS values,

along with the logistic regression fit of the data. The plot for CED shows

that VQR points are distributed evenly across the perceived quality range.

Figure 4-3 shows the scatter plot of DMOS against the predicted DMOS

values. This scatter plot shows outlier points. For the metric to perform

better, the scatter points should be distributed near the diagonal of the graph.Moreover, the points should be distributed evenly across the range of the

perceived quality.

It can be seen from Figure 4-3 that metrics have two empty spots one near

the origin and the other at the far side of the graph as highlighted in red. The

empty spot near the origin means that zero point is translated to a different

value in the predicted DMOS. The graph for the CED shows that all the

empty spots have been decreased significantly and therefore the response for

the CED is improved for error figures located in those areas of the graph.

Figure 4-4 shows the calibration curves of the 5-distortion domains from the

database used in the experiment. The evaluation of VQM performance

stability across different types of distortion mandate that the calibrationcurves should be indistinguishable. In the figure, we can see that calibration

curves are not overlying, however they are adjacent to each other. The points

of intersection highlight the amount of error where the metric would react to

different types of error indifferently. Otherwise, the metric would be

more/less sensitive to certain types of error.


68/100


69/100

P a g e | 55

Figure 4-2 Cont.

Figure 4-2 Cont.


70/100

56 | P a g e

Figure 4-2 Cont.

Figure 4-2 Cont.


71/100

P a g e | 57

Figure 4-3 Scatter plot of predicted DMOS (VQRs after logistic regression)against DMOS values. This was calculated for 6 VQM: PSNR, SSIM, VIF,

PD-VIF, CED, Log(CED) respectively

Figure 4-3 Cont.

RMSE=13.4713

RMSE=12.1396


72/100

58 | P a g e

Figure 4-3 Cont.

Figure 4-3 Cont.

RMSE=9.8798

RMSE=8.1708


73/100

P a g e | 59

Figure 4-3 Cont.

Figure 4-3 Cont.

RMSE=9.9549

RMSE=8.3168


74/100

60 | P a g e

Figure 4-4 Calibration Curves for each error domain: JPEG2k (Green), JPEG (Red), White Noise (Blue), Gaussian Blue (Magenta), Fast Fading

(Cyan) and all error domains (Black). This was calculated for 6 VQM: PSNR,SSIM, VIF, PD-VIF, CED, Log(CED)

Figure 4-4 Cont.


75/100

P a g e | 61

Figure 4-4 Cont.

Figure 4-4 Cont.


76/100

62 | P a g e

Figure 4-4 Cont.

Figure 4-4 Cont.


77/100

P a g e | 63

C h a p t e r 5

D a t a A n a l y s i s

5.

5-1 Introduction

Nowadays, a large number of video transcoding schemes exist. These

schemes change a pre-encoded video bitstream into another that exhibits lessbit rate or complexity and therefore quality.

Currently, the main problem in video adaptation is management of process

itself. More specifically, the problem lies in how to determine the following:

The transcoding scheme to be used.

The amount of transcoding.

The problem relies on the fact that not all video sequences react in the same

way to transcoding processes. A certain amount of transcoding can result into

a different amount of resource reduction in different video sequences. This is

due to varied complexity of video content.


78/100

64 | P a g e

5-2 Offline Data Analysis Model

The authors in [34] put together a systematic procedure for designing video

adaptation technologies, they are as follows:

1. Identify the adequate entities for adaptation, e.g. frame, shot,

sequence of shot, etc.

2. Identify the feasible adaptation operators e.g., de-quantization, frame

dropping, coefficient drooping, etc.3. Develop models for measuring and estimating resource and utility

values associated with video entities undergoing identified operators.

4. Given user preferences and constraints on resource or utility, develop

strategies to find the optimal adaptation operator(s) satisfying the

constraints.

In Figure 5-1, a conceptual diagram of the 3 stages process of the transcoder:offline data analysis, policy module, and transcoding engine. The work was

mainly focused on offline data analysis module. The policy module decides

which transcoding algorithm to be used and how much transcoding is

needed. This is done by extracting some features from pre-encoded videos

and mapping it to a certain class. Each of the classes defined in the policy

module contains information about the resource transcoding relations. Those classes are created in the offline data analysis stage.

The main aim of offline data analysis stage is to define the main classes of

multimedia objects. Each class has its own Resource transcoding quality

graph which contributes in the policy module decision.


79/100

P a g e | 65

Figure 5-1 Block diagram of Multimedia Middleware

The presented study relies mainly on the idea of finding key features that

would characterize the differences between video sequences. Those video

sequences usually reach the transcoding server in a pre-encoded form.

Transcoding servers should distinguish the class of the sequence through only

the information present in the coded domain.

5-3 H.264 Setup

The C++ implementation of H.264 video coding algorithm in [35] Version

JM 13.0 was used. The baseline profile was chosen as the main profile for

encoding the Test sequence.

This profile contains the following features:


80/100

66 | P a g e

I slices: Intra-coding, only spatial prediction is allowed.

P slices: Inter-coding, forward temporal prediction.

CAVLC: Context Adaptive variable length codes

Configuration parameters for the coding algorithm:

Baseline Profile

QP=28

To be coded in IPPP

5-4 Test Sequences

The test video sequences used in this study are presented in [36]. Those video

sequences are single shot video segments. Therefore, video sequence isencoded with the first frame as I-frame and the rest of the frames as P-

frames. A description of complexity for each video sequence is described in

Figure 5-2 Test Sequences Description

5-5 Features

By classifying videos based on their content, the video bitstreams can be

grouped based on their behavior within the transcoding engine. This

classification depends mainly on features extracted from video sequences. A

number of studies in transcoding control schemes have adopted the idea of

classifying the video content based on their complexity. However, features

used were the main point of argument in this concept. In this chapter, the

proposed feature analysis is presented. This analysis was done on most of

features used in the available literature [37-40]. The study conducted in this


81/100

P a g e | 67

thesis concluded that many of these features convey the same information

and some of which can be omitted from the proposed model.

Figure 5-2 Test Sequences Description


82/100

68 | P a g e

5-5-1 Feature Definitions All of feature definitions described in this section are calculated on per frame

basis, In order to calculate a single value for each sequence, the average was

computed. Only for the Source Domain Features, average values were

compared against the first frame (I-frame) value.

5-5-1-1 SOURCE D OMAIN F EATURES

Variance: Average variance of the luminance pixels Pelact: Standard deviation of the luminance pixels

Pelspread: Standard deviation of Pelact

Edgeact: Magnitude of pixel gradient

Edgespread: standard deviation of EdgeAct

5-5-1-2 R ESOURCES R EQUIRED bitcount: Bitcount for coding for macroblock accumulated on the

whole frame.

bitcount Y: Bitcount used for coding only the Y component of theframe

ME time: Time consumed in motion estimation SNR Y: Signal to Noise Ratio calculated on Y frame

SNR U: Signal to Noise Ratio calculated on U frame

SNR V: Signal to Noise Ratio calculated on V frame

Time: Time consumed in coding


83/100

P a g e | 69

5-5-1-3 CODED D OMAIN FEATURES

MV magn: Motion verctors magnitude (Calculated for only non staticMacroblocks)

MV magn var: Motion vectores variance (Calculated for only nonstatic Macroblocks)

sub MV: Percentage of MVs that require subpixel interpolation(either half pixel or quarter pixel)

non zero MV: Percentage of non static Macroblocks

ave energy I: Average Energy of AC coefficients in Iframes

ave energy P: Average Energy of AC coefficients in Pframes

MV accel: Motion vectors acceleration

MV dir: Motion vector change of direction

5-5-2 Analysis and selection Using Principal component analysis (PCA) [41-42] would only help in

changing the axis on which the features are projected to the axes with the

highest covariance between features. Therefore PCA is not suitable as the

main purpose is to omit some features and to inspect if source video features

are important for differentiating between the video sequences or not.

Principal Feature analysis in [43] provides a way to do this. By classifying the

features in the high variance axes and finding the most dominant feature

groups therefore only one feature from each dominant group can be chosen.

First, this algorithm was used on each of the three feature domains separately


84/100


85/100

P a g e | 71

indistinguishable. The three source features to be selected are Ave variance,

Pelspread, and Edgeact. Retained variability is equal to 99.3974 %.

In Table II, the trail of resources is presented. this analysis demonstrates that

ME time can be used instead of encoding time without any loss of

information and that SNR can be calculated on any of the frame components

YUV without any difference. Retained variability of this trail was 99.77155 %.

In Table III, the trail of the coded domain features. The four selected features

are MV magn, sub MV, Ave energy I, and Ave energy P.

Final trail is where both source and coded domain features are compared.

This trail results are illustrated in Table IV. Retained variability for this trail is

99.9966 %

Table 6 Source Domain Features

Cluster Index Feature Distance from center

2 Ave Variance (I-frame) 0.063633

2 Ave Variance (Averaged) 0.063633

3 Pelact (I-frame) 0.0015095

3 Pelact (Averaged) 0.00197883 Pelspread (I-frame) 0.00086648

3 Pelspread (Averaged) 0.0013133

1 Edgeact (I-frame) 0.0045721

1 Edgeact (Averaged) 0.0045721

3 Edgespread (I-frame) 0.012588

3 Edgespread (Averaged) 0.0014647


86/100

72 | P a g e

Table 7 Resource Features


3 Bitcount 0.18841

3 Bitcount Y 1.1781

2 ME Time 0.0012094

1 SNR V 0.02584

1 SNR U 0.025837

1 SNR Y 0.02599

2 Time 0.0012094

Table 8 Coded Domain Features


1 MV magn 0

1 MV magn var 0

2 Sub MV 02 Non zero MV 0

3 Ave energy I 0

4 Ave energy P 0

1 MV accel 0

1 MV dir 0

Table 9 Final Trail


2 MV magn 0.0016

2 Sub MV 0.0018

3 Ave energy I 0

1 Ave energy P 0

2 Ave variance 0.04812 PelSpread 0

2 Edgeact 0.0096


87/100

P a g e | 73

5-7 Transcoder Configuration

Figure 5-3 presents architecture of transcoding system, where videos are pre-

encoded with best quality supported, then passed through a transcoder that

only decodes NAL units into a set of VCL information. Transcoder changes

some of this information in coded domain and then re-encodes them into

NAL units. This modified bitstream is then sent to decoder at the client side

to retrieve the pixel domain video sequence.

Figure 5-3 Standard Transcoder Configuration

The implementation used for the transcoder is presented in Figure 5-4. This

configuration was adopted to simplify the implementation of the transcoder.


88/100

74 | P a g e

This relies on the fact that NAL encoder and decoder blocks are identical

therefore can be omitted.

Figure 5-4 Adopted transcoder configuration

5-8 Transcoder Setup

The implementation of transcoder is based on the coefficient dropping

transcoding scheme. This has been applied to all test sequences and the same

features were extracted, details are as follow:

Transcoding parameters and amount of reduction:

Drop one coefficient (6.25% reduction)

Drop 3 coefficients (18.75% reduction)



89/100

P a g e | 75


In this experiment we have used the features elected by the feature analysis as

discussed in the previous section, those features are as follows:

Bitcount

ME time

SNR Y

Sub MV

Ave Energy I

Ave Energy P

MV MagnFigure 5-5 shows bitrate relations between different bitstreams and

transcoding parameters. The bit rate values are normalized using (zscore)

function in matlab. This function is defined as:

=

=

Where: V is a column vector of D.


90/100


91/100

P a g e | 77

Figure 5-6 Dendrogram of the generated clusters

Figure 5-7 Normalized Bitrate after adding the no transcoding values


92/100

78 | P a g e

Cluster analysis done in this study was able to predict reaction of test videos

to transcoding process. The dendrogram shows the presence of two clusters

in test sequences, one where videos bitrat e without transcoding is higher

than transcoded bitrate, and the second, videos bitrate without transcoding is

less than some of the transcoded bitrate. Those two clusters are marked in

bitrate graph in Figure 5-7.


93/100

P a g e | 79

C h a p t e r 6

C o n c l u s i o n a n d F u t u r e Wo r k

6.

6-1 Conclusion

The research in multimedia transcoding became an essential part of the field

of multimedia communications. This is due to the fact that all users areturning on to multimedia as a key source of information. In reality, most of

those users are using devices or networks that can't yet handle the large

amount of resources required in the transmission of multimedia objects.

Multimedia Middleware servers perform the required transcoding to allow the

video sequence to be transferred over these networks or devices seamlessly without any intervention from the user's side. This system requires a

thorough understanding of the video characteristics, device capabilities and

network resources. The overall objective of this structure is to provide users

with exactly the right amount of information excluding the possibility of

requiring more resources than needed.


94/100

80 | P a g e

A large number of transcoding techniques have been developed in the

available literature. Those techniques can alter the video sequence through

the modification of one or more of its parameters. This leads to a variety of

potential transcoded objects that can be transferred to the user. Currently, the

management scheme for providing video sequences that fits best the

requirements of the client devices and networks continues to be a challenge.

The management system of multimedia content adaptation should have thecapability of providing an efficient use of resources on client side while

keeping the response time to client requests minimal. The concept adopted in

this thesis for the implementation of the transcoding system relies mainly on

the study of the video content while providing different transcoding plans for

different content types.

The transcoding cycle will start with an offline analysis stage that would

cluster the multimedia objects based on their characteristic into categories.

This analysis predicts the behavior of multimedia objects with respect to the

transcoding techniques. Next the choice of the best transcoding plan is

determined. This would require the presence of a quality assessment metric to

evaluate the result and grantee the transmission of the best available option of

the resources on hand.

In our study we have explored those two points. The work done in this thesis

will help toward the implementation of the transcoding server and more

specifically the policy module in that transcoding server.

First, we have considered the examination of the quality assessment methods

in order to define a valid approach to compute the amount of degradation in

the object quality. We have defined Contrast Error Distribution (CED)


95/100

P a g e | 81

metric which provides a good tradeoff between performance and complexity.

This feature makes it suitable for usage in transcoders where real-time

response is valued greatly.

The results showed that CED is consistent with respect to different error

domains and visual content. This characteristic will allow it to be used in the

loopback analysis cycle where both time and generalizability matters most.

The proposed metric defines the perceived quality using a simple

mathematical model that is deducted from common knowledge about the

HVS. All previously available studies of the FR QA models, showed that for

a metric to be of good performance it has to be based on complex analysis of

the image. However using the CED overcomes this weak point showed high

performance as that of the complex metric and at the same time very low

computational time.

Secondly, we ran an analytical study of the type of features to be included in

the offline analysis of videos. This study led to a set of features that can be

used in classifying and predicting the behavior of the video with respect to

the change in transcoding parameters.

The analysis showed that pixel domain features can be omitted. This is an

important fact as all the videos in the content servers will be in a pre-encoded

form and therefore the pixel domain features will not be available for user in

the transcoding server. As a result, the offline analysis will not require any

external information other than the pre-encoded video sequence.


96/100

82 | P a g e

In our study we have ran some preliminary experiment that proved that using

the features selected. A clustering system would be able to predict the

behavior of a set of video sequences.

6-2 Future Work

The contributions discussed so far have examined the implementation of the

offline data analysis and the quality assessment metric. We have examined

those two segments of the transcoding server separately. Consequently, thenext step would be to integrate both of the proposed structures into the

implementation of a transcoding server to validate the whole theory.

Moreover, we need to expand the analysis done in this thesis to include the

following:

Expand the evaluation process of the CED to include a database thatcontains compound error components instead of single errorcomponent.

Change the CED to use 16X16 windows instead of 8X8 and apply iton DCT coefficients instead of luminance values.

Build a transcoding server that would use multiple transcoding techniques and validate the ability of the clustering algorithm to

detect the most significant clusters.


97/100


98/100

84 | P a g e

Criterion for Image Quality Assessment Using Natural Scene

Statistics," IEEE Transactions on Image Processing , vol. 14, no. 12, 2005.[11] Wu H.R., Digital Video Image Quality and Perceptual Coding . 978-

1420027822: CRC Press Inc, 2005.[12] Philip Corriveau, Arthur Webster. (2003) VQEG F

Multimedia Middle Ware

Documents