8/8/2019 Multimedia Middle Ware
1/100
Helwan University
Faculty of Engineering
Department of Electronics,Communications, and Computers
MULTIMEDIA MIDDLEWARE
by
Nora Abdel gaffar Naguib El-morsy
Bsc. In Telecommunication Engineering, 2005
A thesis submitted in partial fulfillment of the requirements for the degree of
Masters of Science in Telecommunications Engineering
Supervised by:
Prof. Mohamed I. El AdawyFaculty of Engineering, Helwan University
Dr. Hesham A. Keshk Faculty of Engineering, Helwan University
Dr. Ahmed E. HusseinFaculty of Engineering, Helwan University
2010
8/8/2019 Multimedia Middle Ware
2/100
ii | P a g e
8/8/2019 Multimedia Middle Ware
3/100
P a g e | iii
ACKNOWLEDGEMENT
It is a pleasure to thank those who made this thesis possible. I would like to
express my gratitude to Prof. Mohamed I. El-Adawy for his constant support
and most valuable advice. I would like to thank the rest of the supervisory
committee for all their help and Dr. Ahmed E. Hussien for the suggestion of
reference titles.
I would also like to thank my family for the support they provided me
through my entire life and in particular, I really cannot express my full
gratitude to my brother Yasser Naguib who patiently proofread this entire
thesis. Special thanks go to my brother Wael Naguib without whose
motivation and encouragement I would not have considered a post graduate
degree. Above all, to my mother who stood beside me all the time.
Lastly, I offer my regards to all of those who supported me in any respect
during the completion of the project.
I dedicate this thesis to My Mother
8/8/2019 Multimedia Middle Ware
4/100
iv | P a g e
8/8/2019 Multimedia Middle Ware
5/100
P a g e | v
PUBLICATIONS
Nora A. Naguib, Ahmed E. Hussein, Hesham A. Keshk, andMohamed I. El- Adawy Contrast Error Distribution Measurementfor Full Reference Image Quality Assessment, The 18thInternational Conference on Computer Theory and Applications2008, Alexandria, Egypt.
Nora A. Naguib, Ahmed E. Hussein, Hesham A. Keshk, andMohamed I. El- Adawy Using PFA in Feature Analysis and Selectionfor H.264 Adaptation, World Academy of Science, Engineering and
Technology, VOLUME 54, JUNE 2009, Paris, France, ISSN: 2070-3724
8/8/2019 Multimedia Middle Ware
6/100
vi | P a g e
8/8/2019 Multimedia Middle Ware
7/100
P a g e | vii
ABSTRACT
In today's world, users have heterogeneous devices connected to a mesh of networks each
with different capabilities and restrictions. Multimedia content providers need innovative
approaches to keep not only one version of each video but having the capability to offer
different bitstreams for a variety of client capabilities as well. The previously used design of
"one size fits all" systems can not apply in diverse environments presented today. A single
bitstream with static parameter cannot satisfy the diversity presented on the client side. This is
why the researchers in Universal Multimedia Access (UMA) are working on the development
of new techniques for coding multimedia objects with maximum compression efficiency
along with flexibility in the parameters of the provided video when dealing with client devices.
The transcoding of multimedia objects requires the presence of intermediate systems that are
capable of altering the bitstream on demand. Those systems should have the capability of
manipulating different format of bitstreams. A large number of adaptation techniques exists
in todays litera ture, each specialized in altering the video bitstream with respect to only one
dimension namely temporal (frame-rate), spatial (resolution), Signal to Noise Ratio (SNR), orformat conversion. In real world, adaptation of video sequences should take the form of
multi-dimensional adaptation allowing the system to do a combination of reduction processes
on different parameters of video sequence while providing the best possible quality.
In this thesis, we have focused on the transcoder policy module. While most of the previous
studies in multimedia transcoding focused on the transcoding techniques, the lack of control
algorithm rendered those techniques useless. The study was directed toward the creation of
an offline data analysis model for transcoders policy module.
The results and analysis provided in this thesis help toward the creation of policy module that
control the transcoder operation for universal multimedia access.
KEYWORDS: Multimedia Transcoding, Objective Quality Assessment, Universal
Multimedia Access.
8/8/2019 Multimedia Middle Ware
8/100
8/8/2019 Multimedia Middle Ware
9/100
P a g e | ix
4-3-3 Prediction Accuracy 48
4-3-4 Prediction Monotonicity 48 4-3-5 Prediction Consistency 49 4-4 Results 49
4-4-1 Overall Performance 50 4-4-2 Cross-Distortion Performance 50 4-4-3 Logistic Regression Performance 53 4-4-4 Complexity Performance 52
Data Analysis 63 5-1 Introduction 63 5-2 Offline Data Analysis Model 64
5-3 H.264 Setup 65 5-4 Test Sequences 66 5-5 Features 66
5-5-1 Feature Definitions 68 5-5-1-1 Source Domain Features 68 5-5-1-2 Resources Required 68 5-5-1-3 Coded Domain features 69
5-5-2 Analysis and selection 69 5-6 Results 70 5-7 Transcoder Configuration 73
5-8 Transcoder Setup 74 5-9 Clustering 76
Conclusion and Future Work 79 6-1 Conclusion 79 6-2 Future Work 82
Bibliography 83
8/8/2019 Multimedia Middle Ware
10/100
x | P a g e
LIST OF FIGURES
FIGURE1-1 MULTIMEDIAMIDDLEWARE.........................................................................................3 FIGURE2-1 MULTIMEDIACOMMUNICATIONS STUDY AREAS(2001 ITU-T) ..........................................8 FIGURE2-2 GENERALARCHITECTURE OFCODING ALGORITHMS..........................................................9 FIGURE2-3 SCALABLEBITSTREAMS .............................................................................................. 10 FIGURE3-1 BLOCK DIAGRAM OF THEPERCEPTUALDISTORTION METRIC(PDM) ................................. 19 FIGURE3-2 BLOCK DIAGRAM OF THESTRUCTURALSIMILARITY......................................................... 20
FIGURE3-3 BLOCKDIAGRAM OF THEMULTI-SCALESTRUCTURALSIMILARITYL: LOW PASS FILTERING; 2: DOWN SAMPLING BY2 .................................................................................................... 22
FIGURE3-4 CONCEPTUAL DIAGRAM OF THEVIF ............................................................................ 22 FIGURE3-5 SUBJECTIVEEXPERIMENTS: VIEWING MODES (ON THE LEFT) SCORESCALE(ON THERIGHT).
(A) DOUBLESTIMULUSIMPAIRMENT SCALE(DSIS) (B) DOUBLESTIMULUSCONTINUOUS Q UALITYSCALE(DSCQS) (C) SINGLESTIMULUSCONTINUOUS Q UALITYSCALE(SSCQS) ....................... 25
FIGURE3-6 (A) VIDEO CODING LAYER(VLC)ANDNETWORKABSTRACTIONLAYER(NAL)ARRANGEMENT. (B) NALUNIT ................................................................................................................ 29
FIGURE3-7 BLOCK DIAGRAM OFH.264 ENCODER......................................................................... 30 FIGURE3-8 BLOCK DIAGRAM OF THEH.264 DECODER................................................................... 30 FIGURE3-9 H.264 PROFILES...................................................................................................... 33 FIGURE3-10 HOMOGENEOUS TRANSCODING................................................................................ 35 FIGURE3-11 TRANSCODERIMPLEMENTATION............................................................................... 36 FIGURE3-12 UTILITYMODEL..................................................................................................... 36 FIGURE3-13 INFO-PYRAMID BASED CONTROL SCHEME................................................................... 37 FIGURE3-14 THREE DIMENSIONAL VIEW...................................................................................... 38 FIGURE3-15 SYSTEM OVERVIEW................................................................................................. 39 FIGURE3-16 ADAPTATION, RESOURCE, UTILITY SPACES.................................................................. 41 FIGURE4-1 BLOCKDIAGRAM OF THECONTRASTERRORDISTRIBUTION(CED) ................................... 46 FIGURE4-2 SCATTER PLOT OFVQRS AGAINSTDMOS VALUES(BLUE), ANDNONLINEARLOGISTIC FITTING
CURVE(BLACK). THIS WAS CALCULATED FOR6 VQM: PSNR, SSIM, VIF, PD-VIF, CED, LOG(CED)RESPECTIVELY................................................................................................. 54
FIGURE4-3 SCATTER PLOT OF PREDICTEDDMOS (VQRS AFTER LOGISTIC REGRESSION) AGAINSTDMOS VALUES. THIS WAS CALCULATED FOR6 VQM: PSNR, SSIM, VIF, PD-VIF, CED, LOG(CED) RESPECTIVELY................................................................................................................. 57
FIGURE4-4 CALIBRATIONCURVES FOR EACH ERROR DOMAIN: JPEG2K(GREEN), JPEG (RED), WHITENOISE (BLUE), GAUSSIANBLUE(MAGENTA), FASTFADING(CYAN) AND ALL ERROR DOMAINS(BLACK). THIS WAS CALCULATED FOR6 VQM: PSNR, SSIM, VIF, PD-VIF, CED, LOG(CED) .... 60
FIGURE5-1 BLOCK DIAGRAM OFMULTIMEDIAMIDDLEWARE.......................................................... 65 FIGURE5-2 TESTSEQUENCESDESCRIPTION.................................................................................. 67 FIGURE5-3 STANDARDTRANSCODERCONFIGURATION................................................................... 73
8/8/2019 Multimedia Middle Ware
11/100
P a g e | xi
FIGURE5-4 ADOPTED TRANSCODER CONFIGURATION..................................................................... 74
FIGURE5-5 NORMALIZEDBITRATE AGAINST DIFFERENT TRANSCODING PARAMETERS FOR ALL THE TESTSEQUENCES.................................................................................................................... 76
FIGURE5-6 DENDROGRAM OF THE GENERATED CLUSTERS............................................................... 77 FIGURE5-7 NORMALIZEDBITRATE AFTER ADDING THE NO TRANSCODING VALUES............................... 77
8/8/2019 Multimedia Middle Ware
12/100
xii | P a g e
LIST OF TABLES
TABLE1 COMPARISON BETWEEN THE PSNR, SSIM, CED, PD-VIF, LOG(CED), LOG(VIF)WITH RESPECTTO CC: PEARSONCORRELATIONCOEFFICIENT, SROCC: SPEARMANRANKCORRELATIONCOEFFICIENT, RMSE: ROOT MEANSQUAREERROR ............................................................. 51
TABLE2 PEARSON CORRELATIONCOEFFICIENT OF THESSIM, CED, PD-VIF, LOG(CED), LOG(VIF). CALCULATED FOR THE DISTORTION DOMAINSJPEG2000, JPEG, WHITENOISE, GAUSSIANBLUR, ANDFASTFADING........................................................................................................... 51
TABLE3 SPEARMANRANKCORRELATIONCOEFFICIENT OF THESSIM, CED, PD-VIF, LOG(CED),
LOG(VIF). CALCULATED FOR THE DISTORTION DOMAINSJPEG2000, JPEG, WHITENOISE, GAUSSIANBLUR, ANDFASTFADING................................................................................... 51
TABLE4 ROOT MEASSQUAREERROR OF THESSIM, CED, PD-VIF, LOG(CED), LOG(VIF). CALCULATEDFOR THE DISTORTION DOMAINSJPEG2000, JPEG, WHITENOISE, GAUSSIANBLUR, ANDFASTFADING. ........................................................................................................................ 52
TABLE5 EVALUATION OF THEQ UALITYMETRICS............................................................................ 52 TABLE6 SOURCEDOMAIN FEATURES........................................................................................... 71 TABLE7 RESOURCEFEATURES..................................................................................................... 72 TABLE8 CODED DOMAIN FEATURES............................................................................................ 72 TABLE9 FINALTRAIL................................................................................................................. 72
8/8/2019 Multimedia Middle Ware
13/100
P a g e | xiii
ACRONYM
ARU Adaptation / Resource / Utility
CED Contrast Error Distribution
CPDT Cascaded Pixel Domain Transcoder
DCT Discrete Cosine Transform
DFT Discrete Fourier TransformDMOS Differential Mean Opinion Score
DSCQS Double Stimulus Continuous Quality Scale
DSIS Double Stimulus Impairment Scale
DWT Discrete Wavelet Transform
FIR Finite Impulse Response
FR QA Full Reference Quality AssessmentHVS Human Visual System
ISO/IEC International Organization for Standardization
/ International Electro-technical Commission
IT Information Technology
ITU-R International Telecommunication Union
Radio Communication
ITU-T International Telecommunication Union
Telecommunications
MM FSA MultiMedia Framework Study Areas
MPEG Motion Pictures Experts Group
MSE Mean Square Error
NAL Network Abstraction Layer
8/8/2019 Multimedia Middle Ware
14/100
xiv | P a g e
NR QA No Reference Quality Assessment
NSS Natural Scene Statistics
PCA Principle Component Analysis
PDM Perceptual Distortion Metric
PFA Principle Feature Analysis
PSNR Peak Signal to Noise Ratio
QoE Quality of Experience
RR QA Reduce Reference Quality AssessmentSDOs Standards Development Organizations
SG Study Group
SNR Signal to Noise Ratio
SSCQS Single Stimulus Continuous Quality Scale
SSIM Structural Similarity
UMA Universal Multimedia Access VCL Video Coding Layer
VIF Visual Information Fidelity
VQEG Video Quality Experts Group
VQM Video Quality Metric
VQR Video Quality Rating
8/8/2019 Multimedia Middle Ware
15/100
P a g e | 1
C h a p t e r 1
I n t r o d u c t i o n
1.
1-1 Motivation
Multimedia plays an important role in our life. We now have terms that were
introduced to industry, culture and leisure that solely depend on theevolvement of the Multimedia Communications field. Working with another
team member overseas through your laptop was never possible if it were not
for the video conferencing capabilities. The term webinar was not used until
few years ago when it was found that a web based seminar would be more
effective in reaching all its target audience with no regard to distances apart.
Multimedia objects can be described as the highest demanding object
transferred between networks, where the Quality of Experience (QoE) [1] is
the most important thing. The slightest delay or error would heavily affect the
quality and render the multimedia object useless. This however doesnt
change the fact that multimedia is the most popular type of data on the
internet.
8/8/2019 Multimedia Middle Ware
16/100
2 | P a g e
The growth of users with access to the internet along with the tremendous
increase in their network capabilities and mobility, made way to the increase
in amount of data accessed and uploaded through the internet. This data as a
whole contains at least 70% of it as multimedia objects. Those users spend
more than 20% of their time away from their primary workplace.
For a relatively long time now, we are used to having two types of networks
available to us. Telecommunications and IT (Information Technology)networks. Though we have interconnections between them, we havent yet
reached the combination of the two. To achieve this merge the ITU-T
(International Telecommunications Union - Telecommunication) is working
on the standardization of what is called Next Generation Networks.
The work of study group 16 is focused in providing guidelines for Network
of the Networks that unifies the view points of end users, standard
committee, telecommunication and IT providers. This will allow the
convergence of all services under the umbrella of one network, and the
cooperation of content providers and network service providers to serve end
users better.
This advancement in the telecommunications networks and deviceinteroperability led to increasing the importance of multimedia objects.
Multimedia communication is expected to dominate the field of
communications in the following 10 years. This fact makes it crucial for us to
tackle the problem of exchanging multimedia objects seamlessly in these
changing environments. The research presented in this thesis is a trail to
examine some of the open issues in the field of multimedia communications.
8/8/2019 Multimedia Middle Ware
17/100
P a g e | 3
1-2 Problem Statement
Multimedia middleware are intermediate systems between the client and the
content server that provides a number of complementary services. The
generalized block diagram of multimedia middleware is illustrated in Figure
1-1. Those servers are used to transcode multimedia objects before delivery
to client devices. This transcoding will help in situations where, we do not
want to exhaust network resources or device processing power when users
are just reviewing multimedia objects to select one, or when the client device
does not have a high screen resolution.
Figure 1-1 Multimedia Middleware
Transcoding can be done with respect to numerous domains, none of which
will result in the same combination of resources. The transcoding middleware
should be able to evaluate the client request, analyze the content of the
multimedia object requested, choose a transcoding scheme, then Transcode
8/8/2019 Multimedia Middle Ware
18/100
4 | P a g e
and deliver it to the user. This middleware server will need to fit within the
existing system and be transparent to both content server and client.
A multimedia middleware should possess the following qualities in order to
be transparent to the client side:
When adding a new multimedia object to the content server, the timerequired for the transcoding server to analyze the content of the
video should be minimized. Time from the reception of client requests till delivery of the content
back to the user should be minimized.
Transcoding server should not require the presence of any pixeldomain information in any of its processes.
The server should have the means to assess the quality of thegenerated version of the multimedia object and choose betweendifferent transcoding schemes.
The above qualities provide a roadmap for the implementation of transcoding
servers. However for those servers to function properly, a set of offline data
analysis studies for multimedia objects should be done. In the available
literature, a number of studies worked on this point but none have reached
the optimal criteria satisfying the above stated qualities. Our work inmultimedia middleware is focused toward the implementation of the
transcoder policy module. We have divided the analysis into two points. A
quality assessment model has been developed for the use in offline data
analysis along with an overall feature analysis for the selection of transcoding
schemes.
1-3 Objectives and contributions
The middleware server request cycle consists of the following:
8/8/2019 Multimedia Middle Ware
19/100
P a g e | 5
Data Analysis of the pre-encoded video stream.
Policy Module: Choosing a transcoding scheme that best fits theclient requirement and have the best quality of all possible solutions.
Transcoding the video stream.
The objective of this research is to examine the first two stages. This work
will help toward the practical implementation of the middleware server
control module. The contribution of this research was concentrated in thefollowing:
Discovering the features that best serve in clustering the multimediaobjects and provide means of predicting the way those objects wouldreact to different transcoding schemes.
Developing a new quality assessment metric for the evaluation andthe choice of the best available transcoding scheme.
1-4 Thesis Outline
This thesis is organized as follows: chapter 2 introduces some of the
multimedia communications concepts used in the discussion presented in this
thesis, chapter 3 provides a review of the related literature, chapter 4
introduces the proposed objective quality assessment model along with theevaluation of its performance, chapter 5 presents the offline data analysis and
the features analysis for the implementation of the transcoder policy module,
and chapter 6 presents the conclusion and future work.
8/8/2019 Multimedia Middle Ware
20/100
6 | P a g e
8/8/2019 Multimedia Middle Ware
21/100
P a g e | 7
C h a p t e r 2
M u l t i m e d i a C o m m u n i c a t i o n sB a s i c s
2.
2-1 ITU-T MediaCom2004 project
The advances in the multimedia communications depend not only on fields
that study multimedia objects but on the development of underlying
networks and services that will allow the integration of complex multimedia
objects in resource limited network, taking into consideration the quality
received by end users.
ITU-T SG16, the lead Study Group for Multimedia, is working on project -
MEDIACOM 2004 (Multimedia Communication 2004) [2]. The objective of
the Mediacom 2004 Project is to establish a framework for Multimediastandardization for use both inside and external to the ITU. This framework
will support the harmonized and coordinated development of global
multimedia communication standards across all ITU-T and ITU-R Study
Groups, and in close cooperation with other regional and international
standards development organizations (SDOs).
8/8/2019 Multimedia Middle Ware
22/100
8 | P a g e
Figure 2-1 presents the Multimedia framework study areas (MM FSA) as
defined by the Mediacom project.
Figure 2-1 Multimedia Communications Study areas (2001 ITU-T)
2-2 MPEG-7 and MPEG-21
Another important segment of research is the semantic annotation of
multimedia content. This annotation provides a bigger picture view on the
overall information that resides in a webpage. As a result, the content of this
webpage can be classified based on its importance and then delivered.
MPEG-7 and MPEG-21 are two standards developed by the Moving
Pictures Experts Group MPEG in 2003. Those standards are not intended
for the coding of Multimedia objects as the proceeding standards. However,they aim in the integration with the other coding algorithms to allow the
8/8/2019 Multimedia Middle Ware
23/100
P a g e | 9
transmission of user preference and context information back and forth
between clients and content servers.
2-3 Coding Standards
Multimedia objects are known to contain a large amount of correlated data.
Coding algorithms are designed to decouple these associations in both the
temporal and spatial dimension and by that achieve a high compression rate
without losing valuable information. Figure 2-2 illustrates the maincomponents of coding algorithms.
Figure 2-2 General Architecture of Coding Algorithms
MPEG-4 and H.264 are the newest standards for multimedia coding
developed by the MPEG. They both rely on the same coding principles but with significantly different visions. MPEG-4 is mainly concerned with
flexibility where H.264 features efficient compression and reliability.
As stated above, the difference between the two standards does not reside in
the theory of the compression module itself, but in how the input is treated.
In MPEG-4, the input of the compression module is a series of multimedia
8/8/2019 Multimedia Middle Ware
24/100
10 | P a g e
objects that are contained in video frames. H.264 uses frame based
compression.
2-4 Transcoding Vs Scalable Coding
Scalable Video Encoding is the coding of video streams to contain a number
of substreams that can be decoded separately. The bitstream structure is
shown in Figure 2-3. First, a base substream containing the most basic
information that allows client devices to render the video with the lowestobtainable quality is considered. This is usually the case in mobile devices
where the client is connected on a low bandwidth network. That base
substream is followed by a series of enhancement layers that can be
downloaded on-demand; this is usually the case when the client can afford
more resources to increase the quality of received video.
Figure 2-3 Scalable Bitstreams
On the other hand, transcoding can be achieved by the presence of
intermediate systems (Multimedia Middleware) between server and client. On
these subsystems the video is re-encoded upon receiving client requests. Those requests will contain the characteristics of the client device along with
8/8/2019 Multimedia Middle Ware
25/100
P a g e | 11
network resources available. In this thesis the words transcoding and
adaptation will be used interchangeably.
The most basic form of a transcoder is a back to back encoder-decoder
configuration. However, this configuration requires a heavy processing power
on the intermediate system. Another form is based on partially decoding the
stream and manipulating the data in its pre-coded form without referring to
the pixel domain data. Those transcoding systems exploit dependenciesbetween coded domain and pixel domain information along with full
understanding of coding scheme itself.
Scalable coding and transcoding are the two coexisting lines of UMA
researches where each has its advantages and limitations. Scalable coding has
the advantage of processing videos in advance therefore it does not require
any intermediate system. However, it means that the video bitstream
resource/quality degradation can be done only on predefined steps and
therefore it does not comply with the exact client requirements.
In other words, scalable coding provides error margin between the provided
bitstream and the requested resource/quality. Meanwhile transcoding tailors
video bitstreams to the exact device/network requirements provided by theclient requests.
Two other limitations involved on practical implementation of the scalable
coding are as follows:
8/8/2019 Multimedia Middle Ware
26/100
12 | P a g e
The decoders compliance to the scalable coding format. Non
compliant decoders will only decode the base layer of the bitstreamyielding a low quality video on clients which can support higherquality.
Enormous number of video bitstreams is available on todaysnetworks that adapt single layer bitstreams; In order to accommodatethe scalable coding techniques transcoding is required for all present
videos
2-5 Quality Assessment
Quality assessment is an important step in transcoding / adaptation process.
In proxy /middleware, the choice of the transcoding dimension and the exact
parameter is dependent on the quality produced. Although meeting client
requests and resources is the steering wheels of the transcoding middleware,
the QoE on client side is what this whole system is about.
During assessment of reduced bit stream, we should bear in mind that quality
measurement of multimedia objects is not defined as fidelity of the new
bitstream to original. Quality when it comes to multimedia objects is defined
as the perceived quality which means that some errors are more important
than others. The perceived quality is related to the limitation within the
Human visual system (HVS) where some errors are neutral while others areseverely perceived by it.
Peak Signal to Noise Ratio (PSNR) is considered to be the most recognized
quality metric. This metric calculates error power within the image.
Consequently, it overlooks the significance of the affected data within the
image, along with modification in HVS response due to this variation in data.
8/8/2019 Multimedia Middle Ware
27/100
P a g e | 13
The degree by which the alteration of video bitstream has affected the
perceived quality can be calculated by either subjective experiments or
objective quality metrics. Subjective experiments refer to viewing videos by
human observers where each observer rates the video quality and then a
mean opinion record is calculated for this video. Objective quality metrics
measures degradation of visual perceptual quality defining criterion for
describing perceptual error.
8/8/2019 Multimedia Middle Ware
28/100
14 | P a g e
8/8/2019 Multimedia Middle Ware
29/100
8/8/2019 Multimedia Middle Ware
30/100
16 | P a g e
on the screen with respect to the original video stream. This clarifies why any
fidelity measure as the SNR would fail in describing the opinion of the
observer.
Although, HVS is a complex organism, it is limited when it comes to error
perception. These limitations are the reason why an error with less power
might contribute in a much severe way to the degradation of image quality.
Up until now, subjective experiments have been used for the assessment of
multimedia quality. However those experiments are impractical, expensive
and time consuming. Hence, they cannot be used in estimating the quality of
multimedia objects during its reproduction. Researchers in the field of
multimedia quality assessment are working on the development of objective
metrics that can predict the observers opinion about the quality of
multimedia objects.
3-1-2 Simple Quality Metrics Simple error power models are considered to be the most recognized quality
metric. This metric calculates the error power within the image.
Consequently, it overlooks the significance of the affected data within the
image, along with the modification in the HVS response due to this variationin data.
To calculate the PSNR between the original and distorted images, we start by
calculating the MSE (Mean Square Error) of pixels grayscale values.
=1
, ,
, ,
2
[3]
8/8/2019 Multimedia Middle Ware
31/100
P a g e | 17
Where: images have a width of X pixels and height of Y pixels and the video
sequence contains F frames.
= 10 102
[3]
Where: I is the maximum value that a pixel can take.
From the above we can see that the MSE defines the difference between thetwo signals and the PSNR defines the fidelity of the distorted image to the
original. In [4] the authors illustrate why error power cannot be used as a
metric for the perceptual quality. They considered the following cases:
Different types of visual error with equal power introduced to thesame image.
Identical error introduced to different images.
In these two cases, although the error has identical power values, the two
images may enclose different perceptual quality. In other words, the type of
error should be studied with respect to its effect on HVS and the image in
hand.
3-1-3 Objective Quality Metrics The above argument about error power based metrics led the researchers to
explore, and formulate a definition for the perceived quality. Some of the
metrics were designed to be generic and utilized the basic understanding of
the limitations in the HVS. The metric itself was designed to mimic the
processing done in the human eye and brain. Other metrics were more
8/8/2019 Multimedia Middle Ware
32/100
18 | P a g e
specific and relied on the prior information about the distortion process that
multimedia object went through. (For example: Coding algorithms introduce
blocking Artifacts)
Three types of references can be used for quality assessment: Full Reference
(FR), Reduced Reference (RR), and No Reference (NR). In FR QA (Full
Reference Quality Assessment) the original image is compared to the
reproduced image, while in RR QA only some features of the original imageare used in the comparison. NR QA is the techniques that rely on the natural
image features to decide about the quality of the image without referring to
any outside information. Obviously, the FR and RR are not very suitable for
the transmission quality problem, due to the need for the original image or
some of its features at the receiver. However, FR and RR are very useful in
cases of developing coding and transcoding techniques. These metrics are
used in order to judge the quality of the image where the original is already
available.
In the following sections, we are going to present a number of FR QA
metrics that have been developed by researchers in the quality assessment
field, along with the underlying definition of the perceptual quality.
3-1-3-1 USING DCT, DWT, AND DFT Authors in [5] examined the effect of decoupling inter-pixel dependencies by
using transforms like Discrete Cosine Transform (DCT), Discrete Wavelet
Transform (DWT) or Discrete Fourier Transform (DFT). Their study shows
that by transforming images to the frequency domain and then doing a simple
pixel difference, the resulting performance would surpass complex quality
measures.
8/8/2019 Multimedia Middle Ware
33/100
P a g e | 19
3-1-3-2 P ERCEPTUAL D ISTORTION METRIC (PDM)
Figure 3-1 Block diagram of the Perceptual Distortion Metric (PDM)
In [3], a generic model of HVS is used as an objective quality assessment
metric. The block diagram of the metric is illustrated in Figure 3-1. The color
space conversion block relies on the fact that HVS treats colors as a nonlinear
color differences rather than RGB. i.e., White-Black, Red-Green, and Blue- Yellow. The perceptual decomposition is a set of spatio-temporal filters that
would mimic the nonlinearity of the neuron responses in HVS to different
Spatio-temporal patterns. The HVS sensitivity decreases with high spatial
frequency, the contrast gain control module is used to compensate for this
feature.
8/8/2019 Multimedia Middle Ware
34/100
20 | P a g e
3-1-3-3 S TRUCTURAL SIMILARITY
Figure 3-2 Block diagram of the Structural Similarity
The argument used in this metric is based on the idea where the human eye is
tuned to detect structural error. By definition there are three types of error
that can be introduced to multimedia objects i.e., variation of average local
luminance or contrast and structural error. The first two dont contribute in
the degradation of the perceived quality. Thus by removing those two error
types, we can calculate the structural error that would result in defining the
amount of degradation in the image quality. The block diagram of theStructural Similarity (SSIM) is shown in Figure3-2.
The definition of those three types of error is as follows:
Luminance error:
, = 22 + 2
8/8/2019 Multimedia Middle Ware
35/100
P a g e | 21
Contrast error:
, =22 + 2
Structure error:
, = 2+
2
Where:
x: Mean of image X
y : Mean of image Y
x: Variance of image X
y: Variance of image Y
xy: Covariance between image X and Y
From the above, Authors in [4], [6], and [7] presented the structural error as
the cosine of the angle between the original (x-) and distorted image (y-).
This logic assumes that after the removal of the luminance and contrast
errors, the resulting error would be illustrated as a circle where all error have
the same error power but different angle defining its effect on the perceived
quality.
8/8/2019 Multimedia Middle Ware
36/100
22 | P a g e
Figure 3-3 Block Diagram of the Multi-scale Structural Similarity L: Low passfiltering; 2: Down sampling by 2
In [8], an improvement of the system showed that running the metric on
downscaled images and combining the results would be more effective in
catching all the structural error in the image, and compensate for the different
watching distances. A diagram of the Multi-scale SSIM is in Figure3-3.
3-1-3-4 V ISUAL I NFORMATION F IDELITY ANDN ATURAL SCENE S TATISTICS
Figure 3-4 Conceptual diagram of the VIF
Although at the beginning of this discussion we argued that fidelity measures
do not correlate well to the perceived quality. This concept is illustrated in
Figure 3-4. The authors in [9-10] presented their concept of a fidelity measurethat uses the natural scene statistics to calculate the amount of information
8/8/2019 Multimedia Middle Ware
37/100
P a g e | 23
conveyed correctly between the original and distorted image through to the
observer.
The Natural Scene Statistics (NSS) rely on the fact that natural scenes occupy
tiny subspace out of all possible permutations for pixel values, and by that, it
is easy to describe natural undistorted images with a number of statistical
features. Visual Information Fidelity (VIF) defines the perceived quality as the
difference in mutual information between the input and output of HVS forno-distortion and distortion channels.
3-2 Subjective Experiments
Subjective experiments [11] are required for the evaluation of Video Quality
Metrics (VQMs). In these experiments, human subjects are requested to
review, evaluate and assess the quality of images available in the database. Thesubjects are normally screened for visual acuity and color blindness, to make
sure those quality score describe the accurate perceived quality for each
image. Moreover, viewing session should last for less than 30 minutes to
reduce the effect of fatigue on the observers.
The output of these experiments is the Differential Mean Opinion Score
(DMOS) of each image in the database. Those DMOS values serve as
benchmark for perceived quality, and are to be compared with the output
values of objective model when they are evaluated. Generally the evaluation
significance is affected by the size, and the different error types used in the
database.
8/8/2019 Multimedia Middle Ware
38/100
24 | P a g e
There are a number of internationally accepted test methods to perform
subjective experiment. They are illustrated in Figure 3-5 and following is a
description of their scheme:
3-2-1 Double Stimulus Impairment Scale (DSIS)Human subjects review reference / test image sets, then rate the images in a
discrete scale ranging from: Imperceptible, perceptible, slightly annoying,
annoying, and very annoying.
3-2-2 Double Stimulus Continuous Quality Scale (DSCQS)In this test method, subjects are blind as to which image is the reference.
Each reference/ test set is viewed twice. The rating of the images is scored on
two scales continuous and discrete.
3-2-3 Single Stimulus Continuous Quality Scale (SSCQS) This method differs from DSCQS in the number of viewing times for the
reference/ test sets. Therefore, it is used for longer sequences (several
minutes) whereas the DSCQS is only suitable for sequences of about 20-30
seconds. Furthermore, the SSCQS resemble the real viewing conditions more
than DSCQS.
8/8/2019 Multimedia Middle Ware
39/100
P a g e | 25
Figure 3-5 Subjective Experiments: Viewing Modes (On the Left) ScoreScale (On the Right). (A) Double Stimulus Impairment Scale (DSIS) (B)Double Stimulus Continuous Quality Scale (DSCQS) (C) Single Stimulus
Continuous Quality Scale (SSCQS)
3-3 VQEG
The Video Quality Experts Group (VQEG) was formed on 1997. Its main
objective was to validate and standardize objective quality assessment models.
Moreover, that group works toward standardization of performance metricsfor validating the objective models. So far, the VQEG have completed two
sets of tests.
Phase I (1998): The subjective experiment used DSCQS. Nineobjective quality assessment models were evaluated. This test showedthat 8 out of 9 models gave results that are indistinguishable from
PSNR.
8/8/2019 Multimedia Middle Ware
40/100
8/8/2019 Multimedia Middle Ware
41/100
P a g e | 27
at least 20-29 human observers. Single stimulus method was used. The
database was rated in 7 separate viewing sessions.
The fact that images were reviewed in more than one session led to a
mismatch scale in the scores given to those images. Therefore, an extra round
of review was performed using double stimulus methodology and a randomly
selected 50 images.
3-4-3 Realignment Process The raw scores for each subject were converted to difference scores (between
the test and the reference) and then Z-scores and then scaled and shifted to
the full range (1 to 100). Finally, a Difference Mean Opinion Score (DMOS)
value for each distorted image was computed.
For a single image if the single score is considered an outlier, that is outside acertain interval from the standard deviation of the mean score for the image.
This point is removed from the DMOS calculation for that image.
A subject is rejected if the number of outliers is more than a specific accepted
rate. Hence all the ratings done by that subject are excluded from the final
dataset.
3-4-4 Datasets The database of images is accompanied with a number of datasets that define
the benchmark values of the perceived quality for each image of the 982
images available in the database.
8/8/2019 Multimedia Middle Ware
42/100
28 | P a g e
dmos.mat: contains two arrays of length 982 each: DMOS and orgs.
o orgs(i)==0 for distorted image, and orgs(i)==1 for referenceimages.
o DMOS(1:227): JP2K, DMOS(228:460):JPEG,DMOS(461:634): White Noise, DMOS(634:808): GaussianBlur, DMOS (809:982): Fast Fading]
o The values of DMOS when corresponding (orgs==1) are
zero (they are reference images)
refnames_all.mat: contains a cell array refnames_all.
o refnames_all{i} is the name of the reference image for imagei whose DMOS value is given by DMOS(i).
o If orgs(i)==0, then this is a valid DMOS entry. Else if orgs(i)==1 then image i denotes a copy of the reference
image.
DMOS_realigned.mat: DMOS values after realignment.
3-5 H.264 Review
Throughout this study, the H.264 standard was used as the main compression
technique for encoding and transcoding all test sequences. In this section we
are going to review this standard and demonstrate its new features.
H.264 is the newest in its series known as International standard 14496-10 or
MPEG-4 part 10 Advanced Video Coding of ISO/IEC. The standard was
finalized in March 2003 and approved by the ITU-T in May 2003 [14-16]
The encoder decoder configuration is separated into two separate stages
Video Coding Layer (VCL) and the Network Abstraction Layer (NAL).
Figure 3-6 shows the arrangement of both layers.
8/8/2019 Multimedia Middle Ware
43/100
P a g e | 29
Figure 3-6 (A) Video Coding Layer (VLC) and Network Abstraction Layer(NAL) arrangement. (B) NAL unit
VCL is responsible for efficient coding of video frames and delivering coded
information to be formatted by NAL. The main aim of NAL is to arrange all
of the coded information in a way that would be comprehended by the
receiver. All the information are sent in what is known as NAL units, these
units act as packets that can be handled separately by the transport layer fortransmission or storage in file. Each NAL unit consists of a NAL header that
specifies sequencing of the information within the unit, and the payload data.
H.264 coding standard falls into the category of Block-based motion
compensated video compression. Figure 3-7 and Figure 3-8 show the detailed
block diagram of encoder and decoder.
8/8/2019 Multimedia Middle Ware
44/100
30 | P a g e
Figure 3-7 Block diagram of H.264 Encoder
Figure 3-8 Block diagram of the H.264 Decoder
The term slice refers to a set of macroblocks in raster order that are to becoded in the same type, i.e. I, P, B, SI, SP. Macroblocks are defined as the
8/8/2019 Multimedia Middle Ware
45/100
P a g e | 31
area of 16 X 16 pixels. It is the main building block on which the processing
occurs.
The slice type is defined by type of coding done on macroblocks contained in
the slice. Different slice types are illustrated in the following:
I (Intra) slice: Macroblocks are coded through prediction frommacroblocks in the same frame.
P (Predicted) slice: Macroblocks are coded in reference to previously coded frames.
B (Bi-directional predicted) slice: Macroblocks uses both previous andnext frames.
SI and SP (Switching) slice: used to switch between different substreams.
The processing in the macroblock layer is divided in two categories intra and
inter coding. In intra coding a macroblock is predicted using only spatial
information i.e., macroblocks from the same frame. However in inter frames,
the prediction rely on temporal dependencies. This is done by copying an area
from previously coded frames and assigning it to the currently encoded
macroblock. The encoder then sends motion vectors, reference frames and
error signal between the predicted and the current macroblock. Moreover,
Motion vector are not sent to the receiver. Only a displacement Motion
vector is sent to adjust values predicted by the receiver. This is dependent on
the fact that motion prediction in both encoder and decoder are identical.
Therefore, Motion vectors are predicted from the surrounding macroblocks
and then a compensation MV is sent to the receiver to correct the value.
8/8/2019 Multimedia Middle Ware
46/100
32 | P a g e
The motion prediction in H.264 supports half and quarter pixel values. The
intensity values for fractional pixels are determined by means of interpolation.
Luma half pixel: 6-tap FIR filter.
Luma quarter pixel: Averaging of half and integer pixels.
Chroma: All fractional pixels are computed through averaging.
In the following a list of the differences between H.264 and earlier standards:
H.264 includes a Deblocking filter.
H.264 allows for multiple reference frames.
H.264 introduces the spatial prediction in intra frames.
H.264 uses 4X4 integer transform instead of the former DCT 8X8transform.
The standard defines a set of profiles in which H.264 can operate: baseline,
main, and extended. Each profile defines accepted syntax and tools to be
used. The profiles are in Figure 3-9. In this study we have used the Baseline
profile.
8/8/2019 Multimedia Middle Ware
47/100
P a g e | 33
Figure 3-9 H.264 profiles
H.264 is the most efficient coding algorithm with respect to bit rate
reduction, yet the most complex among its peers. In [17] authors performed a
number of tests to analyze the complexity-distortion relationship withinH.264. They found that P frames are more efficient with respect to distortion
and complexity but requires more bitrate than sequences containing B frames.
The authors in [18] show that processing time of H.264 is dominated by
deblocking filter (49.01%) and fractional pixel interpolation (19.98%).
3-6 Multimedia Transcoding
The research in multimedia transcoding is categorized into the following:
Transcoding techniques: design of transcoding techniques that wouldadapt the video stream to fit fewer resources.
Transcoder analysis: analysis of the resource utilization in transcodersand its optimization schemes.
8/8/2019 Multimedia Middle Ware
48/100
34 | P a g e
Control Schemes: to control the selection process of transcoding
techniques along with amount of transcoding done by each of them.
Although, there is now a large number of studies on transcoding techniques
design, the lack of policy module that supports transcoder implementation
left those designs useless. In the following section we are going to review the
first category to familiarize the reader with the baseline knowledge about
transcoding. The rest of the section will provide a review of control scheme.
3-6-1 Transcoding Techniques Transcoding has different types based on kind of change induced in the
bitstream [19]:
Homogeneous: is the modification in one or more resources that isrequired by bitstream. The different types of resources are
demonstrated in Figure 3-10.
Heterogeneous: is the change of the bitstream syntax from onestandard coding scheme to another.
Error Resilience: is the injection of some bits to increase thebitstreams robustness to error.
8/8/2019 Multimedia Middle Ware
49/100
P a g e | 35
Figure 3-10 Homogeneous transcoding
Another categorization for transcoding techniques would be from the
implementation point of view. The simplest implementation is back to back
encoder- decoder configuration, also known as cascaded pixel domain
transcoder (CPDT). CPDT is the simplest yet most time consuming
transcoder implementation. In Figure 3-11, we demonstrate that as we go
deeper in the structure of the bitstream we will get higher quality transcoding
at the cost of transcoder complexity.
8/8/2019 Multimedia Middle Ware
50/100
36 | P a g e
Figure 3-11 Transcoder Implementation
3-6-2 Control Schemes In [20], authors proposed a utility model based on the maximization of utility
under a certain amount of resources. The system didnt support dynamic
transcoding nor was transcoding done online. Three profiles where defined
for each multimedia object, namely: gold, silver, bronze.
Figure 3-12 Utility Model
Another argument was set by authors in [21] that is, Offline transcoded
objects can be arranged in what is called Info-pyramid. The info-pyramid isby definition a progressive data representation scheme. Objects stored in the
info-pyramid have different resolutions and abstraction levels:
8/8/2019 Multimedia Middle Ware
51/100
P a g e | 37
Fidelity: is spatial and temporal resolution using lossy compression
technique.
Modality: can be the selection of either: key frame images, audiotrack, and closed captions.
When customization and selection module receives client request, it assigns
the object that best fits the request and sends it back to user. The architecture
of the system is illustrated in Figure 3-13.
Figure 3-13 Info-pyramid based control scheme
On the other hand, authors in [22] proposed a model with three dimensions:
Device Modality: Display, audio, memory, CPU, and color
8/8/2019 Multimedia Middle Ware
52/100
38 | P a g e
Network conditions: Bandwidth, Latency, and BER
User preferences
The illustration of the dimensions and overall system architecture is illustrated
in Figure 3-14 and Figure 3-15 respectively.
For each dimension a number of classes where defined and offline
transcoding for multimedia objects was done. Storage and mapping of these
different bitstreams is done by using MPEG-7 standard. When users request
is received the system chooses from a matrix of classes the most appreciate
one and sends it to user.
Figure 3-14 Three dimensional view
8/8/2019 Multimedia Middle Ware
53/100
P a g e | 39
Figure 3-15 System overview
Another type of control schemes was proposed in[23]. The system operatesin real-time and uses a single dimensional transcoding to fit videos to
available bit rate. A buffer based control scheme was used. The system
utilizes the relation between delay, occupancy of buffer and bitrate. Two type
of transcoding was used re-quantization and frame dropping. The estimated
amount of bits required to encode a frame is predicted by using information
gathered from previously encoded frames.
Control scheme can be simplified to fit as a solution to a specific application.
In [24], authors proposed a control scheme for map viewing application. The
scheme is user centric, where information about the type of usage is
important in defining amount of details to be sent to the user. For example a
hiker would require more fine details than a car driver.
8/8/2019 Multimedia Middle Ware
54/100
8/8/2019 Multimedia Middle Ware
55/100
P a g e | 41
Figure 3-16 Adaptation, Resource, Utility spaces
The curves for these three spaces cannot be developed from a single video
sequence. Each video sequence can react differently to adaptation processes. Authors developed a system for the generation of utility functions by
extracting a set of features from video sequences. Those features are then
used to cluster the sequences into a number of predefined clusters that are
supposed to behave in the same way with respect to different adaptation
process. Those clusters are defined through the analysis of a set of test
sequences.
8/8/2019 Multimedia Middle Ware
56/100
42 | P a g e
8/8/2019 Multimedia Middle Ware
57/100
P a g e | 43
C h a p t e r 4
Q u a l i t y A s s e s s m e n t
4.
4-1 Introduction
Our work in the objective quality assessment was mainly driven by the need
for an objective model to be used in the policy module of the transcoding engine. This FR QA model should possess the following in order to replace
the need for subjective experiments:
High correlation to subjective experiments output.
Consistent with respect to its reaction to different type of visual error
and image content.
Inexpensive with respect to time consumption.
These features are crucial for the metric to be practically used in place of
human observers. Research in quality assessment has revealed different
perspectives for looking at perceptual error. Although, these definitions of
the perceptual error made use of high level features of images, none of them
8/8/2019 Multimedia Middle Ware
58/100
44 | P a g e
have reached the optimal criteria for providing the metric features described
above.
In [28], the authors studied 10 state-of-art FR QA metrics. This extensive
evaluation shows that most of these metrics produce results worse than or
indistinguishable from PSNR. Although these metrics are based on high level
visual features, they didnt correlate well with the subjective data.
In this chapter, we are going to present our work on the formulation of an
objective metric that would comply with the above criteria, along with the
logic behind its design.
4-2 Proposed Metric
Studies examining how HVS treats the received visual information found that
HVS doesnt treat images as luminance values but as contrast differences.
Moreover, this contrast based response varies with the viewing distance. This
led to the use of contrast sensitivity function after the decomposition of the
image into spatial and temporal bands in HVS based metrics.
The idea of the metric presented here uses this fact. If the change in contrast
values was distributed well on the entire image, HVS will not capture this typeof error, since the relations between the contrast values are maintained, and
vice versa, contrast change due to a distortion having a large standard
deviation would modify the contrast relations in images.
The proposed algorithm for calculating the Contrast Error Distribution
(CED) metric is as follows [29]:
8/8/2019 Multimedia Middle Ware
59/100
8/8/2019 Multimedia Middle Ware
60/100
46 | P a g e
Calculate the metric.
=
Figure 4-1 Block Diagram of the Contrast Error Distribution (CED)
4-3 Metric Evaluation Process
The metric evaluation process is not just a simple measurement of how much
resemblance is there between DMOS values and Video Quality Ratings
(VQRs). A number of metrics are to be applied on the VQRs to confirm if
this metric gives good results regardless of error type, image content, or even
the amount of quality degradation.
In short, all of the above would comply with a single definition:
Generalizability. VQEG defines it as: the ability of a model to perform
reliably over a very broad set of video content. This is obviously a critical
selection factor given the very wide variety of content found in real
applications. There is no specific metric that is specific to generalizability so
this objective testing procedure requires the selection of as broad a set of
representative test sequences as is possible.[12]
8/8/2019 Multimedia Middle Ware
61/100
P a g e | 47
As stated above, to achieve this generalizability, we have to perform VQM
tests over a wide range of images and use performance test that would
describe every aspect of generalizability. For this reason, the VQEG
standardized evaluation domains for VQM, as follows:
Prediction Accuracy: the ability to predict the subjective quality ratings with low error.
Prediction Monotonicity: the degree to which the models predictionsagree with the relative magnitudes of subjective quality ratings.
Prediction Consistency: the degree to which the model maintainsprediction accuracy over the range of video test sequences, i.e., itsresponse is robust to a variety of video impairments.
4-3-1 Subjective data rescaling
DMOS values after realignment might take invalid values, for examplenegative values. Therefore a linear scaling is required to level values between
0 and 1. Zero being the worst perceived quality.
The scaling function is as follows:
=Raw Difference Score Minimum Value
Maximum Value Minimum Value
4-3-2 Nonlinear Regression The relation between DMOS and VQRs is not linear. Therefore, the
application of performance metrics on VQM output will lead to inaccurate
results. This nonlinearity is due to the fact that subjective test results tend to
8/8/2019 Multimedia Middle Ware
62/100
48 | P a g e
be compressed at the extreme of the test range. Consequently, a nonlinear
regression process is required to compensate for this.
We have used a 5 parameter logistic regression function as follows:
= 11 + 2 3 [28]
The nonlinear regression converts VQRs into DMOS p(predicted) that can bethen compared to DMOS(subjective).
4-3-3 Prediction Accuracy Pearson linear correlation Coefficients:
2
=
2
2 2
Where xy , x, and y are defined as follows:
2 =
2 = 2 2 = 2
4-3-4 Prediction Monotonicity Spearman rank order correlation coefficient is a measure of monotonic
association that is used when distribution of data make Pearson correlation
coefficient undesirable or misleading.
8/8/2019 Multimedia Middle Ware
63/100
P a g e | 49
=2 2
4-3-5 Prediction Consistency Outlier Ratio:
=
Where:
No is the number of outlier points
N is the total number of data points
A point is considered as outlier where Qerror[i] for 1iN andQerror[i]=DMOS[i] DMOSp[i], if the following condition is satisfied:
> 2 _ _ .
Root mean square error (RMSE): is also considered as a metric for
consistency.
4-4 Results
In the evaluation cycle we have chosen 6 FR-QA metric to be compared.
Peak Signal to Noise Ratio (PSNR)
Structural Similarity (SSIM) [31]
Visual Information Fidelity (Log(VIF)) [32]
Pixel-Domain Visual Information Fidelity (VIF-PD): a less compleximplementation of the VIF [33]
8/8/2019 Multimedia Middle Ware
64/100
50 | P a g e
Contrast Error Distribution (CED) [Proposed]
Contrast Error Distribution (Log(CED)) [Proposed]
4-4-1 Overall Performance The overall performance was measured by computing the Pearson
Correlation Coefficient, spearman rank, and the root mean square error of
the 6 quality assessment metrics mentioned above. The results are shown in
Table 1. The values in Table 1 demonstrate that CED gives results similar toa more sophisticated metrics as the VIF.
4-4-2 Cross-Distortion Performance The results shown in Table 2 through Table 4 are the detailed values of the
above performance metrics for each distortion domains. The tables show that
CEDs output is identical across all distortion domains whereas the othermetrics perform worse in the Fast Fading domain.
8/8/2019 Multimedia Middle Ware
65/100
P a g e | 51
Table 1 Comparison between the PSNR, SSIM, CED, PD-VIF, Log(CED),Log(VIF) with respect to CC: Pearson Correlation Coefficient, SROCC:
Spearman Rank Correlation Coefficient, RMSE: Root Mean Square Error
PSNR SSIM CED
(Proposed)
PD-VIF Log(CED)
(Proposed)
Log(VIF)
CC 0.8700 0.8959 0.9369 0.9326 0.9525 0.9544
SROCC 0.8755 0.9075 0.9550 0.9471 0.9550 0.9637
RMSE 13.4713 12.1396 9.9549 9.8798 8.3168 8.1708
Table 2 Pearson Correlation Coefficient of the SSIM, CED, PD-VIF,Log(CED), Log(VIF). Calculated for the distortion domains JPEG2000,
JPEG, White Noise, Gaussian Blur, and Fast Fading
JP2K JPEG WN GBlur FF
SSIM 0.9311 0.9436 0.9693 0.8622 0.9271
CED (Proposed) 0.9561 0.9688 0.9325 0.9368 0.9466
PD-VIF 0.9702 0.9749 0.9717 0.9538 0.8698
Log(CED) (Proposed) 0.9598 0.9738 0.9716 0.9696 0.9635Log(VIF) 0.9744 0.9688 0.9804 0.9707 0.9490
Table 3 Spearman Rank Correlation Coefficient of the SSIM, CED, PD-VIF,Log(CED), Log(VIF). Calculated for the distortion domains JPEG2000,
JPEG, White Noise, Gaussian Blur, and Fast Fading
JP2k JPEG WN GBlur FF
SSIM 0.9331 0.9389 0.9684 0.8827 0.9380CED (Proposed) 0.9545 0.9712 0.9719 0.9699 0.9658
PD-VIF 0.9717 0.9840 0.9872 0.9695 0.8675
Log(CED) (Proposed) 0.9545 0.9712 0.9719 0.9699 0.9658
Log(VIF) 0.9698 0.9600 0.9856 0.9734 0.9658
8/8/2019 Multimedia Middle Ware
66/100
52 | P a g e
Table 4 Root Meas Square Error of the SSIM, CED, PD-VIF, Log(CED),Log(VIF). Calculated for the distortion domains JPEG2000, JPEG, White
Noise, Gaussian Blur, and Fast Fading.
JP2k JPEG WN GBlur FF
SSIM 9.2222 10.5526 6.8789 9.3565 10.6995
CED (Proposed) 7.6804 8.2344 10.4274 6.8455 9.6306
PD-VIF 6.1433 7.1296 6.6276 5.5593 14.0610
Log(CED) (Proposed) 7.0897 7.2565 6.6182 4.5263 7.6321
Log(VIF) 5.6908 7.8561 5.5314 4.4474 9.0253
4-4-3 Complexity Performance
VQEG has not yet standardized a complexity measure for the VQM.
However the complexity of the 3 metrics was evaluated using a Pentium M,
1.86 GHz Laptop; using the consumed time in calculating the quality metric
for all the jpeg 2000 distorted images (227 images). The complexity measure
is shown in Table 5.
From the results, it can be seen that CED provides a good tradeoff between
performance and complexity. Where it operates in 1.5 seconds per image
where metrics with comparable results operate in 12 seconds per image.
Table 5 Evaluation of the Quality Metrics
MSSIM CED
(Proposed)
PD-VIF VIF
224.11sec/227images
310.91sec/227images
498.26sec/227images
2768.4sec/227images
0.99sec/ averageimage
1.37sec/ averageimage
2.2sec/ averageimage
12.2sec/ averageimage
8/8/2019 Multimedia Middle Ware
67/100
P a g e | 53
4-4-4 Logistic Regression Performance Figure 4-2 shows the scatter plot of output of VQMs against DMOS values,
along with the logistic regression fit of the data. The plot for CED shows
that VQR points are distributed evenly across the perceived quality range.
Figure 4-3 shows the scatter plot of DMOS against the predicted DMOS
values. This scatter plot shows outlier points. For the metric to perform
better, the scatter points should be distributed near the diagonal of the graph.Moreover, the points should be distributed evenly across the range of the
perceived quality.
It can be seen from Figure 4-3 that metrics have two empty spots one near
the origin and the other at the far side of the graph as highlighted in red. The
empty spot near the origin means that zero point is translated to a different
value in the predicted DMOS. The graph for the CED shows that all the
empty spots have been decreased significantly and therefore the response for
the CED is improved for error figures located in those areas of the graph.
Figure 4-4 shows the calibration curves of the 5-distortion domains from the
database used in the experiment. The evaluation of VQM performance
stability across different types of distortion mandate that the calibrationcurves should be indistinguishable. In the figure, we can see that calibration
curves are not overlying, however they are adjacent to each other. The points
of intersection highlight the amount of error where the metric would react to
different types of error indifferently. Otherwise, the metric would be
more/less sensitive to certain types of error.
8/8/2019 Multimedia Middle Ware
68/100
8/8/2019 Multimedia Middle Ware
69/100
P a g e | 55
Figure 4-2 Cont.
Figure 4-2 Cont.
8/8/2019 Multimedia Middle Ware
70/100
56 | P a g e
Figure 4-2 Cont.
Figure 4-2 Cont.
8/8/2019 Multimedia Middle Ware
71/100
P a g e | 57
Figure 4-3 Scatter plot of predicted DMOS (VQRs after logistic regression)against DMOS values. This was calculated for 6 VQM: PSNR, SSIM, VIF,
PD-VIF, CED, Log(CED) respectively
Figure 4-3 Cont.
RMSE=13.4713
RMSE=12.1396
8/8/2019 Multimedia Middle Ware
72/100
58 | P a g e
Figure 4-3 Cont.
Figure 4-3 Cont.
RMSE=9.8798
RMSE=8.1708
8/8/2019 Multimedia Middle Ware
73/100
P a g e | 59
Figure 4-3 Cont.
Figure 4-3 Cont.
RMSE=9.9549
RMSE=8.3168
8/8/2019 Multimedia Middle Ware
74/100
60 | P a g e
Figure 4-4 Calibration Curves for each error domain: JPEG2k (Green), JPEG (Red), White Noise (Blue), Gaussian Blue (Magenta), Fast Fading
(Cyan) and all error domains (Black). This was calculated for 6 VQM: PSNR,SSIM, VIF, PD-VIF, CED, Log(CED)
Figure 4-4 Cont.
8/8/2019 Multimedia Middle Ware
75/100
P a g e | 61
Figure 4-4 Cont.
Figure 4-4 Cont.
8/8/2019 Multimedia Middle Ware
76/100
62 | P a g e
Figure 4-4 Cont.
Figure 4-4 Cont.
8/8/2019 Multimedia Middle Ware
77/100
P a g e | 63
C h a p t e r 5
D a t a A n a l y s i s
5.
5-1 Introduction
Nowadays, a large number of video transcoding schemes exist. These
schemes change a pre-encoded video bitstream into another that exhibits lessbit rate or complexity and therefore quality.
Currently, the main problem in video adaptation is management of process
itself. More specifically, the problem lies in how to determine the following:
The transcoding scheme to be used.
The amount of transcoding.
The problem relies on the fact that not all video sequences react in the same
way to transcoding processes. A certain amount of transcoding can result into
a different amount of resource reduction in different video sequences. This is
due to varied complexity of video content.
8/8/2019 Multimedia Middle Ware
78/100
64 | P a g e
5-2 Offline Data Analysis Model
The authors in [34] put together a systematic procedure for designing video
adaptation technologies, they are as follows:
1. Identify the adequate entities for adaptation, e.g. frame, shot,
sequence of shot, etc.
2. Identify the feasible adaptation operators e.g., de-quantization, frame
dropping, coefficient drooping, etc.3. Develop models for measuring and estimating resource and utility
values associated with video entities undergoing identified operators.
4. Given user preferences and constraints on resource or utility, develop
strategies to find the optimal adaptation operator(s) satisfying the
constraints.
In Figure 5-1, a conceptual diagram of the 3 stages process of the transcoder:offline data analysis, policy module, and transcoding engine. The work was
mainly focused on offline data analysis module. The policy module decides
which transcoding algorithm to be used and how much transcoding is
needed. This is done by extracting some features from pre-encoded videos
and mapping it to a certain class. Each of the classes defined in the policy
module contains information about the resource transcoding relations. Those classes are created in the offline data analysis stage.
The main aim of offline data analysis stage is to define the main classes of
multimedia objects. Each class has its own Resource transcoding quality
graph which contributes in the policy module decision.
8/8/2019 Multimedia Middle Ware
79/100
P a g e | 65
Figure 5-1 Block diagram of Multimedia Middleware
The presented study relies mainly on the idea of finding key features that
would characterize the differences between video sequences. Those video
sequences usually reach the transcoding server in a pre-encoded form.
Transcoding servers should distinguish the class of the sequence through only
the information present in the coded domain.
5-3 H.264 Setup
The C++ implementation of H.264 video coding algorithm in [35] Version
JM 13.0 was used. The baseline profile was chosen as the main profile for
encoding the Test sequence.
This profile contains the following features:
8/8/2019 Multimedia Middle Ware
80/100
66 | P a g e
I slices: Intra-coding, only spatial prediction is allowed.
P slices: Inter-coding, forward temporal prediction.
CAVLC: Context Adaptive variable length codes
Configuration parameters for the coding algorithm:
Baseline Profile
QP=28
To be coded in IPPP
5-4 Test Sequences
The test video sequences used in this study are presented in [36]. Those video
sequences are single shot video segments. Therefore, video sequence isencoded with the first frame as I-frame and the rest of the frames as P-
frames. A description of complexity for each video sequence is described in
Figure 5-2 Test Sequences Description
5-5 Features
By classifying videos based on their content, the video bitstreams can be
grouped based on their behavior within the transcoding engine. This
classification depends mainly on features extracted from video sequences. A
number of studies in transcoding control schemes have adopted the idea of
classifying the video content based on their complexity. However, features
used were the main point of argument in this concept. In this chapter, the
proposed feature analysis is presented. This analysis was done on most of
features used in the available literature [37-40]. The study conducted in this
8/8/2019 Multimedia Middle Ware
81/100
P a g e | 67
thesis concluded that many of these features convey the same information
and some of which can be omitted from the proposed model.
Figure 5-2 Test Sequences Description
8/8/2019 Multimedia Middle Ware
82/100
68 | P a g e
5-5-1 Feature Definitions All of feature definitions described in this section are calculated on per frame
basis, In order to calculate a single value for each sequence, the average was
computed. Only for the Source Domain Features, average values were
compared against the first frame (I-frame) value.
5-5-1-1 SOURCE D OMAIN F EATURES
Variance: Average variance of the luminance pixels Pelact: Standard deviation of the luminance pixels
Pelspread: Standard deviation of Pelact
Edgeact: Magnitude of pixel gradient
Edgespread: standard deviation of EdgeAct
5-5-1-2 R ESOURCES R EQUIRED bitcount: Bitcount for coding for macroblock accumulated on the
whole frame.
bitcount Y: Bitcount used for coding only the Y component of theframe
ME time: Time consumed in motion estimation SNR Y: Signal to Noise Ratio calculated on Y frame
SNR U: Signal to Noise Ratio calculated on U frame
SNR V: Signal to Noise Ratio calculated on V frame
Time: Time consumed in coding
8/8/2019 Multimedia Middle Ware
83/100
P a g e | 69
5-5-1-3 CODED D OMAIN FEATURES
MV magn: Motion verctors magnitude (Calculated for only non staticMacroblocks)
MV magn var: Motion vectores variance (Calculated for only nonstatic Macroblocks)
sub MV: Percentage of MVs that require subpixel interpolation(either half pixel or quarter pixel)
non zero MV: Percentage of non static Macroblocks
ave energy I: Average Energy of AC coefficients in Iframes
ave energy P: Average Energy of AC coefficients in Pframes
MV accel: Motion vectors acceleration
MV dir: Motion vector change of direction
5-5-2 Analysis and selection Using Principal component analysis (PCA) [41-42] would only help in
changing the axis on which the features are projected to the axes with the
highest covariance between features. Therefore PCA is not suitable as the
main purpose is to omit some features and to inspect if source video features
are important for differentiating between the video sequences or not.
Principal Feature analysis in [43] provides a way to do this. By classifying the
features in the high variance axes and finding the most dominant feature
groups therefore only one feature from each dominant group can be chosen.
First, this algorithm was used on each of the three feature domains separately
8/8/2019 Multimedia Middle Ware
84/100
8/8/2019 Multimedia Middle Ware
85/100
P a g e | 71
indistinguishable. The three source features to be selected are Ave variance,
Pelspread, and Edgeact. Retained variability is equal to 99.3974 %.
In Table II, the trail of resources is presented. this analysis demonstrates that
ME time can be used instead of encoding time without any loss of
information and that SNR can be calculated on any of the frame components
YUV without any difference. Retained variability of this trail was 99.77155 %.
In Table III, the trail of the coded domain features. The four selected features
are MV magn, sub MV, Ave energy I, and Ave energy P.
Final trail is where both source and coded domain features are compared.
This trail results are illustrated in Table IV. Retained variability for this trail is
99.9966 %
Table 6 Source Domain Features
Cluster Index Feature Distance from center
2 Ave Variance (I-frame) 0.063633
2 Ave Variance (Averaged) 0.063633
3 Pelact (I-frame) 0.0015095
3 Pelact (Averaged) 0.00197883 Pelspread (I-frame) 0.00086648
3 Pelspread (Averaged) 0.0013133
1 Edgeact (I-frame) 0.0045721
1 Edgeact (Averaged) 0.0045721
3 Edgespread (I-frame) 0.012588
3 Edgespread (Averaged) 0.0014647
8/8/2019 Multimedia Middle Ware
86/100
72 | P a g e
Table 7 Resource Features
Cluster Index Feature Distance from center
3 Bitcount 0.18841
3 Bitcount Y 1.1781
2 ME Time 0.0012094
1 SNR V 0.02584
1 SNR U 0.025837
1 SNR Y 0.02599
2 Time 0.0012094
Table 8 Coded Domain Features
Cluster Index Feature Distance from center
1 MV magn 0
1 MV magn var 0
2 Sub MV 02 Non zero MV 0
3 Ave energy I 0
4 Ave energy P 0
1 MV accel 0
1 MV dir 0
Table 9 Final Trail
Cluster Index Feature Distance from center
2 MV magn 0.0016
2 Sub MV 0.0018
3 Ave energy I 0
1 Ave energy P 0
2 Ave variance 0.04812 PelSpread 0
2 Edgeact 0.0096
8/8/2019 Multimedia Middle Ware
87/100
P a g e | 73
5-7 Transcoder Configuration
Figure 5-3 presents architecture of transcoding system, where videos are pre-
encoded with best quality supported, then passed through a transcoder that
only decodes NAL units into a set of VCL information. Transcoder changes
some of this information in coded domain and then re-encodes them into
NAL units. This modified bitstream is then sent to decoder at the client side
to retrieve the pixel domain video sequence.
Figure 5-3 Standard Transcoder Configuration
The implementation used for the transcoder is presented in Figure 5-4. This
configuration was adopted to simplify the implementation of the transcoder.
8/8/2019 Multimedia Middle Ware
88/100
74 | P a g e
This relies on the fact that NAL encoder and decoder blocks are identical
therefore can be omitted.
Figure 5-4 Adopted transcoder configuration
5-8 Transcoder Setup
The implementation of transcoder is based on the coefficient dropping
transcoding scheme. This has been applied to all test sequences and the same
features were extracted, details are as follow:
Transcoding parameters and amount of reduction:
Drop one coefficient (6.25% reduction)
Drop 3 coefficients (18.75% reduction)
Drop 5 coefficients (31.25% reduction)
8/8/2019 Multimedia Middle Ware
89/100
P a g e | 75
Drop 7 coefficients (43.75% reduction)
In this experiment we have used the features elected by the feature analysis as
discussed in the previous section, those features are as follows:
Bitcount
ME time
SNR Y
Sub MV
Ave Energy I
Ave Energy P
MV MagnFigure 5-5 shows bitrate relations between different bitstreams and
transcoding parameters. The bit rate values are normalized using (zscore)
function in matlab. This function is defined as:
=
=
Where: V is a column vector of D.
8/8/2019 Multimedia Middle Ware
90/100
8/8/2019 Multimedia Middle Ware
91/100
P a g e | 77
Figure 5-6 Dendrogram of the generated clusters
Figure 5-7 Normalized Bitrate after adding the no transcoding values
8/8/2019 Multimedia Middle Ware
92/100
78 | P a g e
Cluster analysis done in this study was able to predict reaction of test videos
to transcoding process. The dendrogram shows the presence of two clusters
in test sequences, one where videos bitrat e without transcoding is higher
than transcoded bitrate, and the second, videos bitrate without transcoding is
less than some of the transcoded bitrate. Those two clusters are marked in
bitrate graph in Figure 5-7.
8/8/2019 Multimedia Middle Ware
93/100
P a g e | 79
C h a p t e r 6
C o n c l u s i o n a n d F u t u r e Wo r k
6.
6-1 Conclusion
The research in multimedia transcoding became an essential part of the field
of multimedia communications. This is due to the fact that all users areturning on to multimedia as a key source of information. In reality, most of
those users are using devices or networks that can't yet handle the large
amount of resources required in the transmission of multimedia objects.
Multimedia Middleware servers perform the required transcoding to allow the
video sequence to be transferred over these networks or devices seamlessly without any intervention from the user's side. This system requires a
thorough understanding of the video characteristics, device capabilities and
network resources. The overall objective of this structure is to provide users
with exactly the right amount of information excluding the possibility of
requiring more resources than needed.
8/8/2019 Multimedia Middle Ware
94/100
80 | P a g e
A large number of transcoding techniques have been developed in the
available literature. Those techniques can alter the video sequence through
the modification of one or more of its parameters. This leads to a variety of
potential transcoded objects that can be transferred to the user. Currently, the
management scheme for providing video sequences that fits best the
requirements of the client devices and networks continues to be a challenge.
The management system of multimedia content adaptation should have thecapability of providing an efficient use of resources on client side while
keeping the response time to client requests minimal. The concept adopted in
this thesis for the implementation of the transcoding system relies mainly on
the study of the video content while providing different transcoding plans for
different content types.
The transcoding cycle will start with an offline analysis stage that would
cluster the multimedia objects based on their characteristic into categories.
This analysis predicts the behavior of multimedia objects with respect to the
transcoding techniques. Next the choice of the best transcoding plan is
determined. This would require the presence of a quality assessment metric to
evaluate the result and grantee the transmission of the best available option of
the resources on hand.
In our study we have explored those two points. The work done in this thesis
will help toward the implementation of the transcoding server and more
specifically the policy module in that transcoding server.
First, we have considered the examination of the quality assessment methods
in order to define a valid approach to compute the amount of degradation in
the object quality. We have defined Contrast Error Distribution (CED)
8/8/2019 Multimedia Middle Ware
95/100
P a g e | 81
metric which provides a good tradeoff between performance and complexity.
This feature makes it suitable for usage in transcoders where real-time
response is valued greatly.
The results showed that CED is consistent with respect to different error
domains and visual content. This characteristic will allow it to be used in the
loopback analysis cycle where both time and generalizability matters most.
The proposed metric defines the perceived quality using a simple
mathematical model that is deducted from common knowledge about the
HVS. All previously available studies of the FR QA models, showed that for
a metric to be of good performance it has to be based on complex analysis of
the image. However using the CED overcomes this weak point showed high
performance as that of the complex metric and at the same time very low
computational time.
Secondly, we ran an analytical study of the type of features to be included in
the offline analysis of videos. This study led to a set of features that can be
used in classifying and predicting the behavior of the video with respect to
the change in transcoding parameters.
The analysis showed that pixel domain features can be omitted. This is an
important fact as all the videos in the content servers will be in a pre-encoded
form and therefore the pixel domain features will not be available for user in
the transcoding server. As a result, the offline analysis will not require any
external information other than the pre-encoded video sequence.
8/8/2019 Multimedia Middle Ware
96/100
82 | P a g e
In our study we have ran some preliminary experiment that proved that using
the features selected. A clustering system would be able to predict the
behavior of a set of video sequences.
6-2 Future Work
The contributions discussed so far have examined the implementation of the
offline data analysis and the quality assessment metric. We have examined
those two segments of the transcoding server separately. Consequently, thenext step would be to integrate both of the proposed structures into the
implementation of a transcoding server to validate the whole theory.
Moreover, we need to expand the analysis done in this thesis to include the
following:
Expand the evaluation process of the CED to include a database thatcontains compound error components instead of single errorcomponent.
Change the CED to use 16X16 windows instead of 8X8 and apply iton DCT coefficients instead of luminance values.
Build a transcoding server that would use multiple transcoding techniques and validate the ability of the clustering algorithm to
detect the most significant clusters.
8/8/2019 Multimedia Middle Ware
97/100
8/8/2019 Multimedia Middle Ware
98/100
84 | P a g e
Criterion for Image Quality Assessment Using Natural Scene
Statistics," IEEE Transactions on Image Processing , vol. 14, no. 12, 2005.[11] Wu H.R., Digital Video Image Quality and Perceptual Coding . 978-
1420027822: CRC Press Inc, 2005.[12] Philip Corriveau, Arthur Webster. (2003) VQEG F