Universidad Politécnica de Madrid - Archivo Digital UPMoa.upm.es/22148/1/PABLO_PEREZ_GARCIA.pdf · 2014-09-22 · Universidad Politécnica de Madrid Escuela Técnica Superior

Universidad Politecnica de Madrid

Escuela Tecnica Superior de Ingenieros de

Telecomunicacion

ANALYSIS, MONITORING, AND MANAGEMENT OF

QUALITY OF EXPERIENCE IN VIDEO DELIVERY

SERVICES OVER IP

Tesis Doctoral

Pablo Perez Garcıa

Ingeniero de Telecomunicacion

2013

http://www.upm.es

http://www.etsit.upm.es


Universidad Politecnica de Madrid

Departamento de Senales, Sistemas y

Radiocomunicaciones

Escuela Tecnica Superior de Ingenieros de

Telecomunicacion

Tesis Doctoral

ANALYSIS, MONITORING, ANDMANAGEMENT OF QUALITY OF

EXPERIENCE IN VIDEO DELIVERYSERVICES OVER IP

Autor:

Pablo Perez Garcıa

Ingeniero de Telecomunicacion

Director:

Narciso Garcıa Santos

Doctor Ingeniero de Telecomunicacion

2013

http://www.upm.es

http://www.ssr.upm.es

http://www.ssr.upm.es



http://www.gti.ssr.upm.es/~pab

http://www.gti.ssr.upm.es/~narciso

Tesis Doctoral

ANALYSIS, MONITORING, AND MANAGEMENT OF QUALITY OF

EXPERIENCE IN VIDEO DELIVERY SERVICES OVER IP

Autor: Pablo Perez Garcıa

Director: Narciso Garcıa Santos

Tribunal nombrado por el Mfgco. y Excmo. Sr. Rector de la Universidad Politecnica

de Madrid, el dıa . . . . . . de . . . . . . . . . . . . . . . . . . . . . . . . de 2013.

Presidente: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Vocal: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Vocal: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Vocal: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Secretario: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Realizado el acto de defensa y lectura de la Tesis el dıa . . . . . . de . . . . . . . . . . . . . . . . . . . . . de

2013 en . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Calificacion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

EL PRESIDENTE LOS VOCALES

EL SECRETARIO

http://www.upm.es

http://www.upm.es

“If you make listening and observation your occupation you will gain much more than

you can by talk.”

Robert Baden-Powell

UNIVERSIDAD POLITECNICA DE MADRID

Abstract

TESIS DOCTORAL



by Pablo Perez Garcıa

This thesis proposes a comprehensive approach to the monitoring and management of

Quality of Experience (QoE) in multimedia delivery services over IP. It addresses the

problem of preventing, detecting, measuring, and reacting to QoE degradations, under

the constraints of a service provider: the solution must scale for a wide IP network

delivering individual media streams to thousands of users.

The solution proposed for the monitoring is called QuEM (Qualitative Experience Mon-

itoring). It is based on the detection of degradations in the network Quality of Service

(packet losses, bandwidth drops. . . ) and the mapping of each degradation event to a

qualitative description of its effect in the perceived Quality of Experience (audio mutes,

video artifacts. . . ). This mapping is based on the analysis of the transport and Network

Abstraction Layer information of the coded stream, and allows a good characterization

of the most relevant defects that exist in this kind of services: screen freezing, mac-

roblocking, audio mutes, video quality drops, delay issues, and service outages. The

results have been validated by subjective quality assessment tests. The methodology

used for those test has also been designed to mimic as much as possible the conditions

of a real user of those services: the impairments to evaluate are introduced randomly in

the middle of a continuous video stream.

Based on the monitoring solution, several applications have been proposed as well: an

unequal error protection system which provides higher protection to the parts of the

stream which are more critical for the QoE, a solution which applies the same principles

to minimize the impact of incomplete segment downloads in HTTP Adaptive Streaming,

and a selective scrambling algorithm which ciphers only the most sensitive parts of the

media stream. A fast channel change application is also presented, as well as a discussion

about how to apply the previous results and concepts in a 3D video scenario.

http://www.upm.es

UNIVERSIDAD POLITECNICA DE MADRID

Resumen

TESIS DOCTORAL



por Pablo Perez Garcıa

Esta tesis estudia la monitorizacion y gestion de la Calidad de Experiencia (QoE) en

los servicios de distribucion de vıdeo sobre IP. Aborda el problema de como prevenir,

detectar, medir y reaccionar a las degradaciones de la QoE desde la perspectiva de

un proveedor de servicios: la solucion debe ser escalable para una red IP extensa que

entregue flujos individuales a miles de usuarios simultaneamente.

La solucion de monitorizacion propuesta se ha denominado QuEM (Qualitative Experien-

ce Monitoring, o Monitorizacion Cualitativa de la Experiencia). Se basa en la deteccion

de las degradaciones de la calidad de servicio de red (perdidas de paquetes, disminucio-

nes abruptas del ancho de banda. . . ) e inferir de cada una una descripcion cualitativa

de su efecto en la Calidad de Experiencia percibida (silencios, defectos en el vıdeo. . . ).

Este analisis se apoya en la informacion de transporte y de la capa de abstraccion de

red de los flujos codificados, y permite caracterizar los defectos mas relevantes que se

observan en este tipo de servicios: congelaciones, efecto de “cuadros”, silencios, perdida

de calidad del vıdeo, retardos e interrupciones en el servicio. Los resultados se han va-

lidado mediante pruebas de calidad subjetiva. La metodologıa usada en esas pruebas se

ha desarrollado a su vez para imitar lo mas posible las condiciones de visualizacion de

un usuario de este tipo de servicios: los defectos que se evaluan se introducen de forma

aleatoria en medio de una secuencia de vıdeo continua.

Se han propuesto tambien algunas aplicaciones basadas en la solucion de monitoriza-

cion: un sistema de proteccion desigual frente a errores que ofrece mas proteccion a las

partes del vıdeo mas sensibles a perdidas, una solucion para minimizar el impacto de la

interrupcion de la descarga de segmentos de Streaming Adaptativo sobre HTTP, y un

sistema de cifrado selectivo que encripta unicamente las partes del vıdeo mas sensibles.

Tambien se ha presentado una solucion de cambio rapido de canal, ası como el analisis

de la aplicabilidad de los resultados anteriores a un escenario de vıdeo en 3D.

http://www.upm.es

Acknowledgements

This thesis would not have been possible without the help of all the people with whom I

have been so lucky to share my way in these more than eight years. Let me express my

gratitude to all of them in my mother tongue.

La vida es un conjunto de relaciones; y enumerar todas las que se pueden forjar en los

ocho anos que ha durado este trabajo ocuparıa mas espacio del que, probablemente,

sea razonable dedicar en una tesis doctoral. De modo que es probable que este siendo

injusto con algunas personas que, por descuido, olvido, o falta de espacio, no apareceran

aquı citadas. Vaya de antemano mi disculpa (y agradecimiento) tambien para ellas.

Gracias ante todo a Narciso Garcıa, que sigue logrando sacar huecos en su cada vez mas

complicada agenda para acompanarme en esta aventura. Es un privilegio contar con el

como director de tesis.

Gracias tambien, muy especialmente, a Jaime Ruiz, que ha sido mucho mas que un

manager en estos ocho anos. No exagero si digo que, si no fuera por el, difıcilmente

podrıa yo haber terminado este trabajo.

Gracias al excepcional equipo humano y profesional con el que he tenido la suerte de

trabajar a lo largo de estos anos en Telefonica I+D y Alcatel-Lucent. A Jesus Macıas,

que me enseno a mirar el vıdeo de otra manera. A Alvaro Villegas, en cuyo trabajo se

apoya buena parte del mıo. A Silvia Varela, por ayudarme a encontrar el enfoque de

este espinoso asunto de la calidad. A Enrique Estalayo y Jose M. Cubero, con los que he

compartido tanto en tantos proyectos. A Ernesto Puerta, por las conversaciones sobre

cuantificacion y otros asuntos arcanos. A Javier Lopez Poncela, por guiarme por los

entresijos de los descodificadores.

Gracias tambien a la gente del Grupo de Tratamiento de Imagenes, que me ha seguido

acogiendo como en casa durante todos estos anos. Muy en particular a Jesus Gutierrez,

por todo el trabajo de las pruebas de calidad subjetiva: sin el, acabar esta tesis habrıa

resultado mucho mas difıcil. Gracias tambien a Julian Cabrera y Fernando Jaureguizar,

siempre dispuestos a echar una mano en lo que hiciera falta.

Mi sincero agradecimiento a todas aquellas personas que, a lo largo de estos anos, han

puesto tambien su granito de arena en esta tesis. A Juan Casal, por compartir su ex-

periencia sobre codificacion de vıdeo. A Rocıo Bravo, por la ayuda con las audiencias

de television. A todos los socios del CENIT VISION, donde se gesto buena parte de la

investigacion que ahora presento.

xiii

Finalmente, muchas gracias a mi familia y amigos. A mis hermanos Lucas y David, que

marcaron el camino a seguir. A mi hermano Jesus, de quien he aprendido lo poco que

se de audio digital (y algun que otro truco de television). A mi madre Teresa, que tanto

ha puesto de su parte para empujarme a terminar la tesis. A mi padre Juan, a quien

seguro que le habrıa gustado verla acabada, y con quien tambien he discutido alguna de

las ecuaciones que en ella aparecen. Y a Graciela, por todo lo que hemos compartido, y

lo que queda por venir; tanto, que no se puede resumir en una frase.

Gracias, en definitiva, a todos los que han hecho posible que esta tesis se haya escrito.

Aun de aquellos que, por la falta de espacio, no he tenido ocasion de mencionar en estas

lıneas, guardo un buen recuerdo en el corazon. Gracias a ti, que te estas tomando el

trabajo de leer estas paginas. Y gracias a Dios por habernos puesto en contacto.

Contents

Abstract ix

Resumen xi

Acknowledgements xiii

List of Figures xix

List of Tables xxi

Abbreviations xxiii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Understanding Quality of Experience 7

2.1 Quality of Experience and its relatives . . . . . . . . . . . . . . . . . . . . 7

2.2 A word about multimedia services . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Players . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 Coding standards and transport protocols . . . . . . . . . . . . . . 11

2.2.3 Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Who is who in the QoE metrics . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.1 Subjective quality assessment . . . . . . . . . . . . . . . . . . . . . 18

2.3.2 Full-Reference quality metrics . . . . . . . . . . . . . . . . . . . . . 20

2.3.3 Reduced-Reference quality metrics . . . . . . . . . . . . . . . . . . 22

2.3.4 No-Reference quality metrics . . . . . . . . . . . . . . . . . . . . . 23

2.4 Other topics related to QoE in IPTV services . . . . . . . . . . . . . . . . 26

2.4.1 Media formats in IPTV deployments . . . . . . . . . . . . . . . . . 29

2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Designing QoE-Aware Multimedia Delivery Services 33

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Delivering multimedia over IP . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.1 Architecture of a multimedia service delivery platform . . . . . . . 36

3.2.2 Impairing the Quality of Experience . . . . . . . . . . . . . . . . . 41

xv

xvi CONTENTS

3.3 QuEM: a qualitative approach to QoE monitoring . . . . . . . . . . . . . 44

3.3.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.2 System design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.3 Qualitative Impairment Detectors . . . . . . . . . . . . . . . . . . 47

3.3.4 Severity Transfer Function . . . . . . . . . . . . . . . . . . . . . . . 48

3.4 A Subjective Assessment methodology to calibrate Quality ImpairmentDetectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4.1 Design principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4.2 Test methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.4.3 Selection of impairments . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5 QoE enablers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.5.1 Headend metadata architecture . . . . . . . . . . . . . . . . . . . . 53

3.5.2 Intelligent Packet Rewrapper . . . . . . . . . . . . . . . . . . . . . 55

3.5.3 Edge Servers for IPTV and OTT . . . . . . . . . . . . . . . . . . . 57

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4 Quality Impairment Detectors 59

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Video Packet Loss Effect Prediction (PLEP) model . . . . . . . . . . . . . 60

4.2.1 Description of the model . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2.3 Subjective analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3 Audio packet loss effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3.1 Objective analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3.2 Subjective analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.4 Coding quality and rate forced drops . . . . . . . . . . . . . . . . . . . . . 79

4.4.1 Analysis of feature-based RR/NR metrics as estimators of videocoding quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.4.2 Managing coding quality drops . . . . . . . . . . . . . . . . . . . . 84

4.5 Outages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.5.1 Detection of outages . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.5.2 Subjective impact of outages . . . . . . . . . . . . . . . . . . . . . 88

4.6 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.6.1 Lag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.6.2 Channel Change time . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.6.3 Latency trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.7 Mapping to Severity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5 Applications 99

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.2 Unequal Error Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.2.1 Priority Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.2.2 Experimentation and results . . . . . . . . . . . . . . . . . . . . . 105

5.2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.3 Fine-grain segmenting for HTTP adaptive streaming . . . . . . . . . . . . 112

5.3.1 Description of the solution . . . . . . . . . . . . . . . . . . . . . . . 113

CONTENTS xvii

5.4 Selective Scrambling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.4.1 Problem statement and requirements . . . . . . . . . . . . . . . . . 117

5.4.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.5 Fast Channel Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.6 Application to 3D Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6 Conclusions 123

A Experimental setup 127

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A.2 Subjective Assessment based on QuEM approach . . . . . . . . . . . . . . 127

A.2.1 Selection and preparation of content . . . . . . . . . . . . . . . . . 127

A.2.2 Selection of impairments . . . . . . . . . . . . . . . . . . . . . . . . 128

A.2.3 Test sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

A.3 Subjective quality assessment of H.264 video encoders . . . . . . . . . . . 134

A.4 Test sequences from IPTV deployments . . . . . . . . . . . . . . . . . . . 135

Bibliography 137

List of Figures

2.1 Layer and domain model for multimedia services . . . . . . . . . . . . . . 10

2.2 Protocol stack for multimedia services over IP . . . . . . . . . . . . . . . . 13

2.3 Models for objective quality assessment: FR/RR/NR . . . . . . . . . . . . 17

2.4 Hierarchical GOP structure . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1 Network architecture for IPTV and OTT services . . . . . . . . . . . . . . 37

3.2 Delivery chain of a multimedia service . . . . . . . . . . . . . . . . . . . . 45

3.3 QuEM architecture design . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4 Test sequences in ACR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5 Test sequences in our proposed method . . . . . . . . . . . . . . . . . . . 51

3.6 Questionnaire for subjective assessment tests . . . . . . . . . . . . . . . . 51

3.7 Structure of the content streams in the subjective assessment test session 53

3.8 Schematic representation of a modular headend . . . . . . . . . . . . . . . 54

3.9 RTP header and extension introduced by the rewrapper processing . . . . 56

4.1 Video sequence used for qualitative analysis . . . . . . . . . . . . . . . . . 67

4.2 MSE and PLEP for all sequences under study, varying the loss position . 69

4.3 Detail of MSE and PLEP for all sequences under study . . . . . . . . . . 69

4.4 MSE vs PLEP (log scale) and linear fit . . . . . . . . . . . . . . . . . . . . 70

4.5 % of different macroblocks vs PLEP and linear fit . . . . . . . . . . . . . 70

4.6 % of different macroblocks and PLEP for all sequences under study, vary-ing the loss position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.7 Results of the subjective assessment for Video Loss impairments . . . . . 73

4.8 Detailed results for each of the individual segments for Video Loss . . . . 74

4.9 Waveform of a lossy audio file . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.10 Effect of audio losses: measured vs. expected . . . . . . . . . . . . . . . . 76

4.11 Short-length audio losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.12 Results of the subjective assessment for Audio Loss impairments . . . . . 78

4.13 Detailed results for each of the individual segments for Audio Loss . . . . 79

4.14 Results of TI and Contrast NR metrics . . . . . . . . . . . . . . . . . . . . 83

4.15 Results of the subjective assessment for Rate Drop impairments . . . . . . 86

4.16 Detailed results for each of the individual segments for Rate Drop . . . . 86

4.17 Results of the subjective assessment for Outage impairments . . . . . . . 89

4.18 Detailed results for each of the individual segments for Outage . . . . . . 89

4.19 Simplified transmission chain for real-time video . . . . . . . . . . . . . . 90

4.20 Decoding delay for video and audio components of a MPEG-2 TransportStream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.21 Results for all the QuIDs mentioned in the chapter . . . . . . . . . . . . . 96

xix

xx LIST OF FIGURES

5.1 Example of the packet priority model . . . . . . . . . . . . . . . . . . . . . 103

5.2 Implementation of the prioritization model . . . . . . . . . . . . . . . . . 104

5.3 Effect of the window size in packet prioritization results . . . . . . . . . . 107

5.4 Values of MSE comparing random vs. priority-based packet loss . . . . . 107

5.5 Effect of varying the loss burst size . . . . . . . . . . . . . . . . . . . . . . 108

5.6 Contribution of each term to the prioritization equation . . . . . . . . . . 109

5.7 Effects of a limited bit budget to encode the priority . . . . . . . . . . . . 110

5.8 Priority-based HTTP Adaptive Streaming segment structure . . . . . . . 115

A.1 Structure of the content streams in the subjective assessment test session 132

A.2 Summary of the subjective quality assessment test results . . . . . . . . . 133

A.3 Subjective MOS for a football sequence . . . . . . . . . . . . . . . . . . . 135

List of Tables

2.1 ACR and DCR evaluation scales . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 Priority values used in the RTP header extension . . . . . . . . . . . . . . 56

4.1 Coefficient of determination (R2) of MSE vs PLEP fit for several videosequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 PLEP impairments analyzed in the subjective assessment tests . . . . . . 72

4.3 Audio losses analyzed in the subjective assessment tests. . . . . . . . . . . 78

4.4 Comparison of NR/RR results with subjective tests . . . . . . . . . . . . 82

4.5 Quality drops analyzed in the subjective assessment tests. . . . . . . . . . 85

4.6 Outage events analyzed in the subjective assessment tests . . . . . . . . . 88

4.7 Example Channel Change time ranges and their mapping to QoE . . . . . 94

5.1 Priority value for each slice type . . . . . . . . . . . . . . . . . . . . . . . 102

5.2 Values of the Aggregated Gain Ratio . . . . . . . . . . . . . . . . . . . . . 106

5.3 Bit budget assignation to encode priority . . . . . . . . . . . . . . . . . . 111

5.4 Minimum scrambling rate required to completely loss the video signal . . 119

A.1 Video test sequences: bitrate and resolution . . . . . . . . . . . . . . . . . 128

A.2 Bitrate drops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

A.3 Frame rate drops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

A.4 Audio losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

A.5 Macroblocking errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

A.6 Video freezing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

A.7 Impairment sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

A.8 Example of a sequence of impairments . . . . . . . . . . . . . . . . . . . . 132

A.9 Test sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

xxi

Abbreviations

3G Third generation of mobile communication technology

ACR Absolute Category Rating

AL-FEC Application Layer Forward Error Correction

ARQ Automatic Repeat reQuest

AVC Advanced Video Coding (also H.264 or MPEG-4 part 10)

CA Conditional Access

CABAC Context-Adaptive Binary Arithmetic Coding

CBR Constant Bit Rate

CDN Content Delivery Network

CoD Content on Demand

DCR Degradation Category Rating

DRM Digital Rights Management

DSL Digital Subscriber Line

DTS DecodingTime Stamp

DVB Digital Video Broadcasting

FCC Fast Channel Change

FEC Forward Error Correction

FR Full Reference

GOP Group Of Pictures

GPON Gigabit-capable Passive Optical Network

HAS HTTP Adaptive Streaming

HDS HTTP Dynamic Streaming

HLS HTTP Live Streaming

HNED Home Network End Device

HTTP Hypertext Transfer Protocol

xxiii

xxiv ABBREVIATIONS

IDR Instantaneous Decoding Refresh

IP Internet Protocol

IPTV Television over Internet Protocol

ITU International Telecommunication Union

LMB Live Media Broadcast

LTE Long Term Evolution

MDI Media Delivery Index

MOS Mean Opinion Score

MPEG Moving Picture Experts Group

MSE Mean Square Error

MVC Multi-view Video Coding

NAL Network Abstraction Layer

NR No Reference

OTT Over The Top multimedia delivery services

PCR Program Clock Reference

PLEP Packet Loss Effect Prediction metric

PLP Packet Loss Pattern

PLR Packet Loss Rate

PSNR Peak Signal to Loss Ratio

PTS Presentation Time Stamp

QoE Quality of Experience

QoS Quality of Service

QuEM Qualitative Experience Monitoring

QuID Quality Impairment Detector

RAP Random Access Point

RET RETranmsission (synonym of ARQ)

RGW Residential Gateway

RR Reduced Reference

RTP Real-Time Transport Protocol

SS Smooth Streaming

STF Severity Transfer Function

TCP Transmission Control Protocol

UDP User Datagram Protocol

ABBREVIATIONS xxv

VBR Variable Bit Rate

VQEG Video Quality Experts Group

To the loving memory of Juan

To Teresa

Chapter 1

Introduction

1.1 Motivation

There is little doubt about the social relevance of the audiovisual delivery services since

the beginning of the first television broadcasts. During the second half of the 20th

century, broadcast television channels controlled the audiovisual market and were the

main communication path for information, culture, and entertainment. But in the last

decades, though the traditional broadcasters are still quite relevant players in the con-

tent marketplace, their offer has been complemented by a plethora of new services: IP

television, video on demand, web video portals, user-generated content. . . The way in

which contents are consumed is rapidly changing, and there are two technological drivers

which have made this possible: digital video and IP networks.

With the standardization of MPEG video in the 1990s, it became possible to consume

video products at home with high quality and at an affordable cost. The popularization

of the internet, at about the same time, brought the possibility to easily interconnect

any two points in the world. The combination of both events allowed that video con-

tents could be managed, stored, and distributed homogeneously with the rest of the

information. Somehow, the distribution of video to the households had just become a

problem of digital data communication and storage. And the main problem to solve was,

consequently, finding enough bandwidth to fit the transmission requirements of video

assets.

The first decade of the 21st century witnessed a quantitative change which resulted in

a qualitative jump: improvements in video codec technologies and in the capacity of

the xDSL access networks allowed to distribute real time video over IP networks with

a quality that could compete with that of television and DVDs. This gave birth to

1

2 Chapter 1. Introduction

the television over IP (IPTV), which introduced real interactivity and personalization

into the audiovisual ecosystem. And in few years time, with subsequent generations

of technological improvements, it has been possible to obtain a competing service of

video distribution even over the standard best-effort internet, in what has been called

over-the-top video delivery (OTT). This has significantly reduced the barriers to entry

the multimedia business. And, as this happens, new services are appearing beyond

the classic television channels, covering from huge video-clubs over the internet to the

distribution of personalized, or even user-generated, video content.

Together with the evolution of the services, it comes the problem of how to provide them

with enough quality for the end users. The transmission of high quality video can be

demanding for the capabilities of IP networks, especially in the access segment. Errors

happen, and service providers struggle to have them under control. The monitoring

of Quality of Service (QoS) parameters, such as bit rate, packet loss rate, or delay, is

not straightforward when the service is distributed over a complex IP network topology.

And even when a suitable QoS monitoring system has been set up in the delivery service

network, it shows insufficient. The interesting concept to monitor is not strictly the

QoS, but the QoE: the Quality of Experience perceived by the final customer.

There has been an important effort in the last decade to characterize the perceived

quality of an audiovisual content, as well as to find algorithms able to model it. A first

method is using subjective quality assessment tests, where a panel of viewers evaluate

the perceived quality of the video clips under study. This can provide quite accurate

information about video quality and user preferences, but at the high cost of having a

group of users involved in the assessment.

The complementary approach is developing objective quality metrics: algorithms which

try to emulate the responses of those viewers by computer analysis of the video sequences.

It has been a very active field of research, especially during the last decade. Dozens of

algorithms have been developed, from simple measures of mean square errors between

images, up to complex metrics which include information about the Human Visual

System (HVS) perception and about the visual structure of the impairments introduced

in the video by the coding and transmission chain.

However, few of those methods have impacted the market relevantly. There are com-

mercially available quality probes which implement this kind of algorithms, but they

are typically used just to measure the quality of the video compression process, and not

always in real time. For the monitoring of the quality in the distribution and access net-

work, only network-based measures are used: packet losses, router failures... Moreover,

in the recent years, the manufacturers of measurement equipment seem to have reduced

the efforts to introduce these complex metrics in their equipments.

Chapter 1. Introduction 3

There are good reasons for that. Video QoE metrics are complex to develop and expen-

sive to deploy in the field. They also cover a very specialized field of interest, frequently

critical in the video headend and video production departments, but much rarer in the

service definition and in the network operation. In many cases the teams operating the

network already have an overwhelming amount of QoS data which is hardly possible

to manage; so that there is little use of increasing the complexity of this information.

Besides, monitoring algorithms need to be implemented in heavily-loaded routers or

low-processing user terminals, thus requiring to be extremely lightweight in processing

power needs, what may disqualify a large number of the metrics available in the litera-

ture. Finally, some metrics are even impossible to apply due to the unavailability of the

information at the monitoring point, as it is the case, for instance, when parts of the

video stream are encrypted by digital rights management (DRM) or conditional access

(CA) systems.

In summary, service providers are still using mainly QoS metrics to monitor their net-

works, but it happens because they are the ones which are applicably under the bud-

getary, computing, and information availability restrictions that they have to cope with.

There is still room for improvement. And this thesis wants to be a step in this direc-

tion, trying to reduce the gap between QoE expertise and multimedia delivery service

providers. The focus of the work is precisely analyzing how to model, monitor and

manage the Quality of Experience under the mentioned restrictions.

The research of the thesis has been carried out along the last 8 years in the framework

of the Grupo de Tratamiento de Imagenes research group at Universidad Politecnica

de Madrid, in parallel to a professional career in the multimedia competence center of

Alcatel-Lucent in Madrid. In this time, services, products, research areas and standard-

ization efforts have evolved significantly. During the first years of the research, the line

that we are proposing in this thesis was almost inexistent in the most relevant journals,

save for a couple of remarkable exceptions. In the recent years, however, there has been

an increasing interest in the research and standardization of monitoring strategies which

are easier to apply in real operation environments.

1.2 Overview

The aim of this thesis is providing architecture, models and results which make it possible

for multimedia service providers to control the Quality of Experience offered by their

service in a way which is relevant for their interests, practical and better than QoS-

only monitoring schemes. It intends to answer the most frequently asked questions that

4 Chapter 1. Introduction

a service provider can raise about the QoE it is offering: which elements determine

the quality of the multimedia stream, which are the most relevant impairments in the

perceived quality, what causes them, and how can they be monitored, prevented, and

minimized. The thesis proposes a comprehensive strategy to address this problem as a

whole, as well as detailed solutions for most of its elements.

Part of the inputs taken to create the approach presented in this thesis have come from

the day-by-day experience of assessing IPTV service providers, designing solutions for

them, and developing products for the content delivery market. All the assumptions

taken in the development of the thesis will be supported either by the work itself or by

previous works published in the scientific literature. However, broader decisions, such

as the relevance of the problem to study or the general approach to it, are influenced

by the experience of listening to the customer, capturing their requirements and un-

derstanding the advantages and disadvantages of different measurement schemes from

a service provider point of view. This fact has no effect on the scientific quality of this

work, but it may help understand its underlying motivation.

As a consequence, the work is probably biased towards this application-oriented ap-

proach in two different ways. On the one hand, there is a stronger focus on the ideas

and concepts, rather than on training of mathematical models or extensive analysis of

experimental results. As it is virtually unaddressable to simulate the conditions of work

of any possible service provider in the world, the research has been aimed at building

models which have as less dependency as possible on the context where they are applied,

or that can be easily adapted to any specific deployment. In a word: clean and generic

models have been preferred to trained and optimized ones. On the other hand, there

has been an explicit effort to be sure that any architecture or algorithm proposed in this

thesis can be directly applied to real multimedia delivery services. And, in fact, some of

them have already been included in products which are currently deployed in the field.

The study starts by analyzing several aspects of the state of the art (Chapter 2). It

defines what a multimedia delivery service is, which technologies it implies, and which

are the most relevant problems to its quality. Although the market applicability of the

multimedia services is quite wide, its underlying technological problem is much more

restricted. The existing techniques to model, analyze, and monitor the multimedia

quality are covered, with special focus on their applicability to content delivery services,

and including the published studies which support or formalize the knowledge obtained

by work experience.

Chapter 3 contains guidelines to design a multimedia delivery service which takes into

account the Quality of Experience. It describes a reference architecture model for the ser-

vice with some QoE-specific elements. It also proposes an specific design for a monitoring

Chapter 1. Introduction 5

system, which explicitly includes the most relevant requirements that any commercially

deployable system should fulfill. The design is complemented with a methodology of

subjective assessment tests that can be used to select, validate and calibrate its quality

monitoring metrics.

Chapter 4 dives into the quality metrics themselves. It presents a novel approach to

predict the effect of packet losses on video quality, as well as some complementary

metrics for audio losses, coding quality drops and outage. The effect of latency in the

quality is analyzed as well. All the metrics also include the results of their respective

subjective assessment tests.

Chapter 5 shows some applications which derive from the previous work and go beyond

the pure monitoring of quality. The knowledge of the effect of packet losses can be used

as input to a packet prioritization model, usable for error protection in IPTV channels or

to improve the error resiliency of HTTP Adaptive Streaming schemes. Other proposed

applications are a method to increase the effect of selective scrambling and a system to

reduce zapping time in IPTV and hybrid environments.

Finally Chapter 6 presents the conclusions of the thesis, also summarizing which parts of

it contain work which has been published in national and international scientific journals

and conferences.

There is also an appendix with some ancillary work: Appendix A, which describes the

detail of some subjective and objective quality assessment tests used for several results

in Chapters 3 and 4.

Chapter 2

Understanding Quality of

Experience

2.1 Quality of Experience and its relatives

Quality of Experience is defined as the overall acceptability of an application or service,

as perceived subjectively by the end-user. It includes the complete end-to-end system

effects (client, terminal, network, services infrastructure, etc.) and may be influenced

by user expectations and context [43].

Some identifiable factors which impact in the QoE are the following [120, 121]:

• Individual interests of the viewer on the content.

• Audiovisual quality of the content.

• Viewing conditions, screen resolution and type. . .

• Interaction with the service or display device (e..g. zap time, remote control,

EPG). . .

• Individual experience and expectations (previous experiences. . . ).

The concept of Quality of Experience is therefore quite wide, including aspects from the

subjective preferences of each user to the objective technical conditions under which the

service was provided. Roughly speaking, there are elements related to the content itself

(the movie, TV show. . . ) and others related to the service (how the content is delivered

and presented to the end user). Most of the analysis of the Quality of Experience are

restricted to the service-related factors, which can be effectively monitored and managed

7

8 Chapter 2. Understanding Quality of Experience

from an engineering point of view: media compression and synchronization, network

transmission performance, channel zapping time. . . [43]

A step down in the abstraction scale, we find the audiovisual quality or multimedia

quality (MMQ), which is the study of the quality of the video and audio signals (either

separately or together). Within the framework of multimedia services, the multimedia

quality is by far the most relevant element of the QoE, up to the point that both terms

are frequently exchanged. Likewise, the analysis of MMQ is typically focused on the

video quality, which is the most critical in most multimedia services.

An additional concept is the multimedia Quality of Service (M-QoS, or just QoS). By

QoS we understand the complete and uninterrupted delivery of the multimedia stream

through the network, from one end of the communication to the other one. It is the

quality offered by the transmission chain (from the output of the multiplexer to the input

of the demultiplexer) [32] without taking into account the contribution of the encoder,

decoder, capture, and display devices into the final quality.

These three quality concepts have a tight relationship. The QoS describes the capabilities

of the communication network (bandwidth, delay. . . ) and their possible degradations

(bit errors, packet losses, jitter. . . ). It therefore limits the level of MMQ that can be

obtained in two senses: on the one hand, limitations in bandwidth result in limitations

in the coding quality of the sequence; on the other, QoS degradations can cause impair-

ments in the transmitted multimedia signal and, hence, in its MMQ. The final QoE will

have to do with the final MMQ, as well as with other factors which are influenced by

the QoS: interactivity, end-to-end latency, zap time. . .

2.2 A word about multimedia services

The concept of multimedia service which will be used in this work is, basically, the pos-

sibility of watching an audiovisual content at home, usually assuming as well that the

content is also delivered to the household at the time when it is going to be viewed.

Multimedia services, thus, have been universally present at homes for the last half a

century, first in the form of television broadcasting and, later, with the possibility to

watch recorded contents in video recording systems. However, in the recent years, this

scenario has been evolving rapidly, with the irruption of at least three significant tech-

nology changes, which have led to the three most relevant families of existing multimedia

delivery services.

The first one was the switch from analog to digital video, which increased the availabil-

ity of different television channels to the households, fostering the growth of channels

Chapter 2. Understanding Quality of Experience 9

for specific target audiences (documentary, sport, children channels. . . ) and impacting

strongly on the business models in the television marketplace. As a side (but relevant)

effect, the experience of watching television changed, with increasing received quality

(including high definition video), the appearance of new video defects, the raise of zap-

ping times, the presence of Electronic Program Guides. . . This technology supports the

existing television broadcast services: terrestrial, cable, and satellite.

As a second step, some of those broadcast television services started their evolution

towards all IP delivery networks [1]. IP delivery networks offer an easy integration with

triple-play offers (voice, internet access and television), as well as inherent interactivity,

which allows to deliver personalized services and, especially, Video on Demand (VoD),

a remote access to stored video content (i.e. the experience of “renting a film in a

video-club” integrated with the television service). A response to this evolution is the

standardization of IPTV architectures, such as the DVB-IPTV [19], focusing on the

delivery of continuous high-quality video services and covering the natural evolution of

the television services (High Definition, stereoscopic video. . . ). And, in parallel, the

deployment of IPTV platforms all over the world.

The third technology change has been the irruption in the marketplace of the last gen-

eration of smartphones and tablets, which have given rise to new video delivery services,

based on the streaming of multimedia content over unmanaged networks [23]. These

services, which do not require a specialized end-to-end network for them, are exper-

imenting a very fast growth. As an example, the website of the BBC delivered 106

million requests for online video during the recent Olympic Games of London 2012 [73].

The result is that, in the near future, multimedia services will have to handle a complex

scenario comprising from 3.5-inch smartphone screens to 100-inch wall-mounted plas-

mas, covering the services coming both from the “television” and from the “internet”

worlds [72][107]. Consequently, content sources will move in a wide range of formats

and qualities, from the user-generated content in the social TV to the high-budget 3D

movie produced by Hollywood studios.

Nevertheless, the core of the multimedia delivery services is the same for all of them

—television broadcasting, IPTV, or internet video—: taking a multimedia content and

delivering it to an end user, providing the best possible Quality of Experience within

the limitations imposed by the available network Quality of Service. In the rest of this

section we will explore the common properties of all those multimedia services: the

players or entities which take part in the service chain, the standards and protocols used

to compress and transport the media stream, and the most relevant quality degradations

or limitations that are present in those services. The focus will be on the multimedia


services over IP networks; but most of the concepts are applicable to other transmission

means as well.

2.2.1 Players

The first step in the analysis of multimedia services is characterizing the players and

their roles. We will use the model proposed by the DVB-IPTV standard [19], and

depicted in figure 2.1. This model is applicable to most service scenarios and it has the

advantage of showing the relationship between the different players (or “domains”) and

their relationships regarding the OSI layer model.

Figure 2.1: Layer and domain model for multimedia services

The Content Provider is “the entity that owns or is licensed to sell content or content

assets”. The Content Provided may have direct relationship with the end user for the

management of usage rights to the content, or it can even be the entity which has the

commercial agreement with the end user (the end user being then a direct customer

of the Content Provider). However, regarding the content flow, the Content Provider

delivers content assets only to the Service Provider. The content offered by the Content

Provider is already “finished”, in the sense that it is a content asset which is deliverable

to an end user (a TV channel, a live event, a movie. . . ). All the complexity of the

content generation is outside this model and out of the scope of our work.


The Service Provider is “the entity providing a service to the end-user”. This is the

one with has direct logical connection with the end user for the purpose of delivering

video content. The Service Provider is also the responsible of controlling the Quality of

Experience offered to the end user, and therefore the subject of the quality monitoring

services covered in our work.

The Delivery Network is “the entity connecting clients and service providers”. According

to DVB-IPTV, “the delivery network is transparent to the IP traffic, although there

may be timing and packet loss issues relevant for A/V content streamed on IP”. In the

practice, however, the Service Provider will need to impose specific requirements to the

delivery network, what leads into two different delivery scenarios:

• “Managed IPTV” (or simply “IPTV”). The Service Provider controls (and typi-

cally owns) the end-to-end IP distribution to the Home domain. The most relevant

implication here is that it is possible to distribute UDP traffic over IP multicast

with sufficient Quality of Service. This scenario has been the most important

(sometimes the only one) for the last years, and therefore it has also been the

main focus of our research and of this work.

• “Over The Top” content (or simply “OTT”). Video delivery is done “over the top”

of the internet, i.e., using a delivery network which is neither owned nor controlled

by the Service Provider. As such, some of the IPTV-related delivery network

features (multicast support, controlled QoS) are not available. In this context,

however, Service Providers normally make use of (or even own) Content Delivery

Networks (CDNs). CDNs are distributed networks which deliver the video content

in an efficient way to points of presence which are closer to the end users, thus

shortening the part of the delivery chain which goes really “over the top”.

Home is “the domain where the A/V services are consumed”. The Home domain is

property of the content consumer (the end customer or subscriber), and includes the

User Terminal —or Home Network End Device (HNED), using DVB-IPTV terminology.

Due to the fact that IPTV is traditionally delivered to a TV screen, the Home domain

is normally depicted as the end user’s own home. However, the User Terminal may be

also a mobile device with direct connection to the Delivery Network. The Home domain

may, but does not need to, include a home local area network.

2.2.2 Coding standards and transport protocols

The multimedia codec and transport technologies used in IPTV and OTT services result

from the ones used in digital television. There are several families of digital television


standards around the world: Digital Video Broadcasting (DVB), adopted in Europe,

Africa, Australia and parts of Asia; Advanced Television System Committee (ATSC),

used mainly in North America; Integrated Services Digital Broadcasting (ISDB), used

in Japan and most of Central and South America; and Digital Terrestrial Multimedia

Broadcast (DTMB), adopted in China. All of them are quite similar in their basis:

transport of audiovisual services, multiplexed in MPEG-2 Transport Stream, over dif-

ferent physical media and using different modulation techniques. When needed, we will

take DVB as a reference, considering that the differences with other standards will be

almost insignificant for the purposes of our work.

DVB (and others) standardize the transport of audiovisual services multiplexed in

MPEG-2 Transport Stream [36]. Video elementary streams are coded in MPEG-2 video

[37] or MPEG-4 AVC/H.264 [38], while audio is coded in MPEG-1, MPEG-2, Dolby

AC3, or MPEG-4 AAC [18]. Both video codecs use similar concepts for compression:

motion prediction (to make use of temporal redundancy), block transformations (to

make use of local spatial redundancy), quantification of transform coefficients, entropy

coding of the resulting data, and package of data into a bitstream which add some head-

ers of meta information (such as delimitation and characterization of the different video

frames). Besides, audio codecs are also quite similar among them in the basic concepts

(encoding of different frequency sub-bands of a block of audio samples). As a result, the

key elements which affect multimedia quality will be very similar among all the different

scenarios for digital television, regardless of the underlying transport.

Both IPTV and OTT platforms may offer several different services around the distribu-

tion of multimedia content. However, we will focus here on the pure delivery of content

assets to the Home domain. In both cases, there are two basic service types:

• Live content (Live Media Broadcast, or LMB, in DVB-IPTV terminology). The

most typical examples are the live broadcast TV channels, which still are the main

contributor in IPTV deployments and one of the most popular audiovisual services

in any deployment. Its most important property is the real-time constraint: the

end-to-end latency must remain constant for the whole play out of the stream

to avoid discontinuities in the received multimedia session. Live content must be

ingested, processed, and delivered by the Service Provider in real time.

• On-demand content (Content on Demand, or CoD, in DVB-IPTV terminology).

This content is pre-loaded by the Content Provider into the Service Provider do-

main. It may take some time for the Service Provider to process it before it is

ready for its delivery to the end user.


Figure 2.2: Protocol stack for multimedia services over IP

Those audiovisual services are delivered over IP. Figure 2.2 shows the protocol stack

used for this purposes, where there is a clear differentiation between IPTV and OTT

protocol families:

• MPEG-2 TS / RTP / UDP / IP. This is the standard scenario for an IPTV

deployment over managed network, as considered in [19], [76], and [55]. It follows

a push paradigm: the server controls the bit rate of the delivery.

• HTTP Adaptive Streaming (HAS) / TCP / IP. This is the upcoming scenario for

OTT environments. It follows a pull paradigm: the client decides which video

segments it downloads and when.

HTTP Adaptive Streaming (HAS) is a solution used to deliver multimedia content to

users where the bitrate is adapted to the network. Although the distribution of video

over the internet can be done in dozens of different ways, the use of adaptive streaming

is becoming the most popular one, especially in the context of OTT services offered by

IPTV service providers [75]. It is also natively supported by most smartphones, tablets,

and set-top-boxes.

HAS works as follows: the content is encoded at a specific bitrate as a concatenation of

small segments, each containing a few seconds of the stream, with the property that at

the video segment boundaries the terminal can switch from one variant (at a particular

bitrate) to another (at a different bitrate) without any visible effects on the screen or

the audio. Each of these segments is accessible as an independent asset with its own

URL, so once it is present in an HTTP server it can be retrieved by a standard web

client using pure HTTP mechanisms.


There are several different HAS implementations. The most widespread distributed in

the market come from the initiative of individual companies: Apple HTTP Live Stream-

ing (HLS), Microsoft Smooth Streaming (SS), and Adobe HTTP Dynamic Streaming

(HDS). All of them are based in the same principles and use similar codecs. Their

main differences are the signaling of the segments and the multiplexing layer: HLS uses

MPEG-2 Transport Stream while SS and HDS use extensions of the ISO base media file

format. MPEG has also recently standardized a proposal for HTTP adaptive streaming

called MPEG DASH (Dynamic Adaptive Streaming over HTTP) [39]. MPEG DASH

supports both MPEG-2 TS and ISO file format profiles.

2.2.3 Artifacts

The “perfect” possible media quality for a multimedia service is the quality of the audio-

visual content just after the production process has finished. This reference “production

quality” shows the product exactly as its creators wanted it to be. Of course, there

might be defects in the capture, recording and production process, but, in a professional

product, it is reasonable to assume that they will be very rare and with a small impact

in the perceived quality.

Producers must then deliver their products to the service provider. This is usually done

encoding the content with a very lightweight compression, to avoid a perceptible loss of

quality, giving as a result a product with “contribution quality”. It can be assumed that

a product with contribution quality has the highest possible multimedia quality, with

no perceptible visual or sound artifact or impairment.

However, due to the impairments produced in the delivery chain, the final multimedia

quality received by the end users may be far from the contribution quality. We will con-

sider three main types of impairments, according to the place where they are generated:

compression artifacts, transmission errors, and display errors [113]. Other terminologies

and classifications are also possible [2, 7].

Compression artifacts are defects introduced when compressing the video from contri-

bution to distribution quality, which must fit into the bitrate budget that the service

provider has reserved for that specific media stream. In this compression process, several

impairments can be introduced [105]:

• Blocking effect appears as a pattern of square-shaped blocks in the compressed

image. It is caused by the independent quantification of adjacent groups of pixels,

which are processed in 4x4, 8x8, or 16x16 blocks, which leads to discontinuities in

the block boundaries. This effect is easy to appreciate due to the regularity of the


generated pattern, and it is typically the most salient defect in MPEG-2 video. In

AVC video it is partially mitigated by the use of smaller blocks and the effect of

the deblocking filter.

• Blurring is the loss of spatial detail and edge sharpness in the image. It is generated

by the application of strong quantification in the high frequency components, and

it is emphasized by the application of deblocking filters, thus being typically the

most relevant artifact in AVC video.

• Flickering is a defect introduced in highly textured regions which are compressed

with different quantification factors along time (normally having higher quality

in key frames than in predicted frames). As a result, the coding quality of those

regions fluctuates periodically along time, and so does the perceived detailed level.

• Ringing (also known as Gibbs effect) produces ring-like periodic intensity varia-

tions around image edges in areas which should not have a perceptible texture. It

is caused by a strong quantification of high frequency coefficients in edgy regions.

• Chromatic dispersion is produced by the suppression of high frequency components

in the chrominance signal, resulting in cross-talk and loss of color definition in areas

with strong color variation.

• Motion jerkiness is caused by the use of a smaller frame rate than the one needed

to properly display the image motion.

Transmission errors are produced by the loss, corruption or excessive delay of some pack-

ets in the transmission chain, which results in stream discontinuities or buffer underrun

events in the receiver. They typically result in stronger versions of the compression

defects:

• Macroblocking : a highly visible blocking effect produced by the loss of video in-

formation, which forces the receiver to build the picture using wrong references

(normally repeating a correctly received frame instead of the lost one). The re-

sult is a strong blocking pattern, sometimes also causing other perceptual artifacts

(parts of the image, typically blocks or horizontal stripes, with a different color or

texture than what they should).

• Freezing or continual jerky motion, caused by the unrecoverable loss of video

frames.

• Mute or audio glitches, caused by the loss of packets with audio information.

• Outage or temporal loss of service due to network problems.


Finally there is an heterogeneous set of errors that can be caused in the user terminal and

display, such as an incorrect aspect ratio display [113] or a malfunction in the terminal

itself.

Transport errors are normally the most damaging for the perceived QoE. In an study

done with a real IPTV deployment [7], it was shown that about 82% of the multimedia

quality impairments reported by customers were directly related to them: “Breaking

Up Into Blocks” (macroblocking, 29%), “Screen Freezes” (20%), “Choppy Screen Tran-

sitions” (or jerky motion, 18%) and “Distorted audio” (mute or glitches, 15%). As the

customers were requested to report perceived errors, it is possible that a fraction of them

were caused in the encoding process. However, the description of the errors as given by

customers suggest that most of them refer to the “stronger” (and more visible) effects

of the artifacts, i.e. the ones resulting from transmission errors. The additional 18%

of the errors is divided into “Edges Shimmer” (11%), visible artifacts around edges in

the image (caused by coding artifacts, as the edges are one of the places where they are

more visible), and “Error Stoppage” (7%), or problems with the end terminal (which

“has to be reset”).

2.3 Who is who in the QoE metrics

In contrast with the relatively fast standardization of audio [41] and speech [51, 52]

qualities, the efforts to standardize video quality metrics have produced slower results

[15]. The Video Quality Experts Group (VQEG) has been the most relevant contributor

to this standardization process [111, 112], producing an extensive evaluation of quality

metrics which has led to some standardization initiatives [45, 46, 47, 48, 49, 50].

The study of the multimedia quality, and more specifically of the video quality, has

been of great interest for the last 15 years, and therefore it is relatively easy to find

good surveys, reviews and classifications of the different existing metrics and approaches

[15, 33, 78, 92, 121]. This section will present the most used classification of video quality

assessment strategies, as well as some example methods which are relevant for our work.

More detailed surveys can be found in the given references.

The first division in the quality assessment approaches is between subjective and objec-

tive methods. Subjective quality assessment implies having a panel of users watching the

target content and evaluating its quality by giving a score to each fragment of content

under study. The result is normally presented in terms of Mean Opinion Score (MOS),

which is the average of the results from the different users, maybe with some statis-

tical processing such as the removal of outliers. Objective quality assessment is done


automatically by computing processes which analyze the multimedia stream to produce

some quality values. In most cases, the aim of objective metrics is providing MOS values

which correlate well with those provided by subjective assessments, which are used as

benchmark.

Figure 2.3: Models for objective quality assessment: Full-reference method (top),Reduced-reference method (middle), No-reference method (bottom)

Objective quality assessment methods can be classified into three different types, de-

pending on how much information they use from the original signal (see Figure 2.3):

• Full-Reference (FR). The impaired signal is compared with the original one to

obtain a quality value. This is the most appropriate method to use in cases where

it is possible to have access to the original and impaired signals simultaneously

(for instance, to analyze the compression defects introduced by a video encoder).

• Reduced-Reference (RR). A reduced description of the original and impaired sig-

nals are generated, and they are compared to produce a quality value. This model

is useful when the original signal is not available in the measurement point (for

instance, when they are at different points in the network), but it is possible to

receive ancillary data through a lower bitrate channel.

• No-Reference (NR). The quality measure is generated only by analyzing the im-

paired signal, without having any information about the original. This is the most

generic model, because it can be introduced in a non-intrusive way at any point

of the transmission chain.


A second classification criterium for objective metrics refers to the type of data they

use, having:

• Picture metrics, which operate in the baseband domain, analyzing the pixel values

of the original and/or decoded frames to produce their results.

• Bitstream metrics, which operate in the coded domain, analyzing the video stream

without fully decoding it or, in some cases, analyzing just the quality of service

information (losses, delays. . . ). Bitstream metrics are usually No-Reference as

well.

2.3.1 Subjective quality assessment

The aim of the quality assessment is knowing, for a specific set of content assets and

impairments, which would be the opinion of an average user. As such, the best way

to know it is in fact asking the users. Subjective quality assessment methods provide

guidelines about how to ask users about multimedia quality in the most effective way.

There are several standards which provide these methods of subjective assessment,

mainly the ITU-R BT.500 [42], ITU-T P.910 [53], and ITU-T P.911 [54]. All of them are

quite similar in the way they propose to structure, perform and evaluate tests. Most of

the subjective assessment tests reported in the literature are based on these standards,

being the VQEG validation tests the most relevant example [119].

In test sessions, a number of “subjects” are asked to watch a set of audiovisual clips

and rate their quality. The total number of viewers for a test must be between 4 and

40 (they can be effectively distributed in different viewing sessions). In general, at least

15 observers should participate in the experiment. They should not be professionally

involved in multimedia quality evaluation, and they should have normal or corrected-

to-normal visual acuity and color vision.

The location and the displays where the tests are conducted must comply with a set of

requirements regarding lighting, screen brightness and contrast, distance and angle from

viewers to screen. . . Guidelines are provided to work either with professional monitors

or with domestic TV sets [42].

Sessions should not last more than half an hour. At the beginning of the session, viewers

are presented with a set of example clips where they can see the type of defects that they

are supposed to judge. The content samples to be evaluated may be preceded by about

five “dummy presentations”, whose results are not taken into account, to stabilize the


observers’ opinion. Besides, the video clips under study should be distributed randomly

along the session.

Table 2.1: ACR and DCR evaluation scales

ACR DCR5 Excellent Imperceptible4 Good Perceptible but not annoying3 Fair Slightly Annoying2 Poor Annoying1 Bad Very Annoying

Different evaluation strategies are used. Although there are some variations in the details

from one standard to another, they are basically the following [54]:

• Absolute Category Rating (ACR), or Single Stimulus method (SS). The test se-

quences are presented one at a time and are rated independently on a category

scale. After each presentation, the subjects are asked to evaluate the quality of

the sequence presented using an absolute scale, normally with five levels (see Ta-

ble 2.1). Nine-level and eleven-level rating scales are also suggested to increase

resolution, but they do not seem to produce significantly different results [35].

• Degradation Category Rating (DCR), or Double Stimulus Impairment Scale method

(DSIS). In this case, each presentation consists of two different video clips: the

reference content (without impairments) and the processed or impaired version

of the same content. Both videos are watched consecutively, and the subject is

asked to rate the impairment of the second stimulus in relation to the reference.

Five-level scales are also used (see Table 2.1).

• Pair Comparison method (PC). Test sequences are presented in pairs as in the case

of DCR, but now the sequences are two different processed versions of the same

original one (i.e. with two different levels or types of impairments). After each

pair is presented, the subject has to select which one is preferred in the context of

the test scenario.

• Single Stimulus Continuous Quality Evaluation (SSCQE). This method considers

long-duration sequences (3 to 30 min). While the sequence is being played, sub-

jects are asked to continuously evaluate the quality of the sequence, normally by

controlling a slider.

The proposed duration of sequences is about 10 seconds, including another 10-second

period (showing a grey screen) to vote each of the sequences. When sequence pairs are


used (DCR and PC), both sequences within a pair should be separated by a short (about

2 seconds) grey screen.

2.3.2 Full-Reference quality metrics

Full Reference metrics compare the original and impaired versions of the sequence, thus

having access to more information than RR or NR metrics. For this reason, FR metrics

have been the first ones to be developed and they also are the ones which produce more

accurate results.

Video engineers have used for years simple FR objective metrics such as the Peak Signal

to Noise Ratio (PSNR) or the Mean Square Error (MSE) of the impaired video with

respect to the reference. They are computed as follows:

MSE =1

MN

M−1�

i=0

N−1�

j=0

�I(i, j)−K(i, j)�2 (2.1)

PSNR = 10 log10

�(max I)2

MSE

�(2.2)

where I(i, j) and K(i, j) are the two compared images, whose size is M ×N pixels, and

max I is the maximum possible intensity value for any pixel in the image (for instance,

255 for 8-bit pixel values).

These metrics compare the pictures on a pixel-by-pixel basis, ignoring the image struc-

ture, and their capability to predict the perceived MOS is quite limited. However, they

are still used for some applications, and especially as benchmark for other FR quality

metrics: the acceptability criterium for any FR quality metric is having a correlation

with subjective MOS which is significantly better (statistically speaking) than that ob-

tained by PSNR [111].

The first attempts to improve the performance of PSNR and MSE resulted from the

application of psychophysical models of the Human Vision System (HVS) to improve

the measurements, in a way that has been known to produce good results in the audio

quality estimation (and in the development of audio codecs) [78, 120].

A second family of FR algorithms appeared with a different approach: trying to detect

impairments related to the known processing applied to the image, the expected impair-

ments that can appear or, in general, how the image is affected from the image point of

view. Some metrics having this “engineering approach” [121] were able to outperform

the PSNR in the second round of the VQEG tests for television signals [111]. They are


the ones included in the ITU-T Recommendation J.144 [45], the first standard for FR

video quality metrics:

• BTFR (BT Full Reference). It makes a weighted linear composition of several

individual measures, such as: percent of correctly estimated blocks, PSNR of

matching blocks, segmental PSNR (error in the matching vectors), energy of edge

differences, texture degradation and pyramidal PSNR.

• EPSNR (Edge PSNR). It measures the PSNR between both images, considering

only the regions where there are edges. The result is afterwards scaled non-linearly

to generate a MOS value.

• CPqD-IES. Image is segmented in three regions: flat, edges and textured. The

Absolute Sobel Difference (ASD) is computed for each region: the result of applying

a Sobel filter and finding the MSE of the resulting images. The result is introduced

into a trained model to obtain the final MOS value.

• VQM. This metric computes also seven different parameters of the image, which

are afterwards added linearly with experimentally obtained weights. Measured

features are: loss of spatial information, loss of horizontal and vertical edges, gain

of horizontal and vertical edges, chroma spread, spatial information gain at edges,

errors in high-contrast areas end extreme chrominance errors. An implementation

of VQM is publicly available on the internet [89].

Subsequent test projects of the VQEG have resulted in additional ITU-T Recommenda-

tions for slightly different scenarios. For instance, ITU-T J.341 [49] introduces VQual-

HD, another FR metric specialized for HDTV contents, which combines picture similar-

ity, spatial degradation, and temporal degradation to obtain a quality metric. ITU-T

J.247 [47] proposes metrics for multimedia environments, more focused on “internet”

frame resolutions and bit rates (lower than in digital television, as a general rule). ITU-

T J.147 [46] proposes embedding hidden data in the original signal and measure their

degradation in the received one.

Additionally to them, it is relevant to mention the Structural Similarity Index (SSIM)

[116]. SSIM considers image degradation as perceived change in structural information.

Structural information is the idea that the pixels have strong inter-dependencies espe-

cially when they are spatially close. The metric is computed over several windows in the

image, and its value between two windows x and y (assumed to be in the same position

of two different images) is:

SSIM(x, y) =(2µxµy + c1)(2σxy + c2)

(µ2x + µ2

y + c1)(σ2x + σ2

y + c2)(2.3)


where µ represent the average, σ2 the variance and σxy the covariance of the signals,

and c1 and c2 are constants used to stabilize the division when the denominator is small.

Although the metric has some limitations [13], SSIM has becoming increasingly popular

over the recent years, since it seem to offer better results than PSNR while being a

simple metric to implement (the source code is available on the internet as well).

In any case, most of the FR metrics (and especially the ones included in ITU-T recom-

mendations) have been specifically designed to be able to provide good MOS estimations

for relatively subtle impairments, such as the ones generated by video encoders. How-

ever, when the errors are generated by packet losses or other network problems, and

therefore are more aggressive perceptually, PSNR, SSIM and VQM show reasonably

good correlation with MOS [40]. For such cases, it can be more useful to use simpler

metrics (such as PSNR or SSIM) rather than the complex schemes proposed by the

standards.

2.3.3 Reduced-Reference quality metrics

The basic strategy used to design Reduced-Reference metrics is extracting a set of statis-

tic parameters that characterize the video and compare them between the original and

the impaired sequences (see [15] for a short survey). We can difference between two

types of features:

• Features which describe image properties: temporal and spatial information [63,

98, 117], structural similarity [106], image statistics [114]. . .

• Known impairments on the image, normally by applying No-Reference quality

estimators in both pictures (original and impaired) and comparing the results

[16].

Simple RR measures can be combined to generate a more complex metric, in a similar

way that FR metrics are generated from complex measures. This is the case of the RR

metrics selected by the RR-NR project of VQEG [112], which are now included in the

ITU-T Recommendations J.249 (for Standard Definition TV)[48] and J.342 (for High

Definition)[50]:

• Yonsei University metric. It is a Reduced Reference version of the EPSNR included

in ITU-T J.144 [45]. The algorithm selects some pixels in the edge region of the

original image and computes its PSNR with the same pixels in the impaired image.


Temporal, spacial and gain registrations are performed to enhance pixel mapping.

Besides, the EPSNR of each picture is post-processed to take into account some

defects or features: EPSNR is reduced if there is high blurring, blocking or freezing

effects, and enhanced for high-motion or high-complexity pictures.

• NEC metric. A reduced version of the image is transmitted, containing the activity

values of 16x16 pixel blocks of the original luminance image. Activity of a block

(ACT) is computed as the average of the absolute differences from the pixel in-

tensities to the average intensity, as in eq. (2.4). The MSE of the activity images

is obtained. It is then post-processed (weighted) if the impaired image exceeds

threshold on different features: psychophysical features (spatial frequency, color),

scene changes, blocking effect, or local impairments.

ACT =1

256

15�

i=0

15�

j=0

��Xi,j − X�� (2.4)

• NTIA metric. It is a Reduced Reference version of the VQM included in ITU-

T J.144 [45], called “fast low bandwidth VQM”. It extracts color, spatial and

temporal features, which are transmitted and compared to the same features of the

processed (impaired) image. Different complex comparisons are used, so that the

original and processed features are used to generate parameters, which are similar

to the ones available in the FR metric, measuring modifications in horizontal and

vertical edges, in spatial information, in color information and in absolute temporal

information. The resulting parameters are linearly combined (with fixed weights

obtained by training) to generate the final VQM value.

Yonsei University EPSNR metric is the only one included both in SDTV (ITU-T J.249)

and HDTV (ITU-T J.342) standards; while the other two were only included in J.249. It

is also relevant to note that, even though the models described in the recommendations

matched (and, at some points, outperformed) a Full-Reference PSNR, none of them

“reached the accuracy of the normative subjective testing” [112], i.e. they are not good

enough to replace subjetive assessment tests.

2.3.4 No-Reference quality metrics

There are two basic families of No-Reference video quality metrics: pixel-based (also

baseband or picture based) and bitstream-based. The former operate in a similar way

as the described FR and RR: they analyze some features of the images and sequences (but

without any information about the original image). They typically focus on detecting


one specific impairment, normally the ones expected to be introduced in the coding

phase (see section 2.2.3). The latter analyze the bitstream of the coded video sequence,

trying to obtain a quality metric from the syntax and semantics of the coded video.

They are normally used to handle packet losses and other network impairments, but

some of the bitstream metrics are also applied for coding defects. It is also possible

to find hybrid schemes which combine both approaches. Several surveys can be found

which describe all these metrics in detail (for instance, [33] or [15]). We will describe

some of the most relevant ones.

Yang et al. [123] propose a metric for temporal consistency. They compute the MSE

between two consecutive pictures on motion-compensated areas with high spatial com-

plexity and homogeneous movement. Kuszpet et al. [62] propose a metric to detect

flickering based on the error of motion-compensated areas with smooth (homogeneous)

textures.

Several authors propose blocking metrics, trying to detect the patterns produced by

block coding. For instance, Wu and Yuen propose GBIM (Generalized Block-edge Im-

pairment Metric) [122], based on the energy of the difference between pixels at both sides

of a block boundary. Vlachos [110] estimates the block effect by comparing the cross

correlation between pixels within the same block with that of pixels between adjacent

blocks. Wang et al. [115] search for peaks in the transform domain (FFT) at multiples

of the block spatial frequency.

Blurring is normally measured by studying the width of edges in the image. An edge

detector (usually Sobel or Canny) is applied to the image and then some statistics are

computed to provide a value for the edge width (see, for instance, [21, 68]).

Other metrics include measuring other less common artifacts, such as additive white

gaussian noise (AWGN), edge continuity, motion continuity. . . and combinations of them

[21, 74].

However, pixel-based NR metrics are not able to provide good enough performance when

evaluated towards subjective quality assessments [64]. In fact, VQEG has not been able

to recommend any NR metrics for standardization; only RR and FR [119]. For such

reason, pixel-based NR metrics are normally not directly applied in the measurement of

video quality. However, they are sometimes used as building blocks for more complex

FR and RR metrics.

The second family of no-reference metrics are the bitstream-based. They have been

increasingly popular in the recent years for two reasons. On the one hand, the lack of

success for pixel-based NR metrics fosters the search for different ways of measuring

quality. On the other, there is a need for measure schemes that are easy to apply to


large platforms of multimedia services (such as IPTV), where using the decoded video

could have an excessive cost which would prevent a scalable deployment.

The benchmark bitstream-based metric for video delivery over UDP/IP is the Media

Delivery Index (MDI), described in IETF RFC 4445 [118]. MDI is a combination of

two different values, Packet Loss Rate (PLR) and Delay Factor (DF), which are usually

shown separated by a colon:

MDI = PLR : DF (2.5)

DF shows how many milliseconds of data must be buffered in the receiver to completely

remove the effect of jitter. It is the additional delay that must be available in the system

to avoid that jitter generates packet losses. In other words, when DF grows over the

dejitter buffer size in a video receptor, some packets will get lost due to buffer underrun,

adding the effect to the losses accounted by the PLR part of the MDI. Let ∆ be the

instant variation of the fill level (in bits) of the dejittering buffer, the Delay Factor over

a period of time (typically of one second) is computed as:

DF =max(∆)−min(∆)

bitrate(2.6)

Media Loss Rate is computed just as the number of packets lost per time interval:

MLR =packets expected− packets received

interval(2.7)

MDI is in fact a pure QoS metric, with no knowledge of the effects produced by packet

losses or jitter. However, due to its simplicity, has became a de-facto standard in com-

mercial IPTV deployments (see, for instance, [67]). Besides, for randomly distributed

errors, it is possible to find a linear correlation between the packet loss rate and the

mean square error [102].

However, these results can vary when losses are not randomly distributed along time.

Different authors have proposed enhancing these metrics by analyzing how errors are

distributed along time and how they propagate between protocols. Liang et al. analyze

the effect of different packet loss patterns for low-bitrate applications [65]. Pattara-

Atikom et al. analyze the propagation of errors, either coming from packet losses or

from excessive delay, from the IP layer to the video layer, also considering different

structural factors related with the loss pattern and how the protocol stack is built [80].

Reibman et al. developed a model which can compute the MSE from the received

bitstream without decoding it [95]. The algorithm, designed for MPEG-2 video, esti-

mates the error at macroblock level and follows its propagation along the following video


frames. The same research group has evolved these results to predict the visibility of

packet losses for MPEG-2 and AVC video, based on some parametrization of the packet

loss and using Generalized Linear Models to combine the parameters [58, 66, 93].

In a different approach, the Picture Appraisal Rating (PAR) is a metric which estimates

the PSNR of the stream from the values of the quantification parameters in MPEG-2

coded video [59].

These schemes are evolving towards hybrid metrics, which combine several bitstream

measures, sometimes also with additional picture measures, to obtain better quality

estimates. The V-Factor proposed by Winkler et al. use several measures such as

quantification parameters, bitrate, packet losses, video stream structure. . . to produce a

single quality value [121] . Erman and Matthews analyze Key Quality Indicators (KQIs)

such as blockiness, jerkiness. . . and predict their value from the measurement of network

quality of service (bitrate, packet loss rate, buffering) using a trained model for that

mapping [17].

This approach is also being used in the new upcoming multimedia quality standards

which are being developed in ITU-T Study Group 12: P.1201 (ex P.NAMS) and P.1202

(ex P.NBAMS) [8]. They are intended to be used both for network planning and for QoE

monitoring. P.1201 use only transport information, while P.1202 adds video bitstream

information. Only video headers are used; neither of them require to decode the video.

Most of the work developed in this PhD Thesis is also located within the framework of

the bitstream-based and hybrid quality estimation. We propose a simple but effective

method to predict packet loss effects on video quality [86]. It can be used as basis of

a full quality monitoring scheme which provides with a significant mapping between

quality values and the qualitative effect observed by the user [85]. This model can

also be applied to different scenarios, such as unequal error protection [82] or selective

scrambling [83], among others.

2.4 Other topics related to QoE in IPTV services

When managing multimedia Quality of Experience, there is some “implicit knowledge”

which is not always easy to find in the surveys of metrics, such as a proper definition

of QoE, the relevant fact that transmission errors are much worse than coding errors,

or the difficulty to generate a good no-reference metric [91]. This section compiles some

miscellaneous results extracted from the literature, which can be used to support design

decisions.


Cermak et al. [6] study “the relationship among video quality, screen resolution, and

bitrate” to show that, as expected, the perceived quality increases with the bitrate for a

given screen resolution. Besides, they conclude that “it would be reasonable to choose

a bit rate, given a screen resolution; it would not be reasonable to choose a screen

resolution given a bit rate.”

Jumisko et al. [56] study the effect of the selected content in subjective assessment

of video quality on mobile devices, finding that the content selection may have strong

effects in the results of subjective assessments. Specifically, for audiovisual content, it

seems that errors are perceived as more severe in contents which are recognized by the

users than in unrecognized contents.

There are also several studies which characterize the levels and patterns of packet losses

(and other network issues) in IPTV services, so that they can provide valuable inputs

to the metrics that monitor the effect of those losses. Hohlfeld et al. [34] provide a

model to simulate packet loss packets based on Markov chains whose parameters are

computed from session capture logs. Ellis and Perkins [14] characterize the packet losses

in residential access networks (cable and ADSL), performing an intensive study of packet

loss rates in 4 cable and 1 ADSL links, at several bitrates (1-4 Mbps). Most sequences

had an error rate lower than < 1%. Typical error bursts were short: 1 to 5 packets.

However, this can change if the DSL service activates Forward Error Correction and

interleaving, which reduces the error rate at the cost of spreading the errors. With a

typical interleaving of about 10 ms [5], any error which is not corrected by the ADSL

FEC will result in a potential loss of 10 ms worth of video.

Mahimkar et al. perform an extensive analysis of a large commercial IPTV network

[67]. They collect a lot of data from the field and develop a method to find the root

cause of a problem by statistics analysis and correlations. Beyond that, they provide an

interesting insight of what happens inside a real IPTV deployment:

• Video traffic is monitored using MDI. Other monitoring data used are the logs

of the network elements and routers (recovered in a centralized syslog), logs of

Set-Top-Box reboots and reports from customer care centers.

• There is a high correlation between MDI events and network events (syslog), as

expected. However, there is low correlation between MDI and call center events

(bursty video losses rarely generate a call). On the other hand, most customer

complaints (46%) are related to video (supposedly sustained problems).

• About 5% of STBs had at least one reboot event in 3 months period.


Another field to consider is the composition of audio and video qualities to generate

an audiovisual quality model of the content. It is widely accepted that the multimedia

quality m can be modeled parametrically from the audio and video qualities (a and v),

m = αa+ βv + γ(a× v) + C (2.8)

and that those parameters depend on the specific application [31].

A recent analysis of 12 different subjective assessment tests [90] has shown that “au-

dio quality and video quality are equally important in the overall audiovisual quality.

The application drives the range of audio quality and video quality examined and thus

produces the appearance that one factor has greater influence than the other. The

underlying perceptual model is invariant to application. The most important overall

conclusion is that only the cross term (a×v) is needed to predict the overall audiovisual

quality” . These results are in line with others that showed that instantaneous errors

where similarly unacceptable, were they produced in audio or in video [57].

Audio quality metrics as such are not usually included in the quality assessment for

multimedia applications. It might be caused by a bias in the studies of multimedia

quality, since most of them come as evolution of video-only quality analysis. Even if it

might be partially true, there is a good reason for not being too concerned about audio

coding quality: while audio and video are equally important for multimedia quality,

audio requires at least an order of magnitude less of bitrate to reach a similar quality

level [43]. Therefore audio coding quality should not be a problem in a well-dimensioned

multimedia service.

The measurement of audio coding quality has been standardized in the recommendation

ITU-R BS.1387-1 [41], which defines a Full Reference audio quality metric called Per-

ceptual Evaluation of Audio Quality (PEAQ). The model divides the audio signal into

segments (called frames). Each frame is divided into different sub-band components (us-

ing an FFT or a filter bank) in the Bark scale, modeling also the frequency response of

the peripheral ear and the time and frequency component masking of the human hearing

system. There are a total of 54 sub-bands between the 80 and the 18000 kHz. Afterwards

both signals are adjusted, equalized and its relative error (per sub-band) is computed.

These results are used to compute several Model Output Parameters (MOVs), which

characterize the level and structure of the error signal (bandwidth, modulation, noise

to mask ratio. . . ). Those MOVs feed a neural network which provide the final quality

value.


Audio packet losses are relatively simple to analyze, since they basically produce a mute

in the audio output. The effect of the mute length in audio quality has been evaluated

by Pastrana et al. [79], from which it is possible to extract rough results for audio losses:

• Mutes below 500 ms produce low to moderate impact in quality.

• Mutes from 500 to 1000 ms produce a strong impact in quality.

• Mutes longer than 1 second produce a very strong impact in quality.

The end-to-end delay is not frequently considered as a critical design factor for multi-

media services (IPTV and similar), as it is for conversational services. However, this

could become a relevant factor in some specific situations. For instance, experiments

show that for specific contents such as important sport matches, having an end-to-end

delay which is 2-4 seconds higher than other services (SDTV vs HDTV for instance) can

be a reason for a user to switch services [70]. And sport matches are in fact the most

relevant content in current digital television platforms: for instance, in Spain pay-TV

channel Canal+, more than 40% of the aggregated audience comes from live football

matches 1.

2.4.1 Media formats in IPTV deployments

Video coding and transport standards provide a reasonable degree of freedom to imple-

ment them. However, when designing, implementing, or testing QoE measuring strate-

gies, some assumptions must be done about the specific parameters used to encode the

content.

To support these assumptions, we have analyzed video streams from existing IPTV

deployments (or field trials of IPTV service providers) in several countries, such as

Spain, USA, UK, Brazil, Chile, Argentina, Austria, Cyprus, Dubai, Czech Republic,

Slovenia, France, Italy, Japan, Taiwan, India, Turkey, South Korea and Australia, from

2007 to 2011.

From this survey, the following conclusions were obtained:

• Video format is MPEG-4 AVC (H.264) in all the scenarios, plus some MPEG-2

part 2 video in some of them (in all cases for legacy support and with intention to

migrate to MPEG-4 AVC). VC-1 has limited use, mainly in North America. Other

formats, such as MPEG-4 part 2 (visual) Simple Profile, which were relatively

1Audience data from January to September 2012. Source: Kantar Media.


popular in internet video in the recent years, have virtually no presence in the

IPTV world. Main profile is used for SDTV and main or high profile for HDTV.

• Video resolutions are the typical for television distribution: 720x576 (25 fps) and

720x480 (30 fps) for SD, as well as 1920x1080 (25/30 fps) and 1280x720 (50/60

fps). Fractions of the full horizontal resolution (e.g 1440x1080 or 544x576) are

also frequent, especially in the lowest bitrates, to reduce the amount of data to

transmit.

• Video bit rates lay between 1.5 and 3 Mbps for SD. HD bitrates are more variable:

from 6 to 20 Mbps. Constant bit rate is used in most of the deployments, although

not necessarily with strict CBR constraints (some local variations of the bitrate

are acceptable).

• GOP lengths are between 12 and 100 frames, (0.5 to 4 seconds, approx.). The most

typical values are between 24 and 48 frames. GOP structures are IBBP or IBBBP,

the latter being more frequent with longer GOPs. Besides, IBBBP GOP structures

are normally hierarchical, with the reference structure represented in figure 2.4.

Dynamic GOP size, i.e. changing the GOP size depending on the structure of the

video (normally to insert I frames in scene changes), are frequently used. However,

the number of consecutive B frames (between I and P frames) does not change.

• Video start-up delay (PTS-PCR difference for I frames) is imposed by the encoder

end-to-end delay. In last-generation low-delay encoders, coding delay is typically

around 1 second and video start-up delay is between 700 and 900 ms. Medium-

delay configurations have values around 2 seconds (normally with the benefit of

better coding quality). The first generation of H.264 encoders had delays on the

range of 4 seconds.

• Audio formats are MPEG-1 layer 2 (typically associated with MPEG-2 video),

MPEG-4 AAC (both low-complexity and high-efficiency profiles), and Dolby AC-

3. Audio bitrates range from 96 to 512 kbps. There may be several audio streams

(with different languages).

• Other data streams are usually present, where the subtitles and the teletext are

the most relevant ones. Their presence, relevance, and format, vary significantly

from one country to another.


Figure 2.4: Hierarchical GOP structure

2.5 Conclusions

Even though the market and the technology are in constant evolution, it is possible to

define a subset of common elements which can cover a big fraction of the multimedia

service playground: delivery of MPEG digital video and audio contents over a packet

network. The main task of service providers is indeed offering this delivery with enough

Quality of Experience to the end user. To achieve it, they must control three elements:

coding quality, network quality of service, and overall service availability.

Multimedia coding quality will depend mainly on the available bit rate, which will limit

the video quality stronger than the audio. Service providers use the best available codec

(from the ones widely available in the market) which best quality offer for a given bit

rate budget, which is currently H.264. The coding process can (and should) be screened

to assess its quality; and this should be done by the best available means: subjective

assessment tests or Full-Reference quality metrics. Regarding FR metrics, the ones that

have passed VQEG tests are available in ITU-T recommendations J.144 and J.341. If

they are not available, simpler metrics such as SSIM (or even PSNR) can be used, if one

is aware of their limitations.

The most relevant risk to Quality of Experience in field deployments, however, is the

drop of the QoS offered by the network, resulting in the loss of information. These

losses can produce both spatial and temporal artifacts (macroblocking, jerky motion,

freezing), with strong impact to the QoE. This impact could be monitored using No-

Reference or, even better, Reduced-Reference metrics, such as the ones proposed in

ITU-T J.249. However, practical reasons make that only QoS monitoring schemes, such

as the Media Delivery Index, are widely used in service deployments. Bitstream-based

NR QoE metrics can overcome those practical limitations and enhance pure QoS models,

as it is intended by the recently approved ITU-T recommendations P.1201 and P.1202.

Regarding contributions of different elements to quality, audio and video can be consid-

ered equally important. There are also reasons to consider end-to-end lag as a relevant

factor.

Overall service availability, understood at the possibility to receive the multimedia ser-

vice, must also be considered in any practical scenario. Customers suffer from user


terminal software issues or other outage events, in a number that can be estimated

roughly in 1–5%, according to the studies presented in the literature. However, unlike in

the coding and network qualities, the problems with service availability will be specific

from each service deployment.

Chapter 3

Designing QoE-Aware

Multimedia Delivery Services

3.1 Introduction

Monitoring multimedia quality of experience (QoE) in a multimedia service is a complex

task. Quality monitoring implies generating quality data in real time in all the relevant

points of the network, raising alarms whenever a critical event happens, and being able

to retrieve and process all the data to obtain significant statistic information about the

network performance in terms of QoE. Also a quality monitoring framework typically

presumes that the original signal is rarely available at the monitoring point, and therefore

reduced-reference (RR) or no-reference (NR) video metrics need to be used.

There are dozens of video, audio, and multimedia RR/NR quality metrics which could be

applicable to the monitoring of multimedia QoE (see, for instance, [33]). Although their

performance is not as good as the Full-Reference metrics [111, 112], they can provide

relevant results about the video quality of the measured signal. In fact, this kind of

metrics has also being introduced in some commercial monitoring probes during the last

decade. The cost of those probes makes them usable for deployment in several points

within the communication network (such as the video head-end or the local points of

presence), but not at the end user home network.

However, in communication networks, the errors occur typically in the last mile, where

the computing power of the equipments (network routers, access gateways or set top

boxes) is rarely dedicated to the implementation of complex processing algorithms for

QoE monitoring. In practical terms, the monitoring information available in field de-

ployments is obtained only at transport level: packet loss rate (PLR) and packet loss

33

34 Chapter 3. Designing QoE-Aware Multimedia Delivery Services

pattern (PLP), as well as jitter [3, 67], frequently expressed using the Media Delivery

Index (MDI) [118]. Some derivative QoE monitoring metrics, such as [4, 17], are built

assuming that packet loss and jitter is the only available input regarding network impair-

ment issues. Since the effect of jitter is creating a packet loss in the receptor (because

the packet arrived too late to be used), this is the same as saying that the only available

network quality source information is the packet loss pattern at the end decice.

Despite its limitations, this approach makes sense because, for random packet losses, the

effects (error in the decoded video) are quite correlated to the (effective) packet loss rate

[95] and pattern [65]. PLR/PLP monitoring systems have also many other advantages:

they make no assumptions about the content, can be widely deployed in a non-intrusive

way, and provide data which are easy to understand, aggregate, and analyze. Besides,

they are repeatable: if we can assume that a specific packet loss pattern creates an

aggregated effect (let us say, an impairment in global perceived quality of x%, within

some error margin), we can recreate the same effect by replicating the causes —by

generating the same error pattern.

However, using PLR/PLP as the only description of network losses implies considering

all the media packets as homogeneous data, which is certainly sub-optimal. The impact

of an isolated packet loss may vary strongly depending on whether the data loss belong

to audio or video, and also depending on the part of the audio or video stream which

has been lost. Besides, as discussed in section 5.2, the impact of the packet losses can

be dramatically mitigated by a simple re-arrangement of the transport stream packets

in the RTP and an appropriate prioritization model [82]. Hence, with an appropriate

packet priority model applied in the service deployment to reduce the randomness in

the packet loss events, the significance of the pure PLR/PLP could also get reduced.

Fortunately, there is additional information at transport level, or at the network abstrac-

tion layer of media level, that is also available for the network elements with very few

additional effort (or, in other words, without needing to decode, even partially, the video

or the audio): elementary stream type (video or audio), type of coded video frame (I, P,

B. . . ), position of the frame boundaries within the bitstream. . . This information, which

we could call rich transport data, can be used to better predict the effect of packet losses

[86]. The key point here is that the rich transport data are obtained directly from the

bitstream in a deterministic way (it is syntactic information which is always available in

the media stream). This way, these data share the main properties that made PLR/PLP

so useful: content-agnosticism, non-intrusiveness, simple processing and, what is most

relevant, repeatability (in the same terms as PLR/PLP).

The aim of this chapter is building a framework for QoE monitoring based on the

information available at the rich transport data level. This framework will be built from

Chapter 3. Designing QoE-Aware Multimedia Delivery Services 35

a pure bottom-up approach: the target is making the best possible use of the available

information within this rich transport data level, and trying to find out whether the

obtained results could be sufficient in a network monitoring scenario, as well as whether

they would provide more information that the ones obtained only by PLR/PLP analysis.

Before going on with the analysis, it is important to consider that the final target of

the network monitoring is anticipating, or at least explaining, the degradations in user

quality of experience that could cause complains from end customers. In such context,

when users complain about errors in the field, they do not speak of packet losses, but

of video artifacts [7], such as blockiness, screen freezes, choppy transitions or distorted

audio.

The aim of our monitoring framework is identifying the root causes of these kinds of

impairments, so that they can be detected when they happen. On the one hand, for

impairments which are caused by network errors, we will use the information of the

rich transport data to obtain the most accurate description and characterization of the

effect. On the other hand, for impairments related to the coding process itself, there are

also elements in the transport data that can be used as proxies to monitor them. It is

also important to consider that, in typical multimedia service deployments (with a few

hundred video streams for hundreds of thousands of users), the quality of the encoded

content should be high enough in normal operating conditions, and the the monitoring

of coding quality could also be done with more complex (and expensive) tools.

To validate the characterization of the different impairments based on rich transport

data, we have also designed a set of subjective quality assessment tests, where the

impairments to analyze are based on that characterization. They compare the effect

of the same type of degradation for several contents and different users. The results of

the test can be used both to validate the characterization of the error (i.e., to determine

to what extent it makes sense), as well as to calibrate its subjective impact.

Quality monitoring tools are aimed at estimating the quality perceived by the end users.

Therefore, to obtain from subjective tests meaningful conclusions to be applied in the

development of the monitoring architecture, these assessment tests should be designed

respecting as far as possible real home viewing conditions. Thus, a novel subjective

methodology, based on well known standard procedures, was used in the tests covered

in the present work to obtain representative results of what end users perceive in their

households when typical transmission errors degrade the received video.

The main target of our work is to sketch the steps required to build a consistent moni-

toring framework. This way, it is possible to identify the main impairment sources, find


out which information can be obtained about them (under reasonable assumptions), pro-

pose a framework to sort and classify this information, and design and implement a set

of subjective assessment tests based on this framework. We believe that this approach

can be easily enriched with other algorithms and models and also could provide useful

tools to other monitoring schemes that are being developed (for example, the recent

standardization efforts such as ITU-T P.NAMS and P.NBAMS [8]).

The chapter is structured as follows. In section 3.2 we will describe the architecture of

a multimedia delivery service, as well as the main quality impairment events that are

present in a field deployment. Once they are identified, in section 3.3 we will propose

the architecture for the monitoring process, aimed at detecting and characterizing each

of those events. Section 3.4 describes the design of the subjective assessment tests used

to validate and parameterize the proposed solution. Finally, section 3.5 describes some

QoE enablers: network elements focused on enhancing the QoE offered by the service.

3.2 Delivering multimedia over IP

The monitoring system has to be designed according to the architecture of the monitored

service. For this reason, a fine characterization of what a multimedia service is and

how it works is quite relevant for the purposes of our work. This section proposes an

architecture for multimedia services, based on the principles described in Section 2.2.

It also describes a set of impairments that appear in those deployments, and how they

could be detected based on the monitoring of rich transport data.

3.2.1 Architecture of a multimedia service delivery platform

Figure 3.1 shows a schematic architecture for an IPTV and OTT service. Although

it is a simplification, it shows the main elements that are present in most commercial

deployments [11, 55, 67]. The architecture also shows the most relevant quality moni-

toring points according to the Recommendation ITU-T G.1081 [44]. They are labeled

as PT1-PT5 in the figure, following the terminology proposed in the Recommendation.

The main building blocks for a multimedia service delivery architecture are thus the

following:

• The video contribution, coming from the Content Provider. The ingestion of the

video contribution is the monitoring point PT1.


Figure 3.1: Schematic representation of the network architecture for IPTV and OTTservices, including reference monitoring points (PT1-PT5)


• A Central Headend, where the content preparation occurs. This is normally owned

or controlled by the Service Provider. There may be also local headends, which

are smaller versions of the local headend used for local content. PT2 is located at

the output of the headend.

• The core network, with different configurations depending on the type of service

distributed. It is assumed to be a high-quality network, with negligible error rate.

• The Point of Presence. This is the last point in the network chain where the

Service Provider has control. PT3 is located here.

• The access network, which is an IP link between the PoP and the Home Domain.

• The Home Domain, which includes the Residential Gateway (RGW, the entry

point of the home, where PT4 is placed) and the HNED or user terminal (whose

output is PT5).

The video contribution is received, by definition, in “contribution quality”, which is

the maximum multimedia quality available to the service provider. The contribution is

ingested into the video headend and processed once in a centralized way (or “locally”

for local headers, with video streams that may have regional or local distribution only).

The key principle of the headend is that any processing is done once for each content

asset or stream (or, in other words, each processed asset will be common for all the users

of the service). For this reason, the processing done in the headend is usually performed

by dedicated equipment. Processing or storage capacity in the headend is not a strong

limitation in the deployment.

Typical head-end functionality includes [87]:

• Coding (or transcoding) of the contribution. The contribution source is encoded

using a format, resolution, and bitrate that fits the dimensioning of the service and

the capabilities of the network and user terminal. After this, the encoding can be

assumed to be left untouched, and the multimedia quality of the content at this

point (delivery quality) is the expected quality to be perceived by the end users.

• Encryption of the content using a Digital Rights Management (DRM) system [109].

The coded media stream, or a fraction of it, is scrambled using cryptographic algo-

rithms. The scrambled data can only be deciphered by authorized user terminals.

• Other video processing: multiplexing, remultiplexing, labeling, signaling of entry

and exit points for local content splicing, metadata insertion...


• Ingestion into the core network, for Live/OD and for IPTV/OTT, using the appro-

priate multiplex and transport protocol stack, as it has been described in section

2.2.2. For the case of RTP streams (IPTV), it could also imply adding Forward

Error Correction (FEC) redundancy packets, as defined in DVB-IPTV AL-FEC.

Quality monitoring in the headend is oriented to guarantee a sufficient degree of delivery

quality. It normally requires intensive monitoring, as any quality impairment in this

point affects all the users in the deployment. It allows FR measurement of the delivery

quality with respect to the contribution quality, between points PT1 and PT2.

The core network is different for each of the service types. In the case of IPTV, the

live video is distributed using a multicast-enabled IP network. IPTV Video on De-

mand is ingested into a centralized master VoD server, which may distribute it to video

pumps located closer to the end users. The core network for OTT is a Content Delivery

Network (CDN). CDNs ingest the master copy of the content into a centralized server

(usually called “origin server”), which stores it permanently (for on-demand content) or

for some time window (for live content). The video is then distributed towards the edge

throughout a hierarchy of caches.

The point of presence (PoP) is, by definition, the last point where the service provider

may have control of the delivered video. Although it has been displayed as a common

point for all the networks, it does not need to be this way: it is not unfrequent that

CDN PoPs, for instance, cover a wider area (and more users) than IPTV PoPs. The

key point of the PoP is that all the processing done here is done in a per-user way. It is

also the last common point for unicast services: any communication between the PoP

and the end user will be different for each user (except for the case of live IPTV over

multicast). As a consequence, the scalability of the PoP processing must be done in a

per-user basis (contrary to the per-asset scalability of the video headend), and therefore

the cost of processing and storage in the PoPs is very relevant for the full performance

of the service.

The PoP is also PT3: the last monitoring point in the service provider domain, where

two different elements are monitored:

1. Errors in the core network. It allows using RR or NR metrics, depending on the

capability of the headend to generate RR information. Each error detected here

affects all the users in that PoP; therefore intense monitoring is recommended.

2. Errors in the delivery and home networks. This is the real monitoring of the

quality delivered to the end user, which must be done between the PoP servers

and the user terminal. In the cases where the service provider does not control any


home network element (which is typical in OTT), the monitoring must be done in

the PoP (with the feedback data provided from the client in the communication).

As performance is critical, in almost all cases only bitstream NR measures will be

available.

The access network is the IP link between the PoP and the home domain. Strictly

speaking, the term “access network” is normally used only for the “last mile”, i.e.,

the part of the network covering the data link to the home domain (DSL, GPON, 3G,

LTE...). However, we will use the term in a broader sense, so that it may cover also

the “second mile” metropolitan network or, in general, any required IP access between

the home domain and the PoP. IPTV access networks must support UDP traffic and,

more specifically, UDP over IP multicast. OTT traffic is less restrictive, typically only

involving HTTP connections (TCP over port 80).

Finally the home domain comprises all the equipments located in the end user premises.

Depending on the type of service, the Service Provider may have some kind of control

of what is happening in the home domain. For instance, in IPTV services it is frequent

that the Service Provider owns the residential gateway and/or the user terminal, which

are provided as part of the service itself. In OTT services it is more frequent that the

user terminal is owned by the end user itself, but it might include some Service Provider

specific application software.

In the former case, it is possible to take NR bitstream based measures in PT4. In

the latter, monitoring in the user terminal is not possible. In any case PT5, which is

the final quality displayed to the end user, cannot be effectively monitored in real time

service-wide. Only selected users, either with objective monitoring probes or as subjects

of subjective assessment tests, will be able to provide quality information.

Two additional considerations are relevant. The first one is that the possibility to take

some measures (as well as to perform error correction actions when possible) may depend

on specific QoE capabilities of the deployment, such as the ones that will be described

in the next subsection 3.5. The second one is that, if a DRM system is in place, it

would be virtually impossible to apply pixel-based metrics beyond the headend, as the

content will be scrambled and will not be decodable by the monitoring probes. It may,

as a general rule, be possible to apply bitstream-based metrics, since the scrambles can

normally be configured to leave in clear all the relevant rich transport data of the stream.


3.2.2 Impairing the Quality of Experience

The first step to design a monitoring system is to understand the elements which can

affect to the quality of experience perceived by the end user. The concept of QoE is

used on purpose in this work, since the framework established here is applicable to any

element of the QoE which is susceptible to be monitored. However, the work will have

a special focus on the aspects of QoE which are directly related to multimedia quality.

Having said this, this section does a first classification of the possible causes of degrada-

tion for quality of experience, as well as their possible consequences. The classification

is mainly based on the causes, because it is what is measured by monitoring systems. In

the description of the different causes, we will also identify the mapping to the quality

impairments reported by the end users, as described by Cermak in [7].

3.2.2.1 Coding quality

Video coding quality is one of the most relevant elements in the QoE, and establishes

an upper boundary for the global perceived quality. The artifacts which appear in video

coding, as well as several ways of measuring them, have been widely discussed in the

literature [15]. Among them, the “edges shimmer” reported in [7] is one of the effects

of problems in video coding. Low coding quality can also cause blocking effect on the

pictures (although it is less visible in AVC video, due to the deblocking filter effect), but

it is less aggressive than in the case of video packet losses.

Audio coding quality is normally a less relevant issue in video delivery services, because

its bitrate is typically one order of magnitude smaller than that of the video; while its

impact in the quality is similar [31].

Estimating video quality from rich transport data is not obvious. Without any better

proxy to measure coding quality, bitrate normally makes a good one, especially when

comparing quality from the same encoder and the same content [6]. Under stable con-

ditions (same encoder implementation and bit rate), the quantification parameters may

also provide an estimation for video quality [59].

3.2.2.2 Packet losses

The most relevant impairments in video transmission services (for instance, those re-

ported in [6]) come from errors in the network: either packet losses or jitter. Packet

losses can be corrected using either FEC or ARQ techniques (see, for instance, the

proposal for IPTV in [19]), while jitter can be corrected by using a reception buffer.


However, when the error (loss or jitter) exceeds the capabilities of the correction strat-

egy, the effect is always a loss of data in the decoded stream. Those effective packet

losses are the main target of the quality monitoring.

The effect of packet losses depends on the error recovery strategy. In most cases, the

decoder tries to conceal the error by inferring an appropriated replacement for the af-

fected data sections (video or audio frames), such as repeating previous data or inserting

silence or noise. Losses in the video stream produce “blockiness” effect or freezes in the

video play out. The former cause the appearance of incoherent blocks in regions of

the frame, while the latter can be perceived as “screen freeze” or “choppy transition”,

depending on the length of the effect [86, 95]. Losses in the audio stream produce an

audio degradation (”distorted audio”) with a duration of the same order of magnitude

as the length of the data loss [79, 84].

However, the behavior for on-demand content could be different. An alternative error

recovery strategy is allowed: stopping the play out and waiting until all necessary data

have arrived. Nevertheless, this only makes sense when data retransmission is possible

(e.g., TCP transport layer, where the integrity of the received data segment is guaran-

teed). Besides, it generates the “buffering events” typical of internet video.

3.2.2.3 Latency

With bidirectional real-time communication (such as videoconference), end-to-end la-

tency is the most critical parameter to consider. However, in unidirectional content

delivery, latency is typically much less important. Coding quality management and

packet loss correction is normally done at the cost of latency. Latency is only important

in live events (especially in sports events). However, up to our knowledge, the effect of

the global latency in the user QoE has not been widely studied in the literature.

Except for the previously mentioned buffering events, end-to-end latency is constant. As

such, it is established at the beginning of the multimedia session and remains constant

from then. In fact, except for pure transport latency (which is only significant for

satellite broadcast), latency is decided at the design phase.

Another latency-related QoE element is the initial wait time, which is the time that the

user has to wait to start viewing (and hearing) a multimedia service. In the context

of linear TV, it is called “channel change” time, or “zapping” time, and it has been

modeled as a component of the QoE [60]. Since digital TV has typically long zapping

times, there have been significant efforts in the last years to develop systems that can

reduce it (see, for instance, [11]).


3.2.2.4 Outages

Service outages are interruptions of the whole service for a relevant period of time.

Although there may be quite different sources for such kind of errors, they must be taken

into consideration in any global QoE monitoring systems: on the one hand, because it

does not make sense to monitor the less relevant errors if the most critical ones are not

controlled; on the other hand, because they effectively happen and are reported by the

end users (they are labeled as “error stop” in [7]).

Besides the possible failures in the service equipments (either in the customer premises

or in the network), an outage can also be produced by an abrupt loss of video and audio

signal, which can be monitored with measures such as the ones defined in [94]. If the

origin of the outage is in the contribution media source, it should be monitored in the

service head-end (before or after the video coders). Outages caused by the network are

equivalent to long packet losses, and are easily monitored as well.

3.2.2.5 Quality degradations in new multimedia scenarios

Nowadays multimedia services are starting to popularize two features which affect the

QoE management: scalability and stereoscopy.

The concept of scalable video, where the video is coded using several quality layers,

each one refining the quality provided by the previous one, has been included in the

coding standards for the last years. However, it has not had wide acceptance and is

not significantly present in current multimedia services. Anyway, the concept of scal-

ability has been recently introduced in the marketplace with the irruption of HTTP

adaptive streaming [104]. In this kind of systems, the media stream is coded in parallel

using different bitrates, and the streaming can switch among them at pre-defined switch-

ing points. The advantage is that the streams are fully compatible with current AVC

decoders, thus simplifying its implementation and deployment. As there are different

bitrates, there are different coding qualities for the same video and audio stream (and all

the considerations done for coding quality apply). Besides parallel coding, another way

to create a codec-compatible lower-bitrate version of a video stream is dropping some

no-reference frames. This technique, called denting, has been already used in different

IPTV applications [85]. Regardless the method used to create the different bitrate ver-

sions, they will have different quality (and therefore different impact in the perceived

QoE). Any monitoring system has to be aware of the version which is being received by

the client at any moment.


Stereoscopic video is also being introduced in broadcast and streaming service, as 3D

productions are increasingly popular in the entertainment market. There are basically

two coding options for stereoscopic services: either coding the right and left view as parts

of a single coded frame (typically side-by-side), using a common 2D video encoder; or

coding them separately, using normally MVC. In either case, the coding schemes are

basically the same as in 2D video (based on blocks and prediction), and therefore the

effects in the decoded picture are equivalent to the ones produced in 2D video. However,

those artifacts can produce different impacts in the final stereoscopic reconstruction done

by the human visual system, and therefore they have to be studied specifically [24].

3.3 QuEM: a qualitative approach to QoE monitoring

The aim of this section is proposing an architectural design aimed at monitoring the

quality of experience in an multimedia service delivery network. First we will provide a

definition of the problem, trying to make explicit all the assumptions taken into account

in the design. Afterward we will propose the architecture structure, as well as some

proposed implementation for its main building blocks.

3.3.1 Problem statement

The problem addressed by this architecture is the monitoring of multimedia QoE in

an IPTV or OTT network. Figure 3.2 shows the delivery chain of multimedia services

based on the network architecture described in section 3.2: source media, coding, trans-

port, decoding, and presentation. The most typical realization of this delivery chain is

an IPTV deployment of MPEG-2 Transport Stream video over RTP/UDP over an IP

network [19]. However, the main ideas and elements described later will be also easily

applicable to HTTP adaptive streaming scenarios.

The main assumption taken is that the monitoring is applied to a network of a service

provider offering some kind of video distribution service to a high number of end users.

This assumption imposes two conditions.

On the one hand, scalability is a must. As such, any monitoring metric should require

small processing power, be applicable on real time, be a no-reference metric, and assume

no prior knowledge of the source content.

On the other hand, it is expected that the service provided has established a target

quality which is considered sufficient, and which is the one offered by the service in

normal conditions. Therefore the aim of the monitoring system will be detecting the


Figure 3.2: Delivery chain of a multimedia service

moments where this quality gets impaired and establish some measure or description of

such impairment. In other words, the monitoring system should provide a relative value

of the quality, with respect to the target quality that would happen in the absence of

impairments.

3.3.2 System design

To detect and measure those impairing events, we take a simple approach based on a

typical architecture of quality estimator: measure, pool, and map to quality [33]. Figure

3.3 shows the block architecture of the design, which we have called QuEM (Qualitative

Experience Monitoring) [85].

The basic building block of the solution is the Qualitative Impairment Detector (QuID).

It performs the measurement step by identifying each of the sources from content degra-

dation. Its output is the (approximate) perceived degradation in the user experience.

The key of this block is that it must be, as much as possible, a systematic description of

effect of the error which has been produced (e.g. “half of the picture is blurred for one

second”), and not only a single quality value (e.g. “Mean Opinion Score equal to 2”).

This property of significance of the QuID output is what makes the approach qualitative

(in the sense that there is not only a quantitative value of the degradation, but also a

qualitative description).


Figure 3.3: QuEM architecture design

At this point of the chain, the repeatability is also very important. Therefore it should

be possible, as a general rule, to force the introduction of an error of each type, as it is

possible in the pure-PLR-based methods.

The next step is the Severity Transfer Function (STF). The idea here is mapping the

error to quality values which, in the case of packet monitoring, would be the severity of

the error. This STF is done within a pooling window. Synchronizing the pooling window

along all the different errors (and along different clients) is important, because it allows

following whether an error has been produced in different users at the same time. The

length of the pooling window is another configurable parameter of the model. It should

be in the range of the duration of what could be considered as a single impairment event.

To cover macroblocking error propagation along the video Group Of Pictures, segments

of adaptive streaming or short outages [94], for instance, pooling windows from 5 to 20

seconds can be considered appropriate.

The scale used for the STF may be anything which is significant for the user of the

monitoring system, including a Mean Opinion Score (MOS) scale. However, unlike

in typical MOS-based quality metrics, the STF is known by the user, thus making it

possible to trace the MOS value to the qualitative description of the impairment that

generated it.

The last step is the aggregation of errors for their use in statistics and in alarm systems.

As with the STF, the aggregation function can also be modified by the service provider.

Due to the complexity of taking into consideration all the possible interaction between


different QuIDs, and provided that the severity of each of the QuIDs has already been

established, our proposal for this block is simply taking always the maximum severity

of the ones in play [57].

3.3.3 Qualitative Impairment Detectors

The key of the usability of this architecture is the definition of the QuIDs in a way that

are significant and repeatable. As a relevant example, we are going to build a monitoring

system which can operate in an extensive multimedia network, based on the following

QuIDs, which will be further discussed in chapter 4:

• Packet Loss Effect Prediction (PLEP) [86], described in section 4.2, models the

effect of video packet losses, which depends mainly on the video coding structure

and the position of the packet loss within the stream. PLEP metric provides a good

estimation of the effect of the loss: macroblocking (with a reasonable estimation

of the area affected and the duration of the artifact) and video freeze.

• Audio packet losses, described in section 4.3. Their effect can be measured by

monitoring loss patterns, since there is a high correlation between the length of

the loss burst and the duration of the resulting distortion (normally silence or

noise) [84].

• Drops of coding quality, discussed in section 4.4. We will assume that the coding

quality that enters the core network is the desired quality (or, alternatively, that it

can be monitored in the encoder side with more suitable mechanisms). However,

this quality can decrease in the cases of network congestion or bandwidth drops, by

using HTTP adaptive streaming or packet prioritization mechanisms [82]. Besides

the switch to a lower bitrate coded stream, we will consider the drop of no-reference

frames (denting). They can be measured by monitoring bit and frame rates.

• Service Outages, or interruptions in the continuity of the delivered content, which

are described in section 4.5. They are basically severe version of the video and

audio packet loss effects, and they can be measured with the same techniques.

These measures cover the most relevant defects which appear in IPTV deployments [13]

and can be easily measured in the bitstream, without needing to decode the video or

audio (only NAL Unit headers and Slice headers beyond the transport layer). All the

measures are fully compliant with the requirements for repeatability and significance

required to qualify as QuID, with the possible exception of the bitrate, whose signifi-

cance is more questionable. However, for the sake of this analysis, we will consider it


enough to provide quality information to the service provider (which should be able to

observe easily the subjective quality of each of the different bitrates produce by its video

encoders).

Before keeping on with the discussion, it is important to point out that we are using a

bottom-up approach to build the QuID measures. We start from the information that

can be obtained by tracking audio and video headers in the bitstream, as well as the

main properties of the stream itself (bitrate and frame rate). Then we provide some

simple measures that can offer information about impairments with a computing cost

similar to the PLR/PLP metrics.

They key is finding out whether these QuID measures can provide relevant information

about the QoE of the received stream. To validate this point, we have designed a

methodology for subjective quality assessment tests, which will be discussed in section

3.4.

3.3.4 Severity Transfer Function

Our proposed way to build the Severity Transfer Function is using subjective quality

assessments which evaluate the effect of the different QuIDs under consideration. Any-

way, due to the significance property of the QuIDs, STFs can be established by the

service provider (or network operator) considering its own severity criteria. This way,

the relative severity of “screen freeze” events versus “blockiness” events, for instance,

can be modified by the service provider by tuning the STF blocks, and without needing

to modify the QuIDs. The subjective quality assessment tests proposed in section 3.4

can also cover this point, as they provide a way to design and calibrate STFs.

3.4 A Subjective Assessment methodology to calibrate Qual-

ity Impairment Detectors

We have included some subjective tests to assess the validity of the approach and to

calibrate the results (and design a first level of STFs). A new test methodology has

been designed to adapt the tests to its purpose, which is described in the following

subsections. The methodology has also been put into practice to assess the impact of

the defects that are being monitored by our QuEM proposal. A description of those

tests can be found in the Appendix A.2.


3.4.1 Design principles

The objective pursued with these subjective assessment tests is twofold. On the one

hand, the tests should validate the selected QuID measures. Each of the impairments

under consideration has a characterization in the monitoring architecture in a way that

can be measured with precision and repeatability. The aim of the tests is validating that

those characterizations are good enough to provide information, with sufficient indepen-

dency on the context. Or, at least, to know to what extent are these characterizations

usable without knowing the context. This way, if a QuID provides, for instance, estima-

tion of “screen freezes” and their duration, different realizations of the event detected as

“screen freeze for 500 ms” should have similar evaluation results among them, and be

differentiable from events detected as “screen freeze for 5 seconds”. On the other hand,

the tests can also be used to establish some severity transfer function from the QuID

outputs to a severity scale.

Having this in mind, the tests should evaluate the effect of the same impairments de-

tected by the QuIDs, using evaluation periods similar to the pooling windows of the

QuEM architecture. Moreover, since the aim of the QuEM system is precisely estimat-

ing the effect of network impairments in real users of the system (and in real time), it

is desirable that the tests respect as far as possible domestic real viewing conditions.

This will allow mimicking the audiovisual experience of an end user watching multime-

dia services, and evaluating the QuID elements matching as much as possible their final

operation conditions to obtain meaningful results.

This fact is especially relevant in the current work compared to other subjective as-

sessment scenarios, and makes unsuitable the most common approaches for subjective

quality evaluation. The main reason is that the methodologies should be designed ori-

ented to the specific aspects of the pursued study. Therefore, in the present case, to

respect real viewing conditions, many aspects of the test should be adapted, such as:

• The test material should be similar to that usually watched by people in their

homes, e.g. movies, sports, news, etc. In addition, the sequences should be longer

enough to attract attention of the observers. This way, as it happens at households,

the viewers will be interested on the content and not only focused on detecting the

impairments.

• The equipment used in the tests should be similar to those used in domestic envi-

ronments; therefore, especially the TV sets should be consumer products.

• The sequences should be shown to the observers following a single stimulus proce-

dure, which means that no unimpaired reference is presented to them to compare


with the test video. This makes the test similar to home environment scenarios,

where there is no explicit reference.

• The evaluation should be carried out in a nearly continuous way, since the effects

of transmission errors are highly dependent on the instant when they occur, and

they are not stationary.

These aspects cause that the majority of the international standard methodologies are

not appropriate (e.g., the procedures proposed in ITU-R BT.500 [42] or ITU-T P.910

[53]), , since they were designed to evaluate the performance of video coding algorithms

and, in many cases, some conditions distance the observers from real viewing situations.

Nevertheless, these standard recommendations have been considered in the design of

the novel methodology that is proposed to evaluate the impact of typical transmission

artifacts, so that the results are more easily comparable with those coming from other

sources.

3.4.2 Test methodology

Our main objective is to mimic home viewing conditions. Therefore, the proposed

methodology is based on standard single stimulus methods, such as those recommended

by the ITU [42] and the Absolute Category Rating (ACR) [112]. These methods do not

have an explicit reference to compare with the content to be evaluated. This situation

is similar to home environments where people watch video sequences.

However these assessment methodologies limit the maximum duration of the video se-

quences to usually 10 seconds to allow silence periods for voting, when a grey fixed

background is displayed. Figure 3.4 shows the structure of a test sequence according to

the standard evaluation methodologies.

To allow for a QoE assessment closer to a real-life situation, we have considered a new

evaluation scenario where subjects view long test video sequences, so they are immersed

in the watching experience. As we are interested in the evaluation of different type of

impairments within this continuous stream, we have divided the whole sequence into

segments. Then, the impairments under study can only be inserted in the first half of

each segment, while the second half remains undistorted. Therefore, while this second

half is being displayed, observers can carry out the evaluation of the distortion introduced

in the first half.

To indicate to the observers when and which segment they have to evaluate, the second

half of each segment displays a number in the right-bottom corner of the screen. During


Figure 3.4: Diagram of the structure of the test sequences in ACR

Figure 3.5: Diagram of the structure of the test sequences in our proposed method

SCALE 1 2 . . .

ImperceptiblePerceptible but not annoying XSlightly annoyingAnnoying XVery annoying

Figure 3.6: Questionnaire for subjective assessment tests

these periods, the observers can avert their eyes if needed to look to the questionnaires,

without affecting the result of the evaluation. In addition, a first segment is used to

indicate to the observers the beginning of the test and provide a coding quality reference,

thus it is also left unimpaired and marked with a zero. Therefore, the structure of the test

sequence is as depicted in Figure 3.5. This methodology simulates better real viewing

situations, and therefore allows, in contrast to ACR, a nearly continuous evaluation of

the quality of the sequence without losing the continuity of the video.

For simplicity, the observers provide their ratings using a questionnaire; however, other

methods could be investigated. The evaluations are done according to the five-grade

impairment scale proposed in [42]. Thus, the questionnaire contains boxes where the

subjects have to write a cross in the one corresponding to the evaluated segment and its

score, as depicted in Figure 3.6.


3.4.3 Selection of impairments

The impairments introduced in the video sequence are selected among the effects mea-

sured by the QuIDs under study. With the aim of controlling and reducing the possible

limitations of using continuous long sequences, such as content dependency and context

effect, N versions of the same original sequences are created by introducing different

impairments in the same time segments. These versions are called variants, and in each

of them, for the same value of i, the impairments introduced in the segment Ti will be

all corresponding to the same QuID.

This kind of distribution allows the parallel evaluation of controlled combinations of

degradations, defined as “impairment set” when concerning the same segment. Each

“impairment set” then is made of N different “intensities” of the same QuID to be

evaluated in parallel. For instance, for a QuID detecting “video screen freezing for x

seconds”, the “impairment set” would be made of N different values of x. The “impair-

ment set” may, but does not need to, include also hidden references.

This way, the structure of the content streams in the test session has the aspect depicted

in Figure 3.7. Each row (A, B, C, D) represents a different variant of the same original

sequence, each one divided into aligned segments (T1, T2, ...). The colored sections in the

segments are the halves where the impairments are present, while the white-background

sections are the evaluation halves (when the segment number is shown on the screen).

The segments in the same position (e.g. the T1 segments in all the variants) contain

different impairments from the same “impairment set”. Once the number of segments

and impairment sets to be tested has been selected, the position of each impairment

set in the sequence is selected randomly. For each segment position Ti, each of the

impairments in the “impairment set” are also assigned randomly to each of the variants.

In the experiments that we have performed using this methodology, N = 4 variants

were selected. Each variant was assigned to a different viewing session. This way, the

evaluators only view each content asset once, which is in line with the intention of

simulating as much as possible home viewing conditions. Each “impairment set” was

introduced at least three times in each of the sequences under evaluation, to be able to

have a relevant number of measures, as well as to take the context and content effects into

account. A detailed description of the assessment tests can be found in the Appendix

A.2.


Figure 3.7: Structure of the content streams in the subjective assessment test session

3.5 QoE enablers

The previous architecture can be enhanced by adding QoE enablers: specific features

that can help simplify the management of QoE in the service. In this section we describe

three of them: a headend architecture proposal to integrate synchronized metadata, an

intelligent way to build RTP packets, and a network element to manage QoE between

the PoP and the user terminal.

Although they are described briefly in this section, all of them are have been evolved

to the point of becoming parts of commercial products which are currently available, or

are in the roadmap to be available in the upcoming months.

In the rest of this work, it will be assumed that those elements exist or, at least, could

be added to the deployment when required. This is not a very restrictive assumption:

since the quality monitoring strategies described are targeted to Service Providers that

want to improve the QoE offered by their service, it is reasonable to suppose that they

may decide to include QoE-enhancing elements such as the ones described.

3.5.1 Headend metadata architecture

Introducing metadata synchronized with the media stream can be necessary for a number

of purposes [9]. The most obvious one is the possibility to use Reduced Reference quality


Figure 3.8: Schematic representation of a modular headend

algorithms. But, in general, any preprocessing that can be done in the headend will be

more efficient there than in any other place further. A first QoE enabler is having a

headend architecture which allows the introduction and synchronization of metadata,

enhancing the interoperability of different headend elements, as we propose in [87].

The proposed architecture is modular and based upon a combination of components

fulfilling different functionality. To avoid duplicating the same functions several times, it

is necessary that the results of each of the processing steps can be reused by the following

components. All the information generated in each step, which can be considered as

metadata (data about the data), is propagated along the chain, so that it can be used

in further processing components [99], as shown in figure 3.8.

The key point here is that each of the components is homogeneous in terms of interfacing,

so that both the management of the headend and the integration of new elements get

simplified. All the processing components share a common time reference and exchange

a set of metadata describing the content. This architecture resembles that of software

multimedia frameworks, such as GStreamer or DirectShow, but applied to a distributed

scenario. All the meta-information available at each point of the processing chain is kept

untouched at the output, and the additional information generated in that processing

step is added as well. This way, all the stream analysis done in the different processing

components can be reused by the others by just not breaking the metadata chain. That

would allow, for example, having access to the Access Unit structure of the stream even

after it has been scrambled (if the scrambling module does not filter out AU metadata).

Synchronization is possible by keeping reference to the clock of the original stream:

all the components shall keep the same time base, so that parallel processing can be

resynchronized afterwards. Each block of video data shall include a Transport Time

Stamp (TTS) as part of its metadata, representing the time stamp (using original clock


reference) where the block starts. Metadata shall always have a TTS reference, and can

be sent in-band or out-of-band. In this context, in-band means that, together with the

multimedia stream, they form a valid MPEG2 TS or ISO file. In such case, however,

they shall be correctly signaled as private data within the resulting stream, so that they

do not disturb the multimedia decoding. In both cases they will have the same interface

strategy (e.g. push or pull) as the multimedia stream.

A video headend which implements this architecture offers several advantages to the

whole network: the possibility to pre-process the content to help the video analysis in

the edge servers (see section 3.5.3), a global synchronization of all the elements with

respect to the video internal clock signal —which would help synchronize the different

QuIDs throughout the network—, or a support for a metadata stream that can be used

to implement Reduced-Reference metrics, among others.

3.5.2 Intelligent Packet Rewrapper

When MPEG-2 Transport Stream is used as multiplexing layer, as it is the case in IPTV,

video and audio data are separated at TS packet level, but mixed again when several

TS packets are encapsulated in RTP. However, the behavior of the network elements

with respect to QoE could get improved if each RTP packet contained homogeneous

information —allowing, for instance, simple prioritization schemes. This can be achieved

by an specialized headend element: the intelligent packet rewrapper (or, in short, the

rewrapper) [96].

The rewrapper reorders MPEG2-Transport Stream packets and encapsulates them in

RTP packets in such a way that TS packets of the same type (e.g. video elementary

stream packets) are grouped together in the same RTP packet. Besides, frame bound-

aries are split between different RTP packets, so that an RTP packet never contains

information from two different frames. The elementary streams are further analyzed

(deep packet inspection) in order to include in an RTP header extension some informa-

tion useful for different applications running lower in the network.

The RTP header generated by the rewrapper, shown in Figure 3.9, follows the syntax

according to RFC 5285 [101] defining, for ID=1, an extension element with the following

semantics:

• B. Frame Begin (set to 1 if a video frame starts in the payload of this RTP packet).

• E. Frame End (set to 1 if a video frame finishes in the payload of this RTP packet).

• ST. Stream Type (0=video, 1=audio, 2=data, 3=reserved).


0 1 2 30 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|V=2|P|X| CC |M| PT | sequence number |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| timestamp |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| synchronization source (SSRC) identifier |+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+| Profile=0xbede | length=1 |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| ID=1 | len=2 |B|E| ST|S|r|PRI| FPRI|r| a | b | c |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Figure 3.9: RTP header and extension introduced by the rewrapper processing

• S, r. Reserved

• PRI, FPRI. Priority (coarse and fine). Values used for H.264 video are described

in Table 3.1.

• a: time from the current packet to the transmission time of the RTP packet con-

taining the last piece of the current frame (in 5-millisecond units).

• b: time from the end of current frame to the end of the next frame of the same

priority within the current GOP (in 20-millisecond units).

• c: time from the end of next frame with the same priority in this GOP to the end

of the GOP (in 20-millisecond units).

Table 3.1: Coarse (PRI) and fine (FPRI) priorities used in the RTP header extensionwhen the main video stream is H.264

PRI FPRI Decimal Meaning3 7 31 Video IDR frame3 0 24 Audio2 0 16 Reference frame1 7 15 Non-reference frame0 4 4 Rest of cases (data, secondary videos, etc.)0 1 1 Padding packets

The use of the rewrapper enables the development of value-add services over an IPTV

deployment. The rewrapper allows that the different types of elements in the coded

stream can be easily identified and isolated at RTP level. Since there is one RTP packet

per UDP packet, and one UDP packet per IP packet, it is the same as saying that each

different IP packet contains only one class of information, which can be retrieved only by

reading its RTP header. Therefore it is possible to implement video-related functionality


which does not require to rebuild the IP packets or, in other words, with the similar

level of performance and scalability that is achieved by IP routers. In fact, some of the

applications proposed in this thesis use a rewrapper as part of their implementation,

such as the Unequal Error Protection algorithm or the Fast Channel Change solution

(see sections 5.2 and 5.5 respectively).

3.5.3 Edge Servers for IPTV and OTT

Even though enhancing the capabilities of the video headend can help improve the final

Quality of Experience, the QoE is, at the end, something experienced by individual

users. Besides, the parts of the distribution chain with higher error probability are the

access network and the home domain. For those reasons, to be able to enhance the

perceived QoE it would be necessary to add new systems at PoP level, together with

possible modifications in the user terminal side.

For the case of IPTV streams, a video management server (generically called “video

services appliance”, VSA) can be located in the PoP, in parallel with the main video

traffic flow and receiving it as well. User terminals can establish individual sessions with

the VSA, in order to request some QoE related services such as:

• Retransmission of lost RTP packets, as defined by RFC 4585 [77].

• Unicast delivery of personalized streams (for instance, to accelerate the channel

change time, as proposed in RFC 6285 [103]).

• Collection of quality measures, as the ones proposed in RFC 3611 [22].

These kind of services have been standardized as part of DVB-IPTV (the first two ones

as DVB-RET and DVB-FCC respectively), and the most relevant IPTV technology

providers have different solutions around this concept.

A similar concept can be applied to CDNs. The proposed idea is to modify dynamically

the properties of the stream at the edge, in specialized servers (“Tailoring Servers”)

placed at the same level than the CDN delivery servers, or even closer to the end

terminals [108]. These Tailoring Servers will have access to the CDN and will retrieve

from it all available media (segments and manifests) and will be able to process them and

offer as a result the same media with some added value functionality to the end devices,

using the same HAS API. From the end user perspective, the network will be providing a

much better quality of service, and the service provider (the one operating the Tailoring

Servers) will achieve it without any modifications in the headend or additional load in


the CDN core network. In this case, it is essential that the Tailoring Server operates in

full transparent way, since in OTT environments it is frequent that the service provider

does not have any control about the user terminal.

Both concepts will be referred generically as Edge Servers from this point onwards.

3.6 Conclusions

In this chapter we have proposed a reference architecture for multimedia delivery services

over IP. This reference architecture provides a homogenous view of the most relevant

scenarios: IPTV and OTT, both for live and on-demand contents, and including quality

monitoring points as well.

We have also introduced the QuEM quality monitoring framework, which is applicable

to almost the same scenarios where PLR/PLP systems are, but offering a more detailed

analysis. Specifically, the basis of this approach has been set up with the objective of

developing a system that is able to characterize what is happening in the network, and

is easy enough to implement, integrate, and deploy in real video delivery systems.

Moreover, the proposed approach and the metrics that compose the monitoring architec-

ture have been validated by means of subjective assessment tests, analyzing the effects of

several transmission impairments on the QoE of the observers, and the relations among

those degradations. Those studies are also useful to calibrate the measurement elements

of the architecture to obtain reliable estimations of the impact of the distortions on the

perceived quality.

Finally, we have described some enablers: network elements that facilitate the imple-

mentation of QoE functionality in the delivery network.

In next chapters we will fill this framework with information. In chapter 4 we will

describe metrics to monitor the most relevant impairments using rich transport data.

Those metrics will comply with the requirements established in the QuEM framework,

and will be validated using the proposed subjective assessment methodology. In chapter

5 we will use the knowledge obtained in the generation of metrics to propose new value-

add applications in the context of the multimedia QoE. The implementation of these

application will also rely on the presence of some of the QoE enablers that we have

described in this chapter.

Chapter 4

Quality Impairment Detectors

4.1 Introduction

This chapter describes the different metrics which are proposed for the monitoring of the

Quality of Experience in multimedia delivery services. Using the terminology defined in

the previous chapter, they are the Quality Impairment Detector (QuID) blocks needed

to build a Qualitative Experience Monitoring (QuEM) system. Each section devotes to

a different QuID.

The general approach to study each of the QuIDs has been similar. First, the impairment

which wants to be detected is defined and characterized. This implies identifying the

cause of the impairment —and therefore propose a technique to monitor it—, as well as

understanding its impact on the perceived quality. Afterwards this analysis is completed

with specific subjective quality assessment tests, which use the methodology described

in section 3.4. A common set of subjective tests has been used for this purpose; they

are described in Appendix A.2. In specific sections of this chapter, additional subjective

and objective experiments have been used. They are described in different sections of

Appendix A, and referred in the appropriate sections of the text when needed.

The metrics described in this chapter are the ones proposed in section 3.3.3. They cover

the most relevant defects described by users [7], and each of them fulfills the requirements

imposed by the QuEM architecture in sections 3.3.1 and 3.3.2 —scalability, significance,

and repeatability.

Section 4.2 describes a video Packet Loss Effect Prediction metric (PLEP). It predicts

how the loss of a video packet can lead to freezing or macroblocking effects, by analyz-

ing the propagation of the error within the video frame, as well as to adjacent frames

throughout the inter-frame prediction reference chain. The results of this metric are

59

60 Chapter 4. Quality Impairment Detectors

analyzed objectively and subjectively using the test sequences described in Appendix

A.4 and the test set described in Appendix A.2, respectively.

Section 4.3 repeats the same structure of 4.2, but analyzing the effect of the loss of audio

packets.

Section 4.4 analyzes the media coding quality, with two differentiated subsections. First,

in 4.4.1, the video artifacts produced by compression are analyzed with a specific set

of subjective quality assessment tests, described in Appendix A.3. The results of these

tests is used to explore the possibilities to use RR or NR metrics to monitor video coding

artifacts in the context of a QuEM framework. Afterwards, in 4.4.2, a different approach

is presented, to analyze the effect of quality drops produced by strong variations in the

channel effective bandwidth —a typical OTT scenario with HTTP Adaptive Streaming.

In this case, two main alternatives are compared: switching to a version with different

bitrate and dropping frames. Their effects are analyzed with the subjective assessment

tests of Appendix A.2.

Section 4.5 describes outage events, understood as the total loss of video, audio, or both

signals for a period of time. Techniques to measure outage are described, as well as its

subjective effect according to the tests described in Appendix A.2.

Section 4.6 analyzes latency-related issues: lag and channel change time. This type of

analysis is sometimes excluded in the discussion of QoE, but it has been included in

this chapter for two reasons. On the one hand, lag and channel change are relevant

only in some specific scenarios; but these scenarios may have great impact in the overall

perceived quality of the multimedia delivery service —live delivery of sports events is the

most typical case. On the other hand, there is a design trade-off between latency and

other quality factors, such as video coding quality or packet loss probability. Acknowl-

edging this relationship is relevant when considering the whole QoE of our services.

Section 4.7 describes the relationship, in terms of perceived quality, between the different

impairments that have been studied.

Finally section 4.8 summarizes the main conclusions obtained in the whole chapter.

4.2 Video Packet Loss Effect Prediction (PLEP) model

Packet losses are the main cause of errors in multimedia services and, more specifically,

in IPTV. The loss of video packets can cause macroblocking and image freezing, which

are about half of the QoE impairments reported by customers in a field deployment [7].

Chapter 4. Quality Impairment Detectors 61

For this reason, packet losses are a relevant QoS issue to monitor in IPTV networks. In

existing deployments, it is typical to use pure QoS metrics, such as the Media Delivery

Index (MDI), to monitor them [67]. On the one hand, MDI is a useful metric to esti-

mate QoE because, on the long term and for random losses, packet loss rate correlate

reasonably well with the Mean Square Error which, in this scenario, can be a reasonably

good predictor of the perceived quality [40, 95]. On the other hand, in most cases there

is simply no other metric which can be applicable in the context of real-time service

monitoring, either because they need information that is not available at the monitoring

point, or because they are too costly to be applied.

However, other approaches are possible. If we have access to rich transport data, such

as the information provided by the rewrapper described in section 3.5.2, we can take

into account the structure of the video stream to improve the prediction of the effect of

losing some packets, instead of applying the sort of flat rate used by MDI.

Another important fact to consider is that network QoS provided for IPTV should

be good enough to make difficult to assume that “MSE correlates to PLR”. Besides,

QoS-management decisions are taken in the short term (some dozens of packets or so;

otherwise delay is too high). Therefore we need to analyze the short-term effect of

isolated packet losses in order to improve quality management in IPTV.

We will focus in this section on the analysis of packet-loss effect in the short term.

We will build a model to predict the effect of packet losses in video, based on the

information available at transport level in a real deployment. In particular, we will

analyze the transport information (RTP and MPEG-2 Transport Stream), as well as

the network abstraction layer of H.264: NAL Unit Headers and Slice Headers. We will

not analyze deeper than Slice Header in any case: firstly because, when any scrambling

is applied (even partial), some parts of the slice are always unavailable; and secondly

because it would require decoding the entropy coding CABAC, which would increase

the computation cost of the monitoring tool excessively for practical applications, thus

violating the scalability requirements required for QuIDs —see section 3.3.1.

The analysis has been performed in the context of an IPTV service, where the transport

unit (the minimum block that can get lost) is the RTP packet. It has also been assumed

that, to simplify the network processing, the MPEG-2 TS has been packaged into RTP

using a rewrapper. However, the model can be easily extended to other multimedia

delivery scenarios, just by adjusting the size and nature of the packets that can get lost.


4.2.1 Description of the model

We need a packet loss effect prediction (PLEP) model which is based on the analysis

of rich transport data, provides meaningful information to the operator using it, and is

as general as possible. To comply with these requirements, we propose a metric which

estimates the fraction of each of the frames which is affected by artifacts coming from

packet losses. Therefore a frame with a degradation value of, e.g., 50 percent, will have

half of its surface affected by artifacts.

The main advantage of this approach is that it focuses on the structure of the error in

the image, i.e. on the most direct impact of the packet loss, which is the absence of

correct information in parts of the image for some time. This metric does not depend

on the statistics of the image itself, and it is therefore usable in environments where

the picture intensity values are not available. Besides, it provides an easy qualitative

description of the impairment, which makes it suitable for our QoEM architecture.

Our solution encompasses two steps which are applied iteratively: we first compute

the degradation value in one frame, and then estimate the error propagation to the

neighboring pictures. The model only makes use of information which is available in the

slice header of H.264 slices: the slice type and reference picture buffer indexes. No data

is obtained from either the original (unimpaired) stream or from the decoded video.

4.2.1.1 Degradation Value

The first component of the impairment is the error generated in the frame where the

packet loss occurs. In an IPTV environment, video frames will typically be transported

over several transport packets (typically RTP). For that reason, a loss in one of the

packets does not necessarily mean the loss of the whole frame. In fact, the effect of the

loss of a single packet within the frame can be estimated by considering two well-known

properties of the H.264 coding:

• The information of macroblocks within a picture is transported in scan order (un-

less flexible macroblock ordering is used, which is not the case in Main and High

profiles).

• When there is an error in a NAL Unit, decoders usually cannot resynchronize video

decoding until the beginning of next NAL Unit.

We measure the degradation value on a scale of 0 to 100, where 0 represents that an

image has been received without errors, and 100 indicates that it is completely impaired.


The metric will estimate the percentage of image which is affected by the error:

E0 = 100%1

N

N−1�

S=0

1− f

�L(S)

Lavg

�(4.1)

where S represents each slice, N is the number of slices per frame, and L(S) represents

the length in bytes of the fragment of the slice which is not lost. It is assumed that

the rest of the slice is lost the moment an error is produced. Similarly, as macroblock

information is sequentially introduced in a slice (i.e., one macroblock after another), it

is reasonable to assume that the larger the portion of the slice is affected, the larger the

region of image is impaired. Lavg is an estimation of the length of the slice if there had

been no losses. Depending on the size of the loss and the video transport layer, it may

be estimated with higher or lower accuracy. In any case, it is always possible to assume

that the slice byte size will be similar to a sliding average of the sizes of the K previous

slices of the same type (I, P, B) and their position in the image. f is a function which

must be monotonically increasing. We will select the identity function saturated to the

value “1”, so that no slice can contribute to more than 100 percent of its size.

The equation assumes that all slices in the image have the same size (in pixels). Other-

wise, values should be weighted by their relative surface in the whole image.

4.2.1.2 Error Propagation

Most of the pictures in an H.264 video sequence use other pictures as references in their

decoding process. This technique, needed to encode the stream with a reasonably low

bit rate, causes errors in one frame to propagate to all frames which make reference to it.

If those frames, in turn, serve as references for others, the impairment would propagate

even more along the reference chain. Therefore a picture with no losses can also have

artifacts which have been propagated from its reference frames.

We compute this propagated error Ep from the value E of each of the frames which are

used as a reference by the picture under study. Given a picture x, depending on a set

of references {yk}, propagated error will be:

Ep = γ

�

k

ωkE (yk) (4.2)

where E(yk) is the error level in the frame yk. This error can be result of a packet loss

in that frame (E0) or being a propagated error itself (Ep), and the values of ωk and γ

model how to estimate the fraction of affected pixels in the predicted picture.


The constant γ represents the attenuation of the error effect along the reference chain.

In a typical coding scenario in H.264, instantaneous decoding refresh (IDR) pictures are

introduced periodically (each few seconds, at most). Therefore, regardless the value of

γ, the error will only propagate until the next IDR frame in the worst case (which is with

γ = 1). However, this assumption is not stable for long IDR repeat periods, or for cases

where I frames are not IDRs and there can be references beyond GOP boundaries1. For

such reason γ < 1 is recommended (for instance, γ = 0.9).

Factors ωk represent the weight of the different pictures which contribute as reference to

the picture under study. We use a model where higher level errors have a higher weight,

as they propagate in a more perceptible way:

ωk =E (yk)�kE (yk)

(4.3)

This allows us to write:

Ep = γ

�kE

2 (yk)�kE (yk)

(4.4)

4.2.1.3 Error Composition

Finally, it is possible that one picture suffers from a packet loss and also that its reference

pictures had errors as well. In this situation, both error contributions must be combined.

In the best scenario, both contributions will overlap and the total error level will be the

maximum:

Ebc = max {E0, Ep} (4.5)

In the worst case, contributions will be independent and the error will be the sum:

Ewc = min {(E0 + Ep), 100%} (4.6)

Therefore we assume that the error will be somewhere in between:

E = αEbc + (1− α)Ewc with 0 ≤ α ≤ 1 (4.7)

1In H.264, it is possible to define an I frame which is not an IDR. As I frame, it can be decoded

without needing other frames for prediction. However, unlike an IDR, it allows that subsequent frames

in decoding order use previous frames as references. This can slightly improve the obtained video quality

for a given bitrate constraint, and it is frequently used by IPTV video encoders.


4.2.2 Experiment

To test the PLEP model proposed, it is necessary to design an experiment which focuses

on the effect of where packet losses occur. Instead of generating random error patterns,

we have designed an experiment where packet losses are set deterministically and where

it is possible to observe the effect of changing the loss position in the stream.

The sequences are pre-processed with the rewrapper described in section 3.5.2. This

way, each video frame is transported in an integer number of RTP packets, and so does

each GOP. With the aim of analyzing the effect of different packet losses within the

stream structure, one single GOP is selected to generate packet losses on it.

We apply the following steps, with K taking values from 0 to the number of RTP packets

in the selected GOP:

1. In the selected GOP, the RTP packet in position K is dropped.

2. The PLEP metric is obtained for the resulting sequence.

3. The video sequence is then decoded using the open-source decoder FFmpeg2 (with

default error concealment) and stored on a disk without compression.

4. The obtained sequence is compared with the original one (without errors) using

MSE.

This experiment was conducted with the sequences A, B, C, and D described in the

Appendix A.4. The following discussion will be done considering sequence A, as it is

the one with longer GOP (100 frames), and therefore the one producing more test cases.

However, the same process was repeated with sequences B, C, and D, with similar results

—a comparison will be provided later.

Sequence A is encoded in H.264 over MPEG-2 TS at 2.8 Mb/s (with the video stream

at 2.3 Mpbs). Each frame has only one slice, which is the most typical situation for

commercially available video encoders for IPTV. The GOP structure is a hierarchical

“. . . IBBBP. . . ”, such as the one discussed in section 2.4.1 and depicted in Figure 2.4 in

page 31. All I frames are IDR pictures.

The sequence is encapsulated in RTP using the rewrapper. Each GOP occupies about

1000 RTP packets and, in particular, the GOP under study had exactly 958 packets.

Therefore 958 different impaired sequences (each one with the error in a different position

within the GOP) were generated, decoded, and processed.

2http://www.ffmpeg.org


It is worth noting that, due to the rewrapping process, all the losses affected only one

video frame, although the visual impairment will affect more than one frame due to

error propagation in the prediction process.

4.2.2.1 Qualitative Analysis

Before analyzing the results of the measurements, it is interesting to examine the video

itself, to better understand what happens when one packet is lost. We mainly consider

the results in sequence A, since having a longer GOP, it produces more data in the one-

GOP analysis. Figures 4.1 is used as an example for this analysis, although the ideas

described in this section are applicable to the majority of sequences generated for the

study, including both other sequences generated from sequence A and from sequences

B, C and D. Figures 4.1(a), (c), and (e) show an IDR frame where RTP packets #11,

#28, and #29 have been lost, respectively. Figures 4.1(b), (d), and (f) show the next P

frame in display order for the same sequences. Figures 4.1(g) and (h) show the original

unimpaired IDR and P frames, respectively.

In all the measurements, the frame with the highest MSE is the one where the loss

occurred. However, this is not the frame where artifacts are most visible. This is

illustrated in Figure 4.1(a). Where the packet is lost when the MSE is high, the visibility

of the error is low. However, four frames later in Figure 4.1(b), once the error has been

propagated by inter-frame predictions, the error has higher visibility even with a lower

MSE than before. This effect is also produced from Figure 4.1(c) to Figure 4.1(d), and

from Figure 4.1(e) to Figure 4.1(f).

This fact is due to error concealment: when part of the frame is lost, it is simply replaced

by the most recent reference frame available. The visual effect of this replacement is

a frame with a spatial discontinuity (part of the frame is the correct one, part is the

previous), which is not very disturbing visually. However, when the frame is used for

prediction, the predicted macroblocks will have errors, and the macroblocking effect will

appear.

It is also important to consider that in real situations, error concealment techniques may

not be as predictable as desired. For example, Figure 4.1(c) and Figure 4.1(e) show the

same frame for two different sequences —Figure 4.1(c) with the loss of packet #28, and

Figure 4.1(e) with the loss of packet #29, with both packets affecting the same frame.

In the first instance, FFmpeg concealment attempts to reuse the last referenced frame

to replace the missing portion of frame #28, and as a result that the error has low

visibility. In the second instance, the lost packet, #29, is directly adjacent to the packet

previously used, #28, which shows that the FFmpeg concealment has failed and that the


Figure 4.1: Video sequence used for qualitative analysis. Left column shows an IDRframe where one RTP packet is lost; while right column shows the following P frame.The red line in each frame indicates the position of in the image of the first macroblockwhich got lost. RTP packet lost are #11 (a,b), #28 (c,d) and #29 (e,f). (g,h) showthe original unimpaired IDR and P frames.


error has high visibility. These kinds of concealment failures can occur in real decoders,

either software or consumer set-top boxes. Therefore one must be careful when making

a priori assumptions about how impaired frames appear on the user screen.

We also found that the sooner an error is produced within an encoded frame, the higher

is the fraction of the decoded frame affected. The lines in Figure 4.1 show the position

of the error within the frame. Frames in Figure 4.1(a) and Figure 4.1(b), where the

error was produced in packet #11, have more visible and extensive artifacts than those

between frames Figure 4.1(c) and Figure 4.1(d), where the error was produced in packet

#28. The underlying idea is that once a fragment of the H.264 slice is lost, the rest of

the slice becomes useless to the decoder, which throws it out completely since it is not

trivial to resynchronize CABAC decoding. As there is only one slice per frame, when

an error occurs within a video frame, the rest of the frame is lost.

Finally, we should mention an specific case of interest: when the first video packet in

the GOP is lost, then the whole I frame gets lost as well, including any GOP-level

header (such as Sequence Parameter Set, Picture Parameter Set or SEI messages). As a

result, and with the decoder implementation that we have used, the whole GOP becomes

impossible to decode and the image freezes until the next I frame arrives.

4.2.2.2 Quantitative Results

We have computed the Packet Loss Effect Prediction (PLEP) values for each one of the

sequences under study. As IDRs are used at GOP boundaries, sensitivity to γ is not so

critical. We have taken the default value of γ = 0.9. Since there is only one packet loss,

there is no error composition situation, and therefore the value of α is not relevant.

We selected MSE (aggregated along all the impaired frames) as our method choice to

measure the impact of error in the sequence. Although there are other methods which

correlate better to subjective MOS, such as structural similarity index (SSIM) [116],

MSE has been shown to perform better when predicting packet loss visibility [93].

Figure 4.2 shows the MSE for all the sequences (varying the loss position) generated

from sequence A. The grey line shows the aggregated MSE of the whole sequence while

the green line shows the MSE only of the frame where the loss was produced. The red

line shows the MSE obtained by just substituting the frame where the error occurs with

the previous available reference frame (i.e., the concealment error at frame level). And

the blue line shows the result of the PLEP metric. Figure 4.3 shows the same values for

a reduced number of the sequences.


0 100 200 300 400 500 600 700 800 900 100010 5

10 4

10 3

10 2

10 1

100

101

102

Position of lost packet

PLEP

/ M

SE

Figure 4.2: Mean Square Error and Packet Loss Effect Prediction metric for allsequences under study, varying the loss position: aggregated MSE (grey), MSE at theframe where the loss occurs (green), concealment error (red), and PLEP (blue).

0 20 40 60 80 100 120 140 160 18010 5

10 4

10 3

10 2

10 1

100

101

102


PLEP

/ M

SE

Figure 4.3: Detail of Mean Square Error and Packet Loss Effect Prediction metricfor all sequences under study

It can be seen that error has higher impact in higher levels of the reference hierarchy:

when the error occurs in an I frame or P frame, it generates higher MSE than when it

occurs at a (reference) B frame, which in turn is higher than error generated by losses

in (no-reference) b frames. This is mainly due to the fact that errors in reference frames

propagate, and therefore affect more frames. Error concealment also produces more

visible results in I frames and P frames since the previous reference frame available is

further back in time (four frames distant), than in the case of B frames (two frames

away), or b frames (one frame away).

The analysis also indicates that error decreases along frame position. This is due to the

fact that losing a single packet on a slice means losing the rest of the slice completely,

since the decoder is unable to resynchronize the CABAC decoding. Of course this

decrease is not completely monotonic, as the reconstruction of the damaged frame is

not always perfect. Sometimes concealment techniques fail or are just less effective than

expected.


Figure 4.4: Mean Square Error versus Packet Loss Effect Prediction metric (log scale)and linear fit between them (R2 = 0.67)

Figure 4.5: Percentage of macroblocks which are different between both images versusPacket Loss Effect Prediction metric, both in log scale, as well as linear fit (R2 = 0.85)

There is also some tendency to error decreasing along the GOP because the earlier the

error occurs in the GOP, the greater number of frames it affects. However, due to the

fact that there are some scene changes within the GOP, this effect is not very strong.

Figure 4.2 shows that the PLEP model follows the shape of the error and in Figure

4.4 both magnitudes are directly compared. There is reasonably good correlation (R2

= 0.67) between both values, which suggests that the PLEP model is robust enough

to predict packet loss effects. It is worth noting that in this scenario, unlike in other

experiments reported in the literature, there is no correlation between the MSE (which is

variable) and the PLR (which is constant and equal to 1/958 for all the sequences). This


0 100 200 300 400 500 600 700 800 900 100010 3

10 2

10 1

100

101

102

103


PLEP

/ %

diffM

B

Figure 4.6: Percentage of macroblocks which are different between both images (blue)and Packet Loss Effect Prediction metric (red) for all sequences under study, varyingthe loss position

means that our PLEP model is able to explain the effect of packet losses reasonably well,

even in situations where the packet loss ratio does not provide any valuable information.

Results obtained from the other sequences are quite similar qualitatively. Table 4.1

shows the R2 between PLEP and MSE for all video sequences.

Table 4.1: Coefficient of determination (R2) of MSE vs PLEP fit for several videosequences.

Sequence A B C DGOP size 100 24 24 12

R2 0.67 0.63 0.74 0.91

With this in mind, it is also important to consider that the PLEP method is more

robust to failures in error concealment than MSE estimation methods. Indeed, error

concealment is quite unpredictable in a real case, and not easy to fit into a predefined

model, as we illustrated previously in Figure 4.3, where the MSE in the frame where

the loss occurred is shown in green, while the MSE in dashed black depicts an instance

when an error occurred and the damaged frame was replaced by the previous frame

available. This suggests that even knowing the MSE produced by replacing one frame

by its predecessor, there is no specific pattern which can easily model MSE in a specific

frame when the loss occurs in the middle of a GOP. However, predicting the “part of the

frame affected” is much more stable, since it does not depend on the error concealment

techniques used. Thus, a metric defined as the ratio (in percent) of macroblocks different

on a pixel-to-pixel basis between both images provides a better approximation than MSE

does for the concept of “part of the frame affected.”

Figures 4.5 and 4.6 show that PLEP model is indeed a good predictor of ratio of mac-

roblocks which differ between the original and the impaired images. Correlation with


the PLEP model increases so that, for the sequence under study, R2 = 0.85.

4.2.3 Subjective analysis

The next step in the analysis is discovering whether the prediction of the fraction of

the image affected by errors can be effectively used to model impairments in the per-

ceived Quality of Experience. With this target, the subjective assessment test session

described in Appendix A.2 included some impairments based on the PLEP model. The

impairments were caused in the same conditions as in the previously discussed objective

experiment: the video is sent by a rewrapper process and only one RTP packet is lost,

and the loss includes data of only one frame. The position of the RTP loss within the

GOP structure is varied to produce different effects.

The different impairment conditions are described in Table 4.2. We will consider the

simplified version of γ = 1, so that we assume that the error is propagated until the

end of the GOP. Impairment N is the hidden reference (no packet loss). Impairment E1

losses the first packet of the first no-reference B frame in the GOP; thus the error does

not propagate to other frames. Impairments E2, E3 and E4 lose one packet in the first

reference P frame of the GOP, so that the error gets propagated along the GOP. To

vary the resulting effect, the packet is lost at the beginning (E4), in the middle (E3) or

at the end (E2) of the frame, which varies the packet loss effect according to what has

been discussed previously. Finally impairment V1 has a special effect, which is losing

the very first packet of the GOP (in the I frame). In this case, as the most relevant

headers for the GOP get lost, the resulting effect is not macroblocking, but the freeze

of the image for the duration of the GOP (until another I frame is received).

Table 4.2: PLEP impairments analyzed in the subjective assessment tests

Code Frame % frame affected DescriptionN n/a n/a Hidden referenceE1 B (nr) 100 Loss of one frameE2 P (ref) 25 25% of frame affected during one GOPE3 P (ref) 50 50% of frame affected during one GOPE4 P (ref) 95 95% of frame affected during one GOPV1 I (ref) 100 Video freeze during one GOP

The results obtained from the tests are shown in Figure 4.7, differentiating the three

content sources under study: an action movie (Avatar, in blue), a football match (yellow)

and a documentary (red). The global average value is also displayed, together with its

confidence intervals. The description of the sources, as well as more details about the

tests, can be found in Appendix A.2.


!"

!#$"

%"

%#$"

&"

&#$"

'"

'#$"

$"

(" )!" )%" )&" )'" *!"

!"#$

*+,-."/012-3"4.55"

Figure 4.7: Results of the subjective assessment for Video Loss impairments

As a first conclusion, the results suggest that the PLEP metric is applicable to the

characterization of video packet losses, as they confirm that the position of the error

within the GOP structure affects significantly the quality perceived by the end user.

This conclusion has to be taken with some degree of caution, because there is variability

in the results, especially from one content source to another. However, it is clear that

the PLEP model outperforms the simple packet loss rate metrics. More specifically,

losing one single frame (without propagation) or a small part of the frame (even with

propagation along the GOP) is, in general, either no perceived or perceived as not

annoying, and statistically indistinguishable from the hidden reference. Beyond that,

the bigger fraction of the frame is affected, the higher the severity it has. Finally, freezing

the video for the whole GOP has a severer impact into quality than the macroblocking

effect.

The errors E2, E3, E4 and V1 belong to the same “impairment set”, as defined in section

3.4.3. That means that they are evaluated in parallel over the same segments. Figure

4.8 shows the detailed results for each of the segments of this “impairment set” for

the three sequences under study. Most of the segments follow the same pattern as the

general results, and it is also possible to see that the “inter-segment” variability for the

same error event is lower than the “intra segment” variability for the different errors

applied to each segment. The segments labeled as “Doc-10” and “Avatar-20” —from

the documentary and the movie sequences, respectively— may be considered outliers,

and they share the property of having a low MOS for the less perceptible error (E2).

This suggests that in both cases the “delivery quality” of the unimpaired version of those

segments might be lower than expected, and maybe a characterization of the properties

of the video in the headend could lead to a RR metric that improved the performance

of PLEP.


!"#$%&'!"#$%('

!"#$(&')*+,+-$%('

)*+,+-$%&')*+,+-$(&'

./,0"1$%%'./,0"1$(&'

./,0"1$(%'

%'

%23'

('

(23'

4'

423'

5'

523'

3'

6('

64'

65'

7%'

!"#$

Figure 4.8: Detailed results for each of the individual segments for Video Loss

4.3 Audio packet loss effect

When packets containing audio information get lost, there is also an impairment in the

perceived quality: either a temporary interruption in the displayed sound or a distortion

(glitch or noisy sound). Audio distortions are less frequent than video artifacts or, at

least, least frequently perceived by end users [7]. However, they are still common enough

for any monitoring system to consider them, especially if we take into account that they

are as unacceptable as video artifacts [57]. It is also relevant to consider that, as video

streams have normally a very stable bitrate, they normally require a relatively small

buffer in the receiver (around 50 ms, compared to the 500-2000 ms typical for video

streams). As a consequence, audio packets are much more sensitive to delay variation

than video streams; and high values of jitter will easily increase the losses in the audio

stream.

In this section we will study the effects of those packet losses, both objectively and

subjectively. We will take as baseline scenario an IPTV channel over MPEG-2 Transport

Stream. To simplify the analysis, we will assume that the stream as been encapsulated

into RTP packets by a rewrapper. This way, a packet loss at RTP level will impair either

audio or video signals, but not both simultaneously.

4.3.1 Objective analysis

Audio coding formats used in multimedia systems use normally block coding: they take

a time window of the audio waveform, divide it into spectrum sub-bands and code each

sub-band attending spectral masking criteria (obtained from a psychophyisical model


Figure 4.9: Waveform of a lossy audio file

of the human hearing system), aimed at maximizing the perceived quality for a target

bit rate. There is some overlap between adjacent windows, but not long term coding

prediction or complex prediction structures. All the audio codecs considered in our

IPTV and OTT scenarios (MPEG-1 layer 2, MPEG-4 AAC, and Dolby AC3) have this

kind of design.

With this, the impairment produced by the loss of one audio RTP packet will affect

only to the time window to which this packet belongs. Therefore we can make the

hypothesis that the impairment will be a silence whose length is proportional to the

length of the packet loss burst. This, which is exact for uncompressed audio (PCM),

will be sufficiently approximate for compressed audio as well.

Figure 4.9 shows the waveform obtained after decoding an audio file with losses. It is

the audio stream of the sequence A described in Appendix A.4, encoded in MPEG-1

layer 2 at 192 kbps. 70 TS-packet losses (around 550 ms) were introduced each 1000 TS

packets (7.8 s). Silence intervals are clearly visible in the waveform, and their duration

is effectively around 0.5 seconds each.

In some cases, signal peaks can be observed next to the silence intervals. They are

perceived as glitches or audio discontinuities, and they may also appear on the event of

packet losses. In principle, and for the sake of the analysis of the losses, we will consider

only the silences as the base impairment, since they cannot be distinguished from the

glitches just by the analysis of the lost packets.

Another 2 minute cut of the aforementioned sequence A (with MPEG-1 layer 2 audio

at 192 kbps) has been taken to introduce audio packet losses, varying the number of


Figure 4.10: Effect of audio losses: measured vs. expected (R2 = 0.98)

consecutive packet lost (the loss burst). The expected duration of each TS packet loss

would be:188× 8

192000= 7.8× 10−3

s (4.8)

Afterwards, the resulting stream has been decoded by a software decoder and the length

of the silences has been determined. The result can be shown in Figure 4.10. Blue points

show the length of the silence events (Y axis) as a function of the number of packet losses

(expressed in seconds, X axis). Most of the silence events have a length which is similar

to the expected one (although there is a small fraction of outliers, which represent the

short silence periods just after or before a glitch effect). Once the outliers have been

removed, the data fitting to a regression line (in red) allows us to determine the validity

of the approach. The line has a slope of 1.05 and a ordinate at the origin of 0.18, with

a determination coefficient R2 = 0.98.

With this data, the following conclusions can be obtained:

• The model is sufficiently good to be used as QuID.

• The slope is approximately 1, so that we can say that the perceptible duration of

the loss is quite similar to the length of the packet loss.

• Each packet loss, even the shortest ones, generates a silence of at least 180 ms.

This last data of the 180 ms should be taken with the appropriated caution. Firstly,

because the offline software decoder is not very robust under packet loss events (and, in

fact, extracting the silence length has required a careful analysis of the recovered data).


Figure 4.11: Short-length audio losses

And secondly because the number of samples used in the model is not high enough to

be sure about the quantitative significance of this result.

However, from a qualitative point of view, it seems to be clear that there is a minimum

silence length that happens in most of the cases. In Figure 4.11, which shows the values

of figure 4.10 for its smallest loss durations, it can be seen that the four columns of blue

points in the left side (which refer to losses of 1, 3, 5 and 7 TS packets) generate errors

between 150 and 300 ms indistinctly. Without considering the quantitative significance

of those figures, it is possible to say that, quantitatively, the effect of losing one single

TS packet is similar than the effect of a short burst of packet losses. A side-effect of this

conclusion is that the fact that 7 audio TS packets are encapsulated in a single RTP

audio packet by the rewrapper does not increase significantly the effect of the minimum

audio loss, which would be of 1 TS packet (plus probably some video packets as well)

for non-rewrapped streams, and is of 7 TS packets (without additional loss of video) for

rewrapped streams.

4.3.2 Subjective analysis

The subjective assessment test session described in Appendix A.2 included also impair-

ments produced by the loss of audio packets. As the transmitted packets have been

processed by the rewrapper, each 7 audio MPEG-2 TS packets are grouped into one

audio RTP packet. As described before, audio coded bitstream does not have complex

prediction structures (as video has), and the effect of the packet loss is basically related

to its duration. Therefore the different type of audio losses differ only in the number


!"

!#$"

%"

%#$"

&"

&#$"

'"

'#$"

$"

(" )!" )%" )&" )'"

!"#$

)*+,-"./0123"4-55"

Figure 4.12: Results of the subjective assessment for Audio Loss impairments

of packets that have been lost (it is similar to a packet loss rate / packet loss pattern

metric, but with the important distinction that we know that the lost packets are audio

packets). The RTP audio packet loss patterns used in the subjective assessment tests

are described in Table 4.3.

Table 4.3: Audio losses analyzed in the subjective assessment tests.

Code Duration of the burstN 0 (hidden reference)A1 1 packetA2 500 msA3 2 sA4 6 s

The results obtained from the tests are shown in Figure 4.12, differentiating the three

content sources under study: the action movie in blue, the football match in yellow,

and the documentary in red. The global average value is also displayed, together with

its confidence intervals. The results are stable and coincident with other research in

the topic [79]: the longer the loss, the higher the severity. Isolated one-packet audio

losses seem to be admissible under real viewing conditions. The acceptability of short

bursts (up to 500 ms) depends strongly on the selected content: it is acceptable in the

soundtrack of a movie, but not in the narration of a sports match. Long bursts (2

seconds or higher) are unacceptable by all means.

Since A1, A2, A3 and A4 belong to the same “impairment set”, it is possible to compare

their results segment by segment. It is shown in figure 4.13, which confirms the conclu-

sions mentioned before. In this case, since the audio structure is simpler and the audio

original quality is, as in a real deployment, high enough for the purpose, the probability

of having clear outliers is low.


!"#$%&!"#$'(&

!"#$')&*+,-,.$'(&

*+,-,.$%&*+,-,.$')&

/0-1"2$'3&/0-1"2$'%&

/0-1"2$'4&

'&

'5)&

6&

65)&

7&

75)&

(&

(5)&

)&

*'&

*6&

*7&

*(&

!"#$

Figure 4.13: Detailed results for each of the individual segments for Audio Loss

4.4 Coding quality and rate forced drops

Another relevant element for the Quality of Experience is the multimedia quality ob-

tained at the end of the encoding process: the coding quality. The coding quality is

important for the overall QoE, but it is not so critical for a monitoring system for two

main reasons. On the one hand, its impairments are less frequently reported by the users

than the ones produced by packet losses [7]. On the other, the target coding quality is

something that must be controlled in the design phase of the service, when selecting the

encoder which is going to be used and the conditions, especially bitrate, under which it

is going to work. But once in runtime, there should be less unexpected events in the

encoder than in the access network, for instance.

When considering coding quality, we will only focus on the video stream; and not on the

audio. The reason for that is that, while both of them contribute similarly to the final

multimedia quality [90], video requires much more bandwidth than audio [6] and, as a

result, video encoders will be working under more stressful conditions,

In this section we will study the coding quality from two different perspectives. First

we will explore the options to control or estimate the coding quality using simple RR

or NR metrics (with a chance to be applicable in the QuEM framework). Then we will

analyze different scenarios of strong quality drops, such as the ones produced when the

stream jumps from one bitrate to a much lower (or higher one). This scenario is typical

of OTT services using HTTP adaptive streaming.


4.4.1 Analysis of feature-based RR/NR metrics as estimators of video

coding quality

The first step done in the analysis of video quality has been trying to find out whether it is

possible to estimate the perceived coding quality (or, at least, some salient impairments)

from elementary Reduced-Reference or No-Reference metrics performed in the pixel

domain. The main reason for that is trying to build a quality estimator that can be of

use in scenarios similar to the ones proposed in our QuEM architecture.

The approach taken to this problem has been analyzing several NR and RR metrics from

the literature. Those metrics have been applied to video at contribution quality (high-

quality recordings from television content, obtained directly from the television studios

in uncompressed D1 format), and to the result of encoding them with commercial H.264

video encoders at different bit rates. The obtained values have been compared to the

outputs of subjective assessment tests done for the same video segments.

The work described in this subsection 4.4.1 was done during the first steps of the re-

search activity of this thesis [81], before the development of the QuEM strategy and its

associated subjective assessment test methodology, described in chapter 3. Therefore,

the subjective tests referred in this subsection, described in Appendix A.3, are different

from the QuEM-based subjective tests used in the rest of this chapter, and described in

Appendix A.2. The experiments, main results, and conclusions are described now.

4.4.1.1 Metrics under study

The aim of the experiment is determining whether it is possible to detect degradations

in the video quality by using lightweight Reduced Reference (RR) and No Reference

(NR) metrics. Most RR metrics are based on comparing some image features before

and after the impairment process. These features usually model amount of movement

and spatial detail. NR metrics are normally based on the detection of known artifacts

produced in the coding process, such as blocking, or blurring [121].

To compare different possible strategies homogeneously, we will extract several features

from the original an impairment features, and measure its relative degradation, averaged

along time:

M = meant

�|X[Forig(t)]−X[Fproc(t)]|

X[Forig(t)]

�(4.9)

Four groups of features have been compared: spatial information (obtained from sev-

eral RR metrics), temporal information (from RR metrics as well), blocking (from NR

metrics), and blurring (from NR as well).


Different feature extractors have been considered for spatial information (or texture):

• Le Callet et al. [63] propose a pair of complementary measures based on intensity

and direction of borders, which they call GHV and GHVP. They compute GHV as

the average magnitude of intensity gradient for all the pixels in which this gradient

is horizontal or vertical, and GHVP as the average magnitude of intensity gradient

for all the pixels in which this gradient is neither horizontal nor vertical.

• BTFR metric in ITU-T J.144 [45] includes a texture measure computed as the

zero cross rate of horizontal gradient.

• Saha and Vemuri [98] propose using the average value of absolute vertical and

horizontal differences, which they call IAM4.

• Webster et al. [117] propose a Spatial Information feature (SI), defined as the

standard deviation of the Sobel-filtered frame.

When characterizing temporal variations, there is less diversity of metrics in the litera-

ture. We will consider Le Callet’s Temporal Information (TI), defined as the energy of

the difference image along time [63].

Regarding the blocking effect, we have studied three of the most frequently cited metrics:

• GBIM (Generalized Block-edge Impairment Metric) [122]. It measures the differ-

ences between both sides of the block (which must present a regular and well-known

pattern).

• Vlachos metric [110], which uses a method based on the spectral analysis of the

pixels in block boundaries.

• Wang metric [115]. It analyzes the Fourier transform of the image to detect energy

peaks in the multiples of the inverse of the block period.

The other relevant artifact to study is blurring. Most blurring metrics are based on the

measurement of the average width of borders in the image [21]. We have selected the

implementation proposed by Marziliano et al. [68].

Finally, we have also included two basic measures: global brightness (mean value of

intensity) and global contrast (standard deviation of intensity).


4.4.1.2 Evaluation

Reference data to benchmark these video quality metrics were obtained from the results

of a study of subjective quality for real-time H.264 encoders, described in Appendix

A.3. The same sequences used for the subjective tests were provided as input for all the

feature extractors described in the previous subsection.

Reduced-Reference metrics were obtained for all the features by applying equation (4.9).

Besides, the blocking and blurring metrics were considered as individual No-Reference

metrics as well, just by computing its average along each test sequence.

The output of all the metrics, both RR and NR, was been compared with the MOS

obtained from the subjective tests, to check whether any of the features under study

could be a reasonable predictor for MOS variations. Pearson correlation and Spearman

rank correlation (with p-test) were computed. Results are shown in Table 4.4.

Table 4.4: Comparison of NR/RR results with subjective tests

Metric Pearson Spearman p-testBrightness 0.41 0.44 OKContrast 0.61 0.64 OK

BTFR Texture 0.16 0.22 NOSI 0.54 0.59 OK

GHV 0.43 0.43 OKGHVP 0.42 0.44 OKIAM4 0.51 0.52 OKTI 0.70 0.68 OK

GBIMRR 0.16 0.17 NOVlachosRR 0.21 0.24 NO

MarzilianoRR 0.29 0.26 OKGBIMNR 0.17 0.19 NOVlachosNR 0.31 0.24 NOWangNR - - -

MarzilianoNR 0.32 0.33 OK

As a rule, correlations are quite low (below 0.7). The best results are obtained from TI

metric and from contrast difference. No-Reference metrics obtained quite poor results:

no blocking metric provides a result which is statistically meaningful and Marziliano’s

blur metric gets a very slight correlation. Wang method did not even provide any stable

result. Furthermore, even when using them as basis for a Reduced-Reference metric,

results were not any better.

Figure 4.14 shows the value pairs which have been used to obtain the results mentioned,

i.e., the output of the metrics versus the subjective MOS, for to the two metrics which


(a) TI loss vs MOS

(b) Contrast loss vs MOS

Figure 4.14: Results for (a) TI and (b) Contrast

provided better results: TI and contrast degradation. Each shape represents one of the

sequences (cross and triangle for the football match; square and circle for the music

show). Each color represents one of the encoders. The regression line is also shown.

Two considerations can be made:

• Low values of the metrics are closer to low MOS than high values of the metrics

are to high MOS. That means that a bad result in one metric would probably

imply bad quality, but a high result will not imply high MOS.

• Results may vary significantly depending on the content. This is especially clear

for contrast loss and one of the music sequences (the square-shaped markers in the

figure). For both metrics, music sequences show better correlation than football

ones.


We can conclude that simple feature-based RR/NR metrics are hardly usable in the

context of continuous video quality monitoring, with the conditions that we have estab-

lished for the design of monitoring systems. Some of the results that have been reported

by other authors regarding the performance of those methods in, for instance, JPEG

encoded images, need to be taken cautiously when applied to H.264 video.

Although the use of more complex metrics could improve the results, their performance

could hardly reach the capabilities of the FR metrics [112]. For those reasons, we

will not consider the use of NR/RR pixel-based metrics to be included in the QuEM

architecture. The source video quality should, as a general rule, be sufficiently high by

network design. And the monitoring of the variations of those reference quality along

time would be better performed at the video headend, where FR metrics can be applied

in dedicated equipment.

4.4.2 Managing coding quality drops

The previous discussion suggested that direct monitoring of the video quality in the

access link (between the PoP and the user terminal) is difficult to achieve with cost

effective RR/NR metrics. However, it is also true that, in the typical monitoring sce-

nario, the video coding quality is selected by the Service Provider in the network design,

and any quality drop in the original quality can be monitored in the headend in better

conditions.

There is room anyway in the access network for drops in the coding quality, if variations

in the video quality can be introduced in the Edge Server. The most typical example for

that is the HTTP Adaptive Streaming, where the user terminal may select to download

different versions of the same video segment (at different bitrates), depending on the

instant quality of service provided by the access network. The same principle might

be also applied for IPTV: once the video is encoded in parallel at several bitrates, an

IPTV Edge Server could force the downgrade of the video quality to overcome a network

congestion event.

In some production or delivery environments it may be necessary to obtain a lower bi-

trate version of a media stream but there is no possibility of performing a full transcoding

process, either because of lack or resources or timing issues. In this cases the “dent-

ing” concept may be helpful: the idea is to dynamically remove frames from the video

elementary stream keeping the original audio (or audios). The result is a stream with

a lower frame rate but also a lower output bitrate which keeps the rest of the variant

characteristics (codec, resolution, etc) unaltered.


The denting component performs exactly this process: based on a configuration param-

eter (target bitrate, target frame rate , “remove all B”, etc) it sends to its output the

same media received at the input except for some video frames which are carefully se-

lected to meet the desired requirements. Due to the encoding properties of most codecs,

video frames can usually not be removed arbitrarily, because the absence of a frame

may prevent other frames which remain in the stream to be properly decoded. For this

reason the denting component requires deep information about the video frames, not

only about their boundaries but also about their decoding hierarchy. Padding packets

can also be removed by the denting component, but non audio/video streams (appli-

cation data, teletext, subtitles, etc) should only be removed if explicitly allowed by

configuration parameters.

Denting can be used in the Edge Server to dynamically generate lower-bitrate versions

of the main stream, either to create or enhance HAS structures or to reduced the bitrate

of a unicast transmission between the Edge Server and the user terminal. In particular,

denting has been successfully used in Fast Channel Change solutions to increase the

apparent bitrate of the unicast session without effectively allocating a higher bitrate for

it.

These quality drops (reducing the bitrate and denting) have also been included in the

subjective assessment tests described in Appendix A.2. Table 4.5 shows the different

values considered. R1 and R2 are a reduction of 50% and 75% of the bit rate. F1 and

F2 are a reduction of 50% and 75% of the frame rate. The effective bitrate reduction

of F1 and F2 depend on how the video was encoded. However, typical values for the

content assets under study are about 25-30% of bitrate reduction for F1, and 35-50%

for F2.

Table 4.5: Quality drops analyzed in the subjective assessment tests.

Code Type DescriptionN n/a Hidden referenceR1 Bitrate Bitrate reduced to 1/2R2 Bitrate Bitrate reduced to 1/4F1 Denting 1/2 of all frames droppedF2 Denting 3/4 of all frames dropped

The results of the subjective assessment tests for these impairments are shown in the

Figure 4.15. The following conclusions can be obtained:

• The results of the hidden reference are high. This means that coding defects

introduced at the reference quality are perceived of much less severity than other


!"

!#$"

%"

%#$"

&"

&#$"

'"

'#$"

$"

(" )!" )%" *!" *%"

!"#$

)+,-"./01"

Figure 4.15: Results of the subjective assessment for Rate Drop impairments

!"#$%&!"#$'&

!"#$()&*+,-,.$%&

*+,-,.$'&*+,-,.$()&

*+,-,.$()/(&01-2"3$4&

01-2"3$)%&01-2"3$(%&

)&

)5'&

(&

(5'&

%&

%5'&

6&

65'&

'&

7)&

7(&

0)&

0(&!"#$

Figure 4.16: Detailed results for each of the individual segments for Rate Drop

defects (forced quality drops in this case, but also other defects considered in other

sections).

• The impact of this kind of impairments depends on the source content, at least up

to some point.

• In general, the quality variations between bitrates are relevant (and between frame

rates as well). However, its specific impact differs from one asset to another, and

from one segment to another. This can be better shown in the comparison within

the “impairment set” formed by R1, R2, F1 and F2, in Figure 4.16.

• Denting has higher impact in the perceived quality than the drop of coding qual-

ity, which was expected, as in the latter case the quality-rate trade-off has been

optimized by the encoder, while in the former case it has not.


4.5 Outages

All the issues considered so far are caused by isolated errors. Now we will analyze a

different case: outage —loss of service for a period of time. The relevance of this case

is that sometimes the users report errors which are described as a complete stop in the

video play out, which sometimes is only recoverable after a reboot of the user terminal

[7]. Any system that monitors the global QoE must be aware of this kind of errors since,

although they are less frequent than the ones caused by isolated packet losses, have a

higher impact in the final quality.

Outages can be roughly classified into two categories: “short” and “long”. By “long”

outages we understand those caused by service unavailability for several minutes or

hours. The most typical example is a software problem in the user terminal, but there

could be severer situations (such a critical failure in the delivery equipments, for in-

stance). “Short” outages are the ones caused by a brief stop (some seconds) in the video

service delivery, typically caused by discontinuities in the service, an issue in the delivery

equipment followed by a recovery of the service from a redundant one...

“Long” outages should be ever monitored and managed by the Service Provider and are,

in fact, outside the scope of our work. The impact of having no service at all is not easy

to measure in the same scale that we are considering. We will focus in the detection and

impact measurement of the “short” outages exclusively.

4.5.1 Detection of outages

The outage can happen in the contribution (detectable in the headend), in the core

network (detectable in the PoP), or in the access network (detectable in the HNED,

maybe with the help of the Edge Server).

If it happens in the contribution, it should be monitored by continuity monitors in

the headend. An effective way to do it is using the VODA algorithm proposed by

Reibman and Wilkins [94]. This algorithm detects an outage when there is as sudden

and simultaneous drop of three different factors: average brightness (i.e. the picture

changes abruptly to black), space information, and audio signal power. The three factors

must also remain low for some seconds for the outage to be detected.

If the outage happens in the network, it will be an extreme case of packet loss with

high impact (loss of several seconds worth of video and/or audio), which can be de-

tected normally with packet loss effect estimators (and probably with simpler packet

loss detectors).


Additionally, short outages in the contribution can be detected in the coded stream

(with less accuracy, but it can be enough for our purposes) by monitoring the global

video and audio signal level:

• For video, with the analysis of the frame size and structure (coded long freezes

have almost zero-byte P and B frames).

• For audio, either from the analysis of energy values for each sub-band (exact) or

with the analysis of the dynamic range compression parameters, when available.

4.5.2 Subjective impact of outages

Some outage events have also been included in the subjective assessment tests described

in Appendix A.2. Table 4.6 shows the different values considered: stops of 2 and 6

seconds for audio and video (or both). The results are shown in the Figure 4.17, with

the comparison of the impairment set A4, V3, AV in 4.18.

In general, and for the same sequence, the longer the outage, the worst the detected

quality. However, the specific impact and the relative importance of video and audio is

quite dependent on the specific content.

Table 4.6: Outage events analyzed in the subjective assessment tests

Code Outage Duration Elementary Stream AffectedA3 2 s AudioA4 6 s AudioV2 2 s VideoV3 6 s VideoAV 6 s Both

4.6 Latency

A final QoE factor to consider is latency. Latency issues are usually disregarded in many

QoE analyses, because they are only perceived in very specific scenarios. However, the

study of latency is relevant because of two different, but related, causes. On the one

hand, as discussed in section 2.4, these scenarios where latency is relevant —mainly live

sport events— are important enough to make latency be a meaningful QoE element. On

the other hand, there is a trade-off between latency and other QoE components that

makes it difficult to have low-latency video delivery services without compromising the


!"

!#$"

%"

%#$"

&"

&#$"

'"

'#$"

$"

(&" ('" )%" )&" ()"

*+,-./"

Figure 4.17: Results of the subjective assessment for Outage impairments

!"#$%&

'()*)+$%&

,-*."/$%&

0&

012&

%&

%12&

3&

'4&

53&

'5&

'5&!"#$

Figure 4.18: Detailed results for each of the individual segments for Outage

perceived quality. These trade-offs will be summarized at the end of this section, at

4.6.3.

Latency will be studied from two different perspectives. Fist we will analyze the end-

to-end latency or lag. Afterwards we will analyze channel change time, which is also a

latency-related scenario with a significant contribution to the overall QoE.

4.6.1 Lag

End-to-end latency or lag refers to the delay observed in the displayed video by the user

with respect to the moment when the event is being recorded. With such definition,

lag only makes sense for live content streams: those which are being watched while

they are being captured. Although it is possible to provide an equivalent definition for

on-demand content, the reality is that lag is only a QoE factor in live events. And even


Figure 4.19: Simplified transmission chain for real-time video

for live television channels, there are very few cases where the lag is really an issue, and

that receiving the video with some additional seconds of delay makes any difference.

However, the few cases where lag is important are also important for service providers

and users, the most typical ones being sport matches. For those reasons, keeping the

lag under control is very relevant for IPTV service providers [70].

Lag must be constant end-to-end, to avoid losing video continuity. As such, any protocol

layer that imposes timing constrains must have also constant end-to-end delay, because

it should not assume that the delay variation may be absorbed by the uppermost layers.

Figure 4.19 shows it. Points A and Z represent the decoded video stream. In absence

of errors, the video reproduced in A and Z should be identical, and therefore the delay

between those points TAZ must be constant.

A first component of this delay is introduced by the encoding process, and it is due to

two main causes. On the one hand, the coding of video using frame prediction normally

implies that the frames are encoded and transmitted in a different order that they are

displayed, to allow the use of bidirectional prediction. On the other hand, this kind of

compression also makes that the size, in bytes, of the different frames differs strongly

among them and along time. This generates local peaks of bitrate that normally need

to be softened before the transmission, introducing additional delay, to comply with

bandwidth restrictions. Those two sub-components of the video delay are introduced by

the encoder and depend only on coding decisions (and therefore can be known in point

B).

MPEG-2 Transport Stream allows that the encoder to manage the coding delay end-to-

end. The transport stream includes a clock signal called PCR (program clock reference),

which indicates the rate at which the coded stream is produced at pointB and, therefore,

the rate at which it is expected to be delivered at point Y. The stream also includes,

for each video, audio or data access unit, its presentation time stamp (PTS) in the same

clock base. The total encoder-decoder delay TAB + TY Z is constant. This way, if the


network is able to keep constant delay TBY , the end-to-end delay TAZ will be constant

as expected.

However, the real delay in the transmission network TCX , which is an IP network, cannot

be guaranteed to be constant. Therefore network elements are introduced to control the

network ingestion and the reception in the user terminal to flatten network jitter and

also to manage error correction protocols.

The delay introduced by server-side elements and by the decoder (TAB + TBC + TY Z)

are established by the network design and known a-priori by the service provider. The

network buffer TXY depends on the implementation of the user terminal, and it is

normally set individually for each video session. Once it is established, however, the

end-to-end network delay TBY will remain constant for the whole video session, and

therefore each video packet whose jitter exceeds this buffer will arrive too late to be sent

to the decoder, and will be considered as a network loss. Therefore, when establishing

the length of the network buffer, there is a tradeoff between end-to-end delay and packet

loss probability.

Additionally, if the video multiplexing protocol is ISO File Format, then it does not

include transport timing information equivalent to the PCR. In such case, the user

terminal must set up the value of TY Z arbitrarily for the first decoded video frame,

and assume that it will be enough to present it on time for then onwards. As a result,

buffer sizes are normally overdimensioned, to avoid buffer emptying events, at the cost

of suffering a higher lag. This overdimensioning is also generally applied to the network

buffer TXY , especially in the case of Over The Top services (where network capacity

variations can be very strong).

4.6.2 Channel Change time

We will define channel change time (or zapping time) as the time between the moment

when the end user presses a “channel change” key in their user terminal and the instant

when the new channel (video and audio) starts playing on their screen. This time can

be divided into the following components:

TCC = Tterm + Tnet + Tbuf + Tvid (4.10)

Where


• Tterm is the delay between the user key stroke and the moment where the user

terminal effectively requests the new video stream to the network (by issuing an

IGMP join, an HTTP request or what it is suitable for each scenario).

• Tnet is the delay between the new video is requested and the first byte of the new

stream arrives back to the user terminal.

• Tbuf is the time needed to fill the network buffer in the user terminal.

• Tvid is the time needed to present the first video frame in the decoder output.

From the analysis done in the previous subsection, it is immediate to consider that Tbuf

is equal to TXY as depicted in the Figure 4.19. Tvid abstracts all the delay introduced

by the video stream in the decoding side. This can be inferred only by analyzing the

video stream and depends only on the encoding process. It can be modeled as:

Tvid = TRAP + Tdec (4.11)

TRAP is the time that the decoder has to wait to reach a Random Access Point (RAP).

A RAP is a specific point in the video stream where it is possible to start decoding

it, which maps approximately with the beginning of the intra-coded frames. Therefore

TRAP can be easily modeled with a random variable of uniform distribution between 0

and the intra frame period TI , whose mean value is TI/2.

Tdec is the interval between the RAP and the moment when the frame can be presented

to the user. It is equal to the stationary delay of the video decoder, i.e. TY Z in Figure

4.19. It represents the decoding part of the end-to-end coding delay for each of the

media components (audio, video, and data) and, in MPEG-2 Transport Stream, it is:

Tdec = PTS(first access unit)− PCR(first packet) (4.12)

It is relevant to consider that the value of Tdec will, in general, be different for each of

the elementary streams. Even though the end-to-end delay (TAB + TY Z) is constant

and equal for all of them, it is usual that the part of the delay left to the decoder

(Tdec = TY Z) varies strongly from one component to another. A typical example taken

from a commercial encoder is shown in Figure 4.20: audio Tdec is constant and below

100 ms while video Tdec varies along time between 800 and 1400 ms approximately.

With these elements, it is possible to build a QuID which monitors the channel change

time in the network in the following way:


Figure 4.20: Decoding delay (PTS-PCR) in milliseconds for video (blue) and audio(red) components of a MPEG-2 Transport Stream, and its variation along time (inseconds)

• Tterm and Tbuf depend on the user terminal implementation, which is the only

point where they are available. However, they are normally quite stable, so they

can be known a priori and introduced into the model as parameters.

• Tnet, TRAP and Tdec can be easily monitored in the network.

It is worth noting that most of the components of the channel change are frequently

sacrified in the process of enhancing the available end-to-end quality of experience. In

particular, Tbuf , as it has been mentioned in the previous subsection, represents the

buffering required to absorb network jitter and to correct packet losses. TRAP and Tdec

provide also a higher degree of freedom to the encoder to distribute its bit budget flexibly,

according to the coding complexity of the images and therefore optimizing the coding

quality. Reducing any of those parameters, what would reduce the channel change time

in the same amount, could also have undesired side-effects in the global quality.

Unlike the case of the global lag, channel change time is a QoE element which is relevant

for many IPTV deployments, and for all the video channels. However, the mapping of

the channel change events into a global scale of severities (or qualities) is very dependent

on the expectations of the service provider, and there is no standard way to do it. Table

4.7 shows an example that could be used as reference, based on informal laboratory

experimentation.


Table 4.7: Example Channel Change time ranges and their mapping to QoE

Time (s) QoE description< 0.4 Very Fast0.4− 1 Fast1− 2.5 Normal2.5− 5 Slow> 5 Very Slow

4.6.3 Latency trade-offs

Since lag and channel change can be considered relevant elements for the global QoE, we

may ask whether it is possible to improve them by reducing some of their components.

The answer is that it is possible, but with some cost: degrading other QoE factors. We

will show here why.

Regarding end-to-end lag, encoding latency TAB + TY Z is used to provide a buffer for

rate-control operations in the video encoder. Reducing this buffer will impair the video

quality that the encoder is able to produce at its output. Network processing delay

TBC + TXY provides a buffer to protect the decoder against network jitter. This buffer

can be reduced, but only at the cost of increasing the packet loss probability.

Channel change components Tbuf and Tdec are TXY and TY Z respectively, so the same

considerations can be made. TRAP is also a design parameter for the encoder: if it is

reduced, the frequency of I frames will increase, which will degrade the video quality (if

the bitrate is kept, as it is assumed).

The rest of the delay components are limited by the technology itself, and are normally

outside the control of the service provider:

• TCX and Tnet depend on the performance of the communication network.

• Tterm depends on the performance of the user terminal software.

As a conclusion: there is a strong relationship between the latency and the video quality

components of the QoE. Therefore latency should always be controlled in any multimedia

delivery service. Even in the cases where lag or channel change are not important by

themselves, managing latency parameters is always a good strategy. Service providers

should be aware that reducing that latency elements in the future will always be at the

cost of putting the video quality at risk.


4.7 Mapping to Severity

One of the most complex problems to solve when managing a QoE monitoring system

in a large multimedia service deployment is the comparison and aggregation of a big

quantity of data. In our QuEM model, this problem is addressed by referring all the

measures to a common severity scale and synchronizing the measurement windows, so

that one single severity value is produced for each monitoring period in each monitoring

point (section 3.3.2). These values should be then processed statistically according to the

needs of the monitoring service with the particularity that, even though the aggregated

value has only meaning in terms of average severity, each of the individual impairment

events is easily traceable to a qualitative description of what happened.

Each QuEM system should be calibrated according to the specific needs of the service

provider, and should also be modified during the operation phase with the feedback

retrieved from the field. The best way to calibrate the different QuID elements to

produce severity values is by performing subjective quality assessment tests such as the

ones described in section 3.4. This way, each service provider could feed the tests with

the type of content and impairments that fit better in their deployment, having the

Severity Transfer Functions completely under their control.

The results of the subjective assessments describe in Appendix A.2 can provide some ini-

tial approach to the problem, which should be used as starting point for real deployments

of a QuEM infrastructure.

Figure 4.21 shows a summary of the different results that have been discussed along this

chapter. The most relevant conclusions for each of the type of errors have already been

discussed, but we can summarize them as follows:

• Video packet losses can have very different effect depending on the part of the

stream which is lost. We have proposed a simple but effective metric (PLEP) to

model this variability.

• Audio packet losses depend mostly on the packet loss rate and pattern. We have

also modeled this in our proposal for audio loss QuID.

• Bitrate is a reasonably good proxy to monitor video coding quality in the context

of a QuEM system. The comparative effect of bitrate change and denting has been

studied. The former technique has less impact than the latter in the final QoE,

but it requires generating and transporting the different versions of the content

stream from the headend to the network edge.


Figure 4.21: Results for all the QuIDs mentioned in the chapter

• Outages can be monitored as severer versions of the rest of the impairments, but

they must be considered separately because of its high impact in the perceived

quality.

• Latency effects (end-to-end lag and channel change) have to be taken into account,

both for their impact in the final QoE and for their relationship to other quality

issues.

Besides, the cross-analysis of different QuIDs can also provide some additional ideas:

• In case of network congestion or any other error situation, the decision of which

packet or packets to discard is critical for the final impact in the Quality of Expe-

rience. Losing all the no-reference frames for six seconds (F1) has an impact which

is similar to losing all the audio during only half a second (A2) or have relevant

macroblocking (90% of the picture) for half a second (E4), and is even better than

any of the video screen freezes (V1-V3). All those impairments are produced by

the loss of less packets than F1.

• Video freezing is probably the worst artifact (relative to the minimum loss burst

needed to produce it). For this reason, it should be avoided by any means. This

is especially relevant in scenarios where the network buffer is small because a low

latency is required. In such cases, countermeasures such as bitrate drop or frame

rate drop are preferable to an empty buffer resulting in video and audio loss signal.


4.8 Conclusions

This chapter has presented strategies to monitor all the relevant sources of quality im-

pairments in multimedia delivery services. We have proposed metrics to analyze the

effect of packet losses in video and audio, which are currently the most frequent errors

in multimedia services; and in particular in IPTV. We have also covered the analysis and

monitoring of media coding quality, with a special focus on the strong bitrate variations

which are typical of OTT scenarios. We have also analyzed the causes and effects of

service outage, as well as the effects of latency in the final QoE.

All the metrics proposed in this chapter can be integrated as Quality Impairment De-

tectors in the QuEM architecture described in chapter 3. Besides, we have analyzed a

set of subjective quality assessment test results which support the selection of QuIDs

and provide a way to compare their relative severities. The results of this analysis have

provided relevant information about the relative severity of the errors under study.

The ideas discussed in this chapter suggest that, with the right knowledge of the effect

of network events in QoE, it is possible to design network systems whose policies are

optimized towards the final perceived quality. The next chapter will present and discuss

some of these applications.

Chapter 5

Applications

5.1 Introduction

This chapter describes applications which, by making use of the knowledge obtained

in previous chapters about the Quality of Experience, can enhance the functionality of

existing multimedia delivery services. In fact, some of the applications described here

have been applied to products and services which are currently deployed in the field.

Section 5.2 describes a variation of the Packet Loss Effect Prediction model which can

be used to establish packet priorities in a video communication network. This can be

used to support Unequal Error Protection schemes which make best use of the error

correction capabilities of the network.

A similar idea is applied in section 5.3 to an HTTP Adaptive Streaming scenario. By

composing HAS segments in priority order (instead of in the traditional decoding order),

it is possible to react better to dynamic variations in the network effective bandwidth

without needing to increase the buffering delay excessively.

Section 5.4 describes a selective scrambling algorithm which can be used to efficiently

protect video content in scenarios where the processing power of the deciphering elements

is small. By only selecting to encrypt the most relevant packets (with respect to their

impact in the QoE) it is possible to get very effective protection with a low packet

scrambling rate.

Section 5.5 proposes a solution to overcome the channel change limitations described in

section 4.6.

Finally section 5.6 discusses the application of the results to stereoscopic video.

99

100 Chapter 5. Applications

5.2 Unequal Error Protection

Not all the packet losses have the same impact in the QoE. For instance, the effect

of isolated packet losses in perceived video quality depends on several factors, such as

coding structure (the type of prediction in the frame or the part of the frame which gets

lost), camera motion, or the presence of scene changes, among others [86, 93]. When the

number of errors grows, the effects of those factors tend to compensate among them, so

that the impact of random errors depends mainly on packet loss rate [95] and loss burst

structure [124]. Audio packet losses have a strong impact on the perceived quality,

depending mainly on the frequency and length of the bursts of loss packets, with no

significant differences between individual packets [79, 84]. When they are studied jointly,

video errors seem to be more acceptable than audio errors, except for high error rates

[57].

Most of the studies mentioned so far analyze the effect of packet losses for relatively high

loss rates. In practical situations, however, real-time video services provide a quality of

experience resulting in less than one visible error per hour, with users showing sensitivity

to higher impairment rates [7]. In terms of network quality of service, it means that

only a few packet loss bursts per hour are allowed, at most.

Home networks typically have error rates which are some orders of magnitude above

these figures, especially in the case of wifi (802.11) [97]. If the media stream is to be

delivered through the home network, the residential gateway must provide some kind

of error correction mechanism (FEC or ARQ) in order to keep the required level of

service. This protection is performed at the cost of introducing end-to-end delay in the

transmission chain [61], as well as increasing the required bandwidth.

The understanding of how packet loss can affect video and audio quality has been used

to propose several unequal error protection (UEP) schemes, where packets with higher

impact in quality are protected better [29, 66]. This allows keeping a good QoE without

an excessive increase in the required protection and, consequently, in the additional delay

introduced. However they usually require an in-depth video analysis which is difficult

to integrate in cost-effective consumer electronic devices. Lightweight UEP designs also

exist, but they usually focus on the characteristics of the loss patterns and use limited

approaches to characterize the priority of the packets [12, 71].

We have shown in the previous chapter that, even with its limitations, the PLEP model

we describe is a promising approximation for blind packet loss effect estimation. How-

ever, it is based on reading and building a reference frame list for each frame. Even

as simple as it is, this could be too expensive for some applications, such as packet

QoS policies applied in routers, and it may require the use of information which is not

Chapter 5. Applications 101

available in real service deployments, perhaps because the elementary video stream is

completely scrambled.

Here we will show how it is possible to reduce strongly the effect of packet losses by

applying a simplified version of the PLEP metric to label video packet priorities (and

even using a low number of bits to encode them). This technique can be applied to

congestion control in home gateways or buffer management in dynamic HTTP adaptive

streaming. In addition, it can improve other lightweight UEP schemes by enriching

their characterization of the video sequence. This approach requires low processing

capabilities while clearly outperforming a random packet drop.

The solution specifically addresses short-term protection decisions, where the error cor-

rection system has to decide which packets to protect (or which ones to drop) within

a short window of time. Thus it is especially suitable for real-time multimedia trans-

missions. This solution is applicable not only to error correction, but also to congestion

control.

5.2.1 Priority Model

5.2.1.1 Effects of packet losses

The priority model proposed is based on the fact that not all the video packets contain

the same kind of information and, therefore, the loss of different kinds of packets will

produce different effects in the perceived video quality. In fact, even the loss of a single

video packet can produce a wide range of different effects, depending on the kind of

packet which is lost.

There are several factors which influence the effect of a single packet loss. They can

be roughly classified in two sets: content-based (camera motion, scene changes. . . ) and

coding-based (type of video frame, position of packet within the frame. . . ). Only the

latter are considered in this approach, since they are the ones which can be easily

identified in the analysis of the coded media stream. It will be shown later that they

suffice to provide a good performance of UEP algorithms.

The factors considered are based on the following previous knowledge:

1. The effect of a loss is higher when it is produced in a reference frame (a frame

used by the encoding system to predict the following ones), because the error will

propagate to the frames which have it as reference [95].


Table 5.1: Priority value for each slice type

NALU Type PS

IDR(I) 1Reference (R) 0.5

No-Reference (N) 0

2. If a packet in the middle of a video slice is lost, then the rest of the slice gets

lost too, as the decoder cannot easily re-synchronize in the middle of a slice. This

is especially relevant in H.264 video, where most commercial encoders use a low

number of slices per frame (typically one). In such cases, the sooner the error is

produced within a frame, the higher its impact is [86].

3. If packets are lost in two different frames, their contribution to the final error (in

terms of mean square error, MSE) can be considered to be the sum, as errors are

typically uncorrelated [95].

4. Audio packet loss effects are basically related to the length and structure of the

loss burst, not existing meaningful differences between audio packets [79, 84].

5.2.1.2 Packet Priority

A packet priority model is proposed in order to assign higher priority to packets whose

loss is going to produce a stronger effect in QoE. The model is based on the type of video

slice carried by the packet and the position of the packet within the slice (assuming that

typically a video slice is carried in several transport packets). As it has been mentioned

before, losses have higher effect in reference slices than in no-reference ones, and at the

beginning of the slice and of the GOP, where error propagation effects are higher [66, 86].

The priority model is defined as follows:

P = αPS + βH + γTS + δTG (5.1)

where PS is the priority of the slice type as described in Table 5.1, H is a flag indicating

whether the packet contains a NALU (Network Abstraction Layer Unit) header, TS

indicates the number of packets until the next slice in the stream and TG is the number

of packets until the next I frame. All the parameters are normalized between 0 and 1.

According to their relevance, the following coefficients are selected: α = 103, β = 102,

γ = 10, δ = 1.


Figure 5.1: Example of the packet priority model applied to one GOP of a codedvideo sequence

Figure 5.1 shows an example of the application of the model to a sequence of video

packets in transmission order. Each box represents an RTP packet, while different colors

represent different frames. The figure shows all the elements of the prioritization model.

PS depends on the NALU type (IDR, Reference slice or No-reference slice), indicated

as I, R or N within the boxes. H = 1 (presence of NALU headers) is represented as a

black bold frame. Finally, TS and TG are shown for the packet marked by the red circle.

Audio packets can be easily introduced in this model just by assigning them a fixed

priority value P = PA. In line with the idea that audio losses are more relevant than

video ones, except in case of high video degradations [57], PA is set to 900. This way,

audio packets have lower priority than IDR packets (for α = 103, PA = 0.9α), but

higher than any other video packet. Different values could be considered depending on

the specific application.

It is important to remark that it is not a scale of priorities, but only an ordering. The

intention of the model is providing a way to sort a group of packets in priority order, so

that the higher the priority is, the higher the impact of its loss is. However, there is no

information about the relative magnitudes of the losses.

Another relevant property of the model is that, once the priority for each packet is

known, no more analysis is required. This allows the unequal error protection schemes

to be stateless in the following sense: the decision of whether one packet is protected

or not will have no effect in the priority value applied to other packets. This simplifies

significantly the work of the UEP mechanisms.


Figure 5.2: Implementation of the prioritization model

5.2.1.3 Implementing the model

Figure 5.2 shows the basic implementation modules to apply the described prioritization

model to a video source. As mentioned before, the priority labeling is applied indepen-

dently from the unequal error protection mechanism itself, and before it. To each packet

x in the sequence, a priority P (x) is assigned and signaled to the UEP module.

In the specific case of an IPTV scenario, each packet x is an RTP packet containing

H.264 or MPEG-2 video, or MPEG audio (MPEG-1, AAC or similar), over MPEG-

2 Transport Stream. To assign the priorities correctly to the transport packets, it is

necessary that audio and video are carried in different packets. It is also advisable that

no packet carries data from more than one slice; which, for the typical H.264 stream

with one slice per frame, means that no packet should carry data from two or more

different frames. All these conditions are satisfied if the packing of MPEG-2 TS into

RTP is done by the rewrapper described in section 3.5.2.

Priorities assigned to packets can be signaled in the RTP header extension, so that

the network processing elements can read them and use them to apply unequal error

protection techniques. This has the advantage that the extension is transparent to other

RTP receivers, so that the application of priority labels is backwards compatible with

any RTP-aware system. This compatibility has been successfully tested with several

commercial set-top-boxes, and this use of signalization in RTP header extensions is

currently in the field in some IPTV commercial deployments.

Other implementation options are possible. For example, priorities can be signaled using

different protocols, such as the DSCP bits of the IP header. In such cases, the number

of bits available to encode the priorities can be reduced. Next section will show that

even a few bits can be enough to encode the priority in an efficient way.

One of the main advantages of this model is its simplicity. This makes lightweight

implementations possible: to assign a priority to a packet, only the video NALU header

has to be read and analyzed. This way, the prioritization algorithm can be implemented

in devices with limited processing capabilities, such as home network gateways. In such

cases, the priority labeling and the unequal error protection modules would both reside

in the same hardware device.


5.2.2 Experimentation and results

5.2.2.1 Description of the experiment

To test the performance of the model, three different short video sequences (4-12 sec-

onds), encoded by commercial IPTV encoders, have been selected. They are sequences

A, B and C from Appendix A.4. All of them are encoded in H.264 over MPEG-2 TS and

packed in RTP in the way described before; with each RTP packet containing informa-

tion about part of at most one video frame. Audio is not considered in the experiment.

Within each possible window of W consecutive RTP packets in the sequence, the K

packets with lowest priority are discarded. Then the resulting sequence is decoded,

using the repetition of the last reference frame as error concealment strategy. The Mean

Square Error of the resulting impaired sequence is computed, MSEPRIO.

For the same W -packet window, the MSE resulting of randomly dropping K packets

it is also computed, MSERAND. The calculation of the random loss is performed by

randomly selecting 1000 of all the possible combinations of K lost packets within the

window. If there are less than 1000 combinations, then all are selected. MSERAND is

obtained as the average of the MSE of each of the (up to) 1000 combinations.

For each window, the MSE gain is computed as

MSEgain(dB) = 10 log10

�MSERAND

MSEPRIO

�(5.2)

Based on this, an Aggregated Gain Ratio (AGR) can be defined to measure the perfor-

mance of the model. For each sequence and each pair of (W ,K), AGRW,K(G) is defined

as the proportion of windows whose MSE gain is equal or greater than G, and it is

expressed as a percentage in a 0-100 scale.

Table 5.2 shows the values of AGR for some relevant values of MSE gain, W and K, for

the three sequences under study (A, B and C), summarizing the results of the experiment.

They will be discussed and analyzed in the following subsections.

5.2.2.2 Single-packet loss

The first test considered is the case where K = 1, for several values of W . For each

original sequence it is necessary to individually discard each one of the RTP packets and

then decode and process the result of that individual discard. This way, more than 1500

impaired sequences have been obtained and used for the analysis.


Table 5.2: Values of the Aggregated Gain Ratio for some relevant values of MSE gain,W and K

MSE gain(dB) W K AGR% A AGR% B AGR% C10 15 1 73.7 50.9 64.920 15 1 50.2 17.4 47.410 20 1 87.4 65.8 72.920 20 1 62.3 21.3 56.010 15 3 61.8 49.1 59.620 15 3 47.9 30.9 49.710 20 3 78.9 60.0 71.720 20 3 59.5 33.8 58.410 15 6 44.1 42.6 35.720 15 6 34.7 41.7 22.210 20 6 66.7 55.5 48.820 20 6 48.2 55.4 30.110 15 10 18.5 30.0 15.820 15 10 14.0 26.9 14.010 20 10 36.8 47.1 22.320 20 10 26.0 44.0 18.7

The results for sequence A, K = 1 and several values of W are shown in Figure 5.3.

Each of the curves refer to a different value of W and represents, for several values of

MSE gain, which proportion of the sequences obtained at least that gain value. The

range of values of W is selected to cover typical loss burst lengths in a wireless home

network [97].

Gains of 20 dB in MSE can be reached for from 20% of the packets (W = 5) up to 85%

(W = 30), using window sizes which are reasonable for a home network device. The

figure also shows that the longer the window is, the better the results are, since it is

easier to find a low-priority packet within the window.

Figure 5.4 shows some values of MSE for sequence A, K = 1, W = 15. As it can be seen,

the MSE varies highly between different windows along the sequence, independently of

the protection method used. However, focusing on any of the specific windows (any

value in the horizontal axis), using the prioritization method results in lower MSE in

almost all the cases; and in most of them this reduction is very strong. This means that

the specific error will depend heavily on the specific window which is selected but, once

the window is there (i.e., once the error is bound to happen), a good UEP decision can

mitigate the error effect dramatically.


0 5 10 15 20 25 3010

20

30

40

50

60

70

80

90

100

MSE Gain (dB)

Aggr

egat

ed G

ain

Rat

io (A

GR

)

5 pkt10 pkt15 pkt20 pkt25 pkt30 pkt

Figure 5.3: Effect of the window size: Aggregated Gain Ratio for K = 1 and severalvalues of W

0 20 40 60 80 100 120 140 16010 5

10 4

10 3

10 2

10 1

100

101

window number

MSE

priorityrandom

Figure 5.4: Values of MSE for some possible windows within sequence A, comparingrandom packet loss (grey line) with priority-based packet loss (red line) for K=1 andW=15

5.2.2.3 Multiple-packet loss

The second test considered is setting the value of W to a fixed value and analyzing the

effect of the burst size, by changing the value of K. To simplify the implementation of

the test bed, the results of the different (W,K) combinations have been derived from the

(W, 1) case of the previous section, according to the considerations described in section

5.2.1.1. This way, only the first error within a slice is considered (as the rest of the slice

is lost anyway) and errors in two different frames are assumed to be uncorrelated.


0 5 10 15 20 25 30

10

20

30

40

50

60

70

80

90

100

MSE Gain (db)

Aggr

egat

ed G

ain

Rat

io (A

GR

)

1 pkt2 pkt4 pkt6 pkt8 pkt10 pkt

Figure 5.5: Effect of varying the loss burst size (K) for a window of W = 15 packets

Figure 5.5 shows the results for sequence A and W = 15. This value has been selected

as representative from the range that was considered in Figure 5.3. Qualitatively, curves

for other values of W within that range show similar behaviors. Results from the other

sequences are summarized in Table 5.2.

When the values of K are high, it can be seen that the effectiveness of the model drops,

as there is very little margin to select low-priority packets. It is also interesting the fact

that the curves gradually reduce their decreasing rate. For example, Figure 5.5 shows

that, for K = 8, only 10% of the sequences have an MSE gain between 10 and 30 dB,

while 20% reach gains over 30 dB.

This behavior is due to the fact that the prioritization method concentrates errors firstly

in no-reference frames (versus reference ones) and secondly in the end of the frame (versus

the beginning). When the window lies entirely within one frame, then the gains against

the random loss are limited. However, when the window covers part of two different

frames, then the priority strategy concentrates the error in the less-impacting part of

the window, thus reaching high MSE gains. As a consequence, even for severe error

patterns, the prioritization method allows that, in a representative proportion of the

cases, the error effect is negligible.


0 5 10 15 20 25 300

10

20

30

40

50

60

70

80

90

100

MSE Gain (dB)

Aggr

egat

ed G

ain

Rat

io (A

GR

)

PS

+H+TS+TG

Figure 5.6: Contribution of each term to the prioritization equation: only PS (red),PS + H (green), PS + H + TS (cyan), and all of them (blue). Computed for W=15and K=1

5.2.2.4 Contribution of each priority factor

An additional analysis of the performance of the model is represented in Figure 5.6. It

shows the contribution of each of the terms in equation (5.1) to the aggregated MSE gain

of the method. The red line represents the use of only PS as prioritization parameter.

Then the green line introduces the effect of H additionally to the already available of

PS . Afterwards the effects of TS and TG are added.

Several aspects of the graphic are notable. First of all, the use of the very simple method

of prioritization of just considering the frame type of the packets (PS) can be good enough

for some applications. And secondly, the most relevant contribution afterwards is TS

which allows dramatic improvements to the performance. Therefore, in addition to PG

and H, the parameter TS should always be considered.

As the scope of the study is focused on the short term, and therefore window sizes

are relatively small, it typically results in a small number of frames within each packet

window, at most. This is the main reason why the contribution of TG is so limited

in current scenario. Nevertheless, additional tests show that when the window size is

enlarged, the relative weight of TG increases, supporting the choice of a model with four

parameters.


0 5 10 15 20 25 300

10

20

30

40

50

60

70

80

90

100

MSE (dB)

Aggr

egat

ed G

ain

Rat

io (A

GR

)

2 bits3 bits4 bits5 bits6 bits8 bits12 bitsreference

Figure 5.7: Effects of a limited bit budget to encode the priority

5.2.2.5 Limiting the bit budget

These results are useful in a scenario where the priority can be established with as high

resolution as possible, meaning that there is a high bit budget to encode priority values.

In the mentioned case of a RTP header extension, for example, this bit budget could

typically be around 16 bits in each RTP packet to encode its priority. However, in other

signaling implementation, such as DSCP, the number of available bits to encode the

priority could be lower.

Figure 5.7 shows the effect of using a reduced number of bits to encode priority. The

lines for 2 and 3 bits are the equivalent to the ones representing using only PS and

PS+H. The rest of the lines are built according to the results shown in previous section,

i.e., devoting more bits to the encoding of TS than TG. The proposed assignation of bits

to each of the components is in 5.3.

According to the results of this experiment, very few bits are necessary to encode packet

priority. In particular, by using only 3 bits to encode TS , plus 3 more for PS and H, the

deviation from the reference curve is already quite small.


Table 5.3: Bit budget assignation to encode priority

Total PS H TS TG

2 2 0 0 03 2 1 0 04 2 1 1 05 2 1 2 06 2 1 3 08 2 1 3 212 2 1 5 4

5.2.3 Applications

This packet prioritization model can be applied to several scenarios related to unequal

error protection. Some of them will be covered in this section: weighted random early de-

tection (WRED), automatic repeat request (ARQ) and forward error correction (FEC).

5.2.3.1 Weighted Random Early Detection (WRED)

Random early detection (RED), sometimes also called random early drop, is a technique

used in routing devices to handle with congestion: when packet queues reach some fill

level, some packets are dropped randomly in order to avoid buffer overflow. Weighted

RED (WRED) is an enhancement of RED which allows assigning different priorities to

each packet, so that their probability to get dropped depends on its priority.

Using the prioritization algorithm in WRED is straightforward: when some packet has

to be discarded, then it should be the one with lowest priority within the buffer.

5.2.3.2 Unequal Forward Error Correction (FEC)

Unequal Forward Error Correction can also make use of this prioritization method. In

many cases, FEC cannot protect all the packets within a specific sequence, thus only

recovering part of them. In this case, where the protection happens before the actual

error is produced; the FEC server has to decide whether to protect all the packets equally

or whether to use this simplified approach to find out which packets have lowest priority.

Typical FEC structures used in IPTV are based on M-by-N matrixes of packets, where

XOR redundancy is applied packet-wise, either vertically, horizontally or both [19].

When applying unequal FEC, there is a limited bitrate budget to transmit these FEC

packets, so that only part of them are generated and set. By introducing prioritization,


it is possible to reduce the required overhead introduced in the sequence, while keeping

a good protection for the high-priority packets.

The total number of packets within a matrix is below, but typically close to, 100. In this

case, the window size is usually in the same order of magnitude. In that application, as

stated before, the effect of the term TG will be higher than in the scenarios that we have

discussed [12]. However, the principles presented here are fully applicable.

5.2.3.3 Unequal Automatic Repeat reQuest (ARQ)

The priority model is particularly suitable for unequal ARQ. When transmitting mul-

timedia over a lossy network, such as 802.11g, it is common that the loss bursts are

longer than what it can be retransmitted according to the available bitrate budget. In

such cases, the decisions of whether to retransmit or not, as well as which packets to

retransmit, are based on the priority of the specific packet [71].

The problem can be modeled as follows: when the receiver detect a loss of r frames, it

requests for a retransmission of the whole burst. However, due to bitrate constraints,

the server can only guarantee that the first n will arrive on time, i.e. before they have to

be consumed in the reception buffer. Then the strategy of the server is retransmitting

only the most important n packets, and doing it on priority order [30].

In such case, the decision in the server is which packets to drop: from a window of W = r

possible packets, only n will be retransmitted, meaning that K = r−n will be lost. The

improvement by introducing this kind of prioritization in the recovery process, instead

of just randomly drop some of the packets, is the one discussed and characterized along

the previous subsection 5.2.2.

5.3 Fine-grain segmenting for HTTP adaptive streaming

An HTTP Adaptive Streaming (HAS) client requests segments at one bitrate or another

depending on the bandwidth of the TCP connection. Basically, when the buffer in the

client is emptying, it requires lower bitrates; when it is stays full enough, it can use high

bitrates. By keeping enough amount of buffering, the play out of the content can be

seamless during this process. If by any chance (an abrupt drop of network quality) the

buffer gets empty, then the play out stops (freezes) while the buffer fills again.

In cases where network capabilities vary strongly, this implementation requires an impor-

tant amount of buffering. When using HAS to transmit live events, however, increasing


the buffer means increasing the latency, which is undesirable in such scenarios. For

non-live content the effect is a slower startup time, which is also undesirable. On the

other hand, if the buffer is small, underflows may be frequent, which is also undesirable

in all kind of transmissions (as freezing the video and starting from the same point for

a while is not an option, as a general rule).

HTTP adaptive streaming works as follows. The source video is encoded at several bi-

trates (and therefore at different qualities) and then chunked into segments (typically the

length of the segments is between 2-10 seconds; depending on the specific application,

the length of the segments may be similar, or even equal, but it is not an strict re-

quirement). The segment size is a compromise between flexibility and efficiency: shorter

segments allow changing bitrate more frequently, and therefore reducing the required

buffering time; however, they increase the server complexity (as it has to handle more

requests) and the protocol overhead. Besides, segment boundaries usually need to be

video random access points (typically I frames): the minimum segment length will be

the minimum intra picture period which, due to coding efficiency, is rarely under 0.5

seconds. Therefore in HAS scenarios, all the decisions are taken within a time resolu-

tion which is at least 0.5 seconds long, and typically much longer than that. There is

no solution to handle low-buffering systems beyond this point.

This section proposes a solution to improve the behavior of HAS under low-delay con-

straints.

5.3.1 Description of the solution

The idea proposed here is to arrange the information contained in each segment (the

audio and video frames) in priority order, instead of in the original temporal order. A

companion metadata included in the segment will carry all the information required to

restore the original order. Depending on the delivery architecture this metadata may be

part of the transport level or a separate information package added to the segment.

Once the information has been arranged this way, if there is drop in network quality and,

for example, only an 80% of the segment can be received by the client on time, instead

of dropping the 20% of the segment duration, the client will drop the 20% which is less

important. In the case, for example, of a 10-second segment, the difference would be

dropping completely at least the last 2 seconds (without our solution) versus dropping,

for example, some frames along the whole segment (which also affects the Quality of

Experience, but in a less aggressive way).


The assigned priority to a fragment of the segment depends on the effect that the lost

of said fragment has in the Quality of Experience. That is, the more important the

fragment is for the Quality of Experience of the segment when played for an user (i.e.

the higher the effect in the lost of quality perceived by the user if the fragment is lost),

the more priority it has.

The solution can be implemented following the next steps:

First, on the multimedia content server side, each multimedia content segment (original

segment) of the HAS streaming is divided into fragments (the length of each fragment

is a design parameter), with the following requirements:

• Each fragment contains homogeneous data: that is each fragment does not content

data from more than one elementary stream (video, audio or data) and/or from

more than one access unit (frame or field) of a video elementary stream. Fragments

may be smaller than a video frame (i.e. dividing each frame into several fragments)

as it allows better performance of the solution (more granularity), although it is

not mandatory.

• Each fragment has associated information (metadata) including information which

allows to restore its original position in the segment (the temporal original posi-

tion); for instance, a sequence number.

• Each fragment is assigned a priority. The packet prioritization model described in

section 5.2 will be used here.

• If all the fragments are concatenated according to the associated information (e.g.

sequence number order), then the result is the original segment or a valid segment

of the multimedia stream equivalent to the original segment from the decoding

point of view (i.e. can be reproduced by a normal multimedia decoder, with the

same result as reproducing the original segment), but with maybe a different order

in the frames. This segment is called recovered segment.

Secondly, a prioritized segment is created by concatenating fragments in priority order.

This segment includes the metadata for each fragment: priority and sequence number.

Fragments with the same priority value may be ordered for example in sequence number

order. This prioritized segment is sent from the multimedia server to the end device or

in other words, the fragments are sent to the end device by the multimedia server in

priority order.

Finally, the prioritized segment is retrieved by the end device and stored in a memory

buffer. When the client starts receiving the prioritized segment via HTTP, it starts


Figure 5.8: Priority-based HTTP Adaptive Streaming segment structure

extracting the fragments into the buffer. Each segment is put at their right position

using the associated information (sequence number), rebuilding therefore the recovered

segment. The buffer may be consumed at normal pace by the client (no special buffering

policy is needed). If the segment is consumed before the whole segment has arrived,

then there will be gaps in the buffer; but they will take place in the less important

positions (the ones with less priority and therefore, less impact to QoE). Late arrivals

are discarded.

Figure 5.8 shows a schematic diagram of the solution in an exemplary scenario. Top line

(both left and right) describes a typical segment transmission and presentation. Bottom

describes a segment transmission and presentation using our solution. The segment

has video frames (I, P, B) and audio (a) frames or, being more general, access units

(AUs). For the present explanation, in order to simplify the figure, it is assumed that

one fragment contains exactly one AU, although each AU can be divided into smaller

fragments if needed.

The left part shows the structure of the segment for transmission. In our solution

(prioritized segment), the fragments have been re-ordered in priority order; but both

segments (top line and bottom line) represent the same content. Note that the prioritized

fragment contains the same AUs than the regular one, but in different order.

Now the segment is transmitted but, for any reason, the download (streaming) has to

stop in the middle (i.e. all the data under the highlighted square are lost, because they

have not been received by the end device) and it has to be sent to play out (this is the

presentation part, on the right hand side of the image).

In the regular segment case (top), the answer is simple. The client plays out the first

half of the segment, and then stops (black or frozen video, and no audio too). In the

prioritized segment case (bottom), it is different: packets are re-ordered and, as they

have a sequence number, they are re-ordered by the client (end-device) and displayed

in their right position and only the less important packets have been lost. To simplify:


we have all the I and P frames, plus all the audio. The result is that the segment is

played out completely, although at a lower frame rate (33%), and with all the audio.

Of course, dropping the frame rate and keeping the audio is much better than losing

several seconds completely. According to the subjective assessment tests described in

Appendix A.2 (see also [27]), there could be a difference of 1 to 3 points in a MOS scale

(1 to 5) between both approaches.

It is important to note that the creation of the prioritized segment is a decision that

can be taken prior to the knowledge of the network status between the content server

and the end device. In other words, the prioritized segment is generated once in the

server, and all the end devices download and play it. If there is no network congestion,

the experience will be the same as with the original segment: it will be correctly and

completely displayed. However, if there is a sudden network QoS drop, the end device

will have its prioritized segment available without having to do anything special in the

server side.

The solution has thus the following advantages:

• It allows to recover from buffer underrun in HAS in an optimal way. That is,

smaller HAS buffers can be used, therefore reducing the latency of the whole HAS

solution.

• It works passively, in the sense that neither the server nor the client have to change

their default behavior when facing a network congestion.

• Besides, it provides a mechanism to mitigate the effect of high network rate vari-

ations.

• More generically, it makes it possible to use in HAS all the QoE enhancement

technology which has been developed for real-time RTP delivery, such as video

preparation for Fast Channel Change, unequal loss protection or selective scram-

bling; that is, this solution allows to use QoE enhancement techniques in a different

environment (HTTP delivery).

5.4 Selective Scrambling

The concept of selective scrambling means that, when cryptographically protecting a

multimedia asset or stream, only a (typically reduced) fraction of the data is scrambled,

whilst the rest are distributed in clear. The reasons for such approach are twofold:

on the one side, by leaving some specific information unscrambled, intermediate video-

processing systems can access to the part of the data which is required for them to work


correctly —the rich transport data; on the other, keeping a reduced bit rate of scrambled

packets can be the only possible solution for decoding devices with limited computing

power, such as user terminals. Addressing the former problem is relatively simple, as

the specific data headers required by the network processors are typically well-known.

The latter is more interesting, as it is necessary to find a good balance point between

scrambling rate and protection effectiveness.

5.4.1 Problem statement and requirements

A user which is watching a partially scrambled content asset and is not entitled to (and

therefore does not have the appropriated keys to descramble it) will experience the same

effect that a user that loses (for example, due to network errors) exactly the same packets

that are scrambled in the stream. From this point of view, selective scrambling can be

seen as a reverse rate-distortion optimization (RDO) problem. Unlike in the typical

RDO problem, however, the aim here is maximizing the final distortion for a specific

rate of scrambled packets. In an ideal case, the resulting distortion should be so high

that no useful data can be extracted from the content. However, for many practical

applications, it could be suitable that the resulting video has a quality bad enough to

discourage the potential user to watch it. The underlying idea here is that, in order to

find a good selective scrambling algorithm, techniques for Quality of Experience analysis

can be used.

Notwithstanding, the design of selective scrambling schemes must be aware of the reasons

why this algorithm is required: processing the scrambled video in the network and

low computing power required in the descrambler. Besides, using a lightweight scheme

also in the scrambler side would broaden the applicability of the scheme. Hence the

requirements for the selective scrambling algorithm would be to:

1. Be transparent to video servers —by leaving “rich transport data” in clear,

2. Scramble only a (low) percent of the video packets,

3. Be implementable with low computational cost and

4. Maximize the distortion introduced by the encrypted packets (i.e., do not allow

to recover the video sequence from the unscrambled packets unless with heavy

impairment).


5.4.2 Algorithms

Most existing commercial CAS/DRM solutions fulfill requirement 1. However, they

typically rely on the encryption of the full stream. There are several solutions in the

literature that address the partial encryption of the video stream. A description of the

state of the art can be found in the work of Massoudi et al. [69], who describe a set

of encryption techniques that allow good visual degradation of encrypted video while

scrambling only part of the packets. However, all of them either require deep analysis

of the video stream (thus not satisfying requirement 3) or scramble the video headers to

make video impossible to decode (not meeting requirement 1).

Fan et al. propose encoding with higher security the most important data and with

lower security (and complexity) the less important [20]. Shi et al. divide H.264 video

elements in different classes, which are provided with different protection [100]. In

the work of Zou et al., different encryption levels can be reached by analyzing the

entropy coding of the H.264 stream [125]. These methods satisfy requirement 2, but

all of them require analyzing H.264 up to, at least, macroblock level, which might be

computationally expensive (especially when CABAC entropy coding is used, as in most

IPTV streams).

The approach we propose is exploiting the error resilience characteristics of video coding

standards such as, but not limited to, H.264, where video frames are divided in slices.

It has been shown that, when a fragment of the video slice gets lost, the rest of the slice

becomes almost impossible to decode [86]. Therefore by scrambling a small set of data

in each slice it is possible to get a very high video degradation.

This solution is especially suitable for multimedia deployments because:

• Commercial encoders use a low number of slices per frame (typically one in SDTV,

4-8 in HDTV, see section 2.4.1). Thus the fraction of video packets to encrypt

(scrambling rate) is kept low.

• The information required to process video in a video server (i.e., stream and

picture-level information) is contained in other H.264 syntax elements (called

NALUs —Network Abstraction Layer Units) which are not slices, and in the header

of the slices.

• The only analysis of the video stream required for this solution is: detecting the

type of NAL Units, detecting slices and slice headers and reading the coding type

of each frame. This can be performed in the H.264 Network Abstraction Layer

i.e., it does not require analyzing anything beyond slice header level. This makes

processing much simpler than any other selective scrambling algorithm.


Table 5.4: Minimum scrambling rate required to completely loss the video signal, assubjectively assessed by expert viewers in laboratory, for several content assets.

Content Resolution Bitrate %Scrambled %Scrambled(selective) (uniform)

Advertising 576i 2.7 Mbps 1% 15%News 576i 2.7 Mbps 1.5% 15%Movie 1080i 8.5 Mbps 0.8% 10%Movie 1080i 15 Mbps 0.3% 5%

With these premises, we propose a scrambling scheme based in two layers [83]:

1. In each Slice, scramble a small set of data just after each Slice Header. This

protects video from real-time decoding with a very low scrambling rate.

2. After that, scramble randomly some sets of data of the rest of the VCL units,

as well as other streams (e.g. audio). With this second layer, two aims are met:

first of all, audio streams are also scrambled so that they are heavily impaired to

non-descrambling receivers; secondly, eliminating redundancy in the video stream

makes it impossible to decode, even for sophisticated offline error concealment

methods.

5.4.3 Results

This algorithm has been tested with different scrambling rates and several contents

encoded in H.264. Then the processed video has been played without correctly de-

scrambling the packets, so that they result in packet losses. The video has been then

watched by expert viewers in the laboratory in order to assess the minimum scrambling

rate at which it was impossible to extract any information from it (i.e. the image was

completely impaired). This value has been compared with the minimum scrambling rate

required to obtain an equivalent result by randomly encrypting video packets. Detailed

results are shown in Table 5.4. Even with the limitations of the experiment, it can be

shown that, by only encrypting up to 2% of the transport packets, it is possible to impair

the video quality so that the resulting video is useless.

All the video samples under study used only one slice per picture. For video sequences

with N slices per picture, these values would have to be multiplied by a factor K ≤ N .

Even in that case, for most typical scenarios, the required scrambling rate would be

relatively small.


5.5 Fast Channel Change

As it has been shown in section 4.6.2, the channel change time can be modeled as

TCC = Tterm + Tnet + Tbuf + TRAP + Tdec (5.3)

where Tterm is the response time of the user terminal software, Tnet is the network

response time, Tbuf is the dejitter buffering time in the terminal, TRAP is the time needed

to reach a Random Access Point, and Tdec is the decoding start-up time, according to

the buffering model imposed by the encoder.

These factors are normally neither optimized nor easily optimizable in real deployments.

For this reason, specific solutions have been proposed to address this issue. The most

common one is the so-called “Rapid Acquisition of Multicast Stream” (RAMS), de-

scribed also as “unicast based Fast Channel Change” in DVB-IPTV [19]. This solution

is based on Fast Channel Change servers deployed as Edge Servers in the network, which

provides the following functionality:

1. When the user requests a channel change, the user terminal, instead of joining

a new multicast stream, it requests a unicast stream to the FCC server. This

changes, and usually reduces, Tnet.

2. The FCC server then sends a unicast stream to the user terminal. This stream

starts from a Random Access Point in the past, reducing TRAP to virtually zero.

3. The stream is sent at a higher bitrate that the one of the multicast stream, so that

at some point it catches up with the multicast. This point is signaled by the FCC

server so that the user terminal can switch to the multicast stream seamlessly.

The application of the standard solution, however, only solves part of the problem (Tnet

and TRAP ). The user terminal can also set Tbuf to a minimum value (Tbuf -fcc � 0) to

further reduce the channel change time. Since the unicast is received at a rate which is

higher than the nominal one, but it is only consumed at the nominal one, the exceeding

bitrate can be used to fill Tbuf to its desired value after the video has been started to

be decoded.

There is still a relevant component of the channel change time, which is Tdec, which has

not been addressed so far. It has additional difficulty for two reasons: it is imposed by

the video encoder and it is different for each of the streams (video, audio and subtitles).

To reduce it to the minimum value, we propose the following solution:


• At the beginning of the unicast session, re-multiplex the different elementary

streams in the FCC server so that they have similar Tdec values at the begin-

ning of the unicast stream. The easiest way is increasing the Tdec of the audio

and subtitling streams. Due to the fact that all the elementary streams have been

separated in different RTP packets by the rewrapper processing (see section 3.5.2),

it can be done only by reordering the RTP packets in the stream.

• Besides, reduce the value of the Tdec of all the elementary streams together by

re-stamping the value of the PCR in the Transport Stream.

• Finally, during the unicast session, recover the original situation of the stream,

by gradually un-doing the re-multiplexing so that, when the session is switch to

multicast, unicast and multicast streams are equal and the switchover can be done

seamlessly.

With this procedure, Tdec can be reduced down to about 100 ms. Since Tbuf , Tnet

and TRAP have been also reduced, almost instantaneous channel change times can be

obtained (in the range of 200 to 300 ms), provided that Tterm, which depends basically

on the software design of the user terminal application, can also be optimized.

5.6 Application to 3D Video

In the recent years the popularity of 3D video has increased strongly, mainly from the

availability of last-generation stereoscopic displays both in cinemas and in consumer

television sets, as well as the production of several successful films using this technology.

As such, today it is possible to buy a 3D television set and watch blu-ray 3D contents at

home at affordable prices. The next challenge is delivering 3D content by the different

types of multimedia delivery services, from traditional television broadcasting to over-

the-top content distribution.

Virtually all the 3D multimedia contents that are available for these kind of services are

encoded as stereoscopic pairs of images: two different video frames, one for each eye of

the viewer. The resulting content is therefore composed of two different video streams

(called “views”), which represent the same scene from two slightly separated points of

view. These two views can be encoded an transmitted in several ways [10]:

• As two different video streams (simulcast).

• Multiplexed in a single video stream. The most typical way to do it is side-by-side

(each half of the image, left and right, contains one view; and the player is able to

separate them).


• Using specific standards which make use of the redundancy between views, such

as H.264 MVC.

In all the cases the video is encoded either in AVC or in MVC (which is an extension

of AVC as well), and therefore the approach presented in this thesis is completely valid.

Since the basics of the coding structure are the same as in 2D video, the most relevant

errors which will happen in a 3D video delivery service will be again video macroblocking,

audio losses, quality drops, outages...

These artifacts, however, may have different impact in the Quality of Experience, as the

viewing experience is completely different. To test it, the subjective quality assessment

methodology described in section 3.4 has been applied to assess stereoscopic video sub-

ject to network errors [24, 25, 28]. The results show that these errors have a similar

impact in 3D video that what they have in standard 2D video. Macroblocking effect

seems to be more annoying in 3D video due to visual rivalry (mismatch between left

and right views). The other artifacts, however, seem to be slightly more tolerable in

3D than what they are in 2D video, maybe because they are somehow masked by the

added-value provided by the stereoscopic experience.

The other relevant difference is that, in the coding schemes where each view is encoded

in a different frame (i.e. simulcast and MVC), there is a new dimension in terms of

scalability. In other words, one of the views can be signaled with a lower priority than

the other so that, in case of an error-prone channel or network congestion, all the errors

are concentrated in a single view. In these situations, dropping one view and switching

to 2D video is an option that can complement the existing drops in bit or frame rate

that have been discussed in section 4.4 [27].

Chapter 6

Conclusions

The market of the multimedia content distribution is in a rapid and continuous evo-

lution, which started with the standardization of digital television broadcasting in the

1990s, continued with the deployment of triple-play offers, with the special relevance of

interactive IPTV, in the 2000s, and is moving towards global multi-screen OTT services

in the 2010s. The service offer is increasingly richer and tends to be more personalized

and focused on the expectations of individual users. New players and business models

appear in the marketplace while the distribution of video traffic over communication

networks increases its relative relevance in the total amount of transported data.

Within this complexity, the underlying technology has a common definition: the delivery

of digitally-encoded multimedia streams (from short advertisement clips to unbounded

television channels) over a packet network. A very restricted set of technologies (MPEG

codecs and multiplexers, and IP transport) is used for a good fraction of the present and

upcoming services. Therefore, as seen in section 3.2, it is possible to define a generic

architecture which can be applied to model the most relevant service scenarios.

The same effort to establish a general architecture can be done for the problem of the

monitoring of multimedia Quality of Experience (QoE) in such multimedia services.

In section 3.3 we propose QuEM (Qualitative Experience Monitoring): a monitoring

framework aimed at obtaining significant descriptions of the impairments present in

the service [84, 85], which can be used as a replacement of pure Packet Loss Rate

(PLR) based monitoring systems. Each measure introduced in the framework must

work under real conditions (lightweight processing and bitstream-based), and must be

repeatable, in the sense that it is possible to artificially generate the error conditions

measured. The output of the measurement block (called QuID, Quality Impairment

Detector) can afterwards be mapped to a severity value through a user-defined Severity

123

124 Chapter 6. Conclusions

Transfer Function (STF). Measures in the QuEM system are proposed to cover the most

relevant artifacts present in existing multimedia delivery services.

The QuEM approach has also inspired a novel methodology of subjective assessment

tests, described in section 3.4. It is aimed at reproducing as much as possible the viewing

conditions of the final user of the services. Therefore the content shown is intended to

be meaningful for the viewer and, more relevantly, the content is displayed in a nearly

continuous way. This methodology can be used for the validation and calibration of

QuIDs [25, 28, 85].

Besides, section 3.5 introduces new features that simplify the management of the multi-

media content in the transport network, by using “rich transport data” which make the

network aware of part of the video information. Among them we can cite the homog-

enization of interfaces in video processing elements [87], the introduction of metadata

synchronized with the video stream [9], the intelligent re-wrapping of video into trans-

port packets, or the processing of video in the edge of the network [108].

The most relevant source for impairments in a real deployment is the loss of video pack-

ets. Section 4.2 describes a proposed metric to predict the effect of video packet losses

(PLEP, Packet Loss Effect Prediction) [86]. By monitoring the Network Abstraction

Layer (NAL) of H.264 video and following the chain of references, it provides a rea-

sonably reliable description of the extension and duration of the artifacts associated to

the loss of video packets. Experiments show that it clearly outperforms simple PLR

monitoring, while still being applicable to the monitoring of real multimedia services.

PLEP metric is complemented with other measures, such as the monitoring of audio

packet losses (section 4.3), video coding quality (section 4.4) and outages (section 4.5),

to complete the relevant impairments. In all the cases, subjective assessments suggest

that the monitoring of basic parameters is enough to contribute to the QuEM system

in a significant way. In the specific case of video coding quality, bit and frame rates are

taken as proxy metrics. Finer monitoring of quality within the same bitrate cannot be

reliably addressed with No-Reference quality metrics [81].

For completeness, section 4.6 studies the latency-related measures that can affect QoE:

end-to-end lag and channel change time. Both can be analytically obtained by measuring

times in the network, the timing and buffering information present in the multiplexing

headers, as well as by knowing the (constant) additional buffer introduced in the encoder

and decoder elements.

The comparison of the different impairments (section 4.7) shows the importance of

controlling the effect of packet losses: the impact of losing packets in video no-reference

frames is much less aggressive that the loss of audio packets, for instance. Therefore,

Chapter 6. Conclusions 125

with the right knowledge of the effect of network events in QoE, it should be possible to

design network systems whose policies are optimized towards the final perceived quality.

Following this idea, section 5.2 proposes applying a simplification of the PLEP model to

do packet prioritization for Unequal Error Protection: error correction and congestion

control [82]. The solution addresses specifically short-term protection decisions, where

the error correction system has to decide which packets to protect (or which ones to

drop) within a short window of time, based on the potential impact of their loss. Thus

it is especially suitable for real-time multimedia transmissions.

This principle can also be applied to HTTP Adaptive Streaming (section 5.3), by sorting

the packets within a segment in priority order. This way, if the segment download has

to be interrupted for any reason, the impact in the final QoE will be minimized [88].

The same concept can be reversed by searching, under certain conditions, the packets

whose loss has a stronger impact in the quality. In other words: it is possible to use

the PLEP model to maximize the effect in the QoE for a given loss rate. Section 5.4

describes how to apply this idea to a selective scrambling environment. By encrypting

a few number of packets in the video stream, it is possible to leave the signal virtually

impossible to decode from the remaining packets [83, 109].

In section 5.5 we propose a solution to reduce the zapping time in IPTV channels.

The analysis of the decoding and buffering processes makes it possible to accelerate the

start-up of the stream decoding after a channel change, reaching zapping times below

500 ms.

Finally the metrics and applications described can also be applied to stereoscopic mul-

timedia delivery (and, more specifically, 3DTV) [10]. Section 5.6 mentions it, with a

special reference to the application of the subjective quality assessment methodology to

3DTV environments [28], both in IPTV [24, 25, 26] and OTT/HAS [27].

In summary, this thesis proposes a comprehensive approach to the monitoring and man-

agement of Quality of Experience in IP multimedia delivery services. With the appro-

priate framework and a deep knowledge of the ”rich transport data” it is possible to

enhance the quality monitoring, minimize the effect of losses, maximize the power of

encryption systems, or improve the zapping time of the service without significantly

increasing its complexity or cost. The proposed approach is also transparent in the in-

formation it provides and the processing it does: any fine tuning can be done by the user

of the system, and there is no a priori dependence on empiric parameters or training

data. These properties make it perfectly suitable for the needs of multimedia service

providers and, in fact, some of the proposals of this thesis have already present in several

IPTV deployments around the world. And we believe that, event for scenarios where

126 Chapter 6. Conclusions

our proposals are not of direct application (for technical or commercial reasons), the

information contained in this work can be helpful to any person who has to address

the complex problem of modeling, monitoring and managing the Quality of Experience

provided by a multimedia delivery service over IP.

Future work is envisioned in three complementary directions. Firstly, in the enhancement

of the QuEM model with additional experiments which can provide better calibration

for the impairment detectors, as well as outline models to evaluate the effect of multiple

artifacts (either simultaneously or sequentially) over a short period of time. Providing

simple and robust mechanisms to aggregate quality data is still a challenge which needs

to be solved for the context of the multimedia delivery services. Secondly, by continuing

the application of the ideas discussed here to the field of 3DTV and stereoscopic video.

And finally, by expanding the applicability of the model to HTTP Adaptive Streaming

environments. With the popularization of this technology as the basis for OTT delivery

services, there is an opportunity to develop new applications that can make an optimal

use of the network resources and manage the Quality of Experience of the end-to-end

service.

Appendix A

Experimental setup

A.1 Introduction

This Appendix describes the most relevant experiments used in this thesis. Section A.2

details the subjective quality assessment tests performed with the methodology described

in section 3.4, and focused on the evaluation and calibration of the QuIDs. Their results

have been used in several subsections of chapter 4. Section A.3 describes a previous set

of subjective quality assessment tests, aimed at evaluating the video quality produced

by video encoders provided by different manufacturers. These tests have been used in

section 4.4. Finally section A.4 describes the set of contents used for several objective

quality experiments. Those contents have been taken from IPTV field deployments (or

laboratory trials) and therefore have been selected as target contents for the development

algorithms to work with.

A.2 Subjective Assessment based on QuEM approach

This section describes the details of the specific set of tests done to calibrate the Quality

Impairment Detectors under study, and whose most relevant results have been presented

in chapter 4, together with the description of the QuIDs. The tests were performed

following the methodology that has been described in section 3.4.

A.2.1 Selection and preparation of content

According to the test methodology, selected content must be representative of what a

multimedia service usually offers, as well as significant to the viewers. An important

127

128 Appendix A. Experimental setup

target of the test methodology is reproducing the experience of a user watching video at

home. To do that, it is important that the user can see the video stream as meaningful,

and not just as a simple evaluation sequence whose contents are irrelevant.

Three content sources were selected for the tests, each one with a duration of 5 minutes

and 30 seconds:

• A movie sequence: a cut from Avatar. It is a film with detailed image information

(making it suitable for subjective tests), and it was reasonably popular at the time

where the tests were made. As an added value, it was released in 3D and could be

used to compare 2D vs 3D impairments easily.

• A sports sequence. In particular, a cut was selected from the extra time of the

final match of 2010 FIFA World Championship, including the goal which resulted

in the victory for the Spanish team. It was the probably most relevant sports

content available.

• A documentary sequence. A high-quality video was selected. Documentaries are

relevant in those tests because their audiovisual features (length of scenes, type of

camera movement. . . ) are quite different from the ones of sports and cinema. The

documentary was also available in 3D.

The sources were compressed using a professional H.264 video encoder. Selected reso-

lutions and bitrates are described in Table A.1. Lower bitrate versions of the sequences

were also generated to simulate bitrate drops.

Table A.1: Video test sequences: bitrate and resolution

Source Format BitrateMovie 1920x1080p 24fps 8 MbpsSports 720x576p 25fps 4 Mbps

Documentary 720x576p 25fps 4 Mbps

The resulting streams were there chunked into 12-seconds segments for the tests and

processed by a rewrapper. Impairments were introduced in the first half of each of the

segments.

A.2.2 Selection of impairments

The selection of impairments was done to cover a sufficient range of error cases related to

the metrics that were going to be evaluated and calibrated (the ones defined in chapter

4).

Appendix A. Experimental setup 129

A.2.2.1 Bitrate drops

To simulate the effect of a bandwidth drop, the first half of the segment was re-encoded

using a different bitrate, which was a fraction of the original one. Two different impair-

ments were defined (called R1 and R2) as detailed in Table A.2.

Table A.2: Bitrate drops

Test R1 R2Bitrate (% of reference) 50% 25%

A.2.2.2 Frame rate drops

In these test cases, the first half of the segment is transmitted using a lower frame rate,

which is a fraction of the original one. Frame rate reduction is achieved by discarding

some B frames from the original stream (denting). Two different impairments were

defined, as detailed in Table A.3.

Table A.3: Frame rate drops

Test F1 F2Frame Rate (% of reference) 50% 25%

A.2.2.3 Audio losses

These impairments are implemented by discarding audio packets in the middle of the

first half of the segment. The shortest loss length, achieved by dropping a single audio

packet, produced a silence of about 200 ms. Longer lengths were achieved by dropping

consecutive packets. Test cases A5 and A6 introduced a sequence of several short losses

separated approximately 1 second. Impairments are detailed in Table A.4. The ‘total

duration’ represents the time from the beginning of the first audio mute to the end of

the last one.

A.2.2.4 Video losses: macroblocking

The macroblocking effect caused by a transmission loss can be roughly characterized

using three parameters:


Table A.4: Audio losses

Test A1 A2 A3 A4 A5 A6Loss length (s) 0.2 0.5 2 6 0.2 0.2Loss events 1 1 1 1 3 7Total duration (s) 0.2 0.5 2 6 2 6

• The fraction of the picture affected (position of the loss within the frame).

• The duration of the artifact due to error propagation (position of the loss within

the GOP).

• The loss pattern (i.e. the effect of losing several packets in several frames).

To simplify the test cases, the following restrictions were imposed to the test cases:

• There would be at most a packet loss in each GOP.

• Loss patterns would be established by introducing the same type of packet loss in

several consecutive GOPs.

Impairments are detailed in Table A.5. ‘MIN’ means that the impairment occurred in a

no-reference frame, and therefore its effect did not propagate through the GOP.

Table A.5: Macroblocking errors

Test E1 E2 E3 E4 E5 E6 E7 E8% of Frame 100 25 50 100 50 50 50 50% of GOP MIN 90 90 90 90 90 25 25Number of GOPs 1 1 1 1 3 5 3 5

The rationale for this selection of impairments is the following:

• E1 — Verify that the loss of isolated no-reference frames has no effect in the

perceived quality.

• E2–E4 — Analyze the effect of single packet losses.

• E5–E8 — Analyze the effect of multiple packet losses.


A.2.2.5 Video freezing

Video freezing was achieved by the loss of a single I frame (or its header), so that the

whole picture remains still until the beginning of the next GOP. The length of the freezes

were selected as multiples of the GOP length (half a second), as shown in Table A.6.

Table A.6: Video freezing

Test V1 V2 V3Freeze duration (s) 0.5 2 6

A.2.2.6 Impairment sets

The selected impairments were structured in impairment sets: groups of impairments

related among them, as described in Table A.7. ‘N’ represents a hidden reference (no

impairment). ‘AV’ is the combination of A4+V3 (6 seconds audio mute and video freeze,

i.e., a 6-second full outage).

Table A.7: Impairment sets

Impairment Set Freq. Impairments DescriptionRate Drop 3 R1 R2 F1 F2 Reaction to bandwidth changes

Audio Loss 1 3 A1 A2 A3 A4 Audio mute lengthAudio Loss 2 3 A3 A4 A5 A6 Continuous vs. periodic mutes

Macroblocking 1 3 E1 E1 N N Detectability of no-reference lossMacroblocking 2 3 E3 E4 E5 E6 Impairment durationMacroblocking 3 3 E5 E6 E7 E8 Effect of % of GOP affected

Single Loss 5 V1 E2 E3 E4 Effect of a single video packet lossOutage 1 1 V2 V3 A3 A4 Audio vs video outagesOutage 2 1 V3 A4 AV AV Audio vs video vs both

The ‘Freq.’ (frequency) label indicates the number of times that each impairment set

appears in each test sequence. The sum of all the frequencies is 25, which means that

25 different impairments were introduced in each test sequence: one impairment each

12 seconds.

For each of the three video test sequences (movie, sports and documentary), the following

steps were followed:

1. Each segmented sequence was replicated 4 times, to create 4 different variants.


Figure A.1: Structure of the content streams in the subjective assessment test session

2. The 25 occurrences of the impairment sets were randomized, as well as the 4 dif-

ferent impairments within each set. This way, 4 different sequences of impairments

were generated, each one having 25 impairments.

3. Each sequence of impairments was applied to each of the variants, i.e., impairments

were introduced in the first halves of the segments accordingly.

The resulting sequences have the structure shown in Figure A.1, where the impairments

introduced in each of the evaluation periods Ti belong to the same impairment set. Table

A.8 shows an example of some of them —they are the first 13 impairments introduced

in each of the variants of the sports sequence in the final tests.

Table A.8: Example of a sequence of impairments

Variant T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 . . .A A4 V4 E8 E4 E1 A5 F2 A4 E5 A3 V1 A6 F1 . . .B A6 AV E7 E3 E1 A6 F1 V2 E6 A1 E3 A5 R1 . . .C A5 V3 E5 E6 N A4 R2 A3 E7 A4 E2 A3 F2 . . .D A3 AV E6 E5 N A3 R1 V3 E8 A2 E4 A4 R2 . . .


!"#$%"&

'#()*&

+,#-./&

0#123*4.%-5&

6&

7&

8&

9&

:&

;&

<7&<8&=7& =8&>7&

>8&>9&

>:&>;&

>?&@7&

@8&@9&

@:&@;&

@?&@A&

@B&C7&

C8&C9&

>C&D&

!"#$%"&

'#()*&

+,#-./&

0#123*4.%-5&

Figure A.2: Summary of the subjective quality assessment test results

A.2.3 Test sessions

Tests were carried out in the laboratories of the Universidad Politecnica de Madrid. The

viewing room was set to have correct light conditions, according to international stan-

dard recommendations for home environment tests. Specifically, a 42 full-HD Panasonic

television was used, and the observers were placed at a viewing distance of 3 times the

height of the TV set.

A total number of 42 observers, 35 male and 7 female, took part in the experiment. All

of them had normal visual acuity and color vision. The ages of the subjects were ranged

between 20 and 48 years old, with an average age of 27. A maximum of 4 people took

part in each of the assessment sessions. In each session, the viewers were shown one

variant of each of the three test sequences (movie, sports and documentary). This way,

each variant was assessed by at least 10 different viewers.

Figure A.2 shows a summary of the results for each impairment and content stream.


A.3 Subjective quality assessment of H.264 video encoders

This section describes a set of subjective quality assessment tests, performed as part of

a comparative study of coding quality of several IPTV H.264 encoders. The results of

these tests have been used as benchmark to evaluate NR and RR objective assessment

metrics in section 4.4.1.

In this test set, 7 different encoder implementations —from 7 different manufacturers—

were analyzed, using several bitrates and several source video sequences. All the source

sequences are cuts of television programs in contribution quality. This way, the only

impairment introduced in the tests is the one generated by the compression process of

the encoder.

Tests were Single Stimulus (SS), and they followed the recommendation ITU-R BT.500

[42]. The viewer was presented with a 10-second sequence, whose quality had to be

judged using a MOS scale (from 1 —“bad”— to 5 —“excellent”—).

Four 10-second sequences (two from a football match, two from a live music show) were

encoded with seven different implementations of H.264 encoders (from different vendors),

each one at five different bitrates: 1.4, 1.7, 2.0, 2.3 and 2.6 Mbps. They were SDTV

sequences encoded at Main Profile, level 3. A hidden reference was included as well.

The selected content assets came from real contributions of an IPTV network and present

demanding coding requirements (movement, textures, capture in interlaced format...).

The target bitrates represented the range of real IPTV deployments and the encoder

configuration was provided for each vendor. Thus the environment was as close as

possible to a real commercial service.

For the tests 20 non-expert observers, balanced in age and gender, were selected and

divided into 4 sessions of 5 participants each. They were presented the sequences, in

random order, and asked to evaluate their quality with a MOS scale (1 to 5), according

to the specifications in [42]. Additionally, 6 stabilizing cuts where added at the beginning

of each viewing session, whose votes were not taken into account for the final results.

Figure A.3 shows the results of one of the sequences for all the H.264 encoders. It is

worth noting that the different codec implementations obtain quite different marks. This

should prevent us from generalizing the behavior of the H.264 standard when only one

implementation is used. In other words, there is no common “AVC quality” for a given

content and bitrate; it will depend on the specific encoder implementation.


1

2

3

4

5

1.4 1.7 2.0 2.3 2.6

MO

S

Video AVC (Mb/s)

Cod1

Cod2

Cod3

Cod4

Cod5

Cod6

Cod7

Figure A.3: Subjective MOS for a football video test sequence. Each color representsa different encoder. The original sequence was ranked with MOS=4.2

A.4 Test sequences from IPTV deployments

This section describes the set of sequences used to test some of the algorithms and

applications that have been presented in this thesis. The main target of those tests is

developing techniques which have to be applicable in real multimedia delivery services.

For that reasons, test sequences have been selected from streams used to validate services

in the field: all of them are captures either from a real field deployment or from a

validation laboratory in an IPTV service. There is therefore more interest in the way

the sequences are encoder rather than in the specific content that was shown at that

moment (which is something that can be rarely selected when doing the capture).

The properties of the different sequences are described in Table A.9, and their source

content is the following:

1. Sequence A is a scene from an action movie (Die Hard 4).

2. Sequence B is a documentary.

3. Sequences C and D are advertisements (the same source sequence with different

encoding settings).

The following clarifications can be done about the table:


Table A.9: Test sequences

Sequence A B C DTS Bitrate (Mb/s) 2.8 2.5 2.7 2.7Video H.264 H.264 H.264 H.264Video Bitrate (Mb/s) 2.3 2.0 2.0 2.0Video Profile Main Main Main MainVideo Level 3.0 3.0 3.0 3.0Video Resolution 720x576 544x576 480x576 480x576Aspect Ratio 16/9 4/3 4/3 4/3Picture Rate 50i 50i 50i 50iIDRs Yes No Yes YesSlices per picture 1 1 1 1GOP length 100 24 24 12P frame period 4 4 3 3Hierarchical GOP Yes Yes No NoNo. of audio streams 2 2 1 1Audio Format MP1L2 MP1L2 MP1L2 MP1L2Audio Bitrate (kb/s) 192 192 192 192

• A P frame period of 4 means that there are 3 B frames between each consecutive

P or I frames (i.e., the structure is IBBBP). Similarly, a P period of 3 represents

an IBBP structure.

• A hierarchical GOP structure (“. . . IBBBP. . . ”) is like the one discussed in section

2.4.1 and deployed in Figure 2.4 in page 31.

• As it was mentioned in section 4.2, in IPTV scenarios it is frequent that some

streams use I frames which are not IDRs. This is the case of sequence B.

Bibliography

[1] K. Ahmad and A. C. Begen. IPTV and video networks in the 2015 timeframe: The

evolution to medianets. IEEE Communications Magazine, 47(12):68–74, December

2009.

[2] ANSI T1.801.02-1996. American National Standard for Telecommunications -

digital transport of video teleconferencing/video telephony signals - performance

terms, definitions, and examples, 1996.

[3] A. C. Begen, C. Perkins, and J. Ott. On the use of RTP for monitoring and fault

isolation in IPTV. IEEE Network, 24(2):14–19, March-April 2009.

[4] Brix Network. Video quality measurement algorithms: Scaling IP video services

for the real world, 2006.

[5] Broadband Forum. TR 176. ADSL2Plus configuration guidelines for IPTV – v3.0,

September 2008.

[6] G. Cermak, M. Pinson, and S. Wolf. The relationship among video quality, screen

resolution, and bitrate. IEEE Transactions on Broadcasting, 57(2):258–262, June

2011.

[7] G. W. Cermak. Consumer opinions about frequency of artifacts in digital video.

IEEE Journal of Selected Topics in Signal Processing, 3(2):336–343, April 2009.

[8] P. Coverdale, S. Moller, A. Raake, and A. Takahashi. Multimedia quality assess-

ment standards in ITU-T SG12. IEEE Signal Processing Magazine, 28(6):91–97,

November 2011.

[9] J. M. Cubero, A. M. Sanz, E. Estalayo, P. Perez, F. Jaureguizar, J. Cabrera, and

J. J. Ruiz. Gestion y aplicacion de metadatos asociados al trafico multimedia en

videoconferencia 3D. In XX Jornadas Telecom I+D, September 2010. Valladolid,

Spain.

137

138 BIBLIOGRAPHY

[10] J. M. Cubero, J. Gutierrez, P. Perez, E. Estalayo, J. Cabrera, F. Jaureguizar, and

N. Garcia. Providing 3D video services: The challenge from 2D to 3DTV quality

of experience. Bell Labs Technical Journal, 16(4):115–134, March 2012.

[11] N. Degrande, K. Laevens, D. Vleeschauwer, and R. Sharpe. Increasing the user

perceived quality for IPTV services. IEEE Communications Magazine, 46(2):94–

100, February 2008.

[12] C. Diaz, J. Cabrera, F. Jaureguizar, and N. Garcia. A video-aware FEC-based

unequal loss protection scheme for RTP video streaming. In IEEE Int. Conf. on

Consumer Electronics, ICCE 2011, Jan 2011. Las Vegas (NV), United States.

[13] R. Dosselmann and X. Yang. A comprehensive assessment of the structural simi-

larity index. Signal, Image and Video Processing, 5:81–91, March 2011.

[14] M. Ellis and C. Perkins. Packet loss characteristics of IPTV-like traffic on resi-

dential links. In IEEE Consumer Communications and Networking Conference,

CCNC 2010, January 2010. Las Vegas (NV), United States.

[15] U. Engelke and H. J. Zepernik. Perceptual-based quality metrics for image and

video services: a survey. In Conf. on Next Generation Internet Networks, May

2007. Trondheim, Norway.

[16] U. Engelke, T. M. Kusuma, and H.-J. Zepernick. Perceptual quality assessment of

wireless video applications. In Int. Symposium on Turbo Codes & Related Topics,

April 2006. Munich, Germany.

[17] B. Erman and E. P. Matthews. Analysis and realization of IPTV service quality.

Bell Labs Technical Journal, 12(4):195–212, February 2008.

[18] ETSI TS 101 154 v1.10.1. Digital Video Broadcasting DVB; specification for the

use of video and audio coding in broadcasting applications based on the MPEG-2

transport stream, 2011.

[19] ETSI TS 102 034 v1.4.1. Digital Video Broadcasting DVB; transport of MPEG-2

based DVB services over IP based networks, 2009.

[20] Y. Fan, J. Wang, T. Ikenaga, Y. Tsunoo, and S. Goto. An unequal secure en-

cryption scheme for H.264/AVC video compression standard. IEICE Transactions

on Fundamentals of Electronics, Communications and Computer Sciences, 91(1):

12–21, January 2008.

[21] M. C. Q. Farias and S. K. Mitra. No-reference video quality metric based on artifact

measurement. In IEEE Int. Conf. on Image Processing, ICIP 2005, September

2005. Genoa, Italy.

BIBLIOGRAPHY 139

[22] T. Friedman, R. Caceres, and A. Clark. RTP Control Protocol Extended Reports

(RTCP XR). RFC 3611 (Proposed Standard), November 2003.

[23] F. Gabin, M. Kampmann, T. Lohmar, and C. Priddle. 3GPP mobile multimedia

streaming standards [standards in a nutshell]. IEEE Signal Processing Magazine,

27(6):134–138, November 2010.

[24] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia. Subjective

assessment of the impact of transmission errors in 3DTV compared to HDTV. In

IEEE 3DTV Conference, May 2011. Antalya, Turkey.

[25] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia. Subjective evalu-

ation of transmission errors in IPTV and 3DTV. In IEEE Visual Communications

and Image Processing, VCIP 2011, November 2011. Tainan, Taiwan.

[26] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia. Monitoring

packet loss impact in IPTV and 3DTV receivers. In IEEE Int. Conf. on Consumer

Electronics, ICCE 2012, January 2012. Las Vegas (NV), United States.

[27] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia. Subjective

study of adaptive streaming strategies for 3DTV. In IEEE Int. Conf. on Image

Processing, ICIP 2012, October 2012. Orlando (FL), United States.

[28] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia. Validation of a

novel approach to subjective quality evaluation of conventional and 3D broadcasted

video services. In Int. Workshop on Quality of Multimedia Experience, QoMEX

2012, July 2012. Yarra Valley, Australia.

[29] H. Ha, J. Park, S. Lee, and A. C. Bovik. Perceptually unequal packet loss protec-

tion by weighting saliency and error propagation. IEEE Transactions on Circuits

and Systems for Video Technology, 20(9):1187–1199, September 2010.

[30] R. Haimi-Coehen. Prioritized retransmission of internet protocol television (IPTV)

packets, December 2008. US Patent Proposal US 2010/0138885 A1.

[31] D. S. Hands. A basic multimedia quality model. IEEE Transactions on Multimedia,

6(6):808–816, December 2004.

[32] S. Hawley and G. Schultz. IPTV Video Quality: QoS & QoE. Quarterly Technology

and Content Report. Multimedia Research Group, Inc., February 2007.

[33] S. S. Hemami and A. R. Reibman. No-reference image and video quality esti-

mation: Applications and human-motivated design. Signal Processing: Image

Communication, 25(7):469–481, August 2010.

140 BIBLIOGRAPHY

[34] O. Hohlfeld, R. Geib, and G. Hasslinger. Packet loss in real-time services: Marko-

vian models generating QoE impairments. In IEEE Int. Workshop on Quality of

Service, IEEE IWQoS 2008, June 2008. Enschede, the Netherlands.

[35] Q. Huynh-Thu, M.-N. Garcia, F. Speranza, P. Corriveau, and A. Raake. Study

of rating scales for subjective quality assessment of high-definition video. IEEE

Transactions on Broadcasting, 57(1):1–14, March 2011.

[36] ISO/IEC 13818-1:2007. Information technology – generic coding of moving pictures

and associated audio information: Systems, 2007.

[37] ISO/IEC 13818-2:2000. Information technology – generic coding of moving pictures

and associated audio information: Video, 2000.

[38] ISO/IEC 14496-10:2012. Information technology – coding of audio-visual objects

– Part 10: Advanced Video Coding, 2012.

[39] ISO/IEC 23009-1:2012. Information technology – dynamic adaptive streaming over

HTTP (DASH) – Part 1: Media presentation description and segment formats,

2012.

[40] O. Issa, W. Li, H. Liu, F. Speranza, and R. Renaud. Quality assessment of

high definition TV distribution over IP networks. In IEEE Int. Symposium on

Broadband Multimedia Systems and Broadcasting, BMSB 2009, May 2009. Bilbao,

Spain.

[41] ITU-R Tech. Rec. BS.1387. Method for objective measurements of perceived audio

quality, 2001.

[42] ITU-R Tech. Rec. BT 500.11. Methodology for the subjective assessment of the

quality of television pictures, 2002.

[43] ITU-T Tech. Rec. G.1080. Quality of experience requirements for IPTV services,

2008.

[44] ITU-T Tech. Rec. G.1081. Performance monitoring points for IPTV, 2008.

[45] ITU-T Tech. Rec. J.144. Objective perceptual video quality measurement tech-

niques for digital cable television in the presence of a full reference, 2004.

[46] ITU-T Tech. Rec. J.147. Objective picture quality measurement method by use

of in-service test signals, 2002.

[47] ITU-T Tech. Rec. J.247. Objective perceptual multimedia video quality measure-

ment in the presence of a full reference, 2008.

BIBLIOGRAPHY 141

[48] ITU-T Tech. Rec. J.249. Perceptual video quality measurement techniques for

digital cable television in the presence of a reduced reference, 2010.

[49] ITU-T Tech. Rec. J.341. Objective perceptual multimedia video quality measure-

ment of HDTV for digital cable television in the presence of a full reference, 2011.

[50] ITU-T Tech. Rec. J.342. Objective multimedia video quality measurement of

HDTV for digital cable television in the presence of a reduced reference signal,

2011.

[51] ITU-T Tech. Rec. J.863. Perceptual objective listening quality assessment, 2011.

[52] ITU-T Tech. Rec. P.862. Perceptual evaluation of speech quality (PESQ): An ob-

jective method for end-to-end speech quality assessment of narrow-band telephone

networks and speech codecs, 2001.

[53] ITU-T Tech. Rec. P.910. Subjective video quality assessment methods for multi-

media applications, 2008.

[54] ITU-T Tech. Rec. P.911. Subjective audiovisual quality assessment methods for

multimedia applications, 1998.

[55] ITU-T Tech. Rec. Y.1910. IPTV functional architecture, 2008.

[56] S. H. Jumisko, V. P. Ilvonen, and K. A. Vaananen-Vainio-Mattila. Effect of TV

content in subjective assessment of video quality on mobile devices. In Proc. SPIE,

Multimedia on Mobile Devices, volume 5684, pages 243–254, March 2005.

[57] S. Jumisko-Pyykko and J. Korhonen. Unacceptability of instantaneous errors in

mobile television: from annoying audio to video. In 8th Conf. on Human-computer

interaction with mobile devices and services, September 2006. Espoo, Finland.

[58] S. Kanumuri, S. Subramanian, P. Cosman, and A. Reibman. Predicting H.264

packet loss visibility using a generalized linear model. In IEEE Int. Conf. on

Image Processing, ICIP 2006, September 2006. Atlanta (GA), United States.

[59] M. Knee. The picture appraisal rating (PAR) - a single-ended picture quality mea-

sure for MPEG-2. In Int. Broadcasting Convention, September 2000. Amsterdam,

the Netherlands.

[60] R. Kooij, K. Ahmed, and . Brunnstrom. Perceived quality of channel zapping.

In IASTED Int. Conf. Commun. Sys. and Networks, August 2006. Palma de

Mallorca, Spain.

142 BIBLIOGRAPHY

[61] K. Kunert, E. Uhlemann, and M. Jonsson. Enhancing reliability in IEEE 802.11

based real-time networks through transport layer retransmissions. In Int. Sympo-

sium on Industrial Embedded Systems, July 2010. Trento, Italy.

[62] Y. Kuszpet, D. Kletsel, Y. Moshe, and A. Levy. Post-processing for flicker reduc-

tion in H.264/AVC. In Picture Coding Symposium, PCS 2007, November 2007.

Lisbon, Portugal.

[63] P. Le Callet, C. Viard-Gaudin, and D. Barba. A convolutional neural network

approach for objective video quality assessment. IEEE Transactions on Neural

Networks, 17(5):1316–1327, September 2006.

[64] A. Leontaris and A. R. Reibman. Comparison of blocking and blurring metrics

for video compression. In IEEE Int. Conf. on Acoustics, Speech, and Signal Pro-

cessing, ICASSP 2005, March 2005. Philadelphia (PA), United States.

[65] Y. Liang, J. Apostolopoulos, and B. Girod. Analysis of packet loss for compressed

video: does burst-length matter? In IEEE Int. Conf. on Acoustics, Speech and

Signal Processing, ICASSP 2003, April 2003. Hong Kong, China.

[66] T. L. Lin, S. Kanumuri, Y. Zhi, D. Poole, P. C. Cosman, and A. R. Reibman. A

versatile model for packet loss visibility and its application to packet prioritization.

IEEE Transactions on Image Processing, 19(3):722–35, March 2010.

[67] A. A. Mahimkar, Z. Ge, A. Shaikh, J. Wang, Y. Zhang, and Q. Zhao. Towards

automated performance diagnosis in a large IPTV network. ACM SIGCOMM

Computer Communication Review, 39:231–242, August 2009.

[68] P. Marziliano, F. Dufaux, S. Winkler, and T. Ebrahimi. A no-reference perceptual

blur metric. In IEEE Int. Conf. on Image Processing, ICIP 2002, September 2002.

Rochester (NY), United States.

[69] A. Massoudi, F. Lefebvre, C. De Vleeschouwer, B. Macq, and J. Quisquater.

Overview on selective encryption of image and video: challenges and perspectives.

EURASIP Journal on Information Security, 2008, December 2008.

[70] R. Mekuria, P. Cesar, and D. Bulterman. Digital TV: the effect of delay when

watching football. In 10th European Conf. on Interactive TV and video, July 2012.

Berlin, Germany.

[71] V. Miguel, J. Cabrera, F. Jaureguizar, and N. Garcia. High-definition video dis-

tribution in 802.11g home wireless networks. In IEEE Int. Conf. on Consumer

Electronics, ICCE 2011, pages 213–214, Las Vegas (NV), United States, January

2011.

BIBLIOGRAPHY 143

[72] M.-J. Montpetit, T. Mirlacher, and M. Ketcham. IPTV: An end to end perspective

(invited paper). Journal of Communications, 5(5):358–373, August 2010.

[73] BBC News. Olympics bring 55 million visits to BBC Sport online, August 2012.

http://www.bbc.com/news/technology-19242083.

[74] T. Oelbaum, C. Keimel, and K. Diepold. Rule-based no-reference video quality

evaluation using additionally coded videos. IEEE Journal of Selected Topics in

Signal Processing, 3(2):294–303, April 2009.

[75] Open IPTV Forum. Release 2 specification – volume 2a HTTP adaptive streaming

– v2.1, 2011.

[76] Open IPTV Forum. Release 2 specification – volume 2 media formats – v2.1, 2011.

[77] J. Ott, S. Wenger, N. Sato, C. Burmeister, and J. Rey. Extended RTP Profile

for Real-time Transport Control Protocol (RTCP)-Based Feedback (RTP/AVPF).

RFC 4585 (Proposed Standard), July 2006.

[78] T. N. Pappas and R. J. Safranek. Perceptual criteria for image quality evaluation.

In Handbook of Image and Video Processing, pages 669–684. Academic Press, 2000.

[79] R. R. Pastrana-Vidal and C. Colomes. Perceived quality of an audio signal im-

paired by sigal loss: psychoacoustic tests and prediction model. In IEEE Int. Conf.

on Acoustics, Speech and Signal Processing, ICASSP 2007, April 2007. Honolulu

(HI), United States.

[80] W. Pattara-Atikom, S. Banerjee, and P. Krishnamurthy. Predicting the quality

of video transmission over best effort network service. In Int. Conf. on Computer

Communications and Networks, ICCCN 2003, October 2003.

[81] P. Perez. Calidad de experiencia en IPTV. Master’s thesis, Universidad Politecnica

de Madrid, September 2008. Trabajo de Investigacion en Tecnologias y Sistemas

de Comunicaciones.

[82] P. Perez and N. Garcia. Lightweight multimedia packet prioritization model for

unequal error protection. IEEE Transactions on Consumer Electronics, 57(1):

132–138, February 2011.

[83] P. Perez and J. J. Ruiz. Encryption procedure and device for an audiovisual data

stream, April 2011. European Patent Application EP 2,309,745 (Published).

[84] P. Perez, J. J. Ruiz, and N. Garcia. Calidad de experiencia en servicios multimedia

sobre IP. In XX Jornadas Telecom I+D, September 2010. Valladolid, Spain.

144 BIBLIOGRAPHY

[85] P. Perez, J. Gutierrez, J. J. Ruiz, and N. Garcia. Qualitative monitoring of video

over a packet network. In IEEE Int. Symposium on Multimedia, December 2011.

Dana Point (CA), United States.

[86] P. Perez, J. Macias, J. J. Ruiz, and N. Garcia. Effect of packet loss in video quality

of experience. Bell Labs Technical Journal, 16(1):91–104, June 2011.

[87] P. Perez, J. J. Ruiz, A. Villegas, K. V. Damme, C. V. Boven, J. Dupont, and P. A.

Molina-Salmeron. Multi-vendor video headend convergence solution. Bell Labs

Technical Journal, 17(1):185–200, June 2012.

[88] P. Perez, A. Villegas, and J. J. Ruiz. Method, system and devices for improved

adaptive streaming of media content, January 2012. European Patent Application

No. 12382006.0 (Filed).

[89] M. H. Pinson and S. Wolf. A new standardized method for objectively measuring

video quality. IEEE Transactions on Broadcasting, 50(3):312 – 322, September

2004.

[90] M. H. Pinson, W. Ingram, and A. Webster. Audiovisual quality components.

Signal Processing Magazine, IEEE, 28(6):60–67, November 2011.

[91] F. Porikli, A. Bovik, C. Plack, G. AlRegib, J. Farrell, P. Le Callet, Q. Huynh-Thu,

S. Moller, and S. Winkler. Multimedia quality assessment [DSP Forum]. IEEE

Signal Processing Magazine, 28(6):164–177, November 2011.

[92] A. Raake, J. Gustafsson, S. Argyropoulos, M. Garcia, D. Lindegren, G. Heikkila,

M. Pettersson, P. List, and B. Feiten. IP-based mobile and fixed network audio-

visual media services. IEEE Signal Processing Magazine, 28(6):68–79, November

2011.

[93] A. R. Reibman and D. Poole. Characterizing packet-loss impairments in com-

pressed video. In IEEE Int. Conf. on Image Processing, ICIP 2007, September

2007. San Antonio (TX), United States.

[94] A. R. Reibman and A. R. Wilkins. Video outage detection: Algorithm and evalu-

ation. In Picture Coding Symposium, PCS 2009, May 2009. Chigaco (IL), United

States.

[95] A. R. Reibman, V. A. Vaishampayan, and Y. Sermadevi. Quality monitoring of

video over a packet network. IEEE Transactions on Multimedia, 6(2):327–334,

April 2004.

BIBLIOGRAPHY 145

[96] D. C. Robinson and A. Villegas. Intelligent wrapping of video content to lighten

downstream processing of video streams, June 2009. European Patent Application

2,071,850 (Published).

[97] S. H. Russ and S. Haghani. 802.11g packet-loss behavior at high sustained bit

rates in the home. IEEE Transactions on Consumer Electronics, 55(2):788–791,

May 2009.

[98] S. Saha and R. Vemuri. An analysis on the effect of image features on lossy coding

performance. IEEE Signal Processing Letters, 7(5):104–107, May 2000.

[99] W. B. P. Schallauer. Studies in Computational Intelligence: Multimedia Semantics

The Role of Metadata, chapter Metadata in the Audiovisual Media Production

Process, pages 65–84. Springer Berlin / Heidelberg, 2008.

[100] T. Shi, B. King, and P. Salama. Selective encryption for H.264/AVC video coding.

In Proc. SPIE, Electronic Imaging, volume 6072, page 607217, 2006.

[101] D. Singer and H. Desineni. A General Mechanism for RTP Header Extensions.

RFC 5285 (Proposed Standard), july 2008.

[102] C. W. Snyder, U. K. Sarkar, and D. Sarkar. Effects of cell loss on MPEG video:

analytical modeling and empirical validation. In IEEE Int. Conf. on Multimedia

and Expo, ICME 2002, volume 2, pages 457–460. IEEE, 2002.

[103] B. V. Steeg, A. Begen, T. V. Caenegem, and Z. Vax. Unicast-Based Rapid Acqui-

sition of Multicast RTP Sessions. RFC 6285 (Proposed Standard), June 2011.

[104] T. Stockhammer. Dynamic adaptive streaming over HTTP: standards and design

principles. In ACM Conf. on Multimedia Systems, February 2011. San Jose (CA),

United States.

[105] S. Susstrunk and S. Winkler. Color image quality on the internet. In Proc. SPIE,

IS&T Internet Imag, pages 118–131, January 2004.

[106] M. Tagliasacchi, G. Valenzise, M. Naccari, and S. Tubaro. A reduced-reference

structural similarity approximation for videos corrupted by channel errors. Multi-

media Tools and Applications, 48:471–492, 2010.

[107] M. Verhoeyen, D. De Vleeschauwer, and D. Robinson. Content storage architec-

tures for boosted IPTV service. Bell Labs Technical Journal, 13(3):29–43, Septem-

ber 2008.

[108] A. Villegas, K. Chow, C. V. Boven, and P. Perez. Content delivery method, June

2011. European Patent Application EP 2,538,629 (Published).

146 BIBLIOGRAPHY

[109] A. Villegas, P. Perez, J. M. Cubero, E. Estalayo, and N. Garcia. Network as-

sisted content protection architectures for a connected world. Bell Labs Technical

Journal, 16(4):85–96, March 2012.

[110] T. Vlachos. Detection of blocking artifacts in compressed video. Electronic Letters,

36(13):1106–1108, June 2000.

[111] VQEG. Final report from the video quality experts group on the validation of

objective models of video quality assessment, Phase II, 2003.

[112] VQEG. Validation of reduced-reference and no-reference objective models for

standard definition television, Phase I. Technical report, 2009.

[113] VQEG. Monitoring of audiovisual quality by key indicators, 2012. Draft available

online at http://www.its.bldrdoc.gov/vqeg/.

[114] Z. Wang and E. P. Simoncelli. Reduced-reference image quality assessment using

a wavelet-domain natural image statistic model. In Proc. SPIE, Human Vision

and Electronic Imaging, volume 5666, pages 149–159, 2005.

[115] Z. Wang, A. C. Bovik, and B. Evan. Blind measurement of blocking artifacts in

images. In IEEE Int. Conf. on Image Processing, ICIP 2000, September 2000.

Vancouver, Canada.

[116] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assess-

ment: From error visibility to structural similarity. IEEE Transactions on Image

Processing, 13(4):600–612, April 2004.

[117] A. A. Webster, C. T. Jones, M. H. Pinson, S. D. Voran, and S. Wolf. An objective

video quality assessment system based on human perception. In Proc. SPIE,

Human Vision, Visual Processing, and Digital Display IV, pages 15–26, 1993.

[118] J. Welch and J. Clark. A Proposed Media Delivery Index (MDI). RFC 4445

(Informational), April 2006.

[119] S. Winkler. Video quality measurement standards - current status and trends. In

Int. Conf. on Information, Communications and Signal Processing, ICICS 2009,

December 2009. Macau, China.

[120] S. Winkler. Digital Video Quality – Vision Models and Metrics. John Wiley &

Sons, January 2005.

[121] S. Winkler and P. Mohandas. The evolution of video quality measurement: From

PSNR to hybrid metrics. IEEE Transactions on Broadcasting, 54(3):660–668,

September 2008.

BIBLIOGRAPHY 147

[122] H. R. Wu and M. Yuen. A generalized block-edge impariment metric for video

coding. IEEE Signal Processing Letters, 4(11):317–320, November 1997.

[123] F. Yang, S. Wan, Y. Chang, and H. R. Wu. A novel objective no-reference metric

for digital video quality assessment. IEEE Signal Processing Letters, 12(10):685–

688, October 2005.

[124] F. You, W. Zhang, and J. Xiao. Packet loss pattern and parametric video quality

model for IPTV. In IEEE/ACIS Int. Conf. on Computer and Information Science,

June 2009. Shanghai, China.

[125] Y. Zou, T. Huang, W. Gao, and L. Huo. H.264 video encryption scheme adaptive to

DRM. IEEE Transactions on Consumer Electronics, 52(4):1289–1297, November

2006.

Universidad Politécnica de Madrid - Archivo Digital UPMoa.upm.es/22148/1/PABLO_PEREZ_GARCIA.pdf · 2014-09-22 · Universidad Politécnica de Madrid Escuela Técnica Superior

Documents