Universidad Polit ´ ecnica de Madrid Escuela T ´ ecnica Superior de Ingenieros de Telecomunicaci ´ on ANALYSIS, MONITORING, AND MANAGEMENT OF QUALITY OF EXPERIENCE IN VIDEO DELIVERY SERVICES OVER IP Tesis Doctoral Pablo P´ erez Garc´ ıa Ingeniero de Telecomunicaci´ on 2013
175
Embed
Universidad Polit´ecnica de Madrid - Archivo Digital UPMoa.upm.es/22148/1/PABLO_PEREZ_GARCIA.pdf · 2014-09-22 · Universidad Polit´ecnica de Madrid Escuela T´ecnica Superior
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
the CDN core network. In this case, it is essential that the Tailoring Server operates in
full transparent way, since in OTT environments it is frequent that the service provider
does not have any control about the user terminal.
Both concepts will be referred generically as Edge Servers from this point onwards.
3.6 Conclusions
In this chapter we have proposed a reference architecture for multimedia delivery services
over IP. This reference architecture provides a homogenous view of the most relevant
scenarios: IPTV and OTT, both for live and on-demand contents, and including quality
monitoring points as well.
We have also introduced the QuEM quality monitoring framework, which is applicable
to almost the same scenarios where PLR/PLP systems are, but offering a more detailed
analysis. Specifically, the basis of this approach has been set up with the objective of
developing a system that is able to characterize what is happening in the network, and
is easy enough to implement, integrate, and deploy in real video delivery systems.
Moreover, the proposed approach and the metrics that compose the monitoring architec-
ture have been validated by means of subjective assessment tests, analyzing the effects of
several transmission impairments on the QoE of the observers, and the relations among
those degradations. Those studies are also useful to calibrate the measurement elements
of the architecture to obtain reliable estimations of the impact of the distortions on the
perceived quality.
Finally, we have described some enablers: network elements that facilitate the imple-
mentation of QoE functionality in the delivery network.
In next chapters we will fill this framework with information. In chapter 4 we will
describe metrics to monitor the most relevant impairments using rich transport data.
Those metrics will comply with the requirements established in the QuEM framework,
and will be validated using the proposed subjective assessment methodology. In chapter
5 we will use the knowledge obtained in the generation of metrics to propose new value-
add applications in the context of the multimedia QoE. The implementation of these
application will also rely on the presence of some of the QoE enablers that we have
described in this chapter.
Chapter 4
Quality Impairment Detectors
4.1 Introduction
This chapter describes the different metrics which are proposed for the monitoring of the
Quality of Experience in multimedia delivery services. Using the terminology defined in
the previous chapter, they are the Quality Impairment Detector (QuID) blocks needed
to build a Qualitative Experience Monitoring (QuEM) system. Each section devotes to
a different QuID.
The general approach to study each of the QuIDs has been similar. First, the impairment
which wants to be detected is defined and characterized. This implies identifying the
cause of the impairment —and therefore propose a technique to monitor it—, as well as
understanding its impact on the perceived quality. Afterwards this analysis is completed
with specific subjective quality assessment tests, which use the methodology described
in section 3.4. A common set of subjective tests has been used for this purpose; they
are described in Appendix A.2. In specific sections of this chapter, additional subjective
and objective experiments have been used. They are described in different sections of
Appendix A, and referred in the appropriate sections of the text when needed.
The metrics described in this chapter are the ones proposed in section 3.3.3. They cover
the most relevant defects described by users [7], and each of them fulfills the requirements
imposed by the QuEM architecture in sections 3.3.1 and 3.3.2 —scalability, significance,
and repeatability.
Section 4.2 describes a video Packet Loss Effect Prediction metric (PLEP). It predicts
how the loss of a video packet can lead to freezing or macroblocking effects, by analyz-
ing the propagation of the error within the video frame, as well as to adjacent frames
throughout the inter-frame prediction reference chain. The results of this metric are
59
60 Chapter 4. Quality Impairment Detectors
analyzed objectively and subjectively using the test sequences described in Appendix
A.4 and the test set described in Appendix A.2, respectively.
Section 4.3 repeats the same structure of 4.2, but analyzing the effect of the loss of audio
packets.
Section 4.4 analyzes the media coding quality, with two differentiated subsections. First,
in 4.4.1, the video artifacts produced by compression are analyzed with a specific set
of subjective quality assessment tests, described in Appendix A.3. The results of these
tests is used to explore the possibilities to use RR or NR metrics to monitor video coding
artifacts in the context of a QuEM framework. Afterwards, in 4.4.2, a different approach
is presented, to analyze the effect of quality drops produced by strong variations in the
channel effective bandwidth —a typical OTT scenario with HTTP Adaptive Streaming.
In this case, two main alternatives are compared: switching to a version with different
bitrate and dropping frames. Their effects are analyzed with the subjective assessment
tests of Appendix A.2.
Section 4.5 describes outage events, understood as the total loss of video, audio, or both
signals for a period of time. Techniques to measure outage are described, as well as its
subjective effect according to the tests described in Appendix A.2.
Section 4.6 analyzes latency-related issues: lag and channel change time. This type of
analysis is sometimes excluded in the discussion of QoE, but it has been included in
this chapter for two reasons. On the one hand, lag and channel change are relevant
only in some specific scenarios; but these scenarios may have great impact in the overall
perceived quality of the multimedia delivery service —live delivery of sports events is the
most typical case. On the other hand, there is a design trade-off between latency and
other quality factors, such as video coding quality or packet loss probability. Acknowl-
edging this relationship is relevant when considering the whole QoE of our services.
Section 4.7 describes the relationship, in terms of perceived quality, between the different
impairments that have been studied.
Finally section 4.8 summarizes the main conclusions obtained in the whole chapter.
4.2 Video Packet Loss Effect Prediction (PLEP) model
Packet losses are the main cause of errors in multimedia services and, more specifically,
in IPTV. The loss of video packets can cause macroblocking and image freezing, which
are about half of the QoE impairments reported by customers in a field deployment [7].
Chapter 4. Quality Impairment Detectors 61
For this reason, packet losses are a relevant QoS issue to monitor in IPTV networks. In
existing deployments, it is typical to use pure QoS metrics, such as the Media Delivery
Index (MDI), to monitor them [67]. On the one hand, MDI is a useful metric to esti-
mate QoE because, on the long term and for random losses, packet loss rate correlate
reasonably well with the Mean Square Error which, in this scenario, can be a reasonably
good predictor of the perceived quality [40, 95]. On the other hand, in most cases there
is simply no other metric which can be applicable in the context of real-time service
monitoring, either because they need information that is not available at the monitoring
point, or because they are too costly to be applied.
However, other approaches are possible. If we have access to rich transport data, such
as the information provided by the rewrapper described in section 3.5.2, we can take
into account the structure of the video stream to improve the prediction of the effect of
losing some packets, instead of applying the sort of flat rate used by MDI.
Another important fact to consider is that network QoS provided for IPTV should
be good enough to make difficult to assume that “MSE correlates to PLR”. Besides,
QoS-management decisions are taken in the short term (some dozens of packets or so;
otherwise delay is too high). Therefore we need to analyze the short-term effect of
isolated packet losses in order to improve quality management in IPTV.
We will focus in this section on the analysis of packet-loss effect in the short term.
We will build a model to predict the effect of packet losses in video, based on the
information available at transport level in a real deployment. In particular, we will
analyze the transport information (RTP and MPEG-2 Transport Stream), as well as
the network abstraction layer of H.264: NAL Unit Headers and Slice Headers. We will
not analyze deeper than Slice Header in any case: firstly because, when any scrambling
is applied (even partial), some parts of the slice are always unavailable; and secondly
because it would require decoding the entropy coding CABAC, which would increase
the computation cost of the monitoring tool excessively for practical applications, thus
violating the scalability requirements required for QuIDs —see section 3.3.1.
The analysis has been performed in the context of an IPTV service, where the transport
unit (the minimum block that can get lost) is the RTP packet. It has also been assumed
that, to simplify the network processing, the MPEG-2 TS has been packaged into RTP
using a rewrapper. However, the model can be easily extended to other multimedia
delivery scenarios, just by adjusting the size and nature of the packets that can get lost.
62 Chapter 4. Quality Impairment Detectors
4.2.1 Description of the model
We need a packet loss effect prediction (PLEP) model which is based on the analysis
of rich transport data, provides meaningful information to the operator using it, and is
as general as possible. To comply with these requirements, we propose a metric which
estimates the fraction of each of the frames which is affected by artifacts coming from
packet losses. Therefore a frame with a degradation value of, e.g., 50 percent, will have
half of its surface affected by artifacts.
The main advantage of this approach is that it focuses on the structure of the error in
the image, i.e. on the most direct impact of the packet loss, which is the absence of
correct information in parts of the image for some time. This metric does not depend
on the statistics of the image itself, and it is therefore usable in environments where
the picture intensity values are not available. Besides, it provides an easy qualitative
description of the impairment, which makes it suitable for our QoEM architecture.
Our solution encompasses two steps which are applied iteratively: we first compute
the degradation value in one frame, and then estimate the error propagation to the
neighboring pictures. The model only makes use of information which is available in the
slice header of H.264 slices: the slice type and reference picture buffer indexes. No data
is obtained from either the original (unimpaired) stream or from the decoded video.
4.2.1.1 Degradation Value
The first component of the impairment is the error generated in the frame where the
packet loss occurs. In an IPTV environment, video frames will typically be transported
over several transport packets (typically RTP). For that reason, a loss in one of the
packets does not necessarily mean the loss of the whole frame. In fact, the effect of the
loss of a single packet within the frame can be estimated by considering two well-known
properties of the H.264 coding:
• The information of macroblocks within a picture is transported in scan order (un-
less flexible macroblock ordering is used, which is not the case in Main and High
profiles).
• When there is an error in a NAL Unit, decoders usually cannot resynchronize video
decoding until the beginning of next NAL Unit.
We measure the degradation value on a scale of 0 to 100, where 0 represents that an
image has been received without errors, and 100 indicates that it is completely impaired.
Chapter 4. Quality Impairment Detectors 63
The metric will estimate the percentage of image which is affected by the error:
E0 = 100%1
N
N−1�
S=0
1− f
�L(S)
Lavg
�(4.1)
where S represents each slice, N is the number of slices per frame, and L(S) represents
the length in bytes of the fragment of the slice which is not lost. It is assumed that
the rest of the slice is lost the moment an error is produced. Similarly, as macroblock
information is sequentially introduced in a slice (i.e., one macroblock after another), it
is reasonable to assume that the larger the portion of the slice is affected, the larger the
region of image is impaired. Lavg is an estimation of the length of the slice if there had
been no losses. Depending on the size of the loss and the video transport layer, it may
be estimated with higher or lower accuracy. In any case, it is always possible to assume
that the slice byte size will be similar to a sliding average of the sizes of the K previous
slices of the same type (I, P, B) and their position in the image. f is a function which
must be monotonically increasing. We will select the identity function saturated to the
value “1”, so that no slice can contribute to more than 100 percent of its size.
The equation assumes that all slices in the image have the same size (in pixels). Other-
wise, values should be weighted by their relative surface in the whole image.
4.2.1.2 Error Propagation
Most of the pictures in an H.264 video sequence use other pictures as references in their
decoding process. This technique, needed to encode the stream with a reasonably low
bit rate, causes errors in one frame to propagate to all frames which make reference to it.
If those frames, in turn, serve as references for others, the impairment would propagate
even more along the reference chain. Therefore a picture with no losses can also have
artifacts which have been propagated from its reference frames.
We compute this propagated error Ep from the value E of each of the frames which are
used as a reference by the picture under study. Given a picture x, depending on a set
of references {yk}, propagated error will be:
Ep = γ
�
k
ωkE (yk) (4.2)
where E(yk) is the error level in the frame yk. This error can be result of a packet loss
in that frame (E0) or being a propagated error itself (Ep), and the values of ωk and γ
model how to estimate the fraction of affected pixels in the predicted picture.
64 Chapter 4. Quality Impairment Detectors
The constant γ represents the attenuation of the error effect along the reference chain.
In a typical coding scenario in H.264, instantaneous decoding refresh (IDR) pictures are
introduced periodically (each few seconds, at most). Therefore, regardless the value of
γ, the error will only propagate until the next IDR frame in the worst case (which is with
γ = 1). However, this assumption is not stable for long IDR repeat periods, or for cases
where I frames are not IDRs and there can be references beyond GOP boundaries1. For
such reason γ < 1 is recommended (for instance, γ = 0.9).
Factors ωk represent the weight of the different pictures which contribute as reference to
the picture under study. We use a model where higher level errors have a higher weight,
as they propagate in a more perceptible way:
ωk =E (yk)�kE (yk)
(4.3)
This allows us to write:
Ep = γ
�kE
2 (yk)�kE (yk)
(4.4)
4.2.1.3 Error Composition
Finally, it is possible that one picture suffers from a packet loss and also that its reference
pictures had errors as well. In this situation, both error contributions must be combined.
In the best scenario, both contributions will overlap and the total error level will be the
maximum:
Ebc = max {E0, Ep} (4.5)
In the worst case, contributions will be independent and the error will be the sum:
Ewc = min {(E0 + Ep), 100%} (4.6)
Therefore we assume that the error will be somewhere in between:
E = αEbc + (1− α)Ewc with 0 ≤ α ≤ 1 (4.7)
1In H.264, it is possible to define an I frame which is not an IDR. As I frame, it can be decoded
without needing other frames for prediction. However, unlike an IDR, it allows that subsequent frames
in decoding order use previous frames as references. This can slightly improve the obtained video quality
for a given bitrate constraint, and it is frequently used by IPTV video encoders.
Chapter 4. Quality Impairment Detectors 65
4.2.2 Experiment
To test the PLEP model proposed, it is necessary to design an experiment which focuses
on the effect of where packet losses occur. Instead of generating random error patterns,
we have designed an experiment where packet losses are set deterministically and where
it is possible to observe the effect of changing the loss position in the stream.
The sequences are pre-processed with the rewrapper described in section 3.5.2. This
way, each video frame is transported in an integer number of RTP packets, and so does
each GOP. With the aim of analyzing the effect of different packet losses within the
stream structure, one single GOP is selected to generate packet losses on it.
We apply the following steps, with K taking values from 0 to the number of RTP packets
in the selected GOP:
1. In the selected GOP, the RTP packet in position K is dropped.
2. The PLEP metric is obtained for the resulting sequence.
3. The video sequence is then decoded using the open-source decoder FFmpeg2 (with
default error concealment) and stored on a disk without compression.
4. The obtained sequence is compared with the original one (without errors) using
MSE.
This experiment was conducted with the sequences A, B, C, and D described in the
Appendix A.4. The following discussion will be done considering sequence A, as it is
the one with longer GOP (100 frames), and therefore the one producing more test cases.
However, the same process was repeated with sequences B, C, and D, with similar results
—a comparison will be provided later.
Sequence A is encoded in H.264 over MPEG-2 TS at 2.8 Mb/s (with the video stream
at 2.3 Mpbs). Each frame has only one slice, which is the most typical situation for
commercially available video encoders for IPTV. The GOP structure is a hierarchical
“. . . IBBBP. . . ”, such as the one discussed in section 2.4.1 and depicted in Figure 2.4 in
page 31. All I frames are IDR pictures.
The sequence is encapsulated in RTP using the rewrapper. Each GOP occupies about
1000 RTP packets and, in particular, the GOP under study had exactly 958 packets.
Therefore 958 different impaired sequences (each one with the error in a different position
within the GOP) were generated, decoded, and processed.
2http://www.ffmpeg.org
66 Chapter 4. Quality Impairment Detectors
It is worth noting that, due to the rewrapping process, all the losses affected only one
video frame, although the visual impairment will affect more than one frame due to
error propagation in the prediction process.
4.2.2.1 Qualitative Analysis
Before analyzing the results of the measurements, it is interesting to examine the video
itself, to better understand what happens when one packet is lost. We mainly consider
the results in sequence A, since having a longer GOP, it produces more data in the one-
GOP analysis. Figures 4.1 is used as an example for this analysis, although the ideas
described in this section are applicable to the majority of sequences generated for the
study, including both other sequences generated from sequence A and from sequences
B, C and D. Figures 4.1(a), (c), and (e) show an IDR frame where RTP packets #11,
#28, and #29 have been lost, respectively. Figures 4.1(b), (d), and (f) show the next P
frame in display order for the same sequences. Figures 4.1(g) and (h) show the original
unimpaired IDR and P frames, respectively.
In all the measurements, the frame with the highest MSE is the one where the loss
occurred. However, this is not the frame where artifacts are most visible. This is
illustrated in Figure 4.1(a). Where the packet is lost when the MSE is high, the visibility
of the error is low. However, four frames later in Figure 4.1(b), once the error has been
propagated by inter-frame predictions, the error has higher visibility even with a lower
MSE than before. This effect is also produced from Figure 4.1(c) to Figure 4.1(d), and
from Figure 4.1(e) to Figure 4.1(f).
This fact is due to error concealment: when part of the frame is lost, it is simply replaced
by the most recent reference frame available. The visual effect of this replacement is
a frame with a spatial discontinuity (part of the frame is the correct one, part is the
previous), which is not very disturbing visually. However, when the frame is used for
prediction, the predicted macroblocks will have errors, and the macroblocking effect will
appear.
It is also important to consider that in real situations, error concealment techniques may
not be as predictable as desired. For example, Figure 4.1(c) and Figure 4.1(e) show the
same frame for two different sequences —Figure 4.1(c) with the loss of packet #28, and
Figure 4.1(e) with the loss of packet #29, with both packets affecting the same frame.
In the first instance, FFmpeg concealment attempts to reuse the last referenced frame
to replace the missing portion of frame #28, and as a result that the error has low
visibility. In the second instance, the lost packet, #29, is directly adjacent to the packet
previously used, #28, which shows that the FFmpeg concealment has failed and that the
Chapter 4. Quality Impairment Detectors 67
Figure 4.1: Video sequence used for qualitative analysis. Left column shows an IDRframe where one RTP packet is lost; while right column shows the following P frame.The red line in each frame indicates the position of in the image of the first macroblockwhich got lost. RTP packet lost are #11 (a,b), #28 (c,d) and #29 (e,f). (g,h) showthe original unimpaired IDR and P frames.
68 Chapter 4. Quality Impairment Detectors
error has high visibility. These kinds of concealment failures can occur in real decoders,
either software or consumer set-top boxes. Therefore one must be careful when making
a priori assumptions about how impaired frames appear on the user screen.
We also found that the sooner an error is produced within an encoded frame, the higher
is the fraction of the decoded frame affected. The lines in Figure 4.1 show the position
of the error within the frame. Frames in Figure 4.1(a) and Figure 4.1(b), where the
error was produced in packet #11, have more visible and extensive artifacts than those
between frames Figure 4.1(c) and Figure 4.1(d), where the error was produced in packet
#28. The underlying idea is that once a fragment of the H.264 slice is lost, the rest of
the slice becomes useless to the decoder, which throws it out completely since it is not
trivial to resynchronize CABAC decoding. As there is only one slice per frame, when
an error occurs within a video frame, the rest of the frame is lost.
Finally, we should mention an specific case of interest: when the first video packet in
the GOP is lost, then the whole I frame gets lost as well, including any GOP-level
header (such as Sequence Parameter Set, Picture Parameter Set or SEI messages). As a
result, and with the decoder implementation that we have used, the whole GOP becomes
impossible to decode and the image freezes until the next I frame arrives.
4.2.2.2 Quantitative Results
We have computed the Packet Loss Effect Prediction (PLEP) values for each one of the
sequences under study. As IDRs are used at GOP boundaries, sensitivity to γ is not so
critical. We have taken the default value of γ = 0.9. Since there is only one packet loss,
there is no error composition situation, and therefore the value of α is not relevant.
We selected MSE (aggregated along all the impaired frames) as our method choice to
measure the impact of error in the sequence. Although there are other methods which
correlate better to subjective MOS, such as structural similarity index (SSIM) [116],
MSE has been shown to perform better when predicting packet loss visibility [93].
Figure 4.2 shows the MSE for all the sequences (varying the loss position) generated
from sequence A. The grey line shows the aggregated MSE of the whole sequence while
the green line shows the MSE only of the frame where the loss was produced. The red
line shows the MSE obtained by just substituting the frame where the error occurs with
the previous available reference frame (i.e., the concealment error at frame level). And
the blue line shows the result of the PLEP metric. Figure 4.3 shows the same values for
a reduced number of the sequences.
Chapter 4. Quality Impairment Detectors 69
0 100 200 300 400 500 600 700 800 900 100010 5
10 4
10 3
10 2
10 1
100
101
102
Position of lost packet
PLEP
/ M
SE
Figure 4.2: Mean Square Error and Packet Loss Effect Prediction metric for allsequences under study, varying the loss position: aggregated MSE (grey), MSE at theframe where the loss occurs (green), concealment error (red), and PLEP (blue).
0 20 40 60 80 100 120 140 160 18010 5
10 4
10 3
10 2
10 1
100
101
102
Position of lost packet
PLEP
/ M
SE
Figure 4.3: Detail of Mean Square Error and Packet Loss Effect Prediction metricfor all sequences under study
It can be seen that error has higher impact in higher levels of the reference hierarchy:
when the error occurs in an I frame or P frame, it generates higher MSE than when it
occurs at a (reference) B frame, which in turn is higher than error generated by losses
in (no-reference) b frames. This is mainly due to the fact that errors in reference frames
propagate, and therefore affect more frames. Error concealment also produces more
visible results in I frames and P frames since the previous reference frame available is
further back in time (four frames distant), than in the case of B frames (two frames
away), or b frames (one frame away).
The analysis also indicates that error decreases along frame position. This is due to the
fact that losing a single packet on a slice means losing the rest of the slice completely,
since the decoder is unable to resynchronize the CABAC decoding. Of course this
decrease is not completely monotonic, as the reconstruction of the damaged frame is
not always perfect. Sometimes concealment techniques fail or are just less effective than
expected.
70 Chapter 4. Quality Impairment Detectors
Figure 4.4: Mean Square Error versus Packet Loss Effect Prediction metric (log scale)and linear fit between them (R2 = 0.67)
Figure 4.5: Percentage of macroblocks which are different between both images versusPacket Loss Effect Prediction metric, both in log scale, as well as linear fit (R2 = 0.85)
There is also some tendency to error decreasing along the GOP because the earlier the
error occurs in the GOP, the greater number of frames it affects. However, due to the
fact that there are some scene changes within the GOP, this effect is not very strong.
Figure 4.2 shows that the PLEP model follows the shape of the error and in Figure
4.4 both magnitudes are directly compared. There is reasonably good correlation (R2
= 0.67) between both values, which suggests that the PLEP model is robust enough
to predict packet loss effects. It is worth noting that in this scenario, unlike in other
experiments reported in the literature, there is no correlation between the MSE (which is
variable) and the PLR (which is constant and equal to 1/958 for all the sequences). This
Chapter 4. Quality Impairment Detectors 71
0 100 200 300 400 500 600 700 800 900 100010 3
10 2
10 1
100
101
102
103
Position of lost packet
PLEP
/ %
diffM
B
Figure 4.6: Percentage of macroblocks which are different between both images (blue)and Packet Loss Effect Prediction metric (red) for all sequences under study, varyingthe loss position
means that our PLEP model is able to explain the effect of packet losses reasonably well,
even in situations where the packet loss ratio does not provide any valuable information.
Results obtained from the other sequences are quite similar qualitatively. Table 4.1
shows the R2 between PLEP and MSE for all video sequences.
Table 4.1: Coefficient of determination (R2) of MSE vs PLEP fit for several videosequences.
Sequence A B C DGOP size 100 24 24 12
R2 0.67 0.63 0.74 0.91
With this in mind, it is also important to consider that the PLEP method is more
robust to failures in error concealment than MSE estimation methods. Indeed, error
concealment is quite unpredictable in a real case, and not easy to fit into a predefined
model, as we illustrated previously in Figure 4.3, where the MSE in the frame where
the loss occurred is shown in green, while the MSE in dashed black depicts an instance
when an error occurred and the damaged frame was replaced by the previous frame
available. This suggests that even knowing the MSE produced by replacing one frame
by its predecessor, there is no specific pattern which can easily model MSE in a specific
frame when the loss occurs in the middle of a GOP. However, predicting the “part of the
frame affected” is much more stable, since it does not depend on the error concealment
techniques used. Thus, a metric defined as the ratio (in percent) of macroblocks different
on a pixel-to-pixel basis between both images provides a better approximation than MSE
does for the concept of “part of the frame affected.”
Figures 4.5 and 4.6 show that PLEP model is indeed a good predictor of ratio of mac-
roblocks which differ between the original and the impaired images. Correlation with
72 Chapter 4. Quality Impairment Detectors
the PLEP model increases so that, for the sequence under study, R2 = 0.85.
4.2.3 Subjective analysis
The next step in the analysis is discovering whether the prediction of the fraction of
the image affected by errors can be effectively used to model impairments in the per-
ceived Quality of Experience. With this target, the subjective assessment test session
described in Appendix A.2 included some impairments based on the PLEP model. The
impairments were caused in the same conditions as in the previously discussed objective
experiment: the video is sent by a rewrapper process and only one RTP packet is lost,
and the loss includes data of only one frame. The position of the RTP loss within the
GOP structure is varied to produce different effects.
The different impairment conditions are described in Table 4.2. We will consider the
simplified version of γ = 1, so that we assume that the error is propagated until the
end of the GOP. Impairment N is the hidden reference (no packet loss). Impairment E1
losses the first packet of the first no-reference B frame in the GOP; thus the error does
not propagate to other frames. Impairments E2, E3 and E4 lose one packet in the first
reference P frame of the GOP, so that the error gets propagated along the GOP. To
vary the resulting effect, the packet is lost at the beginning (E4), in the middle (E3) or
at the end (E2) of the frame, which varies the packet loss effect according to what has
been discussed previously. Finally impairment V1 has a special effect, which is losing
the very first packet of the GOP (in the I frame). In this case, as the most relevant
headers for the GOP get lost, the resulting effect is not macroblocking, but the freeze
of the image for the duration of the GOP (until another I frame is received).
Table 4.2: PLEP impairments analyzed in the subjective assessment tests
Code Frame % frame affected DescriptionN n/a n/a Hidden referenceE1 B (nr) 100 Loss of one frameE2 P (ref) 25 25% of frame affected during one GOPE3 P (ref) 50 50% of frame affected during one GOPE4 P (ref) 95 95% of frame affected during one GOPV1 I (ref) 100 Video freeze during one GOP
The results obtained from the tests are shown in Figure 4.7, differentiating the three
content sources under study: an action movie (Avatar, in blue), a football match (yellow)
and a documentary (red). The global average value is also displayed, together with its
confidence intervals. The description of the sources, as well as more details about the
tests, can be found in Appendix A.2.
Chapter 4. Quality Impairment Detectors 73
!"
!#$"
%"
%#$"
&"
&#$"
'"
'#$"
$"
(" )!" )%" )&" )'" *!"
!"#$
*+,-."/012-3"4.55"
Figure 4.7: Results of the subjective assessment for Video Loss impairments
As a first conclusion, the results suggest that the PLEP metric is applicable to the
characterization of video packet losses, as they confirm that the position of the error
within the GOP structure affects significantly the quality perceived by the end user.
This conclusion has to be taken with some degree of caution, because there is variability
in the results, especially from one content source to another. However, it is clear that
the PLEP model outperforms the simple packet loss rate metrics. More specifically,
losing one single frame (without propagation) or a small part of the frame (even with
propagation along the GOP) is, in general, either no perceived or perceived as not
annoying, and statistically indistinguishable from the hidden reference. Beyond that,
the bigger fraction of the frame is affected, the higher the severity it has. Finally, freezing
the video for the whole GOP has a severer impact into quality than the macroblocking
effect.
The errors E2, E3, E4 and V1 belong to the same “impairment set”, as defined in section
3.4.3. That means that they are evaluated in parallel over the same segments. Figure
4.8 shows the detailed results for each of the segments of this “impairment set” for
the three sequences under study. Most of the segments follow the same pattern as the
general results, and it is also possible to see that the “inter-segment” variability for the
same error event is lower than the “intra segment” variability for the different errors
applied to each segment. The segments labeled as “Doc-10” and “Avatar-20” —from
the documentary and the movie sequences, respectively— may be considered outliers,
and they share the property of having a low MOS for the less perceptible error (E2).
This suggests that in both cases the “delivery quality” of the unimpaired version of those
segments might be lower than expected, and maybe a characterization of the properties
of the video in the headend could lead to a RR metric that improved the performance
of PLEP.
74 Chapter 4. Quality Impairment Detectors
!"#$%&'!"#$%('
!"#$(&')*+,+-$%('
)*+,+-$%&')*+,+-$(&'
./,0"1$%%'./,0"1$(&'
./,0"1$(%'
%'
%23'
('
(23'
4'
423'
5'
523'
3'
6('
64'
65'
7%'
!"#$
Figure 4.8: Detailed results for each of the individual segments for Video Loss
4.3 Audio packet loss effect
When packets containing audio information get lost, there is also an impairment in the
perceived quality: either a temporary interruption in the displayed sound or a distortion
(glitch or noisy sound). Audio distortions are less frequent than video artifacts or, at
least, least frequently perceived by end users [7]. However, they are still common enough
for any monitoring system to consider them, especially if we take into account that they
are as unacceptable as video artifacts [57]. It is also relevant to consider that, as video
streams have normally a very stable bitrate, they normally require a relatively small
buffer in the receiver (around 50 ms, compared to the 500-2000 ms typical for video
streams). As a consequence, audio packets are much more sensitive to delay variation
than video streams; and high values of jitter will easily increase the losses in the audio
stream.
In this section we will study the effects of those packet losses, both objectively and
subjectively. We will take as baseline scenario an IPTV channel over MPEG-2 Transport
Stream. To simplify the analysis, we will assume that the stream as been encapsulated
into RTP packets by a rewrapper. This way, a packet loss at RTP level will impair either
audio or video signals, but not both simultaneously.
4.3.1 Objective analysis
Audio coding formats used in multimedia systems use normally block coding: they take
a time window of the audio waveform, divide it into spectrum sub-bands and code each
sub-band attending spectral masking criteria (obtained from a psychophyisical model
Chapter 4. Quality Impairment Detectors 75
Figure 4.9: Waveform of a lossy audio file
of the human hearing system), aimed at maximizing the perceived quality for a target
bit rate. There is some overlap between adjacent windows, but not long term coding
prediction or complex prediction structures. All the audio codecs considered in our
IPTV and OTT scenarios (MPEG-1 layer 2, MPEG-4 AAC, and Dolby AC3) have this
kind of design.
With this, the impairment produced by the loss of one audio RTP packet will affect
only to the time window to which this packet belongs. Therefore we can make the
hypothesis that the impairment will be a silence whose length is proportional to the
length of the packet loss burst. This, which is exact for uncompressed audio (PCM),
will be sufficiently approximate for compressed audio as well.
Figure 4.9 shows the waveform obtained after decoding an audio file with losses. It is
the audio stream of the sequence A described in Appendix A.4, encoded in MPEG-1
layer 2 at 192 kbps. 70 TS-packet losses (around 550 ms) were introduced each 1000 TS
packets (7.8 s). Silence intervals are clearly visible in the waveform, and their duration
is effectively around 0.5 seconds each.
In some cases, signal peaks can be observed next to the silence intervals. They are
perceived as glitches or audio discontinuities, and they may also appear on the event of
packet losses. In principle, and for the sake of the analysis of the losses, we will consider
only the silences as the base impairment, since they cannot be distinguished from the
glitches just by the analysis of the lost packets.
Another 2 minute cut of the aforementioned sequence A (with MPEG-1 layer 2 audio
at 192 kbps) has been taken to introduce audio packet losses, varying the number of
76 Chapter 4. Quality Impairment Detectors
Figure 4.10: Effect of audio losses: measured vs. expected (R2 = 0.98)
consecutive packet lost (the loss burst). The expected duration of each TS packet loss
would be:188× 8
192000= 7.8× 10−3
s (4.8)
Afterwards, the resulting stream has been decoded by a software decoder and the length
of the silences has been determined. The result can be shown in Figure 4.10. Blue points
show the length of the silence events (Y axis) as a function of the number of packet losses
(expressed in seconds, X axis). Most of the silence events have a length which is similar
to the expected one (although there is a small fraction of outliers, which represent the
short silence periods just after or before a glitch effect). Once the outliers have been
removed, the data fitting to a regression line (in red) allows us to determine the validity
of the approach. The line has a slope of 1.05 and a ordinate at the origin of 0.18, with
a determination coefficient R2 = 0.98.
With this data, the following conclusions can be obtained:
• The model is sufficiently good to be used as QuID.
• The slope is approximately 1, so that we can say that the perceptible duration of
the loss is quite similar to the length of the packet loss.
• Each packet loss, even the shortest ones, generates a silence of at least 180 ms.
This last data of the 180 ms should be taken with the appropriated caution. Firstly,
because the offline software decoder is not very robust under packet loss events (and, in
fact, extracting the silence length has required a careful analysis of the recovered data).
Chapter 4. Quality Impairment Detectors 77
Figure 4.11: Short-length audio losses
And secondly because the number of samples used in the model is not high enough to
be sure about the quantitative significance of this result.
However, from a qualitative point of view, it seems to be clear that there is a minimum
silence length that happens in most of the cases. In Figure 4.11, which shows the values
of figure 4.10 for its smallest loss durations, it can be seen that the four columns of blue
points in the left side (which refer to losses of 1, 3, 5 and 7 TS packets) generate errors
between 150 and 300 ms indistinctly. Without considering the quantitative significance
of those figures, it is possible to say that, quantitatively, the effect of losing one single
TS packet is similar than the effect of a short burst of packet losses. A side-effect of this
conclusion is that the fact that 7 audio TS packets are encapsulated in a single RTP
audio packet by the rewrapper does not increase significantly the effect of the minimum
audio loss, which would be of 1 TS packet (plus probably some video packets as well)
for non-rewrapped streams, and is of 7 TS packets (without additional loss of video) for
rewrapped streams.
4.3.2 Subjective analysis
The subjective assessment test session described in Appendix A.2 included also impair-
ments produced by the loss of audio packets. As the transmitted packets have been
processed by the rewrapper, each 7 audio MPEG-2 TS packets are grouped into one
audio RTP packet. As described before, audio coded bitstream does not have complex
prediction structures (as video has), and the effect of the packet loss is basically related
to its duration. Therefore the different type of audio losses differ only in the number
78 Chapter 4. Quality Impairment Detectors
!"
!#$"
%"
%#$"
&"
&#$"
'"
'#$"
$"
(" )!" )%" )&" )'"
!"#$
)*+,-"./0123"4-55"
Figure 4.12: Results of the subjective assessment for Audio Loss impairments
of packets that have been lost (it is similar to a packet loss rate / packet loss pattern
metric, but with the important distinction that we know that the lost packets are audio
packets). The RTP audio packet loss patterns used in the subjective assessment tests
are described in Table 4.3.
Table 4.3: Audio losses analyzed in the subjective assessment tests.
Code Duration of the burstN 0 (hidden reference)A1 1 packetA2 500 msA3 2 sA4 6 s
The results obtained from the tests are shown in Figure 4.12, differentiating the three
content sources under study: the action movie in blue, the football match in yellow,
and the documentary in red. The global average value is also displayed, together with
its confidence intervals. The results are stable and coincident with other research in
the topic [79]: the longer the loss, the higher the severity. Isolated one-packet audio
losses seem to be admissible under real viewing conditions. The acceptability of short
bursts (up to 500 ms) depends strongly on the selected content: it is acceptable in the
soundtrack of a movie, but not in the narration of a sports match. Long bursts (2
seconds or higher) are unacceptable by all means.
Since A1, A2, A3 and A4 belong to the same “impairment set”, it is possible to compare
their results segment by segment. It is shown in figure 4.13, which confirms the conclu-
sions mentioned before. In this case, since the audio structure is simpler and the audio
original quality is, as in a real deployment, high enough for the purpose, the probability
of having clear outliers is low.
Chapter 4. Quality Impairment Detectors 79
!"#$%&!"#$'(&
!"#$')&*+,-,.$'(&
*+,-,.$%&*+,-,.$')&
/0-1"2$'3&/0-1"2$'%&
/0-1"2$'4&
'&
'5)&
6&
65)&
7&
75)&
(&
(5)&
)&
*'&
*6&
*7&
*(&
!"#$
Figure 4.13: Detailed results for each of the individual segments for Audio Loss
4.4 Coding quality and rate forced drops
Another relevant element for the Quality of Experience is the multimedia quality ob-
tained at the end of the encoding process: the coding quality. The coding quality is
important for the overall QoE, but it is not so critical for a monitoring system for two
main reasons. On the one hand, its impairments are less frequently reported by the users
than the ones produced by packet losses [7]. On the other, the target coding quality is
something that must be controlled in the design phase of the service, when selecting the
encoder which is going to be used and the conditions, especially bitrate, under which it
is going to work. But once in runtime, there should be less unexpected events in the
encoder than in the access network, for instance.
When considering coding quality, we will only focus on the video stream; and not on the
audio. The reason for that is that, while both of them contribute similarly to the final
multimedia quality [90], video requires much more bandwidth than audio [6] and, as a
result, video encoders will be working under more stressful conditions,
In this section we will study the coding quality from two different perspectives. First
we will explore the options to control or estimate the coding quality using simple RR
or NR metrics (with a chance to be applicable in the QuEM framework). Then we will
analyze different scenarios of strong quality drops, such as the ones produced when the
stream jumps from one bitrate to a much lower (or higher one). This scenario is typical
of OTT services using HTTP adaptive streaming.
80 Chapter 4. Quality Impairment Detectors
4.4.1 Analysis of feature-based RR/NR metrics as estimators of video
coding quality
The first step done in the analysis of video quality has been trying to find out whether it is
possible to estimate the perceived coding quality (or, at least, some salient impairments)
from elementary Reduced-Reference or No-Reference metrics performed in the pixel
domain. The main reason for that is trying to build a quality estimator that can be of
use in scenarios similar to the ones proposed in our QuEM architecture.
The approach taken to this problem has been analyzing several NR and RR metrics from
the literature. Those metrics have been applied to video at contribution quality (high-
quality recordings from television content, obtained directly from the television studios
in uncompressed D1 format), and to the result of encoding them with commercial H.264
video encoders at different bit rates. The obtained values have been compared to the
outputs of subjective assessment tests done for the same video segments.
The work described in this subsection 4.4.1 was done during the first steps of the re-
search activity of this thesis [81], before the development of the QuEM strategy and its
associated subjective assessment test methodology, described in chapter 3. Therefore,
the subjective tests referred in this subsection, described in Appendix A.3, are different
from the QuEM-based subjective tests used in the rest of this chapter, and described in
Appendix A.2. The experiments, main results, and conclusions are described now.
4.4.1.1 Metrics under study
The aim of the experiment is determining whether it is possible to detect degradations
in the video quality by using lightweight Reduced Reference (RR) and No Reference
(NR) metrics. Most RR metrics are based on comparing some image features before
and after the impairment process. These features usually model amount of movement
and spatial detail. NR metrics are normally based on the detection of known artifacts
produced in the coding process, such as blocking, or blurring [121].
To compare different possible strategies homogeneously, we will extract several features
from the original an impairment features, and measure its relative degradation, averaged
along time:
M = meant
�|X[Forig(t)]−X[Fproc(t)]|
X[Forig(t)]
�(4.9)
Four groups of features have been compared: spatial information (obtained from sev-
eral RR metrics), temporal information (from RR metrics as well), blocking (from NR
metrics), and blurring (from NR as well).
Chapter 4. Quality Impairment Detectors 81
Different feature extractors have been considered for spatial information (or texture):
• Le Callet et al. [63] propose a pair of complementary measures based on intensity
and direction of borders, which they call GHV and GHVP. They compute GHV as
the average magnitude of intensity gradient for all the pixels in which this gradient
is horizontal or vertical, and GHVP as the average magnitude of intensity gradient
for all the pixels in which this gradient is neither horizontal nor vertical.
• BTFR metric in ITU-T J.144 [45] includes a texture measure computed as the
zero cross rate of horizontal gradient.
• Saha and Vemuri [98] propose using the average value of absolute vertical and
horizontal differences, which they call IAM4.
• Webster et al. [117] propose a Spatial Information feature (SI), defined as the
standard deviation of the Sobel-filtered frame.
When characterizing temporal variations, there is less diversity of metrics in the litera-
ture. We will consider Le Callet’s Temporal Information (TI), defined as the energy of
the difference image along time [63].
Regarding the blocking effect, we have studied three of the most frequently cited metrics:
• GBIM (Generalized Block-edge Impairment Metric) [122]. It measures the differ-
ences between both sides of the block (which must present a regular and well-known
pattern).
• Vlachos metric [110], which uses a method based on the spectral analysis of the
pixels in block boundaries.
• Wang metric [115]. It analyzes the Fourier transform of the image to detect energy
peaks in the multiples of the inverse of the block period.
The other relevant artifact to study is blurring. Most blurring metrics are based on the
measurement of the average width of borders in the image [21]. We have selected the
implementation proposed by Marziliano et al. [68].
Finally, we have also included two basic measures: global brightness (mean value of
intensity) and global contrast (standard deviation of intensity).
82 Chapter 4. Quality Impairment Detectors
4.4.1.2 Evaluation
Reference data to benchmark these video quality metrics were obtained from the results
of a study of subjective quality for real-time H.264 encoders, described in Appendix
A.3. The same sequences used for the subjective tests were provided as input for all the
feature extractors described in the previous subsection.
Reduced-Reference metrics were obtained for all the features by applying equation (4.9).
Besides, the blocking and blurring metrics were considered as individual No-Reference
metrics as well, just by computing its average along each test sequence.
The output of all the metrics, both RR and NR, was been compared with the MOS
obtained from the subjective tests, to check whether any of the features under study
could be a reasonable predictor for MOS variations. Pearson correlation and Spearman
rank correlation (with p-test) were computed. Results are shown in Table 4.4.
Table 4.4: Comparison of NR/RR results with subjective tests
Metric Pearson Spearman p-testBrightness 0.41 0.44 OKContrast 0.61 0.64 OK
The denting component performs exactly this process: based on a configuration param-
eter (target bitrate, target frame rate , “remove all B”, etc) it sends to its output the
same media received at the input except for some video frames which are carefully se-
lected to meet the desired requirements. Due to the encoding properties of most codecs,
video frames can usually not be removed arbitrarily, because the absence of a frame
may prevent other frames which remain in the stream to be properly decoded. For this
reason the denting component requires deep information about the video frames, not
only about their boundaries but also about their decoding hierarchy. Padding packets
can also be removed by the denting component, but non audio/video streams (appli-
cation data, teletext, subtitles, etc) should only be removed if explicitly allowed by
configuration parameters.
Denting can be used in the Edge Server to dynamically generate lower-bitrate versions
of the main stream, either to create or enhance HAS structures or to reduced the bitrate
of a unicast transmission between the Edge Server and the user terminal. In particular,
denting has been successfully used in Fast Channel Change solutions to increase the
apparent bitrate of the unicast session without effectively allocating a higher bitrate for
it.
These quality drops (reducing the bitrate and denting) have also been included in the
subjective assessment tests described in Appendix A.2. Table 4.5 shows the different
values considered. R1 and R2 are a reduction of 50% and 75% of the bit rate. F1 and
F2 are a reduction of 50% and 75% of the frame rate. The effective bitrate reduction
of F1 and F2 depend on how the video was encoded. However, typical values for the
content assets under study are about 25-30% of bitrate reduction for F1, and 35-50%
for F2.
Table 4.5: Quality drops analyzed in the subjective assessment tests.
Code Type DescriptionN n/a Hidden referenceR1 Bitrate Bitrate reduced to 1/2R2 Bitrate Bitrate reduced to 1/4F1 Denting 1/2 of all frames droppedF2 Denting 3/4 of all frames dropped
The results of the subjective assessment tests for these impairments are shown in the
Figure 4.15. The following conclusions can be obtained:
• The results of the hidden reference are high. This means that coding defects
introduced at the reference quality are perceived of much less severity than other
86 Chapter 4. Quality Impairment Detectors
!"
!#$"
%"
%#$"
&"
&#$"
'"
'#$"
$"
(" )!" )%" *!" *%"
!"#$
)+,-"./01"
Figure 4.15: Results of the subjective assessment for Rate Drop impairments
!"#$%&!"#$'&
!"#$()&*+,-,.$%&
*+,-,.$'&*+,-,.$()&
*+,-,.$()/(&01-2"3$4&
01-2"3$)%&01-2"3$(%&
)&
)5'&
(&
(5'&
%&
%5'&
6&
65'&
'&
7)&
7(&
0)&
0(&!"#$
Figure 4.16: Detailed results for each of the individual segments for Rate Drop
defects (forced quality drops in this case, but also other defects considered in other
sections).
• The impact of this kind of impairments depends on the source content, at least up
to some point.
• In general, the quality variations between bitrates are relevant (and between frame
rates as well). However, its specific impact differs from one asset to another, and
from one segment to another. This can be better shown in the comparison within
the “impairment set” formed by R1, R2, F1 and F2, in Figure 4.16.
• Denting has higher impact in the perceived quality than the drop of coding qual-
ity, which was expected, as in the latter case the quality-rate trade-off has been
optimized by the encoder, while in the former case it has not.
Chapter 4. Quality Impairment Detectors 87
4.5 Outages
All the issues considered so far are caused by isolated errors. Now we will analyze a
different case: outage —loss of service for a period of time. The relevance of this case
is that sometimes the users report errors which are described as a complete stop in the
video play out, which sometimes is only recoverable after a reboot of the user terminal
[7]. Any system that monitors the global QoE must be aware of this kind of errors since,
although they are less frequent than the ones caused by isolated packet losses, have a
higher impact in the final quality.
Outages can be roughly classified into two categories: “short” and “long”. By “long”
outages we understand those caused by service unavailability for several minutes or
hours. The most typical example is a software problem in the user terminal, but there
could be severer situations (such a critical failure in the delivery equipments, for in-
stance). “Short” outages are the ones caused by a brief stop (some seconds) in the video
service delivery, typically caused by discontinuities in the service, an issue in the delivery
equipment followed by a recovery of the service from a redundant one...
“Long” outages should be ever monitored and managed by the Service Provider and are,
in fact, outside the scope of our work. The impact of having no service at all is not easy
to measure in the same scale that we are considering. We will focus in the detection and
impact measurement of the “short” outages exclusively.
4.5.1 Detection of outages
The outage can happen in the contribution (detectable in the headend), in the core
network (detectable in the PoP), or in the access network (detectable in the HNED,
maybe with the help of the Edge Server).
If it happens in the contribution, it should be monitored by continuity monitors in
the headend. An effective way to do it is using the VODA algorithm proposed by
Reibman and Wilkins [94]. This algorithm detects an outage when there is as sudden
and simultaneous drop of three different factors: average brightness (i.e. the picture
changes abruptly to black), space information, and audio signal power. The three factors
must also remain low for some seconds for the outage to be detected.
If the outage happens in the network, it will be an extreme case of packet loss with
high impact (loss of several seconds worth of video and/or audio), which can be de-
tected normally with packet loss effect estimators (and probably with simpler packet
loss detectors).
88 Chapter 4. Quality Impairment Detectors
Additionally, short outages in the contribution can be detected in the coded stream
(with less accuracy, but it can be enough for our purposes) by monitoring the global
video and audio signal level:
• For video, with the analysis of the frame size and structure (coded long freezes
have almost zero-byte P and B frames).
• For audio, either from the analysis of energy values for each sub-band (exact) or
with the analysis of the dynamic range compression parameters, when available.
4.5.2 Subjective impact of outages
Some outage events have also been included in the subjective assessment tests described
in Appendix A.2. Table 4.6 shows the different values considered: stops of 2 and 6
seconds for audio and video (or both). The results are shown in the Figure 4.17, with
the comparison of the impairment set A4, V3, AV in 4.18.
In general, and for the same sequence, the longer the outage, the worst the detected
quality. However, the specific impact and the relative importance of video and audio is
quite dependent on the specific content.
Table 4.6: Outage events analyzed in the subjective assessment tests
Code Outage Duration Elementary Stream AffectedA3 2 s AudioA4 6 s AudioV2 2 s VideoV3 6 s VideoAV 6 s Both
4.6 Latency
A final QoE factor to consider is latency. Latency issues are usually disregarded in many
QoE analyses, because they are only perceived in very specific scenarios. However, the
study of latency is relevant because of two different, but related, causes. On the one
hand, as discussed in section 2.4, these scenarios where latency is relevant —mainly live
sport events— are important enough to make latency be a meaningful QoE element. On
the other hand, there is a trade-off between latency and other QoE components that
makes it difficult to have low-latency video delivery services without compromising the
Chapter 4. Quality Impairment Detectors 89
!"
!#$"
%"
%#$"
&"
&#$"
'"
'#$"
$"
(&" ('" )%" )&" ()"
*+,-./"
Figure 4.17: Results of the subjective assessment for Outage impairments
!"#$%&
'()*)+$%&
,-*."/$%&
0&
012&
%&
%12&
3&
'4&
53&
'5&
'5&!"#$
Figure 4.18: Detailed results for each of the individual segments for Outage
perceived quality. These trade-offs will be summarized at the end of this section, at
4.6.3.
Latency will be studied from two different perspectives. Fist we will analyze the end-
to-end latency or lag. Afterwards we will analyze channel change time, which is also a
latency-related scenario with a significant contribution to the overall QoE.
4.6.1 Lag
End-to-end latency or lag refers to the delay observed in the displayed video by the user
with respect to the moment when the event is being recorded. With such definition,
lag only makes sense for live content streams: those which are being watched while
they are being captured. Although it is possible to provide an equivalent definition for
on-demand content, the reality is that lag is only a QoE factor in live events. And even
90 Chapter 4. Quality Impairment Detectors
Figure 4.19: Simplified transmission chain for real-time video
for live television channels, there are very few cases where the lag is really an issue, and
that receiving the video with some additional seconds of delay makes any difference.
However, the few cases where lag is important are also important for service providers
and users, the most typical ones being sport matches. For those reasons, keeping the
lag under control is very relevant for IPTV service providers [70].
Lag must be constant end-to-end, to avoid losing video continuity. As such, any protocol
layer that imposes timing constrains must have also constant end-to-end delay, because
it should not assume that the delay variation may be absorbed by the uppermost layers.
Figure 4.19 shows it. Points A and Z represent the decoded video stream. In absence
of errors, the video reproduced in A and Z should be identical, and therefore the delay
between those points TAZ must be constant.
A first component of this delay is introduced by the encoding process, and it is due to
two main causes. On the one hand, the coding of video using frame prediction normally
implies that the frames are encoded and transmitted in a different order that they are
displayed, to allow the use of bidirectional prediction. On the other hand, this kind of
compression also makes that the size, in bytes, of the different frames differs strongly
among them and along time. This generates local peaks of bitrate that normally need
to be softened before the transmission, introducing additional delay, to comply with
bandwidth restrictions. Those two sub-components of the video delay are introduced by
the encoder and depend only on coding decisions (and therefore can be known in point
B).
MPEG-2 Transport Stream allows that the encoder to manage the coding delay end-to-
end. The transport stream includes a clock signal called PCR (program clock reference),
which indicates the rate at which the coded stream is produced at pointB and, therefore,
the rate at which it is expected to be delivered at point Y. The stream also includes,
for each video, audio or data access unit, its presentation time stamp (PTS) in the same
clock base. The total encoder-decoder delay TAB + TY Z is constant. This way, if the
Chapter 4. Quality Impairment Detectors 91
network is able to keep constant delay TBY , the end-to-end delay TAZ will be constant
as expected.
However, the real delay in the transmission network TCX , which is an IP network, cannot
be guaranteed to be constant. Therefore network elements are introduced to control the
network ingestion and the reception in the user terminal to flatten network jitter and
also to manage error correction protocols.
The delay introduced by server-side elements and by the decoder (TAB + TBC + TY Z)
are established by the network design and known a-priori by the service provider. The
network buffer TXY depends on the implementation of the user terminal, and it is
normally set individually for each video session. Once it is established, however, the
end-to-end network delay TBY will remain constant for the whole video session, and
therefore each video packet whose jitter exceeds this buffer will arrive too late to be sent
to the decoder, and will be considered as a network loss. Therefore, when establishing
the length of the network buffer, there is a tradeoff between end-to-end delay and packet
loss probability.
Additionally, if the video multiplexing protocol is ISO File Format, then it does not
include transport timing information equivalent to the PCR. In such case, the user
terminal must set up the value of TY Z arbitrarily for the first decoded video frame,
and assume that it will be enough to present it on time for then onwards. As a result,
buffer sizes are normally overdimensioned, to avoid buffer emptying events, at the cost
of suffering a higher lag. This overdimensioning is also generally applied to the network
buffer TXY , especially in the case of Over The Top services (where network capacity
variations can be very strong).
4.6.2 Channel Change time
We will define channel change time (or zapping time) as the time between the moment
when the end user presses a “channel change” key in their user terminal and the instant
when the new channel (video and audio) starts playing on their screen. This time can
be divided into the following components:
TCC = Tterm + Tnet + Tbuf + Tvid (4.10)
Where
92 Chapter 4. Quality Impairment Detectors
• Tterm is the delay between the user key stroke and the moment where the user
terminal effectively requests the new video stream to the network (by issuing an
IGMP join, an HTTP request or what it is suitable for each scenario).
• Tnet is the delay between the new video is requested and the first byte of the new
stream arrives back to the user terminal.
• Tbuf is the time needed to fill the network buffer in the user terminal.
• Tvid is the time needed to present the first video frame in the decoder output.
From the analysis done in the previous subsection, it is immediate to consider that Tbuf
is equal to TXY as depicted in the Figure 4.19. Tvid abstracts all the delay introduced
by the video stream in the decoding side. This can be inferred only by analyzing the
video stream and depends only on the encoding process. It can be modeled as:
Tvid = TRAP + Tdec (4.11)
TRAP is the time that the decoder has to wait to reach a Random Access Point (RAP).
A RAP is a specific point in the video stream where it is possible to start decoding
it, which maps approximately with the beginning of the intra-coded frames. Therefore
TRAP can be easily modeled with a random variable of uniform distribution between 0
and the intra frame period TI , whose mean value is TI/2.
Tdec is the interval between the RAP and the moment when the frame can be presented
to the user. It is equal to the stationary delay of the video decoder, i.e. TY Z in Figure
4.19. It represents the decoding part of the end-to-end coding delay for each of the
media components (audio, video, and data) and, in MPEG-2 Transport Stream, it is:
It is relevant to consider that the value of Tdec will, in general, be different for each of
the elementary streams. Even though the end-to-end delay (TAB + TY Z) is constant
and equal for all of them, it is usual that the part of the delay left to the decoder
(Tdec = TY Z) varies strongly from one component to another. A typical example taken
from a commercial encoder is shown in Figure 4.20: audio Tdec is constant and below
100 ms while video Tdec varies along time between 800 and 1400 ms approximately.
With these elements, it is possible to build a QuID which monitors the channel change
time in the network in the following way:
Chapter 4. Quality Impairment Detectors 93
Figure 4.20: Decoding delay (PTS-PCR) in milliseconds for video (blue) and audio(red) components of a MPEG-2 Transport Stream, and its variation along time (inseconds)
• Tterm and Tbuf depend on the user terminal implementation, which is the only
point where they are available. However, they are normally quite stable, so they
can be known a priori and introduced into the model as parameters.
• Tnet, TRAP and Tdec can be easily monitored in the network.
It is worth noting that most of the components of the channel change are frequently
sacrified in the process of enhancing the available end-to-end quality of experience. In
particular, Tbuf , as it has been mentioned in the previous subsection, represents the
buffering required to absorb network jitter and to correct packet losses. TRAP and Tdec
provide also a higher degree of freedom to the encoder to distribute its bit budget flexibly,
according to the coding complexity of the images and therefore optimizing the coding
quality. Reducing any of those parameters, what would reduce the channel change time
in the same amount, could also have undesired side-effects in the global quality.
Unlike the case of the global lag, channel change time is a QoE element which is relevant
for many IPTV deployments, and for all the video channels. However, the mapping of
the channel change events into a global scale of severities (or qualities) is very dependent
on the expectations of the service provider, and there is no standard way to do it. Table
4.7 shows an example that could be used as reference, based on informal laboratory
experimentation.
94 Chapter 4. Quality Impairment Detectors
Table 4.7: Example Channel Change time ranges and their mapping to QoE
Time (s) QoE description< 0.4 Very Fast0.4− 1 Fast1− 2.5 Normal2.5− 5 Slow> 5 Very Slow
4.6.3 Latency trade-offs
Since lag and channel change can be considered relevant elements for the global QoE, we
may ask whether it is possible to improve them by reducing some of their components.
The answer is that it is possible, but with some cost: degrading other QoE factors. We
will show here why.
Regarding end-to-end lag, encoding latency TAB + TY Z is used to provide a buffer for
rate-control operations in the video encoder. Reducing this buffer will impair the video
quality that the encoder is able to produce at its output. Network processing delay
TBC + TXY provides a buffer to protect the decoder against network jitter. This buffer
can be reduced, but only at the cost of increasing the packet loss probability.
Channel change components Tbuf and Tdec are TXY and TY Z respectively, so the same
considerations can be made. TRAP is also a design parameter for the encoder: if it is
reduced, the frequency of I frames will increase, which will degrade the video quality (if
the bitrate is kept, as it is assumed).
The rest of the delay components are limited by the technology itself, and are normally
outside the control of the service provider:
• TCX and Tnet depend on the performance of the communication network.
• Tterm depends on the performance of the user terminal software.
As a conclusion: there is a strong relationship between the latency and the video quality
components of the QoE. Therefore latency should always be controlled in any multimedia
delivery service. Even in the cases where lag or channel change are not important by
themselves, managing latency parameters is always a good strategy. Service providers
should be aware that reducing that latency elements in the future will always be at the
cost of putting the video quality at risk.
Chapter 4. Quality Impairment Detectors 95
4.7 Mapping to Severity
One of the most complex problems to solve when managing a QoE monitoring system
in a large multimedia service deployment is the comparison and aggregation of a big
quantity of data. In our QuEM model, this problem is addressed by referring all the
measures to a common severity scale and synchronizing the measurement windows, so
that one single severity value is produced for each monitoring period in each monitoring
point (section 3.3.2). These values should be then processed statistically according to the
needs of the monitoring service with the particularity that, even though the aggregated
value has only meaning in terms of average severity, each of the individual impairment
events is easily traceable to a qualitative description of what happened.
Each QuEM system should be calibrated according to the specific needs of the service
provider, and should also be modified during the operation phase with the feedback
retrieved from the field. The best way to calibrate the different QuID elements to
produce severity values is by performing subjective quality assessment tests such as the
ones described in section 3.4. This way, each service provider could feed the tests with
the type of content and impairments that fit better in their deployment, having the
Severity Transfer Functions completely under their control.
The results of the subjective assessments describe in Appendix A.2 can provide some ini-
tial approach to the problem, which should be used as starting point for real deployments
of a QuEM infrastructure.
Figure 4.21 shows a summary of the different results that have been discussed along this
chapter. The most relevant conclusions for each of the type of errors have already been
discussed, but we can summarize them as follows:
• Video packet losses can have very different effect depending on the part of the
stream which is lost. We have proposed a simple but effective metric (PLEP) to
model this variability.
• Audio packet losses depend mostly on the packet loss rate and pattern. We have
also modeled this in our proposal for audio loss QuID.
• Bitrate is a reasonably good proxy to monitor video coding quality in the context
of a QuEM system. The comparative effect of bitrate change and denting has been
studied. The former technique has less impact than the latter in the final QoE,
but it requires generating and transporting the different versions of the content
stream from the headend to the network edge.
96 Chapter 4. Quality Impairment Detectors
Figure 4.21: Results for all the QuIDs mentioned in the chapter
• Outages can be monitored as severer versions of the rest of the impairments, but
they must be considered separately because of its high impact in the perceived
quality.
• Latency effects (end-to-end lag and channel change) have to be taken into account,
both for their impact in the final QoE and for their relationship to other quality
issues.
Besides, the cross-analysis of different QuIDs can also provide some additional ideas:
• In case of network congestion or any other error situation, the decision of which
packet or packets to discard is critical for the final impact in the Quality of Expe-
rience. Losing all the no-reference frames for six seconds (F1) has an impact which
is similar to losing all the audio during only half a second (A2) or have relevant
macroblocking (90% of the picture) for half a second (E4), and is even better than
any of the video screen freezes (V1-V3). All those impairments are produced by
the loss of less packets than F1.
• Video freezing is probably the worst artifact (relative to the minimum loss burst
needed to produce it). For this reason, it should be avoided by any means. This
is especially relevant in scenarios where the network buffer is small because a low
latency is required. In such cases, countermeasures such as bitrate drop or frame
rate drop are preferable to an empty buffer resulting in video and audio loss signal.
Chapter 4. Quality Impairment Detectors 97
4.8 Conclusions
This chapter has presented strategies to monitor all the relevant sources of quality im-
pairments in multimedia delivery services. We have proposed metrics to analyze the
effect of packet losses in video and audio, which are currently the most frequent errors
in multimedia services; and in particular in IPTV. We have also covered the analysis and
monitoring of media coding quality, with a special focus on the strong bitrate variations
which are typical of OTT scenarios. We have also analyzed the causes and effects of
service outage, as well as the effects of latency in the final QoE.
All the metrics proposed in this chapter can be integrated as Quality Impairment De-
tectors in the QuEM architecture described in chapter 3. Besides, we have analyzed a
set of subjective quality assessment test results which support the selection of QuIDs
and provide a way to compare their relative severities. The results of this analysis have
provided relevant information about the relative severity of the errors under study.
The ideas discussed in this chapter suggest that, with the right knowledge of the effect
of network events in QoE, it is possible to design network systems whose policies are
optimized towards the final perceived quality. The next chapter will present and discuss
some of these applications.
Chapter 5
Applications
5.1 Introduction
This chapter describes applications which, by making use of the knowledge obtained
in previous chapters about the Quality of Experience, can enhance the functionality of
existing multimedia delivery services. In fact, some of the applications described here
have been applied to products and services which are currently deployed in the field.
Section 5.2 describes a variation of the Packet Loss Effect Prediction model which can
be used to establish packet priorities in a video communication network. This can be
used to support Unequal Error Protection schemes which make best use of the error
correction capabilities of the network.
A similar idea is applied in section 5.3 to an HTTP Adaptive Streaming scenario. By
composing HAS segments in priority order (instead of in the traditional decoding order),
it is possible to react better to dynamic variations in the network effective bandwidth
without needing to increase the buffering delay excessively.
Section 5.4 describes a selective scrambling algorithm which can be used to efficiently
protect video content in scenarios where the processing power of the deciphering elements
is small. By only selecting to encrypt the most relevant packets (with respect to their
impact in the QoE) it is possible to get very effective protection with a low packet
scrambling rate.
Section 5.5 proposes a solution to overcome the channel change limitations described in
section 4.6.
Finally section 5.6 discusses the application of the results to stereoscopic video.
99
100 Chapter 5. Applications
5.2 Unequal Error Protection
Not all the packet losses have the same impact in the QoE. For instance, the effect
of isolated packet losses in perceived video quality depends on several factors, such as
coding structure (the type of prediction in the frame or the part of the frame which gets
lost), camera motion, or the presence of scene changes, among others [86, 93]. When the
number of errors grows, the effects of those factors tend to compensate among them, so
that the impact of random errors depends mainly on packet loss rate [95] and loss burst
structure [124]. Audio packet losses have a strong impact on the perceived quality,
depending mainly on the frequency and length of the bursts of loss packets, with no
significant differences between individual packets [79, 84]. When they are studied jointly,
video errors seem to be more acceptable than audio errors, except for high error rates
[57].
Most of the studies mentioned so far analyze the effect of packet losses for relatively high
loss rates. In practical situations, however, real-time video services provide a quality of
experience resulting in less than one visible error per hour, with users showing sensitivity
to higher impairment rates [7]. In terms of network quality of service, it means that
only a few packet loss bursts per hour are allowed, at most.
Home networks typically have error rates which are some orders of magnitude above
these figures, especially in the case of wifi (802.11) [97]. If the media stream is to be
delivered through the home network, the residential gateway must provide some kind
of error correction mechanism (FEC or ARQ) in order to keep the required level of
service. This protection is performed at the cost of introducing end-to-end delay in the
transmission chain [61], as well as increasing the required bandwidth.
The understanding of how packet loss can affect video and audio quality has been used
to propose several unequal error protection (UEP) schemes, where packets with higher
impact in quality are protected better [29, 66]. This allows keeping a good QoE without
an excessive increase in the required protection and, consequently, in the additional delay
introduced. However they usually require an in-depth video analysis which is difficult
to integrate in cost-effective consumer electronic devices. Lightweight UEP designs also
exist, but they usually focus on the characteristics of the loss patterns and use limited
approaches to characterize the priority of the packets [12, 71].
We have shown in the previous chapter that, even with its limitations, the PLEP model
we describe is a promising approximation for blind packet loss effect estimation. How-
ever, it is based on reading and building a reference frame list for each frame. Even
as simple as it is, this could be too expensive for some applications, such as packet
QoS policies applied in routers, and it may require the use of information which is not
Chapter 5. Applications 101
available in real service deployments, perhaps because the elementary video stream is
completely scrambled.
Here we will show how it is possible to reduce strongly the effect of packet losses by
applying a simplified version of the PLEP metric to label video packet priorities (and
even using a low number of bits to encode them). This technique can be applied to
congestion control in home gateways or buffer management in dynamic HTTP adaptive
streaming. In addition, it can improve other lightweight UEP schemes by enriching
their characterization of the video sequence. This approach requires low processing
capabilities while clearly outperforming a random packet drop.
The solution specifically addresses short-term protection decisions, where the error cor-
rection system has to decide which packets to protect (or which ones to drop) within
a short window of time. Thus it is especially suitable for real-time multimedia trans-
missions. This solution is applicable not only to error correction, but also to congestion
control.
5.2.1 Priority Model
5.2.1.1 Effects of packet losses
The priority model proposed is based on the fact that not all the video packets contain
the same kind of information and, therefore, the loss of different kinds of packets will
produce different effects in the perceived video quality. In fact, even the loss of a single
video packet can produce a wide range of different effects, depending on the kind of
packet which is lost.
There are several factors which influence the effect of a single packet loss. They can
be roughly classified in two sets: content-based (camera motion, scene changes. . . ) and
coding-based (type of video frame, position of packet within the frame. . . ). Only the
latter are considered in this approach, since they are the ones which can be easily
identified in the analysis of the coded media stream. It will be shown later that they
suffice to provide a good performance of UEP algorithms.
The factors considered are based on the following previous knowledge:
1. The effect of a loss is higher when it is produced in a reference frame (a frame
used by the encoding system to predict the following ones), because the error will
propagate to the frames which have it as reference [95].
102 Chapter 5. Applications
Table 5.1: Priority value for each slice type
NALU Type PS
IDR(I) 1Reference (R) 0.5
No-Reference (N) 0
2. If a packet in the middle of a video slice is lost, then the rest of the slice gets
lost too, as the decoder cannot easily re-synchronize in the middle of a slice. This
is especially relevant in H.264 video, where most commercial encoders use a low
number of slices per frame (typically one). In such cases, the sooner the error is
produced within a frame, the higher its impact is [86].
3. If packets are lost in two different frames, their contribution to the final error (in
terms of mean square error, MSE) can be considered to be the sum, as errors are
typically uncorrelated [95].
4. Audio packet loss effects are basically related to the length and structure of the
loss burst, not existing meaningful differences between audio packets [79, 84].
5.2.1.2 Packet Priority
A packet priority model is proposed in order to assign higher priority to packets whose
loss is going to produce a stronger effect in QoE. The model is based on the type of video
slice carried by the packet and the position of the packet within the slice (assuming that
typically a video slice is carried in several transport packets). As it has been mentioned
before, losses have higher effect in reference slices than in no-reference ones, and at the
beginning of the slice and of the GOP, where error propagation effects are higher [66, 86].
The priority model is defined as follows:
P = αPS + βH + γTS + δTG (5.1)
where PS is the priority of the slice type as described in Table 5.1, H is a flag indicating
whether the packet contains a NALU (Network Abstraction Layer Unit) header, TS
indicates the number of packets until the next slice in the stream and TG is the number
of packets until the next I frame. All the parameters are normalized between 0 and 1.
According to their relevance, the following coefficients are selected: α = 103, β = 102,
γ = 10, δ = 1.
Chapter 5. Applications 103
Figure 5.1: Example of the packet priority model applied to one GOP of a codedvideo sequence
Figure 5.1 shows an example of the application of the model to a sequence of video
packets in transmission order. Each box represents an RTP packet, while different colors
represent different frames. The figure shows all the elements of the prioritization model.
PS depends on the NALU type (IDR, Reference slice or No-reference slice), indicated
as I, R or N within the boxes. H = 1 (presence of NALU headers) is represented as a
black bold frame. Finally, TS and TG are shown for the packet marked by the red circle.
Audio packets can be easily introduced in this model just by assigning them a fixed
priority value P = PA. In line with the idea that audio losses are more relevant than
video ones, except in case of high video degradations [57], PA is set to 900. This way,
audio packets have lower priority than IDR packets (for α = 103, PA = 0.9α), but
higher than any other video packet. Different values could be considered depending on
the specific application.
It is important to remark that it is not a scale of priorities, but only an ordering. The
intention of the model is providing a way to sort a group of packets in priority order, so
that the higher the priority is, the higher the impact of its loss is. However, there is no
information about the relative magnitudes of the losses.
Another relevant property of the model is that, once the priority for each packet is
known, no more analysis is required. This allows the unequal error protection schemes
to be stateless in the following sense: the decision of whether one packet is protected
or not will have no effect in the priority value applied to other packets. This simplifies
significantly the work of the UEP mechanisms.
104 Chapter 5. Applications
Figure 5.2: Implementation of the prioritization model
5.2.1.3 Implementing the model
Figure 5.2 shows the basic implementation modules to apply the described prioritization
model to a video source. As mentioned before, the priority labeling is applied indepen-
dently from the unequal error protection mechanism itself, and before it. To each packet
x in the sequence, a priority P (x) is assigned and signaled to the UEP module.
In the specific case of an IPTV scenario, each packet x is an RTP packet containing
H.264 or MPEG-2 video, or MPEG audio (MPEG-1, AAC or similar), over MPEG-
2 Transport Stream. To assign the priorities correctly to the transport packets, it is
necessary that audio and video are carried in different packets. It is also advisable that
no packet carries data from more than one slice; which, for the typical H.264 stream
with one slice per frame, means that no packet should carry data from two or more
different frames. All these conditions are satisfied if the packing of MPEG-2 TS into
RTP is done by the rewrapper described in section 3.5.2.
Priorities assigned to packets can be signaled in the RTP header extension, so that
the network processing elements can read them and use them to apply unequal error
protection techniques. This has the advantage that the extension is transparent to other
RTP receivers, so that the application of priority labels is backwards compatible with
any RTP-aware system. This compatibility has been successfully tested with several
commercial set-top-boxes, and this use of signalization in RTP header extensions is
currently in the field in some IPTV commercial deployments.
Other implementation options are possible. For example, priorities can be signaled using
different protocols, such as the DSCP bits of the IP header. In such cases, the number
of bits available to encode the priorities can be reduced. Next section will show that
even a few bits can be enough to encode the priority in an efficient way.
One of the main advantages of this model is its simplicity. This makes lightweight
implementations possible: to assign a priority to a packet, only the video NALU header
has to be read and analyzed. This way, the prioritization algorithm can be implemented
in devices with limited processing capabilities, such as home network gateways. In such
cases, the priority labeling and the unequal error protection modules would both reside
in the same hardware device.
Chapter 5. Applications 105
5.2.2 Experimentation and results
5.2.2.1 Description of the experiment
To test the performance of the model, three different short video sequences (4-12 sec-
onds), encoded by commercial IPTV encoders, have been selected. They are sequences
A, B and C from Appendix A.4. All of them are encoded in H.264 over MPEG-2 TS and
packed in RTP in the way described before; with each RTP packet containing informa-
tion about part of at most one video frame. Audio is not considered in the experiment.
Within each possible window of W consecutive RTP packets in the sequence, the K
packets with lowest priority are discarded. Then the resulting sequence is decoded,
using the repetition of the last reference frame as error concealment strategy. The Mean
Square Error of the resulting impaired sequence is computed, MSEPRIO.
For the same W -packet window, the MSE resulting of randomly dropping K packets
it is also computed, MSERAND. The calculation of the random loss is performed by
randomly selecting 1000 of all the possible combinations of K lost packets within the
window. If there are less than 1000 combinations, then all are selected. MSERAND is
obtained as the average of the MSE of each of the (up to) 1000 combinations.
For each window, the MSE gain is computed as
MSEgain(dB) = 10 log10
�MSERAND
MSEPRIO
�(5.2)
Based on this, an Aggregated Gain Ratio (AGR) can be defined to measure the perfor-
mance of the model. For each sequence and each pair of (W ,K), AGRW,K(G) is defined
as the proportion of windows whose MSE gain is equal or greater than G, and it is
expressed as a percentage in a 0-100 scale.
Table 5.2 shows the values of AGR for some relevant values of MSE gain, W and K, for
the three sequences under study (A, B and C), summarizing the results of the experiment.
They will be discussed and analyzed in the following subsections.
5.2.2.2 Single-packet loss
The first test considered is the case where K = 1, for several values of W . For each
original sequence it is necessary to individually discard each one of the RTP packets and
then decode and process the result of that individual discard. This way, more than 1500
impaired sequences have been obtained and used for the analysis.
106 Chapter 5. Applications
Table 5.2: Values of the Aggregated Gain Ratio for some relevant values of MSE gain,W and K
The results for sequence A, K = 1 and several values of W are shown in Figure 5.3.
Each of the curves refer to a different value of W and represents, for several values of
MSE gain, which proportion of the sequences obtained at least that gain value. The
range of values of W is selected to cover typical loss burst lengths in a wireless home
network [97].
Gains of 20 dB in MSE can be reached for from 20% of the packets (W = 5) up to 85%
(W = 30), using window sizes which are reasonable for a home network device. The
figure also shows that the longer the window is, the better the results are, since it is
easier to find a low-priority packet within the window.
Figure 5.4 shows some values of MSE for sequence A, K = 1, W = 15. As it can be seen,
the MSE varies highly between different windows along the sequence, independently of
the protection method used. However, focusing on any of the specific windows (any
value in the horizontal axis), using the prioritization method results in lower MSE in
almost all the cases; and in most of them this reduction is very strong. This means that
the specific error will depend heavily on the specific window which is selected but, once
the window is there (i.e., once the error is bound to happen), a good UEP decision can
mitigate the error effect dramatically.
Chapter 5. Applications 107
0 5 10 15 20 25 3010
20
30
40
50
60
70
80
90
100
MSE Gain (dB)
Aggr
egat
ed G
ain
Rat
io (A
GR
)
5 pkt10 pkt15 pkt20 pkt25 pkt30 pkt
Figure 5.3: Effect of the window size: Aggregated Gain Ratio for K = 1 and severalvalues of W
0 20 40 60 80 100 120 140 16010 5
10 4
10 3
10 2
10 1
100
101
window number
MSE
priorityrandom
Figure 5.4: Values of MSE for some possible windows within sequence A, comparingrandom packet loss (grey line) with priority-based packet loss (red line) for K=1 andW=15
5.2.2.3 Multiple-packet loss
The second test considered is setting the value of W to a fixed value and analyzing the
effect of the burst size, by changing the value of K. To simplify the implementation of
the test bed, the results of the different (W,K) combinations have been derived from the
(W, 1) case of the previous section, according to the considerations described in section
5.2.1.1. This way, only the first error within a slice is considered (as the rest of the slice
is lost anyway) and errors in two different frames are assumed to be uncorrelated.
108 Chapter 5. Applications
0 5 10 15 20 25 30
10
20
30
40
50
60
70
80
90
100
MSE Gain (db)
Aggr
egat
ed G
ain
Rat
io (A
GR
)
1 pkt2 pkt4 pkt6 pkt8 pkt10 pkt
Figure 5.5: Effect of varying the loss burst size (K) for a window of W = 15 packets
Figure 5.5 shows the results for sequence A and W = 15. This value has been selected
as representative from the range that was considered in Figure 5.3. Qualitatively, curves
for other values of W within that range show similar behaviors. Results from the other
sequences are summarized in Table 5.2.
When the values of K are high, it can be seen that the effectiveness of the model drops,
as there is very little margin to select low-priority packets. It is also interesting the fact
that the curves gradually reduce their decreasing rate. For example, Figure 5.5 shows
that, for K = 8, only 10% of the sequences have an MSE gain between 10 and 30 dB,
while 20% reach gains over 30 dB.
This behavior is due to the fact that the prioritization method concentrates errors firstly
in no-reference frames (versus reference ones) and secondly in the end of the frame (versus
the beginning). When the window lies entirely within one frame, then the gains against
the random loss are limited. However, when the window covers part of two different
frames, then the priority strategy concentrates the error in the less-impacting part of
the window, thus reaching high MSE gains. As a consequence, even for severe error
patterns, the prioritization method allows that, in a representative proportion of the
cases, the error effect is negligible.
Chapter 5. Applications 109
0 5 10 15 20 25 300
10
20
30
40
50
60
70
80
90
100
MSE Gain (dB)
Aggr
egat
ed G
ain
Rat
io (A
GR
)
PS
+H+TS+TG
Figure 5.6: Contribution of each term to the prioritization equation: only PS (red),PS + H (green), PS + H + TS (cyan), and all of them (blue). Computed for W=15and K=1
5.2.2.4 Contribution of each priority factor
An additional analysis of the performance of the model is represented in Figure 5.6. It
shows the contribution of each of the terms in equation (5.1) to the aggregated MSE gain
of the method. The red line represents the use of only PS as prioritization parameter.
Then the green line introduces the effect of H additionally to the already available of
PS . Afterwards the effects of TS and TG are added.
Several aspects of the graphic are notable. First of all, the use of the very simple method
of prioritization of just considering the frame type of the packets (PS) can be good enough
for some applications. And secondly, the most relevant contribution afterwards is TS
which allows dramatic improvements to the performance. Therefore, in addition to PG
and H, the parameter TS should always be considered.
As the scope of the study is focused on the short term, and therefore window sizes
are relatively small, it typically results in a small number of frames within each packet
window, at most. This is the main reason why the contribution of TG is so limited
in current scenario. Nevertheless, additional tests show that when the window size is
enlarged, the relative weight of TG increases, supporting the choice of a model with four
extracting the fragments into the buffer. Each segment is put at their right position
using the associated information (sequence number), rebuilding therefore the recovered
segment. The buffer may be consumed at normal pace by the client (no special buffering
policy is needed). If the segment is consumed before the whole segment has arrived,
then there will be gaps in the buffer; but they will take place in the less important
positions (the ones with less priority and therefore, less impact to QoE). Late arrivals
are discarded.
Figure 5.8 shows a schematic diagram of the solution in an exemplary scenario. Top line
(both left and right) describes a typical segment transmission and presentation. Bottom
describes a segment transmission and presentation using our solution. The segment
has video frames (I, P, B) and audio (a) frames or, being more general, access units
(AUs). For the present explanation, in order to simplify the figure, it is assumed that
one fragment contains exactly one AU, although each AU can be divided into smaller
fragments if needed.
The left part shows the structure of the segment for transmission. In our solution
(prioritized segment), the fragments have been re-ordered in priority order; but both
segments (top line and bottom line) represent the same content. Note that the prioritized
fragment contains the same AUs than the regular one, but in different order.
Now the segment is transmitted but, for any reason, the download (streaming) has to
stop in the middle (i.e. all the data under the highlighted square are lost, because they
have not been received by the end device) and it has to be sent to play out (this is the
presentation part, on the right hand side of the image).
In the regular segment case (top), the answer is simple. The client plays out the first
half of the segment, and then stops (black or frozen video, and no audio too). In the
prioritized segment case (bottom), it is different: packets are re-ordered and, as they
have a sequence number, they are re-ordered by the client (end-device) and displayed
in their right position and only the less important packets have been lost. To simplify:
116 Chapter 5. Applications
we have all the I and P frames, plus all the audio. The result is that the segment is
played out completely, although at a lower frame rate (33%), and with all the audio.
Of course, dropping the frame rate and keeping the audio is much better than losing
several seconds completely. According to the subjective assessment tests described in
Appendix A.2 (see also [27]), there could be a difference of 1 to 3 points in a MOS scale
(1 to 5) between both approaches.
It is important to note that the creation of the prioritized segment is a decision that
can be taken prior to the knowledge of the network status between the content server
and the end device. In other words, the prioritized segment is generated once in the
server, and all the end devices download and play it. If there is no network congestion,
the experience will be the same as with the original segment: it will be correctly and
completely displayed. However, if there is a sudden network QoS drop, the end device
will have its prioritized segment available without having to do anything special in the
server side.
The solution has thus the following advantages:
• It allows to recover from buffer underrun in HAS in an optimal way. That is,
smaller HAS buffers can be used, therefore reducing the latency of the whole HAS
solution.
• It works passively, in the sense that neither the server nor the client have to change
their default behavior when facing a network congestion.
• Besides, it provides a mechanism to mitigate the effect of high network rate vari-
ations.
• More generically, it makes it possible to use in HAS all the QoE enhancement
technology which has been developed for real-time RTP delivery, such as video
preparation for Fast Channel Change, unequal loss protection or selective scram-
bling; that is, this solution allows to use QoE enhancement techniques in a different
environment (HTTP delivery).
5.4 Selective Scrambling
The concept of selective scrambling means that, when cryptographically protecting a
multimedia asset or stream, only a (typically reduced) fraction of the data is scrambled,
whilst the rest are distributed in clear. The reasons for such approach are twofold:
on the one side, by leaving some specific information unscrambled, intermediate video-
processing systems can access to the part of the data which is required for them to work
Chapter 5. Applications 117
correctly —the rich transport data; on the other, keeping a reduced bit rate of scrambled
packets can be the only possible solution for decoding devices with limited computing
power, such as user terminals. Addressing the former problem is relatively simple, as
the specific data headers required by the network processors are typically well-known.
The latter is more interesting, as it is necessary to find a good balance point between
scrambling rate and protection effectiveness.
5.4.1 Problem statement and requirements
A user which is watching a partially scrambled content asset and is not entitled to (and
therefore does not have the appropriated keys to descramble it) will experience the same
effect that a user that loses (for example, due to network errors) exactly the same packets
that are scrambled in the stream. From this point of view, selective scrambling can be
seen as a reverse rate-distortion optimization (RDO) problem. Unlike in the typical
RDO problem, however, the aim here is maximizing the final distortion for a specific
rate of scrambled packets. In an ideal case, the resulting distortion should be so high
that no useful data can be extracted from the content. However, for many practical
applications, it could be suitable that the resulting video has a quality bad enough to
discourage the potential user to watch it. The underlying idea here is that, in order to
find a good selective scrambling algorithm, techniques for Quality of Experience analysis
can be used.
Notwithstanding, the design of selective scrambling schemes must be aware of the reasons
why this algorithm is required: processing the scrambled video in the network and
low computing power required in the descrambler. Besides, using a lightweight scheme
also in the scrambler side would broaden the applicability of the scheme. Hence the
requirements for the selective scrambling algorithm would be to:
1. Be transparent to video servers —by leaving “rich transport data” in clear,
2. Scramble only a (low) percent of the video packets,
3. Be implementable with low computational cost and
4. Maximize the distortion introduced by the encrypted packets (i.e., do not allow
to recover the video sequence from the unscrambled packets unless with heavy
impairment).
118 Chapter 5. Applications
5.4.2 Algorithms
Most existing commercial CAS/DRM solutions fulfill requirement 1. However, they
typically rely on the encryption of the full stream. There are several solutions in the
literature that address the partial encryption of the video stream. A description of the
state of the art can be found in the work of Massoudi et al. [69], who describe a set
of encryption techniques that allow good visual degradation of encrypted video while
scrambling only part of the packets. However, all of them either require deep analysis
of the video stream (thus not satisfying requirement 3) or scramble the video headers to
make video impossible to decode (not meeting requirement 1).
Fan et al. propose encoding with higher security the most important data and with
lower security (and complexity) the less important [20]. Shi et al. divide H.264 video
elements in different classes, which are provided with different protection [100]. In
the work of Zou et al., different encryption levels can be reached by analyzing the
entropy coding of the H.264 stream [125]. These methods satisfy requirement 2, but
all of them require analyzing H.264 up to, at least, macroblock level, which might be
computationally expensive (especially when CABAC entropy coding is used, as in most
IPTV streams).
The approach we propose is exploiting the error resilience characteristics of video coding
standards such as, but not limited to, H.264, where video frames are divided in slices.
It has been shown that, when a fragment of the video slice gets lost, the rest of the slice
becomes almost impossible to decode [86]. Therefore by scrambling a small set of data
in each slice it is possible to get a very high video degradation.
This solution is especially suitable for multimedia deployments because:
• Commercial encoders use a low number of slices per frame (typically one in SDTV,
4-8 in HDTV, see section 2.4.1). Thus the fraction of video packets to encrypt
(scrambling rate) is kept low.
• The information required to process video in a video server (i.e., stream and
picture-level information) is contained in other H.264 syntax elements (called
NALUs —Network Abstraction Layer Units) which are not slices, and in the header
of the slices.
• The only analysis of the video stream required for this solution is: detecting the
type of NAL Units, detecting slices and slice headers and reading the coding type
of each frame. This can be performed in the H.264 Network Abstraction Layer
i.e., it does not require analyzing anything beyond slice header level. This makes
processing much simpler than any other selective scrambling algorithm.
Chapter 5. Applications 119
Table 5.4: Minimum scrambling rate required to completely loss the video signal, assubjectively assessed by expert viewers in laboratory, for several content assets.
The resulting streams were there chunked into 12-seconds segments for the tests and
processed by a rewrapper. Impairments were introduced in the first half of each of the
segments.
A.2.2 Selection of impairments
The selection of impairments was done to cover a sufficient range of error cases related to
the metrics that were going to be evaluated and calibrated (the ones defined in chapter
4).
Appendix A. Experimental setup 129
A.2.2.1 Bitrate drops
To simulate the effect of a bandwidth drop, the first half of the segment was re-encoded
using a different bitrate, which was a fraction of the original one. Two different impair-
ments were defined (called R1 and R2) as detailed in Table A.2.
Table A.2: Bitrate drops
Test R1 R2Bitrate (% of reference) 50% 25%
A.2.2.2 Frame rate drops
In these test cases, the first half of the segment is transmitted using a lower frame rate,
which is a fraction of the original one. Frame rate reduction is achieved by discarding
some B frames from the original stream (denting). Two different impairments were
defined, as detailed in Table A.3.
Table A.3: Frame rate drops
Test F1 F2Frame Rate (% of reference) 50% 25%
A.2.2.3 Audio losses
These impairments are implemented by discarding audio packets in the middle of the
first half of the segment. The shortest loss length, achieved by dropping a single audio
packet, produced a silence of about 200 ms. Longer lengths were achieved by dropping
consecutive packets. Test cases A5 and A6 introduced a sequence of several short losses
separated approximately 1 second. Impairments are detailed in Table A.4. The ‘total
duration’ represents the time from the beginning of the first audio mute to the end of
the last one.
A.2.2.4 Video losses: macroblocking
The macroblocking effect caused by a transmission loss can be roughly characterized
using three parameters:
130 Appendix A. Experimental setup
Table A.4: Audio losses
Test A1 A2 A3 A4 A5 A6Loss length (s) 0.2 0.5 2 6 0.2 0.2Loss events 1 1 1 1 3 7Total duration (s) 0.2 0.5 2 6 2 6
• The fraction of the picture affected (position of the loss within the frame).
• The duration of the artifact due to error propagation (position of the loss within
the GOP).
• The loss pattern (i.e. the effect of losing several packets in several frames).
To simplify the test cases, the following restrictions were imposed to the test cases:
• There would be at most a packet loss in each GOP.
• Loss patterns would be established by introducing the same type of packet loss in
several consecutive GOPs.
Impairments are detailed in Table A.5. ‘MIN’ means that the impairment occurred in a
no-reference frame, and therefore its effect did not propagate through the GOP.
Table A.5: Macroblocking errors
Test E1 E2 E3 E4 E5 E6 E7 E8% of Frame 100 25 50 100 50 50 50 50% of GOP MIN 90 90 90 90 90 25 25Number of GOPs 1 1 1 1 3 5 3 5
The rationale for this selection of impairments is the following:
• E1 — Verify that the loss of isolated no-reference frames has no effect in the
perceived quality.
• E2–E4 — Analyze the effect of single packet losses.
• E5–E8 — Analyze the effect of multiple packet losses.
Appendix A. Experimental setup 131
A.2.2.5 Video freezing
Video freezing was achieved by the loss of a single I frame (or its header), so that the
whole picture remains still until the beginning of the next GOP. The length of the freezes
were selected as multiples of the GOP length (half a second), as shown in Table A.6.
Table A.6: Video freezing
Test V1 V2 V3Freeze duration (s) 0.5 2 6
A.2.2.6 Impairment sets
The selected impairments were structured in impairment sets: groups of impairments
related among them, as described in Table A.7. ‘N’ represents a hidden reference (no
impairment). ‘AV’ is the combination of A4+V3 (6 seconds audio mute and video freeze,
i.e., a 6-second full outage).
Table A.7: Impairment sets
Impairment Set Freq. Impairments DescriptionRate Drop 3 R1 R2 F1 F2 Reaction to bandwidth changes
Audio Loss 1 3 A1 A2 A3 A4 Audio mute lengthAudio Loss 2 3 A3 A4 A5 A6 Continuous vs. periodic mutes
Macroblocking 1 3 E1 E1 N N Detectability of no-reference lossMacroblocking 2 3 E3 E4 E5 E6 Impairment durationMacroblocking 3 3 E5 E6 E7 E8 Effect of % of GOP affected
Single Loss 5 V1 E2 E3 E4 Effect of a single video packet lossOutage 1 1 V2 V3 A3 A4 Audio vs video outagesOutage 2 1 V3 A4 AV AV Audio vs video vs both
The ‘Freq.’ (frequency) label indicates the number of times that each impairment set
appears in each test sequence. The sum of all the frequencies is 25, which means that
25 different impairments were introduced in each test sequence: one impairment each
12 seconds.
For each of the three video test sequences (movie, sports and documentary), the following
steps were followed:
1. Each segmented sequence was replicated 4 times, to create 4 different variants.
132 Appendix A. Experimental setup
Figure A.1: Structure of the content streams in the subjective assessment test session
2. The 25 occurrences of the impairment sets were randomized, as well as the 4 dif-
ferent impairments within each set. This way, 4 different sequences of impairments
were generated, each one having 25 impairments.
3. Each sequence of impairments was applied to each of the variants, i.e., impairments
were introduced in the first halves of the segments accordingly.
The resulting sequences have the structure shown in Figure A.1, where the impairments
introduced in each of the evaluation periods Ti belong to the same impairment set. Table
A.8 shows an example of some of them —they are the first 13 impairments introduced
in each of the variants of the sports sequence in the final tests.