Quality of Experience Models for Multimedia Streaming

Menkovski, V., Exarchakos, G., Liotta, A., & Sánchez, A. C. (2010). Quality of Experience Models for Multimedia Streaming. International Journal of Mobile Computing and Multimedia Communications (IJMCMC), 2(4), 1-20. doi:10.4018/jmcmc.2010100101

Quality of Experience models for Multimedia Streaming Vlado Menkovski(1), Georgios Exarchakos(1), Antonio Liotta(1), Antonio Cuadra Sánchez(2)

(1) Electrical Engineering Department Eindhoven University of Technology

Eindhoven, the Netherlands

(2) Telefonica R&D 6 Emilio Vargas

28043 Madrid, Spain

Abstract Understanding how quality is perceived by the viewers of multimedia streaming services is essential for efficient management of those services. Quality of Experience (QoE) is a subjective metric that quantifies the perceived quality and therefore is crucial in the process of optimizing the tradeoff between quality and resources. However, accurate estimation of QoE usually entails cumbersome subjective studies that are long and expensive to execute. This paper presents a QoE estimation methodology for developing Machine Learning prediction models based on initial restricted-size subjective tests. Experimental results on subjective data from streaming multimedia tests show that the Machine Learning models outperform other statistical methods achieving accuracy greater than 90%. These models are suitable for real-time use due to their small computational complexity. Even though they have high accuracy, these models are static and cannot adapt to changes in the environment. To maintain the accuracy of the prediction models, we have adopted Online Learning techniques that update the models on data from subjective viewer feedback. Overall this method provides accurate and adaptive QoE prediction models that can become an indispensible component of a QoE-aware management service.

Keywords Quality of Experience, QoE, Machine Learning, Online Learning, QoE-aware management, User-centric management

1. Introduction Advances in the telecommunication systems open opportunities for multimedia services of higher quality, which were previously demanding excessive network resources. As these services are becoming more common many service providers are facing the problems of their efficient management. Streaming multimedia services such as IP TV or Video conferencing have high resource demands and stringent requirements. Efficient management of multimedia services depends on understanding the value they bring to the viewers, which in turn depends on the service perceived quality. The perception of quality from a particular multimedia service is closely related to many factors such as the image fidelity, image resolution, type of device, content, audio fidelity. Traditional approaches to network service management solely focus on the transport and encoding quality such as the Quality of Service parameters and neglect many additional factors. Thus, many multimedia services are under- or over-provisioned. The encoding parameters are not adapted to the presentation device or the type of content, and none take into account the user’s expectations. As this data-centric service management is not aware of the customer’s perceived quality, it cannot be as efficient with the system’s resources as user-centric management approach. To improve the service management, a shift to a user-centric or user-aware multimedia service management is necessary (Agboma & Liotta, 2008).


To execute user centric management a model for the perceived quality or Quality of Experience (QoE) is necessary. QoE is a subjective metric that quantifies the perceived quality of a service by the viewers. As such, QoE needs to correlate numerous parameters that affect the perceived quality such as the encoding, transport, content, type of terminal, as well as the user’s expectations (Agboma & Liotta, 2007). The QoE management approach aims at maximizing the perceived quality of the viewers while minimizing the impact on the system’s resources.

Since QoE is a subjective metric the most accurate estimation methods are execution of subjective studies and calculation of the Mean Opinion Score (MOS) values from user feedback. However, subjective studies involve a complex selection procedure of an appropriate and statistically viable testing group and of the exact test conditions. Organizing such tests is a cumbersome and expensive effort and is not feasible for live streaming content. The method proposed here builds QoE prediction models based on data from initial limited subjective studies. Building the prediction models is done with Machine Learning (ML) algorithms that yield highly accurate QoE models which require low processing power for execution.

To sum up, the proposed QoE prediction models aim at maintaining high perceived quality at the user end with a low resource cost on the delivery network. The product of these models is the QoE estimation to be used for fine-tuning of service parameters. This can be an on-going process as viewers’ expectations or conditions change over time with introduction of new content and viewing devices. Environment alterations cause their accuracy to drop, rendering new subjective tests necessary. Online Learning techniques, adopted for maintaining the prediction accuracy, can help width building adaptive models updated with subjective feedback from the users. Our results show that the models adapt quickly to changes (with a small number of feedback responses) and reach high accuracy closely comparable to the static models. Overall this method provides a means for estimating the QoE of multimedia content using prediction models and the ability to keep those models accurate in dynamic environments.

2. QoE Measurement Methods Perception of quality in multimedia streaming is getting more attention with the proliferation of high quality multimedia technologies. The reason for this is twofold; on the one hand, user’s expectation of quality has increased and, on the other, managing the larger demand on the network resources has become a more pressing matter. There are many aspects and dimensions of perceived quality but the focus of this paper, regarding the multimedia services, is the QoE as a comprehensive quality metric.

A number of definitions for QoE exist, but in general most descriptions agree that QoE is what the end-user experiences while using the service. Even though there is more or less an agreement on how to define QoE, there is no single view on how to measure it. Many different approaches have been proposed, some of which fundamentally different. Though there is an agreement that QoE is a subjective metric (Takahashi, Hands, & Barriac, 2008), due to the drawbacks of subjective analysis (S Winkler, 2007), most efforts focus on objective methodology in measuring QoE. The typical approach for measuring QoE is estimating the quantity of errors that appear in the presented content. These errors are due to compression artifacts, transportation impairments and combinations of both.

There is a wide range of models for QoE estimation; the focus of these models is spread over different points in the path of the content from the creation to the presentation. Due to this diversification some standardization bodies have made attempts to standardize these models. In (Takahashi et al., 2008) the International Telecommunication Union (ITU) presents a classification of the different objective quality assessment models. The classification is made into the following types: media layer,


parametric, bit-stream, and hybrid models. This classification is based on the model’s point of interest. The media layer models focus on the media signal and uses knowledge of the Human Visual System (HVS) to predict the subjective quality of video. The parametric models predict the quality by looking at protocol information and network statistics, acquired with non-intrusive probes. The bit-stream models derive the quality via analyzing content characteristics collected from the coded bit-stream information. In a survey of different video quality methods (Stefan Winkler, 2009) the author concludes that there are many different methods and algorithms for QoE estimation but there is a lack of a standard for comparing their capabilities.

Models that only look at the fidelity of the audio and video estimate the QoE based on the signal distortion. This methodology remains oblivious to the content of the media as well as to the workings of the HVS. A typical example is a pixel to pixel comparison such as the Peak Signal to Noise Ratio (PSNR) and Mean Squared Error (MSE) methods. The drawback of these methods is that they compare the signals without any understanding of the HVS or how that content is perceived (S Winkler, 2007). There are many cases on which PSNR delivers far from accurate results. One of the simplest examples is shifting the image into any direction by one pixel. This will lower the value of the PSNR significantly but the perceived quality by a human will be basically intact.

The media-layer models focus on the content of the stream itself. They implement objective perceptual video quality measurement by modeling the workings of the HVS (Stefan Winkler, 2005). These models are computationally expensive because they need to execute in-depth analysis of the media content. Even though media-layer models show better performance than the data accuracy focused methods, no accurate objective method has been developed yet that takes into account all the physiological and psychological aspects of quality perception (Wang, Bovik, & Lu, 2002).

Modeling the effects that the transport has on the delivered quality would mean looking at the Quality of Service (QoS) parameters. This approach is not very efficient and yields weaker results (Siller & Woods, 2003). The authors of (Siller & Woods, 2003) propose looking at the problem in three layers. The bottom layer is the network layer. The network layer produces the QoS parameters or more precisely the Network QoS (NQoS) parameters. The layer above presents the application layer which is concentrated on parameters such as resolution, frame rate, color, and codec type. These parameters are referred to as Application QoS (AQoS). The third or the top layer is the perception layer which is driven by the human perception of the multimedia content and is concentrated on spatial and temporal perception and acoustic bandpass (Siller & Woods, 2003). The QoE, which is measured on the top layer is a function of both AQoS and NQoS (1).

QoE = f(AQoS, NQoS)…………………………………….…….(1)

In the proposed framework (Siller & Woods, 2003), the authors discuss that arbitrating all of the QoS parameters together is significantly more effective in maximizing the QoE than looking at each of them individually.

Due to the subjectiveness of QoE, the most accurate way to measure it is by executing a subjective test. Subjective studies are of significant importance because they can accurately convey the satisfaction of the viewers with the service. This is why subjective tests are commonly used for comparing the capabilities of different QoE estimation methods. Subjective testing usually entails execution of tests in a tightly controlled environment with a carefully selected group of subjects, which represent the population that is using the service. Guidelines for the execution of different subjective studies are provided by the ITU (ITU-T, 1999).


The drawbacks of subjective studies are obvious from their description. They require significant effort and resources to be put into their design and execution. (Agboma & Liotta, 2007) presents a method that only relies on initial limited subjective tests. From the results of these tests, statistical models are build that can predict the QoE on unseen cases. This approach is suitable for minimizing the need for cumbersome subjective studies while providing for estimation based on the user’s subjective feedback. The weakness of this approach is the limited statistical method for building the prediction models. The following sections present the design of prediction models using ML techniques that will achieve significantly accurate estimations of QoE. Expanding this method with ML Online Learning techniques based on continual viewer feedback, the prediction models become more flexible and adaptable to changes of the environment.

Figure 1. QoE Prediction Method

3. QoE Prediction Models The method of estimating QoE using prediction models consists of gathering subjective data, feeding it into the prediction model induction algorithm and using that model to predict the QoE (Figure 1). So, the first phase is to implement the data acquisition. In (Agboma & Liotta, 2007) and (Agboma & Liotta, 2008) the executed subjective tests are based on the Method of Limits (Fechner, Boring, Adler, & Howes, 1966). This method is used to detect the thresholds for quality by changing a single stimulus in successive, discrete steps. A series terminates when the intensity of the stimulus becomes detectable. When, according to the user, the quality decreases bellow the satisfactory level, the test series stops. The aim is to determine the user thresholds of acceptability and the QoS parameters taking into account content and terminal type. Table 1 depicts the different video samples that the users were subjected to. Each segment is a row in the tables and has corresponding QoS parameters given in the column values.

Table 1. Test-bed combinations for 3G Mobile Phone descending series (174 x 144 image size), for PDA descending series (320 x 240 image size), for Laptop descending series (640 x 480 image size)

Segment

Time (seconds)

Video bitrate (kbit/s)

Audio bitrate (kbit/s)

Frame- rate

Mob

ile

1 1-20 384 12.2 25 2 21-40 303 12.2 25 3 41-60 243 12.2 20 4 61-80 194 12.2 15 5 81-100 128 12.2 12.5 6 101-120 96 12.2 10 7 121-140 64 12.2 6 8 141-160 32 12.2 6

PDA

1 1-20 448 32 25 2 21-40 349 32 25 3 41-60 285 32 20 4 61-80 224 32 15 5 81-100 128 32 10 6 101-120 96 32 10 7 121-140 64 32 6 8 141-160 32 32 6

Lapt

op

1 1-20 448 32 25 2 21-40 349 32 25 3 41-60 285 32 20 4 61-80 224 32 15 5 81-100 128 32 10 6 101-120 96 32 10 7 121-140 64 32 6 8 141-160 32 32 6

Figure 2. QoE Levels for different content types on all terminals

The results of the subjective tests given in Figure 2 show the dependency of the QoE on the type of terminal as well as the content. For instance, user’s expectations for content such as football are


different on the different terminals. On the other hand, the type of content with low dynamics such as a news broadcast has high perceived quality even with low bandwidth consumption on all devices. In addition, the audio quality is significantly more important than video quality in most content types, but particularly in news (Agboma & Liotta, 2008).

It is of great importance to capture these correlations in order to derive the dependencies of the QoS and other parameters to the QoE. In (Agboma & Liotta, 2008) and (Agboma & Liotta, 2007), the correlations are captured with statistical models as discussed in the next section.

3.1 Statistical models for QoE Statistical analysis of the subjective tests is a technique that can be used to build models from the subjective data. In (Agboma & Liotta, 2008) the Discriminant Analysis method (Klecka, 1980) was used to build the prediction models. This method builds linear functions for each class or label with which each data point can be associated. The class of the function that returns the maximum value is associated with the class of the data point. In this analysis the data was divided in subsets for each terminal and then in smaller subsets for each content type. The discriminant functions are built based in two input parameters Video Bitrate and Video Framerate.

In Figure 3 two classification discriminant functions are given. One for the news content and the second for the action movie, both of which are for the mobile terminal. Given the bitrate and the framerate of the news broadcast video, these functions, ℎ(!""#$%!&'#) !"#$%&"'()'$,!" and ℎ(!"#$$%&'#()%) !"#$%&"'()'$,!" , can calculate the acceptance degree of a video. If, for example, ℎ(!""#$%!&'#) is larger than ℎ(!"#$$%&'#()%), the News broadcast is predicted to have acceptable QoE and vice versa.

That work generated prediction models for all of the listed content types for these three terminals. The accuracies of those models, validated with the leave-out-one method for each terminal averaged, are (Agboma, 2009)

• Mobile phones: 76.9% • PDA: 86.6% • Laptop: 83.9%

The following section carries on with the motivation and implementation of a different statistical analysis method for building QoE prediction models based on Machine Learning techniques which show superior accuracy. 3.2 ML QoE Prediction Models Machine Learning is mainly focused on developing algorithms for induction of models based on training data. These models are commonly used for pattern recognition and decision support. As is the case here, these techniques build models for estimation of QoE based on subjective feedback data.

The algorithms used belong to the group of supervised learning algorithms. The training data is classified by a human or acquired experimentally. In our case the two classes to which the data

News

h(acceptable) = ‐5.699 + (-‐0.080×VideoBitrate) + (1.613×FR)

h(unacceptable) = ‐4.223 + (-‐0.104×VideoBitrate) + (1.613xFR)

Action Movie

h(acceptable) = ‐9.735 + (‐0.111×VideoBitrate) + (2.411×FR)

h(unacceptable) = ‐4.223 +(‐0.104×VideoBitrate) + (1.613×FR)

Figure 3. Classification functions created using the discriminant analysis method


instances belong are “acceptable” and “not acceptable” referring to the expected QoE. The algorithms need to infer the correlation between the input parameters and the output and build the classification rules into the models. The models basically map the combination of input parameters to a class value. The training dataset does not contain all possible combinations of input parameter values, hence, the ML algorithm needs to infer the decision rules based on the available data. Moreover, the model should be specific, so it does not adopt errors or noise from the data. The overall goal of the ML algorithm is also to build an efficient model that will classify unseen data as fast as possible with maximum precision. Subjective information inputted from different people may contain certain inconsistencies that are present due to input noise or more precisely errors in measurements.

We present two classification algorithms for the offline batch based training, one function based, which builds a classification hyper-plane and another tree based, which builds a decision tree. For the online classification algorithms we further explore an ensemble model and an updatable decision tree.

Function based, Support Vector Machine The first algorithm is called a Support Vector Machine (SVM). The SVM is a functional type algorithm and works by first plotting the data in an n-dimensional space, n being the number of attributes. In case of nominal (or discrete) attributes, the algorithm creates another axis for each value of the nominal attribute. In our case the terminal type is a nominal attribute and the algorithm creates one variable for each of the three values. These variables can take Boolean values (0/1) depending the presence of that particular nominal value.

After plotting the data in the n-dimensional space, the SVM algorithm builds a hyperplane which separates the data in an optimal manner (Vapnik, 1982) in regards to the two classes. If there are more classes the SVM generates one hyperplane for each combination of pair of classes. Substituting the values of the attributes, a particular data point can be placed above or below the hyperplane; that is, it belongs to one or the other class.

The particular implementation of SVM used here is called Sequential Minimal Optimization (Platt, 1998).

Decision Tree based approach The decision tree algorithm is called C4.5 (John Ross Quinlan, 2003) which is an extension of the Id3 (J. R. Quinlan, 1983) tree induction algorithm. Both algorithms work by splitting the initial set of data into smaller subsets so that the subsets contain, as much as possible, datapoints from the same class. The split is done on a single attribute by its value; the algorithm searches for the split that will bring maximum information gain. These steps are then repeated for each subset until the subset is clean enough regarding the class associations. In this case, the node is declared a leaf and the dominant class is associated with this leaf. The procedure is recursive. When the process of building the tree has finished, C4.5 executes a process of pruning the decision tree to achieve a more general model independent from data specifics such as measurement errors or noise.

After the decision tree is finalized, the classification process begins with inputting a new unseen case. The values of the attributes for this case are tested on each node of the decision tree starting from the top one. Depending on the result of each test we follow one or another branch of the decision tree until we reach a leaf. The class association of the leaf is the classification output of the decision tree.

3.3 Online Prediction Models The Online Learning algorithms are a subset of the supervised learning group of ML algorithms. These algorithms build the models in an online fashion by reading one datapoint at the time. They only


have a partial view of the data which leads to lower accuracy of the learned models. The strength of the Online Learning approach is that it does not need to keep a large training set in memory and it can adapt the model to changes in the environment. For the current work, an Online Learning algorithm can adapt the QoE prediction models based on feedback from the viewers. Changes in the environment can occur if new content or new devices are introduced, or user’s expectations have shifted. This kind of change can be quite frequent in this environment. To circumvent the need to redo the subjective studies and recreate the models, we are exploring the use of Online Learning techniques.

For the implementation of the Online Learning schema we have explored different algorithms and decided to use Oza Bagging ensemble with a Hoeffding Option Tree classifier. The theoretical background and the description of these algorithms follow in the remaining section.

Hoeffding Trees and incremental tree induction Hoeffding Trees (Domingos & Hulten, 2000) are a model that is designed to handle extremely large training sets. The training set is commonly so large that it is not expected for the training data to remain in memory. It is actually processed from the input stream in a single pass. The fact that the data is processed sequentially or one datapoint at the time characterizes this approach as Online Learning. The learner in this case has only a partial view of the data; this means that the selected attribute for the test in a node cannot be made with full confidence for any split criteria, but it has to be made with a more relaxed one. The Hoeffding Tree approaches the issue of the number of examples needed to make a split decision by relying on a statistical result known as Hoeffding bound. Given n observations of a random variable r with a range R, the calculated mean of r is !. The Hoeffding bound states with probability 1-δ that the true mean of the variable is ! − ! whereby (2)

! =!!!" !

!!!

…………………..……………………….…… (2)

Defining the attribute selection criterion as ! ! , then !" = ! !! − ! !! > 0 assuming that the !! attribute is more favorable (or with larger information gain) than !!. Given the desired δ, the Hoeffding guarantees that !! is the better selection with probability δ if n examples are seen where !" > !!.

Hoeffding Option Trees Option trees generalize the regular decision trees by adding a new type of node, an option node (Ron Kohavi & Kunz, 1997). Option nodes allow several tests instead of a single test per node. This effectively means that multiple paths are followed below the option node and classification is commonly done by a majority-voting scheme of the different paths. Option Decision Trees can reduce the error rate of Decision Trees by combining multiple models and combining predictions while still maintaining a single compact classifier. For the proposed methodology, the combination of the predictions of different paths is done with weighted voting (Bernhard Pfahringer, Geoffrey Holmes, & Richard Kirkby, 2007), which sums up individual probability predictions of each class.

Hoeffding Option Trees with functional leaves The usual way a decision tree is built is by assigning a fixed class to each leaf during the training. This class is equal to the class of the majority of the training datapoints that reach this node. There is another approach based on which the leaves are not associated with a constant class but are functional; they have a simple classifier assigned to them trained on the data that falls on the leaf. This approach can outperform both a standalone decision tree as well as a standalone classifier (R. Kohavi, 1996). In further research on functional leaves the authors of (Gama, Rocha, & Medas, 2003) show that, for


incremental learning models, naïve Bayes classifiers used as functional leaves improve the accuracy over the majority class approach. However, this cannot be a rule of thumb. There are exceptional cases shown in (Geoffrey Holmes, Richard Kirkby, & Bernhard Pfahringer, 2005), where a standard Hoeffding Option Tree outperforms the tree with functional nodes. The author of (Geoffrey Holmes et al., 2005) proposes an adaptive approach, the algorithm of which adaptively decides to use the functional or majority vote approach based on the performance. That is the implementation adopted in the current work.

Oza Bagging Ensemble methods recently are showing increasingly high potential in many different research efforts. To make use of this realization, the proposed methodology uses the benefits of the ensemble methods adding to its stack the Oza Bagging algorithm. Ensemble methods deploy multiple classifiers trained with different strategies, which combine their predictions into a more accurate group prediction. Their strength is in improving the generalization capabilities of the stand alone classifier (Breiman, 1996). Bagging is a procedure of bootstrap aggregation according to which one base classifier of the ensemble is trained on the whole dataset D and the remaining classifiers on a sub sample of D – sampled uniformly with replacement. Online bagging (Oza & Russell, 2001) modifies this method for streaming data in the following manner: each example of data !, ! is presented to a base classifier K times, where K is a random variable with Poisson(1) distribution. The authors of (Oza & Russell, 2001) claim that the online bagging classifier converges to the batch bagging classifier (given certain conditions), if the training examples tend to grow to infinity.

Online QoE Prediction Approach All previous sections set the design of a toolkit of Online Learning algorithms that can quickly build accurate prediction models from streaming data and adapt to changes in the distribution of this data. This toolkit is to be used to develop an Online Prediction method for QoE. The objective is the QoE estimation of a service during its execution, instead of running full size subjective tests prior to the service execution. To measure the QoE for different service parameters and conditions, the QoE prediction models rely directly on user’s feedback. This method is particularly useful for live real-time services for which content apriori subjective tests are difficult to implement but also for other services that rely on subjective perception of quality. A mechanism to gather the user’s feedback at service run time is necessary. The proposed method, as depicted in Figure 4, relies on any available system data, usually QoS data from the streaming setup as well as network probes. This data is mapped to user feedback to produce training data points. So, each time a user feedback is received, it is mapped to QoS data (! being QoS and y being QoE) and fed back to the learner engine. The learner is an Online Learning algorithm that builds a prediction model from this data. Meanwhile, the built-up prediction model is used to predict the QoE from QoS available data.

Figure 4.Online QoE Prediction approach

In other words, that model estimates the QoE based on available system data. It improves its performance over time as more and more feedback is available. Furthermore, if changes in the environment happen, such as introduction of different terminal types, or different content, the model adapts as soon as there is feedback for the new conditions.

This method provides an accurate and flexible way to measure QoE on systems where viewer feedback is available, without engaging in comprehensive subjective testing.


4. Experimental Setup and Results In order to demonstrate the capabilities of the ML method for QoE prediction, we implemented model from subjective data using two batch and one online learning method.

Data Adaptation The data of the subjective tests was acquired from the questionnaires (Agboma, 2009) grouped by terminal type and content type. The first step of this approach includes data rearrangement in a more compact format. A sample of each dataset is presented in Table 2a, 2b, and 2c. Each type of multimedia content has different Spatial and Temporal Information which is given in the first two columns of Table 2.

The datasets are randomized and ready for use as training data for classification algorithms. We used the Weka ML suite (“Weka 3 - Data Mining with Open Source Machine Learning Software in Java,” n.d.) for building the ML models. Weka contains implementation of the C4.5 algorithm called J48 as well as an implementation of the SMO algorithm.

Table 2a. Sample set of the Mobile QoE subjective test data Video SI Video TI Video Bitrate (kb/s) Video Frame-rate QoE Accept

70 141 64 6 no 60 153 128 12.5 yes 21 187 64 6 yes 21 187 32 6 no 54 142 32 6 no

Table 2b. Sample set of the PDA QoE subjective test data Video SI Video TI Video Bitrate (kb/s) Video Frame-rate QoE Accept

62 100 285 20 yes 56 87 32 6 no 62 119 349 25 yes 62 119 448 25 yes 71 125 128 10 yes

Table 2c. Sample set of the Laptop QoE subjective test data

Video SI Video TI Video Bitrate (kb/s) Video Frame-rate QoE Accept 56 35 180 20 yes 70 71 128 15 no 67 70 32 10 no 23 130 363 25 yes 63 90 64 12.5 yes

SVM Results The results from running the SMO algorithm generated a hyperplane given in Figure 5a, 5b and 5c. The accuracy of the Support Vector Machine was validated with the 10-fold cross validation method and was measured to 88.59±2.85%, 89.38±2.77%, 91.45±2.66% for the mobile, PDA and laptop datasets respectively. The value is given in percentage of accurately predicted instances averaged over each of the ten folds with the standard deviation also given as an error range for the ten values.

Figure 5a. SVM hyperplane for the Mobile dataset

1.4555 * (normalized) Video SI

+ 1.0459 * (normalized) Video TI

-‐ 5.0892 * (normalized) Video Bitrate

-‐ 3.7632 * (normalized) Video Framerate

-‐ 0.4582


Figure 5b. SVM hyperplane for the PDA dataset

Figure 5c. SVM hyperplane for the Laptop dataset

Figure 6 shows eight projections of the SVM hyperplane built from the mobile dataset on a two dimensional plane. The two dimensions here are the Video Spatial Information and the Video Temporal information. The eight projections are for the eight different values of the Video Bitrate and Framerate for the eight tested segments. The ten tested content types are also present on the graph as diamond points, each taking its position according to its Video Spatial and Temporal Information. We can observe that all the points are between the 7th and the 8th SVM projection. The SVM classifies all points below its boundary as QoE acceptable and those above its boundary as unacceptable. That is, this SVM classifies any of the ten types of content as QoE acceptable if the Video Bitrate and Frame rate are at level seven or above (64kbit/s and 6 fps).

Figure 6. Projection of the SVM hyperplane from mobile data on a Video SI - Video TI plane

If, for example, a new video type with higher complexity is introduced and its SI/TI point is above projection seven, then the stream would need to be with video bitrate 96kbit/s and 10 frames per second for the classifier to find it QoE acceptable. Due to space constraints, projections for PDA and Laptop hyperplanes are omitted.

Decision Tree Result The results of the C4.5 experiments yielded three decision trees. They all have high classification accuracy of 93.55±1.76%, 90.29±2.61% and 95.46±2.09% for the Mobile, PDA and Laptop datasets respectively. The graphical representation of the models is shown in Figure 7a, 7b and 7c.

Figure 7a. Decision Tree for the Mobile dataset

Figure 7b. Decision Tree for the PDA dataset

Figure 7c. Decision Tree for the Laptop dataset

ML QoE models accuracy The achieved accuracy for each model is tested by a k-fold cross validation. This technique splits the data set in k subsets. Then it uses k-1 of the subsets to train the model and the last one to test the model. This is repeated k times for all combinations of subsets and then the results of the testing are averaged. This is a more general technique than the leave-out-one that is used to validate the discriminate analysis models (Ron Kohavi, 1995). In leave-out-one technique only one datapoint is


-‐ 0.4575 * (normalized) Video TI



+ 1.3385


+ 0.6061 * (normalized) Video TI



-‐ 0.7474


left to be tested on, and the procedure is repeated as long as there are datapoints. If k is equal to the number of datapoints then the k-fold cross validation is equal to the leave-out-one. In any case, since both validations are of the same nature we compare the results directly.

The accuracy of the models is SMO • 88.59±2.85% (Figure 3a) • 89.38±2.77% (Figure 3b) • 91.45±2.66% (Figure 3c) J48 • 93.55±1.76% (Figure 5a) • 90.29±2.61% (Figure 5b) • 95.46±2.09% (Figure 5c), compared to the discriminate analysis models • 76.9% (mobile phone) • 86.6% (PDA) • 83.9% (laptop)

To conclude, J48 has achieved significantly higher accuracy of the prediction models (Figure 8). Approaches such as decision tree or rule-based ones are more suitable for these datasets comparing the results from the SVM and the C4.5 models.

Figure 8. Comparison of models over the different datasets

Comparing the model’s over the distinct terminal type subsets, we can see that the mobile dataset shows far bigger difference in performance than the rest of the subsets.

4.1 Online setup To demonstrate the capabilities of the Online Learning schema we created an experimental setup where data is sequentially introduced to the algorithm for induction. The data we used is the same subjective data that we used in the batch learning experiments. For an Online Learning System we used the Massive Online Analysis (MOA) (Bifet, G. Holmes, B. Pfahringer, R. Kirkby, & Gavaldà, 2009) ML platform for data stream mining which has implementations of Hoeffding Option Tees and Oza Bagging algorithm. MOA is based on the WEKA (Witten & Frank, 2005) ML data mining platform and it is optimized for fast stream mining which implies a lot of online data passing through the classifier. But in our case the viewer feedback is considered scarce and expensive, so we can only expect a small amount of it. In light of this difference we have modified some of the parameters of the algorithms to serve our purpose, mainly the nmin grace period from 200 (default value) to 1. We cannot afford to wait for 200 datapoints before we start building the decision tree.

Since in MOA there is the assumption of abundance of data, the estimation of the accuracy of the prediction models and their accuracy is done by interleaving testing and training. In this way part of the data is dedicated for testing, and this data is not used for training the models. So the accuracy of these models could be lowered in cases of small amount of training data, due to the reservation of datapoints for validation. The standard approach for validation in situations with scarce data is cross validation (Ron Kohavi, 1995).

We implemented a ten-fold cross validation scheme to calculate the accuracy of the classifier which splits the data into 90% for training and 10% for testing. Then it repeats this process ten times. Each a


time different combination of datapoints is used for training and testing. We also implemented a validation scheme for testing concept drift where the classifier is trained on one dataset and then at a certain point a new dataset is introduced. The second dataset can contain a different distribution of the data. At the moment when the new data is introduced the model is not aware of the change and predicts based on knowledge from the previous data; then new feedback is introduced from the second dataset. With this setup we can monitor the introduction of new concepts in the dataset and how fast the model adapts, and incorporates this knowledge.

4.2 Online results and analysis The results of the execution of the Online Learning using the Hoeffding Option Tree algorithm are shown in Figure 9.

Figure 9. Hoeffding Option Tree results The classification accuracy rises fast to over 80% accuracy with fewer than 100 datapoints, i.e., user-generated feedback instances. After around 1000 datapoints the classifiers converge to its maximum accuracy of approximately 90%. In the same manner the standard deviation of the accuracy falls quickly to bellow 3 (with just a few exceptions) and then falls to around 2 after introducing 1600 points.

Figure 10. OzaBagging Hoeffding Option Tree results

We obtained qualitatively similar findings from the execution of the ensemble Online Learning algorithm Oza bagging Hoeffding Option Tree (Figure 10). We can see that both algorithms reach very high prediction accuracy (90%) very rapidly (order of a thousand of datapoints). As expected the ensemble approach gains accuracy faster. Also the classifier’s standard deviation over the different folds of the cross-validation is much lower than in the standalone classifier.

Overall, we have presented results that show that we can already achieve an accuracy of over 80% by learning only 100 datapoints which are randomly selected feedback.

In Figure 11 we present the results from testing with concept drift. We split the dataset in two smaller ones. The first dataset contains only video with Temporal Information smaller than 110, which is around 60% (2010 out of 3370) of the data. The second set contains the remaining 40% of data.

Figure 11. Hoeffding Option Tree with concept drift results

The result of using the Hoeffding Option Tree algorithm shows a drop in the accuracy and increase in the standard deviation at the moment the new dataset is introduced. The accuracy is regained very fast and converges above 90% in fewer than 200 datapoints. This very encouraging result shows the capabilities of this algorithm to adapt to changes. In this experiment the model was trained on content with small TI (slow motion content) and then we introduced high motion content. Even with a rather drastic change like this the accuracy recovered very fast.

Figure 12. OzaBagging Hoeffding Option Tree with concept drift results

The result shown in Figure 12 from the Oza Bagging Hoeffding Option Tree ensemble show that this algorithm is much more robust to changes and deals with the concept drift with close to no loss in accuracy and limited rise in the standard deviation. This result is even more impressive that the result of the single classifier, and shows the robustness of the ensemble approach and justifies the added complexity in using an ensemble.


To demonstrate the statistical significance and viability of our approach, Figure 13 illustrates how the two algorithms Hoeffding Option Tree (HOT) and Oza Bagging HOT (OzzaBagg HOT) we used compare with three standard ML approaches, namely Naïve Bayes, Support Vector Machine (SVM) and C4.5.

It is evident that C4.5 performs best with 93% accuracy but the online learning algorithms we used follow closely behind with 90.5% and 91.1%.

Figure 13. Comparison with standard ML algorithms

5. Conclusion Having reached these results we can now conclude that our QoE prediction models are suitable for use in real world applications, particularly as part of a control loop of a network management system. The high accuracy in determining the expected level of QoE enables us to make efficient decisions regarding the provisioning of network resources while keeping the customer satisfied. In addition decision trees offer very fast classification that in turn provides possibilities for application of these models in real-time decision support systems.

In addition to the static models we present methods for developing QoE adaptable models using Online Learning algorithms. These algorithms show that even in highly changeable environments the QoE prediction models can maintain significant accuracy from small user feedback. The accuracy of the online learning algorithms is close to the standard batch ML algorithms, which further justifies their use and the applicability of the approach.

The work presented herein is focused on the construction of accurate as well as practical QoE prediction models. Once we are able to directly correlate context (QoS) with user perception (QoE) it is possible to think about creating a real-time QoE monitoring system to be used for instance for user-centric SLA monitoring. Another possible development is in the area of QoE management and real-time QoE control. Hence, this work is also very relevant to those who are working on encoding and decoding techniques for real-time streaming.

Acknowledgments

The work included in this article has been supported by Telefonica I+D (Spain). The authors thank María del Mar Cutanda, head of division at Telefonica I+D, for providing guidance and feedback. Subjective QoE data has been provided by Dr. Florence Agboma, Essex University (U.K.).

References

Agboma, F. (2009). Quality of Experience Management in Mobile Content Delivery Systems. Department of Computing and Electronic Systems, University of Essex.

Agboma, F., & Liotta, A. (2007). Addressing user expectations in mobile content delivery. Mobile Information Systems, 3(3), 153-164.

Agboma, F., & Liotta, A. (2008). QoE-aware QoS management. In Proceedings of the 6th International Conference on Advances in Mobile Computing and Multimedia (pp. 111-116). Linz, Austria: ACM. doi:10.1145/1497185.1497210

Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., & Gavaldà, R. (2009). New ensemble methods for evolving data streams. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 139–148).

Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123–140. Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM


SIGKDD international conference on Knowledge discovery and data mining (pp. 71–80). Fechner, G. T., Boring, E. G., Adler, H. E., & Howes, D. H. (1966). Elements of psychophysics /

Translated by Helmut E. Adler ; Edited by David H. Howes [and] Edwin G. Boring ; with an introd. by Edwin G. Boring. A Henry Holt edition in psychology. New York :: Holt, Rinehart and Winston.

Gama, J., Rocha, R., & Medas, P. (2003). Accurate decision trees for mining high-speed data streams. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 523–528).

Holmes, G., Kirkby, R., & Pfahringer, B. (2005). Stress-Testing Hoeffding Trees. In Knowledge Discovery in Databases: PKDD 2005 (pp. 495-502). Retrieved from http://dx.doi.org/10.1007/11564126_50

ITU-T, R. I. (1999). 910. Subjective video quality assessment methods for multimedia applications. Klecka, W. R. (1980). Discriminant analysis. SAGE. Kohavi, R. (1996). Scaling up the accuracy of naive-Bayes classifiers: A decision-tree hybrid. In

Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (Vol. 7).

Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In IJCAI (pp. 1137–1145). Retrieved from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.529

Kohavi, R., & Kunz, C. (1997). Option Decision Trees with Majority Votes. In Proceedings of the Fourteenth International Conference on Machine Learning (pp. 161-169). Morgan Kaufmann Publishers Inc. Retrieved from http://portal.acm.org/citation.cfm?id=756499

Oza, N. C., & Russell, S. (2001). Online bagging and boosting. In Artificial Intelligence and Statistics (pp. 105–112).

Pfahringer, B., Holmes, G., & Kirkby, R. (2007). New Options for Hoeffding Trees. In AI 2007: Advances in Artificial Intelligence (pp. 90-99). Retrieved from http://dx.doi.org/10.1007/978-3-540-76928-6_11

Platt, J. C. (1998). Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. doi:10.1.1.55.560

Quinlan, J. R. (1983). Learning efficient classification procedures and their application to chess end games. Machine learning: An artificial intelligence approach, 1, 463-482.

Quinlan, J. R. (2003). C4.5. Morgan Kaufmann. Siller, M., & Woods, J. (2003). QoS arbitration for improving the QoE in multimedia transmission. In

Visual Information Engineering, 2003. VIE 2003. International Conference on (pp. 238-241). Presented at the Visual Information Engineering, 2003. VIE 2003. International Conference on.

Takahashi, A., Hands, D., & Barriac, V. (2008). Standardization activities in the ITU for a QoE assessment of IPTV. Communications Magazine, IEEE, 46(2), 78-84. doi:10.1109/MCOM.2008.4473087

Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics). Springer-Verlag New York, Inc. Retrieved from http://portal.acm.org/citation.cfm?id=1098680

Wang, Z., Bovik, A. C., & Lu, L. (2002). Why is image quality assessment so difficult? In IEEE International Conference on Acoustics Speech and Signal Processing (Vol. 4, pp. 3313–3316).

Weka 3 - Data Mining with Open Source Machine Learning Software in Java. (n.d.). . Retrieved June 2, 2009, from http://www.cs.waikato.ac.nz/ml/weka/

Winkler, S. (2007). Video Quality and Beyond. Symmetricom. Winkler, S. (2005). Digital video quality : vision models and metrics. Chichester West Sussex

;;Hoboken NJ: J. Wiley & Sons. Winkler, S. (2009). Video Quality Measurement Standards - Current Status and Trends. In

Proceedings of ICICS 2009. Macau, PRC. Retrieved from http://stefan.winkler.net/Publications/icics2009.pdf

Witten, I. H., & Frank, E. (2005). Data mining. Morgan Kaufmann.

Quality of Experience Models for Multimedia Streaming

Documents