A User Perceived Quality Assessment of Design Strategies ...jmarty/papers/HAS... · Winkler and Mohandas 2008]. Objective metrics such as peak signal-to-noise ratio (PSNR), vision-based

A

User Perceived Quality Assessment of Design Strategies forHTTP-Based Adaptive Streaming

AYUSH BHARGAVA, Clemson UniversityYUNHUI FU, Clemson UniversitySABARISH V. BABU, Clemson UniversityJAMES MARTIN, Clemson University

HTTP-based Adaptive Streaming (HAS) is the dominant Internet video streaming application. One spe-cific protocol of HAS, Dynamic Adaptive Streaming over HTTP (DASH), piques interest as it is used byNetflix. Prior research has focused on networking and protocol issues and established current understand-ing of video quality assessment. Our work primarily investigates broadcast video strategies and the addi-tional complexity of assessing HAS video quality. For impaired networks, the HAS adaptation algorithmbecomes the dominating factor by selecting lower quality representation of the content resulting in reducedbandwidth consumption. Unlike traditional broadcast video, HAS primarily suffers from stalls due to insuf-ficient data in the playback buffer. In this paper, we present an adaptation design strategy for HAS whichderives from the need to mitigate conflict between avoiding buffer stalls and maximizing video quality. Wealso present results from a user study which provide insights into best practice guidelines for designingstreaming algorithms.

General Terms: Algorithm, Design, Experimentation, Human factors, Performance

Additional Key Words and Phrases: Buffer-based, Capacity-based, Dynamic Adaptive Streaming over HTTP,HTTP-based Adaptive Streaming, Quality of Experience

ACM Reference Format:Ayush Bhargava, Yunhui Fu, Sabarish V. Babu, James Martin, (2015). User Perceived Quality Assessment ofDesign Strategies for HTTP-Based Adaptive Streaming. ACM Trans. Comput.-Hum. Interact. V, N, Article A( 2015), 31 pages.DOI: http://dx.doi.org/10.1145/0000000.0000000

1. INTRODUCTIONAs described in Sandvine’s most recent Internet traffic report [Sandvine 2014],Internet video streaming, referred to as HTTP-Based Adaptive Streaming (HAS),consumes more than 67% of the downstream bandwidth delivered to fixed access endpoint. A recent forecast from Cisco predicts that video traffic will represent 80% ofall consumer Internet traffic by 2019 [ETSI 2012]. Ten years ago the term Internetvideo streaming assumed UDP transport. Today Internet video streaming typicallyinvolves a TCP streaming system that is based on a form of HTTP-based AdaptiveStreaming (HAS). Various approaches for HAS have evolved from companies such asNetflix, Microsoft, Apple, and Google. This evolution motivated the development of theDynamic Adaptive Streaming over HTTP (DASH) Protocol. DASH provides a standardmethod for containing and distributing video content over the Internet [ETSI 2012;Stockhammer et al. 2011]. Both Netflix and Youtube now support DASH. Further,

This work has been supported in part by CableLabs, Comcast, and Intel.Author’s addresses: Ayush Bhargava, Yunhui Fu, Sabarish V. Babu, James Martin, School of Computing,McAdams Hall, Clemson University, Clemson, South Carolina 29631Permission to make digital or hard copies of all or part of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrights for components of this work ownedby others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or repub-lish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected]© 2015 ACM. 1073-0516/2015/-ARTA $15.00DOI: http://dx.doi.org/10.1145/0000000.0000000

ACM Transactions on Computer-Human Interaction, Vol. V, No. N, Article A, Publication date: 2015.

A:2 A. Bhargava et al.

HTML5 has been extended to support HAS javascript-based client players. This latteradvancement is significant as it makes the technology accessible to developers andopen-source communities. To support the user study that we describe in this paper,we have built a DASH-compliant broadcast system. To simplify the presentation, werefer to the system under study as HAS.

As illustrated in figure 1, a HAS system consists of a server and a client. A HASserver is a web server that holds encoded content packaged in the correct manner.Content is divided into ‘chunks’ referred to as segments. Segments are self-containedsuch that the client can decode the content independent of previous or future seg-ments. Each segment represents a specific amount of video in terms of viewing time.The amount of data contained in a segment depends on how it was encoded. A contentprovider will create multiple representations of the content, each corresponding to adifferent level of video quality. Figure 1 illustrates one content selection (for example,one movie) that has been encoded into four different representations. The highestquality representation would likely represent high definition video and the lowestquality would likely represent less than standard definition quality. A four secondsegment (which is a reasonable segment size parameter) encoded at high quality couldrequire an order of magnitude more network bandwidth than that required by thelowest quality representation. At the start of a session, the client receives a MediaPresentation Description (MPD) file for the content selection which identifies thepossible bitrate options as well as the URL name of all segments.

The client consists of a playback buffer, a controller, and a video player. The playbackbuffer holds a sufficient amount of video data such that if the network temporarilyexperiences impairment either due to network congestion or wireless connectivity, theplayer can continue to playback the stream without stalling. The controller monitorsthe arrival rate of data as well as the state of the playback buffer. It determines whenthe client should request additional content. The video player consumes video datafrom the playback buffer at a rate based on the encoded video rate. If the video playerrequires more data but the playback buffer is empty, the player moves into a stallstate. Referred to as a rebuffering event, the player will not resume rendering videountil a minimum amount of data has been buffered. We describe in more detail thetypes of artifacts that are possible in a HAS system later in the paper.

When the network is not congested, the client selects the highest quality videorepresentation. When the network becomes congested, the client could select lowerquality video segments. . This adaptation algorithm, which operates at the controllerin Figure 1, makes decisions based on recently observed network and system con-ditions. The premise behind HAS is that reducing the requested video quality tomatch available network bandwidth leads to improved perceived quality. The HASadaptation algorithm must address the conflicting design goals of maintaining highvideo quality and minimizing buffer stalls. In its simplest form, the former is achievedby having the HAS client request the highest quality segments that conditions appearto support while the latter is achieved by having the client make adaptation decisionsbased only on the state of the playback buffer. In the literature, the two approachesrefer to capacity-based and buffer-based adaptation respectively. A capacity-basedapproach prioritizes high video quality and assumes the prediction of available band-width is sufficiently accurate to avoid buffer stalls. Buffer-based adaptation assumesbuffer stalls must be avoided and that the most reliable method for avoiding bufferstalls is to base the video quality adaptation on the current state of the playback buffer.


User Perceived Quality Assessment of Design Strategies for HTTP-Based Adaptive Streaming A:3

Fig. 1. HAS System Model

The two approaches represent opposite ends of the range of design strategies. Thebest approach is likely a compromise that unfortunately is quite difficult to pinpoint.The evaluation of a specific adaptation algorithm requires a method for obtaining per-ceived quality assessments from users viewing the video stream. An accurate assess-ment would likely involve a utility function that includes potentially many qualitycontribution components. Further, the weighting applied to prioritize the quality com-ponents would vary depending on both the viewer and the content. While there havebeen recent contributions towards understanding perceived quality in HAS (see forexample [Survey] and its references), there are few studies that provide specific guid-ance. In fact, most papers that study HAS adaptation rely on a small number of pub-lished results that collectively provide best practices towards HAS design. The workin [Dobrian et al. 2011] suggests that buffer stalls have the biggest impact on userengagement. The work in [Cranley et al. 2006] suggests that frequent adaptations aredistracting. The work in [Muller and Timmerer 2011] suggests that sudden changes invideo quality trigger poor subjective video quality scores.

The motivating question driving our research is if we can provide any further insightor extension to the current set of best practices. And one step further, can we concretelymap these insights to specific design choices in a HAS adaptation algorithm. In the re-search presented in this paper, we explore these issues. The underlying objective of theresearch is to provide more concrete design guidance for a HAS adaptation algorithmthan what the current best practices provide. In this paper, we present the resultsof a user study that we have conducted at Clemson University that tightly coupleshuman assessment in the evaluation of design choices available to a HAS adaptationalgorithm.

This paper is organized as follows. In the next section we provide a brief overviewof HAS. Next we present related research. The fourth section introduces the systemand overall experimental procedures. We present the results along with a thoroughdiscussion of the analysis and interpretation of the results. We end the paper withconclusions including a summary of the limitations of the study.



2. RELATED WORKThe development and study of HAS systems represent one of the most active areasof research in the networking research community. The interaction between a HASclient (i.e., the player) and server has been established in recent academic research[Akhshabi et al. 2011; Akhshabi et al. 2012; Jiang et al. 2012; Li et al. 2014b; Huanget al. 2012]. In a HAS system, video and audio content is encoded and made availableat servers located either at the content providers facilities or distributed to locationsin the Internet by a content delivery network provider. Multiple representations ofthe content are created that reflect a range of possible encoded bitrates correspondingto different levels of video quality and subsequently different levels of bandwidthconsumption. While the range of bitrates is continually being extended, the literaturesuggests that a bitrate range of 0.50 Mbps to 5.0 Mbps is reasonable [Akhshabi et al.2011; Huang et al. 2014; De Cicco and Mascolo 2014]. The client requests portions ofthe content in chunks referred to as segments. A segment size between two and tenseconds is reasonable [Huang et al. 2014; Martin et al. 2013]. The client maintains aplayback buffer that serves to compensate for the jitter in the traffic arrival process.The literature suggests that a playback buffer size ranging from 60 to 240 seconds isreasonable [Huang et al. 2014; Martin et al. 2013].

The more challenging aspect of HAS is perceived quality assessment. The founda-tions are based on the vast amount of research on video quality assessment that comesprimarily from the broadcast video community [Wang et al. 2004; Xia et al. 2009a;Winkler and Mohandas 2008]. Objective metrics such as peak signal-to-noise ratio(PSNR), vision-based metrics, or packet-stream oriented metrics attempt to program-matically compute video quality [Winkler and Mohandas 2008]. Some metrics requirea frame-by-frame comparison between a reference video and the video under assess-ment. No-reference video limits the analysis to the observed video, greatly enhancingthe flexibility of assessment. However, the challenge with no-reference assessment isbeing able to accurately differentiate quality degradation from the original content.Subjective video quality assessment involves asking human subjects for their opinionof the video quality. While there are guidelines, subjective video quality assessmentcan be difficult due to the complexity and difficulties surrounding large scale userstudies [Sector 1998; ITU-T RECOMMENDATION 1999; Assembly 2003]. There areseveral specific results from this prior literature that can be applied to HAS:

— Users are more sensitive to lower bit rate (lower video quality) for live content thanfor Video-on-Demand [Dobrian et al. 2011]

— Users are more tolerant of reduced video quality that is stable than a video sessionthat frequently changes between various levels of quality ranging from reducedvideo to high quality video [Cranley et al. 2006; Balachandran et al. 2012].

— The percentage of time spent buffering has the largest impact on user engagementacross all types of content [Dobrian et al. 2011].

— QoE due to frame rate jerkiness changes logarithmically and user prefer a singlelong stall than multiple short stalls [Huynh-Thu and Ghanbari 2006].

Recent studies that propose ‘quality of experience’ (QoE) [Rec 2007] metrics typicallyattempt to quantify a measurable aspect of human perception and to calibrate basedon additional subjective assessment results. Unfortunately, it is very difficult to getthe objective assessment correct over a wide range of conditions. Multiple studies now



confirm that PSNR is not a reliable assessment for perceived quality. This issue hasbeen the subject of intense academic study for decades [Degrande et al. 2008; Ni et al.2011b; Xia et al. 2009b; Ou et al. 2010].

In addition to the studies that provide the best practices, there is additional researchthat focuses on QoE metrics for HAS [Jiang et al. 2012; Mok et al. 2012; Huang et al.2012]. The same challenges apply when trying to develop objective metrics. However,determining the perceived QoE of a video streaming session is very complex as theassessment depends on many factors including the viewer, the video encoding details,and the content. Research done by [Agboma and Liotta 2007] supports the argumentthat QoE for multimedia services should be driven by the user perception of qualityrather than raw engineering parameters such as latency, jitter, and bandwidth. Forexample, work done in [De Pessemier et al. 2013] suggests that users prefer fluentplayback of videos above a higher resolution, frame rate, and bitrate.

Assessing the perceived quality of video without a reference is much more challeng-ing. The 3GPP community has identified several quality metrics for HAS includingHTTP request/response transaction times, average throughput and initial playoutdelay. In [Dobrian et al. 2011], the authors explore measures that impact perceivedquality of Internet broadcasts and found that the percent buffering time has thelargest impact on user engagement although the specific impact varies by contentgenre. The work in [Mok et al. 2012] performed subjective tests to determine whichbitrate adaptation behaviors lead to the highest overall perceived quality. The work in[Muller et al. 2012] evaluated three commercial HAS products (Microsoft S1 Stream-ing, Adobe Dynamic Streaming, and Apple Live Streaming) as well as an open sourceHAS implementation. They determine that in vehicular scenarios neither method willalways achieve the maximum available bandwidth with a minimal number of qualityswitches. The work in [Alberti et al. 2013] provide the additional information to theHAS adaptation algorithm such as user feedback and observed network measuressuch as sampled available bandwidth and playback buffer level so that it selects thebest possible quality.

Several studies explore the design space surrounding adaptation algorithms [Liet al. 2014a; Huang et al. 2014; Thang et al. 2013]. These include using buffer sizeas breathing room to temporarily allocate bits among video segments to yield overalloptimal video quality, requesting video rate depending only on buffer occupancy. Theseapproaches will only help in avoiding rebuffering stalls and provides no guarantee ofa high and stable video quality especially with high network fluctuation.

The use of HAS to deliver video further complicates QoE assessment primarily be-cause the interactions between network impairment and the TCP and the adaptiveapplication controls that are involved lead to different artifacts. In other words, objec-tive or even subjective methods that have been used with traditional UDP-based videostreaming might not work well with HAS-based video streaming. While there havebeen some recent work that has explored QoE assessment for HAS-based streaming[Balachandran et al. 2012; Oyman and Singh 2012; Houdaille and Gouache 2012; Huy-segems et al. 2012; De Pessemier et al. 2013; Ou et al. 2010], the problem of findingreliable QoE metrics for HAS video streaming is an open issue.

While the issue of assessing HAS quality is under intense study, the literature doessuggest that the following measures are useful for evaluating HAS:[Akhshabi et al.2012; Cranley et al. 2006; Dobrian et al. 2011; Jiang et al. 2012; Krishnan and Sitara-man 2013; De Cicco et al. 2013; Ni et al. 2011a; De Cicco and Mascolo 2014].



— playerRate: Attributes of the stream such as the resolution, pixel density, framerate as well as end device capabilities all determine the base quality of the videostream. This measure represents the average bitrate of the stream based on therate (in Mbps) at which data is dequeued from the playback buffer. If the videoplayer stalls, samples are not recorded (i.e., rebuffering events will not impact themeasure).

— AdaptationsRate: It has been shown that frequent adaptations and sudden changesto video quality are distraction [Cranley et al. 2006; Ni et al. 2011a]. The numberof adaptations are counted throughout the streaming session. This is translated toa unitless measure that represents the rate of adaptations per hour.

— AvgBuffer: The average playback buffer size (in seconds) maintained over thelifetime of the stream. The actual size of the buffer in bytes is determined by thenumber and quality level of each segment contained in the buffer.

— AverageRebufferingTime: A rebuffering event occurs when the video player movesfrom the running (playback) state to a rebuffering (stalled) state. This transitioncan occur when the player it is not able to obtain at least a partial segment of datafrom the playback buffer. The player will remain in the rebuffering state until athreshold amount of data is available in the playback buffer. When the buffer hasreached this threshold, the duration of the particular rebuffering event is recorded.This statistic is the average rebuffering time observed in all rebuffering events.

— TimeSpentRebuffering: This is the ratio of the total time spent in the rebufferingstate with the total lifetime of the stream. The measure provides a relative indica-tion of how long the video player was stalled.

— ArtifactRate: An artifact describes visual impairments that might be produced bythe player. Depending on the cause, an artifact might be quite evident or dependingon the display device it might not be noticeable. Artifact such as pixilation andstutters are common examples.

In summary, the majority of the related work focuses on viewer satisfaction andreactions to content streamed using HAS, exploring HAS systems already in place,or defining broader metrics for measuring Quality of Experience (QoE). A survey ofsuch works in [Seufert et al. 2014]. Although these yield useful statistics related toeffectiveness of HAS, they don’t provide concrete guidance for HAS system designsand behaviors. In the research described in this paper, we define the behavior of thestreaming technique under different circumstances and directly show how it affectsthe viewer’s quality of experience.

3. SYSTEM DESCRIPTIONThe system under study models a user viewing video over the Internet. An examplecould be a broadband Internet access subscriber viewing Netflix content. In our study,participants were subjected to a controlled viewing experience. Throughout the expe-rience, we obtain quantitative and qualitative information that collectively providesfeedback that we interpret to assess the participant’s perceived quality. We selected a



short sci-fi movie for the study. The film, ‘Tears of Steel’ is available under the termsof Creative Commons.1

We obtained the 4K version of the content using ‘.mov’ container format. We usedffmpeg to create H.264 encoded video and then webM tools to create DASH formattedsegment files. We created representations ranging from 0.5 Mbps to 4.8 Mbps. We setthe segment size to 4 seconds and the maximum playback buffer size to 120 seconds.These configuration settings are based on recent measurement studies of deployedHAS systems [Martin et al. 2013; Akhshabi et al. 2011; Huang et al. 2014; De Ciccoand Mascolo 2014].

The representations were viewed using a modified verion of an open sourcejavascript implementation of HAS.2 We extended the player with implementations ofbuffer-based adaptation (i.e., design strategy 1) and capacity-based adaptation (designstrategy 2) based on [Huang et al. 2014; Liu et al. 2011] respectively.3

3.1. Experimental SetupThe viewing experiences were conducted in a theatre at the Digital Production Artsfacility at Clemson University. The facility offers a realistic theatre experience basedon a screen of size 20’x20’ and a high-end projection system4. The participant wasasked to sit in the center of the first two rows so that he/she can maintain a full viewof the screen.

To obtain more granular quality results from participants, we used a continuousfeedback method involving a WiiMote device. During a viewing experience, partic-ipants were asked to indicate their perceived levels of poor quality of the video byrotating the hand held Wii Remote to the right or left (further details are provided insection 3.3). Participants were not allowed to pause the video. Before the video wasplayed, the participant was asked to fill out surveys and instructed on how to hold anduse the WiiMote during the video playback.

3.2. Video ConditionsAs mentioned earlier, an actual HAS system involves client players interacting withcontent servers located in the Internet. For our study we needed to reproduce aspectsof the actual system but in a manner that supports a repeatable experimentationplatform with sufficient control such that we could carry out our study. Our processinvolved two steps: 1)Platform emulation and calibration; 2)Production of the videoconditions.

Platform Emulation and Calibration. Figure 3 illustrates the experimental setupthat was used to emulate realistic HAS streaming sessions over the Internet. TheWindows PC represents the HAS client and the Content Server holds the contentcorrectly formatted in multiple representations. We conducted a calibration analysisallowing us to create realistic network impairment that resulted in a reasonable HASrandom outcome. The HAS outcome implies the specific sequence of quality levels foreach video segment associated with the outcome.

1Information from the film developer is available at https://mango.blender.org/2The dash.js open source project is available at https://github.com/Dash-Industry-Forum3Further information about the algorithms and the player implementations can be found at our project website located at http://cybertiger.clemson.edu/vss4The DPA Theatre has a Christie D4K256 projector operating in 2K video mode



Fig. 2. Experimental Setup

Modern Linux distributions include a network emulator, referred to as netem, thatcan add artificial packet loss and latency to network traffic that is forwarded overa network interface. The objective of the calibration phase was to find two specificsettings of netem that could emulate a highly impaired network scenario and a sce-nario that represents a less-impaired network scenario. For each setting, we streamedthe content in two different ways. First using the javascript HAS player using abuffer-based adaptation design strategy and second using the player configured toreflect a capacity-based design.

The calibration allowed us identify an artificial packet loss rates to specific HASvideo quality levels. We found that a random loss process based on a Bernoulli lossmodel (i.e., an uncorrelated loss process) with an average loss setting of 3% roughlycorresponds to a minimally impaired scenario. Increasing the average loss rate to6% corresponds to a highly impaired setting (i.e., a low video stream quality level).However, the Internet is a dynamic system that can exhibit highly variable networkconditions. In our calibration, we observed that both HAS adaptation approaches areeffective even in high loss rate scenarios as long as the loss process remains stationary.In order to reflect realistic video streaming sessions, such as those observed in [Huanget al. 2012], we needed to vary the loss process as the streaming session progresses. Toensure we could repeat the realization of the underlying random process, we furthercustomized the client player to request a specific sequence of segments.

Production of the video conditions. We created four renderings of the video stream-ing session which we refer to as the video conditions (identified in Table 1). Each ofthe four video experiences represents a specific sample path of the random processthat would normally define a random streaming session. Table 2 identifies the fourvideo experiences. If the client player requested a specific sequence of segments,randomness due to network or platform effects might still cause participants who



Fig. 3. Calibration Setup

Table I. User Study Conditions - 2 x 2 Experiment design between impairment (high/low) and strategy(buffer based I vs. capacity based II).

Impairment/StreamingStrategy Strategy I (Buffer based) Strategy II (Capacity based)

High S1-HI S2-HI

Low S1-LO S2-LO

view the same sequence of segments to possibly observe difference outcomes. Toensure that the video experience was delivered to participants in a repeatable manner,the four streaming sessions that were presented to participants were recorded ina high quality AVI video file format5. The videos were recorded using SSR6 screencapture software. These files were placed on a machine that was directly connected tothe projector. During a streaming session, one of the four videos was randomly playedfrom the machine locally.

We defined and computed a set of measures that serve to quantify the renderedquality of the video. Some measures are computed by the HAS player. The measuresrelated to rendering artifacts had to be manually identified. We examined each of thefour video conditions and for each observed artifact event, we noted the type of artifact(based on the set of measures) as well as the time of occurrence. We had three teammembers perform this followed by a team discussion to ensure we produced an accu-

5The recorded versions of the four video experiences can be viewed at http://cybertiger.clemson.edu/vss/6The Simple Screen Recorder is available at http://www.maartenbaert.be/simplescreenrecorder/



Table II. Video Experience Definition and Summary

Quantitative Measure S1-HI S2-HI S1-LO S2-LOAverage playback rate (Mbps) 1.024 1.483 2.988 3.401

Average requested rate (Mbps) 0.899 1.589 2.657 3.156

Average available bandwidth estimate (Mbps) 3.437 3.437 4.044 4.044

Average playback buffer size (seconds) 64.11 18.40 56.4 8.87

Total number of artifacts 13 43 2 29

Number of quality changes 9 21 1 1

Number of long stalls 0 5 0 1

Number of short stalls 0 12 0 11

Number of frame drops 4 5 1 16

rate list of artifacts. Table 2 summarizes the quantitative characteristics from eachexperience. The set of quantitative video viewing characteristic include the following:

— Average playback rate: Average rate (in bps) at which the video was played by theplayer.

— Average request rate: Average data transfer rate (in bits per second) requested bythe player to stream the video.

— Average available bandwidth estimate: Average data transfer rate (in bits per sec-ond) available to stream the video.

— Average playback buffer size: Average length of video (in seconds) available in theplayer’s memory/buffer that is ready to be played.

— Total number of artifacts: Total number of events/interruptions you see in the videolike stalls, quality changes, etc.

— Number of quality changes: Total number of resolution upgrades or downgrades inthe video.

— Number of long stalls: Total number of stalls/pauses in the video lasting longer than5 seconds.

— Number of short stalls: Total number of stalls/pauses in the video shorter than 5seconds.

— Number of frame drops: Total number of times a frame was skipped in the video.

3.3. WiiMote SetupAn application, developed using the Unity game engine 7, was used to poll the WiiRemote (referred to as a WiiMote) and record the participants continuous feedbackof the perceived quality of the video viewing experience into a log file8. The WiiMotecommunicates with the software that we developed on a computer using a Bluetoothconnection. To get continuous quality feedback from participants, they were instructedto hold the device in an upright position and rotate it from left to right (i.e. from anangle of 0-good quality low frustration to 180-poor quality high frustration), see figure4. The WiiMote’s rotation about the vertical axis was recorded every 50 milliseconds.This reading along with a corresponding unix time stamp was logged in a text file. Thestarting time of the video session was recorded allowing us to interpret the WiiMotedata as a fixed increment time series.

7Unity: http://unity3d.com/8Information about the Wiimote is available at http://wiibrew.org/wiki/Wiimote.



Fig. 4. WiiMote Usage

3.4. Visualization of ResponseThe study obtained participant feedback in the form of surveys and continuous data.The post-survey was administered at the end of the video playback. The participantwas asked to assess his/her QoE at the beginning of the viewing experience, at theend of the video experience and then overall. The assessment questions asked the par-ticipant to rate the video quality in terms of frustration, distortion, and video clarity.These terms were defined as follows,

— Frustration: feeling of being annoyed with the video playback experience.

— Distortion: presence of artifacts that would degrade or distort the video playback.

— Video Clarity: how clear or crisp the video is. Higher quality implies higher resolu-tion.

The results from each participant in each condition were then analyzed. A detailedanalysis is provided in the results section (section 5).

The continuous measure is based on the participants continuous feedback obtainedfrom the WiiMote device. The data visualized later in figure 5 illustrates a smallsample of the data collected from one participant’s experience viewing S2-HI videocondition. The time relative to the start of the video experience, units of in minutesand seconds, is plotted along the x-axis and the level of frustration (based on theWiiRemote unitless data ranging from 0 to 180) is plotted in blue along the y-axis.The colored vertical bars in the graph represent different artifacts that occur in thevideo at a particular time. The width of the bars indicates the duration of the artifactbeing represented. These graphs were created using the D3.js JavaScript library. Allthe components in the graph are scalable vector graphics that combine to form alayering of graphs that are synchronized to the same time reference. The raw dataobtained from the WiiMote was further filtered and normalized for analysis. Thepost-processing analysis is explained in more detail in the Measures sections (section4.4). As instructed, the participants tended to hold the WiiMote around a value of 90on the scale of 0 to 180. Values above or below this represented a change in observed



perceived quality.

4. EXPERIMENTAL PROCEDURE4.1. Research Questions and HypothesesThe study is designed to formulate guidelines as to how a HAS adaptation algorithmshould address the conflicting requirement of maximizing video player quality andminimizing the frequency of buffer stalls.

We identify the following hypotheses:

— Hypothesis 1: In high impairment network conditions, perceived quality is greaterwhen using Strategy 1 compared with Strategy 2.

— Hypothesis 2: In low impairment network conditions, perceived quality is greaterwhen using Strategy 2 compared with Strategy 1.

— Hypothesis 3: The perceived levels of poor quality are expected to increase as theoccurrence of artifacts increases with video playback.

— Hypothesis 4: For both network conditions, the magnitude of perceived quality wasgreater with Strategy 1.

4.2. Study DesignA within subjects design was used for this study. Using two levels of network impair-ment and two streaming algorithms we get a 2X2 grid with 4 different scenarios orconditions for the purpose of testing. See table I. A third variable of viewing time wasalso used to break down the video for better measurement of viewer responses.

We collected demographic and behavioral data, personality data, current moodstate data and feedback on level of frustration, video quality, distortion and soundclarity at the beginning of the video, the end of the video and an overall measureusing surveys. A measure of perceived levels of poor quality was also recorded as acontinuous measure using a WiiMote. After the first session, participants were alsoasked to record what they could understand about the movie’s storyline.

A total of 33 participants were recruited, 31 males and 2 females. They included amix of graduate and undergraduate students from Clemson University. The partici-pant ages ranged from 18 years to 30 years with a mean of 22 years. Each participantwas expected to attend 2 sessions on separate days each lasting approximately 30 to40 minutes. No participant saw the same video twice. This resulted in a total of 66video sessions, 59 of them contributed data to the analysis. The distribution based oneach video condition is as follows,

— S1-HI - 15 sessions— S2-HI - 15 sessions— S1-LO - 14 sessions— S2-LO - 15 sessions

The types of artifacts recorded from these videos are stalls (long and short), framedrops, distortion and quality change (upgrade and degrade). As a result of the foursettings, each video has a varying number of each type of artifacts. To maintainconsistency between each participant who viewed a particular video setting, the



videos were prerecorded by streaming over each type of network impairment undereach strategy.

4.3. MethodologyParticipants were required to attend two sessions. In the first session,

(1) The participant is handed an informed consent describing the purpose and theprocedure of the study.

(2) Then the participant fills out 3 pre-surveys pertaining to their personality, currentmood state, demographics and general video content watching habits.

(3) Following this, instructions are given on how to use the WiiMote while watchingthe video.

(4) Next, one of the 4 videos is randomly selected to be presented to the participant.

(5) After the video, participant fills out a post-experiment survey pertaining to his/her(Quality of Experience) QoE.

(6) Lastly, the participant is asked to record what he understood about the movie’sstoryline.

In the second session,

(1) When the participant comes in after a few days he/she only fills out the currentmood state pre-survey.

(2) Then is presented with one of the 3 remaining videos (similar to steps 3 to 6 on thefirst session).

(3) The participant fills out the post-experiment survey.

(4) Finally, the participant is debriefed regarding the purpose of the experiment.

We do not ask the participant to record his/her understanding of the movie in thesecond session as it provides less useful information compared to the participants firstresponse to the question.

4.4. Measures4.4.1. Quantitative Subjective Measures. The post-survey was used to collect quantitative

data about the participant’s QoE. We made use of a 5-point Likert’s scale to helpparticipants represent frustration, distortion, and video quality by choosing an optionthat best aligned to their view.

To make it simpler for the participants to comprehend the scale, the values were re-placed with phrases. For example, for frustration the scale read extremely frustrated,somewhat frustrated, neutral, somewhat satisfied and extremely satisfied. Higherthe frustration, smaller the value it represented on the Likert’s scale. This was alsotrue for distortion. The inverse was true for video clarity and sound clarity, higher theclarity, higher the value on the scale. A mapping of the QoE to the measure scores canbe seen in table III. All 4 of the measures were collected for the beginning of the video,



Table III. QoE to Quantitative Subjective score mapping

QoE Very Poor Poor Neutral Good Very Good

1 2 3 4 5

Frustration Extremely Frustrated Somewhat Frustrated Neutral Somewhat Satisfied Extremely Satisfied

1 2 3 4 5

Distortion Extremely Distorted Somewhat Distorted Neutral Somewhat Clear Extremely Clear

1 2 3 4 5

Video Clarity Extremely Blurry Somewhat Blurry Neutral Somewhat Crisp Extremely Crisp

1 2 3 4 5

Sound Clarity Extremely Distorted Somewhat Distorted Neutral Somewhat Clear Extremely Clear

1 2 3 4 5

the end of the video and the overall of the video. The details of analysis on this datacan be seen in the results section (5.1.1).

4.4.2. Quantitative Objective Measures. The raw WiiMote data was filtered and nor-malized to remove human bias and noise and make the data consistent for analysis.All the post processing steps explained below were completed with the help of anautomated program that parses through a text file containing all the WiiMote valuesand time stamps recorded for a particular participant. The program performed thesesteps for each participant.

To filter the noise from the raw data, we applied a 1 second sliding window filtersimilar to the moving average technique to each participants data individually. Skip-ping the first and the last half a second from the data, for every stamp we calculatedan average of all readings starting from half a second before to half a second after thecurrent reading i.e. a subset of 1 second for each time stamp.

Thus, representing each time stamp as t, the perceived level of poor quality at t asvt and the new filtered perceived level of poor quality obtained at t as Vt, we have

Vt =

∑t−0.5 ≤ t < t+0.5 vt

Number of time stamps recorded between t− 0.5 to t+ 0.5(1)

This removed most of the noise from the data as seen in figure 5. Since, the rangeof values was different for every participant; the readings were normalized by scalingbetween 0 and 1. This was done by identifying the minimum and maximum valuesfor each participant from the complete duration of the video. Each recorded value wasthen subtracted with the minimum value and the result was divided by the range,maximum minus the minimum value, giving us the 0-1 scale. This was essential tomaintain consistency in our observations and formulating heuristics.

Using L to represent the series of filtered perceived levels of poor quality, the nor-malized value of (li) for variable L in the ith position is calculated as

Normalized(li) =li − Lmin

Lmax − Lmin(2)

where,Lmin = the minimum value for variable L



Fig. 5. Graph with WiiMote readings plotted showing transition from raw to filtered and filtered to normal-ization

Lmax = the maximum value for variable Lif Lmax is equal to Emin then Normalized (li) is set to 0.5.

We analyzed the continuous data in two different ways. First, we divided each videointo 3 equal time intervals, beginning, middle and end, typically around 4 minuteseach. The values from each interval were averaged for each participant and analyzed.The detailed analysis is in section 5.2.1.

For a participant P, using d to represent the duration of video playback and vt rep-resenting the filtered perceived level of poor quality at a particular time stamp t, thecalculation were as follows

Avgbeg =

∑0 ≤ t < d/3 vt

Number of time stamps recorded between 0 to d/3(3)

Avgmid =

∑d/3 ≤ t < 2d/3 vt

Number of time stamps recorded between d/3 to 2d/3(4)



Avgend =

∑2d/3 ≤ t < d vt

Number of time stamps recorded between 2d/3 to d(5)

where,Avgbeg is the average perceived level of poor quality for the first 1/3 of the video.Avgmid is the average perceived level of poor quality for the second 1/3 of the video.Avgend is the average perceived level of poor quality for the third 1/3 or last 1/3 of thevideo.

Secondly, we identified the frequency of artifacts a participant responded to duringthe video watching experience. This too was done using an automated program basedon the observations made from the normalized graphs. The details of the completeprocess are given below and the detailed analysis is in section 5.2.2.

The following observations were made from the normalized graphs obtained for eachparticipant,

— Participants tend to respond using the Wiimote that shows up as a local maxima inthe response curves for artifacts that degrade their QoE.

— Participants tend to move towards local minima for artifacts that improve theirQoE.

— Participants take about 1-2 seconds to react to an artifact.

— It takes about 2-5 seconds to see a local maxima or minima after the participantstarts reacting.

— It might take longer to see peaks if the duration of an artifact is more than 5seconds.

— The magnitude of response at the start of the video is not very large.

— The magnitude of response increases as the participant sees more frequent anddifferent types of artifacts as the video progresses.

— There might be a combined response of high magnitude for multiple artifacts all ofwhich occur within a span of 5 seconds or less.

We develop a method to automatically detect and code a local maxima and minima.The method identifies responses/reactions to a particular artifact by the participant.The program checks for a local maxima or minima for each artifact and if the max-ima or minima met a specified threshold it was considered to be a response to thecorresponding artifact. The program uses the following heuristics:

— A local maxima was considered for stalls, frame drops and quality degrades. Thebasis for this assumption is that we expect the frustration to rise for artifacts thatdegrade the QoE. Thus, a maxima serves as an indicator for degraded perceivedquality.

— A local minima was considered for quality upgrades. The basis for this assumptionis that we expect the frustration level to reduce for artifacts that better the QoE.Thus, a minima serves as an indicator for upgraded perceived quality.



— The threshold was set to 0.05 units on a normalized scale for the detection ofmaxima and minima.

— The time span considered for a local maxima or minima was 5 seconds from thestart time of the artifact.

— If the artifact lasted more than 5 seconds, the complete duration of the artifact wasconsidered as the span for detecting maximas or minimas.

The value of the global maxima or minima for an artifact was compared to the valuereported by the participant at the beginning of the corresponding artifact to check forthe threshold.

The process to quantify the artifact responses using the program mentioned abovegave us the number of artifacts a participant reacted to during the video viewingexperience. These responses were marked on the graph using colored spherical markertags placed right above the maxima or minima corresponding to that artifact (seefigure 6).

Different colored and sized marker tags were used to represent responses to differ-ent artifacts. The color of tags correspond to the color of the artifact whose responsethey represent. Thus, the red tags represent responses corresponding to stalls, theorange tags correspond to frame drops, the yellow tags correspond to quality degradesand the green tags correspond to quality upgrades. Different sizes were used to makeit easier to read responses when the same peak was considered for 2 different artifactsin case both the artifacts occurred within a span of 5 seconds or less. In a separatevalidation process, the automatic method of identifying local maxima and minimawere verified for correctness by applying the automatic detection and tagging methodon experimenter created data, and was found to be robust in detecting instances ofperceived poor quality or improvement in quality corresponding to the artifacts foundin the videos.

The total number of artifact responded to by a participant was recorded. A summaryof each condition can be seen in table IV. Since, the number of artifacts actuallypresent in the video was different for each condition, the participant responseswere converted into percentiles. This was done by dividing the number of artifactsparticipant responded to by the total number of artifacts present in the video he/sheviewed.

We were interested in analyzing the effect of the adaptation strategy. However,because the video conditions from the same level of impairment did not contain atleast 1 of each type of artifact we only considered a combined count of all types ofartifacts. Thus only the percentage of total artifacts responded to by the participantwas considered for studying the effects of streaming strategy on the QoE. The resultsof the analysis are explained in detail in the results section.

4.4.3. Qualitative Measure. The post-survey also asked participants for comments abouttheir experience at the beginning, end and overall of the video. Comments related tothe video quality and the artifacts that occurred during the video were selected fromeach video condition for further analysis. A detailed analysis can be found in section 7.



Fig. 6. Graph with maxima and minima represented by spherical marker tags; Color of tags correspond tocolor of artifact; Minima for quality upgrade; Maxima for all other artifacts

5. RESULTS5.1. Quantitative Subjective Results

5.1.1. Effects of Network Algorithm, Impairment and sampling Time on user’s Frustration, Clarity,Distortion, and Perceived Levels of Quality of video experience.

. For each of the dependent measure (frustration, clarity, distortion and continuousmeasure of quality), we performed a 3 x 2 x 2 mixed model Analysis of Variance(ANOVA) statistical analysis. Levine’s test of equality of variance was conducted onthe data to insure that there was near equal variance in the samples prior to conduct-ing parametric analysis on the data. Sampling time, beginning, end and overall, wasthe within subjects independent variable. Algorithm, Strategy 1 (S1) and Strategy 2(S2), and impairment, high and low, were the between subjects independent variables.Greenhouse-Geiser adjusted degrees of freedom were considered when Mauchly’stest of sphericity was found to be significant. Post-hoc analysis were computed usingTukey-HSD and pairwise comparisons with Bonferroni adjusted type-I error.

a Frustration:The ANOVA analysis did not reveal a significant main effect of time, but revealed asignificant sampling time by algorithm interaction on frustration QoE scores F(1.36,74.8) = 3.665, p = 0.047, η2 = 0.06. The data also revealed a significant sampling timeby impairment interaction on frustration QoE scores F(1.36, 74.8) = 5.98, p = 0.010, η2= 0.09. Recall that low frustration QoE scores indicate high levels of viewer frustration



Fig. 7. Mean frustration QoE scores in the beginning, end and overall in the S2 (Reactive: Capacity based)and S1 (Smooth: Buffer based) algorithm viewing conditions

and high frustration QoE scores indicate low levels of viewer frustration in the videoviewing experience.In order to further examine the interaction effects, we first performed a block analysisexamining the mean frustration QoE scores in the beginning, end and overall byalgorithm as a 3 x 2 mixed-model ANOVA. The analysis revealed a non-significantmain effect of sampling time of frustration scores, a non-significant main effectof algorithm (S2 vs. S1), as well as a non-significant sampling time by algorithminteraction. A block analysis of frustration scores within the S2 algorithm conditionand within the S1 algorithm condition, conducted via a one-way within subjectsANOVA across all sampling times of frustration scores was also not significant.We also performed multiple pairwise independent samples t-tests to examine if thefrustration scores were different within blocks of sampling time of beginning, end andoverall. Frustration scores were significantly lower at the end of the video watchingexperience in the S2 algorithm condition (M=2.60, SD=1.16), as compared to theS1 algorithm condition (M=3.24, SD=1.29), t = -1.99, p = 0.050, d = 0.52. However,frustration scores were not significantly different between the S2 and S1 algorithmconditions in the beginning as well as the end sampling times.

We then performed a block analysis examining the mean frustration QoE scores inthe beginning, end and overall by impairment as a 3 x 2 mixed-model ANOVA (SeeFigure 8). The analysis revealed a non-significant main effect of sampling time offrustration scores, a significant main effect of impairment (high vs. low F(1, 57) =4.57, p = 0.037, η2 = 0.074, and a significant frustration by impairment interactionF(1.36, 77.94) = 5.312, p = 0.015, η2 = 0.085. A block analysis of frustration scoreswithin the low impairment condition conducted via a one-way within subjects ANOVAacross all sampling times revealed a significant effect, F(1.3, 37.8) = 4.12, p = 0.040,η2 = 0.12. A post-hoc pairwise analysis conducted via Bonferroni adjustment revealedthat frustration scores in the beginning (M=2.7, SD=1.37) were significantly lower



Fig. 8. Mean frustration QoE scores in beginning, end and overall by impairment levels high and low

than overall (M=3.53, 1.24), p = 0.028. A block analysis of frustration scores withinthe high impairment condition conducted via a one-way within subjects ANOVAacross all sampling times did not reveal a significant effect. A block analysis offrustration scores between low and high impairment conditions in the beginning wasnot significant. However, mean frustration scores were significantly lower in the highimpairment condition (M=2.58, SD=0.95) in the end of the viewing experience ascompared to the low impairment condition (M=3.23, SD=1.45), t(57) = 2.01, p = 0.048,d = 0.52. Mean frustration scores were also significantly lower in the high impairmentcondition (M=2.52, SD=0.95) overall in the viewing experience as compared to the lowimpairment condition (M=3.53, SD=1.22), t(57) = 3.55, p = 0.001, d = 0.94.

b Clarity:The ANOVA analysis revealed a significant main effect of time F(1.35, 64.45) =23.10, p <0.001, η2 = 0.32, and revealed a significant sampling time by impairmentinteraction on clarity QoE scores F(1.32, 64.45) = 7.38, p = 0.005, η2 = 0.13. Samplingtime by algorithm as well as the three way interaction on clarity QoE scores wasnot significant. Using the Bonferroni technique, we found that clarity scores in thebeginning (M=2.30, SD=1.17) was significantly less than in the end (M=3.22, SD=1.26)p <0.001 as well as compared to overall (M=3.39, SD=1.30) p <0.001. Recall that lowclarity QoE scores correspond to low levels of perceived clarity and high clarity QoEscores correspond to high levels of perceived clarity in the video viewing experience.

In order to further examine the interaction effects, we first performed a block analysisexamining the mean clarity scores in the beginning, end and overall by impairmentas a 3 x 2 mixed-model ANOVA. The analysis revealed a significant main effect ofsampling time of clarity scores F(1.32, 67.42) = 23.56, p <0.001, η2 = 0.31, a maineffect of impairment F(1, 51) = 12.35, p = 0.001, η2 = 0.2, and an interaction effect of



Fig. 9. Mean clarity QoE scores in high and low impairment conditions by sampling time

sampling time by impairment on clarity scores F(1.32, 67.42) = 7.84, p = 0.003, η2 =0.13 (See Figure 9). A block analysis of clarity scores were conducted within the highimpairment condition and within the low impairment condition via a one-way withinsubjects ANOVA within each impairment block. The one-way within subjects ANOVAof clarity scores in the low impairment condition were significantly different F(1.34,37.50) = 26.68, p <0.001, η2 = 0.49. The contrast analysis using Bonferroni methodrevealed that clarity scores in the beginning (M=2.37, SD=1.32) were significantly lessthan at the end (M=3.72, SD=1.19) p <0.001, as well as compared to overall (M=4.03,SD=1.12) p <0.001. The one-way within subjects ANOVA of clarity scores in the highimpairment condition were not significantly different.

We also performed multiple pairwise independent samples t-tests to examine if theclarity scores were different within blocks of sampling time of beginning, end andoverall by either high or low impairment condition. Clarity scores in the beginningbetween low and high impairment were not significant. However, clarity scores weresignificantly lower in the high impairment condition (M=2.62, SD=1.09) in the endas compared to the low impairment condition (M=3.72, SD=1.19) t(51) = -3.46, p =0.001, d = 0.97. Clarity scores overall were significantly lower in the high impairmentcondition (M=2.62, SD=1.09) as compared to the low impairment condition (M=4.03,SD=1.12) t(51) = -4.61, p <0.001, d = 1.29.

c Distortion:The ANOVA analysis revealed a significant main effect of time F(1.58, 85.33) = 5.53,p = 0.01, η2 = 0.09 and main effect of impairment F(1, 54) = 16.93, p <0.001, η2 =0.24 and revealed a significant sampling time by impairment interaction on distortion



Fig. 10. Mean Distortion QoE scores in high and low impairment conditions by sampling time

QoE scores F(1.58, 85.33) = 11.03, p <0.001, η2 = 0.17. Sampling time by algorithm aswell as the three way interaction on distortion QoE scores was not significant. Notethat low distortion QoE scores correspond to higher levels of perceived distortion inthe video, whereas high distortion QoE scores correspond to low levels of perceiveddistortions in the video viewing experience. Using the Bonferroni technique, we foundthat distortion QoE scores in the beginning (M=2.65, SD=1.01) was significantly lessthan in the overall (M=3.21, SD=1.18) p = 0.005. And overall, distortion QoE scoresin the high impairment condition (M=2.55, SD=0.78) were significantly lower than inthe low impairment condition (M=3.39, SD=0.79) p <0.001.

In order to further examine the interaction effects, we first performed a block analysisexamining the mean distortion QoE scores in the beginning, end and overall byimpairment as a 3 x 2 mixed-model ANOVA. The analysis revealed a significantmain effect of sampling time of distortion scores F(1.58, 88.27) = 5.85, p = 0.007,η2 = 0.09, a main effect of impairment F(1, 56) = 16.36, p <0.001, η2 = 0.23, andan interaction effect of sampling time by impairment on distortion scores F(1.58,88.27) = 11.62, p <0.001, η2 = 0.17 (see figure 10). A block analysis of distortionQoE scores were conducted within the high impairment condition and within the lowimpairment condition via a one-way within subjects ANOVA within each impairmentblock. The one-way within subjects ANOVA of distortion scores in the low impairmentcondition were significantly different F(1.58, 44.25) = 13.24, p <0.001, η2 = 0.32. Thecontrast analysis using Bonferroni method revealed that distortion QoE scores inthe beginning (M=2.62, SD=1.17) were significantly less than at the end (M=3.58,



SD=1.37) p = 0.015, as well as compared to overall (M=3.93, SD=1.07) p <0.001 in lowimpairment condition. The one-way within subjects ANOVA of distortion QoE scoresin the high impairment condition were not significantly different, and at beginning,end and overall the distortion QoE scores were all lower than the mean scores in thelow impairment condition at each sampling time.

5.2. Quantitative Objective Results5.2.1. Results from an analysis on the continuous measure of user perception of video quality.

. After the participant’s continuous response was filtered and normalized basedon the technique highlighted in section 5.4.2, and the following data analysis wasperformed. Final scores were between 0: high video quality (low levels of viewerfrustration) to 1: low video quality (high levels of viewer frustration). We examined towhat extent users perceived the quality in the beginning (first four minutes), middle(middle four minutes), and end (last four minutes) of the video viewing experienceoverall as a function of the impairment (high versus low) and algorithm (S2 versusS1), on the filtered and normalized quality rating (0 high quality/low frustration 1 lowquality/high frustration). Continuous measure of user quality was averaged acrossthe four minute periods in the beginning, middle and end time periods of the videoviewing experience.

The average quality measures were treated with a three way mixed model ANOVA,with impairment (2 levels high and low) and algorithm (2 levels S1 and S2) asbetween subjects factors and time of video viewing (3 levels - beginning, middle,end) as a within or repeated measures factor. Greenhouse-Geiser adjusted degrees offreedom were considered when Mauchly’s test of sphericity was found to be significant.Post-hoc analysis were computed using Tukey-HSD and pairwise comparisons withBonferroni adjusted type-I error. Recall that low continuous measure scores revealhigh video quality (low levels of frustration), and high scores reveal poor video quality(high levels of frustration).

The mixed model ANOVA analysis revealed a significant main effect of time, F(2,108) = 20.13, p <0.001, η2 = 0.27. The main effect of algorithm and main effect ofimpairment were not significant. However, sampling time by algorithm interactionrevealed a significant effect F(2, 108) = 5.83 p = 0.004, η2 = 0.09, and sampling timeby impairment interaction were also significant F(2, 108) = 5.79 p = 0.004, η2 =0.09. Also, the three way interaction of sampling time by algorithm by impairmentrevealed a significant effect F(2, 108) = 3.42, p = 0.036, η2 = 3.42. In order to thor-oughly investigate the interaction effects, we performed a series of block analysis.We first examined the difference in perceived video quality scores between S2 andS1 algorithm within each sampling times via an independent samples t-test thatrevealed an insignificant result. A similar analysis was conducted to examine thedifference in perceived quality scores between high and low impairment within eachsampling times via an independent samples t-test. We found that in the middle fourminutes of the movie watching experience, the perceived poor quality in the video wassignificantly higher in the high impairment condition (M=0.50, SD=0.17) as comparedto the low impairment condition (M=0.36, SD=0.21) t(56) = 2.82, p = 0.007, d = 0.75.

In order to examine if there were pairwise differences between beginning, middleand end time sampling times within levels of impairment or algorithm conditions,we performed a one-way within subjects ANOVA within blocks of the levels of the



Fig. 11. Graphs show the mean normalized scores for the first, middle and last 4 minutes of the video view-ing experience of the continuous measure of perceived poor quality and subsequent levels of frustration inthe S2 (Reactive: Capacity based algorithm)/S1 (Smooth: Buffer based) conditions (left) and in the high/lowimpairment conditions (right)

independent variable. A one way ANOVA of sampling times within the S2 conditionwas significant F(2, 58) = 18.07, p <0.001, η2 = 0.38. Pairwise comparisons usingBonferroni method revealed that perceived levels of poor quality was significantly lessin the beginning (M=0.31, SD=0.13) as compared to the middle (M=0.45, SD=0.16) p<0.001, and perceived levels of poor quality was significantly lower in the beginningas compared to the end (M=0.46, SD=0.16) p <0.001. A one way ANOVA of samplingtimes within the S1 condition was significant F(2, 54) = 3.37, p = 0.04, η2 = 0.11.Pairwise comparisons using Bonferroni method did not reveal any significant pairwisedifferences between sampling times in the S1 conditions. A one way ANOVA ofsampling times within the high impairment condition was significant F(2, 56) =12.40, p <0.001, η2 = 0.31. Pairwise comparisons using Bonferroni method revealedthat perceived levels of poor quality was significantly less in the beginning (M=0.38,



Table IV. Artifacts and response summary for each condition

Impairment / Strategy Strategy I (Buffer based) Strategy II (Capacity based)High Number of Stalls: 0 Number of Stalls: 17

Number of Frame Drops: 4 Number of Frame Drops: 5Number of Quality Degrades: 6 Number of Quality Degrades: 10Number of Quality Upgrades: 3 Number of Quality Upgrades: 9Total: 13 Total: 41Mean proportion of times participantsreacted to artifacts: 31.32%

Mean Proportion of times participantsreacted to artifacts: 51.87%

Std. Deviation: 21.43% Std. Deviation: 15.77%Low Number of Stalls: 0 Number of Stalls: 12

Number of Frame Drops: 1 Number of Frame Drops: 16Number of Quality Degrades: 0 Number of Quality Degrades: 0Number of Quality Upgrades: 1 Number of Quality Upgrades: 1Total: 2 Total: 29Mean Proportion of times participantsreacted to artifacts: 53.57%

Mean Proportion of times participantsreacted to artifacts: 56.09%

Std. Deviation: 41.42% Std. Deviation: 24.68%

SD=0.16) as compared to the middle (M=0.49, SD=0.17) p = 0.001, and perceivedlevels of poor quality was also significantly lower in the beginning as compared tothe end (M=0.46, SD=0.17) p = 0.005. Similarly, a one way ANOVA of sampling timeswithin the low impairment condition was significant F(2, 56) = 11.59, p <0.001, η2 =0.29. Pairwise comparisons using Bonferroni method revealed that perceived levels ofpoor quality was significantly less in the beginning (M=0.30, SD=0.17) as comparedto the end (M=0.45, SD=0.23) p = 0.001, and perceived levels of poor quality wassignificantly lower in the middle (M=0.36, SD=0.21) as compared to the end (M=0.45,SD=0.23) p = 0.008.

5.2.2. Results from the automatic classification of users reactions to the artifacts in the video.

. To analyze the quantitative results obtained from the automatic classificationof user’s reactions to the presence of the artifacts in the continuous user responsethrough the hand held input device, we performed a block frequency analysis on thetotal response percentile in each condition. The means and standard deviation foreach condition are given below in the table.

The distribution of scores failed the test of normality thus a Kruskal-Wallis Htest was performed on the data. This was performed separately under each networkimpairment condition to study the effect of algorithm.

The test showed that there was a statistically significant difference in responsepercentiles between the different video streaming algorithms under high networkimpairment, χ2 (2) = 5.572, p = 0.018, η2 = 0.19, with a mean rank response percentileof 11.14 for Strategy I and 18.60 for Strategy II. The video streaming strategy accountsfor 19% of the variance in the response percentiles. Participants seem to react morefrequently to artifacts while viewing videos streamed using Strategy II under highnetwork impairment resulting in higher overall frustration levels.



The test showed no statistically significant difference in response percentilesbetween the different video streaming conditions under low network impairment.

6. QUALITATIVE RESULTSThe post-survey also asked the viewer for comments on the perceived quality of thevideo. We used these comments to qualitatively assess the perceived level of quality ofthe overall experience.

For the first streaming condition i.e. Strategy 1 under high network impairment(S1-Hi), some viewers thought that a smooth streaming video is better than havingstalls and that the overall audio and video quality were great; “Smooth is better thanstall”, “The video overall looked and sounded great.”. While others viewers did notlike the video quality as much and would have switched to a better source; “If I wasstreaming this, I would have tried to find another source and only streamed from hereif it was the only option”, “video quality seemed much poorer this time around overall”,“Quality could be better”.

For the second condition i.e. Strategy 2 under high network impairment (S2-Hi),comments were critical of the frequent stalls and the low video quality. Viewers feltthat the magnitude of impairment was enough to prevent them from understandingthe plot of the movie; “pauses in video were most frustrating”, “Well, it just looksgrainy. Especially on such a large screen”, “Between the buffering and grainy clarity, Iwas not able to enjoy the movie”, “Pauses in video were more prominent by the end, andit detracted from my following the plot of the movie”.

Viewers from the third condition, Strategy 1 under low network impairment (S1-Lo),thought that the overall experience was extremely great and nothing was frustratingoverall; “Extremely great video quality and audio quality”, “I could tell what was goingon and it was a good short film. Nothing seemed too frustrating overall”, “Even thoughthe beginning was blurry, an overwhelming majority of the film was very crisp”. But,few viewers thought that the sound quality was random and the video had a lot ofdistortion; “Audio volume seemed random through the whole thing. Lots of tearing anddistortion in the video”.

For the last condition, Strategy 2 under low network impairment, viewers seem toprefer the video quality but were frustrated with the frequent stalls and frame drops;“Overall, I was impressed with the quality. Other than the beginning quality, andending lagging/chopping”, “When the video started to freeze and skip it became muchmore difficult to follow along with what was happening on screen”; “The video imagewas very good, but the laging/chopping was very frustrating”, “Towards the end of thevideo, the picture and audio seemed to freeze at certain moments and then skip ahead,which made viewing and understanding what was happening a little more difficult”.

Ranking the four video conditions depending on viewer comments, we summarizethat, S2-Hi seems to provide the worst viewing experience. S1-Hi is better than S2-Hiwith smoother video playback. S1-Lo is better than S1-Hi with better video qualityand S2-Lo provides the best playback of the four with an exceptionally great overallexperience.



7. DISCUSSIONFrom the Quantitative Subjective Results section (5.1.1), we observe that the frus-tration QoE scores were lower towards the end of the video in Strategy 2 than inStrategy 1. For Quantitative subjective measures (4.4.1), higher frustration repre-sents lower scores on the QoE Likert’s scale, thus the above observation supportsthe 1st hypothesis which states that the perceived quality is greater for Strategy1 as compared to Strategy 2. The 1st hypothesis is also supported based on thequantitative objective results (section 5.2.2) i.e. in the high network impairmentcondition, participants responded to artifacts more frequently while viewing videosstreamed using Strategy 2 resulting in higher frustration levels overall in Strategy 2(capacity based) as compared to Strategy 1 (buffer based). The user comments from thequalitative results (section 6) also show a similar trend with viewers favoring videosstreamed under high impairment using Strategy 1 rather than Strategy 2. Hence,supporting the 1st hypothesis. A possible explanation for this trend could be thepresence of a larger number of artifacts in videos streamed using Strategy 2. Videosstreamed using Strategy 2 also tend to have buffering stalls among the other arti-facts which seems to have the largest impact on user engagement [Dobrian et al. 2011].

User comments (section 6) also supports the 2nd hypothesis which states that inthe low impairment condition, the perceived quality is less negatively impacted inStrategy 2. Although the video streamed using Strategy 2 under low impairmentexhibits more artifacts than Strategy 1 (section 5.2.2, table III), viewer commentsrelate that the quality was better for Strategy 2 streamed videos in comparison toStrategy 1. This makes sense due to higher playerRate in Strategy 2. Since, Strat-egy 2 tries to jump to the highest available bandwidth the playback is of higher quality.

The frustration scores observed in the beginning of the video are lower than thefrustration scores observed overall and the distortion scores are higher towardsthe end and overall of the video as compared to the beginning in the low networkimpairment condition (section 5.1.1). For both frustration and distortion, the higherthe values on the Likert’s scale the lower the frustration/distortion (section 4.4.1),thus portraying a higher level of perceived quality as the video progresses from theresults obtained. This does not tend to support the 3rd hypothesis which states thatthe perceived level of poor quality is expected to increase with the number of artifactsovertime. Same is true for the clarity scores obtained (section 5.1.1) with scores beinghigher towards the end of the video and overall as compared to the beginning. The 3rdhypothesis is not supported as higher clarity scores represent a higher value on theLikert’s scale. One possible explanation for these results could be that when measuredafter the viewing experience, participants are more inclined to remember the effectsof the artifacts in the beginning of the viewing experience more so than towards theend. This could be due to the novelty factor that facilitates better recall of the videoviewing experience in the beginning, and through acclimation over time participantsmay be more inclined to remember the impact of the artifacts in the beginning ascompared to the end of the video viewing experience. Therefore, we believe that thecontinuous measure is more robust than the subjective post-experience questionnaire,as the recorded response of the participants is immediate rather than after the factand is less prone to short term memory effects.

Although the quantitative subjective results are unsupported of the 3rd hypothesis,the quantitative objective results (section 5.2.1) and qualitative results (section 6)do support it. An analyzes of the continuous measure of users’ perceived levels of



Table V. Hypotheses Result Summary).

Hypothesis Result SummaryHypothesis 1 Supported by all results

Hypothesis 2 Supported by all results

Hypothesis 3 Inconclusive, partially supported by quantitative objective and qualitative results

Hypothesis 4 Supported by all results

poor quality, higher levels representing higher frustration, showed that the perceivedlevels of poor quality for Strategy 2 was lower in the beginning as compared tothe middle and the end under both high and low impairment conditions. Multipleuser comments from different viewing conditions relate that stalls became morefrequent towards the end along with “chopping” and other sound related issueswhich made it difficult for the users to understand the plot of the movie. Therefore,the analysis of the continuous measure of perceived levels of video quality revealedthat in Strategy 2 (capacity based) the levels of poor quality dramatically increasedfrom beginning to middle to the end, as compared to Strategy 1 (buffer based). Aswith the QoE based measures, perceived levels of poor quality in the continuousmeasure was significantly higher in the high impairment condition as compared tothe low impairment condition. In the high impairment condition, perceived levelsof poor quality was low in the beginning, but increased in the middle and end ofthe video viewing experience. Whereas in the low impairment condition, perceivedlevels of poor quality also seemed to increase from beginning to the end, but the rateof increase was less dramatic and steep as compared to the high impairment condition.

The analysis of the automatic classification of user responses from quantitativeobjective results (section 5.2.2) yielded that under high network impairment partici-pants responded to artifacts more frequently in Strategy 2 as compared to Strategy1 implying higher perceived levels of poor quality for Strategy 2. This is supportiveof the 4th hypothesis which states that the magnitude of change in perceived poorquality overtime is expected to be higher in Strategy 2 as compared to Strategy1. Overall, when presented with high impairment situations, perhaps arising dueto network congestion, the quantitative and qualitative results of our user studysuggests that participants seems to favor the buffer based Strategy 1 over the capacitybased Strategy 2 of video streaming.

8. CONCLUSIONS AND FUTURE WORKWe conducted a within subjects user study involving a 2x2 factorial methodologybased on the level of network impairment (high impairment or low impairment)and the choice of adaptation design strategy (buffer-based or capacity-based). Wealso explored the sensitivity of the results to the time during the 11 minute viewingsession (first half, second half, or at the end considering the session in its entirety).We identified a set of four hypotheses. Based on quantitative subjective, quantitativeobjective, and qualitative measures, Table V provides a summary of the hypothesisresults.

The findings surrounding hypotheses 1 and 2 suggest that a buffer-based strategymight provide a better experience under higher network impairment conditions. Forthe two network scenarios considered, the buffer-based strategy is effective in avoidingstalls but does so at the cost of reduced video quality (based on the lower playerRate



results). Participants in Strategy 1 do notice the drop in video quality causing a de-crease in perceived video quality. However, the perceived levels of video quality, viewerfrustration, and opinions of video clarity and distortion are significantly worse due toartifacts such as stalls in Strategy 2, as compared to Strategy 1. The capacity-basedstrategy tries to provide the highest video quality possible but produces many moreartifacts during playback. The results suggest that player video quality has more ofan impact on perceived quality when stalls are infrequent.

The Quantitative subjective results did not support the 3rd hypothesis. However asthese measures were recorded using a post-survey after the session, it is possible thatone or more particular artifacts observed by the participant may have swayed theparticipant’s opinion. The same might not have been true for the continuous perceivedquality rating. Another reason for the above observation could be that the participantsstarted to get accustomed to the artifacts and did not react to them as readily laterduring the video playback.

The objective of our study was to provide design guidance for a HAS adaptation algo-rithm. Our results can be interpreted from one of two perspectives. First, at a macro-scopic level, the results show the human response to two different design approachesconsidered in two different realizations of a streaming session over a emulated net-work environment. Second, at a microscopic level, the results show user responses tospecific artifacts. For the former, our results suggest that, at least for the two spe-cific network scenarios considered, buffer-based adaptation results in higher perceivedquality compared to a capacity-based adaptation algorithm. For the latter, we havecollected (and have made publicly available) quantitative objective continuous resultsthat shows how different participants react to the same set of artifacts. The major-ity of the analysis presented in this paper focus on the macroscopic rather than themicroscopic results. In ongoing work, we continue to explore and understand the hu-man reaction to specific artifacts (or sequences of artifacts) that we have collected. Wealso plan to consider the impact of both sound quality and network impairment on theviewer’s comprehension of the plot and storyline of the short movie under each of thestrategies. The objective would be to empirically evaluate if the presence and typesof artifacts have cognitive effects on viewers. Finally, we plan on extending the scopeand scale of the user study by developing and conducting an online study designed toobtain results from potentially a very large number of participants.

REFERENCESFlorence Agboma and Antonio Liotta. 2007. Addressing user expectations in mobile content delivery. Mobile

Information Systems 3, 3-4 (2007), 153–164.Saamer Akhshabi, Lakshmi Anantakrishnan, Ali C Begen, and Constantine Dovrolis. 2012. What happens

when HTTP adaptive streaming players compete for bandwidth?. In Proceedings of the 22nd interna-tional workshop on Network and Operating System Support for Digital Audio and Video. ACM, 9–14.

Saamer Akhshabi, Ali C Begen, and Constantine Dovrolis. 2011. An experimental evaluation of rate-adaptation algorithms in adaptive streaming over HTTP. In Proceedings of the second annual ACMconference on Multimedia systems. ACM, 157–168.

Claudio Alberti, Daniele Renzi, Christian Timmerer, Christopher Mueller, Stefan Lederer, Stefano Battista,and Marco Mattavelli. 2013. Automated QoE evaluation of dynamic adaptive streaming over HTTP. InQuality of Multimedia Experience (QoMEX), 2013 Fifth International Workshop on. Ieee, 58–63.

ITU Radiocommunication Assembly. 2003. Methodology for the subjective assessment of the quality of televi-sion pictures. International Telecommunication Union.

Athula Balachandran, Vyas Sekar, Aditya Akella, Srinivasan Seshan, Ion Stoica, and Hui Zhang. 2012. Aquest for an internet video quality-of-experience metric. In Proceedings of the 11th ACM workshop onhot topics in networks. ACM, 97–102.



Nicola Cranley, Philip Perry, and Liam Murphy. 2006. User perception of adapting video quality. Interna-tional Journal of Human-Computer Studies 64, 8 (2006), 637–647.

Luca De Cicco, Vito Caldaralo, Vittorio Palmisano, and Saverio Mascolo. 2013. Elastic: a client-side con-troller for dynamic adaptive streaming over http (dash). In Packet Video Workshop (PV), 2013 20thInternational. IEEE, 1–8.

Luca De Cicco and Saverio Mascolo. 2014. An adaptive video streaming control system: Modeling, validation,and performance evaluation. Networking, IEEE/ACM Transactions on 22, 2 (2014), 526–539.

Toon De Pessemier, Katrien De Moor, Wout Joseph, Lieven De Marez, and Luc Martens. 2013. Quantify-ing the influence of rebuffering interruptions on the user’s quality of experience during mobile videowatching. Broadcasting, IEEE Transactions on 59, 1 (2013), 47–61.

Natalie Degrande, Koen Laevens, Danny De Vleeschauwer, and Randy Sharpe. 2008. Increasing the userperceived quality for IPTV services. Communications Magazine, IEEE 46, 2 (2008), 94–100.

Florin Dobrian, Vyas Sekar, Asad Awan, Ion Stoica, Dilip Joseph, Aditya Ganjam, Jibin Zhan, and HuiZhang. 2011. Understanding the impact of video quality on user engagement. ACM SIGCOMM Com-puter Communication Review 41, 4 (2011), 362–373.

ETSI. 2012. Transparent End-to-end Packet Switched Streaming Service (PSS); Progressive Download andDynamic Adaptive Service over HTTP. 3GPP TS 26.247 version 10.1.0 Release 10 (2012).

Remi Houdaille and Stephane Gouache. 2012. Shaping http adaptive streams for a better user experience.In Proceedings of the 3rd Multimedia Systems Conference. ACM, 1–9.

Te-Yuan Huang, Nikhil Handigol, Brandon Heller, Nick McKeown, and Ramesh Johari. 2012. Confused,timid, and unstable: picking a video streaming rate is hard. In Proceedings of the 2012 ACM conferenceon Internet measurement conference. ACM, 225–238.

Te-Yuan Huang, Ramesh Johari, Nick McKeown, Matthew Trunnell, and Mark Watson. 2014. A buffer-based approach to rate adaptation: Evidence from a large video streaming service. In Proceedings of the2014 ACM conference on SIGCOMM. ACM, 187–198.

Quan Huynh-Thu and Mohammed Ghanbari. 2006. Impact of jitter and jerkiness on perceived video quality.In Proceedings of the Workshop on Video Processing and Quality Metrics.

Raf Huysegems, Bart De Vleeschauwer, Koen De Schepper, Chris Hawinkel, Tingyao Wu, Koen Laevens,and Werner Van Leekwijck. 2012. Session reconstruction for HTTP adaptive streaming: Laying thefoundation for network-based QoE monitoring. In Proceedings of the 2012 IEEE 20th InternationalWorkshop on Quality of Service. IEEE Press, 15.

P ITU-T RECOMMENDATION. 1999. Subjective video quality assessment methods for multimedia appli-cations. (1999).

Junchen Jiang, Vyas Sekar, and Hui Zhang. 2012. Improving fairness, efficiency, and stability in http-basedadaptive video streaming with festive. In Proceedings of the 8th international conference on Emergingnetworking experiments and technologies. ACM, 97–108.

S Shunmuga Krishnan and Ramesh K Sitaraman. 2013. Video stream quality impacts viewer behavior:inferring causality using quasi-experimental designs. Networking, IEEE/ACM Transactions on 21, 6(2013), 2001–2014.

Zhi Li, Ali C Begen, Joshua Gahm, Yufeng Shan, Bruce Osler, and David Oran. 2014a. Streaming video overHTTP with consistent quality. In Proceedings of the 5th ACM Multimedia Systems Conference. ACM,248–258.

Zhi Li, Xiaoqing Zhu, Joshua Gahm, Rong Pan, Hao Hu, Ali Begen, and David Oran. 2014b. Probe and adapt:Rate adaptation for http video streaming at scale. Selected Areas in Communications, IEEE Journal on32, 4 (2014), 719–733.

Chenghao Liu, Imed Bouazizi, and Moncef Gabbouj. 2011. Rate adaptation for adaptive HTTP streaming.In Proceedings of the second annual ACM conference on Multimedia systems. ACM, 169–174.

Jim Martin, Yunhui Fu, Nicholas Wourms, and Terry Shaw. 2013. Characterizing Netflix Bandwidth Con-sumption. In Consumer Communications and Networking Conference (CCNC), 2013 IEEE. IEEE, 230–235.

Ricky KP Mok, Xiapu Luo, Edmond WW Chan, and Rocky KC Chang. 2012. QDASH: a QoE-aware DASHsystem. In Proceedings of the 3rd Multimedia Systems Conference. ACM, 11–22.

Christopher Muller, Stefan Lederer, and Christian Timmerer. 2012. An evaluation of dynamic adaptivestreaming over HTTP in vehicular environments. In Proceedings of the 4th Workshop on Mobile Video.ACM, 37–42.

Christopher Muller and Christian Timmerer. 2011. A test-bed for the dynamic adaptive streaming overHTTP featuring session mobility. In Proceedings of the second annual ACM conference on Multimediasystems. ACM, 271–276.



Pengpeng Ni, Ragnhild Eg, Alexander Eichhorn, Carsten Griwodz, and Pal Halvorsen. 2011a. Flicker ef-fects in adaptive video streaming to handheld devices. In Proceedings of the 19th ACM internationalconference on Multimedia. ACM, 463–472.

Pengpeng Ni, Ragnhild Eg, Alexander Eichhorn, Carsten Griwodz, and Pal Halvorsen. 2011b. Spatial flickereffect in video scaling. In Quality of Multimedia Experience (QoMEX), 2011 Third International Work-shop on. IEEE, 55–60.

Yen-Fu Ou, Yan Zhou, and Yao Wang. 2010. Perceptual quality of video with frame rate variation: A subjec-tive study.. In ICASSP. 2446–2449.

Ozgur Oyman and Sarabjot Singh. 2012. Quality of experience for HTTP adaptive streaming services. Com-munications Magazine, IEEE 50, 4 (2012), 20–27.

ITUT Rec. 2007. P. 10/G. 100 Amendment 1: New Appendix 1-Definition of Quality of Experience (QoE).International Telecommunication Union (Jan. 2007) (2007).

Sandvine. 2014. Global Internet Phenomena Report. Sandvine Corporation,https://www.sandvine.com/downloads/general/global-internet-phenomena/2014/2h-2014-global-internet-phenomena-report.pdf.

Telecommunication Standardization Sector. 1998. Subjective audiovisual quality assessment methods formultimedia applications. (1998).

Michael Seufert, Sebastian Egger, Martin Slanina, Thomas Zinner, Tobias Hobfeld, and Phuoc Tran-Gia.2014. A survey on quality of experience of HTTP adaptive streaming. Communications Surveys & Tuto-rials, IEEE 17, 1 (2014), 469–492.

T Stockhammer, P Frojdh, I Sodagar, and S Rhyu. 2011. Information technologyMPEG systems technologies-Part 6: Dynamic adaptive streaming over HTTP (DASH). ISO/IEC, MPEG Draft International Stan-dard (2011).

Truong Cong Thang, Hung T Le, Huan X Nguyen, Anh T Pham, Jung Won Kang, and Yong Man Ro. 2013.Adaptive video streaming over HTTP with dynamic resource estimation. Communications and Net-works, Journal of 15, 6 (2013), 635–644.

Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, and Eero P Simoncelli. 2004. Image quality assess-ment: from error visibility to structural similarity. Image Processing, IEEE Transactions on 13, 4 (2004),600–612.

Stefan Winkler and Praveen Mohandas. 2008. The evolution of video quality measurement: from PSNR tohybrid metrics. Broadcasting, IEEE Transactions on 54, 3 (2008), 660–668.

Jun Xia, Yue Shi, Kees Teunissen, and Ingrid Heynderickx. 2009a. Perceivable artifacts in compressed videoand their relation to video quality. Signal Processing: Image Communication 24, 7 (2009), 548–556.

Jun Xia, Yue Shi, Kees Teunissen, and Ingrid Heynderickx. 2009b. Perceivable artifacts in compressed videoand their relation to video quality. Signal Processing: Image Communication 24, 7 (2009), 548–556.


A User Perceived Quality Assessment of Design Strategies ...jmarty/papers/HAS... · Winkler and Mohandas 2008]. Objective metrics such as peak signal-to-noise ratio (PSNR), vision-based

Documents