A multi-agent Q-Learning-based framework for achieving fairness in HTTP Adaptive Streaming

A Multi-Agent Q-Learning-based Framework forAchieving Fairness in HTTP Adaptive Streaming

Stefano Petrangeli∗, Maxim Claeys∗, Steven Latre†, Jeroen Famaey∗, Filip De Turck∗∗Department of Information Technology (INTEC), Ghent University- iMinds

Gaston Crommenlaan 8 (Bus 201), 9050 Ghent, Belgium, email: [email protected]†Department of Mathematics and Computer Science, University of Antwerp- iMinds

Middelheimlaan 1, 2020 Antwerp, Belgium

Abstract—HTTP Adaptive Streaming (HAS) is quickly becom-ing the de facto standard for Over-The-Top video streaming. InHAS, each video is temporally segmented and stored in differentquality levels. Quality selection heuristics, deployed at the videoplayer, allow dynamically requesting the most appropriate qualitylevel based on the current network conditions. Today’s heuristicsare deterministic and static, and thus not able to performwell under highly dynamic network conditions. Moreover, in amulti-client scenario, issues concerning fairness among clientsarise, meaning that different clients negatively influence eachother as they compete for the same bandwidth. In this article,we propose a Reinforcement Learning-based quality selectionalgorithm able to achieve fairness in a multi-client setting. Akey element of this approach is a coordination proxy in chargeof facilitating the coordination among clients. The strength ofthis approach is three-fold. First, the algorithm is able to learnand adapt its policy depending on network conditions, unlikecurrent HAS heuristics. Second, fairness is achieved withoutexplicit communication among agents and thus no significantoverhead is introduced into the network. Third, no modificationsto the standard HAS architecture are required. By evaluatingthis novel approach through simulations, under mutable networkconditions and in several multi-client scenarios, we are able toshow how the proposed approach can improve system fairnessup to 60% compared to current HAS heuristics.

I. INTRODUCTION

Nowadays, multimedia applications are responsible for animportant portion of the traffic exchanged over the Internet.One of the most relevant applications is video streaming.Particularly, HTTP Adaptive Streaming (HAS) techniques havegained a lot of popularity because of their flexibility, andcan be considered as the de facto standard for Over-The-Top video streaming. Microsoft’s Smooth Streaming, Apple’sHTTP Live Streaming and Adobe’s HTTP Dynamic Streamingare examples of proprietary HAS implementations. In a HASarchitecture, video content is stored on a server as segmentsof fixed duration at different quality levels. Each client canrequest the segment at the most appropriate quality levelon the basis of the local perceived bandwidth. In this way,video playback dynamically changes according to the availableresources, resulting in a smoother video streaming. The maindisadvantage of current HAS solutions is that the heuristicsused by clients to select the appropriate quality level are fixedand static. This entails they can fail adapting under highlydynamic network conditions, resulting in freezing or frequentquality switches that can negatively affect user perceived videoquality, the so-called Quality of Experience (QoE).In order to overcome these issues, we propose to embed a

Reinforcement Learning (RL) agent [13] into HAS clients,in charge of dynamically selecting the best quality level onthe basis of its past experience. As shown in a single-clientscenario [14], this approach is able to outperform current HASheuristics, achieving a better QoE even under dynamic networkconditions, up to 10% gain.In a real scenario, multiple clients simultaneously requestcontent from the HAS server. Often, clients have to share asingle medium and issues concerning fairness among themarise, meaning that the presence of a client has a negativeimpact on the performance of others. Particularly, fairnesscan be defined both from an application-aware point of view,as the deviation among clients’ achieved QoE, and from anapplication-agnostic point of view, as the deviation amongclients’ achieved bit rate. Moreover, a fundamental aspect wehave to consider when dealing with learning in a multi-agentsetting, is that the learning process of an agent influences thatof the other agents. For example, when a client selects the i−thquality level, it is using a portion of the shared bandwidth. Thisdecision can have an impact on the performance of the otherclients and thus also on their learning process. This mutualinteraction can lead to unstable behavior (e.g. the learningprocess never converges) or unfair behavior (e.g., some agentscontrol all the resources to the detriment of the other ones).Also traditional HAS clients present fairness issues. The maindrawback here is HAS heuristics are static and uncoordinated.This entails they are not aware of the presence of other clientsnor can adapt their behavior to deal with it. Classical TCPrate adaptation algorithms are not effective in this case, sincequality selection heuristics partly take over their role, as theydecide on the rate to download.In this paper, we investigate the aforementioned problemsarising in a multi-client setting. Particularly, we present amulti-agent Q-Learning-based HAS client able to achievesmooth video playback, while coordinating with other clientsin order to improve the fairness of the entire system. This goalis reached with the aid of a coordination proxy, in charge ofcollecting measurements on the behavior of the entire agentset. This information is then used by the clients to refine theirlearning process and develop a fair behavior.The main contributions of this paper are three-fold. First,we present a Q-Learning-based HAS client able to learn thebest action to perform depending on network conditions, inorder to provide a smoother video streaming with respectto current deterministic HAS heuristics, and improve systemfairness. Second, we design a multi-agent framework to helpagents coordinating their behavior, which does not require

https://www.researchgate.net/publication/258374689_Design_of_a_Q-Learning_based_client_quality_selection_algorithm_for_HTTP_Adaptive_Video_Streaming?el=1_x_8&enrichId=rgreq-686f0c0b-9d93-4265-8a9e-45b3941ab43d&enrichSource=Y292ZXJQYWdlOzI2MjQwNDQ1ODtBUzo5ODUzNDkzNDE4ODAzNEAxNDAwNTAzOTY3NDcw

https://www.researchgate.net/publication/5596000_Reinforcement_Learning_An_Introduction_Bradford_Book?el=1_x_8&enrichId=rgreq-686f0c0b-9d93-4265-8a9e-45b3941ab43d&enrichSource=Y292ZXJQYWdlOzI2MjQwNDQ1ODtBUzo5ODUzNDkzNDE4ODAzNEAxNDAwNTAzOTY3NDcw

neither explicit agent-to-agent communication nor a centralizeddecision process. Consequently, the quality level selection canstill be performed locally and independently by each client,without any modification to the general HAS principle. Third,detailed simulation results are presented to characterize thegain of the proposed multi-agent HAS client compared to theproprietary Microsoft ISS Smooth Streaming algorithm and asingle-agent Q-Learning based HAS client [14].The remainder of this article is structured as follows. Section IIreports related work on HAS optimization and multi-agent al-gorithms. Section III formally introduces the fairness problemin a multi-client scenario, while Section IV presents a shortoverview of the single-agent Q-Learning client. Next, SectionV illustrates the proposed multi-agent Q-Learning HAS client,both from an architectural and algorithmic point of view. InSection VI, we evaluate our HAS client through simulationand show its effectiveness compared to current HAS heuristics.Section VII concludes the paper.

II. RELATED WORK

A. HAS Optimization

Akshabi et al. provide a good overview of the performanceand drawbacks of current HAS heuristics [1]. The authorspoint out as current HAS heuristics are effective in non-demanding network conditions, but fail when rapid changesoccur, leading to drops in the playback buffer or unnecessaryquality reductions. Moreover, these solutions carry out thequality level decision in a very conservative way. Furthermore,it is shown that two clients sharing the same bottleneck do notdevelop a fair behavior. Akshabi et al. investigate the mainfactors influencing fairness in HAS [2]. They report that themutual synchronization among clients is a relevant cause ofunfairness. Particularly, unfair behavior emerges when clientsrequest video segments at different times, since this leads towrong bandwidth estimations.Several HAS clients have been proposed to deal with theaforementioned problems. Jarnikov et al. propose a qualitydecision algorithm based on Markov Decision Process, whichrequires an offline training [3]. Liu. et al describe a step-wiseclient, but they do not consider the playback video buffer levelin the decision process [4]. De Cicco et al. use a centralizedquality decision process exploiting control theory techniques[5]. They have also studied a scenario where two clients sharethe same bottleneck and shown how their approach can resultin a fair behavior, at least from the network point of view.Villa et al. investigate how to improve fairness randomizingthe time interval at which clients request a new segment [6].A similar approach is also used by Jiang et al. [7]. They studydifferent design aspects that may lead to fairness improvement,including the quality selection interval, a stateful quality levelselection and a bandwidth estimation using an harmonic meaninstead of a normal one.In general, all the works available in literature share some ofthe following drawbacks. First, the proposed algorithms arefixed and static. This entails they are not able to modify theirbehavior online, taking into account their actual performance.Second, fairness is not explicitly considered in the clientdesign, but is just a result of it. Third, they do not evaluatetheir outcomes from a QoE-point of view. This can lead to aquality selection process that can optimize network resourcesutilization but not the quality perceived by the user. In this

work, we incorporate the experience acquired by the clientinto the quality selection algorithm, using a RL approach.Moreover, we explicitly design our client to improve systemfairness and exploit a QoE model to evaluate the obtainedresults directly at the user level.

B. Multi-Agent Algorithms

As far as multi-agent systems are concerned, a goodoverview of different approaches and challenges is discussedby Vidal [8]. Multi-agent algorithms can be subdivided intotwo categories: centralized and distributed. In the centralizedcase, agents communicate with a central entity in charge ofdeciding the best action set. The work presented by Bredinet al. belongs to this category [9]. They propose a market-based system for resource allocation, where each agent bidsfor computational priority to a central server, which finallyneeds to decide on resources allocation. This type of solutionsare not applicable in the HAS case, since each client hasto decide autonomously which quality level to request. In adistributed approach, agents can communicate directly witheach other. The nature of this communication has a big impacton system performance and algorithm feasibility in a realscenario. Dowling et al. propose a model-based collaborativeRL, where agents exchange their reward to optimize routingin ad-hoc networks [10]. In classical fixed or mobile networks,agents may not communicate directly with each other due tothe overhead introduced into the network. Moreover, Schaerfet al. show how naive agent-to agent communication may evenreduce system efficiency [11]. Crites et al. study the problem ofelevator group control, using distributed Q-Learning agents fedwith a common global reward [12]. This approach facilitatesinfluencing global system performance only and not that ofeach single agent separately. In a HAS setting, we are alsointerested in optimizing local performance of the clients.Consequently, we use a reward composed by both a local anda global term.Theoretical investigation on multi-agent RL algorithms mainlyconcentrates on stochastic games, exploiting game theorytechniques. An example is the aforementioned work by Bredinet al., where the central server allocates resources computinga Nash equilibrium point for the system. These approachesrequire strong assumptions (e.g. perfect knowledge of theenvironment) to work properly or a huge amount of agent-to-agent communication. The multi-agent HAS client presented inthis work does not require any explicit communication amongagents or any a priori assumptions.

III. FAIRNESS PROBLEM STATEMENT

In this section, we formally define the multi-agent fairnessproblem addressed in this paper. The problem we want to solveis to reach the highest possible video quality at the clientswhile keeping the deviation among them as low as possible.The formal problem characterization is given in Definition 1:

Definition 1. Multi-Agent Optimization Problem

J(q) = ξ ×QualityIndex(q) + (1− ξ)× FairIndex(q)

with ξ ∈ [0, 1]

https://www.researchgate.net/publication/266652680_What_happens_when_HTTP_adaptive_streaming_players_compete_for_bandwidth?el=1_x_8&enrichId=rgreq-686f0c0b-9d93-4265-8a9e-45b3941ab43d&enrichSource=Y292ZXJQYWdlOzI2MjQwNDQ1ODtBUzo5ODUzNDkzNDE4ODAzNEAxNDAwNTAzOTY3NDcw

https://www.researchgate.net/publication/260522989_Improving_Fairness_Efficiency_and_Stability_in_HTTP-Based_Adaptive_Video_Streaming_With_Festive?el=1_x_8&enrichId=rgreq-686f0c0b-9d93-4265-8a9e-45b3941ab43d&enrichSource=Y292ZXJQYWdlOzI2MjQwNDQ1ODtBUzo5ODUzNDkzNDE4ODAzNEAxNDAwNTAzOTY3NDcw

https://www.researchgate.net/publication/1961007_Adaptive_Load_Balancing_A_Study_in_Multi-Agent_Learning?el=1_x_8&enrichId=rgreq-686f0c0b-9d93-4265-8a9e-45b3941ab43d&enrichSource=Y292ZXJQYWdlOzI2MjQwNDQ1ODtBUzo5ODUzNDkzNDE4ODAzNEAxNDAwNTAzOTY3NDcw


https://www.researchgate.net/publication/2433267_Elevator_Group_Control_Using_Multiple_Reinforcement_Learning_Agents?el=1_x_8&enrichId=rgreq-686f0c0b-9d93-4265-8a9e-45b3941ab43d&enrichSource=Y292ZXJQYWdlOzI2MjQwNDQ1ODtBUzo5ODUzNDkzNDE4ODAzNEAxNDAwNTAzOTY3NDcw

https://www.researchgate.net/publication/3412383_Using_Feedback_in_Collaborative_Reinforcement_Learning_to_Adaptively_Optimize_MANET_Routing?el=1_x_8&enrichId=rgreq-686f0c0b-9d93-4265-8a9e-45b3941ab43d&enrichSource=Y292ZXJQYWdlOzI2MjQwNDQ1ODtBUzo5ODUzNDkzNDE4ODAzNEAxNDAwNTAzOTY3NDcw

https://www.researchgate.net/publication/228982793_Rate_adaptation_for_adaptive_HTTP_streaming?el=1_x_8&enrichId=rgreq-686f0c0b-9d93-4265-8a9e-45b3941ab43d&enrichSource=Y292ZXJQYWdlOzI2MjQwNDQ1ODtBUzo5ODUzNDkzNDE4ODAzNEAxNDAwNTAzOTY3NDcw

https://www.researchgate.net/publication/2385084_A_Game-Theoretic_Formulation_of_Multi-Agent_Resource_Allocation?el=1_x_8&enrichId=rgreq-686f0c0b-9d93-4265-8a9e-45b3941ab43d&enrichSource=Y292ZXJQYWdlOzI2MjQwNDQ1ODtBUzo5ODUzNDkzNDE4ODAzNEAxNDAwNTAzOTY3NDcw

https://www.researchgate.net/publication/221636637_Feedback_control_for_adaptive_live_video_streaming?el=1_x_8&enrichId=rgreq-686f0c0b-9d93-4265-8a9e-45b3941ab43d&enrichSource=Y292ZXJQYWdlOzI2MjQwNDQ1ODtBUzo5ODUzNDkzNDE4ODAzNEAxNDAwNTAzOTY3NDcw

https://www.researchgate.net/publication/221263090_Client_intelligence_for_adaptive_streaming_solutions?el=1_x_8&enrichId=rgreq-686f0c0b-9d93-4265-8a9e-45b3941ab43d&enrichSource=Y292ZXJQYWdlOzI2MjQwNDQ1ODtBUzo5ODUzNDkzNDE4ODAzNEAxNDAwNTAzOTY3NDcw

https://www.researchgate.net/publication/234056937_Improving_Perceived_Fairness_and_QoE_for_Adaptive_Video_Streams?el=1_x_8&enrichId=rgreq-686f0c0b-9d93-4265-8a9e-45b3941ab43d&enrichSource=Y292ZXJQYWdlOzI2MjQwNDQ1ODtBUzo5ODUzNDkzNDE4ODAzNEAxNDAwNTAzOTY3NDcw

maximizeq=(q1,...,qN )

J(q)

subject to 1 ≤ qi(k) ≤ qmax ∀i = 1 . . . N ,∀k = 1 . . .K

DT ki (q,Bandwidth) ≤ BLki∀i = 1 . . . N ,∀k = 1 . . .K

with N being the number of clients, K the number ofsegments the video content is composed by, qi(k) the qualitylevel requested by the i− th client for the k− th segment, qithe vector containing all the quality levels requested by clienti and qmax the highest available quality level. DT ki representsthe download time of the k− th segment, while BLki denotesthe video player buffer filling level of client i when the k− thsegment download starts. Bandwidth is the vector containingthe bandwidth pattern.The objective function J(q) is the linear combination of twoterms. The first one, QualityIndex(q), measures the overallvideo streaming quality at the clients side. The second term,FairnessIndex(q), represents the fairness of the system.The final formulation of QualityIndex(q) and FairIndex(q)depends on the actual interpretation given to the video qual-ity at the client. From an application-aware point of view,video quality is explicitly associated with the user perceivedvideo quality, or QoE. This way, we are explicitly focusingon achieving fairness from a QoE point of view: clientshave to reach similar perceived video quality. In this case,QualityIndex(q) can be characterized as the average ofclients’ QoE values, while fairness can be expressed as thestandard deviation from this average. The model used tocompute the QoE is explained in Section VI.On the other hand, if the main issue is on network resourceoptimization, video quality can be associated to the bit rateachieved by the client or, equivalently, to the average qualitylevel requested. With this application-agnostic formulation,we are interested in fairness from a network point of view:clients goal is to request the same average quality level, i.e. toequally share the available bandwidth. In light of the above,QualityIndex(q) and FairIndex(q) can be computed as theaverage and standard deviation of clients’ average requestedquality level, respectively.It is worth noting that both application-aware and application-agnostic interpretations are valid and can be used dependingon the focus given to the multi-agent optimization problem.In the design of our client, we focused on the application-aware interpretation, since it is directly correlated to the userperceived quality of the video streaming. Nevertheless, theproposed framework can be easily modified to deal with theapplication-agnostic interpretation of the multi-agent optimiza-tion problem.In light of the above, it is clear why QualityIndex(q) andFairIndex(q) have to be optimized together. If we onlyconsider the maximization of the fairness index, agents couldobtain similar but unacceptable video qualities. Instead, ourgoal is also to reach the highest possible video quality at theclients. Depending on applications and scenarios, ξ can betuned to benefit one of the two terms.The second constraint of the optimization problem is intendedto avoid freezes in the video playback. The download time ofthe next segment (DT ki ) has to be lower than the video playerbuffer filling level when the download starts (BLki ). In thisway, the video player buffer will never be empty and freezes

are avoided. It is worth noting that the download time is notonly a function of the quality level requested by client i, butalso of the quality levels downloaded simultaneously by otherclients and of the available bandwidth.

IV. SINGLE-AGENT Q-LEARNING ALGORITHM

In this section we provide an overview of a single-agentQ-Learning client proposed by Claeys et al. [14], used as thebasis for our multi-agent algorithm presented in Section V.Particularly, the local reward experienced by client i when thek − th segment is downloaded, is:

ri(k) =− |qmax − qi(k)| − |qi(k)− qi(k − 1)|+− |bmax − bi(k)| (1)

with qi(k−1) being the quality level requested at the (k−1)−th step, bi(k) the video player buffer filling level when thedownload is completed and bmax the buffer saturation level.Moreover, when bi(k) is equal to zero, i.e. when a video freezeoccurs, ri(k) is set to -100. The first two terms drive the agentto request the highest possible quality level, while keepingquality switches limited. In fact, these two factors have a bigimpact on the perceived quality. The last term is used to avoidfreezes in the video playout, which also have a big impact onthe final QoE.The action the RL agent can take at each decision step is toselect one of the available quality levels. Consequently, eachagent has NL possible actions to perform, with NL being thenumber of available quality levels.The goal of the learning agent is to select the best qualitylevel depending on two parameters, which compose the agentstate space. The first one is the local perceived bandwidth,while the second is the buffer filling level bi(k). These twoterms are of great importance when deciding on the qualitylevel to download. For example, if the perceived bandwidthis low and the playout buffer is almost empty, the agent willlearn to request a low quality segment, in order to respectthe bandwidth limitation and avoid a video freeze. Since boththe perceived bandwidth and the playout buffer are continuousquantities, they are discretized in NL+ 1 and bmax

Tsegintervals,

respectively. Tseg represents the segment duration in seconds.

V. MULTI-AGENT Q-LEARNING ALGORITHM

In this section, we discuss the proposed multi-agent Q-Learning HAS client. First, we give an architectural overview.Next, we present the multi-agent quality selection algorithm.

A. Architectural Overview

A key element of our multi-agent Q-Learning algorithm isan intermediate node, called coordination proxy, in charge ofhelping clients achieving fairness. The operations it performsare monitoring performance of the system and returning thisinformation to the clients, which use it to learn a fair policy.In light of the above, the real position of the coordinationproxy has to be carefully decided depending on the networkscenario. In a real setting, multiple HAS clients may belongto different networks. Depending on bottlenecks position, we


#1: Next Segment Request

#4: Global Signal Broadcast

#5: Learning

Process #2: Reward estimation

#3: Global Signal Computation

#2: Next segment

request forwarded to

HAS Server

#3: Next segment

from HAS Server Coordination Proxy HAS Clients

HAS Server

Fig. 1. Logical work-flow of the proposed solution.

can identify two types of network scenario.When all clients share a common bottleneck, the best option isto use a single coordination proxy in charge of controlling theentire agent set. For example, the coordination proxy can beembedded into the HAS server. In a second and more complexscenario, a multitude of bottlenecks may be simultaneouslypresent. In this case we need a hierarchy of coordinationproxies, exchanging information to coordinate agents’ behaviorboth locally and globally. In this paper we focus on the firstnetwork scenario, while we propose to investigate the morecomplex one in future work.The logical work-flow of our multi-agent algorithm is shownin Fig. 1. First, each client i requests to the HAS Serverthe next segment to download at a certain quality. Based onthis information, the coordination proxy estimates the rewardthe agents will experience in the future. This data is thenaggregated into a global signal, representing the status of theentire set of agents. This global signal is then returned tothe agents, which use this information to refine their learningprocess. In particular, the global signal informs an agent aboutthe difference between its performance and that of the entiresystem. In this way, the agents can learn how to modifytheir behavior to achieve similar performance, i.e. fairness. Inlight of the above, the reward estimation and global signalcomputation have to be simple enough to avoid overloadingthe coordination proxy and maintain scalability.The main advantage of this hybrid approach is two-fold. First,no communication is needed among clients and consequentlyno significant overhead is introduced. Second, the HAS ar-chitecture is not altered. The coordination proxy has only tocollect and aggregate agents’ reward and is not involved in anydecision process. Furthermore, the global signal broadcast canbe made reusing the existing communication channels betweenHAS server and clients.

B. Algorithmic Details

As shown in Fig. 1, our multi-agent algorithm is subdividedinto four steps, that are detailed below. We assume here theclient i has executed the (k − 1)− th step and is waiting forthe execution of the k − th step.

1) Reward Estimation: The coordination proxy can com-pute an estimate of the reward ri(k) the client will experienceat the k − th step, exploiting the information sent whenrequesting a new segment at a certain quality. In particular,for each client i, the coordination proxy can compute thefollowing:

rfi (k) = −|qmax − qi(k)| − |qi(k)− qi(k − 1)| (2)

rfi (k) represents an estimate of the local reward the i− thagent will experience at the next k − th step. We refer hereto an estimate since the buffer filling level term is lacking, ascan be seen comparing rfi (k) with reward ri(k) shown in Eq.1. The video player buffer filling level bi(k) is not accessibleto the coordination proxy, being computed by the client onlywhen the new segment will be received, i.e. when the k − thstep will be actually executed.

2) Global Signal Computation and Broadcast: After hav-ing computed the reward for each client, the coordinationproxy aggregates them into a global signal. This value rep-resents the global status of the entire system and helps agentsachieving fairness. The global signal formulation has beenchosen equal to that of the QualityIndex(q) introduced inSection III, i.e. an average:

gs(k) =1

N

N∑i=1

rfi (k) (3)

Considering that clients are not synchronized, i.e. they requestsegments at different moments, gs(k) has to be continuouslyupdated by the coordination proxy, when a client requests anew segment.The global signal can then be added as an HTTP header fieldand returned to the agents when downloading the next segmentto play.

3) Learning Process at the Client: In a HAS multi-clientscenario, there is a secondary goal with respect to a single-client one. Besides reaching the best video streaming at theclient side, fairness among clients has to be obtained. In orderto achieve these objectives, we modify the reward functionshown in Eq. 1, adding a Homo Egualis-like reward term[15], which follows the theory of inequity aversion. Thistheory states that agents are willing to reduce their reward inorder to increase that of under-performing agents. The generalformulation is given in Eq. 4:

rhei = ri − α∑ri>rj

ri − rjN − 1

− β∑rj>ri

rj − riN − 1

(4)

The total reward rhei is composed by a first term ri, the localreward an agent experiences while interacting with its environ-ment. The other two terms take into account the performanceof the other agents. Each agent experiences a punishment whenothers have a higher reward as well as when they have a lowerreward. rhei reaches its maximum when ri = rj , for each j,i.e. when the agents show a fair behavior.The Homo Egualis reward shown in Eq. 4 is not directlyapplicable to the HAS case, since it requires a direct rewardcommunication among the agents. For this reason, we use theglobal signal gs(k), which has been designed to represent theoverall performance of the system. The Homo Egualis rewardconsequently becomes as in Eq. 5:

rhei (k) =ri(k)− αmax(rfi (k)− gs(k), 0)+

− βmax(gs(k)− rfi (k), 0)(5)

with ri(k) being the local reward reported in Eq. 1. Inaccordance to the computation of the global signal (see Eq.2 and 3), the punishment term in Eq. 5 is computed using

https://www.researchgate.net/publication/221454247_Artificial_agents_learning_human_fairness?el=1_x_8&enrichId=rgreq-686f0c0b-9d93-4265-8a9e-45b3941ab43d&enrichSource=Y292ZXJQYWdlOzI2MjQwNDQ1ODtBUzo5ODUzNDkzNDE4ODAzNEAxNDAwNTAzOTY3NDcw

TABLE I. AGENT STATE SPACE

State Element Range Elements

Buffer Filling [0; bmax] sec bmaxTseg

Bandwidth [0;BWmax] bps NL+ 1

gs(k) [2 × (1 − qmax) ; 0] 3gs(k) − rf

i(k) [2 × (1 − qmax) ; 2 × (qmax − 1)] 3

only the quality and the switching reward term. It can benoted that in this formulation the coordination proxy acts asa macro-agent representing the behavior of the entire system.The reward reaches the maximum when rfi (k) = gs(k), i.e.when the behavior of the agent matches that of the macro-agent. When the reward is far from the global signal, thepunishment term operates to modify the agent policy. Thisway, agents’ reward will converge to a similar value, i.e similarperformance is achieved and, consequently, fairness. We fixthe value of α and β in Eq. 5 to 1.5 to benefit the punishmentterm and give higher priority to the fairness goal in the learningprocess of the agents.In order to enforce the learning process, we also add anelement to the agent state, in addition to the perceived band-width and video player buffer filling level (see Section IV).In particular, we explore two possible configurations: (i) usingthe global signal gs(k) or (ii) using the difference between theglobal signal gs(k) and the reward rfi (k). This way, the agentcan also consider the overall system behavior when requestinga new segment. The two state space configurations give theagent a slightly different knowledge. In the first case, the agentconsiders directly the entire system behavior. For example, ifgs(k) is close to zero, i.e. the overall system performanceis good, the agent will learn to select a quality level tomaintain this condition. In the second case, the absolutevalue of gs(k) is not relevant because the agent considers itsdeviation from it. We discretize gs(k) into three intervals, torepresent the conditions when the agent set is performing bad(gs(k) ≈ 2 × (1− qmax)), normal or well (gs(k) ≈ 0). Alsothe value gs(k)−rf (k) has been discretized in three intervals.In this case, the three intervals represent the situations whenthe agent behavior is in line with that of the entire set(gs(k)−rf (k) ≈ 0), is under-performing (gs(k)−rf (k)� 0)or is over-performing (gs(k) − rf (k) � 0). The completestate space is given in Table I, with Tseg the segment lengthin seconds, BWmax the maximum bandwidth in bps and NLthe number of available quality levels. From now on we willrefer to the two possible state space configurations presentedabove with the expressions absolute global signal state spaceconfiguration (third row of Table I) and relative global signalstate space configuration (fourth row of Table I), respectively.The RL algorithm embedded into the clients is the well-knownQ-Learning [13].

VI. PERFORMANCE EVALUATION

A. Experimental Setup

A NS-3-based simulation framework [16], [17] has beenused to evaluate our multi-agent HAS client. The simulatednetwork topology is shown in Fig. 2; the actual capacity CLdepends on the number of clients, and is equal to 2.5×NMbps, with N being the number of clients. The video tracestreamed is the Big Buck Bunny, composed by 299 segments,each 2 seconds long and encoded at 7 different quality levels

HAS Server-

Coordination Proxy

100 Mbit/sec CL Mbit/sec

Client #1

Client #2

Client #3

Client #N

.

.

.

Cross traffic

Link A

Fig. 2. Simulated network topology. The coordination proxy has beenembedded into the HAS Server.

TABLE II. VIDEO TRACE QUALITY LEVELS

Quality Level Bit Rate

1 300 Kbps2 427 Kbps3 608 Kbps4 806 Kbps5 1233 Kbps6 1636 Kbps7 2436 Kbps

(see Table II for details). The buffer saturation level for eachclient is equal to 5 segments, or 10 seconds. In order to giveenough time to the RL algorithm to learn, we simulate 800episodes of the video trace. During each episode, the samevariable bandwidth pattern on link A is used, varying each 250msec and scaled with respect to the number of clients. Thebandwidth model is obtained using a cross traffic generator,introducing a traffic ranging from 0 Kbps to 2380×N Kbpsinto the network. In this way, we obtain an available bandwidthranging from 120×N Kbps to 2500×N Kbps.As far as the coordination proxy is concerned, it has beenembedded into the HAS sever. The network topology shownin Fig. 2 represents the situation when many clients share acommon bottleneck. In this case, one coordination proxy isneeded and its functions can be easily carried out by the HASServer.

B. QoE Model

As stated in the previous sections, an important aspect toconsider when evaluating the performance of a video streamingclient, is the final quality perceived by the user enjoying theservice, the so-called QoE. Consequently, we need to definea model to correlate client performance with user perceivedquality. We use a metric in the same range of the Mean OpinionScore (MOS), that can be computed as in Eq. 6 [14], [18]:

QoEi(t, t+ T ) = MOSi(t, t+ T ) = 0.81× qi(t, t+ T )+

−0.96× qi(t, t+ T ) + 0.17− 4.95× Fi(t, t+ T )(6)

The QoE experienced by client i over the time window[t, t + T ] is a linear combination of the average quality levelrequested qi(t, t + T ), its standard deviation qi(t, t + T ) andFi(t, t + T ), which models the influence of freezes and iscomputed as in the following:

Fi(t, t+ T ) =7

8×

(ln(ffreqi (t, t+ T ))

6+ 1

)+

+1

8×(min(favgi (t, t+ T ), 15)

15

) (7)

ffreqi (t, t+T ) and favgi (t, t+T ) are the freezes frequencyand the average freeze duration, respectively.All the coefficients reported in Eq. 6-7 have been tunedconsidering the works by Claeys et al. [14] and De Vriendt etal. [18].

C. Optimal Parameters Configuration

The Q-Learning approach, which is a widely employedReinforcement Learning technique [13], has been used inthe HAS client rate adaptation algorithm. The parameterscharacterizing it are the discount factor, which weighs therelevance of future rewards in the learning process, and thelearning rate, which weighs the relevance of newly acquiredwith respect to past experience. Additionally, the explorationpolicy, which selects the action to take at each decisionstep, has to be carefully selected in order to balance thewell-known trade-off in Reinforcement Learning algorithmsbetween exploration and exploitation [13].In order to properly tune our multi-agent HAS client andselect the best configuration, an exhaustive evaluation of theparameter space has been performed. Particularly, we selectedtwo exploration policies, Softmax [13] and VDBESoftmax[19], which are well-established exploration policies forReinforcement Learning algorithms and allow a good balancebetween exploration and exploitation. An epsilon-greedypolicy was also evaluated, but the results are omitted becauseof its poor performance. We investigated also the influence ofthe discount factor γ of the Q-Learning algorithm, the inversetemperature τ of the Softmax and VDBESoftmax policy andthe σ value of the VDBESoftmax policy. These parameters areof interest, since they influence the behavior of a RL agent.The importance of the discount factor γ has been pointedout at the beginning of this section. The inverse temperatureτ influences the action selection process: a low value entailsall the actions have a similar probability to be selected. σ,the inverse sensitivity of the VDBESoftmax policy, controlsagent exploration. Low values facilitate exploration, whilehigh values cause the agent to select a greedy action moreoften.We consider five different γ values (0.05, 0.1, 0.15, 0.2,0.25), three different τ (0.2, 0.5, 0.7) and three σ values (1,50, 100). Preliminary simulations showed that 0.1 is the bestchoice for the learning rate of the Q-Learning algorithm. Forthis reason, we kept it fixed for all simulations. We repeatedthe exhaustive evaluation of the parameter space for the twopossible state space configurations reported in Section V-Band for scenarios with 4, 7 and 10 clients, leading to 360different configurations overall. The outcome of this analysisis shown in Fig. 3-5. The main goal of this investigation is tofind a parameter combination for our multi-agent client ableto perform well even if conditions change (e.g., the numberof clients). In this section, the performance evaluation isconducted considering the application-aware interpretation of

the multi-agent optimization problem presented in Section III.Fig. 3 and 4 investigate the influence of the discount factor γon the performance of the multi-agent client, for both statespace configurations, in a scenario with 10 clients streamingvideo. The x-axis reports the discount factor γ, while they-axis the MOS. In the top graph, each point represents theaverage MOS of the entire agent set, computed during thelast iteration, i.e. over the last 10 minutes of the video trace.In the bottom graph, each point represents the MOS standarddeviation of the entire agent set, computed during the lastiteration. We selected the four out of sixty policies achievingthe highest average MOS with the lowest MOS standarddeviation. From now on we will refer to the Softmax policyand VDBESoftmax policy with the abbreviations SMAX andVDBE, respectively.For the absolute global signal state space configuration (Fig.3), the VDBE policy with τ = 0.5, σ = 1 and γ = 0.2leads to the best result overall, with an average MOS of 3.75and a standard deviation of 0.17. Moreover, this same policyappears to be robust also when the discount factor changes,leading to good results with γ = 0.05, 0.1. Another eligibleconfiguration is VDBE τ = 0.7, σ = 1, γ = 0.15, whichresults in the second best outcome with an average MOS of3.69 and a standard deviation of 0.19. The same analysis,for the relative global signal state space configuration, isshown in Fig. 4. The best global result is reached with VDBEτ = 0.7, σ = 1 and γ = 0.1, resulting in an average MOS of3.71 and a standard deviation of 0.14. The SMAX τ = 0.7shows a low sensitivity when the discount factor changes, forboth the average MOS and its standard deviation, and reachesthe best outcome for γ = 0.2. Table III and IV summarizethe outcome of the exhaustive parameter evaluation. Werepeated the same analysis shown in Fig. 3-4 also for the4 and 7 clients scenarios, obtaining a total of eight eligibleconfigurations.In light of the above, two preliminary conclusions can bedrawn. First, the VDBE policy is in general the best choice.This is because the VDBE tends to a greedy selection policywhen the learning process converges. Second, higher valuesof τ and lower for σ are preferable. In the first case, highτ values cause the agent to select actions with the highestexpected reward. To balance this aspect, low σ values allowmore exploration during the learning phase.A fundamental characteristic of the multi-agent client is tobe able to perform well independent of the number of clientsin the system. In order to select the best configuration fromthis point of view, we evaluate the performance of everyeligible combination reported in Table III-IV for 4, 7 and 10clients. We then select the two best performing configurationsacross all clients, for each state space configuration. Fig.5 shows the influence of clients number on the selectedconfigurations. All configurations perform similarly whenconsidering the average MOS. A bigger variability can benoticed instead for the MOS standard deviation. In light ofthe above, the VDBE τ = 0.5, σ = 1, γ = 0.2, absoluteglobal signal state space configuration (i.e., third row ofTable I) has been finally chosen. As can be seen by Fig. 5,it guarantees the best results from the average MOS point ofview and a very low standard deviation for 4, 7 and 10 clients.

3.2

3.3

3.4

3.5

3.6

3.7

3.8

0.05 0.1 0.15 0.2 0.25

Avera

ge M

OS

Discount factor γ

SMAX τ=0.7VDBE τ=0.5 σ=1

VDBE τ=0.7 σ=1VDBE τ=0.2 σ=50

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.05 0.1 0.15 0.2 0.25

MO

S S

tand

ard

Devia

tion

Discount factor γ

Fig. 3. Influence of the discount factor γ on average MOS (top) andits standard deviation (bottom) for the absolute global signal state spaceconfiguration and 10 clients streaming video.

3.2

3.3

3.4

3.5

3.6

3.7

3.8

0.05 0.1 0.15 0.2 0.25

Avera

ge M

OS

Discount factor γ

SMAX τ=0.7VDBE τ=0.5 σ=1

VDBE τ=0.7 σ=1VDBE τ=0.2 σ=50

0.1

0.2

0.3

0.4

0.5

0.6

0.05 0.1 0.15 0.2 0.25

MO

S S

tand

ard

Devia

tion

Discount factor γ

Fig. 4. Influence of the discount factor γ on average MOS (top) andits standard deviation (bottom) for the relative global signal state spaceconfiguration and 10 clients streaming video.

D. Gain Achieved by the Algorithm

In this section, we investigate the performance of theproposed multi-agent HAS client, in comparison with both thesingle-agent HAS client studied by Claeys et al. [14] and atraditional HAS client, the Microsoft ISS Smooth Streaming1

(MSS). In particular, we show the results for the 7 and 10clients scenario, both from a QoE and network point of view.

1Original source code available from:https://slextensions.svn.codeplex.com/svn/trunk/SLExtensions/AdaptiveStreaming

3.4

3.5

3.6

3.7

3.8

3.9

4

4.1

4 7 10

Avera

ge M

OS

Number of clients

SMAX τ=0.7 γ=0.25 - First Conf.VDBE τ=0.5 σ=1 γ=0.2 - First Conf.

SMAX τ=0.7 γ=0.2 - Second Conf.VDBE τ=0.5 σ=1 γ=0.25 - Second Conf.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

4 7 10

MO

S S

tand

ard

Devia

tion

Number of clients

Fig. 5. Influence of the number of clients on average MOS (top) and itsstandard deviation (bottom).

TABLE III. RESULTS OF THE PARAMETER SPACE EVALUATION.ABSOLUTE GLOBAL SIGNAL STATE SPACE CONFIGURATION

Clients Number Eligible Combinations

4 Softmax τ = 0.7 γ = 0.25

VDBESoftmax τ = 0.5 σ = 1 γ = 0.25

7 VDBESoftmax τ = 0.5 σ = 1 γ = 0.2


10 VDBESoftmax τ = 0.5 σ = 1 γ = 0.2


An exhaustive evaluation of the parameter space has beencarried out for the single-agent HAS client, and a VDBE policywith τ = 0.7, σ = 1, γ = 0.1 has been selected. For the multi-agent client, we consider the parameters configuration resultingfrom the analysis presented above. Also in this case, all themetrics are computed considering the last of 800 iterations.Fig. 6 shows the results obtained when analysing clientsperformance according to the application-aware interpretationof the multi-agent optimization problem introduced in SectionIII. Each bar represents the average MOS of the entire agentset, together with its standard deviation. The MSS clientpresents a very high standard deviation, both for the 7 and10 clients scenario. This entails there is a big differenceamong the video quality perceived by different clients, i.e.unfairness. Remarkable improvements can be noticed whenusing a RL approach. The single-agent RL client is able toconsiderably reduce MOS standard deviation by 80% and 20%,in the 7 and 10 clients case respectively. This is a very goodresult, considering that, in this case, there are no coordinationmechanisms. Nevertheless, the lack of coordination affects theaverage MOS, which is similar to that reached by the MSSclient. The multi-agent RL client is instead able to improve theaverage MOS by 11% for 7 clients and by 20% for 10 clientswith respect to MSS. Moreover, a very good fair behavior isobtained for the 7 clients scenario. For the 10 clients case, thestandard deviation is 48% less than the single-agent solutionand 60% less than MSS.

TABLE IV. RESULTS OF THE PARAMETER SPACE EVALUATION.RELATIVE GLOBAL SIGNAL STATE SPACE CONFIGURATION

Clients Number Eligible Combinations

4 VDBESoftmax τ = 0.5 σ = 1 γ = 0.25


7 VDBESoftmax τ = 0.5 σ = 1 γ = 0.25


10 Softmax τ = 0.7 γ = 0.2


2.4 2.6 2.8

3 3.2 3.4 3.6 3.8

4 4.2

7 10

Avera

ge M

OS

Number of clients

MSS Single-Agent RL Multi-Agent RL

Fig. 6. Comparison between the different clients, from a QoE perspective.The proposed multi-agent client outperforms both the MSS client and thesingle-agent Q-Learning-based one.

In Fig. 7 the network analysis is depicted. In this case,we evaluate clients performance considering the application-agnostic interpretation of the optimization problem in SectionIII. The graph reports the average and standard deviation ofclients’ average requested quality level. Also in this case, theMSS deviation is very high: this means the agents do notfairly share network resources. The multi-agent client is ableto considerably reduce the deviation of the average qualitylevel requested by the clients, both with respect to MSS andthe single-agent client. The situation arising in the 7 clientsscenario is of interest. In this case, the single-agent clientperformance is close to that of the multi-agent one. If we recallthe results shown in Fig. 6 for the 7 clients scenario, we seethere is a bigger difference between the average MOS of thetwo clients. This entails that in this case, exploited resourcesbeing equal, the multi-agent client results in a better overallperceived video quality, i.e. is more efficient.

VII. CONCLUSIONS

In this paper, we presented a multi-agent Q-Learning-basedHAS client, able to learn and dynamically adapt its behaviordepending on network conditions, in order to obtain a highQoE at the client. Moreover, this client is able to coordinatewith other clients in order to achieve fairness, both from theQoE and the network point of view. This was necessary asboth traditional and earlier proposed RL-based approaches in-troduced non-negligible differences in obtained quality amongclients. Fairness is achieved by means of an intermediate node,called coordination proxy, in charge of collecting informationon the overall performance of the system. This information isthen provided to the clients, which use it to enforce their learn-ing process. Numerical simulations using NS-3 have validatedthe effectiveness of the proposed approach. Particularly, wehave compared our multi-agent HAS client with the MicrosoftISS Smooth Streaming one and with a Q-Learning based HASclient. In the evaluated bandwidth scenario, we were able to

3.6

4

4.4

4.8

5.2

5.6

6

6.4

7 10

Avera

ge Q

uality

Level

Number of clients

MSS Single-Agent RL Multi-Agent RL

Fig. 7. Comparison between the different clients, from a network perspective.The proposed multi-agent client is able to improve fairness and increase theaverage requested quality level, both in the 7 and 10 clients scenario.

show that our multi-agent HAS client resulted in a better videoquality and in a remarkable improvement of fairness, up to60% and 48% in the 10 clients case, compared to MSS andthe Q-Learning-based client, respectively.

ACKNOWLEDGMENT

The research was performed partially within the iMindsMISTRAL project (under grant agreement no. 10838). Thiswork was partly funded by Flamingo, a Network of Excellenceproject (ICT-318488) supported by the European Commissionunder its Seventh Framework Programme. Maxim Claeys isfunded by grant of the Agency for Innovation by Science andTechnology in Flanders (IWT).

REFERENCES

[1] S. Akhshabi, S. Narayanaswamy, A. C. Begen and C. Dovrolis, Anexperimental evaluation of rate-adaptive video players over HTTP.Signal Processing: Image Communication, Volume 27, Issue 4, pp. 271-287, April 2012.

[2] S, Akhshabi, L. Anantakrishnan, A, C. Begen and C. Dovrolis, Whathappens when HTTP adaptive streaming players compete for band-width?. Proceedings of the 22nd international workshop on Networkand Operating System Support for Digital Audio and Video (NOSSDAV’12), 2012.

[3] D. Jarnikov and T. Ozcelebi, Client intelligence for adaptive streamingsolutions. Signal Processing: Image Communication, Volume 26, Issue7, pp. 378-389, August 2011.

[4] C. Liu, I. Bouazizi and M. Gabbouj, Rate adaptation for adaptive HTTPstreaming. Proceedings of the second annual ACM conference onMultimedia systems (MMSys ’11), 2011.

[5] L. De Cicco, S. Mascolo and V. Palmisano, Feedback control foradaptive live video streaming. Proceedings of the second annual ACMconference on Multimedia systems (MMSys ’11), 2011.

[6] B. J. Villa and P. H. Heegaard, Improving perceived fairness andQoE for adaptive video streams. Eighth International Conference onNetworking and Services, 2012.

[7] J. Jiang, V. Sekar and H. Zhang, Improving fairness, efficiency, andstability in HTTP-based adaptive video streaming with FESTIVE. Pro-ceedings of the 8th international conference on Emerging networkingexperiments and technologies (CoNEXT ’12), 2012.

[8] J. M. Vidal, Fundamentals of Multiagent systems. Online.http://multiagent.com/p/fundamentals-of-multiagent-systems.html, Lastaccessed: September 2013.

[9] J. Bredin, R. T. Maheswaran, C. Imer, D. Kotz and D. Rus, A game-theoretic formulation of multi-agent resource allocation. Proceedingsof the Fourth International Conference on Autonomous Agents, 2000.

[10] J. Dowling, E. Curran, R. Cunningham and V. Cahill, Using feedbackin collaborative reinforcement learning to adaptively optimize MANETrouting. IEEE Transactions on Systems, Man, and Cybernetic- PartA: Systems and Humans, pp. 360-372, 2005.

[11] A. Schaerf, Y. Schoam and M. Tennenholtz, Adaptive load balance:a study in multi-agent learning. Journal of Artificial IntelligenceResearch, pp. 475-500, May 1995.

[12] R. H. Crites and A. G. Barto, Elevator group control using multiplereinforcement learning agents. Machine Learning, pp. 235-262, 1998.

[13] R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction.The MIT Press, March 1998.

[14] M. Claeys, S. Latre, J. Famaey, T. Wu, W. Van Leekwijck and F. DeTurck, Design of a Q-learning-based client quality selection algorithmfor HTTP adaptive video streaming. Proceedings of the Adaptive andLearning Agents Workshop, part of AAMAS2013, May 2013.

[15] S. de Jong, K. Tuyls and K. Verbeeck, Artificial agents learning humanfairness . Proceedings of 7th International Conference on AutonomousAgents and Multiagent Systems (AAMAS 2008), pp. 863-870, May2008.

[16] ns3, The Network Simulator ns-3. Online. http://www.nsam.org, Lastaccessed: September 2013.

[17] N. Bouten, J. Famaey, S. Latre, R. Huysegems, B. De Vleeschauwer, W.Van Leekwijck and F. De Turck, QoE optimization through in-networkquality adaptation for HTTP Adaptive Streaming. Eight InternationalConference on Network and Service Management (CNSM), 2012.

[18] J. De Vriendt, D. De Vleeschauwer and D. Robinson, Model forestimating QoE of video delivered using HTTP adaptive streaming. 1stIFIP/IEEE Workshop on QoE Centric Management (IM 2013), May2013.

[19] J. Bach and S. Edelkamp, Value-difference based exploration: adaptivecontrol between Epsilon-greedy and Softmax. KI 2011: Advances inArtificial Intelligence, pp. 335-346, 2011.

A multi-agent Q-Learning-based framework for achieving fairness in HTTP Adaptive Streaming

Documents