Sequential Selection of Correlated Ads by POMDPs

Sequential Selection of Correlated Ads byPOMDPs

Shuai Yuan, Jun Wang

University College London

October 29, 2012

Motivations and contributionsMotivations,• help publishers gain more profit by displaying ads;• go further than offline, content-based matching of

webpages and ads;Contributions,• a framework of ad selection for revenue optimisation;• formulating the sequential selection problem by Partially

observable Markov decision process and providing exactand approximate solutions;

• a public keyword-bid-ad-webpage dataset for reproducibleresearch1.

1http://www.computational-advertising.org

Related worksContextual advertising,• A semantic approach to contextual advertising [Broder 2007]• Impedance coupling in content-targeted advertising [Ribeiro 2005]• Contextual advertising by combining relevance with click feedback [Chakrabarti

Inventory management (contracts),• Targeted advertising on the Web with inventory management [Chickering 2003]• Revenue management for online advertising: Impatient advertisers

[Fridgeirsdottir 2007]• Dynamic revenue management for online display advertising [Roels 2009]

Optimal pricing model,• Pricing of Online Advertising: Cost-Per-Click-Through Vs. Cost-Per-Action [Hu

2010]• Online advertising: Pay-per-view versus pay-per-click [Mangani 2004]• Online advertising: Pay-per-view versus pay-per-click A comment [Fjell 2009]• Single period balancing of pay-per-click and pay-per-view online display

advertisements [Kwon 2011]

Related works (cont.)Ad scheduling,• Scheduling advertisements on a web page to maximize revenue [Kumar 2006]• Scheduling of dynamic in-game advertising [Turner 2011]

Multi-armed bandits,• Using confidence bounds for exploitation-exploration trade-offs [Auer 2003]• Multi-armed bandit problems with dependent arms [Pandey 2007]

POMDPs,• A survey of POMDP applications [Cassandra 1998]• Monte Carlo POMDPs [Thrun 2000]• Perseus: Randomized point-based value iteration for POMDPs [Spaan 2005]

Problem statement - setup

0 200 400 600 800 1000

Figure : 1 webpage, 1 ad slot, M impressions at each time step.Payoff of ads follows X ∼ N (µ, I ·σ2

0). µ is generated by µ ∼ N (θ,Σ).

Problem statement - graphical model

θ(T), Σ(T), 0θ(2), Σ(2), T-2θ(1), Σ(1), T-1

μ(2)μ(1)

s(1) s(2)

x(1) x(2) x(T)

θ, Σ

Figure : The payoff model illustrated by an influence diagramrepresentation with generative processes of a finite horizon POMDP.s(t) is the selection action. θ(t),Σ(t) is the belief at some stage.

Problem statement - object functionTo maximise the expected cumulative payoff over time,

π∗ = arg maxπ

E [Rπ(T )] = arg maxπ

T∑t=1

Xs(t)(t)

= arg maxπ

T∑t=1

E[Xs(t)(t)

=arg maxπ

T∑t=1

xs(t)(t)p(xs(t)(t)|Ψ(t))dx = arg maxπ

T∑t=1

θs(t)(t) (1)

where,• s(t) is the selection decision;• Ψ(t) is the available information;• π is a selection policy and π∗ is the optimal one;• “M impressions” is dropped from object function.

Belief update

t=1 t=2 ...

Figure : Updating belief on ads’ performance over time.

Belief update - the selected adWe update the belief using Bayes’ theorem.

p (x1|x1(t),Ψ(t))

∫p (x1|x1(t),Ψ(t), µ1) p (µ1|x1(t),Ψ(t))dµ (2)

by “completing squares”,

p(µ1|x1(t),Ψ(t)

)∝ p(x1(t)|µ1,Ψ(t))p(µ1|Ψ(t))

∝ exp{−(x1(t)− µ1

)2 −(µ1 − θ1(t)

we obtain the new belief,

µ1|x1(t) ∼ N(θ1(t + 1), σ2

1(t + 1))

θ1(t + 1) =σ2

1(t)x1(t) + σ20θ1(t)

σ21(t) + σ2

1(t + 1) =σ2

1(t)σ20

σ21(t) + σ2

we write θi (t) and σ2i (t) as the shorthand for θi |Ψ(t) and σ2

i |Ψ(t).

Belief update - the correlated adWe also update the belief of non-selected ads,

p (x2|x1(t),Ψ(t)) =

∫p (x2|µ2, x1(t),Ψ(t)) p(µ2|x1(t),Ψ(t))dµ2 (6)

with linear Gaussian property,

µ1|µ2 ∼ N (θ1|µ2, σ21 |µ2) (7)

θ1|µ2 = θ1 +σ1,2

(µ2 − θ2) σ21 |µ2 = σ2

1 −σ2

we obtain the new belief on a correlated ad,

µ2|x1(t) ∼ N (θ2(t + 1), σ22(t + 1)) (9)

θ2(t + 1) = θ2(t) + σ1,2x1(t)− θ1(t)σ2

1(t) + σ20

σ22(t + 1) = σ2

2(t)−σ2

σ21(t) + σ2

Belief update - expected payoffWe also obtain the expected payoff of the selected ad,

X1|x1(t),Ψ(t) ∼ N(θ1(t + 1), σ2

0 + σ21(t + 1)

and the expected payoff of the correlated ad,

X2|x1(t),Ψ(t) ∼ N(θ2(t + 1), σ2

0 + σ22(t + 1)

The final objective function is,

π∗ = arg maxπ

T∑t=1

θs(t)(t) subject to (13)

θs(t+1)(t + 1) = θs(t+1)(t) + σs(t),s(t+1)

xs(t)(t)− θs(t)(t)

σ2s(t)(t) + σ2

σ2s(t+1)(t + 1) = σ2

s(t+1)(t)−σ2

s(t),s(t+1)

σ2s(t)(t) + σ2

POMDP formulation and solution

0 200 400 600 800 1000

(hidden state)

(belief state)

(action)(observation & reward)

Figure : The POMDP model for the revenue optimisation problem.(θ(t),Σ(t)) is belief at some stage; x(t) is observation and reward;s(t) is action; (θ,Σ) is the hidden state. There is no state transition.

Value iteration and MAB approximationThe value function could be expressed as,

s(t)= arg maxs(t)∈N

Vs(t)(Ψ(t)) = arg maxi∈N

(x̄i )︸︷︷︸the expected immediate reward

+ ξ(Ψ(t), i)︸︷︷︸the expected future reward

The exact solution using Value iteration2:V∗ (θ,Σ,T ) = max

s(1)∈NE[Xs(t)(1) + V∗

(θ|Xs(t)(1),Σ|Xs(t)(1),T − 1

)](17)

The approximation based on multi-armed bandit3:

ξUCB1-NORMAL =

√16 ·

qi − tiθ2i (t)

ti − 1·

t − 1ti

2R. E. Bellman. (1957) “Dynamic Programming”3Auer, P. et al. (2002) “Finite-time analysis of the multi-armed bandit

problem”

Value iteration with Monte Carlo sampling4

We use sampling to reduce the computational complexity,

1: function VALUEFUNC(θ,Σ, t)2: array V ← 0 . Expected reward vector.3: loop i ← 1 to N4: V [i]← θi (t) . Expected immediate reward.5: if t < T then6: for all s in SAMPLE(θ,Σ) do7: [θ′,Σ′]← UPDATEBELIEF(θ,Σ, s, i)

. New belief after selecting i and observing s.. Equations 13.

8: V [i]← V [i] + 1M0

VALUEFUNC(θ′,Σ′, t + 1)

9: end for10: end if11: end loop12: return [MAX(V ),MAXINDEX(V )]13: end function

4Thrun, S. (2000) “Monte Carlo POMDPs”

Multi-armed bandit based approximation(cont.)The UCB1-NORMAL-COR algorithm:

1: function PLAN(θ,Σ,Ψ(t))2: array V ← 03: loop i ← 1 to N4: if ti < d8 log te then . ti is the number of times ad i gets selected.5: return i6: end if7: end loop8: [θ′,Σ′]← UPDATEBELIEF(θ,Σ,Ψ(t))

. New belief of all ads with all available information.. Equations 13.

9: loop i ← 1 to N

10: V [i]← θ′i +

√16 · qi−tiθ′2i

ti−1 · t−1ti

. Expected reward.

11: end loop12: return [MAX(V ),MAXINDEX(V )]13: end function

Experiment datasets

advertisers

$$$ $$

publishers

INTRANET

ad network/exchange

Google AdWords Traffic Estimator service

• publishers gain 68% of advertisers’ spending (2003);• data was collected from 12/2011 to 05/2012;• 512 different keywords, 310 with non-zero mean payoff, 8

categories;• 20% for training and 80% for testing;• we consider each keyword to be an ad.

Competing algorithms

We compare the following algorithms,• RANDOM policy, which selects candidates randomly

(uniform);• MYOPIC policy, based on the expected immediate reward;• UCB1 policy, which assumes independent between arms

and is model-free of reward distribution;• UCB1-NORMAL policy, which assumes independent

between arms and the reward following Gaussiandistribution;

• VI-COR policy, which solves Value iteration using MonteCarlo sampling; and

• UCB1-NORMAL-COR policy, which consider thedependencies between candidates.

Results

Datasets MYOPIC RANDOM UCB1 UCB1-N VI-COR UCB1-N-COREducation 21.9 23.0 30.9 30.9 41.2* 27.6Finance-1 38.5 27.8 40.9 26.4 44.5 27.4Finance-2 22.1 16.5 30.6 22.8 38.0* 22.9Information 14.1 12.9 27.8 15.9 29.4 15.9P&O 41.6 30.4 50.5 31.4 72.9* 63.3Shopping-1 17.4 10.6 42.3 16.1 40.2 16.4Shopping-2 29.9 14.5 34.3 75.3 52.9 79.2*Shopping-3 9.7 4.3 21.9 18.3 27.3 19.4P&S 24.7 26.0 47.2 57.1 67.9* 59.9Medical 30.5 19.6 52.7 32.2 58.0* 33.5

Table : The cumulative payoffs are averaged on 8 chunks then normalized w.r.t theGOLDEN policy for a better representation. The one with highest cumulative payoff isin bold and with ∗ if the difference with the second best is significant by Wilcoxonsigned-rank test. P&O is “People & organisations” and P&S is “‘Products & services”.

Results (cont.)

ææææææææææææææææææææ

æææ

ææææ

æææ

æææææææææææææææææææææææææææææææææææææææ

ææææææ

ææææ

ààààààààààààààààààààààààààààà

àààààààààààààààààààààààààààààààààààààààààààààààààà

ìììììììììììììììììììì

ììììììììììììììììììììììììììììììììììììììììììì

ìììì

òòòòòòòòòòòòòòòòòòòòòòòòòòòòò

òòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòò

òòòòòòòò

ôôôôôôôôôôôôôôôôôôôô

ôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôô

çççççççççççççççççççç

ççççççççççççççççççççççççççççççççççççççççççç

áááááááááááááááááááááááá

áááá

áááááá

ááá

áááááááááááááááááááááááááááááááááááááááááááááááááááááááá

áááá

20 40 60 80 100

á Random

ç Myopic

ô Golden

ò UCB1

ì UCB1-Normal

à UCB1-Normal-COR

æ VI-COR

Figure : Cumulative payoff on “People & organization” category, 5candidates.

Results (cont.)

Edu F-1 F-2 Info P&O S-1 S-2 S-3 P&S Med0

1MyopicVI-CorUCB1-NormalUCB1-Normal-Cor

alizedcumulative

payoff

Figure : Comparison of accumulated payoffs on the 10 datasets.VI-COR always performed better than MYOPIC and UCB1-NORMAL-CORalways performed better than UCB1-NORMAL across all datasets.

Results (cont.)

0 50 100 1500

best phonesterm insurance

Figure : Special case: the daily payoff of two candidates with asudden change.

Results (cont.)

10−2

10x 10

Noise factor σ20

GoldenMyopicVI−CORUCB1−Normal−COR

Figure : Theimpact of the noisefactor σ2

0 for thesituation in theprevious figure.

θs(t+1)(t + 1) = θs(t+1)(t) + σs(t),s(t+1)

xs(t)(t)− θs(t)(t)

σ2s(t)(t) + σ2

Future works• correlated update: if ad a1 on webpage w1 was shown to

user u1 and we observed its performance, what’s the beliefon performance of ad a2 on webpage w2 when showing touser u2 with correlations known?

• multiple ads with diversification (another exploration andexploitation dilemma);

• better solution for our continuous POMDP problem.

Sequential Selection of Correlated Ads by POMDPs

Education

Online Planning Algorithms for...

POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function...

Emotionally Adaptive Intelligent Tutoring Systems using...

Distributed POMDPs with Coordination Locales (DPCLs)

Multi-Object Search using Object-Oriented POMDPs ·...

Sequential Circuits - Digital Logic Design (EEE...

Deep Reinforcement Learning: Q-LearningTraining tricks...

Achieving Goals in Decentralized POMDPs

Exact Mode Estimation for POMDPs based on Constraint...

Viewing Classifier Systems as Model Free Learning in...

Solving POMDPs Using Quadratically Constrained Linear...

Solving POMDPs through Macro Decomposition Larry Bush Tony.....

Anytime Point-Based Approximations for Large POMDPs

Distributed Model Shaping for Scaling to Decentralized ...

Representing hierarchical POMDPs as DBNs for multi-scale...

Sequential Decision Making in...