Sequential Selection of Correlated Ads by POMDPs
Post on 06-May-2015
440 Views
Preview:
DESCRIPTION
Transcript
Sequential Selection of Correlated Ads byPOMDPs
Shuai Yuan, Jun Wang
University College London
October 29, 2012
Motivations and contributionsMotivations,• help publishers gain more profit by displaying ads;• go further than offline, content-based matching of
webpages and ads;Contributions,• a framework of ad selection for revenue optimisation;• formulating the sequential selection problem by Partially
observable Markov decision process and providing exactand approximate solutions;
• a public keyword-bid-ad-webpage dataset for reproducibleresearch1.
1http://www.computational-advertising.org
Related worksContextual advertising,• A semantic approach to contextual advertising [Broder 2007]• Impedance coupling in content-targeted advertising [Ribeiro 2005]• Contextual advertising by combining relevance with click feedback [Chakrabarti
2008]
Inventory management (contracts),• Targeted advertising on the Web with inventory management [Chickering 2003]• Revenue management for online advertising: Impatient advertisers
[Fridgeirsdottir 2007]• Dynamic revenue management for online display advertising [Roels 2009]
Optimal pricing model,• Pricing of Online Advertising: Cost-Per-Click-Through Vs. Cost-Per-Action [Hu
2010]• Online advertising: Pay-per-view versus pay-per-click [Mangani 2004]• Online advertising: Pay-per-view versus pay-per-click A comment [Fjell 2009]• Single period balancing of pay-per-click and pay-per-view online display
advertisements [Kwon 2011]
Related works (cont.)Ad scheduling,• Scheduling advertisements on a web page to maximize revenue [Kumar 2006]• Scheduling of dynamic in-game advertising [Turner 2011]
Multi-armed bandits,• Using confidence bounds for exploitation-exploration trade-offs [Auer 2003]• Multi-armed bandit problems with dependent arms [Pandey 2007]
POMDPs,• A survey of POMDP applications [Cassandra 1998]• Monte Carlo POMDPs [Thrun 2000]• Perseus: Randomized point-based value iteration for POMDPs [Spaan 2005]
Problem statement - setup
$
0 200 400 600 800 1000
100
200
300
400
500
0 200 400 600 800 1000
100
200
300
400
500
0 200 400 600 800 1000
100
200
300
400
500
Figure : 1 webpage, 1 ad slot, M impressions at each time step.Payoff of ads follows X ∼ N (µ, I ·σ2
0). µ is generated by µ ∼ N (θ,Σ).
Problem statement - graphical model
2σ 0
θ(T), Σ(T), 0θ(2), Σ(2), T-2θ(1), Σ(1), T-1
μ(2)μ(1)
s(1) s(2)
μ(T)
s(T)
x(1) x(2) x(T)
θ, Σ
Figure : The payoff model illustrated by an influence diagramrepresentation with generative processes of a finite horizon POMDP.s(t) is the selection action. θ(t),Σ(t) is the belief at some stage.
Problem statement - object functionTo maximise the expected cumulative payoff over time,
π∗ = arg maxπ
E [Rπ(T )] = arg maxπ
E
T∑t=1
Xs(t)(t)
= arg maxπ
T∑t=1
E[Xs(t)(t)
]
=arg maxπ
T∑t=1
∫x
xs(t)(t)p(xs(t)(t)|Ψ(t))dx = arg maxπ
T∑t=1
θs(t)(t) (1)
where,• s(t) is the selection decision;• Ψ(t) is the available information;• π is a selection policy and π∗ is the optimal one;• “M impressions” is dropped from object function.
Belief update
$
t=1 t=2 ...
Figure : Updating belief on ads’ performance over time.
Belief update - the selected adWe update the belief using Bayes’ theorem.
p (x1|x1(t),Ψ(t))
=
∫p (x1|x1(t),Ψ(t), µ1) p (µ1|x1(t),Ψ(t))dµ (2)
by “completing squares”,
p(µ1|x1(t),Ψ(t)
)∝ p(x1(t)|µ1,Ψ(t))p(µ1|Ψ(t))
∝ exp{−(x1(t)− µ1
)2 −(µ1 − θ1(t)
)2}
(3)
we obtain the new belief,
µ1|x1(t) ∼ N(θ1(t + 1), σ2
1(t + 1))
(4)
θ1(t + 1) =σ2
1(t)x1(t) + σ20θ1(t)
σ21(t) + σ2
0σ2
1(t + 1) =σ2
1(t)σ20
σ21(t) + σ2
0(5)
we write θi (t) and σ2i (t) as the shorthand for θi |Ψ(t) and σ2
i |Ψ(t).
Belief update - the correlated adWe also update the belief of non-selected ads,
p (x2|x1(t),Ψ(t)) =
∫p (x2|µ2, x1(t),Ψ(t)) p(µ2|x1(t),Ψ(t))dµ2 (6)
with linear Gaussian property,
µ1|µ2 ∼ N (θ1|µ2, σ21 |µ2) (7)
θ1|µ2 = θ1 +σ1,2
σ22
(µ2 − θ2) σ21 |µ2 = σ2
1 −σ2
1,2
σ22
(8)
we obtain the new belief on a correlated ad,
µ2|x1(t) ∼ N (θ2(t + 1), σ22(t + 1)) (9)
θ2(t + 1) = θ2(t) + σ1,2x1(t)− θ1(t)σ2
1(t) + σ20
σ22(t + 1) = σ2
2(t)−σ2
1,2
σ21(t) + σ2
0(10)
Belief update - expected payoffWe also obtain the expected payoff of the selected ad,
X1|x1(t),Ψ(t) ∼ N(θ1(t + 1), σ2
0 + σ21(t + 1)
)(11)
and the expected payoff of the correlated ad,
X2|x1(t),Ψ(t) ∼ N(θ2(t + 1), σ2
0 + σ22(t + 1)
)(12)
The final objective function is,
π∗ = arg maxπ
T∑t=1
θs(t)(t) subject to (13)
θs(t+1)(t + 1) = θs(t+1)(t) + σs(t),s(t+1)
xs(t)(t)− θs(t)(t)
σ2s(t)(t) + σ2
0(14)
σ2s(t+1)(t + 1) = σ2
s(t+1)(t)−σ2
s(t),s(t+1)
σ2s(t)(t) + σ2
0(15)
POMDP formulation and solution
$
0 200 400 600 800 1000
100
200
300
400
500
0 200 400 600 800 1000
100
200
300
400
500
0 200 400 600 800 1000
100
200
300
400
500
(hidden state)
(belief state)
(action)(observation & reward)
Figure : The POMDP model for the revenue optimisation problem.(θ(t),Σ(t)) is belief at some stage; x(t) is observation and reward;s(t) is action; (θ,Σ) is the hidden state. There is no state transition.
Value iteration and MAB approximationThe value function could be expressed as,
s(t)= arg maxs(t)∈N
Vs(t)(Ψ(t)) = arg maxi∈N
(x̄i )︸︷︷︸the expected immediate reward
+ ξ(Ψ(t), i)︸ ︷︷ ︸the expected future reward
(16)
The exact solution using Value iteration2:V∗ (θ,Σ,T ) = max
s(1)∈NE[Xs(t)(1) + V∗
(θ|Xs(t)(1),Σ|Xs(t)(1),T − 1
)](17)
The approximation based on multi-armed bandit3:
ξUCB1-NORMAL =
√16 ·
qi − tiθ2i (t)
ti − 1·
t − 1ti
(18)
2R. E. Bellman. (1957) “Dynamic Programming”3Auer, P. et al. (2002) “Finite-time analysis of the multi-armed bandit
problem”
Value iteration with Monte Carlo sampling4
We use sampling to reduce the computational complexity,
1: function VALUEFUNC(θ,Σ, t)2: array V ← 0 . Expected reward vector.3: loop i ← 1 to N4: V [i]← θi (t) . Expected immediate reward.5: if t < T then6: for all s in SAMPLE(θ,Σ) do7: [θ′,Σ′]← UPDATEBELIEF(θ,Σ, s, i)
. New belief after selecting i and observing s.. Equations 13.
8: V [i]← V [i] + 1M0
VALUEFUNC(θ′,Σ′, t + 1)
9: end for10: end if11: end loop12: return [MAX(V ),MAXINDEX(V )]13: end function
4Thrun, S. (2000) “Monte Carlo POMDPs”
Multi-armed bandit based approximation(cont.)The UCB1-NORMAL-COR algorithm:
1: function PLAN(θ,Σ,Ψ(t))2: array V ← 03: loop i ← 1 to N4: if ti < d8 log te then . ti is the number of times ad i gets selected.5: return i6: end if7: end loop8: [θ′,Σ′]← UPDATEBELIEF(θ,Σ,Ψ(t))
. New belief of all ads with all available information.. Equations 13.
9: loop i ← 1 to N
10: V [i]← θ′i +
√16 · qi−tiθ′2i
ti−1 · t−1ti
. Expected reward.
11: end loop12: return [MAX(V ),MAXINDEX(V )]13: end function
Experiment datasets
advertisers
$$$ $$
$
publishers
INTRANET
ad network/exchange
Google AdWords Traffic Estimator service
• publishers gain 68% of advertisers’ spending (2003);• data was collected from 12/2011 to 05/2012;• 512 different keywords, 310 with non-zero mean payoff, 8
categories;• 20% for training and 80% for testing;• we consider each keyword to be an ad.
Competing algorithms
We compare the following algorithms,• RANDOM policy, which selects candidates randomly
(uniform);• MYOPIC policy, based on the expected immediate reward;• UCB1 policy, which assumes independent between arms
and is model-free of reward distribution;• UCB1-NORMAL policy, which assumes independent
between arms and the reward following Gaussiandistribution;
• VI-COR policy, which solves Value iteration using MonteCarlo sampling; and
• UCB1-NORMAL-COR policy, which consider thedependencies between candidates.
Results
Datasets MYOPIC RANDOM UCB1 UCB1-N VI-COR UCB1-N-COREducation 21.9 23.0 30.9 30.9 41.2* 27.6Finance-1 38.5 27.8 40.9 26.4 44.5 27.4Finance-2 22.1 16.5 30.6 22.8 38.0* 22.9Information 14.1 12.9 27.8 15.9 29.4 15.9P&O 41.6 30.4 50.5 31.4 72.9* 63.3Shopping-1 17.4 10.6 42.3 16.1 40.2 16.4Shopping-2 29.9 14.5 34.3 75.3 52.9 79.2*Shopping-3 9.7 4.3 21.9 18.3 27.3 19.4P&S 24.7 26.0 47.2 57.1 67.9* 59.9Medical 30.5 19.6 52.7 32.2 58.0* 33.5
Table : The cumulative payoffs are averaged on 8 chunks then normalized w.r.t theGOLDEN policy for a better representation. The one with highest cumulative payoff isin bold and with ∗ if the difference with the second best is significant by Wilcoxonsigned-rank test. P&O is “People & organisations” and P&S is “‘Products & services”.
Results (cont.)
ææææææææææææææææææææ
æææ
æææ
ææ
æææ
ææ
æ
æ
æ
æ
æ
æ
ææ
ææææ
æææ
æææ
æææ
æææææææææææææææææææææææææææææææææææææææ
ææææææ
ææææ
ààààààààààààààààààààààààààààà
àà
àà
àà
à
à
à
à
à
à
àà
àà
àà
àà
àà
àà
àààààààààààààààààààààààààààààààààààààààààààààààààà
ìììììììììììììììììììì
ìì
ìì
ìì
ìì
ìì
ìì
ìì
ìì
ìì
ìì
ìì
ìì
ìì
ìì
ììììììììììììììììììììììììììììììììììììììììììì
ìì
ìììì
ìì
ìì
ìì
òòòòòòòòòòòòòòòòòòòòòòòòòòòòò
òò
òò
òò
òò
òò
òò
òò
òò
òò
òòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòòò
òòòòòòòò
ò
ôôôôôôôôôôôôôôôôôôôô
ôô
ôô
ôô
ôô
ôô
ô
ô
ô
ô
ô
ô
ô
ô
ô
ô
ô
ôô
ôô
ôô
ôô
ôô
ôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôôô
ôô
ôô
ôô
ôô
ôô
ôô
ô
çççççççççççççççççççç
çç
çç
çç
çç
çç
çç
çç
çç
çç
çç
çç
çç
çç
çç
ççççççççççççççççççççççççççççççççççççççççççç
çç
çç
çç
çç
çç
çç
áááááááááááááááááááááááá
áááá
áááááá
ááá
áá
áá
áááááááááááááááááááááááááááááááááááááááááááááááááááááááá
áááá
áá
20 40 60 80 100
1000
2000
3000
4000
á Random
ç Myopic
ô Golden
ò UCB1
ì UCB1-Normal
à UCB1-Normal-COR
æ VI-COR
Figure : Cumulative payoff on “People & organization” category, 5candidates.
Results (cont.)
Edu F-1 F-2 Info P&O S-1 S-2 S-3 P&S Med0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1MyopicVI-CorUCB1-NormalUCB1-Normal-Cor
Norm
alizedcumulative
payoff
Figure : Comparison of accumulated payoffs on the 10 datasets.VI-COR always performed better than MYOPIC and UCB1-NORMAL-CORalways performed better than UCB1-NORMAL across all datasets.
Results (cont.)
0 50 100 1500
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Day
Dai
ly p
ayof
f
best phonesterm insurance
Figure : Special case: the daily payoff of two candidates with asudden change.
Results (cont.)
10−2
100
102
104
3
4
5
6
7
8
9
10x 10
4
Noise factor σ20
Cum
ulat
ive
payo
ff
GoldenMyopicVI−CORUCB1−Normal−COR
Figure : Theimpact of the noisefactor σ2
0 for thesituation in theprevious figure.
θs(t+1)(t + 1) = θs(t+1)(t) + σs(t),s(t+1)
xs(t)(t)− θs(t)(t)
σ2s(t)(t) + σ2
0
Future works• correlated update: if ad a1 on webpage w1 was shown to
user u1 and we observed its performance, what’s the beliefon performance of ad a2 on webpage w2 when showing touser u2 with correlations known?
• multiple ads with diversification (another exploration andexploitation dilemma);
• better solution for our continuous POMDP problem.
top related