Page 1
1
Symbolic Symbolic PerseusPerseus: : a Generic a Generic POMDP Algorithm with POMDP Algorithm with
Application to Dynamic Pricing Application to Dynamic Pricing with Demand Learningwith Demand Learning
Pascal Poupart (University of Waterloo)
INFORMS 2009
Page 2
2
OutlineOutline
• Dynamic Pricing as a POMDP• Symbolic Perseus
– Generic POMDP solver– Point-based value iteration– Algebraic decision diagrams
• Experimental evaluation• Conclusion
Page 3
3
SettingSetting• One or several firms (monopoly or oligopoly)• Fixed capacity and fixed number of selling rounds
(i.e., sale of seasonal items)• Finite range of prices• Unknown and varying demand
• Question: how to dynamically adjust prices to maximize sales?
Page 4
4
POMDPsPOMDPs Formulation (monopoly)Formulation (monopoly)
Price
CC
Inv
Sales
Price
CC
Inv
Price
CC
Inv
Price
CC
Inv
Sales Sales
Time Time Time Time
Firm
Con
sum
er
Page 5
5
POMDPsPOMDPs Formulation (oligopoly)Formulation (oligopoly)
Price
CC
Inv
Sales
Price
CC
Inv
Price
CC
Inv
Price
CC
Inv
Price-i Price-i Price-i Price-i
Sales Sales
Inv-i Inv-i Inv-i Inv-i
Time Time Time Time
Firm
Com
petit
ors
Con
sum
er
Page 6
6
Unknown demand & competitorsUnknown demand & competitors
Price
CC
Inv
Sales
Price
CC
Inv
Price
CC
Inv
Price
CC
Inv
Price-i Price-i Price-i Price-i
Sales Sales
Inv-i Inv-i Inv-i Inv-i
Time Time Time Time
Firm
Com
petit
ors
Con
sum
er
Page 7
7
Demand ModelDemand Model• Probability that consumer chooses firm i:
Pr(CC=i) = eai+bipi
Σi eai+bipi + 1
• Parameters ai and bi are unknown• Learn them
– From historical data– As process evolves
Page 8
8
CompetitorsCompetitors• Model each competitor:
– Pricing strategy: inv/time price– Two thresholds: tup and tdown
• If inv/time < tup price↑
• If inv/time > tdown price↓
• Learn thresholds – From historical data– As process evolves
Page 9
9
Expanded POMDPExpanded POMDP
Price
CC
Inv
Sales
Price
CC
Inv
Price
CC
Inv
Price
CC
Inv
Sales Sales
Firm
Con
sum
er
Price-i Price-i Price-i Price-i
Inv-i Inv-i Inv-i Inv-i
Time Time Time Time
Com
petit
ors
A, B A, B A, B A, B
T↑, T↓ T↑, T↓ T↑, T↓ T↑, T↓
Page 10
10
POMDPsPOMDPs• Partially Observable Markov Decision Processes
– S: set of states• Cross product of domain of all variables• |S| = ∏i |dom(Vi)| (exponentially large!)
– A: set of actions• {price↑, price↓, price unchanged}
– O: set of observations• Cross product of domain of observable variables
– T(s,a,s’) = Pr(s’|s,a): transition function• Factored rep: Pr(s’|s,a) = ∏i Pr(Vi|parents(Vi))
– R(s,a) = r: reward function• Sale = price x CC
Page 11
11
Belief monitoringBelief monitoring• Belief: b(s)
– Distribution over states
• Belief update: Bayes theorem– bao’(s’) = k Σs∈S b(s) Pr(s’|s,a) Pr(o’|a,s’)– bao’ = < o’, a, b >
• Demand learning and opponent modeling:– Implicit learning by belief monitoring
Page 12
12
Policy treesPolicy trees• Policy π
– Mapping from past actions & obs to next action– Tree representation
– Problem: tree grows exponentially with time
a1
a3a2
a7a6a5a4
o1 o2
o1 o2
o1 o2
Page 13
13
Policy OptimizationPolicy Optimization• Policy π : B A
– mapping from beliefs to actions
• Value function Vπ(b) = Σt γt Ebt|π [R]
• Optimal policy π*:– V*(b) ≥ Vπ(b) for all π,b
• Bellman’s Equation:– V*(b) = maxa Eb[R] + γ Σo’ Pr(o’|s,a) V*(bao’)
Page 14
14
DifficultiesDifficulties
• Exponentially large state space– |S| = ∏i |dom(Vi)|– Solution: algebraic decision diagrams
• Complex policy space– Policy π : B A– Continuous belief space– Solution: point-based Bellman backups
Page 15
15
Symbolic Symbolic PerseusPerseus
• Publicly available:– http://www.cs.uwaterloo.ca/~ppoupart/software.html
• Has been used to solve POMDPs with millions of states
• Currently used by– Intel, Toronto Rehabilitation Institute, Univ of Dundee,
Technical Univ of Lisbon, Univ of British Columbia, Univ of Manchester, Univ of Waterloo
Point-based value iteration
algebraic decision diagrams
+
Page 16
16
Piecewise linear & convex Piecewise linear & convex valval fnfn• Value of a policy tree β is linear
Vβ(b0) = Σs∈S b0(s) Vβ(s)
• Value of an optimal finite horizon policy is piecewise-linear and convex [SS73]
belief spaceb(s)=0 b(s)=1
Vπ
Page 17
17
PointPoint--based value iterationbased value iteration• Point-based backup (Pineau & al. 2003)
αt-1(b) = maxa Eb[R] + γ Σo’ Pr(o’|s,a) αt(bao’)
b
VtVt-1
bao2bao1
Page 18
18
Algebraic Decision DiagramsAlgebraic Decision Diagrams• First use in MDPs: Hoey et al. 1999
• Factored Representation– Exploit conditional independence– Pr(s’|s,a) = ∏i Pr(Vi|parents(Vi))
• Automatic State aggregation– Exploit context specific independence– Exploit sparsity
Page 19
19
Factored RepresentationFactored Representation
• Transition fn: Pr(s’|s,a) – Flat representation: matrix O(|S|2)– Factored representation: often O(log |S|)
Price
CC
Inv
Sales
Price
CC
Inv
Price
CC
Inv
Price
CC
Inv
Sales Sales
Time Time Time Time
Page 20
20
Computation with Factored RepComputation with Factored Rep
• Belief monitoring: – bao’(s’) = k Pr(o’|a,s’) Σs b(s) Pr(s’|s,a)
• Point-based Bellman backup:– α(s) = maxa R(s,a) + Σs’o’ Pr(s’|s,a) Pr(o’|a,s’) αao’(s’)
• Flat representation: O(|S|2)• Factored representation: often O(|S| log |S|)
Page 21
21
Algebraic Decision DiagramsAlgebraic Decision Diagrams
• Tree-based representation– Acyclic directed graph
• Avoid duplicate entries– Exploit context
specific independence– Exploit sparsity
2x~y~z0x~yz0xy~z0xyz
X
Y Y
Z
0 2
3
3~x~y~z3~x~yz2~xy~z0~xyz
Page 22
22
Empirical ResultsEmpirical Results• Monopolistic Dynamic Pricing
448199192817,92035 / 70
350199188605,12030 / 60
161198182424,32025 / 50
61187171275,52020 / 40
48167152158,72015 / 30
19 13312173,92010 / 20
Runtime (min)
Upper bound
SP Value|S|Inv / Time
Page 23
23
COACH projectCOACH project• Automated prompting system to help elderly persons
wash their hands• IATSL: Alex Mihailidis, Jesse Hoey, Jennifer Boger et al.
Page 24
24
Policy OptimizationPolicy Optimization
• Partially observable MDP:– Handle noisy HandLocation and noisy WaterFlow– Can adapt to user responsiveness– 50,181,120 states, 20 actions, 12 observations
• Approximation: fully observable MDP– Assume HandLocation, WaterFlow are fully observable– Remove responsiveness user variable– 25,090,560 states, 20 actions
Page 25
25
Empirical Comparison (Simulation)Empirical Comparison (Simulation)
Page 26
26
ConclusionConclusion• Natural encoding of Dynamic Pricing as POMDP
– Demand and competitor learning by belief monitoring– Factored model
• Symbolic Perseus (generic POMDP solvers)– Point-based value iteration + algebraic decision diagrams– Exploit problem specific structure
• Future work– Bayesian reinforcement learning– Planning as inference