Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker Guy Van den Broeck
Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker
Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker
Guy Van den Broeck
Deceptive play
Should I bluff?
Opponent modeling
Should I bluff?Is he bluffing?
Incomplete information
Should I bluff?
Who has the Ace?
Is he bluffing?
Game of chance
Should I bluff?
Who has the Ace? What are the odds?
Is he bluffing?
Exploitation
Should I bluff?
Who has the Ace? What are the odds?
Is he bluffing?
I'll bet because he always calls
Huge state space
Should I bluff?
Who has the Ace? What are the odds?
Is he bluffing?
What can happen next?I'll bet because he always calls
Risk management & Continuous action space
Should I bet $5 or $10?Should I bluff?
Who has the Ace? What are the odds?
Is he bluffing?
What can happen next?I'll bet because he always calls
Take-Away Message:We can solve all these problems!
Should I bet $5 or $10?Should I bluff?
Who has the Ace? What are the odds?
Is he bluffing?
What can happen next?I'll bet because he always calls
Problem Statement
A bot for Texas hold'em poker No-Limit & > 2 players
Not done before! Exploitative, not game theoretic
Game tree search + Opponent modeling
Applies to any problem with either incomplete information non-determinism continuous actions
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search Opponent model
Conclusion
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search Opponent model
Conclusion
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search Opponent model
Conclusion
Poker Game TreePoker Game Tree
Minimax trees: deterministic Tic-tac-toe, checkers, chess, go,… max min
Poker Game TreePoker Game Tree
Minimax trees: deterministic Tic-tac-toe, checkers, chess, go,…
Expecti(mini)max trees: chance Backgammon, …
max min
max min mix
Poker Game TreePoker Game Tree
Minimax trees: deterministic Tic-tac-toe, checkers, chess, go,…
Expecti(mini)max trees: chance Backgammon, …
Miximax trees: hidden information
max min
mix
max
max
min
mix
mix
+ opponent model
my actionfold
callraise
my action
Resolve
fold
callraise
my action
Resolve
fold
callraise
0
my action
Resolve
fold
callraise
Reveal Cards
…
0
0.5 0.5
my action
Resolve
fold
callraise
Reveal Cards
…
0
0.5 0.5
my action
Resolve
fold
callraise
Reveal Cards
…
0
-1
0.5 0.5
my action
Resolve
fold
callraise
Reveal Cards
…
0
-1 3
0.5 0.5
my action
Resolve
fold
callraise
Reveal Cards
…
0
1
-1 3
0.5 0.5
my action
Resolve
fold
callraise
Reveal Cards
…
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.1
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… …
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.1
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… …
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.14
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… ……
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.14
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… ……
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.14 2
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… ……
…
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.14 2
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… ……
…
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.14 2
0
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… ……
…
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.14 2
0
3
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… ……
…
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.14 2
0
3
3
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search Opponent model
Conclusion
Short ExperimentShort Experiment
Opponent ModelOpponent Model
Set of probability trees Weka's M5' Separate model for
Actions
Hand cards at showdown
Fold ProbabilityFold ProbabilitynbAllPlayerRaises
(Can also be relational) Tilde probability tree [Ponsen08]
Opponent RanksOpponent Ranks
Learn distribution of hand ranks at showdown
Rank Bucket
Probability
Number of Raises
Probability
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search Opponent model
Conclusion
Traversing the treeTraversing the tree
Limit Texas Hold’em 1018 nodes Fully traversable
No-limit >1071 nodes Too large to traverse Sampled, not searched Monte-Carlo Tree Search
Monte-Carlo Tree SearchMonte-Carlo Tree Search
[Chaslot08]
SelectionSelection
In each node: is an estimate of the rewardis the number of samples
SelectionSelection
UCT (Multi-Armed Bandit)
In each node: is an estimate of the rewardis the number of samples
SelectionSelection
UCT (Multi-Armed Bandit)
In each node: is an estimate of the rewardis the number of samples
exploitation
SelectionSelection
UCT (Multi-Armed Bandit)
In each node: is an estimate of the rewardis the number of samples
exploration
exploitation
SelectionSelection
UCT (Multi-Armed Bandit)
CrazyStone
In each node: is an estimate of the rewardis the number of samples
exploration
exploitation
ExpansionSimulationExpansionSimulation
BackpropagationBackpropagationis an estimate of the rewardis the number of samples
BackpropagationBackpropagation
Sample-weighted average
is an estimate of the rewardis the number of samples
BackpropagationBackpropagation
Sample-weighted average
Maximum child
is an estimate of the rewardis the number of samples
Initial experiments 1*MCTS + 2*rule based Exploitative!
MCTS Bot
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search Opponent model
Conclusion
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search
Uncertainty in MCTS Continuous action spaces
Opponent model Online learning Concept drift
Conclusion
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search
Uncertainty in MCTS Continuous action spaces
Opponent model Online learning Concept drift
Conclusion
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search
Uncertainty in MCTS Continuous action spaces
Opponent model Online learning Concept drift
Conclusion
MCTS for games with uncertainty? Expected reward distributions (ERD) Sample selection using ERD Backpropagation of ERD
[VandenBroeck09]
100 samples
∞ samples
MiniMax
Expected reward distribution
10 samples
Variance
Estimating
100 samples
∞ samples
MiniMax
Expected reward distribution
10 samples
Variance
Estimating
100 samples
∞ samples
MiniMax
Expected reward distribution
10 samples
Variance
Estimating
100 samples
∞ samples
MiniMax
Expected reward distribution
10 samples
Variance
Estimating
100 samples
∞ samples
MiniMax
Expected reward distribution
10 samples
SamplingVariance
Estimating
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance
Estimating
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance
Estimating
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance
Estimating
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance
Estimating
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance Uncertainty + Sampling
Estimating
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance Uncertainty + Sampling
ExpectiMax/MixiMax
/ T(P)Estimating
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance Uncertainty + Sampling
ExpectiMax/MixiMax
/ T(P)Estimating
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance Uncertainty + Sampling
ExpectiMax/MixiMax
/ T(P)Estimating
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance Uncertainty + Sampling
ExpectiMax/MixiMax
/ T(P)Estimating
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance Uncertainty + Sampling
ExpectiMax/MixiMax
Sampling
/ T(P)Estimating
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance Uncertainty + Sampling
ExpectiMax/MixiMax
Sampling
/ T(P)Estimating
ERD selection strategy
Objective? Find maximum expected reward Sample more in subtrees with
(1) High expected reward(2) Uncertain estimate
UCT does (1) but not really (2) CrazyStone does (1) and (2) for
deterministic games (Go) UCT+ selection:
(1) (2)
ERD selection strategy
Objective? Find maximum expected reward Sample more in subtrees with
(1) High expected reward(2) Uncertain estimate
UCT does (1) but not really (2) CrazyStone does (1) and (2) for
deterministic games (Go) UCT+ selection:
“Expected value under perfect play”
ERD selection strategy
Objective? Find maximum expected reward Sample more in subtrees with
(1) High expected reward(2) Uncertain estimate
UCT does (1) but not really (2) CrazyStone does (1) and (2) for
deterministic games (Go) UCT+ selection:
“Measure of uncertainty due to sampling”
max
A
…
B
…
3 4
ERD max-distribution backpropagation
max
A
…
B
…
3 4
ERD max-distribution backpropagation
3.5
sample-weighted
max
A
…
B
…
3 4
ERD max-distribution backpropagation
3.5
sample-weighted
4
max
max
A
…
B
…
3 4
ERD max-distribution backpropagation
3.5
sample-weighted
4
max
“When the game reaches P, we'll have more time to find the real “
max
A
…
B
…
3 4
ERD max-distribution backpropagation
4
3.5
sample-weighted
max
max-distribution
4.5
max
A
…
B
…
3 4
ERD max-distribution backpropagation
A4B4 0.8*0.5 0.2*0.5
P(A>4) = 0.2P(A4) = 0.5P(B4) = 0.6 > 0.5
4.5
Experiments
2*MCTS Max-distribution Sample-weighted
2*MCTS UCT+ (stddev) UCT
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search
Uncertainty in MCTS Continuous action spaces
Opponent model Online learning Concept drift
Conclusion
Dealing with continuous actions
Sample discrete actions
Progressive unpruning [Chaslot08] (ignores smoothness of EV function)
... Tree learning search (work in progress)
relative betsize
Tree learning search
Based on regression tree induction from data streams training examples arrive quickly nodes split when significant reduction in stddev training examples are immediately forgotten
Edges in TLS tree are not actions, but sets of actions, e.g., (raise in [2,40]), (fold or call)
MCTS provides a stream of (action,EV) examples Split action sets to reduce stddev of EV
(when significant)
Tree learning searchmax
{Fold, Call}
max
Bet in [0,10]
? ?
Tree learning searchmax
?
{Fold, Call}
max
Bet in [0,10]
?
Tree learning searchmax
?
{Fold, Call}
max
Bet in [0,10]
?
Optimal split at 4
Tree learning searchmax
Bet in [0,4]
{Fold, Call}
max
Bet in [0,10]
Bet in [4,10]
max max
? ?? ?
one action of P1
one action of P2
Tree learning search
Selection Phase
Sample 2.4
P1
Each node has EV estimate, which generalizes over actions
Expansion
Selected Node
P1
P2
Expansion
Expanded nodeRepresents any action of P3P3
P2
P1
Backpropagation
New sample;Split becomes significant
Backpropagation
New sample;Split becomes significant
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search
Uncertainty in MCTS Continuous action spaces
Opponent model Online learning Concept drift
Conclusion
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search
Uncertainty in MCTS Continuous action spaces
Opponent model Online learning Concept drift
Conclusion
Online learning of opponent model
Start from (safe) model of general opponent Exploit weaknesses of specific opponent
Start to learn modelof specific opponent
(exploration of opponent behavior)
Multi-agent interaction
Multi-agent interaction
Yellow learns model for Blue and changes strategy
Multi-agent interaction
Yellow learns model for Blue and changes strategy
Yellow doesn't profit!
Multi-agent interaction
Yellow learns model for Blue and changes strategy
Yellow doesn't profit!
Green profits without changing strategy!!
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search
Uncertainty in MCTS Continuous action spaces
Opponent model Online learning Concept drift
Conclusion
Concept drift
While learning from a stream, the training examples in the stream change In opponent model: changing strategy
“Changing gears is not just about bluffing, it's about changing strategy to achieve a goal.”
Learning with concept drift adapt quickly to changes yet robust to noise (recognize recurrent concepts)
Basic approach to concept drift
Maintain a window of training examples large enough to learn small enough to adapt quickly without 'old' concepts
Heuristics to adjust window size based on FLORA2 framework [Widmer92]
Accuracy
Window size
4 components of a singleopponent model
Start online learning
Concept drift
Bad parameters for heuristic
NOT ROBUST
Accuracy
Window size
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search Opponent model
Conclusion
Conclusions
First exploitive poker bot for No-limit Holdem > 2 players
Apply in other games backgammon computational pool ...
Challenge for MCTS games with uncertainty continuous action space
Challenge for ML online learning concept drift (relational learning)
Thanks for listening!
Monte-Carlo Tree Search in Poker using Expected Reward DistributionsDia 2Dia 3Dia 4Dia 5Dia 6Dia 7Dia 8Dia 9Dia 10Dia 11Dia 12Dia 13Dia 14Dia 15Dia 16Dia 17Dia 18Dia 19Dia 20Dia 21Dia 22Dia 23Dia 24Dia 25Dia 26Dia 27Dia 28Dia 29Dia 30Dia 31Dia 32Dia 33Dia 34Dia 35Short ExperimentDia 37Dia 38Dia 39Opponent RanksDia 41Dia 42Dia 43Dia 44Dia 45Dia 46Dia 47Dia 48Dia 49Dia 50Dia 51Dia 52Dia 53Dia 54Dia 55Dia 56Dia 57Dia 58Dia 59Dia 60Dia 61Dia 62Dia 63Dia 64Dia 65Dia 66Dia 67Dia 68Dia 69Dia 70Dia 71Dia 72Dia 73Dia 74Dia 75Dia 76Dia 77Dia 78Dia 79Dia 80Dia 81Dia 82Dia 83Dia 84Dia 85Dia 86Dia 87Dia 88Dia 89Dia 90Dia 91Dia 92Dia 93Dia 94Dia 95Dia 96Dia 97Dia 98Dia 99Dia 100Dia 101Dia 102Dia 103Dia 104Dia 105Dia 106Dia 107Dia 108Dia 109Dia 110Dia 111Dia 112