The exploration- exploitation trade-off Pantelis Pipergias Analytis Exploration- exploitation problems The multi-armed bandit framework Strategies Contextual bandits Results from a real world experiment Conclusions The exploration-exploitation trade-off Pantelis Pipergias Analytis Cornell University February 5, 2018 1 / 75
118
Embed
The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
The exploration-exploitation trade-off
Pantelis Pipergias Analytis
Cornell University
February 5, 2018
1 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Examples of exploration and exploitation in real life
1 Going to your favorite restaurant/bar vs. trying a new one.
2 Listening to music from a band you love vs. discoveringnew ones.
3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.
4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.
5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.
6 An organization trying a new organizational structure vs.a decently working existing one.
2 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Examples of exploration and exploitation in real life
1 Going to your favorite restaurant/bar vs. trying a new one.
2 Listening to music from a band you love vs. discoveringnew ones.
3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.
4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.
5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.
6 An organization trying a new organizational structure vs.a decently working existing one.
2 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Examples of exploration and exploitation in real life
1 Going to your favorite restaurant/bar vs. trying a new one.
2 Listening to music from a band you love vs. discoveringnew ones.
3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.
4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.
5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.
6 An organization trying a new organizational structure vs.a decently working existing one.
2 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Examples of exploration and exploitation in real life
1 Going to your favorite restaurant/bar vs. trying a new one.
2 Listening to music from a band you love vs. discoveringnew ones.
3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.
4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.
5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.
6 An organization trying a new organizational structure vs.a decently working existing one.
2 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Examples of exploration and exploitation in real life
1 Going to your favorite restaurant/bar vs. trying a new one.
2 Listening to music from a band you love vs. discoveringnew ones.
3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.
4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.
5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.
6 An organization trying a new organizational structure vs.a decently working existing one.
2 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Examples of exploration and exploitation in real life
1 Going to your favorite restaurant/bar vs. trying a new one.
2 Listening to music from a band you love vs. discoveringnew ones.
3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.
4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.
5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.
6 An organization trying a new organizational structure vs.a decently working existing one.
2 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Multi-armed bandit (MAB) problem
Option 1
??
N(µ1, σ1)
N(12, 3)
Option 2
7.417.3
N(µ2, σ2)
N(15, 3)3 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
History of the problem
1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)
2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).
3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.
4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.
5 Note the similarities to the search problem consideredlast week: the problems fold into each other
4 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
History of the problem
1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)
2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).
3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.
4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.
5 Note the similarities to the search problem consideredlast week: the problems fold into each other
4 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
History of the problem
1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)
2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).
3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.
4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.
5 Note the similarities to the search problem consideredlast week: the problems fold into each other
4 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
History of the problem
1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)
2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).
3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.
4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.
5 Note the similarities to the search problem consideredlast week: the problems fold into each other
4 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Domains where MABs have been applied
1 Developing new medicine—clinical trials.
2 One of the steam-engines for studying human (andanimal) learning.
3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.
4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.
5 Used to decide which learning algorithm to use in aspecific context.
6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.
5 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Domains where MABs have been applied
1 Developing new medicine—clinical trials.
2 One of the steam-engines for studying human (andanimal) learning.
3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.
4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.
5 Used to decide which learning algorithm to use in aspecific context.
6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.
5 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Domains where MABs have been applied
1 Developing new medicine—clinical trials.
2 One of the steam-engines for studying human (andanimal) learning.
3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.
4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.
5 Used to decide which learning algorithm to use in aspecific context.
6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.
5 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Domains where MABs have been applied
1 Developing new medicine—clinical trials.
2 One of the steam-engines for studying human (andanimal) learning.
3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.
4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.
5 Used to decide which learning algorithm to use in aspecific context.
6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.
5 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Domains where MABs have been applied
1 Developing new medicine—clinical trials.
2 One of the steam-engines for studying human (andanimal) learning.
3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.
4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.
5 Used to decide which learning algorithm to use in aspecific context.
6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.
5 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Domains where MABs have been applied
1 Developing new medicine—clinical trials.
2 One of the steam-engines for studying human (andanimal) learning.
3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.
4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.
5 Used to decide which learning algorithm to use in aspecific context.
6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.
5 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Different strategies for coping with the multi-armedbandit problem
Go optimal — not always possible and oftencomputational very expensive.
Go greedy — always try the best alternative.
Add some noise — randomize one in while (e-greedy).
When randomizing choose options with higher expectedreturn with a higher probability (softmax).
Probability matching — choose actions according to theirprobability of being the best.
Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.
6 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Different strategies for coping with the multi-armedbandit problem
Go optimal — not always possible and oftencomputational very expensive.
Go greedy — always try the best alternative.
Add some noise — randomize one in while (e-greedy).
When randomizing choose options with higher expectedreturn with a higher probability (softmax).
Probability matching — choose actions according to theirprobability of being the best.
Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.
6 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Different strategies for coping with the multi-armedbandit problem
Go optimal — not always possible and oftencomputational very expensive.
Go greedy — always try the best alternative.
Add some noise — randomize one in while (e-greedy).
When randomizing choose options with higher expectedreturn with a higher probability (softmax).
Probability matching — choose actions according to theirprobability of being the best.
Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.
6 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Different strategies for coping with the multi-armedbandit problem
Go optimal — not always possible and oftencomputational very expensive.
Go greedy — always try the best alternative.
Add some noise — randomize one in while (e-greedy).
When randomizing choose options with higher expectedreturn with a higher probability (softmax).
Probability matching — choose actions according to theirprobability of being the best.
Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.
6 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Different strategies for coping with the multi-armedbandit problem
Go optimal — not always possible and oftencomputational very expensive.
Go greedy — always try the best alternative.
Add some noise — randomize one in while (e-greedy).
When randomizing choose options with higher expectedreturn with a higher probability (softmax).
Probability matching — choose actions according to theirprobability of being the best.
Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.
6 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Different strategies for coping with the multi-armedbandit problem
Go optimal — not always possible and oftencomputational very expensive.
Go greedy — always try the best alternative.
Add some noise — randomize one in while (e-greedy).
When randomizing choose options with higher expectedreturn with a higher probability (softmax).
Probability matching — choose actions according to theirprobability of being the best.
Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.
6 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
The Gittins index (Christian and Griffiths cpt. 2)
Possible to calculate for Bernoulli bandits with stablediscounting of future trials.
Biases exploration towards the more promising actions.
The softmax rule grades probabilities according to theirselected values.
P(C (t) = j) =exp(θEj(t))∑K
k=1 exp(θEk(t))
where θ is a temperature controlling how biased thealgorithm will be.
12 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
The softmax rule
Biases exploration towards the more promising actions.
The softmax rule grades probabilities according to theirselected values.
P(C (t) = j) =exp(θEj(t))∑K
k=1 exp(θEk(t))
where θ is a temperature controlling how biased thealgorithm will be.
12 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
The softmax rule
Biases exploration towards the more promising actions.
The softmax rule grades probabilities according to theirselected values.
P(C (t) = j) =exp(θEj(t))∑K
k=1 exp(θEk(t))
where θ is a temperature controlling how biased thealgorithm will be.
12 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
The softmax rule
Biases exploration towards the more promising actions.
The softmax rule grades probabilities according to theirselected values.
P(C (t) = j) =exp(θEj(t))∑K
k=1 exp(θEk(t))
where θ is a temperature controlling how biased thealgorithm will be.
12 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Optimism in the face of uncertainty and the upperconfidence bound (UCB)
The more uncertain you are about the value of an optionthe more important it is to explore.That option could turn out to be really good and in thelong-term improve your overall utility.UCB: P(C = i) ∝ exp(θmi + α
√vari )
13 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
UCB against e-greedy
14 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Probability matching, changing environments andThompson sampling
Probability matching suggests sampling alternativesaccording to their rewards or their probability of being thebest.Thompson sampling is an implementation of theprobability matching principle.
15 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Collective exploration
The Roger’s paradox — produce or scrounge?
The social learning tournament — Rendell et al. (2010)
Counter-intuitive more or less effects — Toyokawa et al.(2014)
16 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Collective exploration
The Roger’s paradox — produce or scrounge?
The social learning tournament — Rendell et al. (2010)
Counter-intuitive more or less effects — Toyokawa et al.(2014)
16 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Collective exploration
The Roger’s paradox — produce or scrounge?
The social learning tournament — Rendell et al. (2010)
Counter-intuitive more or less effects — Toyokawa et al.(2014)
16 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Collective exploration
The Roger’s paradox — produce or scrounge?
The social learning tournament — Rendell et al. (2010)
Counter-intuitive more or less effects — Toyokawa et al.(2014)
17 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
The typical bandit setting is like blind tasting...
18 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
My grandma’s problem: Choosing the best place toswim
19 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
The machine learner’s problem
Li, L., Chu, W., Langford, J., Schapire, R. E. (2010,April). A contextual-bandit approach to personalizednews article recommendation. In Proceedings of the 19thinternational conference on World Wide Web.
20 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
A contextual bandit experiment
21 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
A contextual bandit experiment: Results
22 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Contextual multi-armed bandit (CMAB) problem
Option 1
N(f (·), σ1)N(w1x1 + w2x2, σ)
N(µ1, σ1)
Option 2
N(f (·), σ2)N(w1x1 + w2x2, σ)
N(µ2, σ2)
23 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Realistic decision problem...
24 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Realistic decision problem...
25 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Motivation
Why is the CMAB problem interesting?
1 Captures the important characteristics of decisions in thewild better.
2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.
3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.
4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.
26 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Motivation
Why is the CMAB problem interesting?
1 Captures the important characteristics of decisions in thewild better.
2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.
3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.
4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.
26 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Motivation
Why is the CMAB problem interesting?
1 Captures the important characteristics of decisions in thewild better.
2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.
3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.
4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.
26 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Motivation
Why is the CMAB problem interesting?
1 Captures the important characteristics of decisions in thewild better.
2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.
3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.
4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.
26 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Motivation
Why is the CMAB problem interesting?
1 Captures the important characteristics of decisions in thewild better.
2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.
3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.
4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.
26 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
CMAB task
27 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
MAB task
28 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
One-shot choices in the test phase
Three alternatives:
Dominating - highest function value.
Neutral - middle function value.
Dominated - lowest function value.
29 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experimental Design
Training phase
Between subject design – CMAB or MAB
Contextual multi-armed bandit (CMAB) task – twoinformative features are visually displayed
Classic multi-armed bandit (MAB) task – control group,features are not visible
20 alternatives, 100 trials
Test phase
Designed to test the functional knowledge.
One shot choices, no outcome feedback.
3 arms in 70 trials.
30 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Gaussian process (GP) based “optimal” solutions
Goal: simultaneously learn and optimize unknown function.
376 participants – Amazon Turk – monetary payoffs.
Test phase
Test items for mixed linear function are the same as forthe positive linear one
Special items for the quadratic function, testing whetherpeople detected the nonlinear nature of the relationship.
Illustration
52 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Exploration in the feature space, first 10 trials
MAB mixed CMAB mixed
MAB quadratic CMAB quadratic
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
Feature 1
Feat
ure
2
0.0500.0750.100
Proportion
Mean choice rank One-shot choices
53 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Exploration in the feature space, all trials
MAB mixed CMAB mixed
MAB quadratic CMAB quadratic
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
Feature 1
Feat
ure
2
0.1
0.2
0.3Proportion
Mean choice rank One-shot choices
54 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
How much do people rely on knowledge of therelationships between features and alternative value’swhen making decisions?
Can we model people’s behavior using traditional machinelearning models?
How priors about functional relationships affect thedecision making?
Do people explore the choice set strategically, to learnthe relationships?
55 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
56 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.
Type of function.Horizon.Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
56 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.Type of function.
Horizon.Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
56 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.Type of function.Horizon.
Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
56 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
56 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)
Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
56 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.
Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
56 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
People learn the function and generalize their knowledgeto new decision situations.
But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.
New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.
Priors about the functional relationship can hurt theperformance.
People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.
60 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
People learn the function and generalize their knowledgeto new decision situations.
But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.
New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.
Priors about the functional relationship can hurt theperformance.
People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.
60 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
People learn the function and generalize their knowledgeto new decision situations.
But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.
New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.
Priors about the functional relationship can hurt theperformance.
People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.
60 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
People learn the function and generalize their knowledgeto new decision situations.
But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.
New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.
Priors about the functional relationship can hurt theperformance.
People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.
60 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
People learn the function and generalize their knowledgeto new decision situations.
But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.
New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.
Priors about the functional relationship can hurt theperformance.
People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.
60 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
Challenges and future directions
Goals is to develop a function learning based RL model –algorithmic level.
Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.
How do people behave in the presence of informationabout the alternatives and other contextual information?
61 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
Challenges and future directions
Goals is to develop a function learning based RL model –algorithmic level.
Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.
How do people behave in the presence of informationabout the alternatives and other contextual information?
61 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
Challenges and future directions
Goals is to develop a function learning based RL model –algorithmic level.
Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.
How do people behave in the presence of informationabout the alternatives and other contextual information?
61 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
Challenges and future directions
Goals is to develop a function learning based RL model –algorithmic level.
Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.
How do people behave in the presence of informationabout the alternatives and other contextual information?
61 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Acknowledgments
Funding:
FPU grant, Ministry of Education, Culture and Sports,Spain
Max Planck Institute for Human Development, Berlin
Barcelona Graduate School of Economics
62 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Quadratic function – An illustration
Experimental design
63 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Individual behaviour in the training phase -Experiment 1a
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●●●●
●
●
●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
5
10
15
20
0 25 50 75 100
Ran
k of
the
chos
en a
ltern
ativ
e
Choice behavior of subject e2−0124,CMABn condition, experiment LowNoise
Mean choice rank64 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Individual behaviour in the training phase -Experiment 1a
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●●●●●●●
●●●●
●●
●●●●
●●●
●
●
●
●
●
●
●●
●
●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●
●●
●●●●●●●●
5
10
15
20
0 25 50 75 100
Ran
k of
the
chos
en a
ltern
ativ
e
Choice behavior of subject e2−0065,CMABn condition, experiment LowNoise
Mean choice rank65 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Mean choice rank – Lab replication
Random performance
1
2
3
4
5
6
7
8
9
10
11
12
13
1 2 3 4 5Block
Mea
n ra
nk o
f the
cho
sen
alte
rnat
ive
MABCMAB
Mean choice rank - Exp 1a66 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
One-shot choices – Lab replication
CMABn
Diff/Extra
CMABn
Diff/Inter
CMABn
Easy/Extra
CMABn
Easy/Inter
CMABn
Weight test
0.00
0.25
0.50
0.75
1.00
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative