The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The exploration-exploitation trade-off

Pantelis Pipergias Analytis

Cornell University

February 5, 2018

1 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Examples of exploration and exploitation in real life

1 Going to your favorite restaurant/bar vs. trying a new one.

2 Listening to music from a band you love vs. discoveringnew ones.

3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.

4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.

5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.

6 An organization trying a new organizational structure vs.a decently working existing one.

2 / 75


trade-off




Strategies

Contextualbandits


Conclusions








2 / 75


trade-off




Strategies

Contextualbandits


Conclusions








2 / 75


trade-off




Strategies

Contextualbandits


Conclusions








2 / 75


trade-off




Strategies

Contextualbandits


Conclusions








2 / 75


trade-off




Strategies

Contextualbandits


Conclusions








2 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Multi-armed bandit (MAB) problem

Option 1

??

N(µ1, σ1)

N(12, 3)

Option 2

7.417.3

N(µ2, σ2)

N(15, 3)3 / 75


trade-off




Strategies

Contextualbandits


Conclusions

History of the problem

1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)

2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).

3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.

4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.

5 Note the similarities to the search problem consideredlast week: the problems fold into each other

4 / 75


trade-off




Strategies

Contextualbandits


Conclusions







4 / 75


trade-off




Strategies

Contextualbandits


Conclusions







4 / 75


trade-off




Strategies

Contextualbandits


Conclusions







4 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Domains where MABs have been applied

1 Developing new medicine—clinical trials.

2 One of the steam-engines for studying human (andanimal) learning.

3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.

4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.

5 Used to decide which learning algorithm to use in aspecific context.

6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.

5 / 75


trade-off




Strategies

Contextualbandits


Conclusions








5 / 75


trade-off




Strategies

Contextualbandits


Conclusions








5 / 75


trade-off




Strategies

Contextualbandits


Conclusions








5 / 75


trade-off




Strategies

Contextualbandits


Conclusions








5 / 75


trade-off




Strategies

Contextualbandits


Conclusions








5 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Different strategies for coping with the multi-armedbandit problem

Go optimal — not always possible and oftencomputational very expensive.

Go greedy — always try the best alternative.

Add some noise — randomize one in while (e-greedy).

When randomizing choose options with higher expectedreturn with a higher probability (softmax).

Probability matching — choose actions according to theirprobability of being the best.

Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.

6 / 75


trade-off




Strategies

Contextualbandits


Conclusions








6 / 75


trade-off




Strategies

Contextualbandits


Conclusions








6 / 75


trade-off




Strategies

Contextualbandits


Conclusions








6 / 75


trade-off




Strategies

Contextualbandits


Conclusions








6 / 75


trade-off




Strategies

Contextualbandits


Conclusions








6 / 75


trade-off




Strategies

Contextualbandits


Conclusions

The Gittins index (Christian and Griffiths cpt. 2)

Possible to calculate for Bernoulli bandits with stablediscounting of future trials.

7 / 75


trade-off




Strategies

Contextualbandits


Conclusions

A simple example (Sutton and Barto, cpt. 2)

8 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Performance of the e-greedy algorithm

9 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Starting optimistically

10 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Discussion: A/B testing andexploration-exploitation

11 / 75


trade-off




Strategies

Contextualbandits


Conclusions

The softmax rule

Biases exploration towards the more promising actions.

The softmax rule grades probabilities according to theirselected values.

P(C (t) = j) =exp(θEj(t))∑K

k=1 exp(θEk(t))

where θ is a temperature controlling how biased thealgorithm will be.

12 / 75


trade-off




Strategies

Contextualbandits


Conclusions

The softmax rule




k=1 exp(θEk(t))


12 / 75


trade-off




Strategies

Contextualbandits


Conclusions

The softmax rule




k=1 exp(θEk(t))


12 / 75


trade-off




Strategies

Contextualbandits


Conclusions

The softmax rule




k=1 exp(θEk(t))


12 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Optimism in the face of uncertainty and the upperconfidence bound (UCB)

The more uncertain you are about the value of an optionthe more important it is to explore.That option could turn out to be really good and in thelong-term improve your overall utility.UCB: P(C = i) ∝ exp(θmi + α

√vari )

13 / 75


trade-off




Strategies

Contextualbandits


Conclusions

UCB against e-greedy

14 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Probability matching, changing environments andThompson sampling

Probability matching suggests sampling alternativesaccording to their rewards or their probability of being thebest.Thompson sampling is an implementation of theprobability matching principle.

15 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Collective exploration

The Roger’s paradox — produce or scrounge?

The social learning tournament — Rendell et al. (2010)

Counter-intuitive more or less effects — Toyokawa et al.(2014)

16 / 75


trade-off




Strategies

Contextualbandits


Conclusions





16 / 75


trade-off




Strategies

Contextualbandits


Conclusions





16 / 75


trade-off




Strategies

Contextualbandits


Conclusions





17 / 75


trade-off




Strategies

Contextualbandits


Conclusions

The typical bandit setting is like blind tasting...

18 / 75


trade-off




Strategies

Contextualbandits


Conclusions

My grandma’s problem: Choosing the best place toswim

19 / 75


trade-off




Strategies

Contextualbandits


Conclusions

The machine learner’s problem

Li, L., Chu, W., Langford, J., Schapire, R. E. (2010,April). A contextual-bandit approach to personalizednews article recommendation. In Proceedings of the 19thinternational conference on World Wide Web.

20 / 75


trade-off




Strategies

Contextualbandits


Conclusions

A contextual bandit experiment

21 / 75


trade-off




Strategies

Contextualbandits


Conclusions

A contextual bandit experiment: Results

22 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Contextual multi-armed bandit (CMAB) problem

Option 1

N(f (·), σ1)N(w1x1 + w2x2, σ)

N(µ1, σ1)

Option 2

N(f (·), σ2)N(w1x1 + w2x2, σ)

N(µ2, σ2)

23 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Realistic decision problem...

24 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Realistic decision problem...

25 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Motivation

Why is the CMAB problem interesting?

1 Captures the important characteristics of decisions in thewild better.

2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.

3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.

4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.

26 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Motivation






26 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Motivation






26 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Motivation






26 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Motivation






26 / 75


trade-off




Strategies

Contextualbandits


Conclusions

CMAB task

27 / 75


trade-off




Strategies

Contextualbandits


Conclusions

MAB task

28 / 75


trade-off




Strategies

Contextualbandits


Conclusions

One-shot choices in the test phase

Three alternatives:

Dominating - highest function value.

Neutral - middle function value.

Dominated - lowest function value.

29 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Experimental Design

Training phase

Between subject design – CMAB or MAB

Contextual multi-armed bandit (CMAB) task – twoinformative features are visually displayed

Classic multi-armed bandit (MAB) task – control group,features are not visible

20 alternatives, 100 trials

Test phase

Designed to test the functional knowledge.

One shot choices, no outcome feedback.

3 arms in 70 trials.

30 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Gaussian process (GP) based “optimal” solutions

Goal: simultaneously learn and optimize unknown function.

y = f (x) + ε, ε ∼ N(0, σ2)

GP based function learning process

f (x) ∼ GP(m(x),K (x, x′))

K (x, x′) = σ2f exp(− (x−x′)2

2l2)

Two versions of the choice process

1 Upper confidence bound (GP-UCB): argmaximi + 2√vari

2 Thompson sampling (GP-Th): Draw from p(θ|D,M) foreach arm, take the max.

31 / 75


trade-off




Strategies

Contextualbandits


Conclusions



y = f (x) + ε, ε ∼ N(0, σ2)


f (x) ∼ GP(m(x),K (x, x′))

K (x, x′) = σ2f exp(− (x−x′)2

2l2)




31 / 75


trade-off




Strategies

Contextualbandits


Conclusions



y = f (x) + ε, ε ∼ N(0, σ2)


f (x) ∼ GP(m(x),K (x, x′))

K (x, x′) = σ2f exp(− (x−x′)2

2l2)




31 / 75


trade-off




Strategies

Contextualbandits


Conclusions

GP prior, 1D example

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

Prior, Squared exponential kernel, l=1

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−UCB, trial 3

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−UCB, trial 5

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−UCB, trial 20

32 / 75


trade-off




Strategies

Contextualbandits


Conclusions

GP-Thompson, 1D example

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−Thompson, trial 2

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)


−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)


−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)


33 / 75


trade-off




Strategies

Contextualbandits


Conclusions

How much do people rely on knowledge of therelationships between features and alternative value’swhen making decisions?

Can we model people’s behavior using traditional machinelearning models?

How priors about functional relationships affect thedecision making?

Do people explore the choice set strategically, to learnthe relationships?

34 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Experiment 1 – Positive linear function

Experiment 1a – Amazon Turk

Feature values x drawn from U(0.1, 0.9)

For each arm j in trial t, the payoffs Rj(t) were computedas:

Rj(t) = 2× x1,j + 1× x2,j + εj(t).

εj(t) drawn independently for each arm in every trial, fromN(0, 0.25).

Task was to maximize the cumulative reward.

186 participants – monetary payoffs.

Experiment 1b – lab replication

Weights and noise rescaled: w1 = 20, w2 = 10, N(0, 2.5).

75 UPF lab participants – monetary payoffs.

35 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Experiment 1 – Positive linear function

Experiment 1a – Amazon Turk

Feature values x drawn from U(0.1, 0.9)

For each arm j in trial t, the payoffs Rj(t) were computedas:

Rj(t) = 2× x1,j + 1× x2,j + εj(t).

εj(t) drawn independently for each arm in every trial, fromN(0, 0.25).

Task was to maximize the cumulative reward.

186 participants – monetary payoffs.

Experiment 1b – lab replication

Weights and noise rescaled: w1 = 20, w2 = 10, N(0, 2.5).

75 UPF lab participants – monetary payoffs.35 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Mean choice rank - Exp 1a

Random performance

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

MABCMABGP−UCB

Individual 1 Individual 236 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Mean choice rank - Exp 1a

Random performance

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

MABCMABGP−UCB

Mean choice rank - Exp 1b37 / 75


trade-off




Strategies

Contextualbandits


Conclusions


Three alternatives:

Dominating - highest function value.

Neutral - middle function value.

Dominated - lowest function value.

38 / 75


trade-off




Strategies

Contextualbandits


Conclusions


CMABn

Diff/Extra

CMABn

Diff/Inter

CMABn

Easy/Extra

CMABn

Easy/Inter

CMABn

Weight test

0.00

0.25

0.50

0.75

1.00

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es

39 / 75


trade-off




Strategies

Contextualbandits


Conclusions


CMABn

Diff/Extra

CMABn

Diff/Inter

CMABn

Easy/Extra

CMABn

Easy/Inter

CMABn

Weight test

0.00

0.25

0.50

0.75

1.00


Mea

n pr

opor

tion

of c

hoic

es

One-shot choices – Lab replication

40 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Exploration in the feature space

MAB CMABn

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.1

0.2

0.3Proportion

41 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Exploration in the feature space – First 10 trials

MAB CMABn

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.040.060.080.100.12

Proportion

Lab replication Clusters: Exploration

42 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Exploration in the feature space – All trials

MAB CMABn

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.1

0.2

0.3Proportion

Exp 1b - Lab replication

43 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Inter-individual differences: Function-basedand naive learners

44 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Clustering according to the test phase performance

1

Diff/Extra

1

Diff/Inter

1

Easy/Extra

1

Easy/Inter

1

Weight test

2

Diff/Extra

2

Diff/Inter

2

Easy/Extra

2

Easy/Inter

2

Weight test

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00


Mea

n pr

opor

tion

of c

hoic

es

45 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Clusters: Performance in the CMAB task

N1 = 43N2 = 53

Rtr = 7Rtr = 4.59

Rte = 1.94Rte = 1.24

CMABn

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

aaaaaa

12

46 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Clusters: Feature space, first 10 trials

CMABn

1

CMABn

2

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.060.090.120.15

Proportion

Exploration in the MAB condition

47 / 75


trade-off




Strategies

Contextualbandits


Conclusions





48 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Modeling user behavior

Learning: We model participants as function learners(GP) or as tracing mean rewards (BMT)

1 Gaussian processes (GP) function learning model:

f (x) ∼ GP(m(x),K (x, x′)), K (x, x′) = σ2f exp(− (x−x′)2

2l2 )2 Bayesian mean reward tracing (BMT)

Choices: Participants either use uncertainty in balancingthe exploration-exploitation (UCB) or not (SM).

1 Upper confidence bound (UCB):P(C = i) ∝ exp(θmi + α

√vari )

2 Softmax (SM): P(C = i) ∝ exp(θmi )

49 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Modeling user behavior

Model BICw N # C1 C2

BMT-SM .54 (.38) 19 4 .48 (12) .72 (7)BMT-UCB .05 (.55) 0 5 .04 (0) .07 (0)GP-SM .27 (.38) 12 4 .34 (11) .09 (1)GP-UCB .02 (.04) 0 5 .03 (0) .01 (0)

RCM .11 (.32) 4 0 .11 (3) .11 (1)

There is evidence for GP models, especially for participants thatknow function well (according to the test task). Models withUCB perform poorly.

50 / 75


trade-off




Strategies

Contextualbandits


Conclusions





51 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Experiment 1c – Quadratic and mixed linearfunction

Training phase

2x2 between subject design: Type of task (CMAB, MAB)and Type of function (Quadratic, Mixed)

Quadratic function:1 + 60(x1 − .02)2 + 60(x1 − .02)2 + 30x1x2, N(0, 2.5)

Mixed linear function: w1 = 40, w2 = −30, N(0, 2.5)

376 participants – Amazon Turk – monetary payoffs.

Test phase

Test items for mixed linear function are the same as forthe positive linear one

Special items for the quadratic function, testing whetherpeople detected the nonlinear nature of the relationship.

Illustration

52 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Exploration in the feature space, first 10 trials

MAB mixed CMAB mixed

MAB quadratic CMAB quadratic

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.0500.0750.100

Proportion

Mean choice rank One-shot choices

53 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Exploration in the feature space, all trials

MAB mixed CMAB mixed

MAB quadratic CMAB quadratic

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.1

0.2

0.3Proportion

Mean choice rank One-shot choices

54 / 75


trade-off




Strategies

Contextualbandits


Conclusions





55 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75


trade-off




Strategies

Contextualbandits


Conclusions



Uncertainty about the function.

Type of function.Horizon.Expecting need for generalization.

Training phase


56 / 75


trade-off




Strategies

Contextualbandits


Conclusions



Uncertainty about the function.Type of function.

Horizon.Expecting need for generalization.

Training phase


56 / 75


trade-off




Strategies

Contextualbandits


Conclusions



Uncertainty about the function.Type of function.Horizon.

Expecting need for generalization.

Training phase


56 / 75


trade-off




Strategies

Contextualbandits


Conclusions




Training phase


56 / 75


trade-off




Strategies

Contextualbandits


Conclusions




Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)

Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75


trade-off




Strategies

Contextualbandits


Conclusions




Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.

Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75


trade-off




Strategies

Contextualbandits


Conclusions




Training phase


56 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Mean choice ranks

Random performance Random performance

Linear Quadratic

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

CMAB linearfCMAB linearfCMABs linearCMAB quadraticfCMAB quadraticfCMABs quadratic

57 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Exploration in the feature space, first 10 trials

CMAB lin fCMAB lin fCMABs lin

CMAB quad fCMAB quad fCMABs quad

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.1

0.2

0.3Proportion

One-shot choices

58 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Exploration in the feature space, all trials

CMAB lin fCMAB lin fCMABs lin

CMAB quad fCMAB quad fCMABs quad

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.10.20.30.4

Proportion

One-shot choices

59 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Summary

People learn the function and generalize their knowledgeto new decision situations.

But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.

New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.

Priors about the functional relationship can hurt theperformance.

People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.

60 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Summary






60 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Summary






60 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Summary






60 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Summary






60 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Summary

Challenges and future directions

Goals is to develop a function learning based RL model –algorithmic level.

Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.

How do people behave in the presence of informationabout the alternatives and other contextual information?

61 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Summary





61 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Summary





61 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Summary





61 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Acknowledgments

Funding:

FPU grant, Ministry of Education, Culture and Sports,Spain

Max Planck Institute for Human Development, Berlin

Barcelona Graduate School of Economics

62 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Quadratic function – An illustration

Experimental design

63 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Individual behaviour in the training phase -Experiment 1a

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●●●●

●

●

●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

5

10

15

20

0 25 50 75 100

Ran

k of

the

chos

en a

ltern

ativ

e

Choice behavior of subject e2−0124,CMABn condition, experiment LowNoise

Mean choice rank64 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Individual behaviour in the training phase -Experiment 1a

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●●●●●●●

●●●●

●●

●●●●

●●●

●

●

●

●

●

●

●●

●

●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●

●●

●●●●●●●●

5

10

15

20

0 25 50 75 100

Ran

k of

the

chos

en a

ltern

ativ

e

Choice behavior of subject e2−0065,CMABn condition, experiment LowNoise

Mean choice rank65 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Mean choice rank – Lab replication

Random performance

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

MABCMAB

Mean choice rank - Exp 1a66 / 75


trade-off




Strategies

Contextualbandits


Conclusions

One-shot choices – Lab replication

CMABn

Diff/Extra

CMABn

Diff/Inter

CMABn

Easy/Extra

CMABn

Easy/Inter

CMABn

Weight test

0.00

0.25

0.50

0.75

1.00


Mea

n pr

opor

tion

of c

hoic

es

One-shot choices - Exp 1a

67 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Feature space, all trials – Lab replication

MAB CMABn

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.10.20.3

Proportion

Back to Exp 1a

68 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Feature space, first 10 trials – Lab replication

MAB CMABn

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.050.100.150.200.25

Proportion

Back to Exp 1a

69 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Mean choice rank – Mixed and quadratic


Mixed Quadratic

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5 1 2 3 4 5Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

MAB mixedCMAB mixedMAB quadraticCMAB quadratic

Feature space, all Feature space, subset

70 / 75


trade-off




Strategies

Contextualbandits


Conclusions

One-shot choices – Mixed and quadratic

CMAB mixed

Easy

CMAB mixed

Difficult

CMAB mixed

Weight test

CMAB quadratic

Max test 1

CMAB quadratic

Min test 1

CMAB quadratic

Slope test 1

CMAB quadratic

Max test 2

CMAB quadratic

Min test 2

CMAB quadratic

Slope test 2

0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es


71 / 75


trade-off




Strategies

Contextualbandits


Conclusions

Mean choice rank – Positive and quadratic


Linear Quadratic

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

CMAB linearfCMAB linearfCMABs linearCMAB quadraticfCMAB quadraticfCMABs quadratic


72 / 75


trade-off




Strategies

Contextualbandits


Conclusions

One-shot choices – Positive linear

CMAB lin

Easy

CMAB lin

Difficult

CMAB lin

Weight test

fCMAB lin

Easy

fCMAB lin

Difficult

fCMAB lin

Weight test

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es


73 / 75


trade-off




Strategies

Contextualbandits


Conclusions

One-shot choices – Quadratic

CMAB q

Max test 1

CMAB q

Min test 1

CMAB q

Slope test 1

CMAB q

Max test 2

CMAB q

Min test 2

CMAB q

Slope test 2

fCMAB q

Max test 1

fCMAB q

Min test 1

fCMAB q

Slope test 1

fCMAB q

Max test 2

fCMAB q

Min test 2

fCMAB q

Slope test 2

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es


74 / 75

The exploration-exploitation trade-offexploitation trade-o Pantelis Pipergias Analytis Exploration-exploitation problems The multi-armed bandit framework Strategies Contextual bandits

Documents