Bayesian Prediction & Estimation
),|( xyp
Hypothesized models, parameterized by , map each x value to a
probability distribution over y values.
Kruschke, IPAM GSS 2007
Bayesian Prediction & Estimation
),|( xyp
)(p
There is a distribution of probabilities regarding values of .
Kruschke, IPAM GSS 2007
Bayesian Prediction & Estimation
),|( xyp
)(p
dyxypyy
dpxypxyp
)|( ˆ loss, SSEFor
)(),|()|(
For a given x, we predict y by
marginalizing over parameter values.
Kruschke, IPAM GSS 2007
Bayesian Prediction & Estimation
dpxyp
pxypxyp
)(),|(
)(),|(),|(
),|( xyp
)(p
For a given x,y pair, we
estimate parameters by Bayes’ rule:
Kruschke, IPAM GSS 2007
Bayesian Prediction & Estimation
),|( xyp
)(p
Formalism doesn’t care what it refers to in the world. Suppose thatx is a stimulus,y is a response, and
is a hypothesis.
Kruschke, IPAM GSS 2007
dyxypyy
dpxyp
xyp
)|( ˆ
)(),|(
)|(
Bayesian Prediction
),|( xyp
)(p
x
Kruschke, IPAM GSS 2007
Then , p(), and p(y|x,) are in (or
refer to) the mind.
Bayesian Estimation = Learning
dpxyp
pxypxyp
)(),|(
)(),|(),|(
),|( xyp
)(p
x y
Kruschke, IPAM GSS 2007
dyxypyy
dpxyp
xyp
)|( ˆ
)(),|(
)|(
Bayesian Cognition
),|( xyp
)(p
x
dpxyp
pxypxyp
)(),|(
)(),|(),|(
y
Kruschke, IPAM GSS 2007
dyxypyy
dpxyp
xyp
)|( ˆ
)(),|(
)|(
Not only cognition by Bayes...
),|( xyp
)(p
x
dpxyp
pxypxyp
)(),|(
)(),|(),|(
y
Kruschke, IPAM GSS 2007
dyxypyy
dpxyp
xyp
)|( ˆ
)(),|(
)|(
Bayesian cognition by others, too
),|( xyp
)(p
x
dpxyp
pxypxyp
)(),|(
)(),|(),|(
y
Kruschke, IPAM GSS 2007
dyxypyy
dpxyp
xyp
)|( ˆ
)(),|(
)|(
Bayesian Cognition?
),|( xyp
)(p
x
dpxyp
pxypxyp
)(),|(
)(),|(),|(
y
Kruschke, IPAM GSS 2007
dyxypyy
dpxyp
xyp
)|( ˆ
)(),|(
)|(
Bayesian Cognition?
),|( xyp
)(p
x
dpxyp
pxypxyp
)(),|(
)(),|(),|(
y
Kruschke, IPAM GSS 2007
dyxypyy
dpxyp
xyp
)|( ˆ
)(),|(
)|(
Bayesian Cognition?
),|( xyp
)(p
x
dpxyp
pxypxyp
)(),|(
)(),|(),|(
y
Kruschke, IPAM GSS 2007
Image from Jacob, Litorco & Lee (2004)
dyxypyy
dpxyp
xyp
)|( ˆ
)(),|(
)|(
Bayesian Cognition?
),|( xyp
)(px
dpxyp
pxypxyp
)(),|(
)(),|(),|(
y
Kruschke, IPAM GSS 2007
dyxypyy
dpxyp
xyp
)|( ˆ
)(),|(
)|(
Bayesian Cognition?
),|( xyp
)(px
dpxyp
pxypxyp
)(),|(
)(),|(),|(
y
Kruschke, IPAM GSS 2007
dyxypyy
dpxyp
xyp
)|( ˆ
)(),|(
)|(
Bayesian Cognition?
),|( xyp
)(p
x
dpxyp
pxypxyp
)(),|(
)(),|(),|(
y
Kruschke, IPAM GSS 2007
dyxypyy
dpxyp
xyp
)|( ˆ
)(),|(
)|(
Bayesian Cognition?
),|( xyp
)(p
x
dpxyp
pxypxyp
)(),|(
)(),|(),|(
y
Kruschke, IPAM GSS 2007
To Ponder:
• For a Bayesian model of “cognitive behavior”, what level of analysis is appropriate?
• If a system is Bayesian at one level of analysis, is it Bayesian at other levels?
Kruschke, IPAM GSS 2007
Bayesian Cognition?
Kruschke, IPAM GSS 2007
Marr (1982):
Image Intensity
Primal Sketch
2½D Sketch
3D Model
Is the overall mapping, from image to 3D model, Bayesian?
Is each component Bayesian?
Consider a Chain of Bayesians
),|( 111 xyp)( 1p
Kruschke, IPAM GSS 2007
Thomas1 Thomas2 Thomas3
),|( 222 xyp ),|( 333 xyp)( 2p )( 3p
Image Intensity
Primal Sketch
2½D Sketch
3D Model
Not Parallel Bayesians
),|( 111 xyp
)( 1p
Kruschke, IPAM GSS 2007
Thomas1
),|( 222 xyp
)( 2pThomas2
),|( 333 xyp
)( 3pThomas3
A Chain of Bayesians
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
),|( 222 xyp ),|( 333 xyp)( 2p )( 3p
1y 2y
2x 3x
Thomas1 Thomas2 Thomas3
3y
Not Iterated Bayesians
)( 1p
1x
Kruschke, IPAM GSS 2007
3y
)( 2p )( 3p
1y 2y
2x 3x
Thomas Son of
Thomas
Grandson of
Thomas
3x
)|,( 111 xyp )|,( 222 xyp )|,( 333 xyp
1y 2y 2y
2x2x
A Chain of Bayesians
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
),|( 222 xyp ),|( 333 xyp)( 2p )( 3p
1y 2y
2x 3x
Thomas1 Thomas2 Thomas3
3y
Could Be Generative Bayesians
)|,( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
)|,( 222 xyp )|,( 333 xyp)( 2p )( 3p
1y 2y
2x 3x
Thomas1 Thomas2 Thomas3
3y
But not pursued here.
A Chain of Bayesians
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
),|( 222 xyp ),|( 333 xyp)( 2p )( 3p
1y 2y
2x 3x
Thomas1 Thomas2 Thomas3
3y
A Chain of Bayesians
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
),|( 222 xyp ),|( 333 xyp)( 2p )( 3p
1y 2y
2x 3x
Thomas1 Thomas2 Thomas3
3y
The standard approach: The three heads are conjoined over a
joint parameter space.
The Globally Bayesian Approach
3213213211313
21111212323
32113
),,(),,,|()|(
),|(),|(),|(
),,,|(
dddpxypxyp
dydyxypyypyyp
xyp
),,( 321 p
1x
Kruschke, IPAM GSS 2007
3y
The Globally Bayesian Approach
1x
Kruschke, IPAM GSS 2007
3y
32132132113
32132113
13321
),,(),,,|(
),,(),,,|(
),|,,(
dddpxyp
pxyp
xyp
),,( 321 p
Yes, we are all individuals!
Kruschke, IPAM GSS 2007
),|( 111 xyp)( 1p
1x3y
Thomas Richard Harold
),|( 222 xyp ),|( 333 xyp)( 2p )( 3p
1y 2y
2x 3x
Locally Bayesian Prediction
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
Thomas Richard Harold
),|( 222 xyp ),|( 333 xyp)( 2p )( 3p
1y 2y
2x 3x
Each Bayesian agent computes its best prediction, and propagates it forward.
This process needs integrals over only the individual parameter spaces.
Locally Bayesian Learning
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
Thomas Richard Harold
),|( 222 xyp ),|( 333 xyp)( 2p
1y 2y
2x 3x
Update p(3|y3,x3) by Bayes’ rule. Involves integrating only over the 3
parameter space.
),|( 333 xyp
Locally Bayesian Learning
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
Thomas Richard Harold
),|( 222 xyp ),|( 333 xyp)( 2p ),|( 333 xyp
1y 2y
2x 3x
But how should poor Richard update his beliefs about 2? He needs a y2 value to
learn about!
?
Locally Bayesian Learning
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
Thomas Richard Harold
),|( 222 xyp ),|( 333 xyp)( 2p ),|( 333 xyp
1y
2x
)|( argmaxLet *
332*3
xypyx
2y
Locally Bayesian Learning
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
Thomas Richard Harold
),|( 222 xyp ),|( 333 xyp)( 2p ),|( 333 xyp
1y
2x
)|( argmaxLet *
332*3
xypyx
2y
Harold tells Richard to produce a value that is consistent with Harold’s beliefs!
Locally Bayesian Learning
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
Thomas Richard Harold
),|( 222 xyp ),|( 333 xyp)( 2p ),|( 333 xyp
1y
2x
)ˆ|()|( with of a valueget
just maximize; toneedt don' practice, In
)|( argmaxLet
23232
*
332*3
yypyypy
xypyx
2y
2y
Locally Bayesian Learning
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
Thomas Richard Harold
),|( 222 xyp ),|( 333 xyp)( 2p ),|( 333 xyp
2y),|( 222 xyp
1y
)|( argmaxLet *
221*2
xypyx
?
Locally Bayesian Learning
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
Thomas Richard Harold
),|( 222 xyp ),|( 333 xyp)( 2p ),|( 333 xyp
2y),|( 222 xyp
1y
)|( argmaxLet *
221*2
xypyx
Richard tells Thomas to produce a value that is consistent with Richard’s beliefs!
),|( 111 xyp
Locally Bayesian Learning
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
Thomas Richard Harold
),|( 222 xyp ),|( 333 xyp)( 2p ),|( 333 xyp
2y),|( 222 xyp
1y
)ˆ|()|( with of a valueget
just maximize; toneedt don' practice, In
)|( argmaxLet
12121
*
221*2
yypyypy
xypyx
1y
),|( 111 xyp
Locally Bayesian Learning
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
Thomas Richard Harold
),|( 222 xyp ),|( 333 xyp)( 2p ),|( 333 xyp
2y),|( 222 xyp
1y
Other updating dynamics are possible.E.g., first propagate y3 all the way back to the first
agent, and update p(1|y1,x1). Then compute predicted ŷ1. Then update p(2|y2, ŷ1 ). And so on.
),|( 111 xyp
Locally Bayesian Learning
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
Thomas Richard Harold
),|( 222 xyp ),|( 333 xyp)( 2p ),|( 333 xyp
2y),|( 222 xyp
1y
Each agent is told by its superior to learn a datum that is maximally consistent (or
minimally inconsistent) with the superior’s current beliefs.
),|( 111 xyp
Locally Bayesian Learning
),|( 111 xyp)( 1p
1x
Kruschke, IPAM GSS 2007
3y
Thomas Richard Harold
),|( 222 xyp ),|( 333 xyp)( 2p ),|( 333 xyp
2y),|( 222 xyp
1y
This process protects the superior’s beliefs from disconfirmation! The inferior will learn to “distort the data” to avoid disconfirming the
superior.
),|( 111 xyp
Locally Bayesian Learning (LBL)
Kruschke, IPAM GSS 2007
LBL preserves current beliefs and creates “epicycles” for new data. Perhaps not perfectly
optimal, but then, are real systems?
Put your models where your data are...
• Some real behavior, in the domain of associative learning, to which Locally Bayesian Learning can be applied.
Kruschke, IPAM GSS 2007
Typical Learning Task
RADIO
OCEAN
Press F, G, H or J.
Stimulus presentation and response collection:
50
Phenomena Suggestive of Attention in Learning
• Fewer relevant cues faster learning.
• Intradimensional shifts are faster than extradimensional.
• Attenuated learning after blocking.
• Overshadowing.
• Context-specific attention.
• Highlighting.
• Et cetera!
51
Highlighting:
Early Training: I.PEgE .
Late Training: I.PEgE I.PLgL
Testing Results:
Ig? (E!)
PE.PLg? (L!)
52
Highlighting:
Early Training: I.PEgE .
Late Training: I.PEgE I.PLgL
Testing Results:
Ig? (E!)
PE.PLg? (L!)
53
Highlighting:
Early Training: I.PEgE .
Late Training: I.PEgE I.PLgL
Testing Results:
Ig? (E!)
PE.PLg? (L!)
I
PE PL
E L
54
Highlighting:
Early Training: I.PEgE .
Late Training: I.PEgE I.PLgL
Testing Results:
Ig? (E!)
PE.PLg? (L!)
I
PE PL
E L
Design: Highlighting
Phase CuesgOutcome
Initial Training:
(2x) I1.PE1gE1 (2x) I2.PE2gE2
3:1 base-rateTraining:
(3x) I1.PE1gE1 (3x) I2.PE2gE2
(1x) I1.PL1gL1 (1x) I2.PL2gL2
1:3 base-rate Training:
(1x) I1.PE1gE1 (1x) I2.PE2gE2
(3x) I1.PL1gL1 (3x) I2.PL2gL2
Testing: PE.PLg?, etc.
I
PE PL
E L
Design: Highlighting
Phase CuesgOutcome
Initial Training:
(2x) I1.PE1gE1 (2x) I2.PE2gE2
3:1 base-rate Training:
(3x) I1.PE1gE1 (3x) I2.PE2gE2
(1x) I1.PL1gL1 (1x) I2.PL2gL2
1:3 base-rate Training:
(1x) I1.PE1gE1 (1x) I2.PE2gE2
(3x) I1.PL1gL1 (3x) I2.PL2gL2
Testing: PE.PLg?, etc.
“Canonical” Design: Highlighting
# Blocks CuesgOutcome
N1: (2x) I1.PE1gE1 (2x) I2.PE2gE2
N2:(3x) I1.PE1gE1 (3x) I2.PE2gE2
(1x) I1.PL1gL1 (1x) I2.PL2gL2
N1+N2:(1x) I1.PE1gE1 (1x) I2.PE2gE2
(3x) I1.PL1gL1 (3x) I2.PL2gL2
Testing: PE.PLg?, etc.Frequency of I.PEE trials equalsfrequency of I.PLL trials.
67
An Application: Highlighting while web browsing.
If browsed left-to-right and top-to-bottom, then I.PEE tends to be before I.PLL.
68
Test items
PE
PL
Results: I yields strong preference for Early quality;PE.PL yields strong preference for Later quality.
69
An Application: Highlighting of personal attributes.
Early Training:
honest(+) & conventional(-) g Fred
Late Training:
honest(+) & conventional(-) g Fred
honest(+) & materialistic(-) g Jack
I+
PE- PL-
E L
70
An Application: Highlighting of personal attributes.
Early Training:
honest(+) & conventional(-) g Fred
Late Training:
honest(+) & conventional(-) g Fred
honest(+) & materialistic(-) g Jack
honest
+
conventional
-materialistic
-
Fred Jack
71
An Application: Highlighting of personal attributes.
Early Training:
honest(+) & conventional(-) g Fred
Late Training:
honest(+) & conventional(-) g Fred
honest(+) & materialistic(-) g Jack
honest
+
conventional
-materialistic
-
Fred Jack Likability: 5.60
Likability: 6.47
72
What causes highlighting?
• Can your favorite model of learning account for highlighting?
• How about various Bayesian approaches?
– Only candidates are Bayesian approaches with sensitivity to time or trial order
Rational Model(J. R. Anderson 1990)
• Representation:– There are internal clusters that represent subsets of
training items.
– Each cluster has its own set of Dirichlet distributions over beliefs about feature probabilities.
• Learning:– For each item presented, the item is assigned to the
cluster that is most probable.
– The Dirichlet parameters of that cluster are Bayesian updated.
i
cue
ii
out awa
cuea2
2w1w
cuea1
Kalman Filter(Sutton 1992; Dayan, Kakade et al. 2000+)
CwNwp ,|~
vatNtp out,|~
cuea2
2w1w
cuea1
Kalman Filter Updating: Step 1. Linear Dynamics
CwNwp ,|~
i
cue
ii
out awa
UDCDC
D
T
*
*
cuea2
2w1w
cuea1
Kalman Filter Updating: Step 2. Bayesian Learning
****'
***'
1
1
CaaCavaCCC
ataCavaC
TcuecueTcuecue
TcuecueTcuecue
**,|~ CwNwp
i
cue
ii
out awa
Kalman Filter Does Not Show Highlighting:
Symmetric weights:
– Weight from cue I is near zero.
– Weights from PE and PL are equal and opposite.
Explanation of Highlighting:
• Attention rapidly shifts to the distinctive feature of the later learned outcome.
I
PE PL
E L
Taught:
I
PE PL
E L
Learned:
Kruschke, J. K. (1992). ALCOVE: An exemplar-based connectionist model of category learning. Psychological Review, 99, 22-44.
Kruschke, J. K. & Johansen, M. K. (1999). A model of probabilistic category learning. Journal of Experimental Psychology: Learning, Memory and Cognition, 25, 1083-1119.
Models of Attention Shifting: RASHNL (/ALCOVE)
Roughly analogous to Automatic Relevance Determination (ARD) in Radial Basis Function (RBF) networks.
Models of Attention Shifting: EXIT (/ADIT)
Kruschke, J. K. (1996). Base rates in category learning. Journal of Experimental Psychology: Learning, Memory and Cognition, 22, 3-26.
Kruschke, J. K. (2001). Toward a unified model of attention in associative learning. Journal of Mathematical Psychology, 45, 812-863.
Models of Attention Shifting: EXIT (/ADIT)
Kruschke, J. K. (1996). Base rates in category learning. Journal of Experimental Psychology: Learning, Memory and Cognition, 22, 3-26.
Kruschke, J. K. (2001). Toward a unified model of attention in associative learning. Journal of Mathematical Psychology, 45, 812-863.
Models of Attention Shifting: ATRIUM & POLE
Kalish, M. L., Lewandowsky, S., and Kruschke, J. K. (2004). Population of linear experts: Knowledge partitioning and functionlearning. Psychological Review, 111(4), 1072-1099.
Erickson, M. A. & Kruschke, J. K. (1998). Rules and Exemplars in Category Learning. Journal of Experimental Psychology: General, 127, 107-140.
Models of Attention Shifting: Locally Bayesian
Kruschke, J. K. (2006). Locally Bayesian learning with applications to retrospective revaluation and highlighting. Psychological Review, 113, 677-699.
Locally Bayesian Learning Implemented in an Attentional Learning Model
Cues:
Attention (Hidden):
Outcome:
),|( attin wap attw )(~ attwp
ina
outw )(~ outwp
),|( outwcp
E
PE I PL
PE I PL
Locally Bayesian Learning Implemented in an Attentional Learning Model
Cues:
Attention (Hidden):
Outcome:
6)()1( inatt
jj awsigp
otherwise 0
present is cue if 1in
ia
E
PE I PL
PE I PL
Locally Bayesian Learning Implemented in an Attentional Learning Model
Cues:
Attention (Hidden):
Outcome: E
PE I PL
PE I PL
Hidden activations are attentionally filtered copies of input activations.
6 00 4 00
6 0-4 6 -40
att
jw
Locally Bayesian Learning Implemented in an Attentional Learning Model
Cues:
Attention (Hidden):
Outcome: E
PE I PL
PE I PL
Each combination of weights constitutes a hypothesis. They are symmetrically distributed with uniform prior.
6 00 4 00
6 0-4 6 -40
att
jw
Locally Bayesian Learning Implemented in an Attentional Learning Model
Cues:
Attention (Hidden):
Outcome: E
PE I PL
PE I PL
)()1( outwsigEp
}1,0{
)( ˆ
jj p
Locally Bayesian Learning Implemented in an Attentional Learning Model
Cues:
Attention (Hidden):
Outcome: E
PE I PL
PE I PL
Outcome is arbitrary combination of cues. Prior favors all zeros; symmetrically distributed.
0 00 5 00
0 -55 5 0-5
outw
Highlighting: Prior Distribution
Prior beliefs are symmetric:
There are 50-50 beliefs in neutral (0) or inhibitory (-4) weights from PE and PL to I attn.
Hypotheses After Initial Learning of PE.I E
Cues:
Attention (Hidden):
Outcome: E
PE I PL
PE I PL
5 5 0
00
6 6
Attention (Hidden): PE I PL
Hypotheses After All Learning, PE.I E and I.PL L
Cues:
Attention (Hidden):
Outcome: E
PE I PL
PE I PL
-40
5w4w
Attention (Hidden): PE I PL
Inhibition of I by PL prevents disconfirmation of previous learning that IE.
2
Highlighting: End of training
PE does not inhibit attention to I: Beliefs in weights from PE to I-attn have shifted toward 0.
Highlighting: End of training
PL does inhibit attention to I:
Beliefs in weights from PL to I-attn have shifted toward -4.
Highlighting: End of training
Beliefs about I are asymmetric:
Stronger beliefs in +5 weights than -5 weights.
Highlighting: End of training
Beliefs about PE and PL are asymmetric:
PL beliefs are more extreme than PE beliefs.
Layers of Kalman Filters Applied to Highlighting
Kalman Filters
Kalman Filters
E
PE I PL
L
wPE wI wPL
I PLPE
wPE wI wPL
Cues:
Attention:
Outcomes:
Layers of Kalman Filters:Likelihood and Prior Distributions
E
PE I PL
L
wPE wI wPL
I PLPE
wPE wI wPL
Cues:
Attention:
Outcomes:
Layers of Kalman Filters:Outcome generation
Cues:
Attention:
Outcomes:
vectoractivationinput x
yx input
E
PE I PL
L
wPE wI wPL
I PLPE
wPE wI wPL
Layers of Kalman Filters:Target for Attention
Cues:
Attention:
Outcomes: E
PE I PL
L
wPE wI wPL
I PLPE
wPE wI wPL
Layers of Kalman Filters:Target for Attention
Cues:
Attention:
Outcomes:
(To determine unique maximum, included tiny cost for unequal attention values, and tiny cost for non-zero attention on absent cue.)
E
PE I PL
L
wPE wI wPL
I PLPE
wPE wI wPL
Layers of Kalman Filters:Dynamics and Bayesian Learning
Cues:
Attention:
Outcomes: E
PE I PL
L
wPE wI wPL
I PLPE
wPE wI wPL
Layers of Kalman Filters Applied to Highlighting:
Initial p(w)Outcome Node 1 Weights
Highlighting Initial
-0.5 0 0.50
1
2var = 0.04
mean = 0.00
cov = 0.00
-1 0 1
-0.5
0
0.5 cov = 0.00
-1 0 1
-0.5
0
0.5
cov = 0.00
-1 0 1
-0.5
0
0.5
-0.5 0 0.50
1
2var = 0.04
mean = 0.00
cov = 0.00
-1 0 1
-0.5
0
0.5
cov = 0.00
-1 0 1
-0.5
0
0.5 cov = 0.00
-1 0 1
-0.5
0
0.5
-0.5 0 0.50
1
2var = 0.04
mean = 0.00
Outcome Node 2 WeightsHighlighting Initial
-0.5 0 0.50
1
2var = 0.04
mean = 0.00
cov = 0.00
-1 0 1
-0.5
0
0.5 cov = 0.00
-1 0 1
-0.5
0
0.5
cov = 0.00
-1 0 1
-0.5
0
0.5
-0.5 0 0.50
1
2var = 0.04
mean = 0.00
cov = 0.00
-1 0 1
-0.5
0
0.5
cov = 0.00
-1 0 1
-0.5
0
0.5 cov = 0.00
-1 0 1
-0.5
0
0.5
-0.5 0 0.50
1
2var = 0.04
mean = 0.00
Attention Node 3 WeightsHighlighting Initial
-0.5 0 0.5 10
1
2var = 0.04
mean = 0.00
cov = 0.00
-1 0 1
-0.5
0
0.5
1 cov = 0.00
0 1 2
-0.5
0
0.5
1
cov = 0.00
-1 0 1
-0.5
0
0.5
1
-0.5 0 0.5 10
1
2var = 0.04
mean = 0.00
cov = 0.00
0 1 2
-0.5
0
0.5
1
cov = 0.00
-1 0 1
-0.5
0
0.5
1 cov = 0.00
-1 0 1
-0.5
0
0.5
1
-0.5 0 0.5 10
20
40var = 0.00
mean = 1.00
Attention Node 2 WeightsHighlighting Initial
-0.5 0 0.5 10
1
2var = 0.04
mean = 0.00
cov = 0.00
0 1 2
-0.5
0
0.5
1 cov = 0.00
-1 0 1
-0.5
0
0.5
1
cov = 0.00
-1 0 1
-0.5
0
0.5
1
-0.5 0 0.5 10
20
40var = 0.00
mean = 1.00
cov = 0.00
-1 0 1
-0.5
0
0.5
1
cov = 0.00
-1 0 1
-0.5
0
0.5
1 cov = 0.00
0 1 2
-0.5
0
0.5
1
-0.5 0 0.5 10
1
2var = 0.04
mean = 0.00
Attention Node 1 WeightsHighlighting Initial
-0.5 0 0.5 10
20
40var = 0.00
mean = 1.00
cov = 0.00
-1 0 1
-0.5
0
0.5
1 cov = 0.00
-1 0 1
-0.5
0
0.5
1
cov = 0.00
0 1 2
-0.5
0
0.5
1
-0.5 0 0.5 10
1
2var = 0.04
mean = 0.00
cov = 0.00
-1 0 1
-0.5
0
0.5
1
cov = 0.00
0 1 2
-0.5
0
0.5
1 cov = 0.00
-1 0 1
-0.5
0
0.5
1
-0.5 0 0.5 10
1
2var = 0.04
mean = 0.00
wPE wI wPL wPE wI wPL
wPL
wI
wPE
Layers of Kalman Filters Applied to Highlighting:
Final p(w)Outcome Node 1 Weights
Highlighting After Phase 3, Epoch 3, Trial 4
-1 0 1 20
0.5
1var = 0.23
mean = 0.63
cov = -0.13
-2 0 2 4-1
0
1
2 cov = 0.03
-2 0 2-1
0
1
2
cov = -0.13
-2 0 2 4-1
0
1
2
-1 0 1 20
0.5
1var = 0.23
mean = 0.57
cov = -0.05
-2 0 2-1
0
1
2
cov = 0.03
-2 0 2 4-1
0
1
2 cov = -0.05
-2 0 2 4-1
0
1
2
-1 0 1 20
0.5
1
1.5
var = 0.08
mean = -0.16
Outcome Node 2 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-1 0 10
0.5
1var = 0.23
mean = -0.13
cov = -0.13
-2 0 2
-1
0
1cov = 0.03
-2 0 2 4
-1
0
1
cov = -0.13
-2 0 2
-1
0
1
-1 0 10
0.5
1var = 0.23
mean = 0.17
cov = -0.05
-2 0 2 4
-1
0
1
cov = 0.03
-2 0 2
-1
0
1cov = -0.05
-2 0 2
-1
0
1
-1 0 10
0.5
1
1.5
var = 0.08
mean = 0.93
Attention Node 1 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-0.5 0 0.5 10
20
40var = 0.00
mean = 1.00
cov = -0.00
-1 0 1
-0.5
0
0.5
1 cov = 0.00
-1 0 1
-0.5
0
0.5
1
cov = -0.00
0 1 2
-0.5
0
0.5
1
-0.5 0 0.5 10
0.5
1
1.5
var = 0.09
mean = -0.09
cov = -0.06
-1 0 1
-0.5
0
0.5
1
cov = 0.00
0 1 2
-0.5
0
0.5
1 cov = -0.06
-1 0 1
-0.5
0
0.5
1
-0.5 0 0.5 10
0.5
1
1.5
var = 0.11
mean = 0.05
Attention Node 2 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-1 0 10
0.5
1
1.5
var = 0.10
mean = -0.14
cov = -0.00
0 1 2
-1
0
1 cov = 0.00
-2 -1 0 1
-1
0
1
cov = -0.00
-1 0 1
-1
0
1
-1 0 10
20
40var = 0.00
mean = 1.00
cov = -0.00
-2 -1 0 1
-1
0
1
cov = 0.00
-1 0 1
-1
0
1 cov = -0.00
0 1 2
-1
0
1
-1 0 10
1
2var = 0.07
mean = -0.68
Attention Node 3 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-1 0 10
0.5
1
1.5
var = 0.13
mean = -0.05
cov = -0.03
-1 0 1
-1
0
1 cov = 0.00
0 1 2
-1
0
1
cov = -0.03
-1 0 1
-1
0
1
-1 0 10
1
2var = 0.06
mean = 0.01
cov = -0.00
0 1 2
-1
0
1
cov = 0.00
-1 0 1
-1
0
1 cov = -0.00
-1 0 1
-1
0
1
-1 0 10
20
40var = 0.00
mean = 1.00
.63
.57
-.16
-.13
.17
.93
-.14
-.68
wPE wI wPL wPE wI wPL
wPL
wI
wPE
Layers of Kalman Filters Applied to Highlighting:
Final p(w)Outcome Node 1 Weights
Highlighting After Phase 3, Epoch 3, Trial 4
-1 0 1 20
0.5
1var = 0.23
mean = 0.63
cov = -0.13
-2 0 2 4-1
0
1
2 cov = 0.03
-2 0 2-1
0
1
2
cov = -0.13
-2 0 2 4-1
0
1
2
-1 0 1 20
0.5
1var = 0.23
mean = 0.57
cov = -0.05
-2 0 2-1
0
1
2
cov = 0.03
-2 0 2 4-1
0
1
2 cov = -0.05
-2 0 2 4-1
0
1
2
-1 0 1 20
0.5
1
1.5
var = 0.08
mean = -0.16
Outcome Node 2 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-1 0 10
0.5
1var = 0.23
mean = -0.13
cov = -0.13
-2 0 2
-1
0
1cov = 0.03
-2 0 2 4
-1
0
1
cov = -0.13
-2 0 2
-1
0
1
-1 0 10
0.5
1var = 0.23
mean = 0.17
cov = -0.05
-2 0 2 4
-1
0
1
cov = 0.03
-2 0 2
-1
0
1cov = -0.05
-2 0 2
-1
0
1
-1 0 10
0.5
1
1.5
var = 0.08
mean = 0.93
Attention Node 1 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-0.5 0 0.5 10
20
40var = 0.00
mean = 1.00
cov = -0.00
-1 0 1
-0.5
0
0.5
1 cov = 0.00
-1 0 1
-0.5
0
0.5
1
cov = -0.00
0 1 2
-0.5
0
0.5
1
-0.5 0 0.5 10
0.5
1
1.5
var = 0.09
mean = -0.09
cov = -0.06
-1 0 1
-0.5
0
0.5
1
cov = 0.00
0 1 2
-0.5
0
0.5
1 cov = -0.06
-1 0 1
-0.5
0
0.5
1
-0.5 0 0.5 10
0.5
1
1.5
var = 0.11
mean = 0.05
Attention Node 2 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-1 0 10
0.5
1
1.5
var = 0.10
mean = -0.14
cov = -0.00
0 1 2
-1
0
1 cov = 0.00
-2 -1 0 1
-1
0
1
cov = -0.00
-1 0 1
-1
0
1
-1 0 10
20
40var = 0.00
mean = 1.00
cov = -0.00
-2 -1 0 1
-1
0
1
cov = 0.00
-1 0 1
-1
0
1 cov = -0.00
0 1 2
-1
0
1
-1 0 10
1
2var = 0.07
mean = -0.68
Attention Node 3 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-1 0 10
0.5
1
1.5
var = 0.13
mean = -0.05
cov = -0.03
-1 0 1
-1
0
1 cov = 0.00
0 1 2
-1
0
1
cov = -0.03
-1 0 1
-1
0
1
-1 0 10
1
2var = 0.06
mean = 0.01
cov = -0.00
0 1 2
-1
0
1
cov = 0.00
-1 0 1
-1
0
1 cov = -0.00
-1 0 1
-1
0
1
-1 0 10
20
40var = 0.00
mean = 1.00
.63
.57
-.16
-.13
.17
.93
-.14
-.68
≠
wPE wI wPL wPE wI wPL
wPL
wI
wPE
Layers of Kalman Filters Applied to Highlighting:
Final p(w)Outcome Node 1 Weights
Highlighting After Phase 3, Epoch 3, Trial 4
-1 0 1 20
0.5
1var = 0.23
mean = 0.63
cov = -0.13
-2 0 2 4-1
0
1
2 cov = 0.03
-2 0 2-1
0
1
2
cov = -0.13
-2 0 2 4-1
0
1
2
-1 0 1 20
0.5
1var = 0.23
mean = 0.57
cov = -0.05
-2 0 2-1
0
1
2
cov = 0.03
-2 0 2 4-1
0
1
2 cov = -0.05
-2 0 2 4-1
0
1
2
-1 0 1 20
0.5
1
1.5
var = 0.08
mean = -0.16
Outcome Node 2 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-1 0 10
0.5
1var = 0.23
mean = -0.13
cov = -0.13
-2 0 2
-1
0
1cov = 0.03
-2 0 2 4
-1
0
1
cov = -0.13
-2 0 2
-1
0
1
-1 0 10
0.5
1var = 0.23
mean = 0.17
cov = -0.05
-2 0 2 4
-1
0
1
cov = 0.03
-2 0 2
-1
0
1cov = -0.05
-2 0 2
-1
0
1
-1 0 10
0.5
1
1.5
var = 0.08
mean = 0.93
Attention Node 1 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-0.5 0 0.5 10
20
40var = 0.00
mean = 1.00
cov = -0.00
-1 0 1
-0.5
0
0.5
1 cov = 0.00
-1 0 1
-0.5
0
0.5
1
cov = -0.00
0 1 2
-0.5
0
0.5
1
-0.5 0 0.5 10
0.5
1
1.5
var = 0.09
mean = -0.09
cov = -0.06
-1 0 1
-0.5
0
0.5
1
cov = 0.00
0 1 2
-0.5
0
0.5
1 cov = -0.06
-1 0 1
-0.5
0
0.5
1
-0.5 0 0.5 10
0.5
1
1.5
var = 0.11
mean = 0.05
Attention Node 2 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-1 0 10
0.5
1
1.5
var = 0.10
mean = -0.14
cov = -0.00
0 1 2
-1
0
1 cov = 0.00
-2 -1 0 1
-1
0
1
cov = -0.00
-1 0 1
-1
0
1
-1 0 10
20
40var = 0.00
mean = 1.00
cov = -0.00
-2 -1 0 1
-1
0
1
cov = 0.00
-1 0 1
-1
0
1 cov = -0.00
0 1 2
-1
0
1
-1 0 10
1
2var = 0.07
mean = -0.68
Attention Node 3 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-1 0 10
0.5
1
1.5
var = 0.13
mean = -0.05
cov = -0.03
-1 0 1
-1
0
1 cov = 0.00
0 1 2
-1
0
1
cov = -0.03
-1 0 1
-1
0
1
-1 0 10
1
2var = 0.06
mean = 0.01
cov = -0.00
0 1 2
-1
0
1
cov = 0.00
-1 0 1
-1
0
1 cov = -0.00
-1 0 1
-1
0
1
-1 0 10
20
40var = 0.00
mean = 1.00
.63
.57
-.16
-.13
.17
.93
-.14
-.68
≠
wPE wI wPL wPE wI wPL
wPL
wI
wPE
Layers of Kalman Filters Applied to Highlighting:
Final p(w)Outcome Node 1 Weights
Highlighting After Phase 3, Epoch 3, Trial 4
-1 0 1 20
0.5
1var = 0.23
mean = 0.63
cov = -0.13
-2 0 2 4-1
0
1
2 cov = 0.03
-2 0 2-1
0
1
2
cov = -0.13
-2 0 2 4-1
0
1
2
-1 0 1 20
0.5
1var = 0.23
mean = 0.57
cov = -0.05
-2 0 2-1
0
1
2
cov = 0.03
-2 0 2 4-1
0
1
2 cov = -0.05
-2 0 2 4-1
0
1
2
-1 0 1 20
0.5
1
1.5
var = 0.08
mean = -0.16
Outcome Node 2 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-1 0 10
0.5
1var = 0.23
mean = -0.13
cov = -0.13
-2 0 2
-1
0
1cov = 0.03
-2 0 2 4
-1
0
1
cov = -0.13
-2 0 2
-1
0
1
-1 0 10
0.5
1var = 0.23
mean = 0.17
cov = -0.05
-2 0 2 4
-1
0
1
cov = 0.03
-2 0 2
-1
0
1cov = -0.05
-2 0 2
-1
0
1
-1 0 10
0.5
1
1.5
var = 0.08
mean = 0.93
Attention Node 1 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-0.5 0 0.5 10
20
40var = 0.00
mean = 1.00
cov = -0.00
-1 0 1
-0.5
0
0.5
1 cov = 0.00
-1 0 1
-0.5
0
0.5
1
cov = -0.00
0 1 2
-0.5
0
0.5
1
-0.5 0 0.5 10
0.5
1
1.5
var = 0.09
mean = -0.09
cov = -0.06
-1 0 1
-0.5
0
0.5
1
cov = 0.00
0 1 2
-0.5
0
0.5
1 cov = -0.06
-1 0 1
-0.5
0
0.5
1
-0.5 0 0.5 10
0.5
1
1.5
var = 0.11
mean = 0.05
Attention Node 2 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-1 0 10
0.5
1
1.5
var = 0.10
mean = -0.14
cov = -0.00
0 1 2
-1
0
1 cov = 0.00
-2 -1 0 1
-1
0
1
cov = -0.00
-1 0 1
-1
0
1
-1 0 10
20
40var = 0.00
mean = 1.00
cov = -0.00
-2 -1 0 1
-1
0
1
cov = 0.00
-1 0 1
-1
0
1 cov = -0.00
0 1 2
-1
0
1
-1 0 10
1
2var = 0.07
mean = -0.68
Attention Node 3 WeightsHighlighting After Phase 3, Epoch 3, Trial 4
-1 0 10
0.5
1
1.5
var = 0.13
mean = -0.05
cov = -0.03
-1 0 1
-1
0
1 cov = 0.00
0 1 2
-1
0
1
cov = -0.03
-1 0 1
-1
0
1
-1 0 10
1
2var = 0.06
mean = 0.01
cov = -0.00
0 1 2
-1
0
1
cov = 0.00
-1 0 1
-1
0
1 cov = -0.00
-1 0 1
-1
0
1
-1 0 10
20
40var = 0.00
mean = 1.00
.63
.57
-.16
-.13
.17
.93
-.14
-.68
wPE wI wPL wPE wI wPL
wPL
wI
wPE Inhibition of I by PL prevents disconfirmation of previous learning that IE.
Summary
• Locally Bayesian learning was applied to attentional shifts in associative learning, specifically to account for “highlighting”.
Kruschke, IPAM GSS 2007
IPE PL
E LTaught:
IPE PL
E LLearned:
),|( 111 xyp)( 1p
1x3y
Thomas Richard Harold
),|( 222 xyp ),|( 333 xyp)( 2p )( 3p
1y 2y
2x 3x
• Different levels of analysis invite possibility of a chain of Bayesian learners.
• Locally Bayesian learning prevents disconfirmation of superior’s beliefs and creates distortions in inferior’s beliefs.