Top Banner
Day 1: Probability and speech perception 1
178

Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

May 08, 2018

Download

Documents

tranquynh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Day 1: Probability and speech perception

1

Page 2: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Day 2: Human sentence parsing

2

Page 3: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Day 3: Noisy-channel sentence processing

?

Page 4: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Day 4: Language production & acquisition

4

whatsthat thedoggie yeah wheresthedoggie

whats that the doggie yeah wheres the doggie

Grammar/lexicon

(abstract internal representation)

Page 5: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Computational PsycholinguisticsDay 1

Klinton Bicknell and Roger Levy

Northwestern & UCSD

July 7, 2015

1 / 38

Page 6: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Computational Psycholinguistics

Psycholinguistics deals with the problem of how humans

1. comprehend

2. produce

3. acquire

language.

2 / 38

Page 7: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Computational Psycholinguistics

Psycholinguistics deals with the problem of how humans

1. comprehend

2. produce

3. acquire

language.

In this class, we will study these problems from a computational,and especially probabilistic/Bayesian, perspective.

2 / 38

Page 8: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Class goals

Introduce you to the technical foundations of modeling workin the field

Overview the literature and major areas in whichcomputational psycholinguistic research is carried out

Acquaint you with some of the key models and their empiricalsupport

Give you experience in understanding the details of a modelfrom the papers

Give you practice in critical analysis of models

3 / 38

Page 9: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

What is computational modeling? Why do we do it?

Any phenomenon involving human behavior is so complex thatwe cannot hope to formulate a comprehensive theory

Instead, we devise a model that simplifies the phenomenon tocapture some key aspect of it

4 / 38

Page 10: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

What might we use a model for?

Models can serve any of the following (related) functions:

5 / 38

Page 11: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

What might we use a model for?

Models can serve any of the following (related) functions:

Prediction: estimating the behavior/properties of a newstate/datum on the basis of an existing dataset

5 / 38

Page 12: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

What might we use a model for?

Models can serve any of the following (related) functions:

Prediction: estimating the behavior/properties of a newstate/datum on the basis of an existing dataset

Hypothesis testing: as a framework for determining whether agiven factor has an appreciable influence on some othervariable

5 / 38

Page 13: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

What might we use a model for?

Models can serve any of the following (related) functions:

Prediction: estimating the behavior/properties of a newstate/datum on the basis of an existing dataset

Hypothesis testing: as a framework for determining whether agiven factor has an appreciable influence on some othervariable

Data simulation: creating artificial data more cheaply andquickly than through empirical data collection

5 / 38

Page 14: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

What might we use a model for?

Models can serve any of the following (related) functions:

Prediction: estimating the behavior/properties of a newstate/datum on the basis of an existing dataset

Hypothesis testing: as a framework for determining whether agiven factor has an appreciable influence on some othervariable

Data simulation: creating artificial data more cheaply andquickly than through empirical data collection

Summarization: If phenomenon X is complex but relevant tophenomenon Y, it can be most effective to use a simple modelof X when constructing a model of Y

5 / 38

Page 15: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

What might we use a model for?

Models can serve any of the following (related) functions:

Prediction: estimating the behavior/properties of a newstate/datum on the basis of an existing dataset

Hypothesis testing: as a framework for determining whether agiven factor has an appreciable influence on some othervariable

Data simulation: creating artificial data more cheaply andquickly than through empirical data collection

Summarization: If phenomenon X is complex but relevant tophenomenon Y, it can be most effective to use a simple modelof X when constructing a model of Y

Insight: Most generally, a good model can be explored in waysthat give insight into the phenomenon under consideration

5 / 38

Page 16: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Feedback from you

Please take a moment to fill out a sheet of paper with this info:

Name (optional)

School & Program/Department

Year/stage in program

Computational Linguistics background

Psycholinguistics background

Probability/Statistics/Machine Learning background

Do you know about (weighted) finite-state automata?

Do you know about (probabilistic) context-free grammars?

Other courses you’re taking at ESSLLI

(other side) What do you hope to learn in this class?

6 / 38

Page 17: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Today’s content

Foundations of probability theory

Joint, marginal, and conditional probability

Bayes’ Rule

Bayes Nets (a.k.a. directed acyclic graphical models, DAGs)

The Gaussian distribution

A probabilistic model of human phoneme categorization

A probabilistic model of the perceptual magnet effect

7 / 38

Page 18: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Probability spacesTraditionally, probability spaces are defined in terms of sets. Anevent E is a subset of a sample space Ω: E ⊂ Ω.

8 / 38

Page 19: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Probability spacesTraditionally, probability spaces are defined in terms of sets. Anevent E is a subset of a sample space Ω: E ⊂ Ω.

A probability space P on a sample space Ω is a function fromevents E in Ω to real numbers such that the following three axiomshold:

1. P(E ) ≥ 0 for all E ⊂ Ω (non-negativity).

2. If E1 and E2 are disjoint, then P(E1 ∪ E2) = P(E1) + P(E2)(disjoint union).

3. P(Ω) = 1 (properness).

8 / 38

Page 20: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Probability spacesTraditionally, probability spaces are defined in terms of sets. Anevent E is a subset of a sample space Ω: E ⊂ Ω.

A probability space P on a sample space Ω is a function fromevents E in Ω to real numbers such that the following three axiomshold:

1. P(E ) ≥ 0 for all E ⊂ Ω (non-negativity).

2. If E1 and E2 are disjoint, then P(E1 ∪ E2) = P(E1) + P(E2)(disjoint union).

3. P(Ω) = 1 (properness).

We can also think of these things as involving logical rather thanset relations:

Subset A ⊂ B A → B

Disjointness E1 ∩ E2 = ∅ ¬(E1 ∧ E2)Union E1 ∪ E2 E1 ∨ E2

8 / 38

Page 21: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

A simple exampleIn historical English, object NPs could appear both preverbally andpostverbally.

VP

VerbObject

VP

ObjectVerb

There is a broad cross-linguistic tendency for pronominal objects tooccur earlier on average than non-pronominal objects.

So, hypothetical probabilities from historical English:

Y :Pronoun Not Pronoun

X :Object Preverbal 0.224 0.655Object Postverbal 0.014 0.107

9 / 38

Page 22: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

A simple exampleIn historical English, object NPs could appear both preverbally andpostverbally.

VP

VerbObject

VP

ObjectVerb

There is a broad cross-linguistic tendency for pronominal objects tooccur earlier on average than non-pronominal objects.

So, hypothetical probabilities from historical English:

Y :Pronoun Not Pronoun

X :Object Preverbal 0.224 0.655Object Postverbal 0.014 0.107

We will sometimes call this the joint distribution P(X ,Y ) overtwo random variables—here, verb-object word order X and objectpronominality Y . 9 / 38

Page 23: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Checking the axioms of probability

1. P(E) ≥ 0 for all E ⊂ Ω(non-negativity).

2. If E1 and E2 are disjoint, thenP(E1 ∪ E2) = P(E1) + P(E2)(disjoint union).

3. P(Ω) = 1 (properness).

ObjectPronoun Not Pronoun

Object Preverbal 0.224 0.655Object Postverbal 0.014 0.107

We can consider the sample space to be

Ω =Preverbal+Pronoun,Preverbal+Not Pronoun,

Postverbal+Pronoun,Postverbal+Not Pronoun

10 / 38

Page 24: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Checking the axioms of probability

1. P(E) ≥ 0 for all E ⊂ Ω(non-negativity).

2. If E1 and E2 are disjoint, thenP(E1 ∪ E2) = P(E1) + P(E2)(disjoint union).

3. P(Ω) = 1 (properness).

ObjectPronoun Not Pronoun

Object Preverbal 0.224 0.655Object Postverbal 0.014 0.107

We can consider the sample space to be

Ω =Preverbal+Pronoun,Preverbal+Not Pronoun,

Postverbal+Pronoun,Postverbal+Not Pronoun

Disjoint union tells us the probabilities of non-atomic events:

10 / 38

Page 25: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Checking the axioms of probability

1. P(E) ≥ 0 for all E ⊂ Ω(non-negativity).

2. If E1 and E2 are disjoint, thenP(E1 ∪ E2) = P(E1) + P(E2)(disjoint union).

3. P(Ω) = 1 (properness).

ObjectPronoun Not Pronoun

Object Preverbal 0.224 0.655Object Postverbal 0.014 0.107

We can consider the sample space to be

Ω =Preverbal+Pronoun,Preverbal+Not Pronoun,

Postverbal+Pronoun,Postverbal+Not Pronoun

Disjoint union tells us the probabilities of non-atomic events: If we define

E1 = Preverbal+Pronoun,Postverbal+Not Pronoun,then P(E1) = 0.224 + 0.107 = 0.331.

10 / 38

Page 26: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Checking the axioms of probability

1. P(E) ≥ 0 for all E ⊂ Ω(non-negativity).

2. If E1 and E2 are disjoint, thenP(E1 ∪ E2) = P(E1) + P(E2)(disjoint union).

3. P(Ω) = 1 (properness).

ObjectPronoun Not Pronoun

Object Preverbal 0.224 0.655Object Postverbal 0.014 0.107

We can consider the sample space to be

Ω =Preverbal+Pronoun,Preverbal+Not Pronoun,

Postverbal+Pronoun,Postverbal+Not Pronoun

Disjoint union tells us the probabilities of non-atomic events: If we define

E1 = Preverbal+Pronoun,Postverbal+Not Pronoun,then P(E1) = 0.224 + 0.107 = 0.331.

Check for properness:P(Ω) = 0.224 + 0.655 + 0.014 + 0.107 = 1

10 / 38

Page 27: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Marginal probability

Sometimes we have a joint distribution P(X ,Y ) over randomvariables X and Y , but we’re interested in the distributionimplied over one of them (here, without loss of generality, X )

11 / 38

Page 28: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Marginal probability

Sometimes we have a joint distribution P(X ,Y ) over randomvariables X and Y , but we’re interested in the distributionimplied over one of them (here, without loss of generality, X )

The marginal probability distribution P(X ) is

P(X = x) =∑

y

P(X = x ,Y = y)

11 / 38

Page 29: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Marginal probability: an exampleY :

Pronoun Not Pronoun

X :Object Preverbal 0.224 0.655Object Postverbal 0.014 0.107

Finding the marginal distribution on X :

P(X = Preverbal) = P(X = Preverbal,Y = Prenominal)

+ P(X = Preverbal,Y = Postnominal)

= 0.224 + 0.655

= 0.879

P(X = Postverbal) = P(X = Postverbal,Y = Prenominal)

+ P(X = Postverbal,Y = Postnominal)

= 0.014 + 0.107

= 0.12112 / 38

Page 30: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Marginal probability: an example

Y :Pronoun Not Pronoun

X :Object Preverbal 0.224 0.655Object Postverbal 0.014 0.107

So, the marginal distributionon X is

P(X )

Preverbal 0.879Postverbal 0.121

Likewise, the marginal dis-tribution on Y is

P(Y )

Pronoun 0.238Not Pronoun 0.762

13 / 38

Page 31: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Conditional probability

The conditional probability of event B given that A hasoccurred/is known is defined as follows:

P(B|A) ≡P(A,B)

P(A)

14 / 38

Page 32: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Conditional Probability: an example

Y :Pronoun Not Pronoun

X :Object Preverbal 0.224 0.655Object Postverbal 0.014 0.107

P(X )Preverbal 0.879Postverbal 0.121

P(Y )Pronoun 0.238Not Pronoun 0.762

15 / 38

Page 33: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Conditional Probability: an example

Y :Pronoun Not Pronoun

X :Object Preverbal 0.224 0.655Object Postverbal 0.014 0.107

P(X )Preverbal 0.879Postverbal 0.121

P(Y )Pronoun 0.238Not Pronoun 0.762

How do we calculate the following?

P(Y = Pronoun|X = Postverbal)

Page 34: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Conditional Probability: an example

Y :Pronoun Not Pronoun

X :Object Preverbal 0.224 0.655Object Postverbal 0.014 0.107

P(X )Preverbal 0.879Postverbal 0.121

P(Y )Pronoun 0.238Not Pronoun 0.762

How do we calculate the following?

P(Y = Pronoun|X = Postverbal)

Page 35: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Conditional Probability: an example

Y :Pronoun Not Pronoun

X :Object Preverbal 0.224 0.655Object Postverbal 0.014 0.107

P(X )Preverbal 0.879Postverbal 0.121

P(Y )Pronoun 0.238Not Pronoun 0.762

How do we calculate the following?

P(Y = Pronoun|X = Postverbal) =P(X = Postverbal,Y = Pronoun)

P(X = Postverbal)

=0.014

0.121= 0.116

15 / 38

Page 36: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

The chain ruleA joint probability can be rewritten as the product of marginal andconditional probabilities:

P(E1,E2) = P(E2|E1)P(E1)

16 / 38

Page 37: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

The chain ruleA joint probability can be rewritten as the product of marginal andconditional probabilities:

P(E1,E2) = P(E2|E1)P(E1)

And this generalizes to more than two variables:

16 / 38

Page 38: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

The chain ruleA joint probability can be rewritten as the product of marginal andconditional probabilities:

P(E1,E2) = P(E2|E1)P(E1)

And this generalizes to more than two variables:

P(E1,E2) = P(E2|E1)P(E1)

Page 39: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

The chain ruleA joint probability can be rewritten as the product of marginal andconditional probabilities:

P(E1,E2) = P(E2|E1)P(E1)

And this generalizes to more than two variables:

P(E1,E2) = P(E2|E1)P(E1)

P(E1,E2,E3) = P(E3|E1,E2)P(E2|E1)P(E1)

Page 40: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

The chain ruleA joint probability can be rewritten as the product of marginal andconditional probabilities:

P(E1,E2) = P(E2|E1)P(E1)

And this generalizes to more than two variables:

P(E1,E2) = P(E2|E1)P(E1)

P(E1,E2,E3) = P(E3|E1,E2)P(E2|E1)P(E1)

......

Page 41: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

The chain ruleA joint probability can be rewritten as the product of marginal andconditional probabilities:

P(E1,E2) = P(E2|E1)P(E1)

And this generalizes to more than two variables:

P(E1,E2) = P(E2|E1)P(E1)

P(E1,E2,E3) = P(E3|E1,E2)P(E2|E1)P(E1)

......

Page 42: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

The chain ruleA joint probability can be rewritten as the product of marginal andconditional probabilities:

P(E1,E2) = P(E2|E1)P(E1)

And this generalizes to more than two variables:

P(E1,E2) = P(E2|E1)P(E1)

P(E1,E2,E3) = P(E3|E1,E2)P(E2|E1)P(E1)

......

P(E1,E2, . . . ,En) = P(En|E1,E2, . . . ,En−1) . . .P(E2|E1)P(E1)

16 / 38

Page 43: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

The chain ruleA joint probability can be rewritten as the product of marginal andconditional probabilities:

P(E1,E2) = P(E2|E1)P(E1)

And this generalizes to more than two variables:

P(E1,E2) = P(E2|E1)P(E1)

P(E1,E2,E3) = P(E3|E1,E2)P(E2|E1)P(E1)

......

P(E1,E2, . . . ,En) = P(En|E1,E2, . . . ,En−1) . . .P(E2|E1)P(E1)

Breaking a joint probability down into the product a marginalprobability and several joint probabilities this way is called chain

rule decomposition.16 / 38

Page 44: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes’ Rule (Bayes’ Theorem)

P(A|B) =P(B |A)P(A)

P(B)

17 / 38

Page 45: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes’ Rule (Bayes’ Theorem)

P(A|B) =P(B |A)P(A)

P(B)

With extra ”background” random variables I :

P(A|B, I ) =P(B|A, I )P(A|I )

P(B|I )

17 / 38

Page 46: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes’ Rule (Bayes’ Theorem)

P(A|B) =P(B |A)P(A)

P(B)

With extra ”background” random variables I :

P(A|B, I ) =P(B|A, I )P(A|I )

P(B|I )

This“theorem”follows directly from def’n of conditional probability:

P(A,B) = P(B|A)P(A)

17 / 38

Page 47: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes’ Rule (Bayes’ Theorem)

P(A|B) =P(B |A)P(A)

P(B)

With extra ”background” random variables I :

P(A|B, I ) =P(B|A, I )P(A|I )

P(B|I )

This“theorem”follows directly from def’n of conditional probability:

P(A,B) = P(B|A)P(A)

P(A,B) = P(A|B)P(B)

17 / 38

Page 48: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes’ Rule (Bayes’ Theorem)

P(A|B) =P(B |A)P(A)

P(B)

With extra ”background” random variables I :

P(A|B, I ) =P(B|A, I )P(A|I )

P(B|I )

This“theorem”follows directly from def’n of conditional probability:

P(A,B) = P(B|A)P(A)

P(A,B) = P(A|B)P(B)

So

P(A|B)P(B) = P(B|A)P(A)

17 / 38

Page 49: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes’ Rule (Bayes’ Theorem)

P(A|B) =P(B |A)P(A)

P(B)

With extra ”background” random variables I :

P(A|B, I ) =P(B|A, I )P(A|I )

P(B|I )

This“theorem”follows directly from def’n of conditional probability:

P(A,B) = P(B|A)P(A)

P(A,B) = P(A|B)P(B)

So

P(A|B)P(B) = P(B|A)P(A)

P(A|B)P(B)

P(B)=

P(B|A)P(A)

P(B)17 / 38

Page 50: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes’ Rule (Bayes’ Theorem)

P(A|B) =P(B |A)P(A)

P(B)

With extra ”background” random variables I :

P(A|B, I ) =P(B|A, I )P(A|I )

P(B|I )

This“theorem”follows directly from def’n of conditional probability:

P(A,B) = P(B|A)P(A)

P(A,B) = P(A|B)P(B)

So

P(A|B)P(B) = P(B|A)P(A)

P(A|B)P(B)

P(B)=

P(B|A)P(A)

P(B)17 / 38

Page 51: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes’ Rule, more closely inspected

Posterior︷ ︸︸ ︷

P(A|B) =

Likelihood︷ ︸︸ ︷

P(B |A)

Prior︷ ︸︸ ︷

P(A)

P(B)︸ ︷︷ ︸

Normalizing constant

18 / 38

Page 52: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes’ Rule in action

Let me give you the same information you had before:

P(Y = Pronoun) = 0.238

P(X = Preverbal|Y = Pronoun) = 0.941

P(X = Preverbal|Y = Not Pronoun ) = 0.860

19 / 38

Page 53: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes’ Rule in action

Let me give you the same information you had before:

P(Y = Pronoun) = 0.238

P(X = Preverbal|Y = Pronoun) = 0.941

P(X = Preverbal|Y = Not Pronoun ) = 0.860

Imagine you’re an incremental sentence processor. You encounter atransitive verb but haven’t encountered the object yet. Inferenceunder uncertainty: How likely is it that the object is a pronoun?

19 / 38

Page 54: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes Rule in Action

P(Y = Pronoun) = 0.238

P(X = Preverbal|Y = Pronoun) = 0.941

P(X = Preverbal|Y = Not Pronoun ) = 0.860

P(Y = Pron|X = PostV)

20 / 38

Page 55: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes Rule in Action

P(Y = Pronoun) = 0.238

P(X = Preverbal|Y = Pronoun) = 0.941

P(X = Preverbal|Y = Not Pronoun ) = 0.860

P(Y = Pron|X = PostV) =P(X = PostV|Y = Pron)P(Y = Pron)

P(X = PostV)

20 / 38

Page 56: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes Rule in Action

P(Y = Pronoun) = 0.238

P(X = Preverbal|Y = Pronoun) = 0.941

P(X = Preverbal|Y = Not Pronoun ) = 0.860

P(Y = Pron|X = PostV) =P(X = PostV|Y = Pron)P(Y = Pron)

P(X = PostV)

=P(X = PostV|Y = Pron)P(Y = Pron)

y P(X = PostV,Y = y)

20 / 38

Page 57: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes Rule in Action

P(Y = Pronoun) = 0.238

P(X = Preverbal|Y = Pronoun) = 0.941

P(X = Preverbal|Y = Not Pronoun ) = 0.860

P(Y = Pron|X = PostV) =P(X = PostV|Y = Pron)P(Y = Pron)

P(X = PostV)

=P(X = PostV|Y = Pron)P(Y = Pron)

y P(X = PostV,Y = y)

=P(X = PostV|Y = Pron)P(Y = Pron)∑

y P(X = PostV|Y = y)P(Y = y)

20 / 38

Page 58: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes Rule in Action

P(Y = Pronoun) = 0.238

P(X = Preverbal|Y = Pronoun) = 0.941

P(X = Preverbal|Y = Not Pronoun ) = 0.860

P(Y = Pron|X = PostV) =P(X = PostV|Y = Pron)P(Y = Pron)

P(X = PostV)

=P(X = PostV|Y = Pron)P(Y = Pron)

y P(X = PostV,Y = y)

=P(X = PostV|Y = Pron)P(Y = Pron)∑

y P(X = PostV|Y = y)P(Y = y)

=P(X = PostV|Y = Pron)P(Y = Pron)

P(PostV|Pron)P(Pron)+ P(PostV|NotPron)P(NotPron)

20 / 38

Page 59: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes Rule in Action

P(Y = Pronoun) = 0.238

P(X = Preverbal|Y = Pronoun) = 0.941

P(X = Preverbal|Y = Not Pronoun ) = 0.860

P(Y = Pron|X = PostV) =P(X = PostV|Y = Pron)P(Y = Pron)

P(X = PostV)

=P(X = PostV|Y = Pron)P(Y = Pron)

y P(X = PostV,Y = y)

=P(X = PostV|Y = Pron)P(Y = Pron)∑

y P(X = PostV|Y = y)P(Y = y)

=P(X = PostV|Y = Pron)P(Y = Pron)

P(PostV|Pron)P(Pron)+ P(PostV|NotPron)P(NotPron)

=(1− 0.941)× 0.238

(1− 0.941)× 0.238 + (1− 0.860)× (1− 0.238)

20 / 38

Page 60: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes Rule in Action

P(Y = Pronoun) = 0.238

P(X = Preverbal|Y = Pronoun) = 0.941

P(X = Preverbal|Y = Not Pronoun ) = 0.860

P(Y = Pron|X = PostV) =P(X = PostV|Y = Pron)P(Y = Pron)

P(X = PostV)

=P(X = PostV|Y = Pron)P(Y = Pron)

y P(X = PostV,Y = y)

=P(X = PostV|Y = Pron)P(Y = Pron)∑

y P(X = PostV|Y = y)P(Y = y)

=P(X = PostV|Y = Pron)P(Y = Pron)

P(PostV|Pron)P(Pron)+ P(PostV|NotPron)P(NotPron)

=(1− 0.941)× 0.238

(1− 0.941)× 0.238 + (1− 0.860)× (1− 0.238)

= 0.116

20 / 38

Page 61: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Other ways of writing Bayes’ Rule

P(A|B) =

Likelihood︷ ︸︸ ︷

P(B|A)

Prior︷ ︸︸ ︷

P(A)

P(B)︸ ︷︷ ︸

Normalizing constant

The hardest part of using Bayes’ Rule was calculating thenormalizing constant (a.k.a. the partition function)

21 / 38

Page 62: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Other ways of writing Bayes’ Rule

P(A|B) =

Likelihood︷ ︸︸ ︷

P(B|A)

Prior︷ ︸︸ ︷

P(A)

P(B)︸ ︷︷ ︸

Normalizing constant

The hardest part of using Bayes’ Rule was calculating thenormalizing constant (a.k.a. the partition function)

Hence there are often two other ways we write Bayes’ Rule:

21 / 38

Page 63: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Other ways of writing Bayes’ Rule

P(A|B) =

Likelihood︷ ︸︸ ︷

P(B|A)

Prior︷ ︸︸ ︷

P(A)

P(B)︸ ︷︷ ︸

Normalizing constant

The hardest part of using Bayes’ Rule was calculating thenormalizing constant (a.k.a. the partition function)

Hence there are often two other ways we write Bayes’ Rule:

1. Emphasizing explicit marginalization:

P(A|B) =P(B|A)P(A)

a P(A = a,B)

21 / 38

Page 64: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Other ways of writing Bayes’ Rule

P(A|B) =

Likelihood︷ ︸︸ ︷

P(B|A)

Prior︷ ︸︸ ︷

P(A)

P(B)︸ ︷︷ ︸

Normalizing constant

The hardest part of using Bayes’ Rule was calculating thenormalizing constant (a.k.a. the partition function)

Hence there are often two other ways we write Bayes’ Rule:

1. Emphasizing explicit marginalization:

P(A|B) =P(B|A)P(A)

a P(A = a,B)

2. Ignoring the partition function:

P(A|B) ∝ P(B|A)P(A)

21 / 38

Page 65: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

(Conditional) Independence

Events A and B are said to be Conditionally Independent giveninformation C if

P(A,B|C ) = P(A|C )P(B|C )

Conditional independence of A and B given C is often expressed as

A ⊥ B|C

22 / 38

Page 66: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Directed graphical models

A lot of the interesting joint probability distributions in thestudy of language involve conditional independencies amongthe variables

23 / 38

Page 67: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Directed graphical models

A lot of the interesting joint probability distributions in thestudy of language involve conditional independencies amongthe variables

So next we’ll introduce you to a general framework forspecifying conditional independencies among collections ofrandom variables

23 / 38

Page 68: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Directed graphical models

A lot of the interesting joint probability distributions in thestudy of language involve conditional independencies amongthe variables

So next we’ll introduce you to a general framework forspecifying conditional independencies among collections ofrandom variables

It won’t allow us to express all possible independencies thatmay hold, but it goes a long way

23 / 38

Page 69: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Directed graphical models

A lot of the interesting joint probability distributions in thestudy of language involve conditional independencies amongthe variables

So next we’ll introduce you to a general framework forspecifying conditional independencies among collections ofrandom variables

It won’t allow us to express all possible independencies thatmay hold, but it goes a long way

And I hope that you’ll agree that the framework is intuitivetoo!

23 / 38

Page 70: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

A non-linguistic example

Imagine a factory that produces three types of coins in equalvolumes:

24 / 38

Page 71: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

A non-linguistic example

Imagine a factory that produces three types of coins in equalvolumes:

Fair coins;

24 / 38

Page 72: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

A non-linguistic example

Imagine a factory that produces three types of coins in equalvolumes:

Fair coins; 2-headed coins;

24 / 38

Page 73: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

A non-linguistic example

Imagine a factory that produces three types of coins in equalvolumes:

Fair coins; 2-headed coins; 2-tailed coins.

24 / 38

Page 74: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

A non-linguistic example

Imagine a factory that produces three types of coins in equalvolumes:

Fair coins; 2-headed coins; 2-tailed coins.

Generative process:

24 / 38

Page 75: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

A non-linguistic example

Imagine a factory that produces three types of coins in equalvolumes:

Fair coins; 2-headed coins; 2-tailed coins.

Generative process: The factory produces a coin of type X and sends it to you;

24 / 38

Page 76: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

A non-linguistic example

Imagine a factory that produces three types of coins in equalvolumes:

Fair coins; 2-headed coins; 2-tailed coins.

Generative process: The factory produces a coin of type X and sends it to you; You receive the coin and flip it twice, with H(eads)/T(ails)

outcomes Y1 and Y2

24 / 38

Page 77: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

A non-linguistic example

Imagine a factory that produces three types of coins in equalvolumes:

Fair coins; 2-headed coins; 2-tailed coins.

Generative process: The factory produces a coin of type X and sends it to you; You receive the coin and flip it twice, with H(eads)/T(ails)

outcomes Y1 and Y2

Receiving a coin from the factory and flipping it twice issampling (or taking a sample) from the joint distributionP(X ,Y1,Y2)

24 / 38

Page 78: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

This generative process a Bayes NetThe directed acyclic graphical model (DAG), or Bayes net:

X

Y1 Y2

Semantics of a Bayes net: the joint distribution can beexpressed as the product of the conditional distributions ofeach variable given only its parents

25 / 38

Page 79: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

This generative process a Bayes NetThe directed acyclic graphical model (DAG), or Bayes net:

X

Y1 Y2

Semantics of a Bayes net: the joint distribution can beexpressed as the product of the conditional distributions ofeach variable given only its parents

In this DAG, P(X ,Y1,Y2) = P(X )P(Y1|X )P(Y2|X )

25 / 38

Page 80: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

This generative process a Bayes NetThe directed acyclic graphical model (DAG), or Bayes net:

X

Y1 Y2

Semantics of a Bayes net: the joint distribution can beexpressed as the product of the conditional distributions ofeach variable given only its parents

In this DAG, P(X ,Y1,Y2) = P(X )P(Y1|X )P(Y2|X )

X P(X )

Fair 13

2-H 13

2-T 13

25 / 38

Page 81: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

This generative process a Bayes NetThe directed acyclic graphical model (DAG), or Bayes net:

X

Y1 Y2

Semantics of a Bayes net: the joint distribution can beexpressed as the product of the conditional distributions ofeach variable given only its parents

In this DAG, P(X ,Y1,Y2) = P(X )P(Y1|X )P(Y2|X )

X P(X )

Fair 13

2-H 13

2-T 13

X P(Y1 = H|X ) P(Y1 = T|X )

Fair 12

12

2-H 1 02-T 0 1

25 / 38

Page 82: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

This generative process a Bayes NetThe directed acyclic graphical model (DAG), or Bayes net:

X

Y1 Y2

Semantics of a Bayes net: the joint distribution can beexpressed as the product of the conditional distributions ofeach variable given only its parents

In this DAG, P(X ,Y1,Y2) = P(X )P(Y1|X )P(Y2|X )

X P(X )

Fair 13

2-H 13

2-T 13

X P(Y1 = H|X ) P(Y1 = T|X )

Fair 12

12

2-H 1 02-T 0 1

X P(Y2 = H|X ) P(Y2 = T|X )

Fair 12

12

2-H 1 02-T 0 1

25 / 38

Page 83: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Conditional independence in Bayes nets

X P(X )

Fair 13

2-H 13

2-T 13

X P(Y1 = H|X ) P(Y1 = T|X )

Fair 12

12

2-H 1 02-T 0 1

X P(Y2 = H|X ) P(Y2 = T|X )

Fair 12

12

2-H 1 02-T 0 1

Question:

Conditioned on not having any further information, are the

two coin flips Y1 and Y2 in this generative process

independent?

26 / 38

Page 84: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Conditional independence in Bayes nets

X P(X )

Fair 13

2-H 13

2-T 13

X P(Y1 = H|X ) P(Y1 = T|X )

Fair 12

12

2-H 1 02-T 0 1

X P(Y2 = H|X ) P(Y2 = T|X )

Fair 12

12

2-H 1 02-T 0 1

Question:

Conditioned on not having any further information, are the

two coin flips Y1 and Y2 in this generative process

independent?

That is, if C = , is it the case that A ⊥ B|C?

26 / 38

Page 85: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Conditional independence in Bayes nets

X P(X )

Fair 13

2-H 13

2-T 13

X P(Y1 = H|X ) P(Y1 = T|X )

Fair 12

12

2-H 1 02-T 0 1

X P(Y2 = H|X ) P(Y2 = T|X )

Fair 12

12

2-H 1 02-T 0 1

Question:

Conditioned on not having any further information, are the

two coin flips Y1 and Y2 in this generative process

independent?

That is, if C = , is it the case that A ⊥ B|C?

No!

26 / 38

Page 86: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Conditional independence in Bayes nets

X P(X )

Fair 13

2-H 13

2-T 13

X P(Y1 = H|X ) P(Y1 = T|X )

Fair 12

12

2-H 1 02-T 0 1

X P(Y2 = H|X ) P(Y2 = T|X )

Fair 12

12

2-H 1 02-T 0 1

Question:

Conditioned on not having any further information, are the

two coin flips Y1 and Y2 in this generative process

independent?

That is, if C = , is it the case that A ⊥ B|C?

No!

P(Y2 = H) = 12 (you can see this by symmetry)

26 / 38

Page 87: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Conditional independence in Bayes nets

X P(X )

Fair 13

2-H 13

2-T 13

X P(Y1 = H|X ) P(Y1 = T|X )

Fair 12

12

2-H 1 02-T 0 1

X P(Y2 = H|X ) P(Y2 = T|X )

Fair 12

12

2-H 1 02-T 0 1

Question:

Conditioned on not having any further information, are the

two coin flips Y1 and Y2 in this generative process

independent?

That is, if C = , is it the case that A ⊥ B|C?

No!

P(Y2 = H) = 12 (you can see this by symmetry)

But P(Y2 = H|Y1 = H) =

Coin was fair︷ ︸︸ ︷

1

1

2+

Coin was 2-H︷ ︸︸ ︷

2

3× 1 = 5

6

26 / 38

Page 88: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Formally assessing conditional independence in Bayes Nets

The comprehensive criterion for assessing conditionalindependence is known as D-separation.

27 / 38

Page 89: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Formally assessing conditional independence in Bayes Nets

The comprehensive criterion for assessing conditionalindependence is known as D-separation.

A path between two disjoint node sets A and B is a sequenceof edges connecting some node in A with some node in B

27 / 38

Page 90: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Formally assessing conditional independence in Bayes Nets

The comprehensive criterion for assessing conditionalindependence is known as D-separation.

A path between two disjoint node sets A and B is a sequenceof edges connecting some node in A with some node in B

Any node on a given path has converging arrows if two edgeson the path connect to it and point to it.

27 / 38

Page 91: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Formally assessing conditional independence in Bayes Nets

The comprehensive criterion for assessing conditionalindependence is known as D-separation.

A path between two disjoint node sets A and B is a sequenceof edges connecting some node in A with some node in B

Any node on a given path has converging arrows if two edgeson the path connect to it and point to it.

A node on the path has non-converging arrows if two edges onthe path connect to it, but at least one does not point to it.

27 / 38

Page 92: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Formally assessing conditional independence in Bayes Nets

The comprehensive criterion for assessing conditionalindependence is known as D-separation.

A path between two disjoint node sets A and B is a sequenceof edges connecting some node in A with some node in B

Any node on a given path has converging arrows if two edgeson the path connect to it and point to it.

A node on the path has non-converging arrows if two edges onthe path connect to it, but at least one does not point to it.

A third disjoint node set C d-separates A and B if for everypath between A and B, either:

1. there is some node on the path with converging arrows whichis not in C ; or

2. there is some node on the path whose arrows do not convergeand which is in C .

27 / 38

Page 93: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Major types of d-separationC d-separates A and B if for every path between A and B, either:

1. there is some node on the path with converging arrows which is not in C ;

or

2. there is some node on the path whose arrows do not converge and which

is in C .

Common-

cause d-

separation

Intervening

d-

separation

Explaining

away:

no d-

separation

D-

separation

in the

absence of

knowledge

of C

A B

C A

B

C

A B

C

A B

C

28 / 38

Page 94: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Back to our example

X

Y1 Y2

29 / 38

Page 95: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Back to our example

X

Y1 Y2

Without looking at the coin before flipping it, the outcome Y1

of the first flip gives me information about the type of coin,and affects my beliefs about the outcome of Y2

29 / 38

Page 96: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Back to our example

X

Y1 Y2

Without looking at the coin before flipping it, the outcome Y1

of the first flip gives me information about the type of coin,and affects my beliefs about the outcome of Y2

But if I look at the coin before flipping it, Y1 and Y2 arerendered independent

29 / 38

Page 97: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of explaining awayI saw an exhibition about the, uh. . .

30 / 38

Page 98: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of explaining awayI saw an exhibition about the, uh. . .

There are several causes of disfluency, including:

30 / 38

Page 99: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of explaining awayI saw an exhibition about the, uh. . .

There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency,

astrolabe)

30 / 38

Page 100: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of explaining awayI saw an exhibition about the, uh. . .

There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency,

astrolabe) The speaker’s attention was distracted by something in the

non-linguistic environment

30 / 38

Page 101: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of explaining awayI saw an exhibition about the, uh. . .

There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency,

astrolabe) The speaker’s attention was distracted by something in the

non-linguistic environment

30 / 38

Page 102: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of explaining awayI saw an exhibition about the, uh. . .

There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency,

astrolabe) The speaker’s attention was distracted by something in the

non-linguistic environment

A reasonable graphical model:

W: hard

word?

A:

attention

distracted?

D:

disfluency?

30 / 38

Page 103: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of explaining away

W: hard

word?

A:

attention

distracted?

D:

disfluency?

Without knowledge of D, there’s no reason to expect that Wand A are correlated

31 / 38

Page 104: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of explaining away

W: hard

word?

A:

attention

distracted?

D:

disfluency?

Without knowledge of D, there’s no reason to expect that Wand A are correlated

But hearing a disfluency demands a cause

31 / 38

Page 105: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of explaining away

W: hard

word?

A:

attention

distracted?

D:

disfluency?

Without knowledge of D, there’s no reason to expect that Wand A are correlated

But hearing a disfluency demands a cause

Knowing that there was a distraction explains away thedisfluency, reducing the probability that the speaker wasplanning to utter a hard word

31 / 38

Page 106: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of the disfluency model

W : hardword?

A:attentiondistracted?

D:disfluency?

Let’s suppose that both hard words anddistractions are unusual, the latter moreso

P(W = hard) = 0.25

P(A = distracted) = 0.15

32 / 38

Page 107: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of the disfluency model

W : hardword?

A:attentiondistracted?

D:disfluency?

Let’s suppose that both hard words anddistractions are unusual, the latter moreso

P(W = hard) = 0.25

P(A = distracted) = 0.15

Hard words and distractions both inducedisfluencies; having both makes adisfluency really likely

W A D=no disfluency D=disfluencyeasy undistracted 0.99 0.01easy distracted 0.7 0.3hard undistracted 0.85 0.15hard distracted 0.4 0.6

32 / 38

Page 108: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of the disfluency model

W : hardword?

A:attentiondistracted?

D:disfluency?

P(W = hard) = 0.25

P(A = distracted) = 0.15

W A D=no disfluency D=disfluencyeasy undistracted 0.99 0.01easy distracted 0.7 0.3hard undistracted 0.85 0.15hard distracted 0.4 0.6

Suppose that we observe the speaker uttering a disfluency.What is P(W = hard|D = disfluent)?

33 / 38

Page 109: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of the disfluency model

W : hardword?

A:attentiondistracted?

D:disfluency?

P(W = hard) = 0.25

P(A = distracted) = 0.15

W A D=no disfluency D=disfluencyeasy undistracted 0.99 0.01easy distracted 0.7 0.3hard undistracted 0.85 0.15hard distracted 0.4 0.6

Suppose that we observe the speaker uttering a disfluency.What is P(W = hard|D = disfluent)?

Now suppose we also learn that her attention is distracted.What does that do to our beliefs about W

33 / 38

Page 110: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of the disfluency model

W : hardword?

A:attentiondistracted?

D:disfluency?

P(W = hard) = 0.25

P(A = distracted) = 0.15

W A D=no disfluency D=disfluencyeasy undistracted 0.99 0.01easy distracted 0.7 0.3hard undistracted 0.85 0.15hard distracted 0.4 0.6

Suppose that we observe the speaker uttering a disfluency.What is P(W = hard|D = disfluent)?

Now suppose we also learn that her attention is distracted.What does that do to our beliefs about W

That is, what is P(W = hard|D = disfluent,A = distracted)?

33 / 38

Page 111: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of the disfluency modelFortunately, there is automated machinery to“turn the Bayesiancrank”:

P(W = hard) = 0.25

34 / 38

Page 112: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of the disfluency modelFortunately, there is automated machinery to“turn the Bayesiancrank”:

P(W = hard) = 0.25

P(W = hard|D = disfluent) = 0.57

34 / 38

Page 113: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of the disfluency modelFortunately, there is automated machinery to“turn the Bayesiancrank”:

P(W = hard) = 0.25

P(W = hard|D = disfluent) = 0.57

P(W = hard|D = disfluent,A = distracted) = 0.40

34 / 38

Page 114: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of the disfluency modelFortunately, there is automated machinery to“turn the Bayesiancrank”:

P(W = hard) = 0.25

P(W = hard|D = disfluent) = 0.57

P(W = hard|D = disfluent,A = distracted) = 0.40

Knowing that the speaker was distracted (A) decreased theprobability that the speaker was about to utter a hard word(W )—A explained D away.

34 / 38

Page 115: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of the disfluency modelFortunately, there is automated machinery to“turn the Bayesiancrank”:

P(W = hard) = 0.25

P(W = hard|D = disfluent) = 0.57

P(W = hard|D = disfluent,A = distracted) = 0.40

Knowing that the speaker was distracted (A) decreased theprobability that the speaker was about to utter a hard word(W )—A explained D away.

A caveat: the type of relationship among A, W , and D willdepend on the values one finds in the probability table!

P(W )P(A)P(D|W ,A)

34 / 38

Page 116: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Summary thus far

Key points:

Bayes’ Rule is a compelling framework for modeling inferenceunder uncertainty

DAGs/Bayes Nets are a broad class of models for specifyingjoint probability distributions with conditional independencies

Classic Bayes Net references: Pearl (1988, 2000); Jordan(1998); Russell and Norvig (2003, Chapter 14); Bishop (2006,Chapter 8).

35 / 38

Page 117: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

References I

Bishop, C. M. (2006). Pattern Recognition and Machine Learning.Springer.

Jordan, M. I., editor (1998). Learning in Graphical Models. Cambridge,MA: MIT Press.

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. MorganKaufmann, 2 edition.

Pearl, J. (2000). Causality: Models, Reasoning, and Inference.Cambridge.

Russell, S. and Norvig, P. (2003). Artificial Intelligence: a Modern

Approach. Prentice Hall, second edition.

36 / 38

Page 118: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of the disfluency model

P(W = hard|D = disfluent,A = distracted)

hard W=hardeasy W=easydisfl D=disfluentdistr A=distractedundistr A=undistracted

P(hard|disfl, distr) =P(disfl|hard, distr)P(hard|distr)

P(disfl|distr)(Bayes’ Rule)

=P(disfl|hard, distr)P(hard)

P(disfl|distr)(Independence from the DAG)

P(disfl|distr) =∑

w′

P(disfl|W = w′)P(W = w

′) (Marginalization)

= P(disfl|hard)P(hard) + P(disfl|easy)P(easy)

= 0.6 × 0.25 + 0.3 × 0.75

= 0.375

P(hard|disfl, distr) =0.6 × 0.25

0.375

= 0.4

37 / 38

Page 119: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

An example of the disfluency modelP(W = hard|D = disfluent)

P(hard|disfl) =P(disfl|hard)P(hard)

P(disfl)(Bayes’ Rule)

P(disfl|hard) =∑

a′

P(disfl|A = a′, hard)P(A = a

′|hard)

= P(disfl|A = distr, hard)P(A = distr|hard) + P(disfl|undistr, hard)P(undistr|hard)

= 0.6 × 0.15 + 0.15 × 0.85

= 0.2175

P(disfl) =∑

w′

P(disfl|W = w′)P(W = w

′)

= P(disfl|hard)P(hard) + P(disfl|easy)P(easy)

P(disfl|easy) =∑

a′

P(disfl|A = a′, easy)P(A = a

′|easy)

= P(disfl|A = distr, easy)P(A = distr|easy) + P(disfl|undistr, easy)P(undistr|easy)

= 0.3 × 0.15 + 0.01 × 0.85

= 0.0535

P(disfl) = 0.2175 × 0.25 + 0.0535 × 0.75

= 0.0945

P(hard|disfl) =0.2175 × 0.25

0.0945

= 0.575396825396825

38 / 38

Page 120: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Sound categorization

1

our first computational psycholinguistic problem

• hear an acoustic signal, recover a sound category

• our example: distinguishing two similar sound categories, a voicing contrast between a pair of stops: /b/ vs. /p/ or /d/ vs. /t/

Page 121: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

voice onset time (VOT)

• primary cue distinguishing voiced and voiceless stops

Sound categorization

2(Chen, 1980)

Page 122: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

identification curve (for /d/ vs. /t/)

Sound categorization

3

238 CONNINE, BLASKO, AND HALL

while other target sounds would be tance (near/far), and stimulus (six stimuli) “muddy” or “a mixture of two sounds” as within subjects factors. Both subjects and were instructed to use “whatever in- and item analyses were performed in all ex- formation is available in the sentence to periments. Because of a programming er- make their response.” Subsequent to label- ror, data from one sentence set in group one ing the target sound on each trial, subjects did not include presentation of all six target were required to indicate whether the la- stimuli and all data from this sentence set beled sound indicated on that trial formed a were discarded. word that was sensible or anomalous with Overall, more voiced responses were the sentence frame on an answer sheet. produced in the voiced bias sentence than

the voiceless bias sentences (60.69% vs. Results and Discussion X35%, F,(1,28) = 18.21, p < .Ol; Fz(1,4)

= 25.93, p < .Ol). Also significant was the The percentage of voiced responses as a main effect of stimulus (F,(5,140) = 236, p

function of stimulus and sentence bias are < .Ol; F,(5,5) = 300.1, p < .Ol). As is the shown in Fig. la for the near condition (dis- case for all of the experiments reported ambiguating information three syllables here, effects of stimulus in both identifica- subsequent to the target) and in Fig. lb for tion (and reaction time, see below) indicate the far condition (disambiguating informa- that the acoustic manipulation was success- tion six syllables subsequent to the target). ful in producing a perceptual change from

As is evident from Fig. la, the near con- voiced to voiceless. Of particular impor- dition, sentence context was effective in in- tance was the significant interaction of bias fluencing subjects responses and the intlu- x distance (F,(1,28) = 30.5, p < .Ol; ence of context was confined to the cate- F,(1,4) = 64.5, p < .Ol>. Inspection of the gory boundary region. In contrast, Fig. lb, means indicated that the interaction was at- the far condition showed little or no effect tributable to a 4.68% effect of context in the of context. Also evident in both Figs. la near condition (61.46 vs. 56.78, voiced and and b was the orderliness of the identifica- voiceless bias, respectively) compared with tion functions in that endpoint stimuli were a .Ol effect in the far condition (59.91 vs. consistently labeled as voiced or voiceless 59.92, voiced and voiceless bias, respec- independent of the sentential bias. These tively). These observations were confirmed observations were confirmed in an ANOVA by Bonferroni t tests for the near and far with the sentence group as a between sub- conditions: a significant effect of bias was jects factor and bias (voiced/voiceless), dis- found in the near condition (ti(14) = 17.8, p

I z SENTENCE g 80

BIAS P

.

SENTENCE

BIAS

- + - “OICILESS

--A-. YCWED

a “OlCE ONSET TIME 0.w b “OK2 ONSET TIME o.%s,

FIG. 1. Percentage voiced (/d/) responses as a function of stimulus and sentence context bias for: (a) the NEAR condition, where semantic biasing information occurred three syllables subsequent to the target word (Experiment 1); (b) the FAR condition, where semantic biasing information occurred six syllables subsequent to the target word (Experiment 1).

(Connine et al., 1991) How do people do this?

Page 123: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

4

Generative model

• c ~ discrete choice, e.g., p(p) = p(b) = 0.5

• S|c ~ [some distribution]

S!

c! category

sound valueBayesian inference

• prior p(c): probability of each category overall (in first step of generative model)

• likelihood p(S|c): [some distribution]

Bayesian sound categorization

p(c|S) = p(S|c)p(c)p(S)

=p(S|c)p(c)X

c0

p(S|c0)p(c0)

Page 124: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Plan

5

• some high level considerations in building cognitive models

• probability in continuous spaces and the Gaussian distribution

• deriving and testing a probabilistic model of sound categorization

• a closely related model of the perceptual magnet effect

Page 125: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Marr's levels of analysis

6

Three levels of computational models (Marr, 1982)

• computational level

• what is the structure of the information processing problem?

• what are the inputs? what are the outputs?

• what information is relevant to solving the problem?

• algorithmic level

• what representations and algorithms are used?

• implementational level

• how are the representations and algorithms implemented neurally?

• levels are mutually constraining, and each necessary to fully understand

Page 126: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Rational analysis

7

How to perform rational analysis (Anderson, 1990)

• background: organism behavior is optimized for common problems both by evolution and by learning

• step 1: specify a formal model of the problem to be solved and the agent's goals

• make as few assumptions about computational limitations as possible

• step 2: derive optimal behavior given problem and goals

• step 3: compare optimal behavior to agent behavior

• step 4: if predictions are off, revisit assumptions about limitations and iterate

Page 127: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

8

Generative model

• c ~ discrete choice, e.g., p(p) = p(b) = 0.5

• S|c ~ [some distribution]

S!

c! category

sound (VOT)Bayesian inference

• prior p(c): probability of each category overall (in first step of generative model)

• likelihood p(S|c): [some distribution]

Bayesian sound categorization

p(c|S) = p(S|c)p(c)p(S)

=p(S|c)p(c)X

c0

p(S|c0)p(c0)

Page 128: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Continuous probability

9

we can't just assign every VOT outcome a probability

• there are uncountably many possible outcomes(e.g., 60.1, 60.01, 60.001, …)

• instead, we use a probability density function that assigns each outcome a non-negative density

• actual probability is now an integral of the density function (area under the curve)

• properness requires that

Continuous random variables and probability density

functions

! Sometimes we want to model distributions on a continuum

of possible outcomes:! The amount of time an infant lives before it hears its first

parasitic gap construction! Formant frequencies for different vowel productions

! We use continuous random variables for this

! Because there are uncountably many possible outcomes,we cannot use a probability mass function

! Instead, continuous random variables have a probability

density function p(x) assigning non-negative density toevery real number

! For continuous random variables, properness requires that!

−∞

p(x) dx = 1

Page 129: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Continuous probability

10

a common continuous distribution: the Gaussian aka the normal

0.0000

0.0005

0.0010

0.0015

0.0020

1600 2000 2400F2 (Hz)

prob

abilit

y de

nsity

Page 130: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Continuous probability

11

a common continuous distribution: the Gaussian aka the normal

0.0000

0.0005

0.0010

0.0015

0.0020

1600 2000 2400F2 (Hz)

prob

abilit

y de

nsity

Page 131: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Continuous probability

12

a common continuous distribution: the Gaussian aka the normal

0.0000

0.0005

0.0010

0.0015

0.0020

1600 2000 2400F2 (Hz)

prob

abilit

y de

nsity

Page 132: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Continuous probability

13

a common continuous distribution: the Gaussian aka the normal

0.0000

0.0005

0.0010

0.0015

0.0020

1600 2000 2400F2 (Hz)

prob

abilit

y de

nsity

Page 133: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Continuous probability

14

a common continuous distribution: the Gaussian aka the normal

0.0

0.5

1.0

1.5

2.0

9.6 10.0 10.4VOT (ms)

prob

abilit

y de

nsity

Page 134: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Continuous probability

15

a common continuous distribution: the Gaussian aka the normal

0.0

0.5

1.0

1.5

2.0

9.6 10.0 10.4VOT (ms)

prob

abilit

y de

nsity

Page 135: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Gaussian parameters

16

Normal(μ, σ2) = N(μ, σ2)

• has two parameters

• most probability distributions are properly families of distributions, indexed by parameters

• e.g., N(μ = 10, σ2 = 10) vs. N(μ = 20, σ2 = 5)

• formal definition of Gaussian probability density function:

The normal distribution

! The normal distribution is perhaps the continuousdistribution you’re most likely to encounter

! It’s a two-parameter distribution: the mean µ and thevariance σ2

! Its probability density function is:

p(x) =1√2πσ2

exp

!

−(x − µ)2

2σ2

"

! We’ll spend some time deconstructing this scary-lookingfunction. . . soon you will come to know and love it!

Page 136: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Gaussian parameters

17

Mean = Expected value = Expectation = μ

• Formal definition

• intuitively: the center of mass (here: 0, 50)

E(X) =

Z +1

1xp(x)dx

0.00

0.01

0.02

0.03

0.04

0.05

−25 0 25 50 75x

prob

abilit

y de

nsity

Page 137: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Gaussian parameters

18

Variance = Var = σ2

• Formal definition

• equivalent alternative definition

• intuitively: how broadly are outcomes dispersed: here (25, 100)

Variance

! Variance is a“second-order”mean: it quantifies how broadlydispersed the outcomes of the r.v. are

! Definition:

Var(X ) = E [(X − E (X ))2]

or equivalently,

Var(X ) = E [X 2]− E [X ]2

! What is the variance of a Bernoulli random variable? When isits variance smallest? largest?

Variance

! Variance is a“second-order”mean: it quantifies how broadlydispersed the outcomes of the r.v. are

! Definition:

Var(X ) = E [(X − E (X ))2]

or equivalently,

Var(X ) = E [X 2]− E [X ]2

! What is the variance of a Bernoulli random variable? When isits variance smallest? largest?

0.00

0.02

0.04

0.06

0.08

25 50 75x

prob

abilit

y de

nsity

Page 138: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Gaussian parameters

19

Putting both parameters together

Normal distributions with different means and variances

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

x

p(x)

µ=0, σ2=1µ=0, σ2=2µ=0, σ2=0.5µ=2, σ2=1

Page 139: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayesian sound categorization

20

modeling ideal speech sound categorization

• which Gaussian category did sound come from?

−20 0 20 40 60 80

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

VOT

Prob

abilit

y de

nsity

/b/ /p/

27

Figure 5.2: Likelihood functions for /b/–/p/ phoneme categorizations, with µb =0, µp = 50, σb = σp = 12. For the inputx = 27, the likelihoods favor /p/.

−20 0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

VOT

Post

erio

r pro

babi

lity o

f /b/

Figure 5.3: Posterior probability curvefor Bayesian phoneme discrimination asa function of VOT

the conditional distributions over acoustic representations, Pb(x) and Pp(x) for /b/ and/p/ respectively (the likelihood functions), and the prior distribution over /b/ versus /p/.We further simplify the problem by characterizing any acoustic representation x as a singlereal-valued number representing the VOT, and the likelihood functions for /b/ and /p/ asnormal density functions (Section 2.10) with means µb, µp and standard deviations σb, σp

respectively.Figure 5.2 illustrates the likelihood functions for the choices µb = 0, µp = 50, σb = σp =

12. Intuitively, the phoneme that is more likely to be realized with VOT in the vicinity of agiven input is a better choice for the input, and the greater the discrepancy in the likelihoodsthe stronger the categorization preference. An input with non-negligible likelihood for eachphoneme is close to the “categorization boundary”, but may still have a preference. Theseintuitions are formally realized in Bayes’ Rule:

P (/b/|x) = P (x|/b/)P (/b/)

P (x)(5.10)

and since we are considering only two alternatives, the marginal likelihood is simply theweighted sum of the likelihoods under the two phonemes: P (x) = P (x|/b/)P (/b/) +P (x|/p/)P (/p/). If we plug in the normal probability density function we get

P (/b/|x) =1√2πσ2

b

exp!− (x−µb)2

2σ2b

"P (/b/)

1√2πσ2

b

exp!− (x−µb)2

2σ2b

"P (/b/) + 1√

2πσ2p

exp!− (x−µp)2

2σ2p

"P (/p/)

(5.11)

In the special case where σb = σp = σ we can simplify this considerably by cancelling the

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 85

Page 140: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

21

Generative model

• c ~ discrete choice, e.g., p(p) = p(b) = 0.5

• S|c ~ Gaussian(μc, σ2c)

S!

c! category

sound valueBayesian inference

• prior p(c): probability of each category overall (in first step of generative model)

• likelihood p(S|c): Gaussian(μc, σ2c)

Bayesian sound categorization

p(c|S) = p(S|c)p(c)p(S)

=p(S|c)p(c)X

c0

p(S|c0)p(c0)

Page 141: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

The normal distribution

! The normal distribution is perhaps the continuousdistribution you’re most likely to encounter

! It’s a two-parameter distribution: the mean µ and thevariance σ2

! Its probability density function is:

p(x) =1√2πσ2

exp

!

−(x − µ)2

2σ2

"

! We’ll spend some time deconstructing this scary-lookingfunction. . . soon you will come to know and love it!

22

Concrete parameters

• c ~ discrete choice, p(p) = p(b) = 0.5

• S | c ~ normal, μb=0, μp=100; σb=20, σp=20

S!

c! category

sound valueConcrete example

• p(b|60) = p(60|b)p(b) / [p(60|b)p(b) + p(60|p)p(p)]

• = .0002(.5) / [.0002(.5) + .0027(.5)] ≈ .08

Bayesian sound categorization

p(c|S) = p(S|c)p(c)p(S)

=p(S|c)p(c)X

c0

p(S|c0)p(c0)

Page 142: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

23

Categorization function

• which (Gaussian) category did sound come from?

Bayesian sound categorization

−20 0 20 40 60 80

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

VOT

Prob

abilit

y de

nsity

/b/ /p/

27

Figure 5.2: Likelihood functions for /b/–/p/ phoneme categorizations, with µb =0, µp = 50, σb = σp = 12. For the inputx = 27, the likelihoods favor /p/.

−20 0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

VOT

Post

erio

r pro

babi

lity o

f /b/

Figure 5.3: Posterior probability curvefor Bayesian phoneme discrimination asa function of VOT

the conditional distributions over acoustic representations, Pb(x) and Pp(x) for /b/ and/p/ respectively (the likelihood functions), and the prior distribution over /b/ versus /p/.We further simplify the problem by characterizing any acoustic representation x as a singlereal-valued number representing the VOT, and the likelihood functions for /b/ and /p/ asnormal density functions (Section 2.10) with means µb, µp and standard deviations σb, σp

respectively.Figure 5.2 illustrates the likelihood functions for the choices µb = 0, µp = 50, σb = σp =

12. Intuitively, the phoneme that is more likely to be realized with VOT in the vicinity of agiven input is a better choice for the input, and the greater the discrepancy in the likelihoodsthe stronger the categorization preference. An input with non-negligible likelihood for eachphoneme is close to the “categorization boundary”, but may still have a preference. Theseintuitions are formally realized in Bayes’ Rule:

P (/b/|x) = P (x|/b/)P (/b/)

P (x)(5.10)

and since we are considering only two alternatives, the marginal likelihood is simply theweighted sum of the likelihoods under the two phonemes: P (x) = P (x|/b/)P (/b/) +P (x|/p/)P (/p/). If we plug in the normal probability density function we get

P (/b/|x) =1√2πσ2

b

exp!− (x−µb)2

2σ2b

"P (/b/)

1√2πσ2

b

exp!− (x−µb)2

2σ2b

"P (/b/) + 1√

2πσ2p

exp!− (x−µp)2

2σ2p

"P (/p/)

(5.11)

In the special case where σb = σp = σ we can simplify this considerably by cancelling the

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 85

−20 0 20 40 60 80

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

VOT

Prob

abilit

y de

nsity

/b/ /p/

27

Figure 5.2: Likelihood functions for /b/–/p/ phoneme categorizations, with µb =0, µp = 50, σb = σp = 12. For the inputx = 27, the likelihoods favor /p/.

−20 0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

VOT

Post

erio

r pro

babi

lity o

f /b/

Figure 5.3: Posterior probability curvefor Bayesian phoneme discrimination asa function of VOT

the conditional distributions over acoustic representations, Pb(x) and Pp(x) for /b/ and/p/ respectively (the likelihood functions), and the prior distribution over /b/ versus /p/.We further simplify the problem by characterizing any acoustic representation x as a singlereal-valued number representing the VOT, and the likelihood functions for /b/ and /p/ asnormal density functions (Section 2.10) with means µb, µp and standard deviations σb, σp

respectively.Figure 5.2 illustrates the likelihood functions for the choices µb = 0, µp = 50, σb = σp =

12. Intuitively, the phoneme that is more likely to be realized with VOT in the vicinity of agiven input is a better choice for the input, and the greater the discrepancy in the likelihoodsthe stronger the categorization preference. An input with non-negligible likelihood for eachphoneme is close to the “categorization boundary”, but may still have a preference. Theseintuitions are formally realized in Bayes’ Rule:

P (/b/|x) = P (x|/b/)P (/b/)

P (x)(5.10)

and since we are considering only two alternatives, the marginal likelihood is simply theweighted sum of the likelihoods under the two phonemes: P (x) = P (x|/b/)P (/b/) +P (x|/p/)P (/p/). If we plug in the normal probability density function we get

P (/b/|x) =1√2πσ2

b

exp!− (x−µb)2

2σ2b

"P (/b/)

1√2πσ2

b

exp!− (x−µb)2

2σ2b

"P (/b/) + 1√

2πσ2p

exp!− (x−µp)2

2σ2p

"P (/p/)

(5.11)

In the special case where σb = σp = σ we can simplify this considerably by cancelling the

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 85

Page 143: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

24

Bayesian sound categorization

−20 0 20 40 60 80

0.00

0.01

0.02

0.03

0.04

0.05

VOT

Prob

abilit

y de

nsity

[b] [p]

Figure 5.4: Clayards et al.(2008)’s manipulation ofVOT variance for /b/–/p/categories

−20 0 20 40 60 800.

00.

20.

40.

60.

81.

0VOT

Post

erio

r pro

babi

lity o

f /b/

Figure 5.5: Ideal posteriordistributions for narrow andwide variances

−20 0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

VOT

Prop

ortio

n re

spon

se /b

/

Figure 5.6: Response ratesobserved by Clayards et al.(2008)

constants and multiplying through by exp!(x−µb)2

2σ2b

":

P (/b/|x) = P (/b/)

P (/b/) + exp!(x−µb)2−(x−µp)2

2σ2

"P (/p/)

(5.12)

Since e0 = 1, when (x − µb)2 = (x − µp)2 the input is “on the category boundary” and theposterior probabilities of each phoneme are unchanged from the prior. When x is closer toµb, (x − µb)2 − (x − µp)2 > 0 and /b/ is favored; and vice versa when x is closer to µp.Figure 5.3 illustrates the phoneme categorization curve for the likelihood parameters chosenfor this example and the prior P (/b/) = P (/p/) = 0.5.

This account makes clear, testable predictions about the dependence on the parame-ters of the VOT distribution for each sound category on the response profile. Clayards et al.(2008), for example, conducted an experiment in which native English speakers were exposedrepeatedly to words with initial stops on a /b/–/p/ continuum such that either sound cate-gory would form a word (beach–peach, beak–peak, bes–peas). The distribution of the /b/–/p/continuum used in the experiment was bimodal, approximating two overlapping Gaussiandistributions (Section 2.10); high-variance distributions (156ms2) were used for some exper-imental participants and low-variance distribution (64ms2) for others (Figure 5.4). If thesespeakers were to learn the true underlying distributions to which they were exposed anduse them to draw ideal Bayesian inferences about which word they heard on a given trial,then the posterior distribution as a function of VOT would be as in Figure 5.5: note thatlow-variance Gaussians would induce a steeper response curve than high-variance Gaussians.The actual response rates are given in Figure 5.6; although the discrepancy between the low-and high-variance conditions is smaller than predicted by ideal inference, suggesting thatlearning may have been incomplete, the results of Clayards et al. confirm human responsecurves are indeed steeper when category variances are lower, as predicted by principles ofBayesian inference.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 86

−20 0 20 40 60 80

0.00

0.01

0.02

0.03

0.04

0.05

VOT

Prob

abilit

y de

nsity

[b] [p]

Figure 5.4: Clayards et al.(2008)’s manipulation ofVOT variance for /b/–/p/categories

−20 0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

VOT

Post

erio

r pro

babi

lity o

f /b/

Figure 5.5: Ideal posteriordistributions for narrow andwide variances

−20 0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

VOT

Prop

ortio

n re

spon

se /b

/

Figure 5.6: Response ratesobserved by Clayards et al.(2008)

constants and multiplying through by exp!(x−µb)2

2σ2b

":

P (/b/|x) = P (/b/)

P (/b/) + exp!(x−µb)2−(x−µp)2

2σ2

"P (/p/)

(5.12)

Since e0 = 1, when (x − µb)2 = (x − µp)2 the input is “on the category boundary” and theposterior probabilities of each phoneme are unchanged from the prior. When x is closer toµb, (x − µb)2 − (x − µp)2 > 0 and /b/ is favored; and vice versa when x is closer to µp.Figure 5.3 illustrates the phoneme categorization curve for the likelihood parameters chosenfor this example and the prior P (/b/) = P (/p/) = 0.5.

This account makes clear, testable predictions about the dependence on the parame-ters of the VOT distribution for each sound category on the response profile. Clayards et al.(2008), for example, conducted an experiment in which native English speakers were exposedrepeatedly to words with initial stops on a /b/–/p/ continuum such that either sound cate-gory would form a word (beach–peach, beak–peak, bes–peas). The distribution of the /b/–/p/continuum used in the experiment was bimodal, approximating two overlapping Gaussiandistributions (Section 2.10); high-variance distributions (156ms2) were used for some exper-imental participants and low-variance distribution (64ms2) for others (Figure 5.4). If thesespeakers were to learn the true underlying distributions to which they were exposed anduse them to draw ideal Bayesian inferences about which word they heard on a given trial,then the posterior distribution as a function of VOT would be as in Figure 5.5: note thatlow-variance Gaussians would induce a steeper response curve than high-variance Gaussians.The actual response rates are given in Figure 5.6; although the discrepancy between the low-and high-variance conditions is smaller than predicted by ideal inference, suggesting thatlearning may have been incomplete, the results of Clayards et al. confirm human responsecurves are indeed steeper when category variances are lower, as predicted by principles ofBayesian inference.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 86

Categorization function

• ideal categorization function slope changes with category variance

Page 144: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

25

Bayesian sound categorization

−20 0 20 40 60 80

0.00

0.01

0.02

0.03

0.04

0.05

VOT

Prob

abilit

y de

nsity

[b] [p]

Figure 5.4: Clayards et al.(2008)’s manipulation ofVOT variance for /b/–/p/categories

−20 0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

VOT

Post

erio

r pro

babi

lity o

f /b/

Figure 5.5: Ideal posteriordistributions for narrow andwide variances

−20 0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

VOTPr

opor

tion

resp

onse

/b/

Figure 5.6: Response ratesobserved by Clayards et al.(2008)

constants and multiplying through by exp!(x−µb)2

2σ2b

":

P (/b/|x) = P (/b/)

P (/b/) + exp!(x−µb)2−(x−µp)2

2σ2

"P (/p/)

(5.12)

Since e0 = 1, when (x − µb)2 = (x − µp)2 the input is “on the category boundary” and theposterior probabilities of each phoneme are unchanged from the prior. When x is closer toµb, (x − µb)2 − (x − µp)2 > 0 and /b/ is favored; and vice versa when x is closer to µp.Figure 5.3 illustrates the phoneme categorization curve for the likelihood parameters chosenfor this example and the prior P (/b/) = P (/p/) = 0.5.

This account makes clear, testable predictions about the dependence on the parame-ters of the VOT distribution for each sound category on the response profile. Clayards et al.(2008), for example, conducted an experiment in which native English speakers were exposedrepeatedly to words with initial stops on a /b/–/p/ continuum such that either sound cate-gory would form a word (beach–peach, beak–peak, bes–peas). The distribution of the /b/–/p/continuum used in the experiment was bimodal, approximating two overlapping Gaussiandistributions (Section 2.10); high-variance distributions (156ms2) were used for some exper-imental participants and low-variance distribution (64ms2) for others (Figure 5.4). If thesespeakers were to learn the true underlying distributions to which they were exposed anduse them to draw ideal Bayesian inferences about which word they heard on a given trial,then the posterior distribution as a function of VOT would be as in Figure 5.5: note thatlow-variance Gaussians would induce a steeper response curve than high-variance Gaussians.The actual response rates are given in Figure 5.6; although the discrepancy between the low-and high-variance conditions is smaller than predicted by ideal inference, suggesting thatlearning may have been incomplete, the results of Clayards et al. confirm human responsecurves are indeed steeper when category variances are lower, as predicted by principles ofBayesian inference.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 86

Clayards et al. (2008) tested exactly this prediction

• trained participants with Gaussian categories of two variances

• then tested categorization

−20 0 20 40 60 80

0.00

0.01

0.02

0.03

0.04

0.05

VOT

Prob

abilit

y de

nsity

[b] [p]

Figure 5.4: Clayards et al.(2008)’s manipulation ofVOT variance for /b/–/p/categories

−20 0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

VOT

Post

erio

r pro

babi

lity o

f /b/

Figure 5.5: Ideal posteriordistributions for narrow andwide variances

−20 0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

VOT

Prop

ortio

n re

spon

se /b

/

Figure 5.6: Response ratesobserved by Clayards et al.(2008)

constants and multiplying through by exp!(x−µb)2

2σ2b

":

P (/b/|x) = P (/b/)

P (/b/) + exp!(x−µb)2−(x−µp)2

2σ2

"P (/p/)

(5.12)

Since e0 = 1, when (x − µb)2 = (x − µp)2 the input is “on the category boundary” and theposterior probabilities of each phoneme are unchanged from the prior. When x is closer toµb, (x − µb)2 − (x − µp)2 > 0 and /b/ is favored; and vice versa when x is closer to µp.Figure 5.3 illustrates the phoneme categorization curve for the likelihood parameters chosenfor this example and the prior P (/b/) = P (/p/) = 0.5.

This account makes clear, testable predictions about the dependence on the parame-ters of the VOT distribution for each sound category on the response profile. Clayards et al.(2008), for example, conducted an experiment in which native English speakers were exposedrepeatedly to words with initial stops on a /b/–/p/ continuum such that either sound cate-gory would form a word (beach–peach, beak–peak, bes–peas). The distribution of the /b/–/p/continuum used in the experiment was bimodal, approximating two overlapping Gaussiandistributions (Section 2.10); high-variance distributions (156ms2) were used for some exper-imental participants and low-variance distribution (64ms2) for others (Figure 5.4). If thesespeakers were to learn the true underlying distributions to which they were exposed anduse them to draw ideal Bayesian inferences about which word they heard on a given trial,then the posterior distribution as a function of VOT would be as in Figure 5.5: note thatlow-variance Gaussians would induce a steeper response curve than high-variance Gaussians.The actual response rates are given in Figure 5.6; although the discrepancy between the low-and high-variance conditions is smaller than predicted by ideal inference, suggesting thatlearning may have been incomplete, the results of Clayards et al. confirm human responsecurves are indeed steeper when category variances are lower, as predicted by principles ofBayesian inference.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 86

Page 145: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

26

Bayesian sound categorizationWrapping up categorization

• assumed knowledge of categories (which were Gaussian distributions)

• found the exact posterior probability that a sound belongs to each of two categories with a simple application of Bayes' rule

• confirmed Bayesian model prediction that categorization function should become less steep as variance is larger

let's move on to a more complex situation

Page 146: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

27

Bayesian sound categorization

A more complex situation: the perceptual magnet effect

• empirical work by Kuhl and colleages[Kuhl et al., 1992; Iverson & Kuhl, 1995]

• modeling work we discuss is from Feldman and colleagues[Feldman & Griffths, 2007; Feldman et al., 2009]

Page 147: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

28

Page 148: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Perceptual Magnet Effect

/ε/

/i/

(Iverson & Kuhl, 1995)

Page 149: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Perceptual Magnet Effect

(Iverson & Kuhl, 1995)

Perceptual+Magnet+Effect+

Perceived+S.muli:+

Actual+S.muli:+

(Iverson & Kuhl, 1995)

To account for this, we need a new generative model for speech perception

Page 150: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Speech Perception

Page 151: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Speech Perception

Speaker chooses a phonetic category

c

Page 152: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Speech Perception

Speaker chooses a phonetic category

c

TSpeaker articulates a

“target production”

Page 153: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Speech Perception

Speaker chooses a phonetic category

Noise in the speech signal

c

TSpeaker articulates a

“target production”

Page 154: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Speech PerceptionListener hears

a speech sound

Speaker chooses a phonetic category

Noise in the speech signal

cS

TSpeaker articulates a

“target production”

Page 155: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Speech PerceptionListener hears

a speech sound

Speaker chooses a phonetic category

Noise in the speech signal

cS

TSpeaker articulates a

“target production”

Inferring an acoustic value: Compute p(T|S)

Page 156: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Statistical Model

c Choose a category c with probability p(c)

Page 157: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Statistical Model

c Choose a category c with probability p(c)

Articulate a target production T with probability p(T|c)Tp(T |c) = N(µc,

2c )

Page 158: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Statistical Model

c

S

Choose a category c with probability p(c)

Articulate a target production T with probability p(T|c)

Listener hears speech sound S with probability p(S|T)

Tp(T |c) = N(µc,

2c )

p(S|T ) = N(T,2S)

Page 159: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Statistical Model

N µc,σ c2( )

T

N T,σS2( )

Phonetic Category ‘c’

Speech Signal Noise

S

Target Production

Speech Sound

Page 160: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Statistical Model

N µc,σ c2( )

T

N T,σS2( )

Phonetic Category ‘c’

Speech Signal Noise

S

Target Production

Speech Sound

Page 161: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Statistical Model

?€

N µc,σ c2( )

Phonetic Category ‘c’

SSpeech Sound

N T,σS2( )

Speech Signal Noise

Page 162: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Statistical Model

?€

N µc,σ c2( )

Phonetic Category ‘c’

SSpeech Sound

Speech Signal Noise

Prior, p(h)

Hypotheses, h

Data, d Likelihood, p(d|h)

N T,σS2( )

Page 163: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes for Speech Perception

Listeners must infer the target production based on the speech sound they hear and their prior knowledge of phonetic categories

– Data (d): speech sound S– Hypotheses (h): target productions T – Prior (p(h)): phonetic category structure p(T|c) – Likelihood (p(d|h)): speech signal noise p(S|T)

( ) ( ) ( )hphdpdhp || ∝

Page 164: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes for Speech Perception

PriorLikelihood

SSpeech Sound

Page 165: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes for Speech Perception

PriorLikelihood

Posterior

SSpeech Sound

Page 166: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Bayes for Speech Perception

E T |S,c[ ]=σ c2S+σS

2µc

σ c2 +σS

2

PriorLikelihood

Posterior

Page 167: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Perceptual Warping

Page 168: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Multiple Categories

• Want to compute

• Marginalize over categories

p(T |S)

p(T |S) = p(T |S,c)p(c |S)c∑

Page 169: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Multiple Categories

• Want to compute

• Marginalize over categories

p(T |S)

p(T |S) = p(T |S,c)p(c |S)c∑

solution for a single category

probability of category membership

Page 170: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Multiple Categories

SSpeech Sound

Page 171: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Multiple Categories

SSpeech Sound

Page 172: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Multiple Categories

E T |S,c[ ]=σ c2S+σS

2µc

σ c2 +σS

2

Page 173: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Multiple Categories

E T |S,c[ ]=σ c2S+σS

2µc

σ c2 +σS

2

Page 174: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Multiple Categories

E T |S[ ]=σ c2S+σS

2µc

σ c2 +σS

2 p c |S( )c∑

Page 175: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Perceptual Warping

Page 176: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Perceptual Warping

To compare model to humans

• we have a 13-step continuum

• estimate perceptual distance between each adjacent pair in humans and model

Perceptual+Magnet+Effect+

Perceived+S.muli:+

Actual+S.muli:+

(Iverson & Kuhl, 1995)

Page 177: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

Modeling the /i/-­‐/e/ DataModeling+the+/i/Q/e/+Data+

1 2 3 4 5 6 7 8 9 10 11 12 130

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2Relative Distances Between Neighboring Stimuli

Pe

rce

ptu

al D

ista

nce

Stimulus Number

MDS

Model

Page 178: Day 1: Probability and speech perceptionidiom.ucsd.edu/.../lecture-1/lecture1-slides-all-no-builds.pdfComputational Psycholinguistics Psycholinguistics deals with the problem of how

59

Bayesian sound categorizationConclusions

• continuous probability theory lets us build ideal models of speech perception

• part 1: can build a principled model of categorization, which fits human data well

• e.g., categorization less steep for high variance categories

• part 2: can predict how linguistic category structure warps perceptual space

• speech sounds are perceived as being closer to the center of their likely category