Top Banner
580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori
15

580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

Mar 26, 2015

Download

Documents

Eric Ramirez
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

580.691 Learning Theory

Reza Shadmehr

Bayesian learning 1:

Bayes rule, priors and maximum a posteriori

Page 2: 580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

Frequentist vs. Bayesian Statistics

Frequentist Thinking

True parameter:

Estimate of this parameter:

*ww

*ˆBias:

ˆvar

E w w

w

Bayesian Thinking

Does not have the concept of a true parameter.

Rather, at every given time we have knowledge about w (the prior), gain new data, and then update our knowledge using Bayes rule (the posterior).Many different ways in which we can

come up with estimates (e.g. Maximum Likelihood estimate), and we can evaluate them.

| ,( | )

,

p w p D w p w Dp w D

p D p w D dw

Posterior distr.

Prior Distr. Conditional Distr.

Given Bayes rule, there is only ONE correct way of learning.

Page 3: 580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

Binomial distribution and discrete random variables

(1) (2) ( )(1) (2) ( )

(1) (2) ( )

1

1 1 1

0,1 1 0 1

, , ,

1

1 1 1NN

N

xx

x x xx x x

x P x P x

x x x

p x

p

x

x

Suppose a random variable can only take one of two variables (e.g., 0 and 1, success and failure, etc.). Such trials are termed Bernoulli trials.

( )

1

number of times the trial succeeded

!1 1

! !

var 1

Ni

i

N n N nn n

n

n x

N Np n

n n N n

E n N

n N

Probability distribution of a specific sequence of successes and failures

Probability density or distribution

Page 4: 580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

Poor performance of ML estimators with small data samples

• Suppose we have a coin and wish to estimate the outcome (head or tail) from observing a series of coin tosses. = probability of tossing a head.

• After observing n coin tosses, we note that:

out of which h trials are head.

• To estimate whether the next toss will be head or tail, we form an ML estimator:

• After one toss, if it comes up tail, our ML estimate predicts zero probability of seeing heads. If first n tosses are tails, the ML continues to predict zero prob. of seeing heads.

(1) ( ), , nD x x

(1) ( )

(1) (2) ( )

, ,

1

log log log 1

log 01

n

n

n hh

ML

L p x x

p x p x p x

L h n h

d h n hL

dh

n

Probability of observing a particular sequence of heads and tails in D

Page 5: 580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

The numerator is just the joint distribution of and D, evaluated at a particular D. The denominator is the marginal distribution of the data D, that is, it is just a number that makes the Numerator integrate to one.

1

, | |( | )

| |n

p D p p D p p Dp D

p D p p D d p p D

Posterior distr.

Prior Distr. Conditional Distr.

Including prior knowledge into the estimation process

• Even though the ML estimator might say , we “know” that the coin can come up both heads and tails, i.e.:

• Starting point for our consideration is that is not only a number, but we will give a full probability distribution function

•Suppose we know that the coin is either fair (=0.5) with prob. or in favor of tails (=0.4) with probability 1-.

• We want to combine this prior knowledge with new data D (i.e. number of heads in n throws) to arrive at a posterior distribution for . We will apply Bayes rule:

0ML 0

Page 6: 580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

Bayesian estimation for a potentially biased coin

• Suppose that we believe that the coin is either fair, or that it is biased toward tails: = probability of tossing a head. After observing n coin tosses, we note that: out of which h trials are head.

for 0.5

1 for 0.4

0 otherwise

p

0.5 0.5 0.50.5 |

0.5 0.5 1 0.4 0.6 0.5 1 0.4 0.6

1 0.4 0.60.4 |

0.5 1 0.4 0.6

h n h n

h n h h n h n h n h

h n h

n h n h

P D

P D

Now we can accurately calculate the probability that we have a fair coin, given some data D. In contrast to the ML estimate, which only gave us one number ML, we have here a full probability distribution, that is we know also how certain we are that we have a fair or unfair coin.

In some situation we would like a single number, that represents our best guess of . One possibility for this best guess is the maximum a-posteriori estimate (MAP).

(1) ( ), , nD x x

( )1 n hhp D

Page 7: 580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

MAP estimator:

arg max arg max

arg max log logMAP

p D p D p

p D p

Maximum a-posteriori estimate

We define the MAP estimate as the maximum (i.e. mode) of the posterior distribution.

The latter version makes the comparison to the maximum likelihood estimate easy:

arg max | arg max log |

arg max | arg max log | log

ML

MAP

p D p D

p D p D p

We see that ML and MAP are identical, if p() is a constant that does not depend on . Thus our prior would be a uniform distribution over the domain of . We call such a prior for obvious reasons a flat or uniformed prior.

Page 8: 580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

Formulating a continuous prior for the coin toss problem

• In the last example the probability of tossing a head, represented by , could only be either0.5 or p=0.4. How should we choose a prior distribution if can be between 0 and 1?

•Suppose we observed n tosses. The probability density that exactly h of those tosses were heads is:

10n

p h

h

1

!1

! !

n hh

n hh

np h

h

n

h n h

= probability of tossing a head

5 10 15 20

0.05

0.1

0.15

0.2

0.25

20n

Binomial distribution

0.5

Page 9: 580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

0.2 0.4 0.6 0.8 1

0.5

1

1.5

2

2.5

3

3.5

0.2 0.4 0.6 0.8 1

0.5

1

1.5

2

2.5

Formulating a continuous prior for the coin toss problem

• represents the probability of a head. We want a continuous distribution that is defined between =0 and =1, and is 0 for =0 and =1.

1; 2n p

1

0

1

11

1

n

n

n

p D

pc

c d

= probability of tossing a head

Beta distribution

2; 4n 3; 6n

4; 8n

normalizing constant

1; 2n 1; 4n

1; 6n 1; 8n

Page 10: 580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

Formulating a continuous prior for the coin toss problem

• In general, let’s assume our knowledge comes in the form of a beta distribution:

1

0

1

0

1

0

11

1

| 1

11 1

|1

1 1

11

1

n hh

n hh

n hh

n hh

n hh

pc

c d

p D

cp D

dc

d

d d

When we apply Bayes rule to integrate some old knowledge (the prior) in the form of a beta-distribution with parameters and , with some new knowledge h and n (coming from a binomial distribution), then we find that the posterior distribution also has the form of a beta distribution with parameters +h and +n-h.

Beta and binomial distribution are therefore called conjugate distributions.

Page 11: 580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

MAP estimator for the coin toss problem

Let us look at the MAP estimator if we start with a prior of =1, n=2, i.e. we have a slight belief in the fact that the coin is fair.

Our posterior is then:

111| 1 n hhp D

d

Note that after one toss, if we get a tail, our probability of tossing a head is 0.33, not zero as in the ML case.

Let’s calculate the MAP-estimate so that we can compare it to the ML estimate.

1log | 1 log 1 log 1 log

log | 1 10

11 1

11 1 1

11

2MAP

p D h n hd

d p D h n h

dn h

hn h h

hh

n

0.2 0.4 0.6 0.8 1

0.5

1

1.5

2; 1n h p

Page 12: 580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

Assume you only know the height of a person, but not their gender. Can height tell you something about gender? Assume y=height and x=gender (0=male or 1=female).

Height is normally distributed in the population of men and in the population of women, with different means, and similar variances. Let x be an indicator variable for being a female. Then the conditional distribution of y (the height becomes):

2

2

2

2

1 1| 1 exp

2 2

1 1| 0 exp

2 2

f

m

p y x y

p y x y

| 1p y x | 0p y x

| 0 and | 1

1|

p y x p y x

P x y

What we have: densitiesWhat we want: probability

Classification with a continuous conditional distribution

1

0

1 11|

i

P x p y xP x y

P x i p y x i

Page 13: 580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

Let us further assume that we start with a prior distribution, such that x is 1 with probability .

2

2

2 2

2 2

2

2

2

2

22

2

1 | 11 |

1 | 1 0 | 0

1exp

21 1

exp 1 exp2 2

11

1 exp2

11

exp2

1

1 11 exp log

2

1

1 exp log

f

f m

m

f

m f

P x p y xP x y

P x p y x P x p y x

y

y y

y

y

y y

2 22 2

2 2

2 2

1 1 1

2

1

1 exp

1log , , 1,

2

m f m f

T

T

m f m f T

y

y

θ y

θ y

The posterior is a logistic function of a linear function of the data and parameters (remember this result the section on classification!).

The maximum-likelihood argument would just have decided under which model the data would have been more likely.

The posterior distribution gives us the full probability that we have a male or female.

We can also include prior knowledge in our scheme.

Classification with a continuous conditional distribution

Page 14: 580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

120 140 160 180 200 220

0.2

0.4

0.6

0.8

1

Computing the probability that the subject is female, given that we observed height y.

2 2

2 2

11|

11 exp log

2m f m f

P x y

y

176

166

12

m

f

cm

cm

cm

1|P x y

y

1 0.5P x Our prior probability

Posterior probability:

1 0.3P x

Classification with a continuous conditional distribution

Page 15: 580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori.

Summary

•Bayesian estimation involves the application of Bayes rule to combine a prior density and a conditional density to arrive at a posterior density.

•Maximum a posteriori (MAP) estimation: If we need a “best guess” from our posterior distribution, often the maximum of the posterior distribution is used.

•The MAP and ML estimate are identical, when our prior is uniformly distributed on , i.e. is flat or uniformed.

•With a two-way classification problem and data that is Gaussian given the category membership, the posterior is a logistic function, linear in the data.

p D pp D

p D

arg max arg maxMAP p D p D p

1

11 exp T

P x

yθ y