Top Banner
COMS4771, Columbia University Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha
31

Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Apr 24, 2018

Download

Documents

lenguyet
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

COMS4771, Columbia University

Machine Learning

4771

Instructors:

Adrian Weller and Ilia Vovsha

Page 2: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

COMS4771, Columbia University

Topic 2: Basic concepts of Bayesians and Frequentists

•Properties of PDFs

•Bayesians & Frequentists

•ML, MAP and Full Bayes

•Example: Coin Toss

•Bernoulli Priors

•Conjugate Priors

•Bayesian decision theory

Page 3: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

COMS4771, Columbia University

Properties of PDFs•Review some basics of probability theory

•First, pdf is a function, multiple inputs, one output:

•Function’s output is always non-negative:

•Can have discrete or continuous inputs or both:

•Summing over the domain of all inputs gives unity:

p x

1,…,x

n( ) ( )

10.3, , 1 0.2

n

p X X= = =…

p x

1,…,x

n( )≥ 0

( )1 2 3 4

1, 0, 0, 3.1415p X X X X= = = =

p x,y( )x=−∞

∫y=−∞

∫ dxdy = 1

p x,y( )x

∑y

∑ = 1 0.4 0.1

0.3 0.2

Continuous�integral, Discrete�sum

0

1

0 1

X

Y

ProbabilityDistributionFunction

Page 4: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Properties of PDFs

COMS4771, Columbia University(Bishop PRML 1.2.1)

PDF

CDF Cumulative

Distribution

Function

D

Page 5: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

COMS4771, Columbia University

Properties of PDFs•Marginalizing: integrate/sum out a variable leaves a

marginal distribution over the remaining ones…

•Conditioning: if a variable ‘y’ is ‘given’ we get aconditional distribution over the remaining ones…

•Bayes Rule: mathematically just redo conditioningbut has a deeper meaning (1764)… if wehave x being data and θ being a model

p x,y( )

y∑ = p x( )

p x | y( )=

p x,y( )p y( )

p θ | x( )=p x | θ( )p θ( )

p x( )posterior

likelihood

evidence

prior

Page 6: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

COMS4771, Columbia University

Properties of PDFs•Expectation: can use pdf p(x) to compute averages and

expected values for quantities, denoted by:

•Properties:

•Mean: expected value for x

•Variance: expected value of (x-mean)2, how much x varies E

p x( )x{}= p x( )x

−∞

∫ dx

Fine=0$ Fine=20$

0.8 0.2

example: speeding ticket

expected cost of speeding?

E

p x( )f x( ){ }= p x( )f x( )

x∫ dx or = p x( )f x( )

x∑

Var x{}= E x −E x{}( )2

= E x

2− 2xE x{}+ E x{}

2{ }= E x

2{ }− 2E x{}E x{}+ E x{}2

= E x2{ }−E x{}

2

E cf x( ){ }= cE f x( ){ }E f x( )+c{ }= E f x( ){ }+c

E E f x( ){ }{ }= E f x( ){ }

Page 7: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

COMS4771, Columbia University

The IID Assumption•Most of the time, we will assume that a dataset is

independent and identically distributed (IID)

•In many real situations, data is generated by someblack box phenomenon in an arbitrary order.

•Assume we are given a dataset:

“Independent” means that (given the model θ) theprobability of our data multiplies:

“Identically distributed” means that each marginalprobability is the same for each data point

p x

1,…,x

N| Θ( )= p

ixi| Θ( )

i=1

N

D = x

1,…,x

N{ }

p x

1,…,x

N| Θ( )= p

ixi| Θ( )

i=1

N

∏ = p xi| Θ( )

i=1

N

Page 8: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Ex: Is a coin fair?

COMS4771, Columbia University

A stranger tells you his coin is fair.

Let’s assume tosses are iid with P(H)=µ.

He tosses it 4 times, gets H H T H.

What can you say about µ?

Page 9: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

COMS4771, Columbia University

Bayesians & Frequentists•Frequentists (Neymann/Pearson/Wald). An orthodox view that sampling is infinite and decision rules can be sharp.

•Bayesians (Bayes/Laplace/de Finetti). Unknown quantities are treated probabilistically and the state of the world can always be updated.

de Finetti: p( event ) = price I would pay for acontract that pays $1when event happens

•Likelihoodists (Fisher). Single sample inference based on maximizing the likelihood function.

actuarial fair

Page 10: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

COMS4771, Columbia University

Bayesians & Frequentists•Frequentists:

•Data are a repeatable random sample - there is a frequency •Underlying parameters remain constant during this repeatable process•Parameters are fixed

•Bayesians:•Data are observed from the realized sample.•Parameters are unknown and described probabilistically•Data are fixed

Page 11: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

COMS4771, Columbia University

Frequentists

•Frequentists: classical / objective view / no priorsevery statistician should compute same p(x) so no priorscan’t have a p(event) if it never happenedavoid p(θ), there is 1 true model, not distribution of thempermitted: p

θ(x,y) forbidden: p(x,y|θ)

Frequentist inference: estimate one best model θuse the Maximum Likelihood Estimator (ML)(unbiased & minimum variance)do not depend on Bayes rule for learning

( )

( )1 2

|

Data

arg m x

, ,

a

,n

MLp

x x x

θθθ

=

=

D

D

Page 12: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

COMS4771, Columbia University

Bayesians

•Bayesians: subjective view / priors are okput a distribution or pdf on all variables in the problemeven models & deterministic quantities (speed of light)use a prior p(θ) on the model θ before seeing any data

Bayesian inference: use Bayes rule for learning, integrateover all model (θ) unknown variables

Page 13: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

COMS4771, Columbia University

Bayesian Inference•Bayes rule can lead us to maximum likelihood•Assume we have a prior over models p(θ)

•How to pick p(θ)?Pick simpler θ is betterPick form for mathematical convenience

•We have data (can assume IID):

•Want to get a model to compute:

•Want p(x) given our data… How to proceed?

( )( ) ( )

( )

||

p x p

p x

p x

θ θθ =

posterior

likelihood

evidence

prior

{ }D1 2, , ,

Nx x x= …

( )p x

( )p θ

Page 14: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

COMS4771, Columbia University

Bayesian Inference•Want p(x) given our data… ( ) ( )D

1 2| | , , ,

n

p x p x x x x= …

( ) ( )( ) ( )

( )( ) ( )

( )

( )( ) ( )( )

D D

D D

DD

D

D

1

| , |

| , |

|| ,

||

N

ii

p x p x d

p x p d

p pp x d

p

p x pp x d

p

θ

θ

θ

=

θ

= θ θ

= θ θ θ

θ θ= θ θ

θ θ= θ θ

∫∫

∏∫

p x | θ( ) ( )D|p θ

Manymodels Weight on

each model

θ = 1

θ = 2

θ = 3

Prior

Page 15: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

COMS4771, Columbia University

Bayesian Inference to MAP & ML•The full Bayesian Inference integral can be mathematicallytricky. MAP and ML are approximations of it…

•Maximum A Posteriori (MAP) is like MaximumLikelihood (ML) with a prior p(θ) which letsus prefer some models over others

( ) ( )( ) ( )( )

( ) ( )

DD

1

*

|| |

|

N

iip x p

p x p x dp

p x d

=

θ

θ

θ θ= θ θ

≈ θ δ θ − θ θ

∏∫

( ) ( )( )

( ) ( )( )

D

D

*

1

1

|argmax

|argmax

N

ii

N

ii

where

p x p

p

p x

MAP

MLuniform

p

=

θ

=

θ

θ =

θ θ θ θ

( )D|p θ

θ*

lMAP

θ( )= lML

θ( )+ log p θ( )= log p xi| θ( )

i=1

N

∑ + log p θ( )

( | )p θ D

Page 16: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Ex: Is a coin fair?

COMS4771, Columbia University

A stranger tells you his coin is fair.

Let’s assume tosses are iid with P(H)=µ.

He tosses it 4 times, gets H H T H.

What can you say about µ?

( ), , ,H H T H=D

Page 17: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

COMS4771, Columbia University

Bernoulli Probability ML•Bernoulli:

•Log-Likelihood (IID):

•Gradient=0:

( ) ( ) { }1

1 0,1 0,1x

x

p x x−

= − ∈ ∈ µ

µ µ

( ) ( )1

1 1log | log 1

ii

xN N x

ii ip x

= =

= −µ µ µ∑ ∑

( )( ) ( )

( )

( )

1

1

1

1 0

1 1

11 0

1 1

1

log 1 0

log 1 log 1 0

log log 1 0

0

( ) 0

1 ( ) 0

0

ii

xN x

i

N

i ii

i class i class

i class i class

m

N

x x

m N m

m N m

m N

−∂

∂ =

∂ =

∂ ∈ ∈

µ

µ

µ

µ µ

µ

−∈ ∈

µ−

− =

+ − − =

+ − =

− =

− − =

− − − =

µ µ

µ µ

µ µ

µ µ

µ

− =

=

µ

∑ ∑∑ ∑

x=0 x=1

m

N

N m

N

0 ~ Tail1 ~ Headµ=P(H)

N trials/tossesm heads/1sN-m tails/0s

Page 18: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Bernoulli Bayes, Prior 1

• Assume prior µ=1/2, point mass distribution

• Posterior

• If the prior is 0 for some value, the posterior will also be 0 at that value no matter what data we see

COMS4771, Columbia University

( ), , ,H H T H=D

( ) ( || ) ( )P Pr r P rµ µ µ= = ×∝ =D D

Page 19: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Bernoulli Bayes, Prior 2

• Allow some chance of bias

• Prior P(µ=1/2) = 1-b

P(µ=3/4) = b

• Posterior

COMS4771, Columbia University

( ), , ,H H T H=D

4

3

( ) ( |

3 1( ) ( |

4

1 1 1 1| ) ( ) (1 )

2 2 2 2

3 3 3| ) ( )

4 4 44

P

bP

P bP

P P

µ µ µ

µ µ µ

= = × = = × −

= = × =

∝ = ×

D D

D D

Page 20: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Bernoulli Bayes, Prior 2

• Prior P(µ=1/2) = 1-b

P(µ=3/4) = b

• Posterior

COMS4771, Columbia University

( ), , ,H H T H=D

4

3

1 1 1 1| ) ( ) (1 )

2 2 2 2

3 3 3| ) ( )

4 4 4

The two are equal

( ) ( |

3 1

when

( ) ( |4 4

16

43

P b

P b

b

P P

P P

µ µ µ

µ µ µ

= = × = = × −

= = × = = ×

=

D D

D D

Page 21: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Bernoulli Bayes, Prior 3

• Uniform prior

COMS4771, Columbia University

( ), , ,H H T H=D

3

13

0

3

| ) ( )

) (1 )

1 1 1(1 )

4 5 20

Hence posterior | (1 )

Notice for tosses with m

( ) ( |

( |

( ) 20

, tails,

( |

Does this sugg

heads

est a convenient

) (1 )

prior?

m N m

P P

P

P r

N m

r r P r

r r r

r r dr

r r rP

r r

N

µ µ µ

µ

µ

µ−

= = × =

=

= −

− = − =

= −

= = −

=

D D

D

D

D

[0,1]Uµ ∼

Where’s the mode?

Page 22: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Beta Distribution

• Distribution over

COMS4771, Columbia University(Bishop PRML 2.1, especially 2.1.1)

Here switch to typical ‘sloppy’ notation, using µ for variable name and its value.

Page 23: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Bernoulli Bayes, Beta Prior

The Beta distribution provides the conjugate prior for the Bernoulli distribution, i.e. the posterior distribution has the same form as the prior.

effective number of observations +1

All distributions in the Exponential Family (includes multinomial, Gaussian, Poisson) have convenient conjugate priors (Bishop PRML 2.4).

COMS4771, Columbia University(Bishop PRML 2.1, especially 2.1.1)

Page 24: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Beta Distribution

COMS4771, Columbia University(Bishop PRML 2.1, especially 2.1.1)

Which distribution is this?

Page 25: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Prior · Likelihood = Posteriornormalized

COMS4771, Columbia University(Bishop PRML 2.1, especially 2.1.1)

Example:

a=2, b=2 a=3, b=2Single observationH or x=1

Recall our earlier example of a Uniform prior, check this works…

)(p µ ( | )p µD |( )p µ D

Page 26: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Properties of the Posterior

As the size of the data set, N, grows

COMS4771, Columbia University(Bishop PRML 2.1, especially 2.1.1)

0

0 0

2var[

[ ]

] 0( ) ( 1)

N

ML

N N

N N

N N N N

a a m m

a b a m b N m N

a b

a b a b

µ µ

µ

+= = → =

+ + + + −

= →

+ + +

E

This is typical behavior for Bayesian learning.

Page 27: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Prediction under the PosteriorWhat is the probability that the next coin toss will land heads up?

COMS4771, Columbia University(Bishop PRML 2.1, especially 2.1.1)

N

N N

a

a b+

( ) 1 1 1, ,, 3,N N

H H a bT H= ⇒ = + = +D

Page 28: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Bayesian Decision Theory

• Initially assume just 2 classes

• Given input data we want to determine which class is optimal

• Various possible criteria

• Need either directly (discriminative)

• Or by Bayes,

(generative)

• Divide input space into decision regions separated by decision boundaries such that

• How might we choose boundaries?

COMS4771, Columbia University(Bishop PRML 1.5)

1 2, C C

x

|( )k

p xC

( , ( || )

( )

) ) ( )

( )( k k k

k

p x p xx

p x

pp

p x= =

C C CC

1 2,R R

assign k k

x∈ ⇒R C

Page 29: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Criterion 1: Minimize Misclassification Rate

COMS4771, Columbia University(Bishop PRML 1.5)

decision boundary,in general couldbe more complex

optimal

( , ) ( | ) ( )k k

p x p x p x=C C

assign class 1 but should be 2 assign class 2 but should be 1

Page 30: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Criterion 2: Minimize Expected LossExample loss matrix:

classify medical images as ‘cancer’ or ‘normal’

DecisionTruth

COMS4771, Columbia University(Bishop PRML 1.5)

Now

Choose regions to minimize Expected LossjR

Page 31: Machine Learning 4771 - Columbia Universitycoms4771/lectures/lec2e.pdf · Machine Learning 4771 Instructors: Adrian Weller and Ilia Vovsha. COMS4771, Columbia University ... (Bishop

Reject Option

COMS4771, Columbia University(Bishop PRML 1.5)

1 2| ) ( | ) then less confident about assigning claI s( s.f x p xp ≈C C

One possibility is to reject or refuse to assign.