Top Banner
© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors
58

© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

© Daniel S. Weld 1

Statistical LearningCSE 573

Lecture 16 slides

which overlap fix

several erro

rs

Page 2: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

© Daniel S. Weld 2

Logistics

• Team Meetings• Midterm

  Open book, notes   Studying

• See AIMA exercises

Page 3: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

© Daniel S. Weld 3

573 Topics

Agency

Problem Spaces

Search

Knowledge Representation & Inference

Planning SupervisedLearning

Logic-Based

Probabilistic

ReinforcementLearning

Page 4: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

© Daniel S. Weld 4

Topics

• Parameter Estimation:  Maximum Likelihood (ML)  Maximum A Posteriori (MAP)  Bayesian  Continuous case

• Learning Parameters for a Bayesian Network• Naive Bayes

  Maximum Likelihood estimates  Priors

• Learning Structure of Bayesian Networks

Page 5: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Coin Flip

P(H|C2) = 0.5P(H|C1) = 0.1

C1C2

P(H|C3) = 0.9

C3

Which coin will I use?

P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

Prior: Probability of a hypothesis before we make any observations

Page 6: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Coin Flip

P(H|C2) = 0.5P(H|C1) = 0.1

C1C2

P(H|C3) = 0.9

C3

Which coin will I use?

P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

Uniform Prior: All hypothesis are equally likely before we make any observations

Page 7: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Experiment 1: Heads

Which coin did I use?P(C1|H) = ? P(C2|H) = ? P(C3|H) = ?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1)=0.1

C1C2 C3

P(C1)=1/3 P(C2) = 1/3 P(C3) = 1/3

Page 8: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Experiment 1: Heads

Which coin did I use?P(C1|H) = 0.066P(C2|H) = 0.333 P(C3|H) = 0.6

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

Posterior: Probability of a hypothesis given data

Page 9: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Terminology

•Prior:   Probability of a hypothesis before we see any data

•Uniform Prior:   A prior that makes all hypothesis equaly likely

•Posterior:   Probability of a hypothesis after we saw some data

•Likelihood:   Probability of data given hypothesis

Page 10: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Experiment 2: Tails

Which coin did I use?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

P(C1|HT) = ? P(C2|HT) = ? P(C3|HT) = ?

Page 11: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Experiment 2: Tails

Which coin did I use?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

P(C1|HT) = 0.21P(C2|HT) = 0.58P(C3|HT) = 0.21

Page 12: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Experiment 2: Tails

Which coin did I use?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

P(C1|HT) = 0.21P(C2|HT) = 0.58P(C3|HT) = 0.21

Page 13: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Your Estimate?

What is the probability of heads after two experiments?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

Best estimate for P(H)

P(H|C2) = 0.5

Most likely coin:

C2

Page 14: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Your Estimate?

P(H|C2) = 0.5

C2

P(C2) = 1/3

Most likely coin: Best estimate for P(H)

P(H|C2) = 0.5C2

Maximum Likelihood Estimate: The best hypothesis

that fits observed data assuming uniform prior

Page 15: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Using Prior Knowledge• Should we

always use a Uniform Prior ?

• Background knowledge:

  Heads => we have take-home midterm

  Dan doesn’t like take-homes…

  => Dan is more likely to use a coin biased in his favor

P(H|C2) = 0.5P(H|C1) = 0.1

C1C2

P(H|C3) = 0.9

C3

Page 16: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Using Prior Knowledge

P(H|C2) = 0.5P(H|C1) = 0.1

C1C2

P(H|C3) = 0.9

C3

P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70

We can encode it in the prior:

Page 17: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Experiment 1: Heads

Which coin did I use?P(C1|H) = ? P(C2|H) = ? P(C3|H) = ?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70

Page 18: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Experiment 1: Heads

Which coin did I use?P(C1|H) = 0.006P(C2|H) = 0.165P(C3|H) = 0.829

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70

P(C1|H) = 0.066P(C2|H) = 0.333P(C3|H) = 0.600Compare with ML posterior after

Exp 1:

Page 19: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Experiment 2: Tails

Which coin did I use?P(C1|HT) = ? P(C2|HT) = ? P(C3|HT) = ?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70

Page 20: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Experiment 2: Tails

Which coin did I use?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70

P(C1|HT) = 0.035P(C2|HT) = 0.481P(C3|HT) = 0.485

Page 21: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Experiment 2: Tails

Which coin did I use?P(C1|HT) = 0.035P(C2|HT)=0.481P(C3|HT) = 0.485

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70

Page 22: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Your Estimate?

What is the probability of heads after two experiments?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70

Best estimate for P(H)

P(H|C3) = 0.9C3

Most likely coin:

Page 23: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Your Estimate?

Most likely coin: Best estimate for P(H)

P(H|C3) = 0.9C3

Maximum A Posteriori (MAP) Estimate: The best hypothesis that fits observed data

assuming a non-uniform prior

P(H|C3) = 0.9

C3

P(C3) = 0.70

Page 24: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Did We Do The Right Thing?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1|HT)=0.035 P(C2|HT)=0.481P(C3|HT)=0.485

Page 25: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Did We Do The Right Thing?

P(C1|HT) =0.035P(C2|HT)=0.481P(C3|HT)=0.485

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

C2 and C3 are almost equally likely

Page 26: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

A Better Estimate

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1C1

C2 C3

Recall: = 0.680

P(C1|HT)=0.035 P(C2|HT)=0.481P(C3|HT)=0.485

Page 27: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Bayesian Estimate

P(C1|HT)=0.035 P(C2|HT)=0.481 P(C3|HT)=0.485

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1C1

C2 C3

= 0.680

Bayesian Estimate: Minimizes prediction error, given data and (generally) assuming a

non-uniform prior

Page 28: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Comparison After more experiments: HTH8

ML (Maximum Likelihood): P(H) = 0.5after 10 experiments: P(H) = 0.9

MAP (Maximum A Posteriori): P(H) = 0.9after 10 experiments: P(H) = 0.9

Bayesian: P(H) = 0.68after 10 experiments: P(H) = 0.9

Page 29: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Comparison

ML (Maximum Likelihood):Easy to compute

MAP (Maximum A Posteriori): Still easy to computeIncorporates prior knowledge

Bayesian: Minimizes error => great when data is

scarcePotentially much harder to compute

Page 30: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Summary For Now

Prior Hypothesis

Maximum Likelihood Estimate

Maximum A Posteriori Estimate

Bayesian Estimate

Uniform The most likely

Any The most likely

Any Weighted combination

Page 31: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Continuous Case

•In the previous example,  we chose from a discrete set of three coins

•In general,  we have to pick from a continuous distribution  of biased coins

Page 32: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Continuous Case

Page 33: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Continuous Case

Page 34: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

0

1

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Continuous CasePrior

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exp 1: Heads

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exp 2: Tails

0

1

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

uniform

with backgroundknowledge

Page 35: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Continuous Case

0

1

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Bayesian EstimateMAP Estimate

ML Estimate

Posterior after 2 experiments:

w/ uniform prior

with backgroundknowledge

Page 36: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

-1

0

1

2

3

4

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-1

0

1

2

3

4

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

After 10 Experiments...Posterior:

Bayesian EstimateMAP Estimate

ML Estimatew/ uniform prior

with backgroundknowledge

Page 37: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

After 100 Experiments...

Page 38: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

© Daniel S. Weld 38

Topics

• Parameter Estimation:  Maximum Likelihood (ML)  Maximum A Posteriori (MAP)  Bayesian  Continuous case

• Learning Parameters for a Bayesian Network• Naive Bayes

  Maximum Likelihood estimates  Priors

• Learning Structure of Bayesian Networks

Page 39: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

© Daniel S. Weld 39

Review: Conditional Probability

• P(A | B) is the probability of A given B• Assumes that B is the only info known.• Defined by:

)(

)()|(

BP

BAPBAP

A BAB

Tru

e

Page 40: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

© Daniel S. Weld 40

Conditional IndependenceTru

e

B

A A B

A&B not independent, since P(A|B) < P(A)

Page 41: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

© Daniel S. Weld 41

Conditional IndependenceTru

e

B

A A B

C

B C

AC

But: A&B are made independent by C

P(A|C) =P(A|B,C)

Page 42: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

© Daniel S. Weld 42

Bayes Rule

Simple proof from def of conditional probability:

)(

)()|()|(

EP

HPHEPEHP

)(

)()|(

EP

EHPEHP

)(

)()|(

HP

EHPHEP

)()|()( HPHEPEHP

QED:

(Def. cond. prob.)

(Def. cond. prob.)

)(

)()|()|(

EP

HPHEPEHP

(Mult by P(H) in line 1)

(Substitute #3 in #2)

Page 43: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

© Daniel S. Weld 43

An Example Bayes Net

Earthquake Burglary

Alarm

Nbr2CallsNbr1Calls

Pr(B=t) Pr(B=f) 0.05 0.95

Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)

Radio

Page 44: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

© Daniel S. Weld 44

Given Parents, X is Independent of

Non-Descendants

Page 45: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

© Daniel S. Weld 45

Given Markov Blanket, X is Independent of All Other Nodes

MB(X) = Par(X) Childs(X) Par(Childs(X))

Page 46: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Parameter Estimation and Bayesian Networks

E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F

...

We have: - Bayes Net structure and observations- We need: Bayes Net parameters

Page 47: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Parameter Estimation and Bayesian Networks

E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F

...

P(B) = ? -5

0

5

10

15

20

25

0 0.2 0.4 0.6 0.8 1

Prior

+ data = -2

0

2

4

6

8

10

12

14

16

18

20

0 0.2 0.4 0.6 0.8 1

Now computeeither MAP or

Bayesian estimate

Page 48: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Parameter Estimation and Bayesian Networks

E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F

...

P(A|E,B) = ?P(A|E,¬B) = ?P(A|¬E,B) = ?P(A|¬E,¬B) = ?

Page 49: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Parameter Estimation and Bayesian Networks

E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F

...

P(A|E,B) = ?P(A|E,¬B) = ?P(A|¬E,B) = ?P(A|¬E,¬B) = ?

Prior

0

1

2

0 0.2 0.4 0.6 0.8 1

+ data= 0

1

2

0 0.2 0.4 0.6 0.8 1

Now com

pute

eith

er M

AP or

Bayes

ian

estim

ate

Page 50: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

© Daniel S. Weld 50

Topics

• Parameter Estimation:  Maximum Likelihood (ML)  Maximum A Posteriori (MAP)  Bayesian  Continuous case

• Learning Parameters for a Bayesian Network• Naive Bayes

  Maximum Likelihood estimates  Priors

• Learning Structure of Bayesian Networks

Page 51: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

© Daniel S. Weld 51

Recap

• Given a BN structure (with discrete or continuous variables), we can learn the parameters of the conditional prop tables.

Spam

Nigeria NudeSex

Earthqk Burgl

Alarm

N2N1

Page 52: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

What if we don’t know structure?

Page 53: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Learning The Structureof Bayesian Networks

• Search thru the space…   of possible network structures!  (for now, assume we observe all variables)

• For each structure, learn parameters• Pick the one that fits observed data best

  Caveat – won’t we end up fully connected????

• When scoring, add a penalty model complexity

Problem !?!?

Page 54: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Learning The Structureof Bayesian Networks

• Search thru the space • For each structure, learn parameters• Pick the one that fits observed data best

• Problem?  Exponential number of networks!  And we need to learn parameters for each!  Exhaustive search out of the question!

• So what now?

Page 55: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Learning The Structureof Bayesian Networks

Local search!  Start with some network structure  Try to make a change   (add or delete or reverse edge)  See if the new network is any better

 What should be the initial state?

Page 56: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Initial Network Structure?

• Uniform prior over random networks?

• Network which reflects expert knowledge?

Page 57: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

© Daniel S. Weld 57

Learning BN Structure

Page 58: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

The Big Picture

• We described how to do MAP (and ML) learning of a Bayes net (including structure)

• How would Bayesian learning (of BNs) differ?•Find all possible networks

•Calculate their posteriors

•When doing inference, return weighed combination of predictions from all networks!