Top Banner
Bayesian Learning Rong Jin
24

Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Bayesian Learning

Rong Jin

Page 2: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Outline

• MAP learning vs. ML learning• Minimum description length principle• Bayes optimal classifier• Bagging

w ww

Page 3: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Maximum Likelihood Learning (ML)

• Find the best model by maximizing the log-likelihood of the training data

Page 4: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Maximum A Posterior Learning (MAP)

• ML learning• Models are determined by training data • Unable to incorporate prior knowledge/preference about

models

• Maximum a posterior learning (MAP)• Knowledge/preference is incorporated through a prior

Prior encodes the knowledge/preference

Page 5: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

MAP

• Uninformative prior: regularized logistic regression

Page 6: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

MAP

Consider text categorization• wi: importance of i-th word in classification• Prior knowledge: the more common the word, the

less important it is

• How to construct a prior according to the prior knowledge ?

Page 7: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

MAP

• An informative prior for text categorization

• i : the occurrence of the i-th word in training data

Page 8: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

MAP

Two correlated classification tasks: C1 and C2

• How to introduce an appropriate prior to capture this prior knowledge ?

Page 9: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

MAP

• Construct priors to capture the dependence between w1 and w2

Page 10: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Minimum Description Length (MDL) Principle

• Occam’s razor: prefer a simple hypothesis• Simple hypothesis short description length

• Minimum description length

• LC (x) is the description length for message x under coding scheme c

Bits for encoding hypothesis h

Bits for encoding data given h

Page 11: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

MDL

D

Sender ReceiverSend only D ?

Send only h ?

Send h + D/h ?

Page 12: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Example: Decision Tree

H = decision trees, D = training data labels• LC1(h) is # bits to describe tree h

• LC2(D|h) is # bits to describe D given tree h

• LC2(D|h)=0 if examples are classified perfectly by h.

• Only need to describe exceptionshMDL trades off tree size for training errors

Page 13: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

MAP vs. MDL

MAP learning

MDL learning

Page 14: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Problems with Maximum Approaches

ConsiderThree possible hypotheses:

Maximum approaches will pick h1

Given new instance x

Maximum approaches will output +However, is this most probable result?

1 2 3Pr( | ) 0.4, Pr( | ) 0.3, Pr( | ) 0.3h D h D h D

1 2 3( ) , ( ) , ( )h x h x h x

Page 15: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Bayes Optimal Classifier (Bayesian Average)

Bayes optimal classification:

Example:

The most probable class is -

1 1 1

2 2 2

3 3 3

Pr( | ) 0.4, Pr( | , ) 1, Pr( | , ) 0

Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1

Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1

Pr( | )Pr( | , ) 0.4, Pr( | )Pr( | , ) 0.6h h

h D h x h x

h D h x h x

h D h x h x

h D h x h D h x

Page 16: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Computational Issues

• Need to sum over all possible hypotheses • It is expensive or impossible when the hypothesis

space is large• E.g., decision tree

• Solution: sampling !

Page 17: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Gibbs Classifier

Gibbs algorithm1. Choose one hypothesis at random, according to p(h|D)2. Use this hypothesis to classify new instance

• Surprising fact:

• Improve by sampling multiple hypotheses from p(h|D) and average their classification results

2Gibbs BayesOptimalE err E err

Page 18: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Bagging Classifiers

• In general, sampling from p(h|D) is difficult• P(h|D) is difficult to compute• P(h|D) is impossible to compute for non-

probabilistic classifier such as SVM• Bagging Classifiers:• Realize sampling p(h|D) by sampling training

examples

Page 19: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Boostrap Sampling

Bagging = Boostrap aggregating• Boostrap sampling: given set D containing m

training examples• Create Di by drawing m examples at random with

replacement from D• Di expects to leave out about 0.37 of examples

from D

Page 20: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Bagging Algorithm

• Create k boostrap samples D1, D2,…, Dk

• Train distinct classifier hi on each Di

• Classify new instance by classifier vote with equal weights

Page 21: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Bagging Bayesian Average

P(h|D)

Bayesian Average

…h1 h2 hk

Sampling

Pr( | , )iic h x

D

Bagging

D1 D2 Dk

Boostrap Sampling

Pr( | , )iic h x

h1 h2 hk

Boostrap sampling is almost equivalent to sampling from posterior P(h|D)

Page 22: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Empirical Study of BaggingBagging decision trees• Boostrap 50 different samples from

the original training data• Learn a decision tree over each

boostrap sample• Predict the class labels for test

instances by the majority vote of 50 decision trees

• Bagging decision tree outperforms a single decision tree

Page 23: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Why Bagging works better than a single classifier?• Real value case• y~f(x)+, ~N(0,)• (x|D) is a predictor learned from training data D

Bias-Variance Tradeoff

Irreducible variance

Model bias:The simpler the (x|D),

the larger the bias

Model variance:The simpler the (x|D), the

smaller the variance

Page 24: Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Bagging• Bagging performs better than a single classifier because it

effectively reduces the model variance

single decision tree

Bagging decision tree

bias

variance