Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Bayesian Learning

Rong Jin

Outline

• MAP learning vs. ML learning• Minimum description length principle• Bayes optimal classifier• Bagging

w ww

Maximum Likelihood Learning (ML)

• Find the best model by maximizing the log-likelihood of the training data

Maximum A Posterior Learning (MAP)

• ML learning• Models are determined by training data • Unable to incorporate prior knowledge/preference about

models

• Maximum a posterior learning (MAP)• Knowledge/preference is incorporated through a prior

Prior encodes the knowledge/preference

MAP

• Uninformative prior: regularized logistic regression

MAP

Consider text categorization• wi: importance of i-th word in classification• Prior knowledge: the more common the word, the

less important it is

• How to construct a prior according to the prior knowledge ?

MAP

• An informative prior for text categorization

• i : the occurrence of the i-th word in training data

MAP

Two correlated classification tasks: C1 and C2

• How to introduce an appropriate prior to capture this prior knowledge ?

MAP

• Construct priors to capture the dependence between w1 and w2

Minimum Description Length (MDL) Principle

• Occam’s razor: prefer a simple hypothesis• Simple hypothesis short description length

• Minimum description length

• LC (x) is the description length for message x under coding scheme c

Bits for encoding hypothesis h

Bits for encoding data given h

MDL

D

Sender ReceiverSend only D ?

Send only h ?

Send h + D/h ?

Example: Decision Tree

H = decision trees, D = training data labels• LC1(h) is # bits to describe tree h

• LC2(D|h) is # bits to describe D given tree h

• LC2(D|h)=0 if examples are classified perfectly by h.

• Only need to describe exceptionshMDL trades off tree size for training errors

MAP vs. MDL

MAP learning

MDL learning

Problems with Maximum Approaches

ConsiderThree possible hypotheses:

Maximum approaches will pick h1

Given new instance x

Maximum approaches will output +However, is this most probable result?

1 2 3Pr( | ) 0.4, Pr( | ) 0.3, Pr( | ) 0.3h D h D h D

1 2 3( ) , ( ) , ( )h x h x h x

Bayes Optimal Classifier (Bayesian Average)

Bayes optimal classification:

Example:

The most probable class is -

1 1 1

2 2 2

3 3 3

Pr( | ) 0.4, Pr( | , ) 1, Pr( | , ) 0

Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1

Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1

Pr( | )Pr( | , ) 0.4, Pr( | )Pr( | , ) 0.6h h

h D h x h x

h D h x h x

h D h x h x

h D h x h D h x

Computational Issues

• Need to sum over all possible hypotheses • It is expensive or impossible when the hypothesis

space is large• E.g., decision tree

• Solution: sampling !

Gibbs Classifier

Gibbs algorithm1. Choose one hypothesis at random, according to p(h|D)2. Use this hypothesis to classify new instance

• Surprising fact:

• Improve by sampling multiple hypotheses from p(h|D) and average their classification results

2Gibbs BayesOptimalE err E err

Bagging Classifiers

• In general, sampling from p(h|D) is difficult• P(h|D) is difficult to compute• P(h|D) is impossible to compute for non-

probabilistic classifier such as SVM• Bagging Classifiers:• Realize sampling p(h|D) by sampling training

examples

Boostrap Sampling

Bagging = Boostrap aggregating• Boostrap sampling: given set D containing m

training examples• Create Di by drawing m examples at random with

replacement from D• Di expects to leave out about 0.37 of examples

from D

Bagging Algorithm

• Create k boostrap samples D1, D2,…, Dk

• Train distinct classifier hi on each Di

• Classify new instance by classifier vote with equal weights

Bagging Bayesian Average

P(h|D)

Bayesian Average

…h1 h2 hk

Sampling

Pr( | , )iic h x

D

Bagging

…

D1 D2 Dk

Boostrap Sampling

Pr( | , )iic h x

h1 h2 hk

Boostrap sampling is almost equivalent to sampling from posterior P(h|D)

Empirical Study of BaggingBagging decision trees• Boostrap 50 different samples from

the original training data• Learn a decision tree over each

boostrap sample• Predict the class labels for test

instances by the majority vote of 50 decision trees

• Bagging decision tree outperforms a single decision tree

Why Bagging works better than a single classifier?• Real value case• y~f(x)+, ~N(0,)• (x|D) is a predictor learned from training data D

Bias-Variance Tradeoff

Irreducible variance

Model bias:The simpler the (x|D),

the larger the bias

Model variance:The simpler the (x|D), the

smaller the variance

Bagging• Bagging performs better than a single classifier because it

effectively reduces the model variance

single decision tree

Bagging decision tree

bias

variance

Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Documents

d slide

h slide

knowledgepreference

training data slide

prior prior

d given tree h

hypothesis h bits

h dh