Bayesian Learning Rong Jin
Dec 19, 2015
Bayesian Learning
Rong Jin
Outline
• MAP learning vs. ML learning• Minimum description length principle• Bayes optimal classifier• Bagging
w ww
Maximum Likelihood Learning (ML)
• Find the best model by maximizing the log-likelihood of the training data
Maximum A Posterior Learning (MAP)
• ML learning• Models are determined by training data • Unable to incorporate prior knowledge/preference about
models
• Maximum a posterior learning (MAP)• Knowledge/preference is incorporated through a prior
Prior encodes the knowledge/preference
MAP
• Uninformative prior: regularized logistic regression
MAP
Consider text categorization• wi: importance of i-th word in classification• Prior knowledge: the more common the word, the
less important it is
• How to construct a prior according to the prior knowledge ?
MAP
• An informative prior for text categorization
• i : the occurrence of the i-th word in training data
MAP
Two correlated classification tasks: C1 and C2
• How to introduce an appropriate prior to capture this prior knowledge ?
MAP
• Construct priors to capture the dependence between w1 and w2
Minimum Description Length (MDL) Principle
• Occam’s razor: prefer a simple hypothesis• Simple hypothesis short description length
• Minimum description length
• LC (x) is the description length for message x under coding scheme c
Bits for encoding hypothesis h
Bits for encoding data given h
MDL
D
Sender ReceiverSend only D ?
Send only h ?
Send h + D/h ?
Example: Decision Tree
H = decision trees, D = training data labels• LC1(h) is # bits to describe tree h
• LC2(D|h) is # bits to describe D given tree h
• LC2(D|h)=0 if examples are classified perfectly by h.
• Only need to describe exceptionshMDL trades off tree size for training errors
MAP vs. MDL
MAP learning
MDL learning
Problems with Maximum Approaches
ConsiderThree possible hypotheses:
Maximum approaches will pick h1
Given new instance x
Maximum approaches will output +However, is this most probable result?
1 2 3Pr( | ) 0.4, Pr( | ) 0.3, Pr( | ) 0.3h D h D h D
1 2 3( ) , ( ) , ( )h x h x h x
Bayes Optimal Classifier (Bayesian Average)
Bayes optimal classification:
Example:
The most probable class is -
1 1 1
2 2 2
3 3 3
Pr( | ) 0.4, Pr( | , ) 1, Pr( | , ) 0
Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1
Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1
Pr( | )Pr( | , ) 0.4, Pr( | )Pr( | , ) 0.6h h
h D h x h x
h D h x h x
h D h x h x
h D h x h D h x
Computational Issues
• Need to sum over all possible hypotheses • It is expensive or impossible when the hypothesis
space is large• E.g., decision tree
• Solution: sampling !
Gibbs Classifier
Gibbs algorithm1. Choose one hypothesis at random, according to p(h|D)2. Use this hypothesis to classify new instance
• Surprising fact:
• Improve by sampling multiple hypotheses from p(h|D) and average their classification results
2Gibbs BayesOptimalE err E err
Bagging Classifiers
• In general, sampling from p(h|D) is difficult• P(h|D) is difficult to compute• P(h|D) is impossible to compute for non-
probabilistic classifier such as SVM• Bagging Classifiers:• Realize sampling p(h|D) by sampling training
examples
Boostrap Sampling
Bagging = Boostrap aggregating• Boostrap sampling: given set D containing m
training examples• Create Di by drawing m examples at random with
replacement from D• Di expects to leave out about 0.37 of examples
from D
Bagging Algorithm
• Create k boostrap samples D1, D2,…, Dk
• Train distinct classifier hi on each Di
• Classify new instance by classifier vote with equal weights
Bagging Bayesian Average
P(h|D)
Bayesian Average
…h1 h2 hk
Sampling
Pr( | , )iic h x
D
Bagging
…
D1 D2 Dk
Boostrap Sampling
Pr( | , )iic h x
h1 h2 hk
Boostrap sampling is almost equivalent to sampling from posterior P(h|D)
Empirical Study of BaggingBagging decision trees• Boostrap 50 different samples from
the original training data• Learn a decision tree over each
boostrap sample• Predict the class labels for test
instances by the majority vote of 50 decision trees
• Bagging decision tree outperforms a single decision tree
Why Bagging works better than a single classifier?• Real value case• y~f(x)+, ~N(0,)• (x|D) is a predictor learned from training data D
Bias-Variance Tradeoff
Irreducible variance
Model bias:The simpler the (x|D),
the larger the bias
Model variance:The simpler the (x|D), the
smaller the variance
Bagging• Bagging performs better than a single classifier because it
effectively reduces the model variance
single decision tree
Bagging decision tree
bias
variance