Ensemble Methods NLP ML Web Fall 2013 Andrew Rosenberg TA/Grader: David Guy Brizan
Ensemble MethodsNLP ML Web!
Fall 2013!Andrew Rosenberg!
TA/Grader: David Guy Brizan
How do you make a decision?• What do you want for lunch today?!
• What did you have last night?!
• What are your favorite foods?!
• Have you been to this restaurant before?!
• How expensive is it?!
• Are you on a diet?!
• Moral/ethical/religious concerns.
Decision Making
• Collaboration.!!
• Weighing disparate sources of evidence.
Ensemble Methods
• Ensemble Methods are based around the hypothesis that an aggregated decision from multiple experts can be superior to a decision from a single system.
Ensemble averaging• Assume your prediction has noise in it
y = f(x) + ✏
✏ ⇠ N (0,�2)
y =1
k
kX
i
f(x) + ✏
(i)
• On multiple iid observations, the epsilons will cancel out, leaving a better estimate of the signal
http://terpconnect.umd.edu/~toh/spectrum/SignalsAndNoise.html
Combining disparate evidence• Early Fusion - combine features
Acoustic Features
Video Features
Lexical Features
Concatenate Classifier
Combining disparate evidence• Late Fusion - combine predictions
Acoustic Features
Video Features
Lexical Features
Classifier Merge
Classifier
Classifier
Classifier Fusion• Construct an answer from k predictions
C1 C2 C3 C4
Test instance
Classifier Fusion• Construct an answer from k predictions
Features Classifier Merge
Classifier
Classifier
Majority Voting• Each Classifier generates a prediction and
confidence score. !
• Chose the prediction that receives the most “votes” predictions from the ensemble
Features Classifier Sum
Classifier
Classifier
Weighted Majority Voting!
• Most classifiers can be interpreted as delivering a distribution over predictions.!
• Rather than sum the number of votes, generate an average distribution from the sum.!
• This is the same as taking a vote where each prediction contributes its confidence.
Features Classifier Weighted Sum
Classifier
Classifier
Sum, Max, Min• Majority Voting can be viewed as summing the scores from
each ensemble member.!
• Other aggregation functions can be used including:!
• maximum!
• minimum!
• What is the implication of these?
Features Classifier Aggregator
Classifier
Classifier
Second-tier classifier• Classifier predictions are used as input
features for a second classifier.!
• How should the second tier classifier be trained?
Features Classifier Classifer
Classifier
Classifier
Classifier Fusion• Advantages!
• Experts to be trained separately on specialized data!
• Can be trained quicker, due to smaller data sets and feature space dimensionality.!
• Disadvantages!
• Interactions across feature sets may be missed!
• Explanation of how and why it works can be limited.
Bagging• Bootstrap Aggregating!
• Train k models on different samples of the training data!
• Predict by averaging results of k models !!
• Simple instance of majority voting.
Model averaging• Seen in Language Modeling!
• Take, for example a linear classifier
• Average model parameters.
y = f(WTx+ b)
W ⇤ =1
k
X
i
Wi
b⇤ =1
k
X
i
bi b⇤ =1
k
X
i
bi + ✏
W ⇤ =1
k
X
i
Wi + ✏
Mixture of Experts• Can we do better than averaging over
all training points?!
• Look at the input data to see which points are better classified by which classifiers!
• Allow each expert to focus on those cases where it’s already doing better than average
Mixture of Experts• The array of P’s are called a
“gating network”.!
• Optimize pi as part of the loss function
Features Classifier Classifer
Classifier
Classifier
p
p
p
Probability Correct under a mixture of experts
221
1 ||||
2)|(
cio
cd
i
ci
c epMoGdp−−
∑=π
From Hinton Lecture
prob. desired output on c
mixing coefficient for i on c
gaussian loss between desired and
observed output
Gating network Gradient
)(2
log)|(log
221
221
221
1
||||
||||
||||
2
ci
c
j
cjo
cdcj
cio
cdci
ci
c
cio
cd
i
ci
c
odep
epoE
epMoEdp
−−=∂
∂
−=−
∑
∑
−−
−−
−−
π
posterior probability of expert iFrom Hinton Lecture
AdaBoost
• Adaptive Boosting!
• Construct an ensemble of “weak” classifiers.!
• Typically single split decision trees!
• Identify weights for each classifier.
Weak Classifiers
• Weak classifiers:!
• low performance (slightly better than chance)!
• high variance!
• (for adaboost) should have uncorrelated errors.
Boosting Hypothesis
• The existence of a weak learner implies the existence of a strong learner.
AdaBoost Decision Function
• AdaBoost generates a prediction from a weighted sum of predictions of each classifier.!
• The AdaBoost algorithm determines the weights.!
• Similar to systems that use a second tier classifier to learn a combination function.
C(x) = ↵1C1(x) + ↵2C2(x) + . . .+ ↵kCk(x)
The weight training is different from any loss function we’ve used.
AdaBoost training algorithm• Repeat!
• Identify the best unused classifier Ci.!
• Assign it a weight based on its performance!
• Update the weights of each data point based on whether or not it is classified correctly!
• Until performance converges or all classifiers are included.
Identify the best classifier• Generate hypotheses using each unused
classifier.!
• Calculate weighted error using the current data point weights.!
• Data point weights are initialized to one.
We
=X
yi 6=km(xi)
w(m)i
How many errors were made
Generate a weight for the classifier
• The larger the reduction in error, the larger the classifier weight
em =Wm
Wratio of error to previous iteration
↵m =1
2ln
✓1� emem
◆new weight
Data Point weighting• If data point i was not correctly
classified!!
!
• If data point i was correctly classified
w(m+1)i = w(m)
i e�↵m = w(m)i
rem
1� em
w(m+1)i = w(m)
i e↵m = w(m)i
r1� emem
>1
<1
AdaBoost training algorithm• Repeat!
• Identify the best unused classifier Ci.!
• Assign it a weight based on its performance!
• Update the weights of each data point based on whether or not it is classified correctly!
• Until performance converges or all classifiers are included.
Random Forests• Random Forests are similar to
AdaBoost decision trees. (sans adaptive training)!
• An ensemble of classifiers is trained each on a different subset of features and a different set of data points!
• Random subspace projection
Decision Treeworld&state&
is&it&raining?&
is&the&sprinkler&on?&P(wet)&=&0.95&
P(wet)&=&0.9&
yes%no%
yes%no%
P(wet)&=&0.1&
Construct a Forest of Tree
……"tree"t1 tree"tT
category"c
Training Algorithm• Divide training data into K subsets of
data points and M variables.!
• Improved Generalization!
• Reduced Memory requirements!
• Train a unique decision tree on each K set!
• Simple multi threading
Handling Class Imbalances• Class imbalance, or skewed class distributions happen
when there are not equal numbers of each label. !
• Class Imbalance provides a number of challenges !
• Density Estimation !
• low priors can lead to poor estimation of minority classes!
• Loss Functions!
• Since the loss of each point is equal, getting a lot of majority class points correct is important.!
• Evaluation!
• Accuracy is less informative.
Impact on Accuracy• Example from Information Retrieval!
• Find 10 relevant documents from a set of 100.
True Values
Positive Negative
Hyp Values
Positive 0 0
Negative 10 90
Accuracy = 90%
Contingency Table
Accuracy =TP + TN
TP + FP + TN + FN
True Values Positive Negative
Hyp Values
Positive True Positive
False Positive
Negative False Negative
True Negative
F-Measure
• Precision: how many hypothesized events were true events
• Recall: how many of the true events were identified
• F-Measure: Harmonic meanof precision and recall
P =TP
TP + FP
R =TP
TP + FN
F =2PR
P + RTrue Values
Positive Negative
Hyp Values
Positive 0 0
Negative 10 90
F-Measure
• F-measure can be weighted to favor Precision or Recall
•beta > 1 favors recall •beta < 1 favors precision
F� =(1 + �2)PR
(�2P ) + R
F-Measure
True Values
Positive Negative
Hyp Values
Positive 0 0
Negative 10 90
P = 0R = 0F1 = 0
F-Measure
True Values
Positive Negative
Hyp Values
Positive 10 50
Negative 0 40
P =1060
R = 1F1 = .29
F-Measure
True Values
Positive Negative
Hyp Values
Positive 9 1
Negative 1 89
P = .9R = .9F1 = .9
ROC and AUC• It is common to plot classifier
performance at a variety of settings or thresholds
• Receiver Operating Characteristic (ROC) curves plot true positives against false positives.
• The overall performanceis calculated by the Area Under the Curve(AUC)
Skew in Classifier Training• Most classifiers train better with balanced
training data.!
• Bayesian methods:!
• Reliance on a prior to weight classes.!
• Estimation of class conditioned density is impacted by skew in number of samples!
• Loss functions:!
• There is more pressure to set the decision boundary for the majority classes
Skew in Classifier Training
Skew in Classifier TrainingTwice as
many errors Same distance from
optimal decision boundary
Sampling• Artificial manipulation of the number of
training samples can help reduce the impact of class imbalance!
• Under sampling!
• Randomly select Nm data points from the majority class for training
Sampling• Oversampling!
• Reproduce the minority class points until the class sizes are balanced
Ensemble Sampling• Ensemble Sampling!
• Repeat undersampling NM/Nm times with different samples of the majority class data points.!
• Train NM/Nm classifiers, combine with majority voting.
C1
C2
C3
Merge
Ensemble Methods• Very simple and effective technique to
improve classification performance!
• Netflix Prize, Watson, etc.!
• Mathematical justification!
• Intuitive appeal into how decisions are made by people and organizations!
• Can allow for modular training