Top Banner
Ensemble Methods NLP ML Web Fall 2013 Andrew Rosenberg TA/Grader: David Guy Brizan
49

Lecture8 - Ensemble Methods - Test Page for Apache

Feb 11, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture8 - Ensemble Methods - Test Page for Apache

Ensemble MethodsNLP ML Web!

Fall 2013!Andrew Rosenberg!

TA/Grader: David Guy Brizan

Page 2: Lecture8 - Ensemble Methods - Test Page for Apache

How do you make a decision?• What do you want for lunch today?!

• What did you have last night?!

• What are your favorite foods?!

• Have you been to this restaurant before?!

• How expensive is it?!

• Are you on a diet?!

• Moral/ethical/religious concerns.

Page 3: Lecture8 - Ensemble Methods - Test Page for Apache

Decision Making

• Collaboration.!!

• Weighing disparate sources of evidence.

Page 4: Lecture8 - Ensemble Methods - Test Page for Apache

Ensemble Methods

• Ensemble Methods are based around the hypothesis that an aggregated decision from multiple experts can be superior to a decision from a single system.

Page 5: Lecture8 - Ensemble Methods - Test Page for Apache

Ensemble averaging• Assume your prediction has noise in it

y = f(x) + ✏

✏ ⇠ N (0,�2)

y =1

k

kX

i

f(x) + ✏

(i)

• On multiple iid observations, the epsilons will cancel out, leaving a better estimate of the signal

http://terpconnect.umd.edu/~toh/spectrum/SignalsAndNoise.html

Page 6: Lecture8 - Ensemble Methods - Test Page for Apache

Combining disparate evidence• Early Fusion - combine features

Acoustic Features

Video Features

Lexical Features

Concatenate Classifier

Page 7: Lecture8 - Ensemble Methods - Test Page for Apache

Combining disparate evidence• Late Fusion - combine predictions

Acoustic Features

Video Features

Lexical Features

Classifier Merge

Classifier

Classifier

Page 8: Lecture8 - Ensemble Methods - Test Page for Apache

Classifier Fusion• Construct an answer from k predictions

C1 C2 C3 C4

Test instance

Page 9: Lecture8 - Ensemble Methods - Test Page for Apache

Classifier Fusion• Construct an answer from k predictions

Features Classifier Merge

Classifier

Classifier

Page 10: Lecture8 - Ensemble Methods - Test Page for Apache

Majority Voting• Each Classifier generates a prediction and

confidence score. !

• Chose the prediction that receives the most “votes” predictions from the ensemble

Features Classifier Sum

Classifier

Classifier

Page 11: Lecture8 - Ensemble Methods - Test Page for Apache

Weighted Majority Voting!

• Most classifiers can be interpreted as delivering a distribution over predictions.!

• Rather than sum the number of votes, generate an average distribution from the sum.!

• This is the same as taking a vote where each prediction contributes its confidence.

Features Classifier Weighted Sum

Classifier

Classifier

Page 12: Lecture8 - Ensemble Methods - Test Page for Apache

Sum, Max, Min• Majority Voting can be viewed as summing the scores from

each ensemble member.!

• Other aggregation functions can be used including:!

• maximum!

• minimum!

• What is the implication of these?

Features Classifier Aggregator

Classifier

Classifier

Page 13: Lecture8 - Ensemble Methods - Test Page for Apache

Second-tier classifier• Classifier predictions are used as input

features for a second classifier.!

• How should the second tier classifier be trained?

Features Classifier Classifer

Classifier

Classifier

Page 14: Lecture8 - Ensemble Methods - Test Page for Apache

Classifier Fusion• Advantages!

• Experts to be trained separately on specialized data!

• Can be trained quicker, due to smaller data sets and feature space dimensionality.!

• Disadvantages!

• Interactions across feature sets may be missed!

• Explanation of how and why it works can be limited.

Page 15: Lecture8 - Ensemble Methods - Test Page for Apache

Bagging• Bootstrap Aggregating!

• Train k models on different samples of the training data!

• Predict by averaging results of k models !!

• Simple instance of majority voting.

Page 16: Lecture8 - Ensemble Methods - Test Page for Apache

Model averaging• Seen in Language Modeling!

• Take, for example a linear classifier

• Average model parameters.

y = f(WTx+ b)

W ⇤ =1

k

X

i

Wi

b⇤ =1

k

X

i

bi b⇤ =1

k

X

i

bi + ✏

W ⇤ =1

k

X

i

Wi + ✏

Page 17: Lecture8 - Ensemble Methods - Test Page for Apache

Mixture of Experts• Can we do better than averaging over

all training points?!

• Look at the input data to see which points are better classified by which classifiers!

• Allow each expert to focus on those cases where it’s already doing better than average

Page 18: Lecture8 - Ensemble Methods - Test Page for Apache

Mixture of Experts• The array of P’s are called a

“gating network”.!

• Optimize pi as part of the loss function

Features Classifier Classifer

Classifier

Classifier

p

p

p

Page 19: Lecture8 - Ensemble Methods - Test Page for Apache

Probability Correct under a mixture of experts

221

1 ||||

2)|(

cio

cd

i

ci

c epMoGdp−−

∑=π

From Hinton Lecture

prob. desired output on c

mixing coefficient for i on c

gaussian loss between desired and

observed output

Page 20: Lecture8 - Ensemble Methods - Test Page for Apache

Gating network Gradient

)(2

log)|(log

221

221

221

1

||||

||||

||||

2

ci

c

j

cjo

cdcj

cio

cdci

ci

c

cio

cd

i

ci

c

odep

epoE

epMoEdp

−−=∂

−=−

−−

−−

−−

π

posterior probability of expert iFrom Hinton Lecture

Page 21: Lecture8 - Ensemble Methods - Test Page for Apache

AdaBoost

• Adaptive Boosting!

• Construct an ensemble of “weak” classifiers.!

• Typically single split decision trees!

• Identify weights for each classifier.

Page 22: Lecture8 - Ensemble Methods - Test Page for Apache

Weak Classifiers

• Weak classifiers:!

• low performance (slightly better than chance)!

• high variance!

• (for adaboost) should have uncorrelated errors.

Page 23: Lecture8 - Ensemble Methods - Test Page for Apache

Boosting Hypothesis

• The existence of a weak learner implies the existence of a strong learner.

Page 24: Lecture8 - Ensemble Methods - Test Page for Apache

AdaBoost Decision Function

• AdaBoost generates a prediction from a weighted sum of predictions of each classifier.!

• The AdaBoost algorithm determines the weights.!

• Similar to systems that use a second tier classifier to learn a combination function.

C(x) = ↵1C1(x) + ↵2C2(x) + . . .+ ↵kCk(x)

The weight training is different from any loss function we’ve used.

Page 25: Lecture8 - Ensemble Methods - Test Page for Apache

AdaBoost training algorithm• Repeat!

• Identify the best unused classifier Ci.!

• Assign it a weight based on its performance!

• Update the weights of each data point based on whether or not it is classified correctly!

• Until performance converges or all classifiers are included.

Page 26: Lecture8 - Ensemble Methods - Test Page for Apache

Identify the best classifier• Generate hypotheses using each unused

classifier.!

• Calculate weighted error using the current data point weights.!

• Data point weights are initialized to one.

We

=X

yi 6=km(xi)

w(m)i

How many errors were made

Page 27: Lecture8 - Ensemble Methods - Test Page for Apache

Generate a weight for the classifier

• The larger the reduction in error, the larger the classifier weight

em =Wm

Wratio of error to previous iteration

↵m =1

2ln

✓1� emem

◆new weight

Page 28: Lecture8 - Ensemble Methods - Test Page for Apache

Data Point weighting• If data point i was not correctly

classified!!

!

• If data point i was correctly classified

w(m+1)i = w(m)

i e�↵m = w(m)i

rem

1� em

w(m+1)i = w(m)

i e↵m = w(m)i

r1� emem

>1

<1

Page 29: Lecture8 - Ensemble Methods - Test Page for Apache

AdaBoost training algorithm• Repeat!

• Identify the best unused classifier Ci.!

• Assign it a weight based on its performance!

• Update the weights of each data point based on whether or not it is classified correctly!

• Until performance converges or all classifiers are included.

Page 30: Lecture8 - Ensemble Methods - Test Page for Apache

Random Forests• Random Forests are similar to

AdaBoost decision trees. (sans adaptive training)!

• An ensemble of classifiers is trained each on a different subset of features and a different set of data points!

• Random subspace projection

Page 31: Lecture8 - Ensemble Methods - Test Page for Apache

Decision Treeworld&state&

is&it&raining?&

is&the&sprinkler&on?&P(wet)&=&0.95&

P(wet)&=&0.9&

yes%no%

yes%no%

P(wet)&=&0.1&

Page 32: Lecture8 - Ensemble Methods - Test Page for Apache

Construct a Forest of Tree

……"tree"t1 tree"tT

category"c

Page 33: Lecture8 - Ensemble Methods - Test Page for Apache

Training Algorithm• Divide training data into K subsets of

data points and M variables.!

• Improved Generalization!

• Reduced Memory requirements!

• Train a unique decision tree on each K set!

• Simple multi threading

Page 34: Lecture8 - Ensemble Methods - Test Page for Apache

Handling Class Imbalances• Class imbalance, or skewed class distributions happen

when there are not equal numbers of each label. !

• Class Imbalance provides a number of challenges !

• Density Estimation !

• low priors can lead to poor estimation of minority classes!

• Loss Functions!

• Since the loss of each point is equal, getting a lot of majority class points correct is important.!

• Evaluation!

• Accuracy is less informative.

Page 35: Lecture8 - Ensemble Methods - Test Page for Apache

Impact on Accuracy• Example from Information Retrieval!

• Find 10 relevant documents from a set of 100.

True Values

Positive Negative

Hyp Values

Positive 0 0

Negative 10 90

Accuracy = 90%

Page 36: Lecture8 - Ensemble Methods - Test Page for Apache

Contingency Table

Accuracy =TP + TN

TP + FP + TN + FN

True Values Positive Negative

Hyp Values

Positive True Positive

False Positive

Negative False Negative

True Negative

Page 37: Lecture8 - Ensemble Methods - Test Page for Apache

F-Measure

• Precision: how many hypothesized events were true events

• Recall: how many of the true events were identified

• F-Measure: Harmonic meanof precision and recall

P =TP

TP + FP

R =TP

TP + FN

F =2PR

P + RTrue Values

Positive Negative

Hyp Values

Positive 0 0

Negative 10 90

Page 38: Lecture8 - Ensemble Methods - Test Page for Apache

F-Measure

• F-measure can be weighted to favor Precision or Recall

•beta > 1 favors recall •beta < 1 favors precision

F� =(1 + �2)PR

(�2P ) + R

Page 39: Lecture8 - Ensemble Methods - Test Page for Apache

F-Measure

True Values

Positive Negative

Hyp Values

Positive 0 0

Negative 10 90

P = 0R = 0F1 = 0

Page 40: Lecture8 - Ensemble Methods - Test Page for Apache

F-Measure

True Values

Positive Negative

Hyp Values

Positive 10 50

Negative 0 40

P =1060

R = 1F1 = .29

Page 41: Lecture8 - Ensemble Methods - Test Page for Apache

F-Measure

True Values

Positive Negative

Hyp Values

Positive 9 1

Negative 1 89

P = .9R = .9F1 = .9

Page 42: Lecture8 - Ensemble Methods - Test Page for Apache

ROC and AUC• It is common to plot classifier

performance at a variety of settings or thresholds

• Receiver Operating Characteristic (ROC) curves plot true positives against false positives.

• The overall performanceis calculated by the Area Under the Curve(AUC)

Page 43: Lecture8 - Ensemble Methods - Test Page for Apache

Skew in Classifier Training• Most classifiers train better with balanced

training data.!

• Bayesian methods:!

• Reliance on a prior to weight classes.!

• Estimation of class conditioned density is impacted by skew in number of samples!

• Loss functions:!

• There is more pressure to set the decision boundary for the majority classes

Page 44: Lecture8 - Ensemble Methods - Test Page for Apache

Skew in Classifier Training

Page 45: Lecture8 - Ensemble Methods - Test Page for Apache

Skew in Classifier TrainingTwice as

many errors Same distance from

optimal decision boundary

Page 46: Lecture8 - Ensemble Methods - Test Page for Apache

Sampling• Artificial manipulation of the number of

training samples can help reduce the impact of class imbalance!

• Under sampling!

• Randomly select Nm data points from the majority class for training

Page 47: Lecture8 - Ensemble Methods - Test Page for Apache

Sampling• Oversampling!

• Reproduce the minority class points until the class sizes are balanced

Page 48: Lecture8 - Ensemble Methods - Test Page for Apache

Ensemble Sampling• Ensemble Sampling!

• Repeat undersampling NM/Nm times with different samples of the majority class data points.!

• Train NM/Nm classifiers, combine with majority voting.

C1

C2

C3

Merge

Page 49: Lecture8 - Ensemble Methods - Test Page for Apache

Ensemble Methods• Very simple and effective technique to

improve classification performance!

• Netflix Prize, Watson, etc.!

• Mathematical justification!

• Intuitive appeal into how decisions are made by people and organizations!

• Can allow for modular training