Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag
Variance Reduction and Ensemble Methods
Nicholas RuozziUniversity of Texas at Dallas
Based on the slides of Vibhav Gogate and David Sontag
Last Time
β’ PAC learning
β’ Bias/variance tradeoff
β’ small hypothesis spaces (not enough flexibility) can have high bias
β’ rich hypothesis spaces (too much flexibility) can have high variance
β’ Today: more on this phenomenon and how to get around it
2
Intuition
β’ Bias
β’ Measures the accuracy or quality of the algorithm
β’ High bias means a poor match
β’ Variance
β’ Measures the precision or specificity of the match
β’ High variance means a weak match
β’ We would like to minimize each of these
β’ Unfortunately, we canβt do this independently, there is a trade-off
3
Bias-Variance Analysis in Regression
β’ True function is π¦π¦ = ππ(π₯π₯) + ππ
β’ Where noise, ππ, is normally distributed with zero mean and standard deviation ππ
β’ Given a set of training examples, π₯π₯ 1 ,π¦π¦(1) , β¦ , π₯π₯ ππ ,π¦π¦(ππ) , we fit a hypothesis ππ(π₯π₯) = π€π€πππ₯π₯ + ππ to the data to minimize the squared error
οΏ½ππ
π¦π¦(ππ) βππ π₯π₯ ππ 2
4
2-D Example
Sample 20 points from ππ(π₯π₯) = π₯π₯ + 2 sin(1.5π₯π₯) + ππ(0,0.2)
5
2-D Example
50 fits (20 examples each)
6
Bias-Variance Analysis
β’ Given a new data point π₯π₯π₯ with observed value π¦π¦β² = ππ π₯π₯β² + ππ, want to understand the expected prediction error
β’ Suppose that training samples are drawn independently from a distribution ππ(ππ), want to compute the expected error of the estimator
πΈπΈ[ π¦π¦β²βππππ π₯π₯β²2 ]
7
Probability Reminder
β’ Variance of a random variable, ππ
ππππππ ππ = πΈπΈ ππ β πΈπΈ ππ 2
= πΈπΈ ππ2 β 2πππΈπΈ ππ + πΈπΈ ππ 2
= πΈπΈ ππ2 β πΈπΈ ππ 2
β’ Properties of ππππππ(ππ)
ππππππ ππππ = πΈπΈ ππ2ππ2 β πΈπΈ ππππ 2 = ππ2ππππππ(ππ)
8
Bias-Variance-Noise Decomposition
9
πΈπΈ π¦π¦β² β ππππ π₯π₯β²2 = πΈπΈ ππππ π₯π₯β² 2 β 2ππππ π₯π₯β² π¦π¦β² + π¦π¦β²2
= πΈπΈ ππππ π₯π₯β² 2 β 2πΈπΈ ππππ π₯π₯β² πΈπΈ π¦π¦β² + πΈπΈ π¦π¦β²2
= ππππππ ππππ π₯π₯β² + πΈπΈ πππ π π₯π₯β² 2 β 2πΈπΈ ππππ π₯π₯β² ππ π₯π₯β²
+ ππππππ π¦π¦π₯ + ππ π₯π₯β² 2
= ππππππ ππππ π₯π₯β² + πΈπΈ πππ π (π₯π₯β²) β ππ π₯π₯β² 2 + ππππππ ππ
= ππππππ ππππ π₯π₯β² + πΈπΈ πππ π (π₯π₯β²) β ππ π₯π₯β² 2 + ππ2
Bias-Variance-Noise Decomposition
10
πΈπΈ π¦π¦β² β ππππ π₯π₯β²2 = πΈπΈ ππππ π₯π₯β² 2 β 2ππππ π₯π₯β² π¦π¦β² + π¦π¦β²2
= πΈπΈ ππππ π₯π₯β² 2 β 2πΈπΈ ππππ π₯π₯β² πΈπΈ π¦π¦β² + πΈπΈ π¦π¦β²2
= ππππππ ππππ π₯π₯β² + πΈπΈ πππ π π₯π₯β² 2 β 2πΈπΈ ππππ π₯π₯β² ππ π₯π₯β²
+ ππππππ π¦π¦π₯ + ππ π₯π₯β² 2
= ππππππ ππππ π₯π₯β² + πΈπΈ πππ π (π₯π₯β²) β ππ π₯π₯β² 2 + ππππππ ππ
= ππππππ ππππ π₯π₯β² + πΈπΈ πππ π (π₯π₯β²) β ππ π₯π₯β² 2 + ππ2
The samples ππand the noise ππ are independent
Bias-Variance-Noise Decomposition
11
πΈπΈ π¦π¦β² β ππππ π₯π₯β²2 = πΈπΈ ππππ π₯π₯β² 2 β 2ππππ π₯π₯β² π¦π¦β² + π¦π¦β²2
= πΈπΈ ππππ π₯π₯β² 2 β 2πΈπΈ ππππ π₯π₯β² πΈπΈ π¦π¦β² + πΈπΈ π¦π¦β²2
= ππππππ ππππ π₯π₯β² + πΈπΈ πππ π π₯π₯β² 2 β 2πΈπΈ ππππ π₯π₯β² ππ π₯π₯β²
+ ππππππ π¦π¦π₯ + ππ π₯π₯β² 2
= ππππππ ππππ π₯π₯β² + πΈπΈ πππ π (π₯π₯β²) β ππ π₯π₯β² 2 + ππππππ ππ
= ππππππ ππππ π₯π₯β² + πΈπΈ πππ π (π₯π₯β²) β ππ π₯π₯β² 2 + ππ2
Follows from definition of variance
Bias-Variance-Noise Decomposition
12
πΈπΈ π¦π¦β² β ππππ π₯π₯β²2 = πΈπΈ ππππ π₯π₯β² 2 β 2ππππ π₯π₯β² π¦π¦β² + π¦π¦β²2
= πΈπΈ ππππ π₯π₯β² 2 β 2πΈπΈ ππππ π₯π₯β² πΈπΈ π¦π¦β² + πΈπΈ π¦π¦β²2
= ππππππ ππππ π₯π₯β² + πΈπΈ πππ π π₯π₯β² 2 β 2πΈπΈ ππππ π₯π₯β² ππ π₯π₯β²
+ ππππππ π¦π¦π₯ + ππ π₯π₯β² 2
= ππππππ ππππ π₯π₯β² + πΈπΈ πππ π (π₯π₯β²) β ππ π₯π₯β² 2 + ππππππ ππ
= ππππππ ππππ π₯π₯β² + πΈπΈ πππ π (π₯π₯β²) β ππ π₯π₯β² 2 + ππ2
πΈπΈ π¦π¦β² = ππ(π₯π₯β²)
Bias-Variance-Noise Decomposition
13
πΈπΈ π¦π¦β² β ππππ π₯π₯β²2 = πΈπΈ ππππ π₯π₯β² 2 β 2ππππ π₯π₯β² π¦π¦β² + π¦π¦β²2
= πΈπΈ ππππ π₯π₯β² 2 β 2πΈπΈ ππππ π₯π₯β² πΈπΈ π¦π¦β² + πΈπΈ π¦π¦β²2
= ππππππ ππππ π₯π₯β² + πΈπΈ πππ π π₯π₯β² 2 β 2πΈπΈ ππππ π₯π₯β² ππ π₯π₯β²
+ ππππππ π¦π¦π₯ + ππ π₯π₯β² 2
= ππππππ ππππ π₯π₯β² + πΈπΈ πππ π (π₯π₯β²) β ππ π₯π₯β² 2 + ππππππ ππ
= ππππππ ππππ π₯π₯β² + πΈπΈ πππ π (π₯π₯β²) β ππ π₯π₯β² 2 + ππ2
Variance Bias Noise
Bias, Variance, and Noise
β’ Variance: πΈπΈ[ (ππππ π₯π₯β² β πΈπΈ ππππ(π₯π₯β²) )2 ]
β’ Describes how much ππππ(π₯π₯β²) varies from one training set ππ to another
β’ Bias: πΈπΈ ππππ(π₯π₯β²) β ππ(π₯π₯π₯)
β’ Describes the average error of ππππ(π₯π₯β²)
β’ Noise: πΈπΈ π¦π¦β² β ππ π₯π₯β² 2 = πΈπΈ[ππ2] = ππ2
β’ Describes how much π¦π¦β² varies from ππ(π₯π₯β²)
14
2-D Example
50 fits (20 examples each)
15
Bias
16
Variance
17
Noise
18
Bias
β’ Low bias
β’ ?
β’ High bias
β’ ?
19
Bias
β’ Low bias
β’ Linear regression applied to linear data
β’ 2nd degree polynomial applied to quadratic data
β’ High bias
β’ Constant function applied to non-constant data
β’ Linear regression applied to highly non-linear data
20
Variance
β’ Low variance
β’ ?
β’ High variance
β’ ?
21
Variance
β’ Low variance
β’ Constant function
β’ Model independent of training data
β’ High variance
β’ High degree polynomial
22
Bias/Variance Tradeoff
β’ (bias2+variance) is what counts for prediction
β’ As we saw in PAC learning, we often have
β’ Low bias β high variance
β’ Low variance β high bias
β’ How can we deal with this in practice?
23
Reduce Variance Without Increasing Bias
β’ Averaging reduces variance: let ππ1, β¦ ,ππππ be i.i.d random variables
ππππππ1πποΏ½ππ
ππππ =1ππππππππ(ππππ)
β’ Idea: average models to reduce model variance
β’ The problem
β’ Only one training set
β’ Where do multiple models come from?
24
Bagging: Bootstrap Aggregation
β’ Take repeated bootstrap samples from training set π·π· (Breiman, 1994)
β’ Bootstrap sampling: Given set π·π· containing ππ training examples, create π·π·π₯ by drawing ππ examples at random with replacement from π·π·
β’ Bagging:
β’ Create ππ bootstrap samples π·π·1, β¦ ,π·π·ππ
β’ Train distinct classifier on each π·π·ππ
β’ Classify new instance by majority vote / average
25
Bagging: Bootstrap Aggregation
[image from the slides of David Sontag]26
Bagging
Data 1 2 3 4 5 6 7 8 9 10BS 1 7 1 9 10 7 8 8 4 7 2BS 2 8 1 3 1 1 9 7 4 10 1BS 3 5 4 8 8 2 5 5 7 8 8
27
β’ Build a classifier from each bootstrap sampleβ’ In each bootstrap sample, each data point has
probability 1 β 1ππ
ππof not being selected
β’ Expected number of distinct data points in each sample is then
ππ β 1 β 1 β 1ππ
ππβ ππ β (1 β exp(β1)) = .632 β ππ
Bagging
Data 1 2 3 4 5 6 7 8 9 10BS 1 7 1 9 10 7 8 8 4 7 2BS 2 8 1 3 1 1 9 7 4 10 1BS 3 5 4 8 8 2 5 5 7 8 8
28
β’ Build a classifier from each bootstrap sampleβ’ In each bootstrap sample, each data point has
probability 1 β 1ππ
ππof not being selected
β’ If we have 1 TB of data, each bootstrap sample will be ~ 632GB (this can present computational challenges, e.g., you shouldnβt replicate the data)
Decision Tree Bagging
[image from the slides of David Sontag]29
Decision Tree Bagging (100 Bagged Trees)
[image from the slides of David Sontag]30
Bagging Results
Breiman βBagging Predictorsβ Berkeley Statistics Department TR#421, 1994
32
Without Bagging
WithBagging
Random Forests
33
Random Forests
β’ Ensemble method specifically designed for decision tree classifiers
β’ Introduce two sources of randomness: βbaggingβ and βrandom input vectorsβ
β’ Bagging method: each tree is grown using a bootstrap sample of training data
β’ Random vector method: best split at each node is chosen from a random sample of ππ attributes instead of all attributes
34
Random Forest Algorithmβ’ For ππ = 1 to π΅π΅
β’ Draw a bootstrap sample of size ππ from the dataβ’ Grow a tree ππππ using the bootstrap sample as follows
β’ Choose ππ attributes uniformly at random from the dataβ’ Choose the best attribute among the ππ to split onβ’ Split on the best attribute and recurse (until partitions
have fewer than π π ππππππ number of nodes)
β’ Prediction for a new data point π₯π₯β’ Regression: 1
π΅π΅βππ ππππ(π₯π₯)
β’ Classification: choose the majority class label among ππ1 π₯π₯ , β¦ ,πππ΅π΅(π₯π₯)
35
Random Forest Demo
A demo of random forests implemented in JavaScript
36
When Will Bagging Improve Accuracy?
β’ Depends on the stability of the base-level classifiers
β’ A learner is unstable if a small change to the training set causes a large change in the output hypothesis
β’ If small changes in π·π· cause large changes in the output, then there will likely be an improvement in performance with bagging
β’ Bagging can help unstable procedures, but could hurt the performance of stable procedures
β’ Decision trees are unstable
β’ ππ-nearest neighbor is stable
37