Understanding the Bias- Variance TradeoffJune 2012 When we discuss predicti on models , predicti on error s can be decom posed i nto two mai n subcom ponents we care about: error due to "bias" and error due to "variance". There is a tradeo ff between a model's abi lity to mini mi ze bi as and vari ance. Understanding these two types o f error can help us di agnose model results and avoi d the mistake o f over- or under- fitting. 1 Bias and Variance Understanding h ow dif ferent sou rces of error lead to bias and v arian ce helps u s i m prov e the data fi tting process resulting in m ore accurate m odels. We defi n e bias and variance in th ree way s: conceptu ally, graph icall y and m ath em atically . 1.1 Con ceptu al Definition Error due to Bias : Th e error due to bias i s taken as th e di fferen ce between th e expected (or average) predicti on of ou r m odel and the correct value which we are try ing to predict. Of course you on l y have on e m odel so talkin g abou t expected or av erage prediction values m i g h t seem a littl e strange. However, i m agin e you cou l d repeat th e wh ole m odel bu ilding process m ore than once: each tim e you gath er n ew data and run a n ew an al y sis creating a n ew m odel. Du e to random n ess in the u n derly i n g data sets, the resu lting m odels will have a range of predi cti on s. Bias m easu res h ow far off i n g en eral th ese m odels' predi cti on s are from th e correct v al u e. Error due to Variance : The error due to v arian ce is taken as th e v ariability of a m odel prediction f or a given data point. Again, i m agin e y ou can repeat the en tire m odel buil ding process m ultiple tim es. Th e variance is how much th e predictions for a given point varybetw een di fferen t reali z ation s of th e m odel. 1.2 Graph ical Defin ition We can create a graphical visu alization of bi as an d v arian ce using a bulls-ey e diagram . Im agin e th at the cen ter of th e target is a m odel th at perfectly predi cts t h e correct values. As we m ove away f rom th e bull s-eye, our prediction s get worse and worse. Im agine we can repeat ou r en tire m odel bu i ldin g process to get a nu m ber ofseparate hits on th e target. Each hit represen ts an indiv i dual real ization of ou r m odel, given the chance v ariabili ty in the training data we gather. Sometim es we will get a good distribu tion of train i n g data so we predi ct v ery well an d we are close to the bu lls-eye, wh ile som etim es ou r train i n g data m ight be f ull of ou tl i ers or non-sta ndard v alues resu lting in poorer predi cti on s. Th ese differen t realiz ation s resu l t in a scatterof h its on the target. We can plot fou r dif ferent cases representin g com bi n ations of both h i g h an d low bi as and varian ce.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
8/21/2019 Understanding the Bias-Variance Tradeoff
However, in a world with im perf ect models and f inite data, there is a tradeof f
between minimizing the bias and minimizing the variance.
2 An Illustrative Example: Voting IntentionsLet's undertake a sim ple model building task. We wish to create a model f or the
percentage of people who will vote f or a Republican president in the next election.
As models go, this is conceptually trivial and is much sim pler than what peoplecommonly envision when they think of "modeling", but it helps us to cleanly illustrate
the dif f erence between bias and variance.
A straightf orward, if f lawed (as we will see below), way to build this model would
be to randomly choose 50 num bers f rom the phone book, call each one and ask the
responder who they planned to vote f or in the next election. Imagine we got the
f ollowing results:
Voting Republican Voting Democratic Non-Respondent Total
13 16 21 50
From the data, we estimate that the probability of voting R epu blican is 13/(13+16),
or 44.8%. We put out our press release that the Democrats are going to win by
over 10 points; but, when the election comes around, it turns out they actually lose
by 10 points. That certainly ref lects poorly on us. Where did we go wrong in our
model?
Clearly, there are many issues with the trivial model we built. A list would include
that we only sam ple people f rom the phone book and so only include people with
listed num bers, we did not f ollow up with non-respondents and they might havedif f erent voting patterns f rom the respondents, we do not try to weight responses by
likeliness to vote and we have a very small sam ple size.
It is tem pting to lum p all these causes of error into one big box. However, they can
actually be separate sources causing bias and those causing variance.
For instance, using a phonebook to select participants in our survey is one of our
sources of bias. By only surveying certain classes of people, it skews the results in a
way that will be consistent if we repeated the entire model building exercise.
Similarly, not f ollowing u p with respondents is another source of bias, as itconsistently changes the mixture of responses we get. On our bulls-eye diagram
these move us away f rom the center of the target, but they would not result in an
increased scatter of estimates.
On the other hand, the small sam ple size is a source of variance. If we increased our
sam ple size, the results would be more consistent each time we repeated the survey
and prediction. The results still might be highly inaccurate due to our large sources
of bias, but the variance of predictions will be reduced. On the bulls-eye diagram,
the low sam ple size results in a wide scatter of estimates. Increasing the sam ple size
would make the estimates clum p closers together, but they still might miss the center of the target.
Again, this voting model is trivial and quite removed f rom the modeling tasks most
of ten f aced in practice. In general the data set used to build the model is provided
8/21/2019 Understanding the Bias-Variance Tradeoff
prior to model construction and the modeler cannot sim ply say, "Let's increase the
sam ple size to reduce variance." In practice an explicit tradeof f exists between bias
and variance where decreasing one increases the other. Minimizing the total error of
the model requires a caref ul balancing of these two f orms of error.
3 An Applied Example: Voter Party
RegistrationLet's look at a bit more realistic exam ple. Assume we have a training data set of
voters each tagged with three properties: voter party registration, voter wealth, and
a quantitative measure of voter religiousness. These simulated data are plotted
below 2. The x-axis shows increasing wealth, the y-axis increasing religiousness and
the red circles represent R epublican voters while the blue circles represent
Democratic votes. We want to predict voter registration using wealth and
religiousness as predictors.
Fig. 2 Hy pothetical party registration. Plotted on religiousness (y-axis) versus wealth
(x-axis).
3.1 The k -Nearest Neigh bor Algorithm
There are many ways to go about this modeling task. For binary data like ours,
logistic regressions are of ten used. However, if we think there are non-linearities inthe relationshi ps between the variables, a more f lexible, data-adaptive approach
might be desired. One very f lexible machine-learning technique is a method called
k -Nearest Neigh bors.
Religiousness→
Wealth →
8/21/2019 Understanding the Bias-Variance Tradeoff
Fig. 4 Nearest neigh bor predictions f or new data. Hover over a point to see the
neighborhood used to predict it.
We can also plot the f ull prediction regions f or where individuals will be classif ied as
either Democrats or Republicans. The f ollowing f igure shows this.
A key part of the k -Nearest Neigh borhood algorithm is the choice of k . Up to now,we have been using a value of 1 f or k . In this case, each new point is predicted by
its nearest neigh bor in the training set. However, k is an adjustable value, we can set
it to anything f rom 1 to the num ber of data points in the training set. Below you can
ad just the value of k used to generate these plots. As you ad just it, both the
f ollowing and the preceding plots will be updated to show how predictions change
when k changes.
8/21/2019 Understanding the Bias-Variance Tradeoff
Fig. 5 Nearest neigh bor prediction regions. Lighter colors indicate less certainty
about predictions. You can ad just the value of k .
What is the best value of k ? In this simulated case,
we are f ortunate in that we know the actual model
that was used to classif y the original voters as R epu blicans or Democrats. A sim ple
split was used and the dividing line is plotted in the above f igure. Voters north of the
line were classif ied as Repu blicans, voters south of the line Democrats. Some
stochastic noise was then added to change a random f raction of voters'
registrations. You can also generate new training data to see how the results are
sensitive to the original training data.
Try experimenting with the value of k to f ind the best prediction algorithm that
matches up well with the black boundary line.
3.2 Bias and Variance
Increasing k results in the averaging of more voters in each prediction. This results in
smoother prediction curves. With a k of 1, the separation between Democrats and
Republicans is very jagged. Furthermore, there are "islands" of Democrats in
generally Republican territory and vice versa. As k is increased to, say, 20, the
transition becomes smoother and the islands disappear and the split between
Democrats and Repu blicans does a good job of f ollowing the boundary line. As k
becomes very large, say, 80, the distinction between the two categories becomesmore blurred and the boundary prediction line is not matched very well at all.
At small k 's the jaggedness and islands are signs of variance. The locations of the
islands and the exact curves of the boundaries will change radically as new data is
k -Nearest Neighbors: 1
8/21/2019 Understanding the Bias-Variance Tradeoff
gathered. On the other hand, at large k 's the transition is very smooth so there isn't
much variance, but the lack of a match to the boundary line is a sign of high bias.
What we are observing here is that increasing k will decrease variance and
increase bias. While decreasing k will increase variance and decrease bias.
Take a look at how variable the predictions are f or dif f erent data sets at low k . As
k increases this variability is reduced. But if we increase k too much, then we no
longer f ollow the true boundary line and we observe high bias. This is the nature of
the Bias-Variance Tradeof f .
3.3 Analytical Bias and Variance
In the case of k -Nearest Neigh bors we can derive an explicit analytical expression
f or the total error as a summation of bias and variance:
The variance term is a f unction of the irreducible error and k with the variance error
steadily f alling as k increases. The bias term is a f unction of how rough the model
space is (e.g. how quickly in reality do values change as we move through the space
of dif f erent wealths and religiosities). The rougher the space, the f aster the bias term
will increase as f urther away neighbors are brought into estimates.
4 Managing Bias and VarianceThere are some key things to think about when trying to manage bias and variance.
4.1 Fight Your Instincts
A gut f eeling many people have is that they should minimize bias even at the expense
of variance. Their thinking goes that the presence of bias indicates something
basically wrong with their model and algorithm. Yes, they acknowledge, variance is
also bad but a model with high variance could at least predict well on average, at
least it is not f undamentally wrong .
This is mistaken logic. It is true that a high variance and low bias model can pref orm
well in some sort of long-run average sense. However, in practice modelers are
always dealing with a single realization of the data set. In these cases, long run
averages are irrelevant, what is im portant is the perf ormance of the model on the
data you actually have and in this case bias and variance are equally im portant and
one should not be im proved at an excessive expense to the other.
4.2 Bagging and Resam pling
Bagging and other resam pling techniques can be used to reduce the variance inmodel predictions. In bagging (Bootstrap Aggregating), numerous replicates of the
original data set are created using random selection with replacement. Each
derivative data set is then used to construct a new model and the models are
gathered together into an ensem ble. To make a prediction, all of the models in the
$ KK&Q' ; ) )
?&Q' ‚ ?& ' /
D.
B; /
D
QB
0
r 0}
D r 0}
$ KK&Q' ; ) T_pg _ l ac ) Gppcbsag̀ jcCppmp@g _q0
8/21/2019 Understanding the Bias-Variance Tradeoff
Fig. 6 Bias and variance contri buting to total error.
Understanding bias and variance is critical f or understanding the behavior of
prediction models, but in general what you really care about is overall error, not the
specif ic decom position. The sweet spot f or any model is the level of com plexity at
which the increase in bias is equivalent to the reduction in variance. Mathematically:
If our model com plexity exceeds this sweet spot, we are in ef f ect over-f itting our
model; while if our com plexity f alls short of the sweet spot, we are under-f itting themodel. In practice, there is not an analytical way to f ind this location. Instead we
must use an accurate measure of prediction error and explore dif f ering levels of
model com plexity and then choose the com plexity level that minimizes the overall
error. A key to this process is the selection of an accurate error measure as of ten
grossly inaccurate measures are used which can be deceptive. The topic of
accuracy measures is discussed here but generally resam pling based measures such
as cross-validation should be pref erred over theoretical measures such as Aikake's
Inf ormation Criteria.
❧
Scott Fortmann-Roe
1. The Elements of Statistical Learning. A superb book.↩
2. The simulated data are purely f or demonstration purposes. There is no reason to
expect it to accurately ref lect what would be observed if this experiment was
actually carried out.↩
3. The k -Nearest Neigh bors algorithm described above is a little bit of a
paradoxical case. It might seem that the "sim plest" and least com plex model is
; ‚=! B: L
=" HF I E>QBMR
=5 : KB: G<>
=" HF I E>QBMR
8/21/2019 Understanding the Bias-Variance Tradeoff