1 Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2
1
Regression and the Bias-Variance Decomposition
William Cohen10-601 April 2008
Readings: Bishop 3.1,3.2
2
Regression• Technically: learning a function f(x)=y where
y is real-valued, rather than discrete.– Replace livesInSquirrelHill(x1,x2,…,xn) with
averageCommuteDistanceInMiles(x1,x2,…,xn)– Replace userLikesMovie(u,m) with
usersRatingForMovie(u,m)– …
3
Example: univariate linear regression• Example: predict age from number of publications
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140 160
Number of Publications
Age
in Y
ears
4
Linear regression• Model: yi = axi + b + εi where εi ~ N(0,σ)• Training Data: (x1,y1),….(xn,yn)• Goal: estimate a,b with w=(a,b)^ ^
ii
iii xy
DDD
2)](ˆ[minarg
),|Pr(logmaxarg)|Pr(logmaxarg
)Pr()|Pr(maxarg)|Pr(maxarg
w
wwww
ww
assume MLE
)ˆˆ()(ˆ bxay iii w
5
Linear regression• Model: yi = axi + b + εi where εi ~ N(0,σ)• Training Data: (x1,y1),….(xn,yn)• Goal: estimate a,b with w=(a,b)• Ways to estimate parameters
– Find derivative wrt parameters a,b– Set to zero and solve
• Or use gradient ascent to solve• Or ….
^ ^ i
i2ˆminarg w
6
Linear regression
d2
d1
x1 x2
y2
y1
x
y
d3How to estimate the slope?
...
2
2
1
1
xxyy
xxyy
xyslope
ii
ii
xxn
yyn1
1
i i
ii
xxyyxx
2n*var(X)
n*cov(X,Y)
7
Linear regression
How to estimate the intercept?
bxay ˆˆ
d2
d1
x1 x2
y2
y1
x
y
d3
xayb ˆˆ
8
Bias/Variance Decomposition of Error
9
• Return to the simple regression problem f:XY y = f(x) +
What is the expected error for a learned h?
noise N(0,)
deterministic
Bias – Variance decomposition of error
10
Bias – Variance decomposition of error
learned from D
])Pr()Pr()()([2
dxdxxhxfE DD
true fct dataset
Experiment (the error of which I’d like to predict):
1. Draw size n sample D=(x1,y1),….(xn,yn)
2. Train linear function hD using D
3. Draw a test example (x,f(x)+ε)
4. Measure squared error of hD on that example
11
Bias – Variance decomposition of error (2)
learned from D
)()( 2, xhxfE DD
true fct dataset
Fix x, then do this experiment:
1. Draw size n sample D=(x1,y1),….(xn,yn)
2. Train linear function hD using D
3. Draw the test example (x,f(x)+ε)
4. Measure squared error of hD on that example
12
Bias – Variance decomposition of error t
f y ]ˆˆ[2]ˆ[][
]ˆ][[2]ˆ[][ ]ˆ[][
)ˆ(
222
22
2
2
yffyttfyfftEyfftyfftE
yfftE
ytE
^
)()( 2, xhxfE DD
why not?
really yD^
13
Bias – Variance decomposition of error
]ˆ[][]ˆ[][2])ˆ[(][ ]ˆˆ[2]ˆ[][
]ˆ][[2]ˆ[][ ]ˆ[][
)ˆ(
222
222
22
2
2,
yfEfEytEtfEyfEEyffyttfyfftE
yfftyfftEyfftE
ytED
Depends on how well learner approximates f
Intrinsic noise
ff )( yf ˆ)(
14
Bias – Variance decomposition of error
]ˆ[][]ˆ[][2])ˆ[(])[( ]ˆˆ[2]ˆ[][
]ˆ][[2]ˆ[][ ]ˆ[][
)ˆ(
222
222
22
2
2
yhEhEyfEfhEyhEhfEyhhyffhyhhfE
yhhfyhhfEyhhfE
yfE
Squared difference btwn our long-term expectation for the learners
performance, ED[hD(x)], and what we expect in a representative run
on a dataset D (hat y)
Squared difference between best possible
prediction for x, f(x), and our “long-term”
expectation for what the learner will do if we
averaged over many datasets D, ED[hD(x)]
)}({ xhEh DD)(ˆˆ xhyy DD
BIAS2
VARIANCE
15
Bias-variance decomposition
How can you reduce bias of a learner?
How can you reduce variance of a learner?
Make the long-term average better approximate the true function f(x)
Make the learner less sensitive to variations in the data
16
A generalization of bias-variance decomposition to other loss functions
• “Arbitrary” real-valued loss L(t,y) But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’
• Define “optimal prediction”: y* = argmin y’ L(t,y’)
• Define “main prediction of learner”ym=ym,D = argmin y’ ED{L(t,y’)}
• Define “bias of learner”:B(x)=L(y*,ym)
• Define “variance of learner” V(x)=ED[L(ym,y)]
• Define “noise for x”:N(x) = Et[L(t,y*)]
Claim:
ED,t[L(t,y) ] = c1N(x)+B(x)+c2V(x)
where
c1=PrD[y=y*] - 1
c2=1 if ym=y*, -1 elsem=n=|D|
17
Other regression methods
18
Example: univariate linear regression• Example: predict age from number of publications
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140 160
Number of Publications
Age
in Y
ears
Paul Erdős
Hungarian mathematician, 1913-1996
2671ˆ xy
x ~ 1500
age about 240
19
Linear regression
d2
d1
x1 x2
y2
y1
x
y
d3
xayb
xxyyxxa
i i
ii
ˆˆ
ˆ 2
Summary:
To simplify:
• assume zero-centered data, as we did for PCA
• let x=(x1,…,xn) and y =(y1,…,yn)
• then… 0ˆ
)(ˆ 1
b
a TT xxyx
20
Onward: multivariate linear regression
1)(ˆ xxyx TTw
nxx ,....,1x
nyy ,....,1y
1
11
)(ˆˆ...ˆˆ
XXXxwxwy
TT
kk
yw
knn
k
xx
xxX
,....,...
,....,
1
111
ny
y...
1
y
Univariate Multivariate
row is example
col is feature
2)](ˆ[minarg ww i
iT
ii y xww )(̂
21
Onward: multivariate linear regression
1
11
)(ˆˆ...ˆˆ
XXXxwxwy
TT
kk
yw
knn
k
xx
xxX
,....,...
,....,
1
111
ny
y...
1
y 2)](ˆ[minarg ww i
iT
ii y xww )(̂
regularized
^
wwww Ti 2
)](ˆ[minarg 2
1)(ˆ XXIX TT yw
22
Onward: multivariate linear regression
1
11
)(ˆˆ...ˆˆ
XXXxwxwy
TT
kk
yw
mnn
m
xx
xxX
,....,...
,....,
1
111
ny
y...
1
y
Multivariate, multiple outputs
knn yy
yyY
,....,...
,....,
1
11
11
1)(
ˆ
XXYXWW
TT
xy
23
Onward: multivariate linear regression
1
11
)(ˆˆ...ˆˆ
XXXxwxwy
TT
kk
yw
knn
k
xx
xxX
,....,...
,....,
1
111
ny
y...
1
y 2)](ˆ[minarg ww i
iT
ii y xww )(̂
regularized
^
wwww Ti 2
)](ˆ[minarg 2
1)(ˆ XXIX TT yw
What does increasing λ do?
24
Onward: multivariate linear regression
1
11
)(ˆˆ...ˆˆ
XXXxwxwy
TT
kk
yw
nx
xX
,1...,1 1
ny
y...
1
y 2)](ˆ[minarg ww i
iT
ii y xww )(̂
regularized
^
wwww Ti 2
)](ˆ[minarg 2
1)(ˆ XXIX TT yw
w=(w1,w2) What does fixing w2=0 do (if λ=0)?
25
Regression trees - summary• Growing tree:
– Split to optimize information gain
• At each leaf node– Predict the majority class
• Pruning tree:– Prune to reduce error on holdout
• Prediction:– Trace path to a leaf and predict
associated majority class
build a linear model, then greedily remove features
estimates are adjusted by (n+k)/(n-k): n=#cases, k=#features
estimated error on training data
using to a linear interpolation of every prediction made by every node on the path
[Quinlan’s M5]
26
Regression trees: example - 1
27
Regression trees – example 2
What does pruning do to bias and variance?
28
Kernel regression• aka locally weighted regression, locally linear
regression, LOESS, …
29
Kernel regression• aka locally weighted regression, locally linear
regression, …• Close approximation to kernel regression:
– Pick a few values z1,…,zk up front– Preprocess: for each example (x,y), replace x with
x=<K(x,z1),…,K(x,zk)>where K(x,z) = exp( -(x-z)2 / 2σ2 )– Use multivariate regression on x,y pairs
30
Kernel regression• aka locally weighted regression, locally linear
regression, LOESS, …What does making the kernel wider do to bias and variance?
31
Additional readings• P. Domingos, A Unified Bias-Variance Decomposition
and its Applications. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 231-238), 2000. Stanford, CA: Morgan Kaufmann.
• J. R. Quinlan, Learning with Continuous Classes, 5th Australian Joint Conference on Artificial Intelligence, 1992.
• Y. Wang & I. Witten, Inducing model trees for continuous classes, 9th European Conference on Machine Learning, 1997
• D. A. Cohn, Z. Ghahramani, & M. Jordan, Active Learning with Statistical Models, JAIR, 1996.