7/31/2019 Intro Reg 05
1/51
Copyright 2001, 2003, Andrew W. Moore
Predicting Real-valued
outputs: an introductionto Regression
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/[email protected]
412-268-7599
Note to other teachers and users ofthese slides. Andrew would be delightedif you found this source material useful ingiving your own lectures. Feel free to usethese slides verbatim, or to modify themto fit your own needs. PowerPointoriginals are available. If you make useof a significant portion of these slides inyour own lecture, please include thismessage, or the following link to thesource repository of Andrews tutorials:
http://www.cs.cmu.edu/~awm/tutorials .Comments and corrections gratefullyreceived.
Thisisreorde
redmateri
al
fromthe
NeuralNets
lecturean
dtheFav
orite
Regressio
nAlgorith
ms
lecture
Copyright 2001, 2003, Andrew W. Moore 2
Single-Parameter
LinearRegression
7/31/2019 Intro Reg 05
2/51
Copyright 2001, 2003, Andrew W. Moore 3
Linear Regression
Linear regression assumes that the expected value of
the output given an input, E[y|x], is linear.
Simplest case: Out(x) = wxfor some unknown w.
Given the data, we can estimate w.
y5 = 3.1x5 = 4
y4 = 1.9x4 = 1.5
y3 = 2x3 = 2
y2 = 2.2x2 = 3
y1 = 1x1 = 1outputsinputs
DATASET
1w
Copyright 2001, 2003, Andrew W. Moore 4
1-parameter linear regressionAssume that the data is formed by
yi = wxi+ noisei
where
the noise signals are independent
the noise has a normal distribution with mean 0
and unknown variance 2
p(y|w,x) has a normal distribution with
mean wx
variance 2
7/31/2019 Intro Reg 05
3/51
Copyright 2001, 2003, Andrew W. Moore 5
Bayesian Linear Regressionp(y|w,x) = Normal (mean wx, var 2)
We have a set of datapoints (x1,y1) (x2,y2) (xn,yn)which are EVIDENCE about w.
We want to infer wfrom the data.
p(w|x1, x2, x3,xn, y1, y2yn)
You can use BAYES rule to work out a posteriordistribution for wgiven the data.
Or you could do Maximum Likelihood Estimation
Copyright 2001, 2003, Andrew W. Moore 6
Maximum likelihood estimation ofw
Asks the question:
For which value ofwis this data most likely to havehappened?
For what wisp(y1, y2yn|x1, x2, x3,xn, w) maximized?
For what wis
maximized),(1
i
n
i
i xwyp=
7/31/2019 Intro Reg 05
4/51
Copyright 2001, 2003, Andrew W. Moore 7
For what wis
For what wis
For what wis
For what wis
maximized?),(1
i
n
i
i xwyp=
maximized?))(2
1exp( 2
1
ii wxyn
i
=
maximized?
2
1 2
1
=
iin
i
wxy
( ) minimized?2
1=
n
i
ii wxy
Copyright 2001, 2003, Andrew W. Moore 8
Linear Regression
The maximumlikelihood wisthe one that
minimizes sum-of-squares ofresiduals
We want to minimize a quadratic function ofw.
( )
( ) ( ) 222
2
2 wxwyxy
wxy
i
i
iii
i
ii
+=
=
E(w)w
7/31/2019 Intro Reg 05
5/51
Copyright 2001, 2003, Andrew W. Moore 9
Linear RegressionEasy to show the sum of
squares is minimized
when
2
=
i
ii
x
yxw
The maximum likelihoodmodel is
We can use it forprediction
( ) wxx =Out
Copyright 2001, 2003, Andrew W. Moore 10
Linear RegressionEasy to show the sum of
squares is minimizedwhen
2
=i
ii
x
yxw
The maximum likelihoodmodel is
We can use it forprediction
Note: In Bayesian stats youd have
ended up with a prob dist ofw
And predictions would have given a prob
dist of expected output
Often useful to know your confidence.
Max likelihood can give some kinds of
confidence too.
p(w)
w
( ) wxx =Out
7/31/2019 Intro Reg 05
6/51
Copyright 2001, 2003, Andrew W. Moore 11
MultivariateLinear
Regression
Copyright 2001, 2003, Andrew W. Moore 12
Multivariate RegressionWhat if the inputs are vectors?
Dataset has formx1 y1x2 y2x3 y3.: :
.
xR yR
3 .
. 46 .
. 5
. 8
. 10
2-d inputexample
x1
x2
7/31/2019 Intro Reg 05
7/51
Copyright 2001, 2003, Andrew W. Moore 13
Multivariate RegressionWrite matrix X and Y thus:
=
=
=
RRmRR
m
m
R y
y
y
xxx
xxx
xxx
MMM
2
1
21
22221
11211
2
...
...
...
..........
..........
..........
y
x
x
x
x
1
(there are Rdatapoints. Each input has mcomponents)
The linear regression model assumes a vector wsuch that
Out(x) = wTx= w1x[1] + w2x[2] + .wmx[D]
The max. likelihood wis w= (XTX)-1(XTY)
Copyright 2001, 2003, Andrew W. Moore 14
Multivariate RegressionWrite matrix X and Y thus:
=
=
=
RRmRR
m
m
R y
y
y
xxx
xxx
xxx
MMM
2
1
21
22221
11211
2
...
...
...
..........
..........
..........
y
x
x
x
x
1
(there are Rdatapoints. Each input has mcomponents)
The linear regression model assumes a vector wsuch that
Out(x) = wTx= w1x[1] + w2x[2] + .wmx[D]
The max. likelihood wis w= (XTX)-1(XTY)
IMPORTANT EXERCISE:PROVE IT !!!!!
7/31/2019 Intro Reg 05
8/51
Copyright 2001, 2003, Andrew W. Moore 15
Multivariate Regression (cont)
The max. likelihood wis w= (XTX)-1(XTY)
XTXis an m x m matrix: i,jth elt is
XTY is an m-element vector: ith elt
=
R
k
kjkixx1
=
R
k
kkiyx
1
Copyright 2001, 2003, Andrew W. Moore 16
Constant Term
in LinearRegression
7/31/2019 Intro Reg 05
9/51
Copyright 2001, 2003, Andrew W. Moore 17
What about a constant term?We may expectlinear data that does
not go through theorigin.
Statisticians andNeural Net Folks allagree on a simpleobvious hack.
Can you guess??
Copyright 2001, 2003, Andrew W. Moore 18
The constant term The trick is to create a fake input X0 that
always takes the value 1
2055
1743
1642
YX2X1
1
1
1
X0
2055
1743
1642
YX2X1
Before:
Y=w1X1+ w2X2has to be a poormodel
After:
Y= w0X0+w1X1+ w2X2= w0+w1X1+ w2X2
has a fine constantterm
In this example,You should be ableto see the MLE w0
, w1and w2byinspection
7/31/2019 Intro Reg 05
10/51
Copyright 2001, 2003, Andrew W. Moore 19
LinearRegression with
varying noise
Heterosceda
sticity...
Copyright 2001, 2003, Andrew W. Moore 20
Regression with varying noise Suppose you know the variance of the noise that
was added to each datapoint.
x=0 x=3x=2x=1
y=0
y=3
y=2
y=1
=1/2
=2
=1
=1/2
=2
1/423
4321/412
111
4
i2yixi
),(~ 2iii wxNy Assume What
stheM
LE
estimat
eofw?
7/31/2019 Intro Reg 05
11/51
Copyright 2001, 2003, Andrew W. Moore 21
MLE estimation with varying noise
=),,...,,,,...,,|,...,,(log 2222
12121argmax wxxxyyypw
RRR
=
=
R
i i
ii wxy
w1
2
2)(argmin
=
=
=
0)(
such that1
2
R
i i
iii wxyxw
=
=
R
i i
i
R
i i
ii
x
yx
12
2
12
Assuming independenceamong noise and thenplugging in equation forGaussian and simplifying.
Setting dLL/dwequal to zero
Trivial algebra
Copyright 2001, 2003, Andrew W. Moore 22
This is Weighted Regression We are asking to minimize the weighted sum of
squares
x=0 x=3x=2x=1
y=0
y=3
y=2
y=1
=1/2
=2
=1
=1/2
=2
=
R
i i
ii wxy
w1
2
2)(argmin
2
1
iwhere weight for ith datapoint is
7/31/2019 Intro Reg 05
12/51
Copyright 2001, 2003, Andrew W. Moore 23
Non-linear
Regression
Copyright 2001, 2003, Andrew W. Moore 24
Non-linear Regression Suppose you know that y is related to a function of x in
such a way that the predicted values have a non-lineardependence on w, e.g:
x=0 x=3x=2x=1
y=0
y=3
y=2
y=1
33
2332
2.51
yixi
),(~ 2ii xwNy +Assume Whatst
heMLE
estimat
eofw?
7/31/2019 Intro Reg 05
13/51
Copyright 2001, 2003, Andrew W. Moore 25
Non-linear MLE estimation
=),,,...,,|,...,,(log 2121argmax wxxxyyypw
RR
( ) =+=
R
i
ii xwy
w1
2
argmin
=
=
+
+
=
0such that1
R
i i
ii
xw
xwyw
Assuming i.i.d. andthen plugging inequation for Gaussianand simplifying.
Setting dLL/dwequal to zero
Copyright 2001, 2003, Andrew W. Moore 26
Non-linear MLE estimation
=),,,...,,|,...,,(log 2121argmax wxxxyyypw
RR
( ) =+=
R
i
ii xwy
w1
2
argmin
=
=+
+
= 0such that 1
R
i i
ii
xw
xwyw
Assuming i.i.d. andthen plugging inequation for Gaussianand simplifying.
Setting dLL/dw
equal to zero
Were down thealgebraic toilet
Soguess
what
wedo?
7/31/2019 Intro Reg 05
14/51
Copyright 2001, 2003, Andrew W. Moore 27
Non-linear MLE estimation
=),,,...,,|,...,,(log 2121argmax wxxxyyypw
RR
( ) =+=
R
i
ii xwy
w1
2
argmin
=
=
+
+
=
0such that1
R
i i
ii
xw
xwyw
Assuming i.i.d. andthen plugging inequation for Gaussianand simplifying.
Setting dLL/dwequal to zero
Were down the
algebraic toilet
Soguess
what
wedo?
Common (but not only) approach:
Numerical Solutions:
Line Search
Simulated Annealing
Gradient Descent
Conjugate Gradient
Levenberg Marquart
Newtons Method
Also, special purpose statistical-optimization-specific tricks such asE.M. (See Gaussian Mixtures lecturefor introduction)
Copyright 2001, 2003, Andrew W. Moore 28
PolynomialRegression
7/31/2019 Intro Reg 05
15/51
Copyright 2001, 2003, Andrew W. Moore 29
Polynomial RegressionSo far weve mainly been dealing with linear regression
:::
311
723
YX2X1
::
11
23
:
3
7X= y=
x1=(3,2).. y 1=7..
1
3
::
11
21
:
3
7Z= y=
z1=(1,3,2)..
zk=(1,xk1,xk2)
y1=7..
=(ZTZ)-1(ZTy)
yest=0+1x1+2x2
Copyright 2001, 2003, Andrew W. Moore 30
Quadratic RegressionIts trivial to do linear fits of fixed nonlinear basis functions
:::
311
723
YX2X1
::
11
23
:
3
7X= y=
x1=(3,2).. y 1=7..
12
19
16
13
::
1141
:
3
7Z= y=
z=(1 , x1, x2, x12, x1x2,x2
2,)
=(ZTZ)-1(ZTy)
yest=0+1x1+2x2+
3x12+4x1x2+5x2
2
7/31/2019 Intro Reg 05
16/51
Copyright 2001, 2003, Andrew W. Moore 31
Quadratic RegressionIts trivial to do linear fits of fixed nonlinear basis functions
:::
311
723
YX2X1
::
11
23
:
3
7X= y=
x1=(3,2).. y 1=7..
1
2
1
9
1
6
1
3
::
11
41
:
3
7Z=
y=
z=(1 , x1, x2, x12, x1x2,x2
2,)
=(ZTZ)-1(ZTy)
yest=0+1x1+2x2+
3x12+4x1x2+5x2
2
Each component of a z vector is called a term.
Each column of the Z matrix is called a term column
How many terms in a quadratic regression with minputs?
1 constant term
m linear terms
(m+1)-choose-2 = m(m+1)/2 quadratic terms
(m+2)-choose-2 terms in total = O(m2)
Note that solving =(ZTZ)-1(ZTy)is thus O(m6)
Copyright 2001, 2003, Andrew W. Moore 32
Qth-degree polynomial Regression
:::
311
723
YX2X1
::
11
23
:
3
7X= y=
x1=(3,2).. y 1=7..
12
19
16
13
:
11
:
37Z= y=
z=(all products of powers of inputs inwhich sum of powers is q or less,)
=(ZTZ)-1(ZTy)
yest=0+
1x1+
7/31/2019 Intro Reg 05
17/51
Copyright 2001, 2003, Andrew W. Moore 33
m inputs, degree Q: how many terms?= the number of unique terms of the form
Qqxxxm
i
i
q
m
qq m =1
21 where...21
Qqxxxm
i
i
q
m
qqq m ==0
21 where...1210
= the number of unique terms of the form
= the number of lists of non-negative integers [q0,q1,q2,..qm]
in which qi= Q
= the number of ways of placing Q red disks on a row ofsquares of length Q+m = (Q+m)-choose-Q
Q=11, m=4
q0=2 q2=0q1=2 q3=4 q4=3
Copyright 2001, 2003, Andrew W. Moore 34
Radial BasisFunctions
7/31/2019 Intro Reg 05
18/51
Copyright 2001, 2003, Andrew W. Moore 35
Radial Basis Functions (RBFs)
:::
311
723
YX2X1
::
11
23
:
3
7X= y=
x1=(3,2).. y 1=7..
:
3
7Z=
y=
z=(list of radial basis function evaluations)
=(ZTZ)-1(ZTy)
yest=0+
1x1+
Copyright 2001, 2003, Andrew W. Moore 36
1-d RBFs
yest=11(x)+22(x)+33(x)
where
i(x) = KernelFunction(| x - ci|/ KW)
x
y
c1 c1 c1
7/31/2019 Intro Reg 05
19/51
Copyright 2001, 2003, Andrew W. Moore 37
Example
yest=21(x)+0.052(x)+0.53(x)
where
i(x) = KernelFunction(| x - ci|/ KW)
x
y
c1 c1 c1
Copyright 2001, 2003, Andrew W. Moore 38
RBFs with Linear Regression
yest=21(x)+0.052(x)+0.53(x)
where
i(x) = KernelFunction(| x - ci|/ KW)
x
y
c1 c1 c1
All cis are held constant(initialized randomly or
on a grid in m-dimensional input space)
KWalso held constant(initialized to be large
enough that theres decentoverlap between basis
functions**Usually much better than the crappy
overlap on my diagram
7/31/2019 Intro Reg 05
20/51
Copyright 2001, 2003, Andrew W. Moore 39
RBFs with Linear Regression
yest=21(x)+0.052(x)+0.53(x)
where
i(x) = KernelFunction(| x - ci|/ KW)
then given Qbasis functions, define the matrix Zsuch that Zkj=
KernelFunction(| xk- ci|/ KW)where xk is the kth vector of inputs
And as before, =(ZTZ)-1(ZTy)
x
y
c1 c1 c1
All cis are held constant(initialized randomly or
on a grid in m-dimensional input space)
KWalso held constant(initialized to be large
enough that theres decentoverlap between basis
functions**Usually much better than the crappy
overlap on my diagram
Copyright 2001, 2003, Andrew W. Moore 40
RBFs with NonLinear Regression
yest=21(x)+0.052(x)+0.53(x)
where
i(x) = KernelFunction(| x - ci|/ KW)
But how do we now find all the js, cis and KW?
x
y
c1 c1 c1
Allow the cis to adapt tothe data (initialized
randomly or on a grid inm-dimensional input
space)
KWallowed to adapt to the data.
(Some folks even let each basisfunction have its ownKWj,permitting fine detail indense regions of input space)
7/31/2019 Intro Reg 05
21/51
Copyright 2001, 2003, Andrew W. Moore 41
RBFs with NonLinear Regression
yest=21(x)+0.052(x)+0.53(x)
where
i(x) = KernelFunction(| x - ci|/ KW)
But how do we now find all the js, cis and KW?
x
y
c1 c1 c1
Allow the cis to adapt tothe data (initializedrandomly or on a grid in
m-dimensional inputspace)
KWallowed to adapt to the data.
(Some folks even let each basisfunction have its ownKWj,permitting fine detail indense regions of input space)
Answer: Gradient Descent
Copyright 2001, 2003, Andrew W. Moore 42
RBFs with NonLinear Regression
yest=21(x)+0.052(x)+0.53(x)
where
i(x) = KernelFunction(| x - ci|/ KW)
But how do we now find all the js, cis and KW?
x
y
c1 c1 c1
Allow the cis to adapt tothe data (initialized
randomly or on a grid inm-dimensional input
space)
KWallowed to adapt to the data.
(Some folks even let each basisfunction have its ownKWj,permitting fine detail indense regions of input space)
Answer: Gradient Descent(But Id like to see, or hope someones already done, ahybrid, where the cis and KWare updated with gradientdescent while the js use matrix inversion)
7/31/2019 Intro Reg 05
22/51
Copyright 2001, 2003, Andrew W. Moore 43
Radial Basis Functions in 2-d
x1
x2
Center
Sphere ofsignificantinfluence of
center
Two inputs.
Outputs (heightssticking out of page)not shown.
Copyright 2001, 2003, Andrew W. Moore 44
Happy RBFs in 2-d
x1
x2
Center
Sphere ofsignificantinfluence ofcenter
Blue dots denotecoordinates ofinput vectors
7/31/2019 Intro Reg 05
23/51
Copyright 2001, 2003, Andrew W. Moore 45
Crabby RBFs in 2-d
x1
x2
Center
Sphere ofsignificantinfluence of
center
Blue dots denotecoordinates of
input vectors
Whats theproblem in thisexample?
Copyright 2001, 2003, Andrew W. Moore 46
x1
x2
Center
Sphere ofsignificantinfluence ofcenter
Blue dots denotecoordinates ofinput vectors
More crabby RBFs And whats theproblem in thisexample?
7/31/2019 Intro Reg 05
24/51
Copyright 2001, 2003, Andrew W. Moore 47
Hopeless!
x1
x2
Center
Sphere ofsignificantinfluence of
center
Even before seeing the data, you shouldunderstand that this is a disaster!
Copyright 2001, 2003, Andrew W. Moore 48
Unhappy
x1
x2
Center
Sphere ofsignificantinfluence ofcenter
Even before seeing the data, you shouldunderstand that this isnt good either..
7/31/2019 Intro Reg 05
25/51
Copyright 2001, 2003, Andrew W. Moore 49
Robust
Regression
Copyright 2001, 2003, Andrew W. Moore 50
Robust Regression
x
y
7/31/2019 Intro Reg 05
26/51
Copyright 2001, 2003, Andrew W. Moore 51
Robust Regression
x
y
This is the best fit thatQuadratic Regression canmanage
Copyright 2001, 2003, Andrew W. Moore 52
Robust Regression
x
y
but this is what wedprobably prefer
7/31/2019 Intro Reg 05
27/51
Copyright 2001, 2003, Andrew W. Moore 53
LOESS-based Robust Regression
x
y
After the initial fit, scoreeach datapoint according tohow well its fitted
You are a very gooddatapoint.
Copyright 2001, 2003, Andrew W. Moore 54
LOESS-based Robust Regression
x
y
After the initial fit, scoreeach datapoint according tohow well its fitted
You are a very gooddatapoint.
You are not tooshabby.
7/31/2019 Intro Reg 05
28/51
Copyright 2001, 2003, Andrew W. Moore 55
LOESS-based Robust Regression
x
y
After the initial fit, scoreeach datapoint according tohow well its fitted
You are a very gooddatapoint.
You are not tooshabby.
But you arepathetic.
Copyright 2001, 2003, Andrew W. Moore 56
Robust Regression
x
y
For k = 1 to R
Let (xk,yk)be the kth datapoint
Let yestkbe predicted value ofyk
Let wkbe a weight for
datapoint k that is large if thedatapoint fits well and small if itfits badly:
wk= KernelFn([yk- yest
k]2)
7/31/2019 Intro Reg 05
29/51
Copyright 2001, 2003, Andrew W. Moore 57
Robust Regression
x
y
For k = 1 to R
Let (xk,yk)be the kth datapoint
Let yestkbe predicted value ofyk
Let wkbe a weight fordatapoint k that is large if thedatapoint fits well and small if itfits badly:
wk= KernelFn([yk- yest
k]2)
Then redo the regressionusing weighted datapoints.
Weighted regression was described earlier inthe vary noise section, and is also discussedin the Memory-based Learning Lecture.
Guess what happens next?
Copyright 2001, 2003, Andrew W. Moore 58
Robust Regression
x
y
For k = 1 to R
Let (xk,yk)be the kth datapoint
Let yestkbe predicted value ofyk
Let wkbe a weight for
datapoint k that is large if thedatapoint fits well and small if itfits badly:
wk= KernelFn([yk- yest
k]2)
Then redo the regressionusing weighted datapoints.
I taught you how to do this in the Instance-based lecture (only then the weightsdepended on distance in input-space)
Repeat whole thing untilconverged!
7/31/2019 Intro Reg 05
30/51
Copyright 2001, 2003, Andrew W. Moore 59
Robust Regression---what weredoing
What regular regression does:
Assume ykwas originally generated using thefollowing recipe:
yk=0+1xk+2xk2+N(0,2)
Computational task is to find the MaximumLikelihood 0, 1and2
Copyright 2001, 2003, Andrew W. Moore 60
Robust Regression---what weredoing
What LOESS robust regression does:
Assume ykwas originally generated using thefollowing recipe:
With probabilityp:
yk=0+1xk+2xk2+N(0,2)
But otherwise
yk~ N(,huge2)
Computational task is to find the MaximumLikelihood 0, 1, 2, p, andhuge
7/31/2019 Intro Reg 05
31/51
Copyright 2001, 2003, Andrew W. Moore 61
Robust Regression---what weredoing
What LOESS robust regression does:
Assume ykwas originally generated using thefollowing recipe:
With probabilityp:
yk=0+1xk+2xk2+N(0,2)
But otherwise
yk~ N(,huge2)
Computational task is to find the MaximumLikelihood 0, 1, 2, p, andhuge
Mysteriously, thereweighting proceduredoes this computationfor us.
Your first glimpse oftwo spectacular letters:
E.M.
Copyright 2001, 2003, Andrew W. Moore 62
RegressionTrees
7/31/2019 Intro Reg 05
32/51
Copyright 2001, 2003, Andrew W. Moore 63
Regression Trees Decision trees for regression
Copyright 2001, 2003, Andrew W. Moore 64
A regression tree leaf
Predict age = 47
Mean age of records
matching this leaf node
7/31/2019 Intro Reg 05
33/51
Copyright 2001, 2003, Andrew W. Moore 65
A one-split regression tree
Predict age = 36Predict age = 39
Gender?
Female Male
Copyright 2001, 2003, Andrew W. Moore 66
Choosing the attribute to split on
We cant useinformation gain.
What should we use?
725+0YesMale
:::::
2400NoMale
3812NoFemale
AgeNum. BeanyBabies
Num.Children
Rich?Gender
7/31/2019 Intro Reg 05
34/51
Copyright 2001, 2003, Andrew W. Moore 67
Choosing the attribute to split on
MSE(Y|X) = The expected squared error if we must predict a records Yvalue given only knowledge of the records X value
If were told x=j, the smallest expected error comes from predicting themean of the Y-values among those records in which x=j. Call this meanquantity y
x=j
Then
725+0YesMale
:::::
2400NoMale3812NoFemale
AgeNum. BeanyBabies
Num.Children
Rich?Gender
= =
==X
k
N
j jxk
jx
yk yR
XYMSE1 )such that(
2)(1)|(
Copyright 2001, 2003, Andrew W. Moore 68
Choosing the attribute to split on
MSE(Y|X) = The expected squared error if we must predict a records Y
value given only knowledge of the records X valueIf were told x=j, the smallest expected error comes from predicting the
mean of the Y-values among those records in which x=j. Call this meanquantity y
x=j
Then
725+0YesMale
:::::
2400NoMale
3812NoFemale
AgeNum. BeanyBabies
Num.Children
Rich?Gender
= =
==X
k
N
j jxk
jx
yk yR
XYMSE1 )such that(
2)(
1)|(
Regression tree attribute selection: greedilychoose the attribute that minimizes MSE(Y|X)
Guess what we do about real-valued inputs?
Guess how we prevent overfitting
7/31/2019 Intro Reg 05
35/51
Copyright 2001, 2003, Andrew W. Moore 69
Pruning Decision
Predict age = 36Predict age = 39
Gender?
Female Male
property-owner = Yes
# property-owning females = 56712
Mean age among POFs = 39
Age std dev among POFs = 12
# property-owning males = 55800
Mean age among POMs = 36
Age std dev among POMs = 11.5
Use a standard Chi-squared test of the null-
hypothesis these two populations have the samemean and Bobs your uncle.
Do I deserveto live?
Copyright 2001, 2003, Andrew W. Moore 70
Linear Regression Trees
Predict age =
26 + 6 * NumChildren -2 * YearsEducation
Gender?
Female Male
property-owner = Yes
Leaves contain linearfunctions (trained usinglinear regression on allrecords matching that leaf)
Predict age =
24 + 7 * NumChildren -2.5 * YearsEducation
Also known asModel Trees
Split attribute chosen to minimizeMSE of regressed children.
Pruning with a different Chi-squared
7/31/2019 Intro Reg 05
36/51
Copyright 2001, 2003, Andrew W. Moore 71
Linear Regression Trees
Predict age =
26 + 6 * NumChildren -2 * YearsEducation
Gender?
Female Male
property-owner = Yes
Leaves contain linear
functions (trained usinglinear regression on allrecords matching that leaf)
Predict age =
24 + 7 * NumChildren -2.5 * YearsEducation
Also known asModel Trees
Split attribute chosen to minimize
MSE of regressed children.Pruning with a different Chi-squared
Detail:
Youty
pically
ignore
any
catego
ricalat
tribute
thatha
sbeen
tested
onhig
herup
inthe
treed
uringth
e
regres
sion.Bu
tusea
llunte
sted
attribute
s,and
userea
l-value
dattrib
utes
evenifthey
vebee
nteste
dabov
e
Copyright 2001, 2003, Andrew W. Moore 72
Test your understanding
x
y
Assuming regular regression trees, can you sketch agraph of the fitted function yest(x)over this diagram?
7/31/2019 Intro Reg 05
37/51
Copyright 2001, 2003, Andrew W. Moore 73
Test your understanding
x
y
Assuming linear regression trees, can you sketch a graphof the fitted function yest(x)over this diagram?
Copyright 2001, 2003, Andrew W. Moore 74
MultilinearInterpolation
7/31/2019 Intro Reg 05
38/51
Copyright 2001, 2003, Andrew W. Moore 75
Multilinear Interpolation
x
y
Consider this dataset. Suppose we wanted to create acontinuous and piecewise linear fit to the data
Copyright 2001, 2003, Andrew W. Moore 76
Multilinear Interpolation
x
y
Create a set of knot points: selected X-coordinates(usually equally spaced) that cover the data
q1 q4q3 q5q2
7/31/2019 Intro Reg 05
39/51
Copyright 2001, 2003, Andrew W. Moore 77
Multilinear Interpolation
x
y
We are going to assume the data was generated by anoisy version of a function that can only bend at the
knots. Here are 3 examples (none fits the data well)
q1 q4q3 q5q2
Copyright 2001, 2003, Andrew W. Moore 78
How to find the best fit?Idea 1: Simply perform a separate regression in eachsegment for each part of the curve
Whats the problem with this idea?
x
y
q1 q4q3 q5q2
7/31/2019 Intro Reg 05
40/51
Copyright 2001, 2003, Andrew W. Moore 79
How to find the best fit?
x
y
Lets look at what goes on in the red segment
q1 q4q3 q5q2
h2
h3
233
2
2
3
where
)()(
)( qqwhw
xq
hw
xq
xy
est
=
+
=
Copyright 2001, 2003, Andrew W. Moore 80
How to find the best fit?
x
y
In the red segment
q1 q4q3 q5q2
h2
h3
)()()( 3322 xhxhxyest +=
w
xqx
w
qxx
=
= 33
22 1)(,1)(where
2(x)
7/31/2019 Intro Reg 05
41/51
Copyright 2001, 2003, Andrew W. Moore 81
How to find the best fit?
x
y
In the red segment
q1 q4q3 q5q2
h2
h3
)()()( 3322 xhxhxyest +=
w
xq
xw
qx
x
=
=
3
3
2
2 1)(,1)(where
2(x)
3(x)
Copyright 2001, 2003, Andrew W. Moore 82
How to find the best fit?
x
y
In the red segment
q1 q4q3 q5q2
h2
h3
)()()( 3322 xhxhxyest +=
w
qxx
w
qxx
||1)(,
||1)(where 33
22
=
=
2(x)
3(x)
7/31/2019 Intro Reg 05
42/51
Copyright 2001, 2003, Andrew W. Moore 83
How to find the best fit?
x
y
In the red segment
q1 q4q3 q5q2
h2
h3
)()()( 3322 xhxhxyest +=
w
qx
xw
qx
x||
1)(,
||
1)(where3
3
2
2
=
=
2(x)
3(x)
Copyright 2001, 2003, Andrew W. Moore 84
How to find the best fit?
x
y
In general
q1 q4q3 q5q2
h2
h3
=
=KN
i
ii
est xhxy1
)()(
7/31/2019 Intro Reg 05
43/51
Copyright 2001, 2003, Andrew W. Moore 85
How to find the best fit?
x
y
In general
q1 q4q3 q5q2
h2
h3
=
=KN
i
ii
est xhxy1
)()(
7/31/2019 Intro Reg 05
44/51
Copyright 2001, 2003, Andrew W. Moore 87
In two dimensions
x1
x2
Blue dots showlocations of input
vectors (outputsnot depicted)
Each purple dotis a knot point.It will containthe height of
the estimatedsurface
Copyright 2001, 2003, Andrew W. Moore 88
In two dimensions
x1
x2
Blue dots showlocations of inputvectors (outputsnot depicted)
Each purple dotis a knot point.It will containthe height ofthe estimatedsurface
But how do wedo theinterpolation toensure that the
surface iscontinuous?
9
7 8
3
7/31/2019 Intro Reg 05
45/51
Copyright 2001, 2003, Andrew W. Moore 89
In two dimensions
x1
x2
Blue dots showlocations of input
vectors (outputsnot depicted)
Each purple dotis a knot point.It will containthe height of
the estimatedsurface
But how do wedo theinterpolation toensure that thesurface iscontinuous?
9
7 8
3
To predict thevalue here
Copyright 2001, 2003, Andrew W. Moore 90
In two dimensions
x1
x2
Blue dots showlocations of inputvectors (outputsnot depicted)
Each purple dotis a knot point.It will containthe height ofthe estimatedsurface
But how do wedo theinterpolation toensure that the
surface iscontinuous?
9
7 8
3
To predict thevalue here
First interpolateits value on twoopposite edges 7.33
7
7/31/2019 Intro Reg 05
46/51
Copyright 2001, 2003, Andrew W. Moore 91
In two dimensions
x1
x2
Blue dots showlocations of input
vectors (outputsnot depicted)
Each purple dotis a knot point.It will containthe height of
the estimatedsurface
But how do wedo theinterpolation toensure that thesurface iscontinuous?
9
7 8
3To predict thevalue here
First interpolateits value on twoopposite edges
Then interpolatebetween thosetwo values
7.33
7
7.05
Copyright 2001, 2003, Andrew W. Moore 92
In two dimensions
x1
x2
Blue dots showlocations of inputvectors (outputsnot depicted)
Each purple dotis a knot point.It will containthe height ofthe estimatedsurface
But how do wedo theinterpolation toensure that the
surface iscontinuous?
9
7 8
3To predict thevalue here
First interpolateits value on two
opposite edgesThen interpolatebetween thosetwo values
7.33
7
7.05
Notes:This can easily be generalizedto mdimensions.
It should be easy to see that itensures continuity
The patches are not linear
7/31/2019 Intro Reg 05
47/51
Copyright 2001, 2003, Andrew W. Moore 93
Doing the regression
x1
x2
Given data, howdo we find theoptimal knotheights?
Happily, itssimply a two-dimensionalbasis functionproblem.
(Working outthe basisfunctions istedious,unilluminating,and easy)
Whats theproblem inhigherdimensions?
9
7 8
3
Copyright 2001, 2003, Andrew W. Moore 94
MARS: MultivariateAdaptive Regression
Splines
7/31/2019 Intro Reg 05
48/51
Copyright 2001, 2003, Andrew W. Moore 95
MARS Multivariate Adaptive Regression Splines
Invented by Jerry Friedman (one ofAndrews heroes)
Simplest version:
Lets assume the function we are learning is of thefollowing form:
=
=m
k
kk
est xgy1
)()(x
Instead of a linear combination of the inputs, its a linearcombination of non-linear functions ofindividualinputs
Copyright 2001, 2003, Andrew W. Moore 96
MARS =
=m
k
kk
est xgy1
)()(x
Instead of a linear combination of the inputs, its a linearcombination of non-linear functions ofindividualinputs
x
y
q1 q4q3 q5q2
Idea: Each
gk is one ofthese
7/31/2019 Intro Reg 05
49/51
Copyright 2001, 2003, Andrew W. Moore 97
MARS =
=m
k
kk
est xgy1
)()(x
Instead of a linear combination of the inputs, its a linear
combination of non-linear functions ofindividualinputs
x
y
q1 q4q3 q5q2
= =
=m
k
k
N
j
k
j
k
j
est xhyK
1 1
)()(x
7/31/2019 Intro Reg 05
50/51
Copyright 2001, 2003, Andrew W. Moore 99
If you like MARSSee also CMAC (Cerebellar Model Articulated
Controller) by James Albus (another ofAndrews heroes)
Many of the same gut-level intuitions
But entirely in a neural-network, biologicallyplausible way
(All the low dimensional functions are bymeans of lookup tables, trained with a delta-rule and using a clever blurred update and
hash-tables)
Copyright 2001, 2003, Andrew W. Moore 100
Where are we now?
In
puts
ClassifierPredict
category
Inputs
DensityEstimator
Prob-ability
Inputs
RegressorPredictreal no.
Dec Tree, Gauss/Joint BC, Gauss Nave BC,
Joint DE, Nave DE, Gauss/Joint DE, Gauss NaveDE
Linear Regression, Polynomial Regression, RBFs,Robust Regression Regression Trees, MultilinearInterp, MARS
Inputs
InferenceEngine Learn
p(E1|E2)Joint DE
7/31/2019 Intro Reg 05
51/51
Copyright 2001, 2003, Andrew W. Moore 101
CitationsRadial Basis Functions
T. Poggio and F. Girosi,Regularization Algorithms for
Learning That Are Equivalent toMultilayer Networks, Science,247, 978--982, 1989
LOESS
W. S. Cleveland, Robust LocallyWeighted Regression andSmoothing Scatterplots, Journalof the American StatisticalAssociation, 74, 368, 829-836,December, 1979
Regression Trees etc
L. Breiman and J. H. Friedman andR. A. Olshen and C. J. Stone,
Classification and RegressionTrees, Wadsworth, 1984
J. R. Quinlan, Combining Instance-Based and Model-BasedLearning, Machine Learning:Proceedings of the TenthInternational Conference, 1993
MARS
J. H. Friedman, MultivariateAdaptive Regression Splines,Department for Statistics,Stanford University, 1988,Technical Report No. 102