Intro Reg 05

7/31/2019 Intro Reg 05

1/51

Copyright 2001, 2003, Andrew W. Moore

Predicting Real-valued

outputs: an introductionto Regression

Andrew W. Moore

Professor

School of Computer Science

Carnegie Mellon University

www.cs.cmu.edu/[email protected]

412-268-7599

Note to other teachers and users ofthese slides. Andrew would be delightedif you found this source material useful ingiving your own lectures. Feel free to usethese slides verbatim, or to modify themto fit your own needs. PowerPointoriginals are available. If you make useof a significant portion of these slides inyour own lecture, please include thismessage, or the following link to thesource repository of Andrews tutorials:

http://www.cs.cmu.edu/~awm/tutorials .Comments and corrections gratefullyreceived.

Thisisreorde

redmateri

al

fromthe

NeuralNets

lecturean

dtheFav

orite

Regressio

nAlgorith

ms

lecture

Copyright 2001, 2003, Andrew W. Moore 2

Single-Parameter

LinearRegression

7/31/2019 Intro Reg 05

2/51


Linear Regression

Linear regression assumes that the expected value of

the output given an input, E[y|x], is linear.

Simplest case: Out(x) = wxfor some unknown w.

Given the data, we can estimate w.

y5 = 3.1x5 = 4

y4 = 1.9x4 = 1.5

y3 = 2x3 = 2

y2 = 2.2x2 = 3

y1 = 1x1 = 1outputsinputs

DATASET

1w


1-parameter linear regressionAssume that the data is formed by

yi = wxi+ noisei

where

the noise signals are independent

the noise has a normal distribution with mean 0

and unknown variance 2

p(y|w,x) has a normal distribution with

mean wx

variance 2

7/31/2019 Intro Reg 05

3/51


Bayesian Linear Regressionp(y|w,x) = Normal (mean wx, var 2)

We have a set of datapoints (x1,y1) (x2,y2) (xn,yn)which are EVIDENCE about w.

We want to infer wfrom the data.

p(w|x1, x2, x3,xn, y1, y2yn)

You can use BAYES rule to work out a posteriordistribution for wgiven the data.

Or you could do Maximum Likelihood Estimation


Maximum likelihood estimation ofw

Asks the question:

For which value ofwis this data most likely to havehappened?

For what wisp(y1, y2yn|x1, x2, x3,xn, w) maximized?

For what wis

maximized),(1

i

n

i

i xwyp=

7/31/2019 Intro Reg 05

4/51


For what wis

For what wis

For what wis

For what wis

maximized?),(1

i

n

i

i xwyp=

maximized?))(2

1exp( 2

1

ii wxyn

i

=

maximized?

2

1 2

1

=

iin

i

wxy

( ) minimized?2

1=

n

i

ii wxy


Linear Regression

The maximumlikelihood wisthe one that

minimizes sum-of-squares ofresiduals

We want to minimize a quadratic function ofw.

( )

( ) ( ) 222

2

2 wxwyxy

wxy

i

i

iii

i

ii

+=

=

E(w)w

7/31/2019 Intro Reg 05

5/51


Linear RegressionEasy to show the sum of

squares is minimized

when

2

=

i

ii

x

yxw

The maximum likelihoodmodel is

We can use it forprediction

( ) wxx =Out


Linear RegressionEasy to show the sum of

squares is minimizedwhen

2

=i

ii

x

yxw

The maximum likelihoodmodel is

We can use it forprediction

Note: In Bayesian stats youd have

ended up with a prob dist ofw

And predictions would have given a prob

dist of expected output

Often useful to know your confidence.

Max likelihood can give some kinds of

confidence too.

p(w)

w

( ) wxx =Out

7/31/2019 Intro Reg 05

6/51


MultivariateLinear

Regression


Multivariate RegressionWhat if the inputs are vectors?

Dataset has formx1 y1x2 y2x3 y3.: :

.

xR yR

3 .

. 46 .

. 5

. 8

. 10

2-d inputexample

x1

x2

7/31/2019 Intro Reg 05

7/51


Multivariate RegressionWrite matrix X and Y thus:

=

=

=

RRmRR

m

m

R y

y

y

xxx

xxx

xxx

MMM

2

1

21

22221

11211

2

...

...

...

..........

..........

..........

y

x

x

x

x

1

(there are Rdatapoints. Each input has mcomponents)

The linear regression model assumes a vector wsuch that

Out(x) = wTx= w1x[1] + w2x[2] + .wmx[D]

The max. likelihood wis w= (XTX)-1(XTY)


Multivariate RegressionWrite matrix X and Y thus:

=

=

=

RRmRR

m

m

R y

y

y

xxx

xxx

xxx

MMM

2

1

21

22221

11211

2

...

...

...

..........

..........

..........

y

x

x

x

x

1

(there are Rdatapoints. Each input has mcomponents)

The linear regression model assumes a vector wsuch that

Out(x) = wTx= w1x[1] + w2x[2] + .wmx[D]


IMPORTANT EXERCISE:PROVE IT !!!!!

7/31/2019 Intro Reg 05

8/51


Multivariate Regression (cont)


XTXis an m x m matrix: i,jth elt is

XTY is an m-element vector: ith elt

=

R

k

kjkixx1

=

R

k

kkiyx

1


Constant Term

in LinearRegression

7/31/2019 Intro Reg 05

9/51


What about a constant term?We may expectlinear data that does

not go through theorigin.

Statisticians andNeural Net Folks allagree on a simpleobvious hack.

Can you guess??


The constant term The trick is to create a fake input X0 that

always takes the value 1

2055

1743

1642

YX2X1

1

1

1

X0

2055

1743

1642

YX2X1

Before:

Y=w1X1+ w2X2has to be a poormodel

After:

Y= w0X0+w1X1+ w2X2= w0+w1X1+ w2X2

has a fine constantterm

In this example,You should be ableto see the MLE w0

, w1and w2byinspection

7/31/2019 Intro Reg 05

10/51


LinearRegression with

varying noise

Heterosceda

sticity...


Regression with varying noise Suppose you know the variance of the noise that

was added to each datapoint.

x=0 x=3x=2x=1

y=0

y=3

y=2

y=1

=1/2

=2

=1

=1/2

=2

1/423

4321/412

111

4

i2yixi

),(~ 2iii wxNy Assume What

stheM

LE

estimat

eofw?

7/31/2019 Intro Reg 05

11/51


MLE estimation with varying noise

=),,...,,,,...,,|,...,,(log 2222

12121argmax wxxxyyypw

RRR

=

=

R

i i

ii wxy

w1

2

2)(argmin

=

=

=

0)(

such that1

2

R

i i

iii wxyxw

=

=

R

i i

i

R

i i

ii

x

yx

12

2

12

Assuming independenceamong noise and thenplugging in equation forGaussian and simplifying.

Setting dLL/dwequal to zero

Trivial algebra


This is Weighted Regression We are asking to minimize the weighted sum of

squares

x=0 x=3x=2x=1

y=0

y=3

y=2

y=1

=1/2

=2

=1

=1/2

=2

=

R

i i

ii wxy

w1

2

2)(argmin

2

1

iwhere weight for ith datapoint is

7/31/2019 Intro Reg 05

12/51


Non-linear

Regression


Non-linear Regression Suppose you know that y is related to a function of x in

such a way that the predicted values have a non-lineardependence on w, e.g:

x=0 x=3x=2x=1

y=0

y=3

y=2

y=1

33

2332

2.51

yixi

),(~ 2ii xwNy +Assume Whatst

heMLE

estimat

eofw?

7/31/2019 Intro Reg 05

13/51


Non-linear MLE estimation

=),,,...,,|,...,,(log 2121argmax wxxxyyypw

RR

( ) =+=

R

i

ii xwy

w1

2

argmin

=

=

+

+

=

0such that1

R

i i

ii

xw

xwyw

Assuming i.i.d. andthen plugging inequation for Gaussianand simplifying.





RR

( ) =+=

R

i

ii xwy

w1

2

argmin

=

=+

+

= 0such that 1

R

i i

ii

xw

xwyw


Setting dLL/dw

equal to zero

Were down thealgebraic toilet

Soguess

what

wedo?

7/31/2019 Intro Reg 05

14/51




RR

( ) =+=

R

i

ii xwy

w1

2

argmin

=

=

+

+

=

0such that1

R

i i

ii

xw

xwyw



Were down the

algebraic toilet

Soguess

what

wedo?

Common (but not only) approach:

Numerical Solutions:

Line Search

Simulated Annealing

Gradient Descent

Conjugate Gradient

Levenberg Marquart

Newtons Method

Also, special purpose statistical-optimization-specific tricks such asE.M. (See Gaussian Mixtures lecturefor introduction)


PolynomialRegression

7/31/2019 Intro Reg 05

15/51


Polynomial RegressionSo far weve mainly been dealing with linear regression

:::

311

723

YX2X1

::

11

23

:

3

7X= y=

x1=(3,2).. y 1=7..

1

3

::

11

21

:

3

7Z= y=

z1=(1,3,2)..

zk=(1,xk1,xk2)

y1=7..

=(ZTZ)-1(ZTy)

yest=0+1x1+2x2


Quadratic RegressionIts trivial to do linear fits of fixed nonlinear basis functions

:::

311

723

YX2X1

::

11

23

:

3

7X= y=

x1=(3,2).. y 1=7..

12

19

16

13

::

1141

:

3

7Z= y=

z=(1 , x1, x2, x12, x1x2,x2

2,)

=(ZTZ)-1(ZTy)

yest=0+1x1+2x2+

3x12+4x1x2+5x2

2

7/31/2019 Intro Reg 05

16/51


Quadratic RegressionIts trivial to do linear fits of fixed nonlinear basis functions

:::

311

723

YX2X1

::

11

23

:

3

7X= y=

x1=(3,2).. y 1=7..

1

2

1

9

1

6

1

3

::

11

41

:

3

7Z=

y=

z=(1 , x1, x2, x12, x1x2,x2

2,)

=(ZTZ)-1(ZTy)

yest=0+1x1+2x2+

3x12+4x1x2+5x2

2

Each component of a z vector is called a term.

Each column of the Z matrix is called a term column

How many terms in a quadratic regression with minputs?

1 constant term

m linear terms

(m+1)-choose-2 = m(m+1)/2 quadratic terms

(m+2)-choose-2 terms in total = O(m2)

Note that solving =(ZTZ)-1(ZTy)is thus O(m6)


Qth-degree polynomial Regression

:::

311

723

YX2X1

::

11

23

:

3

7X= y=

x1=(3,2).. y 1=7..

12

19

16

13

:

11

:

37Z= y=

z=(all products of powers of inputs inwhich sum of powers is q or less,)

=(ZTZ)-1(ZTy)

yest=0+

1x1+

7/31/2019 Intro Reg 05

17/51


m inputs, degree Q: how many terms?= the number of unique terms of the form

Qqxxxm

i

i

q

m

qq m =1

21 where...21

Qqxxxm

i

i

q

m

qqq m ==0

21 where...1210

= the number of unique terms of the form

= the number of lists of non-negative integers [q0,q1,q2,..qm]

in which qi= Q

= the number of ways of placing Q red disks on a row ofsquares of length Q+m = (Q+m)-choose-Q

Q=11, m=4

q0=2 q2=0q1=2 q3=4 q4=3


Radial BasisFunctions

7/31/2019 Intro Reg 05

18/51


Radial Basis Functions (RBFs)

:::

311

723

YX2X1

::

11

23

:

3

7X= y=

x1=(3,2).. y 1=7..

:

3

7Z=

y=

z=(list of radial basis function evaluations)

=(ZTZ)-1(ZTy)

yest=0+

1x1+


1-d RBFs

yest=11(x)+22(x)+33(x)

where

i(x) = KernelFunction(| x - ci|/ KW)

x

y

c1 c1 c1

7/31/2019 Intro Reg 05

19/51


Example

yest=21(x)+0.052(x)+0.53(x)

where


x

y

c1 c1 c1


RBFs with Linear Regression

yest=21(x)+0.052(x)+0.53(x)

where


x

y

c1 c1 c1

All cis are held constant(initialized randomly or

on a grid in m-dimensional input space)

KWalso held constant(initialized to be large

enough that theres decentoverlap between basis

functions**Usually much better than the crappy

overlap on my diagram

7/31/2019 Intro Reg 05

20/51


RBFs with Linear Regression

yest=21(x)+0.052(x)+0.53(x)

where


then given Qbasis functions, define the matrix Zsuch that Zkj=

KernelFunction(| xk- ci|/ KW)where xk is the kth vector of inputs

And as before, =(ZTZ)-1(ZTy)

x

y

c1 c1 c1

All cis are held constant(initialized randomly or

on a grid in m-dimensional input space)

KWalso held constant(initialized to be large

enough that theres decentoverlap between basis

functions**Usually much better than the crappy

overlap on my diagram


RBFs with NonLinear Regression

yest=21(x)+0.052(x)+0.53(x)

where


But how do we now find all the js, cis and KW?

x

y

c1 c1 c1

Allow the cis to adapt tothe data (initialized

randomly or on a grid inm-dimensional input

space)

KWallowed to adapt to the data.

(Some folks even let each basisfunction have its ownKWj,permitting fine detail indense regions of input space)

7/31/2019 Intro Reg 05

21/51



yest=21(x)+0.052(x)+0.53(x)

where



x

y

c1 c1 c1

Allow the cis to adapt tothe data (initializedrandomly or on a grid in

m-dimensional inputspace)



Answer: Gradient Descent



yest=21(x)+0.052(x)+0.53(x)

where



x

y

c1 c1 c1

Allow the cis to adapt tothe data (initialized

randomly or on a grid inm-dimensional input

space)



Answer: Gradient Descent(But Id like to see, or hope someones already done, ahybrid, where the cis and KWare updated with gradientdescent while the js use matrix inversion)

7/31/2019 Intro Reg 05

22/51


Radial Basis Functions in 2-d

x1

x2

Center

Sphere ofsignificantinfluence of

center

Two inputs.

Outputs (heightssticking out of page)not shown.


Happy RBFs in 2-d

x1

x2

Center

Sphere ofsignificantinfluence ofcenter

Blue dots denotecoordinates ofinput vectors

7/31/2019 Intro Reg 05

23/51


Crabby RBFs in 2-d

x1

x2

Center


center

Blue dots denotecoordinates of

input vectors

Whats theproblem in thisexample?


x1

x2

Center


Blue dots denotecoordinates ofinput vectors

More crabby RBFs And whats theproblem in thisexample?

7/31/2019 Intro Reg 05

24/51


Hopeless!

x1

x2

Center


center

Even before seeing the data, you shouldunderstand that this is a disaster!


Unhappy

x1

x2

Center


Even before seeing the data, you shouldunderstand that this isnt good either..

7/31/2019 Intro Reg 05

25/51


Robust

Regression


Robust Regression

x

y

7/31/2019 Intro Reg 05

26/51


Robust Regression

x

y

This is the best fit thatQuadratic Regression canmanage


Robust Regression

x

y

but this is what wedprobably prefer

7/31/2019 Intro Reg 05

27/51


LOESS-based Robust Regression

x

y

After the initial fit, scoreeach datapoint according tohow well its fitted

You are a very gooddatapoint.



x

y



You are not tooshabby.

7/31/2019 Intro Reg 05

28/51



x

y



You are not tooshabby.

But you arepathetic.


Robust Regression

x

y

For k = 1 to R

Let (xk,yk)be the kth datapoint

Let yestkbe predicted value ofyk

Let wkbe a weight for

datapoint k that is large if thedatapoint fits well and small if itfits badly:

wk= KernelFn([yk- yest

k]2)

7/31/2019 Intro Reg 05

29/51


Robust Regression

x

y

For k = 1 to R



Let wkbe a weight fordatapoint k that is large if thedatapoint fits well and small if itfits badly:


k]2)

Then redo the regressionusing weighted datapoints.

Weighted regression was described earlier inthe vary noise section, and is also discussedin the Memory-based Learning Lecture.

Guess what happens next?


Robust Regression

x

y

For k = 1 to R



Let wkbe a weight for

datapoint k that is large if thedatapoint fits well and small if itfits badly:


k]2)

Then redo the regressionusing weighted datapoints.

I taught you how to do this in the Instance-based lecture (only then the weightsdepended on distance in input-space)

Repeat whole thing untilconverged!

7/31/2019 Intro Reg 05

30/51


Robust Regression---what weredoing

What regular regression does:

Assume ykwas originally generated using thefollowing recipe:

yk=0+1xk+2xk2+N(0,2)

Computational task is to find the MaximumLikelihood 0, 1and2



What LOESS robust regression does:


With probabilityp:

yk=0+1xk+2xk2+N(0,2)

But otherwise

yk~ N(,huge2)

Computational task is to find the MaximumLikelihood 0, 1, 2, p, andhuge

7/31/2019 Intro Reg 05

31/51



What LOESS robust regression does:


With probabilityp:

yk=0+1xk+2xk2+N(0,2)

But otherwise

yk~ N(,huge2)

Computational task is to find the MaximumLikelihood 0, 1, 2, p, andhuge

Mysteriously, thereweighting proceduredoes this computationfor us.

Your first glimpse oftwo spectacular letters:

E.M.


RegressionTrees

7/31/2019 Intro Reg 05

32/51


Regression Trees Decision trees for regression


A regression tree leaf

Predict age = 47

Mean age of records

matching this leaf node

7/31/2019 Intro Reg 05

33/51


A one-split regression tree

Predict age = 36Predict age = 39

Gender?

Female Male


Choosing the attribute to split on

We cant useinformation gain.

What should we use?

725+0YesMale

:::::

2400NoMale

3812NoFemale

AgeNum. BeanyBabies

Num.Children

Rich?Gender

7/31/2019 Intro Reg 05

34/51



MSE(Y|X) = The expected squared error if we must predict a records Yvalue given only knowledge of the records X value

If were told x=j, the smallest expected error comes from predicting themean of the Y-values among those records in which x=j. Call this meanquantity y

x=j

Then

725+0YesMale

:::::

2400NoMale3812NoFemale

AgeNum. BeanyBabies

Num.Children

Rich?Gender

= =

==X

k

N

j jxk

jx

yk yR

XYMSE1 )such that(

2)(1)|(



MSE(Y|X) = The expected squared error if we must predict a records Y

value given only knowledge of the records X valueIf were told x=j, the smallest expected error comes from predicting the

mean of the Y-values among those records in which x=j. Call this meanquantity y

x=j

Then

725+0YesMale

:::::

2400NoMale

3812NoFemale

AgeNum. BeanyBabies

Num.Children

Rich?Gender

= =

==X

k

N

j jxk

jx

yk yR

XYMSE1 )such that(

2)(

1)|(

Regression tree attribute selection: greedilychoose the attribute that minimizes MSE(Y|X)

Guess what we do about real-valued inputs?

Guess how we prevent overfitting

7/31/2019 Intro Reg 05

35/51


Pruning Decision

Predict age = 36Predict age = 39

Gender?

Female Male

property-owner = Yes

# property-owning females = 56712

Mean age among POFs = 39

Age std dev among POFs = 12

# property-owning males = 55800

Mean age among POMs = 36

Age std dev among POMs = 11.5

Use a standard Chi-squared test of the null-

hypothesis these two populations have the samemean and Bobs your uncle.

Do I deserveto live?


Linear Regression Trees

Predict age =

26 + 6 * NumChildren -2 * YearsEducation

Gender?

Female Male


Leaves contain linearfunctions (trained usinglinear regression on allrecords matching that leaf)

Predict age =

24 + 7 * NumChildren -2.5 * YearsEducation

Also known asModel Trees

Split attribute chosen to minimizeMSE of regressed children.

Pruning with a different Chi-squared

7/31/2019 Intro Reg 05

36/51


Linear Regression Trees

Predict age =

26 + 6 * NumChildren -2 * YearsEducation

Gender?

Female Male


Leaves contain linear

functions (trained usinglinear regression on allrecords matching that leaf)

Predict age =

24 + 7 * NumChildren -2.5 * YearsEducation

Also known asModel Trees

Split attribute chosen to minimize

MSE of regressed children.Pruning with a different Chi-squared

Detail:

Youty

pically

ignore

any

catego

ricalat

tribute

thatha

sbeen

tested

onhig

herup

inthe

treed

uringth

e

regres

sion.Bu

tusea

llunte

sted

attribute

s,and

userea

l-value

dattrib

utes

evenifthey

vebee

nteste

dabov

e


Test your understanding

x

y

Assuming regular regression trees, can you sketch agraph of the fitted function yest(x)over this diagram?

7/31/2019 Intro Reg 05

37/51


Test your understanding

x

y

Assuming linear regression trees, can you sketch a graphof the fitted function yest(x)over this diagram?


MultilinearInterpolation

7/31/2019 Intro Reg 05

38/51


Multilinear Interpolation

x

y

Consider this dataset. Suppose we wanted to create acontinuous and piecewise linear fit to the data



x

y

Create a set of knot points: selected X-coordinates(usually equally spaced) that cover the data

q1 q4q3 q5q2

7/31/2019 Intro Reg 05

39/51



x

y

We are going to assume the data was generated by anoisy version of a function that can only bend at the

knots. Here are 3 examples (none fits the data well)

q1 q4q3 q5q2


How to find the best fit?Idea 1: Simply perform a separate regression in eachsegment for each part of the curve

Whats the problem with this idea?

x

y

q1 q4q3 q5q2

7/31/2019 Intro Reg 05

40/51


How to find the best fit?

x

y

Lets look at what goes on in the red segment

q1 q4q3 q5q2

h2

h3

233

2

2

3

where

)()(

)( qqwhw

xq

hw

xq

xy

est

=

+

=



x

y

In the red segment

q1 q4q3 q5q2

h2

h3

)()()( 3322 xhxhxyest +=

w

xqx

w

qxx

=

= 33

22 1)(,1)(where

2(x)

7/31/2019 Intro Reg 05

41/51



x

y

In the red segment

q1 q4q3 q5q2

h2

h3

)()()( 3322 xhxhxyest +=

w

xq

xw

qx

x

=

=

3

3

2

2 1)(,1)(where

2(x)

3(x)



x

y

In the red segment

q1 q4q3 q5q2

h2

h3

)()()( 3322 xhxhxyest +=

w

qxx

w

qxx

||1)(,

||1)(where 33

22

=

=

2(x)

3(x)

7/31/2019 Intro Reg 05

42/51



x

y

In the red segment

q1 q4q3 q5q2

h2

h3

)()()( 3322 xhxhxyest +=

w

qx

xw

qx

x||

1)(,

||

1)(where3

3

2

2

=

=

2(x)

3(x)



x

y

In general

q1 q4q3 q5q2

h2

h3

=

=KN

i

ii

est xhxy1

)()(

7/31/2019 Intro Reg 05

43/51



x

y

In general

q1 q4q3 q5q2

h2

h3

=

=KN

i

ii

est xhxy1

)()(

7/31/2019 Intro Reg 05

44/51


In two dimensions

x1

x2

Blue dots showlocations of input

vectors (outputsnot depicted)

Each purple dotis a knot point.It will containthe height of

the estimatedsurface


In two dimensions

x1

x2

Blue dots showlocations of inputvectors (outputsnot depicted)

Each purple dotis a knot point.It will containthe height ofthe estimatedsurface

But how do wedo theinterpolation toensure that the

surface iscontinuous?

9

7 8

3

7/31/2019 Intro Reg 05

45/51


In two dimensions

x1

x2





But how do wedo theinterpolation toensure that thesurface iscontinuous?

9

7 8

3

To predict thevalue here


In two dimensions

x1

x2





9

7 8

3

To predict thevalue here

First interpolateits value on twoopposite edges 7.33

7

7/31/2019 Intro Reg 05

46/51


In two dimensions

x1

x2





But how do wedo theinterpolation toensure that thesurface iscontinuous?

9

7 8

3To predict thevalue here

First interpolateits value on twoopposite edges

Then interpolatebetween thosetwo values

7.33

7

7.05


In two dimensions

x1

x2





9

7 8

3To predict thevalue here

First interpolateits value on two

opposite edgesThen interpolatebetween thosetwo values

7.33

7

7.05

Notes:This can easily be generalizedto mdimensions.

It should be easy to see that itensures continuity

The patches are not linear

7/31/2019 Intro Reg 05

47/51


Doing the regression

x1

x2

Given data, howdo we find theoptimal knotheights?

Happily, itssimply a two-dimensionalbasis functionproblem.

(Working outthe basisfunctions istedious,unilluminating,and easy)

Whats theproblem inhigherdimensions?

9

7 8

3


MARS: MultivariateAdaptive Regression

Splines

7/31/2019 Intro Reg 05

48/51


MARS Multivariate Adaptive Regression Splines

Invented by Jerry Friedman (one ofAndrews heroes)

Simplest version:

Lets assume the function we are learning is of thefollowing form:

=

=m

k

kk

est xgy1

)()(x

Instead of a linear combination of the inputs, its a linearcombination of non-linear functions ofindividualinputs


MARS =

=m

k

kk

est xgy1

)()(x

Instead of a linear combination of the inputs, its a linearcombination of non-linear functions ofindividualinputs

x

y

q1 q4q3 q5q2

Idea: Each

gk is one ofthese

7/31/2019 Intro Reg 05

49/51


MARS =

=m

k

kk

est xgy1

)()(x

Instead of a linear combination of the inputs, its a linear

combination of non-linear functions ofindividualinputs

x

y

q1 q4q3 q5q2

= =

=m

k

k

N

j

k

j

k

j

est xhyK

1 1

)()(x

7/31/2019 Intro Reg 05

50/51


If you like MARSSee also CMAC (Cerebellar Model Articulated

Controller) by James Albus (another ofAndrews heroes)

Many of the same gut-level intuitions

But entirely in a neural-network, biologicallyplausible way

(All the low dimensional functions are bymeans of lookup tables, trained with a delta-rule and using a clever blurred update and

hash-tables)


Where are we now?

In

puts

ClassifierPredict

category

Inputs

DensityEstimator

Prob-ability

Inputs

RegressorPredictreal no.

Dec Tree, Gauss/Joint BC, Gauss Nave BC,

Joint DE, Nave DE, Gauss/Joint DE, Gauss NaveDE

Linear Regression, Polynomial Regression, RBFs,Robust Regression Regression Trees, MultilinearInterp, MARS

Inputs

InferenceEngine Learn

p(E1|E2)Joint DE

7/31/2019 Intro Reg 05

51/51


CitationsRadial Basis Functions

T. Poggio and F. Girosi,Regularization Algorithms for

Learning That Are Equivalent toMultilayer Networks, Science,247, 978--982, 1989

LOESS

W. S. Cleveland, Robust LocallyWeighted Regression andSmoothing Scatterplots, Journalof the American StatisticalAssociation, 74, 368, 829-836,December, 1979

Regression Trees etc

L. Breiman and J. H. Friedman andR. A. Olshen and C. J. Stone,

Classification and RegressionTrees, Wadsworth, 1984

J. R. Quinlan, Combining Instance-Based and Model-BasedLearning, Machine Learning:Proceedings of the TenthInternational Conference, 1993

MARS

J. H. Friedman, MultivariateAdaptive Regression Splines,Department for Statistics,Stanford University, 1988,Technical Report No. 102

Intro Reg 05

Documents