Top Banner

of 51

Intro Reg 05

Apr 05, 2018

Download

Documents

Ventura Charlin
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/31/2019 Intro Reg 05

    1/51

    Copyright 2001, 2003, Andrew W. Moore

    Predicting Real-valued

    outputs: an introductionto Regression

    Andrew W. Moore

    Professor

    School of Computer Science

    Carnegie Mellon University

    www.cs.cmu.edu/[email protected]

    412-268-7599

    Note to other teachers and users ofthese slides. Andrew would be delightedif you found this source material useful ingiving your own lectures. Feel free to usethese slides verbatim, or to modify themto fit your own needs. PowerPointoriginals are available. If you make useof a significant portion of these slides inyour own lecture, please include thismessage, or the following link to thesource repository of Andrews tutorials:

    http://www.cs.cmu.edu/~awm/tutorials .Comments and corrections gratefullyreceived.

    Thisisreorde

    redmateri

    al

    fromthe

    NeuralNets

    lecturean

    dtheFav

    orite

    Regressio

    nAlgorith

    ms

    lecture

    Copyright 2001, 2003, Andrew W. Moore 2

    Single-Parameter

    LinearRegression

  • 7/31/2019 Intro Reg 05

    2/51

    Copyright 2001, 2003, Andrew W. Moore 3

    Linear Regression

    Linear regression assumes that the expected value of

    the output given an input, E[y|x], is linear.

    Simplest case: Out(x) = wxfor some unknown w.

    Given the data, we can estimate w.

    y5 = 3.1x5 = 4

    y4 = 1.9x4 = 1.5

    y3 = 2x3 = 2

    y2 = 2.2x2 = 3

    y1 = 1x1 = 1outputsinputs

    DATASET

    1w

    Copyright 2001, 2003, Andrew W. Moore 4

    1-parameter linear regressionAssume that the data is formed by

    yi = wxi+ noisei

    where

    the noise signals are independent

    the noise has a normal distribution with mean 0

    and unknown variance 2

    p(y|w,x) has a normal distribution with

    mean wx

    variance 2

  • 7/31/2019 Intro Reg 05

    3/51

    Copyright 2001, 2003, Andrew W. Moore 5

    Bayesian Linear Regressionp(y|w,x) = Normal (mean wx, var 2)

    We have a set of datapoints (x1,y1) (x2,y2) (xn,yn)which are EVIDENCE about w.

    We want to infer wfrom the data.

    p(w|x1, x2, x3,xn, y1, y2yn)

    You can use BAYES rule to work out a posteriordistribution for wgiven the data.

    Or you could do Maximum Likelihood Estimation

    Copyright 2001, 2003, Andrew W. Moore 6

    Maximum likelihood estimation ofw

    Asks the question:

    For which value ofwis this data most likely to havehappened?

    For what wisp(y1, y2yn|x1, x2, x3,xn, w) maximized?

    For what wis

    maximized),(1

    i

    n

    i

    i xwyp=

  • 7/31/2019 Intro Reg 05

    4/51

    Copyright 2001, 2003, Andrew W. Moore 7

    For what wis

    For what wis

    For what wis

    For what wis

    maximized?),(1

    i

    n

    i

    i xwyp=

    maximized?))(2

    1exp( 2

    1

    ii wxyn

    i

    =

    maximized?

    2

    1 2

    1

    =

    iin

    i

    wxy

    ( ) minimized?2

    1=

    n

    i

    ii wxy

    Copyright 2001, 2003, Andrew W. Moore 8

    Linear Regression

    The maximumlikelihood wisthe one that

    minimizes sum-of-squares ofresiduals

    We want to minimize a quadratic function ofw.

    ( )

    ( ) ( ) 222

    2

    2 wxwyxy

    wxy

    i

    i

    iii

    i

    ii

    +=

    =

    E(w)w

  • 7/31/2019 Intro Reg 05

    5/51

    Copyright 2001, 2003, Andrew W. Moore 9

    Linear RegressionEasy to show the sum of

    squares is minimized

    when

    2

    =

    i

    ii

    x

    yxw

    The maximum likelihoodmodel is

    We can use it forprediction

    ( ) wxx =Out

    Copyright 2001, 2003, Andrew W. Moore 10

    Linear RegressionEasy to show the sum of

    squares is minimizedwhen

    2

    =i

    ii

    x

    yxw

    The maximum likelihoodmodel is

    We can use it forprediction

    Note: In Bayesian stats youd have

    ended up with a prob dist ofw

    And predictions would have given a prob

    dist of expected output

    Often useful to know your confidence.

    Max likelihood can give some kinds of

    confidence too.

    p(w)

    w

    ( ) wxx =Out

  • 7/31/2019 Intro Reg 05

    6/51

    Copyright 2001, 2003, Andrew W. Moore 11

    MultivariateLinear

    Regression

    Copyright 2001, 2003, Andrew W. Moore 12

    Multivariate RegressionWhat if the inputs are vectors?

    Dataset has formx1 y1x2 y2x3 y3.: :

    .

    xR yR

    3 .

    . 46 .

    . 5

    . 8

    . 10

    2-d inputexample

    x1

    x2

  • 7/31/2019 Intro Reg 05

    7/51

    Copyright 2001, 2003, Andrew W. Moore 13

    Multivariate RegressionWrite matrix X and Y thus:

    =

    =

    =

    RRmRR

    m

    m

    R y

    y

    y

    xxx

    xxx

    xxx

    MMM

    2

    1

    21

    22221

    11211

    2

    ...

    ...

    ...

    ..........

    ..........

    ..........

    y

    x

    x

    x

    x

    1

    (there are Rdatapoints. Each input has mcomponents)

    The linear regression model assumes a vector wsuch that

    Out(x) = wTx= w1x[1] + w2x[2] + .wmx[D]

    The max. likelihood wis w= (XTX)-1(XTY)

    Copyright 2001, 2003, Andrew W. Moore 14

    Multivariate RegressionWrite matrix X and Y thus:

    =

    =

    =

    RRmRR

    m

    m

    R y

    y

    y

    xxx

    xxx

    xxx

    MMM

    2

    1

    21

    22221

    11211

    2

    ...

    ...

    ...

    ..........

    ..........

    ..........

    y

    x

    x

    x

    x

    1

    (there are Rdatapoints. Each input has mcomponents)

    The linear regression model assumes a vector wsuch that

    Out(x) = wTx= w1x[1] + w2x[2] + .wmx[D]

    The max. likelihood wis w= (XTX)-1(XTY)

    IMPORTANT EXERCISE:PROVE IT !!!!!

  • 7/31/2019 Intro Reg 05

    8/51

    Copyright 2001, 2003, Andrew W. Moore 15

    Multivariate Regression (cont)

    The max. likelihood wis w= (XTX)-1(XTY)

    XTXis an m x m matrix: i,jth elt is

    XTY is an m-element vector: ith elt

    =

    R

    k

    kjkixx1

    =

    R

    k

    kkiyx

    1

    Copyright 2001, 2003, Andrew W. Moore 16

    Constant Term

    in LinearRegression

  • 7/31/2019 Intro Reg 05

    9/51

    Copyright 2001, 2003, Andrew W. Moore 17

    What about a constant term?We may expectlinear data that does

    not go through theorigin.

    Statisticians andNeural Net Folks allagree on a simpleobvious hack.

    Can you guess??

    Copyright 2001, 2003, Andrew W. Moore 18

    The constant term The trick is to create a fake input X0 that

    always takes the value 1

    2055

    1743

    1642

    YX2X1

    1

    1

    1

    X0

    2055

    1743

    1642

    YX2X1

    Before:

    Y=w1X1+ w2X2has to be a poormodel

    After:

    Y= w0X0+w1X1+ w2X2= w0+w1X1+ w2X2

    has a fine constantterm

    In this example,You should be ableto see the MLE w0

    , w1and w2byinspection

  • 7/31/2019 Intro Reg 05

    10/51

    Copyright 2001, 2003, Andrew W. Moore 19

    LinearRegression with

    varying noise

    Heterosceda

    sticity...

    Copyright 2001, 2003, Andrew W. Moore 20

    Regression with varying noise Suppose you know the variance of the noise that

    was added to each datapoint.

    x=0 x=3x=2x=1

    y=0

    y=3

    y=2

    y=1

    =1/2

    =2

    =1

    =1/2

    =2

    1/423

    4321/412

    111

    4

    i2yixi

    ),(~ 2iii wxNy Assume What

    stheM

    LE

    estimat

    eofw?

  • 7/31/2019 Intro Reg 05

    11/51

    Copyright 2001, 2003, Andrew W. Moore 21

    MLE estimation with varying noise

    =),,...,,,,...,,|,...,,(log 2222

    12121argmax wxxxyyypw

    RRR

    =

    =

    R

    i i

    ii wxy

    w1

    2

    2)(argmin

    =

    =

    =

    0)(

    such that1

    2

    R

    i i

    iii wxyxw

    =

    =

    R

    i i

    i

    R

    i i

    ii

    x

    yx

    12

    2

    12

    Assuming independenceamong noise and thenplugging in equation forGaussian and simplifying.

    Setting dLL/dwequal to zero

    Trivial algebra

    Copyright 2001, 2003, Andrew W. Moore 22

    This is Weighted Regression We are asking to minimize the weighted sum of

    squares

    x=0 x=3x=2x=1

    y=0

    y=3

    y=2

    y=1

    =1/2

    =2

    =1

    =1/2

    =2

    =

    R

    i i

    ii wxy

    w1

    2

    2)(argmin

    2

    1

    iwhere weight for ith datapoint is

  • 7/31/2019 Intro Reg 05

    12/51

    Copyright 2001, 2003, Andrew W. Moore 23

    Non-linear

    Regression

    Copyright 2001, 2003, Andrew W. Moore 24

    Non-linear Regression Suppose you know that y is related to a function of x in

    such a way that the predicted values have a non-lineardependence on w, e.g:

    x=0 x=3x=2x=1

    y=0

    y=3

    y=2

    y=1

    33

    2332

    2.51

    yixi

    ),(~ 2ii xwNy +Assume Whatst

    heMLE

    estimat

    eofw?

  • 7/31/2019 Intro Reg 05

    13/51

    Copyright 2001, 2003, Andrew W. Moore 25

    Non-linear MLE estimation

    =),,,...,,|,...,,(log 2121argmax wxxxyyypw

    RR

    ( ) =+=

    R

    i

    ii xwy

    w1

    2

    argmin

    =

    =

    +

    +

    =

    0such that1

    R

    i i

    ii

    xw

    xwyw

    Assuming i.i.d. andthen plugging inequation for Gaussianand simplifying.

    Setting dLL/dwequal to zero

    Copyright 2001, 2003, Andrew W. Moore 26

    Non-linear MLE estimation

    =),,,...,,|,...,,(log 2121argmax wxxxyyypw

    RR

    ( ) =+=

    R

    i

    ii xwy

    w1

    2

    argmin

    =

    =+

    +

    = 0such that 1

    R

    i i

    ii

    xw

    xwyw

    Assuming i.i.d. andthen plugging inequation for Gaussianand simplifying.

    Setting dLL/dw

    equal to zero

    Were down thealgebraic toilet

    Soguess

    what

    wedo?

  • 7/31/2019 Intro Reg 05

    14/51

    Copyright 2001, 2003, Andrew W. Moore 27

    Non-linear MLE estimation

    =),,,...,,|,...,,(log 2121argmax wxxxyyypw

    RR

    ( ) =+=

    R

    i

    ii xwy

    w1

    2

    argmin

    =

    =

    +

    +

    =

    0such that1

    R

    i i

    ii

    xw

    xwyw

    Assuming i.i.d. andthen plugging inequation for Gaussianand simplifying.

    Setting dLL/dwequal to zero

    Were down the

    algebraic toilet

    Soguess

    what

    wedo?

    Common (but not only) approach:

    Numerical Solutions:

    Line Search

    Simulated Annealing

    Gradient Descent

    Conjugate Gradient

    Levenberg Marquart

    Newtons Method

    Also, special purpose statistical-optimization-specific tricks such asE.M. (See Gaussian Mixtures lecturefor introduction)

    Copyright 2001, 2003, Andrew W. Moore 28

    PolynomialRegression

  • 7/31/2019 Intro Reg 05

    15/51

    Copyright 2001, 2003, Andrew W. Moore 29

    Polynomial RegressionSo far weve mainly been dealing with linear regression

    :::

    311

    723

    YX2X1

    ::

    11

    23

    :

    3

    7X= y=

    x1=(3,2).. y 1=7..

    1

    3

    ::

    11

    21

    :

    3

    7Z= y=

    z1=(1,3,2)..

    zk=(1,xk1,xk2)

    y1=7..

    =(ZTZ)-1(ZTy)

    yest=0+1x1+2x2

    Copyright 2001, 2003, Andrew W. Moore 30

    Quadratic RegressionIts trivial to do linear fits of fixed nonlinear basis functions

    :::

    311

    723

    YX2X1

    ::

    11

    23

    :

    3

    7X= y=

    x1=(3,2).. y 1=7..

    12

    19

    16

    13

    ::

    1141

    :

    3

    7Z= y=

    z=(1 , x1, x2, x12, x1x2,x2

    2,)

    =(ZTZ)-1(ZTy)

    yest=0+1x1+2x2+

    3x12+4x1x2+5x2

    2

  • 7/31/2019 Intro Reg 05

    16/51

    Copyright 2001, 2003, Andrew W. Moore 31

    Quadratic RegressionIts trivial to do linear fits of fixed nonlinear basis functions

    :::

    311

    723

    YX2X1

    ::

    11

    23

    :

    3

    7X= y=

    x1=(3,2).. y 1=7..

    1

    2

    1

    9

    1

    6

    1

    3

    ::

    11

    41

    :

    3

    7Z=

    y=

    z=(1 , x1, x2, x12, x1x2,x2

    2,)

    =(ZTZ)-1(ZTy)

    yest=0+1x1+2x2+

    3x12+4x1x2+5x2

    2

    Each component of a z vector is called a term.

    Each column of the Z matrix is called a term column

    How many terms in a quadratic regression with minputs?

    1 constant term

    m linear terms

    (m+1)-choose-2 = m(m+1)/2 quadratic terms

    (m+2)-choose-2 terms in total = O(m2)

    Note that solving =(ZTZ)-1(ZTy)is thus O(m6)

    Copyright 2001, 2003, Andrew W. Moore 32

    Qth-degree polynomial Regression

    :::

    311

    723

    YX2X1

    ::

    11

    23

    :

    3

    7X= y=

    x1=(3,2).. y 1=7..

    12

    19

    16

    13

    :

    11

    :

    37Z= y=

    z=(all products of powers of inputs inwhich sum of powers is q or less,)

    =(ZTZ)-1(ZTy)

    yest=0+

    1x1+

  • 7/31/2019 Intro Reg 05

    17/51

    Copyright 2001, 2003, Andrew W. Moore 33

    m inputs, degree Q: how many terms?= the number of unique terms of the form

    Qqxxxm

    i

    i

    q

    m

    qq m =1

    21 where...21

    Qqxxxm

    i

    i

    q

    m

    qqq m ==0

    21 where...1210

    = the number of unique terms of the form

    = the number of lists of non-negative integers [q0,q1,q2,..qm]

    in which qi= Q

    = the number of ways of placing Q red disks on a row ofsquares of length Q+m = (Q+m)-choose-Q

    Q=11, m=4

    q0=2 q2=0q1=2 q3=4 q4=3

    Copyright 2001, 2003, Andrew W. Moore 34

    Radial BasisFunctions

  • 7/31/2019 Intro Reg 05

    18/51

    Copyright 2001, 2003, Andrew W. Moore 35

    Radial Basis Functions (RBFs)

    :::

    311

    723

    YX2X1

    ::

    11

    23

    :

    3

    7X= y=

    x1=(3,2).. y 1=7..

    :

    3

    7Z=

    y=

    z=(list of radial basis function evaluations)

    =(ZTZ)-1(ZTy)

    yest=0+

    1x1+

    Copyright 2001, 2003, Andrew W. Moore 36

    1-d RBFs

    yest=11(x)+22(x)+33(x)

    where

    i(x) = KernelFunction(| x - ci|/ KW)

    x

    y

    c1 c1 c1

  • 7/31/2019 Intro Reg 05

    19/51

    Copyright 2001, 2003, Andrew W. Moore 37

    Example

    yest=21(x)+0.052(x)+0.53(x)

    where

    i(x) = KernelFunction(| x - ci|/ KW)

    x

    y

    c1 c1 c1

    Copyright 2001, 2003, Andrew W. Moore 38

    RBFs with Linear Regression

    yest=21(x)+0.052(x)+0.53(x)

    where

    i(x) = KernelFunction(| x - ci|/ KW)

    x

    y

    c1 c1 c1

    All cis are held constant(initialized randomly or

    on a grid in m-dimensional input space)

    KWalso held constant(initialized to be large

    enough that theres decentoverlap between basis

    functions**Usually much better than the crappy

    overlap on my diagram

  • 7/31/2019 Intro Reg 05

    20/51

    Copyright 2001, 2003, Andrew W. Moore 39

    RBFs with Linear Regression

    yest=21(x)+0.052(x)+0.53(x)

    where

    i(x) = KernelFunction(| x - ci|/ KW)

    then given Qbasis functions, define the matrix Zsuch that Zkj=

    KernelFunction(| xk- ci|/ KW)where xk is the kth vector of inputs

    And as before, =(ZTZ)-1(ZTy)

    x

    y

    c1 c1 c1

    All cis are held constant(initialized randomly or

    on a grid in m-dimensional input space)

    KWalso held constant(initialized to be large

    enough that theres decentoverlap between basis

    functions**Usually much better than the crappy

    overlap on my diagram

    Copyright 2001, 2003, Andrew W. Moore 40

    RBFs with NonLinear Regression

    yest=21(x)+0.052(x)+0.53(x)

    where

    i(x) = KernelFunction(| x - ci|/ KW)

    But how do we now find all the js, cis and KW?

    x

    y

    c1 c1 c1

    Allow the cis to adapt tothe data (initialized

    randomly or on a grid inm-dimensional input

    space)

    KWallowed to adapt to the data.

    (Some folks even let each basisfunction have its ownKWj,permitting fine detail indense regions of input space)

  • 7/31/2019 Intro Reg 05

    21/51

    Copyright 2001, 2003, Andrew W. Moore 41

    RBFs with NonLinear Regression

    yest=21(x)+0.052(x)+0.53(x)

    where

    i(x) = KernelFunction(| x - ci|/ KW)

    But how do we now find all the js, cis and KW?

    x

    y

    c1 c1 c1

    Allow the cis to adapt tothe data (initializedrandomly or on a grid in

    m-dimensional inputspace)

    KWallowed to adapt to the data.

    (Some folks even let each basisfunction have its ownKWj,permitting fine detail indense regions of input space)

    Answer: Gradient Descent

    Copyright 2001, 2003, Andrew W. Moore 42

    RBFs with NonLinear Regression

    yest=21(x)+0.052(x)+0.53(x)

    where

    i(x) = KernelFunction(| x - ci|/ KW)

    But how do we now find all the js, cis and KW?

    x

    y

    c1 c1 c1

    Allow the cis to adapt tothe data (initialized

    randomly or on a grid inm-dimensional input

    space)

    KWallowed to adapt to the data.

    (Some folks even let each basisfunction have its ownKWj,permitting fine detail indense regions of input space)

    Answer: Gradient Descent(But Id like to see, or hope someones already done, ahybrid, where the cis and KWare updated with gradientdescent while the js use matrix inversion)

  • 7/31/2019 Intro Reg 05

    22/51

    Copyright 2001, 2003, Andrew W. Moore 43

    Radial Basis Functions in 2-d

    x1

    x2

    Center

    Sphere ofsignificantinfluence of

    center

    Two inputs.

    Outputs (heightssticking out of page)not shown.

    Copyright 2001, 2003, Andrew W. Moore 44

    Happy RBFs in 2-d

    x1

    x2

    Center

    Sphere ofsignificantinfluence ofcenter

    Blue dots denotecoordinates ofinput vectors

  • 7/31/2019 Intro Reg 05

    23/51

    Copyright 2001, 2003, Andrew W. Moore 45

    Crabby RBFs in 2-d

    x1

    x2

    Center

    Sphere ofsignificantinfluence of

    center

    Blue dots denotecoordinates of

    input vectors

    Whats theproblem in thisexample?

    Copyright 2001, 2003, Andrew W. Moore 46

    x1

    x2

    Center

    Sphere ofsignificantinfluence ofcenter

    Blue dots denotecoordinates ofinput vectors

    More crabby RBFs And whats theproblem in thisexample?

  • 7/31/2019 Intro Reg 05

    24/51

    Copyright 2001, 2003, Andrew W. Moore 47

    Hopeless!

    x1

    x2

    Center

    Sphere ofsignificantinfluence of

    center

    Even before seeing the data, you shouldunderstand that this is a disaster!

    Copyright 2001, 2003, Andrew W. Moore 48

    Unhappy

    x1

    x2

    Center

    Sphere ofsignificantinfluence ofcenter

    Even before seeing the data, you shouldunderstand that this isnt good either..

  • 7/31/2019 Intro Reg 05

    25/51

    Copyright 2001, 2003, Andrew W. Moore 49

    Robust

    Regression

    Copyright 2001, 2003, Andrew W. Moore 50

    Robust Regression

    x

    y

  • 7/31/2019 Intro Reg 05

    26/51

    Copyright 2001, 2003, Andrew W. Moore 51

    Robust Regression

    x

    y

    This is the best fit thatQuadratic Regression canmanage

    Copyright 2001, 2003, Andrew W. Moore 52

    Robust Regression

    x

    y

    but this is what wedprobably prefer

  • 7/31/2019 Intro Reg 05

    27/51

    Copyright 2001, 2003, Andrew W. Moore 53

    LOESS-based Robust Regression

    x

    y

    After the initial fit, scoreeach datapoint according tohow well its fitted

    You are a very gooddatapoint.

    Copyright 2001, 2003, Andrew W. Moore 54

    LOESS-based Robust Regression

    x

    y

    After the initial fit, scoreeach datapoint according tohow well its fitted

    You are a very gooddatapoint.

    You are not tooshabby.

  • 7/31/2019 Intro Reg 05

    28/51

    Copyright 2001, 2003, Andrew W. Moore 55

    LOESS-based Robust Regression

    x

    y

    After the initial fit, scoreeach datapoint according tohow well its fitted

    You are a very gooddatapoint.

    You are not tooshabby.

    But you arepathetic.

    Copyright 2001, 2003, Andrew W. Moore 56

    Robust Regression

    x

    y

    For k = 1 to R

    Let (xk,yk)be the kth datapoint

    Let yestkbe predicted value ofyk

    Let wkbe a weight for

    datapoint k that is large if thedatapoint fits well and small if itfits badly:

    wk= KernelFn([yk- yest

    k]2)

  • 7/31/2019 Intro Reg 05

    29/51

    Copyright 2001, 2003, Andrew W. Moore 57

    Robust Regression

    x

    y

    For k = 1 to R

    Let (xk,yk)be the kth datapoint

    Let yestkbe predicted value ofyk

    Let wkbe a weight fordatapoint k that is large if thedatapoint fits well and small if itfits badly:

    wk= KernelFn([yk- yest

    k]2)

    Then redo the regressionusing weighted datapoints.

    Weighted regression was described earlier inthe vary noise section, and is also discussedin the Memory-based Learning Lecture.

    Guess what happens next?

    Copyright 2001, 2003, Andrew W. Moore 58

    Robust Regression

    x

    y

    For k = 1 to R

    Let (xk,yk)be the kth datapoint

    Let yestkbe predicted value ofyk

    Let wkbe a weight for

    datapoint k that is large if thedatapoint fits well and small if itfits badly:

    wk= KernelFn([yk- yest

    k]2)

    Then redo the regressionusing weighted datapoints.

    I taught you how to do this in the Instance-based lecture (only then the weightsdepended on distance in input-space)

    Repeat whole thing untilconverged!

  • 7/31/2019 Intro Reg 05

    30/51

    Copyright 2001, 2003, Andrew W. Moore 59

    Robust Regression---what weredoing

    What regular regression does:

    Assume ykwas originally generated using thefollowing recipe:

    yk=0+1xk+2xk2+N(0,2)

    Computational task is to find the MaximumLikelihood 0, 1and2

    Copyright 2001, 2003, Andrew W. Moore 60

    Robust Regression---what weredoing

    What LOESS robust regression does:

    Assume ykwas originally generated using thefollowing recipe:

    With probabilityp:

    yk=0+1xk+2xk2+N(0,2)

    But otherwise

    yk~ N(,huge2)

    Computational task is to find the MaximumLikelihood 0, 1, 2, p, andhuge

  • 7/31/2019 Intro Reg 05

    31/51

    Copyright 2001, 2003, Andrew W. Moore 61

    Robust Regression---what weredoing

    What LOESS robust regression does:

    Assume ykwas originally generated using thefollowing recipe:

    With probabilityp:

    yk=0+1xk+2xk2+N(0,2)

    But otherwise

    yk~ N(,huge2)

    Computational task is to find the MaximumLikelihood 0, 1, 2, p, andhuge

    Mysteriously, thereweighting proceduredoes this computationfor us.

    Your first glimpse oftwo spectacular letters:

    E.M.

    Copyright 2001, 2003, Andrew W. Moore 62

    RegressionTrees

  • 7/31/2019 Intro Reg 05

    32/51

    Copyright 2001, 2003, Andrew W. Moore 63

    Regression Trees Decision trees for regression

    Copyright 2001, 2003, Andrew W. Moore 64

    A regression tree leaf

    Predict age = 47

    Mean age of records

    matching this leaf node

  • 7/31/2019 Intro Reg 05

    33/51

    Copyright 2001, 2003, Andrew W. Moore 65

    A one-split regression tree

    Predict age = 36Predict age = 39

    Gender?

    Female Male

    Copyright 2001, 2003, Andrew W. Moore 66

    Choosing the attribute to split on

    We cant useinformation gain.

    What should we use?

    725+0YesMale

    :::::

    2400NoMale

    3812NoFemale

    AgeNum. BeanyBabies

    Num.Children

    Rich?Gender

  • 7/31/2019 Intro Reg 05

    34/51

    Copyright 2001, 2003, Andrew W. Moore 67

    Choosing the attribute to split on

    MSE(Y|X) = The expected squared error if we must predict a records Yvalue given only knowledge of the records X value

    If were told x=j, the smallest expected error comes from predicting themean of the Y-values among those records in which x=j. Call this meanquantity y

    x=j

    Then

    725+0YesMale

    :::::

    2400NoMale3812NoFemale

    AgeNum. BeanyBabies

    Num.Children

    Rich?Gender

    = =

    ==X

    k

    N

    j jxk

    jx

    yk yR

    XYMSE1 )such that(

    2)(1)|(

    Copyright 2001, 2003, Andrew W. Moore 68

    Choosing the attribute to split on

    MSE(Y|X) = The expected squared error if we must predict a records Y

    value given only knowledge of the records X valueIf were told x=j, the smallest expected error comes from predicting the

    mean of the Y-values among those records in which x=j. Call this meanquantity y

    x=j

    Then

    725+0YesMale

    :::::

    2400NoMale

    3812NoFemale

    AgeNum. BeanyBabies

    Num.Children

    Rich?Gender

    = =

    ==X

    k

    N

    j jxk

    jx

    yk yR

    XYMSE1 )such that(

    2)(

    1)|(

    Regression tree attribute selection: greedilychoose the attribute that minimizes MSE(Y|X)

    Guess what we do about real-valued inputs?

    Guess how we prevent overfitting

  • 7/31/2019 Intro Reg 05

    35/51

    Copyright 2001, 2003, Andrew W. Moore 69

    Pruning Decision

    Predict age = 36Predict age = 39

    Gender?

    Female Male

    property-owner = Yes

    # property-owning females = 56712

    Mean age among POFs = 39

    Age std dev among POFs = 12

    # property-owning males = 55800

    Mean age among POMs = 36

    Age std dev among POMs = 11.5

    Use a standard Chi-squared test of the null-

    hypothesis these two populations have the samemean and Bobs your uncle.

    Do I deserveto live?

    Copyright 2001, 2003, Andrew W. Moore 70

    Linear Regression Trees

    Predict age =

    26 + 6 * NumChildren -2 * YearsEducation

    Gender?

    Female Male

    property-owner = Yes

    Leaves contain linearfunctions (trained usinglinear regression on allrecords matching that leaf)

    Predict age =

    24 + 7 * NumChildren -2.5 * YearsEducation

    Also known asModel Trees

    Split attribute chosen to minimizeMSE of regressed children.

    Pruning with a different Chi-squared

  • 7/31/2019 Intro Reg 05

    36/51

    Copyright 2001, 2003, Andrew W. Moore 71

    Linear Regression Trees

    Predict age =

    26 + 6 * NumChildren -2 * YearsEducation

    Gender?

    Female Male

    property-owner = Yes

    Leaves contain linear

    functions (trained usinglinear regression on allrecords matching that leaf)

    Predict age =

    24 + 7 * NumChildren -2.5 * YearsEducation

    Also known asModel Trees

    Split attribute chosen to minimize

    MSE of regressed children.Pruning with a different Chi-squared

    Detail:

    Youty

    pically

    ignore

    any

    catego

    ricalat

    tribute

    thatha

    sbeen

    tested

    onhig

    herup

    inthe

    treed

    uringth

    e

    regres

    sion.Bu

    tusea

    llunte

    sted

    attribute

    s,and

    userea

    l-value

    dattrib

    utes

    evenifthey

    vebee

    nteste

    dabov

    e

    Copyright 2001, 2003, Andrew W. Moore 72

    Test your understanding

    x

    y

    Assuming regular regression trees, can you sketch agraph of the fitted function yest(x)over this diagram?

  • 7/31/2019 Intro Reg 05

    37/51

    Copyright 2001, 2003, Andrew W. Moore 73

    Test your understanding

    x

    y

    Assuming linear regression trees, can you sketch a graphof the fitted function yest(x)over this diagram?

    Copyright 2001, 2003, Andrew W. Moore 74

    MultilinearInterpolation

  • 7/31/2019 Intro Reg 05

    38/51

    Copyright 2001, 2003, Andrew W. Moore 75

    Multilinear Interpolation

    x

    y

    Consider this dataset. Suppose we wanted to create acontinuous and piecewise linear fit to the data

    Copyright 2001, 2003, Andrew W. Moore 76

    Multilinear Interpolation

    x

    y

    Create a set of knot points: selected X-coordinates(usually equally spaced) that cover the data

    q1 q4q3 q5q2

  • 7/31/2019 Intro Reg 05

    39/51

    Copyright 2001, 2003, Andrew W. Moore 77

    Multilinear Interpolation

    x

    y

    We are going to assume the data was generated by anoisy version of a function that can only bend at the

    knots. Here are 3 examples (none fits the data well)

    q1 q4q3 q5q2

    Copyright 2001, 2003, Andrew W. Moore 78

    How to find the best fit?Idea 1: Simply perform a separate regression in eachsegment for each part of the curve

    Whats the problem with this idea?

    x

    y

    q1 q4q3 q5q2

  • 7/31/2019 Intro Reg 05

    40/51

    Copyright 2001, 2003, Andrew W. Moore 79

    How to find the best fit?

    x

    y

    Lets look at what goes on in the red segment

    q1 q4q3 q5q2

    h2

    h3

    233

    2

    2

    3

    where

    )()(

    )( qqwhw

    xq

    hw

    xq

    xy

    est

    =

    +

    =

    Copyright 2001, 2003, Andrew W. Moore 80

    How to find the best fit?

    x

    y

    In the red segment

    q1 q4q3 q5q2

    h2

    h3

    )()()( 3322 xhxhxyest +=

    w

    xqx

    w

    qxx

    =

    = 33

    22 1)(,1)(where

    2(x)

  • 7/31/2019 Intro Reg 05

    41/51

    Copyright 2001, 2003, Andrew W. Moore 81

    How to find the best fit?

    x

    y

    In the red segment

    q1 q4q3 q5q2

    h2

    h3

    )()()( 3322 xhxhxyest +=

    w

    xq

    xw

    qx

    x

    =

    =

    3

    3

    2

    2 1)(,1)(where

    2(x)

    3(x)

    Copyright 2001, 2003, Andrew W. Moore 82

    How to find the best fit?

    x

    y

    In the red segment

    q1 q4q3 q5q2

    h2

    h3

    )()()( 3322 xhxhxyest +=

    w

    qxx

    w

    qxx

    ||1)(,

    ||1)(where 33

    22

    =

    =

    2(x)

    3(x)

  • 7/31/2019 Intro Reg 05

    42/51

    Copyright 2001, 2003, Andrew W. Moore 83

    How to find the best fit?

    x

    y

    In the red segment

    q1 q4q3 q5q2

    h2

    h3

    )()()( 3322 xhxhxyest +=

    w

    qx

    xw

    qx

    x||

    1)(,

    ||

    1)(where3

    3

    2

    2

    =

    =

    2(x)

    3(x)

    Copyright 2001, 2003, Andrew W. Moore 84

    How to find the best fit?

    x

    y

    In general

    q1 q4q3 q5q2

    h2

    h3

    =

    =KN

    i

    ii

    est xhxy1

    )()(

  • 7/31/2019 Intro Reg 05

    43/51

    Copyright 2001, 2003, Andrew W. Moore 85

    How to find the best fit?

    x

    y

    In general

    q1 q4q3 q5q2

    h2

    h3

    =

    =KN

    i

    ii

    est xhxy1

    )()(

  • 7/31/2019 Intro Reg 05

    44/51

    Copyright 2001, 2003, Andrew W. Moore 87

    In two dimensions

    x1

    x2

    Blue dots showlocations of input

    vectors (outputsnot depicted)

    Each purple dotis a knot point.It will containthe height of

    the estimatedsurface

    Copyright 2001, 2003, Andrew W. Moore 88

    In two dimensions

    x1

    x2

    Blue dots showlocations of inputvectors (outputsnot depicted)

    Each purple dotis a knot point.It will containthe height ofthe estimatedsurface

    But how do wedo theinterpolation toensure that the

    surface iscontinuous?

    9

    7 8

    3

  • 7/31/2019 Intro Reg 05

    45/51

    Copyright 2001, 2003, Andrew W. Moore 89

    In two dimensions

    x1

    x2

    Blue dots showlocations of input

    vectors (outputsnot depicted)

    Each purple dotis a knot point.It will containthe height of

    the estimatedsurface

    But how do wedo theinterpolation toensure that thesurface iscontinuous?

    9

    7 8

    3

    To predict thevalue here

    Copyright 2001, 2003, Andrew W. Moore 90

    In two dimensions

    x1

    x2

    Blue dots showlocations of inputvectors (outputsnot depicted)

    Each purple dotis a knot point.It will containthe height ofthe estimatedsurface

    But how do wedo theinterpolation toensure that the

    surface iscontinuous?

    9

    7 8

    3

    To predict thevalue here

    First interpolateits value on twoopposite edges 7.33

    7

  • 7/31/2019 Intro Reg 05

    46/51

    Copyright 2001, 2003, Andrew W. Moore 91

    In two dimensions

    x1

    x2

    Blue dots showlocations of input

    vectors (outputsnot depicted)

    Each purple dotis a knot point.It will containthe height of

    the estimatedsurface

    But how do wedo theinterpolation toensure that thesurface iscontinuous?

    9

    7 8

    3To predict thevalue here

    First interpolateits value on twoopposite edges

    Then interpolatebetween thosetwo values

    7.33

    7

    7.05

    Copyright 2001, 2003, Andrew W. Moore 92

    In two dimensions

    x1

    x2

    Blue dots showlocations of inputvectors (outputsnot depicted)

    Each purple dotis a knot point.It will containthe height ofthe estimatedsurface

    But how do wedo theinterpolation toensure that the

    surface iscontinuous?

    9

    7 8

    3To predict thevalue here

    First interpolateits value on two

    opposite edgesThen interpolatebetween thosetwo values

    7.33

    7

    7.05

    Notes:This can easily be generalizedto mdimensions.

    It should be easy to see that itensures continuity

    The patches are not linear

  • 7/31/2019 Intro Reg 05

    47/51

    Copyright 2001, 2003, Andrew W. Moore 93

    Doing the regression

    x1

    x2

    Given data, howdo we find theoptimal knotheights?

    Happily, itssimply a two-dimensionalbasis functionproblem.

    (Working outthe basisfunctions istedious,unilluminating,and easy)

    Whats theproblem inhigherdimensions?

    9

    7 8

    3

    Copyright 2001, 2003, Andrew W. Moore 94

    MARS: MultivariateAdaptive Regression

    Splines

  • 7/31/2019 Intro Reg 05

    48/51

    Copyright 2001, 2003, Andrew W. Moore 95

    MARS Multivariate Adaptive Regression Splines

    Invented by Jerry Friedman (one ofAndrews heroes)

    Simplest version:

    Lets assume the function we are learning is of thefollowing form:

    =

    =m

    k

    kk

    est xgy1

    )()(x

    Instead of a linear combination of the inputs, its a linearcombination of non-linear functions ofindividualinputs

    Copyright 2001, 2003, Andrew W. Moore 96

    MARS =

    =m

    k

    kk

    est xgy1

    )()(x

    Instead of a linear combination of the inputs, its a linearcombination of non-linear functions ofindividualinputs

    x

    y

    q1 q4q3 q5q2

    Idea: Each

    gk is one ofthese

  • 7/31/2019 Intro Reg 05

    49/51

    Copyright 2001, 2003, Andrew W. Moore 97

    MARS =

    =m

    k

    kk

    est xgy1

    )()(x

    Instead of a linear combination of the inputs, its a linear

    combination of non-linear functions ofindividualinputs

    x

    y

    q1 q4q3 q5q2

    = =

    =m

    k

    k

    N

    j

    k

    j

    k

    j

    est xhyK

    1 1

    )()(x

  • 7/31/2019 Intro Reg 05

    50/51

    Copyright 2001, 2003, Andrew W. Moore 99

    If you like MARSSee also CMAC (Cerebellar Model Articulated

    Controller) by James Albus (another ofAndrews heroes)

    Many of the same gut-level intuitions

    But entirely in a neural-network, biologicallyplausible way

    (All the low dimensional functions are bymeans of lookup tables, trained with a delta-rule and using a clever blurred update and

    hash-tables)

    Copyright 2001, 2003, Andrew W. Moore 100

    Where are we now?

    In

    puts

    ClassifierPredict

    category

    Inputs

    DensityEstimator

    Prob-ability

    Inputs

    RegressorPredictreal no.

    Dec Tree, Gauss/Joint BC, Gauss Nave BC,

    Joint DE, Nave DE, Gauss/Joint DE, Gauss NaveDE

    Linear Regression, Polynomial Regression, RBFs,Robust Regression Regression Trees, MultilinearInterp, MARS

    Inputs

    InferenceEngine Learn

    p(E1|E2)Joint DE

  • 7/31/2019 Intro Reg 05

    51/51

    Copyright 2001, 2003, Andrew W. Moore 101

    CitationsRadial Basis Functions

    T. Poggio and F. Girosi,Regularization Algorithms for

    Learning That Are Equivalent toMultilayer Networks, Science,247, 978--982, 1989

    LOESS

    W. S. Cleveland, Robust LocallyWeighted Regression andSmoothing Scatterplots, Journalof the American StatisticalAssociation, 74, 368, 829-836,December, 1979

    Regression Trees etc

    L. Breiman and J. H. Friedman andR. A. Olshen and C. J. Stone,

    Classification and RegressionTrees, Wadsworth, 1984

    J. R. Quinlan, Combining Instance-Based and Model-BasedLearning, Machine Learning:Proceedings of the TenthInternational Conference, 1993

    MARS

    J. H. Friedman, MultivariateAdaptive Regression Splines,Department for Statistics,Stanford University, 1988,Technical Report No. 102