Statistical View of Regression a MATLAB Tutorial

Post on 29-Oct-2014

20 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Regression and Least Squares: A MATLABTutorial

Dr. Michael D. Porterporter@stat.ncsu.edu

Department of StatisticsNorth Carolina State University

andSAMSI

Tuesday May 20, 2008

1 / 54

Introduction to Regression

Goal: Express the relationship between two (or more)variables by a mathematical formula.

x is the predictor (independent) variabley is the response (dependent) variable

We specifically want to indicate how y varies as a functionof x.

y(x) is considered a random variable, so it can never bepredicted perfectly.

2 / 54

Example: Relating Shoe Size to HeightThe problem

Footwear impressions are commonly observed at crimescenes. While there are numerous forensic properties that canbe obtained from these impressions, one in particular is theshoe size. The detectives would like to be able to estimate theheight of the impression maker from the shoe size.

3 / 54

Example: Relating Shoe Size to HeightThe data

6 7 8 9 10 11 12 13 14 1560

62

64

66

68

70

72

74

76

Determining Height from Shoe Size

Shoe Size (Mens)

Hei

gh

t (i

n)

Data taken from: http://staff.imsa.edu/∼brazzle/E2Kcurr/Forensic/Tracks/TracksSummary.html

4 / 54

Example: Relating Shoe Size to HeightYour answers

6 7 8 9 10 11 12 13 14 1560

62

64

66

68

70

72

74

76

Determining Height from Shoe Size

Shoe Size (Mens)

Hei

gh

t (i

n)

1 What is the predictor?What is the response?

5 / 54

Example: Relating Shoe Size to HeightYour answers

6 7 8 9 10 11 12 13 14 1560

62

64

66

68

70

72

74

76

Determining Height from Shoe Size

Shoe Size (Mens)

Hei

gh

t (i

n)

1 What is the predictor?What is the response?

2 Can the height of theimpression maker beaccurately estimated fromthe shoe size?

6 / 54

Example: Relating Shoe Size to HeightYour answers

6 7 8 9 10 11 12 13 14 1560

62

64

66

68

70

72

74

76

Determining Height from Shoe Size

Shoe Size (Mens)

Hei

gh

t (i

n)

1 What is the predictor?What is the response?

2 Can the height of theimpression maker beaccurately estimated fromthe shoe size?

3 If a shoe is size 11, whatwould you advise thepolice?

7 / 54

Example: Relating Shoe Size to HeightYour answers

6 7 8 9 10 11 12 13 14 1560

62

64

66

68

70

72

74

76

Determining Height from Shoe Size

Shoe Size (Mens)

Hei

gh

t (i

n)

1 What is the predictor?What is the response?

2 Can the height of theimpression maker beaccurately estimated fromthe shoe size?

3 If a shoe is size 11, whatwould you advise thepolice?

4 What if the size is 7? Size12.5?

8 / 54

General Regression Model

Assume the true model is of the form:

y(x) = m(x) + ǫ(x)

The systematic part, m(x) is deterministicThe error, ǫ(x) is a random variable

Measurement errorNatural variations due to exogenous factorsTherefore, y(x) is also a random variable

The error is additive

9 / 54

Example: Sinusoid Function�

�y(x) = A · sin(ωx + φ) + ǫ(x)

A = 1; ω = π/2; φ = π; σ = 0.5

0 1 2 3 4 5 6 7 8 9 10−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

x

y(x)

y(x)m(x)

Amplitude A

Angularfrequency ω

Phase φ

Random errorǫ(x) ∼ N(0, σ2)

10 / 54

Regression Modeling

We want to estimate m(x) and possibly the distribution of ǫ(x)

There are two general situations:

Theoretical Modelsm(x) is of some known (or hypothesized) form but withsome parameters unknown. (e.g. Sinusoid Function withA, ω, φ unknown)

Empirical Modelsm(x) is constructed from the observed data (e.g. Shoe sizeand height)

We often end up using both: constructing models from theobserved data and prior knowledge.

11 / 54

The Standard Assumptions

�y(x) = m(x) + ǫ(x)

A1: E[ǫ(x)] = 0 ∀x

A2: Var[ǫ(x)] = σ2 ∀x

A3: Cov[ǫ(x), ǫ(x′)] = 0 ∀x 6= x′

(Mean 0)

(Homoskedastic)

(Uncorrelated)

These assumptions are only on the error term.

ǫ(x) = y(x) − m(x)

12 / 54

Residuals

The residualse(xi) = y(xi) − m(xi)

can be used to check the estimated model, m(x).

If the model fit is good, the residuals should satisfy ourthree assumptions.

13 / 54

A1 - Mean 0

Violates A1

0 0.2 0.4 0.6 0.8 1−10

−8

−6

−4

−2

0

2

4

6

8

10

x

e(x)

Satisfies A1

0 0.2 0.4 0.6 0.8 1−3

−2

−1

0

1

2

3

x

e(x)

14 / 54

A2 - Constant Variance

Violates A2

0 0.2 0.4 0.6 0.8 1−30

−20

−10

0

10

20

30

x

e(x)

Satisfies A2

0 0.2 0.4 0.6 0.8 1−3

−2

−1

0

1

2

3

x

e(x)

15 / 54

A3 - Uncorrelated

Violates A3

0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

e(x)

Satisfies A3

0 0.2 0.4 0.6 0.8 1−3

−2

−1

0

1

2

3

x

e(x)

16 / 54

Back to the Shoes

How can we estimate m(x) for the shoe example?

(Non-parametric): For each shoe size, take the mean ofthe observed heights.(Parametric): Assume the trend is linear.

6 7 8 9 10 11 12 13 14 15

60

62

64

66

68

70

72

74

76

Determining Height from Shoe Size

Shoe Size (Mens)

Hei

gh

t (i

n)

Local MeanLinear Trend

17 / 54

Simple Linear Regression

Simple linear regression assumes that m(x) is of the parametricform

m(x) = β0 + β1x

which is the equation for a line.

18 / 54

Simple Linear Regression

Which line is the best estimate?

6 7 8 9 10 11 12 13 14 15

60

62

64

66

68

70

72

74

76

Determining Height from Shoe Size

Shoe Size (Mens)

Hei

gh

t (i

n)

Line #1Line #2Line #3

m(x) = β0 + β1x

β0 β1

Line #1 48.6 1.9Line #2 51.5 1.6Line #3 45.0 2.3

19 / 54

Estimating Parameters in Linear RegressionData

Write the observed data:

yi = β0 + β1xi + ǫi i = 1, 2, . . . , n

where

yi ≡ y(xi) is the response value for observation i

β0 and β1 are the unknown parameters (regressioncoefficients)

xi is the predictor value for observation i

ǫi ≡ ǫ(xi) is the random error for observation i

20 / 54

Estimating Parameters in Linear RegressionStatistical Decision Theory

Let g(x) ≡ g(x; β) be an estimator for y(x)

Define a Loss Function, L(y(x), g(x)) which describes howfar g(x) is from y(x)

Example

Squared Error Loss

L(y(x), g(x)) = (y(x) − g(x))2

The best predictor minimizes the Risk (or expected Loss)

R(x) = E[L(y(x), g(x))]

g∗(x) = arg ming∈G

E[L(y(x), g(x))]

21 / 54

Estimating Parameters in Linear RegressionMethod of Least Squares

If we assume a squared error loss function

L(yi, mi) = (yi − (β0 + β1xi))2

An approximation to the Risk function is the Sum of SquaredErrors (SSE):

R(β0, β1) =n∑

i=1

(yi − (β0 + β1xi))2

Then it makes sense to estimate (β0, β1) as the values thatminimize R(β0, β1)

(β0, β1) = arg minB0,B1

R(β0, β1)

22 / 54

Estimating Parameters in Linear RegressionDerivation of Linear Least Squares Solution

R(β0, β1) =n∑

i=1

(yi − (β0 + β1xi))2

Differentiate the Risk function with respect to the unknownparameters and equate to 0

∂R∂β0

∣∣∣∣=0

= −2n∑

i=1

(yi − (β0 + β1xi)) = 0

∂R∂β1

∣∣∣∣=0

= −2n∑

i=1

xi (yi − (β0 + β1xi)) = 0

23 / 54

Estimating Parameters in Linear RegressionLinear Least Squares Solution

R(β0, β1) =n∑

i=1

(yi − (β0 + β1xi))2

The least square estimates are

β1 =

∑ni=1 xiyi − nxy∑ni=1 x2

i − nx2

β0 = y − β1x

where x and y are the sample means of the xi’s and yi’s.

24 / 54

And the winner is ...

Line # 2!

6 7 8 9 10 11 12 13 14 15

60

62

64

66

68

70

72

74

76

Determining Height from Shoe Size

Shoe Size (Mens)

Hei

gh

t (i

n)

Line #1Line #2Line #3

For these data:x = 11.03 y = 69.31

β0 = 51.46

β1 = 1.62

25 / 54

Residuals

The fitted value, yi for the ith observation is

yi = β0 + β1xi

The residual, ei is the difference between the observed andfitted value

ei = yi − yi

The residuals are used to check if our three assumptionsappear valid

26 / 54

Residuals for shoe size data

6 7 8 9 10 11 12 13 14 15−5

−4

−3

−2

−1

0

1

2

3

4

5Determining Height from Shoe Size

Shoe Size (Mens)

resi

du

al

Residuals

27 / 54

Example of poor fit

Scatter Plot

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

1

2

3

4

5

6

7

8

9

x

y(x)

Residual Plot

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−4

−3

−2

−1

0

1

2

3

4

xe(

x)

28 / 54

Adding Polynomial Terms in the Linear Model

Modeling the mean trend as a line doesn’t seem to fitextremely well in the above example. There is a systematiclack of fit.

Consider a polynomial form for the mean

m(x) = β0 + β1x + β2x2 + . . . + βpxp

=

p∑

k=0

βkxk

This is still considered a linear modelm(x) is a linear combination of the βk

Danger of over-fitting

29 / 54

Quadratic Fit: y(x) = β0 + β1x + β2x2 + ǫ(x)

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

0

1

2

3

4

5

6

7

8

9Scatter Plot

x

y(x)

1st OrderQuadratic

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

1

2

3

4

5

6

7

8

9

x

y(x)

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−4

−3

−2

−1

0

1

2

3

4Residual Plot (Quadratic Fit)

x

e(x)

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−4

−3

−2

−1

0

1

2

3

4

x

e(x)

30 / 54

Matrix Approach to Linear Least SquaresSetup

Previously, we wrote our data as yi =∑p

k=0 βkxki + ǫi. In matrix

notation this becomes

Y = Xβ + ǫ

Y =

y1

y2...

yn

, X =

1 x1 x21 . . . xp

11 x2 x2

2 . . . xp2

......

.... . .

...1 xn x2

n . . . xpn

, β =

β0

β1...

βp

, ǫ =

ǫ1

ǫ2...ǫn

How many unknown parameters are in the model?

31 / 54

Matrix Approach to Linear Least SquaresSolution

To minimize SSE (Sum of Squared Errors), use Riskfunction

R(β) = (Y − Xβ)T(Y − Xβ)

Taking derivative w.r.t β gives the Normal Equations

XTXβ = XTY

The least squares solution for β is ...Hint: See “Linear Inverse Problems: A MATLAB Tutorial” by Qin Zhang

32 / 54

Matrix Approach to Linear Least SquaresSolution

To minimize SSE (Sum of Squared Errors), use Riskfunction

R(β) = (Y − Xβ)T(Y − Xβ)

Taking derivative w.r.t β gives the Normal Equations

XTXβ = XTY

The least squares solution for β is ...Hint: See “Linear Inverse Problems: A MATLAB Tutorial” by Qin Zhang

β = (XTX)−1XTY

33 / 54

STRETCH BREAK!!!

34 / 54

MATLAB DemonstrationLinear Least Squares

MATLAB Demo #1Open Regression_Intro.m

35 / 54

Model Selection

How can we compare and select a final model?

How many terms should be include in polynomial models?

What is the danger of over-fitting? (Including too manyterms)

What is the problem with under-fitting? (Not includingenough terms)

36 / 54

Estimating Variance

Recall assumptions A1, A2, and A3: Assumptions

For our fitted model, the residuals ei = yi − yi can be usedto estimate Var[ǫ(x)].

An estimator for the variance is ...Hint: See “Basic Statistical Concepts and Some Probability Essentials” by

Justin Shows and Betsy Enstrom

37 / 54

Estimating Variance

Recall assumptions A1, A2, and A3: Assumptions

For our fitted model, the residuals ei = yi − yi can be usedto estimate Var[ǫ(x)].

An estimator for the variance is ...Hint: See “Basic Statistical Concepts and Some Probability Essentials” by

Justin Shows and Betsy Enstrom

The Sample Variance

s2z =

1n − 1

n∑

i=1

(zi − z)2

38 / 54

Estimating Variance

Sample Variance for a rv z

s2z =

1n − 1

n∑

i=1

(zi − z)2

The estimator for the regression problem is similar

σ2ǫ

=1

n − (p + 1)

n∑

i=1

e2i

=SSE

df

where the degrees of freedom df = n − (p + 1). There arep + 1 unknown parameters in the model.

39 / 54

Statistical InferenceAn additional assumption

In order to calculate confidence intervals (C.I.), we need adistributional assumption on ǫ(x).

Up to now, we haven’t needed one

The standard assumption is to assume a Normal orGaussian distribution

A4 : ǫ(x) ∼ N (0, σ2)

40 / 54

Statistical InferenceDistributions

Using

y(xo) = xT0 β + ǫ(xo)

y(xo) ∼ N (xT0 β, σ2)

β = (XTX)−1XTY

where x0 is a point in design space.

And the 4 assumptions, we find

m(xo) = N(xT

o β, σ2xTo (XTX)−1xo

)

y(xo) = N(xT

o β, σ2(1 + xTo (XTX)−1xo)

)

β ∼ MVN(Xβ, σ2(XTX)−1)

From these we can find CI’s and perform hypothesis tests.

41 / 54

Model ComparisonR2

Sum of Squares Error

SSE =n∑

i=1

(yi − yi)2 =

n∑

i=1

e2i = e′e

Sum of Squares Total

SST =

n∑

i=1

(yi − y)2

This is the model with intercept only y(x) = y.

Coefficient of Determination

R2 = 1 −SSE

SST

R2 is a measure of how much better a regression model isthan the intercept only.

42 / 54

Model ComparisonAdjusted R2

What happens to R2 if you add more terms in the model?

R2 = 1 −SSE

SST

43 / 54

Model ComparisonAdjusted R2

What happens to R2 if you add more terms in the model?

R2 = 1 −SSE

SST

Adjusted R2 penalizes by the number of terms (p + 1) in themodel

R2adj = 1 −

SSE/(n − (p + 1))

SST/(n − 1)

= 1 −σǫ

SST/(n − 1)

Also see residual plots, Mallow’s Cp, PRESS(cross-validation), AIC, etc.

44 / 54

MATLAB Demonstrationcftool

MATLAB Demo #2Type cftool

45 / 54

Nonlinear Regression

A linear regression model can be written

y(x) =

p∑

k=0

βkhk(x) + ǫ(x)

The mean, m(x) is a linear combination of the β’sNonlinear regression takes the general form

y(x) = m(x; β) + ǫ(x)

for some specified function m(x; β) with unknownparameters β.

46 / 54

Nonlinear Regression

A linear regression model can be written

y(x) =

p∑

k=0

βkhk(x) + ǫ(x)

The mean, m(x) is a linear combination of the β’sNonlinear regression takes the general form

y(x) = m(x; β) + ǫ(x)

for some specified function m(x; β) with unknownparameters β.

Example

The sinusoid we looked at earlier

y(x) = A · sin(ωx + φ) + ǫ(x)

with parameters β = (A, ω, φ) is a nonlinear model.

47 / 54

Nonlinear RegressionParameter Estimation

Making same assumptions as in linear regression (A1-A3),the least squares solution is still valid

β = arg minn∑

i=1

(yi − m(xi; β))2

Unfortunately, this usually doesn’t have a closed formsolution (like in the linear case)

Approaches to finding the solution will be discussed later inthe workshop

But that won’t stop us from using nonlinear (andnonparametric) regression in MATLAB!

48 / 54

Off again to cftool

MATLAB Demo #3

49 / 54

Weighted Regression

Consider the risk functions we have considered so far

R(β) =n∑

i=1

(yi − m(xi; β))2

Each observation is equally contributes to the riskWeighted regression uses the risk function

Rw(β) =

n∑

i=1

wi (yi − m(xi; β))2

so observations with larger weights are more important.Some examples

wi = 1/σ2i Heteroskedastic (Non-constant variance)

wi = 1/xi

wi = 1/yi

wi = k/|ei| Robust Regression

50 / 54

Transformations

Sometimes transformations are used to obtain bettermodels

Transform predictors x → x′

Transform response y → y′

Make sure assumptions A1-A3,A4 are still valid

Standardizedx′ =

x − xsx

Logy′ = log(y)

51 / 54

The Competition

Contest to see who can construct the best model in cftool

Get into groups

Data can be found in competition data.m

Scoring will be performed on testing set

Want to minimize sum of squared errors

When group is ready, enter model into this computer

52 / 54

MATLAB Help

There is lots of good assistance in the MATLAB helpwindow

Specifically, look at the Demos tab on the help window

The Toolboxes of Statistics (Regression) and Optimizationmay be particularly useful for this workshop

53 / 54

Have a great workshop!

54 / 54

top related