Top Banner
Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney
49

Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Dec 22, 2015

Download

Documents

Harriet Ford
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Simple Linear Regression:An Introduction

Dr. Tuan V. Nguyen

Garvan Institute of Medical Research

Sydney

Page 2: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Give a man three weapons – correlation, regression and a pen – and he will use all three

(Anon, 1978)

Page 3: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

An exampleID Age Chol (mg/ml)

1 463.5

2 201.9

3 524.0

4 302.6

5 574.5

6 253.0

7 282.9

8 363.8

9 222.1

10 433.8

11 574.1

12 333.0

13 222.5

14 634.6

15 403.2

16 484.2

17 282.3

18 494.0

Age and cholesterol levels in 18 individuals

Page 4: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Read data into R

id <- seq(1:18)age <- c(46, 20, 52, 30, 57, 25, 28, 36, 22, 43, 57, 33, 22, 63, 40, 48, 28, 49)chol <- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8, 2.1, 3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3, 4.0)plot(chol ~ age, pch=16)

Page 5: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

20 30 40 50 60

2.0

2.5

3.0

3.5

4.0

4.5

age

cho

l

Page 6: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Questions of interest

• Association between age and cholesterol levels• Strength of association• Prediction of cholesterol for a given age

Correlation and Regression analysis

Page 7: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Variance and covariance: algebra

• Let x and y be two random variables from a sample of n obervations.

• Measure of variability of x and y: variance

n

i

i

n

xxx

1

2

1var

n

i

i

n

yyy

1

2

1var

• Measure of covariation between x and y ?

• Algebraically:

var(x + y) = var(x) + var(y)

var(x + y) = var(x) + var(y) + 2cov(x,y)

Where:

n

iii yyxx

nyx

11

1,cov

Page 8: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Variance and covariance: geometry

• The independence or dependence between x and y can be represented geometrically:

y

x

h

h2 = x2 + y2

x

yh

h2 = x2 + y2 – 2xycos(H)

H

Page 9: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Meaning of variance and covariance

• Variance is always positive

• If covariance = 0, x and y are independent.• Covariance is sum of cross-products: can be positive or

negative.

• Negative covariance = deviations in the two distributions in are opposite directions, e.g. genetic covariation.

• Positive covariance = deviations in the two distributions in are in the same direction.

• Covariance = a measure of strength of association.

Page 10: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Covariance and correlation• Covariance is unit-depenent. • Coefficient of correlation (r) between x and y is a standardized

covariance.• r is defined by:

yx SDSD

yx

yx

yxr

,cov

varvar

,cov

Page 11: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Positive and negative correlation

8 10 12 14 16

-30

-25

-20

-15

x

y

8 10 12 14 16

1520

2530

x

y

r = 0.9 r = -0.9

Page 12: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Test of hypothesis of correlation• Hypothesis: Ho: r = 0 versus Ho: r not equal to 0.

• Standard error of r is: • The t-statistic:

21

2

r

nrt

• This statistic has a t distribution with n – 2 degrees of freedom.

• Fisher’s z-transformation:

• Standard error of z:

• Then 95% CI of z can be constructed as:

2

1 2

n

rrSE

r

rz

1

1ln

2

1

3

1

nzSE

3

1

nz

Page 13: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

An illustration of correlation analysisID Age Cholesterol

(x) (y; mg/100ml)

1 46 3.52 20 1.93 52 4.04 30 2.65 57 4.56 25 3.07 28 2.98 36 3.89 22 2.110 43 3.811 57 4.112 33 3.013 22 2.514 63 4.615 40 3.216 48 4.217 28 2.318 49 4.0Mean 38.83 3.33SD 13.60 0.84

Cov(x, y) = 10.68

94.0

84.060.13

68.10,cov

yx SDSD

yxr

56.094.01

94.01ln

2

1

z

26.015

1

3

1

n

zSE

t-statistic = 0.56 / 0.26 = 2.17

Critical t-value with 17 df and alpha = 5% is 2.11

Conclusion: There is a significant association between age and cholesterol.

Page 14: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Simple linear regression analysis

• Assessment:– Quantify the relationship between two variables

• Prediction– Make prediction and validate a test

• Control– Adjusting for confounding effect (in the case of multiple variables)

• Only two variables are of interest: one response variable and one predictor variable

• No adjustment is needed for confounding or covariate

Page 15: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Relationship between age and cholesterol

Page 16: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Linear regression: model

• Y : random variable representing a response• X : random variable representing a predictor variable

(predictor, risk factor)– Both Y and X can be a categorical variable (e.g., yes / no) or a

continuous variable (e.g., age). – If Y is categorical, the model is a logistic regression model; if Y is

continuous, a simple linear regression model.

• ModelY = + X +

: intercept : slope / gradient : random error (variation between subjects in y even if x is constant, e.g.,

variation in cholesterol for patients of the same age.)

Page 17: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Linear regression: assumptions

• The relationship is linear in terms of the parameter;

• X is measured without error;

• The values of Y are independently from each other (e.g., Y1 is not correlated with Y2) ;

• The random error term () is normally distributed with mean 0 and constant variance.

Page 18: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Expected value and variance

• If the assumptions are tenable, then: • The expected value of Y is: E(Y | x) = + x

• The variance of Y is: var(Y) = var() = 2

Page 19: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Given two points A(x1, y1) and B(x2, y2) in a two-dimensional space, we can derive an equation connecting the points.

A(x1,y1)

B(x2,y2)

Gradient:12

12

xx

yy

dx

dym

Equation: y = mx + a

What happen if we have more than 2 points?

a

x

y

0

dy

dx

Estimation of model parameters

Page 20: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Estimation of and

• For a series of pairs: (x1, y1), (x2, y2), (x3, y3), …, (xn, yn)

• Let a and b be sample estimates for parameters and ,

• We have a sample equation: Y* = a + bx

• Aim: finding the values of a and b so that (Y – Y*) is minimal.

• Let SSE = sum of (Yi – a – bxi)2.

• Values of a and b that minimise SSE are called least square estimates.

Page 21: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Criteria of estimation

Chol

Age

ii bxay ˆ

iii yyd ˆyi

The goal of least square estimator (LSE) is to find a and b such that the sum of d2 is minimal.

Page 22: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Estimation of and • After some calculus operations, the results can be shown

to be:

xx

xy

S

Sb

xbya

n

iixx xxS

1

2

n

iiixy yyxxS

1

Where:

• When the regression assumptions are valid, the estimators of and have the following properties:

– Unbiased

– Uniformly minimal variance (eg efficient)

Page 23: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Goodness-of-fit

• Now, we have the equation Y = a + bX + e

• Question: how well the regression equation describe the actual data?

• Answer: coefficient of determination (R2): the amount of variation in Y is explained by the variation in X.

Page 24: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Partitioning of variations: concept

• SST = sum of squared difference between yi and the mean of y.

• SSR = sum of squared difference between the predicted value of y and the mean of y.

• SSE = sum of squared difference between the observed and predicted value of y.

SST = SSR + SSE

The the coefficient of determination is:

R2 = SSR / SST

Page 25: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Partitioning of variations: geometry

Chol (Y)

Age (X)

mean

SSR

SSE

SST

Page 26: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Partitioning of variations: algebra

• Some statistics:• Total variation:• Attributed to the model:• Residual sum of square: • SST = SSR + SSE• SSR = SST – SSE

n

ii yySST

1

2

n

ii yySSR

1

n

iii yySSE

1

Page 27: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Analysis of variance

• SS increases in proportion to sample size (n)

• Mean squares (MS): normalise for degrees of freedom (df)

– MSR = SSR / p (where p = number of degrees of freedom)

– MSE = SSE / (n – p – 1)

– MST = SST / (n – 1)

• Analysis of variance (ANOVA) table:

Source d.f. Sum of squares (SS)

Mean squares (MS)

F-test

Regression

Residual

Total

p

N–p –1

n – 1

SSR

SSE

SST

MSR

MSE

MSR/MSE

Page 28: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Hypothesis tests in regression analysis

• Now, we have

Sample data: Y = a + bX + ePopulation: Y = + X +

• Ho: = 0. There is no linear association between the outcome and predictor variable.

• In layman language: “what is the chance, given the sample data that we observed, of observing a sample of data that is less consistent with the null hypothesis of no association?”

Page 29: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Inference about slope (parameter )

• Recall that is assumed to be normally distributed with mean 0 and variance = 2.

• Estimate of 2 is MSE (or s2)• It can be shown that

– The expected value of b is , i.e. E(b) = – The standard error of b is:

• Then the test whether = 0 is: t = b / SE(b) which follows a t-distribution with n-1 degrees of freedom.

xxSsbSE /

Page 30: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Confidence interval around predicted valued

• Observed value is Yi.

• Predicted value is • The standard error of the predicted value is:

• Interval estimation for Yi values

xx

ii S

xx

nsYSE

211ˆ

2/1,1ˆˆ

pnii tYSEY

ii bxaY ˆ

Page 31: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Checking assumptions

• Assumption of constant variance• Assumption of normality• Correctness of functional form• Model stability

• All can be conducted with graphical analysis. The residuals from the model or a function of the residuals play an important role in all of the model diagnostic procedures.

Page 32: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Checking assumptions

• Assumption of constant variance– Plot the studentized residuals versus their predicted values. Examine

whether the variability between residuals remains relatively constant across the range of fitted values.

• Assumption of normality– Plot the residuals versus their expected values under normality (Normal

probability plot). If the residuals are normally distributed, it should fall along a 45o line.

• Correct functional form? – Plot the residuals versus fitted values. Examine whether the residual

plot for evidence of a non-linear trend in the value of the residual across the range of fitted values.

• Model stability– Check whether one or more observations are influential. Use Cook’s

distance.

Page 33: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Checking assumptions (Cont)

• Cook’s distance (D) is a measure of the magnitude by which the fitted values of the regression model change if the ith observation is removed from the data set.

• Leverage is a measure of how extreme the value of xi is relative to the remaining value of x.

• The Studentized residual provides a measure of how extreme the value of yi is relative to the remaining value of y.

Page 34: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Remedial measures

• Non-constant variance– Transform the response variable (y) to a new scale (e.g. logarithm) is

often helpful.– If no transformation can achieve the non-constant variance problem,

use a more robust estimator such as iterative weighted least squares.

• Non-normality– Non-normality and non-constant variance go hand-in-hand.

• Outliers– Check for accuracy– Use robust estimator

Page 35: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Regression analysis using R

id <- seq(1:18)

age <- c(46, 20, 52, 30, 57, 25, 28, 36, 22,

43, 57, 33, 22, 63, 40, 48, 28, 49)

chol <- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8, 2.1,

3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3, 4.0)

#Fit linear regression model

reg <- lm(chol ~ age)

summary(reg)

Page 36: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

ANOVA result

> anova(reg)Analysis of Variance Table

Response: chol Df Sum Sq Mean Sq F value Pr(>F) age 1 10.4944 10.4944 114.57 1.058e-08 ***Residuals 16 1.4656 0.0916 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Page 37: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Results of R analysis> summary(reg)

Call:lm(formula = chol ~ age)

Residuals: Min 1Q Median 3Q Max -0.40729 -0.24133 -0.04522 0.17939 0.63040

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.089218 0.221466 4.918 0.000154 ***age 0.057788 0.005399 10.704 1.06e-08 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3027 on 16 degrees of freedomMultiple R-Squared: 0.8775, Adjusted R-squared: 0.8698 F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

Page 38: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Diagnostics: influential data

par(mfrow=c(2,2))plot(reg)

2.5 3.0 3.5 4.0 4.5

-0.4

0.0

0.2

0.4

0.6

Fitted values

Re

sid

ua

ls

Residuals vs Fitted

8

6

17

-2 -1 0 1 2

-10

12

Theoretical Quantiles

Sta

nd

ard

ize

d r

es

idu

als

Normal Q-Q

8

6

17

2.5 3.0 3.5 4.0 4.5

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location8

617

0.00 0.05 0.10 0.15 0.20 0.25-1

01

2

Leverage

Sta

nd

ard

ize

d r

es

idu

als

Cook's distance0.5

0.5

1

Residuals vs Leverage

6

2

8

Page 39: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

A non-linear illustration: BMI and sexual attractiveness

– Study on 44 university students

– Measure body mass index (BMI)

– Sexual attractiveness (SA) score

id <- seq(1:44)bmi <- c(11.00, 12.00, 12.50, 14.00, 14.00, 14.00, 14.00, 14.00, 14.00, 14.80, 15.00, 15.00, 15.50, 16.00, 16.50, 17.00, 17.00, 18.00, 18.00, 19.00, 19.00, 20.00, 20.00, 20.00, 20.50, 22.00, 23.00, 23.00, 24.00, 24.50, 25.00, 25.00, 26.00, 26.00, 26.50, 28.00, 29.00, 31.00, 32.00, 33.00, 34.00, 35.50, 36.00, 36.00) sa <- c(2.0, 2.8, 1.8, 1.8, 2.0, 2.8, 3.2, 3.1, 4.0, 1.5, 3.2, 3.7, 5.5, 5.2, 5.1, 5.7, 5.6, 4.8, 5.4, 6.3, 6.5, 4.9, 5.0, 5.3, 5.0, 4.2, 4.1, 4.7, 3.5, 3.7, 3.5, 4.0, 3.7, 3.6, 3.4, 3.3, 2.9, 2.1, 2.0, 2.1, 2.1, 2.0, 1.8, 1.7)

Page 40: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Linear regression analysis of BMI and SA

reg <- lm (sa ~ bmi)

summary(reg)

Residuals: Min 1Q Median 3Q Max -2.54204 -0.97584 0.05082 1.16160 2.70856

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.92512 0.64489 7.637 1.81e-09 ***bmi -0.05967 0.02862 -2.084 0.0432 * ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.354 on 42 degrees of freedomMultiple R-Squared: 0.09376, Adjusted R-squared: 0.07218 F-statistic: 4.345 on 1 and 42 DF, p-value: 0.04323

Page 41: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

BMI and SA: analysis of residuals

plot(reg)

3.0 3.5 4.0

-3-2

-10

12

3

Fitted values

Re

sid

ua

ls

Residuals vs Fitted

21

10

20

-2 -1 0 1 2

-2-1

01

2

Theoretical Quantiles

Sta

nd

ard

ize

d r

es

idu

als

Normal Q-Q

21

10

20

3.0 3.5 4.0

0.0

0.4

0.8

1.2

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location21 1020

0.00 0.02 0.04 0.06 0.08 0.10 0.12

-2-1

01

2

Leverage

Sta

nd

ard

ize

d r

es

idu

als

Cook's distance

Residuals vs Leverage

1310

Page 42: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

BMI and SA: a simple plotpar(mfrow=c(1,1))reg <- lm(sa ~ bmi)plot(sa ~ bmi, pch=16)abline(reg)

10 15 20 25 30 35

23

45

6

bmi

sa

Page 43: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

# Fit 3 regression modelslinear <- lm(sa ~ bmi)quad <- lm(sa ~ poly(bmi, 2))cubic <- lm(sa ~ poly(bmi, 3))

# Make new BMI axisbmi.new <- 10:40

# Get predicted valuesquad.pred <- predict(quad,data.frame(bmi=bmi.new))cubic.pred <- predict(cubic,data.frame(bmi=bmi.new))

# Plot predicted valuesabline(reg)lines(bmi.new, quad.pred, col="blue",lwd=3)lines(bmi.new, cubic.pred, col="red",lwd=3)

Re-analysis of sexual attractiveness data

Page 44: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

10 15 20 25 30 35

23

45

6

bmi

sa

Page 45: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Some comments: Interpretation of correlation

• Correlation lies between –1 and +1. A very small correlation does not mean that no linear association between the two variables. The relationship may be non-linear.

• For curlinearity, a rank correlation is better than the Pearson’s correlation.

• A small correlation (eg 0.1) may be statistically significant, but clinically unimportant.

• R2 is another measure of strength of association. An r = 0.7 may sound impressive, but R2 is 0.49!

• Correlation does not mean causation.

Page 46: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Some comments: Interpretation of correlation

• Be careful with multiple correlations. For p variables, there are p(p – 1)/2 possible pairs of correlation, and false positive is a problem.

• Correlation can not be inferred directly from association.– r(age, weight) = 0.05; r(weight, fat) = 0.03; it does not mean that

r(age, fat) is near zero.

– In fact, r(age, fat) = 0.79.

Page 47: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Some comments: Interpretation of regression

• The fitted line (regression) is only an estimated of the relation between these variables in the population.

• Uncertainty associated with estimated parameters.

• Regression line should not be used to make prediction of x values outside the range of values in the observed data.

• A statistical model is an approximation; the “true” relation may be nonlinear, but a linear is a reasonable approximation.

Page 48: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Some comments: Reporting results

• Results should be reported in sufficient details: nature of response variable, predictor variable; any transformation; checking assumptions, etc.

• Regression coefficients (a, b), their associated standard errors, and R2 are useful summary.

Page 49: Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Some final comments

• Equations are the cornerstone on which the edifice of science rests.

• Equations are like poems, or even an onion.

• So, be careful with your building of equations!