Top Banner
10/27/15 1 Chapter 11: Linear Regression and Correla4on Regression analysis is a sta4s4cal tool that u4lizes the rela4on between two or more quan4ta4ve variables so that one variable can be predicted from the other, or others. Some Examples: Height and weight of people Income and expenses of people Produc4on size and produc4on 4me Soil pH and the rate of growth of plants 1 Correla4on An easy way to determine if two quan4ta4ve variables are linearly related is by looking at their scaLerplot. Another way is to calculate the correla4on coefficient, denoted usually by r. Note: 1 ≤ r ≤ 1. 2 The Linear Correla+on measures the strength of the linear rela4onship between explanatory variable (x) and the response variable (y). An es4mate of this correla4on parameter is provided by the Pearson sample correla4on coefficient, r. Example Sca;erplots with Correla4ons 3 If X and Y are independent, then their correla4on is 0. Correla4on Some Guidelines in Interpre4ng r. Value of |r| Strength of linear rela1onship If |r|≥ 0.95 Very Strong If 0.85 ≤ |r|< 0.95 Strong If 0.65 ≤ |r|< 0.85 Moderate to Strong If 0.45 ≤ |r|< 0.65 Moderate If 0.25 ≤ |r|< 0.45 Weak If|r|< 0.25 Very weak/Close to none If the correla4on between X and Y is 0, it doesn’t mean they are independent. It only means that they are not linearly related. One complain about the correla4on is that it can be subjec4ve when interpre4ng its value. Some people are very happy with r≈0.6, while others are not. Note: Correla4on does not necessarily imply Causa4on! 4
5

Chapter11:(( Correla4on …websites.uwlax.edu/storibio/math405_fall15/Chapter11_4slides.pdf1 Chapter11:((LinearRegressionandCorrelaon ... response is the addition of 1 inside the square

Apr 03, 2018

Download

Documents

duongbao
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chapter11:(( Correla4on …websites.uwlax.edu/storibio/math405_fall15/Chapter11_4slides.pdf1 Chapter11:((LinearRegressionandCorrelaon ... response is the addition of 1 inside the square

10/27/15  

1  

Chapter  11:    Linear  Regression  and  Correla4on

• Regression  analysis  is  a  sta4s4cal  tool  that  u4lizes  the  rela4on  between  two  or  more  quan4ta4ve  variables  so  that  one  variable  can  be  predicted  from  the  other,  or  others.  •  Some  Examples:  • Height  and  weight  of  people  •  Income  and  expenses  of  people  • Produc4on  size  and  produc4on  4me  •  Soil  pH  and  the  rate  of  growth  of  plants  

1  

Correla4on •  An  easy  way  to  determine  if  two  quan4ta4ve  variables  are  linearly  related  is  by  looking  at  their  scaLerplot.      •  Another  way  is  to  calculate  the  correla4on  coefficient,  denoted  usually  by  r.  

Note:  -­‐1  ≤  r  ≤  1.  

2  

•  The  Linear  Correla+on  measures  the  strength  of  the  linear  rela4onship  between  explanatory  variable  (x)  and  the  response  variable  (y).  An  es4mate  of  this  correla4on  parameter  is  provided  by  the  Pearson  sample  correla4on  coefficient,  r.  

Example  Sca;erplots  with  Correla4ons

3  

If  X  and  Y  are  independent,  then  their  correla4on  is  0.  

Correla4on

•  Some  Guidelines  in  Interpre4ng  r.  

Value  of  |r|   Strength  of  linear  rela1onship  

If  |r|≥  0.95   Very  Strong  

If  0.85  ≤  |r|<  0.95   Strong  

If  0.65  ≤  |r|<  0.85   Moderate  to  Strong  

If  0.45  ≤  |r|<  0.65   Moderate  

If  0.25  ≤  |r|<  0.45   Weak  

If|r|<  0.25   Very  weak/Close  to  none  

If   the   correla4on   between   X   and   Y   is   0,   it  doesn’t  mean   they   are   independent.   It   only  means  that  they  are  not  linearly  related.  

One  complain  about  the  correla4on  is  that  it   can   be   subjec4ve  when   interpre4ng   its  value.   Some   people   are   very   happy   with  r≈0.6,  while  others  are  not.  

Note:  Correla4on  does  not  necessarily  imply  Causa4on!  

4  

Page 2: Chapter11:(( Correla4on …websites.uwlax.edu/storibio/math405_fall15/Chapter11_4slides.pdf1 Chapter11:((LinearRegressionandCorrelaon ... response is the addition of 1 inside the square

10/27/15  

2  

Compu4ng  Correla4on  in  R

5  

data.health=read.csv("HealthExam.csv",header=T)!head(data.health)!Gender Age Height Weight Waist Pulse SysBP DiasBP Cholesterol BodyMass Leg Elbow Wrist Arm !1 F 12 63.3 156.3 81.4 64 104 41 89 27.5 41.0 6.8 5.5 33.0 !2 F 16 57.0 100.7 68.7 64 106 64 2 21.9 33.8 5.6 4.6 26.4 !3 M 17 63.0 156.3 86.7 96 109 65 78 27.8 44.2 7.1 5.3 31.7 !!attach(health.exam)!plot(Height,Weight,pch=19,main="Scatterplot")!!cor(Height,Weight) !# 0.544563!!cor(Waist,Weight) !# 0.9083268!plot(Waist,Weight,pch=19,main="Scatterplot")!

Simple  Linear  Regression •  Model:  Yi=(β0+β1xi)  +  εi where,    

•  Yi  is  the  ith  value  of  the  response  variable.  •  xi  is  the  ith  value  of  the  explanatory  variable.  •  εi’s  are  uncorrelated  with  a  mean  of  0  and  constant  variance  σ2.  

x

y

Y=β0+β1x Random Error

x1

*

Expected  point  

Observed  point  

ε1  

*

*

**

*

**

•  How  do  we  determine  the  underlying  linear  rela4onship?  

•  Well,  since  the  points  are  following  this  linear  trend,  why  don’t  we  look  for  a  line  that  “best”  fit  the  points.  

•  But  what  do  we  mean  by  “best”  fit?  We  need  a  criterion  to  help  us  determine  which  between  2  compe4ng  candidate  lines  is  beLer.  

L1 L2

L3

L4

6  

Method  of  Least  Squares •  Model:  Yi=(β0+β1xi)  +  εi where,    

•  Yi  is  the  ith  value  of  the  response  variable.  •  xi  is  the  ith  value  of  the  explanatory  variable.  •  εi’s  are  uncorrelated  with  a  mean  of  0  and  constant  variance  σ2.  

x

y

Y=β0+β1x

x1

*

P1(x1,y1)  

e1  

*

*

**

*

**

•  Residual  =  (Observed  y-­‐value)  –  (Predicted  y-­‐value)    e1=  y1  –  

y1 Example: 2+.8x

e2  

Method  of  Least  Squares:  Choose  the  line  that  minimizes  the  SSE  as  the  “best”  line.  This  line  is  unknown  as  the  Least-­‐Squares  Regression  Line.  

Ques1on:  But  there  are  infinite  possible  candidate  lines,  how  can  we  find  the  one  that  minimizes  the  SSE?  

Observed    

Predicted  

Answer:  Since  SSE  is  a  con+nuous  func+on  of  2  variables,  we  can  use  methods  from  calculus  to  minimize  the  SSE.  7  

Obtaining  the  Regression  Line  in  R

8  

data.health=read.csv("HealthExam.csv",header=T)!head(data.health)! Gender Age Height Weight Waist Pulse SysBP DiasBP Cholesterol BodyMass Leg Elbow Wrist Arm !1 F 12 63.3 156.3 81.4 64 104 41 89 27.5 41.0 6.8 5.5 33.0 !2 F 16 57.0 100.7 68.7 64 106 64 2 21.9 33.8 5.6 4.6 26.4 !3 M 17 63.0 156.3 86.7 96 109 65 78 27.8 44.2 7.1 5.3 31.7 !!attach(health.exam)!plot(Waist,Weight,pch=19,main="Scatterplot")!!result=lm(Weight~Waist)!coef(result)!(Intercept) Waist ! -51.72790 2.39469 !!abline(a=-51.7279,b=2.39469,lwd=2,col="blue")!!So, for the first person, her predicted weight is 143.2 pounds. Predicted.1=-51.728+2.395*81.4 !# 143.225 pounds!!Since her actual weight is 156.3 pounds. Residual.1=156.3-143.2 !# 13.1 pounds!

As waist increases by 1 cm, weight goes up by about 2.4 pounds.

Page 3: Chapter11:(( Correla4on …websites.uwlax.edu/storibio/math405_fall15/Chapter11_4slides.pdf1 Chapter11:((LinearRegressionandCorrelaon ... response is the addition of 1 inside the square

10/27/15  

3  

What  else  do  we  get  from  the  ‘lm’  func4on? data.health=read.csv("HealthExam.csv",header=T)!attach(health.exam)!result=lm(Weight~Waist)!attributes(result)!$names!"coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr" "df.residual" "xlevels" "call" "terms" "model" !!result$fit[1] !# 143.1999 !result$res[1] !# 13.10011 !!summary(result)!lm(formula = Weight ~ Waist)!Coefficients:! Estimate Std. Error t value Pr(>|t|) !(Intercept) -51.7279 11.1288 -4.648 1.34e-05 ***!Waist 2.3947 0.1249 19.180 < 2e-16 ***!---!Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1!!Residual standard error: 14.68 on 78 degrees of freedom!Multiple R-squared: 0.8251, !Adjusted R-squared: 0.8228 !F-statistic: 367.9 on 1 and 78 DF, p-value: < 2.2e-16!

Coefficient of Determination (R2) : This index measures the amount of variability in the dependent variable (y) that can be explained by the regression line. Hence, about 82.51% of the variability of weight can be explained by the regression line involving the waist size.

Testing H0: β1 = 0 vs. H1: β1 ≠ 0. Since the p-vlaue is extremely small (<0.05), we can reject the null hypothesis and conclude that waist has a significant effect on weight.

Model  Assump4ons

Since the underlying (green) line is unknown to us, we can’t calculate the values of the error terms (εi). The best that we can do is study the residuals (ei).

•  Model:  Yi=(β0+β1xi)  +  εi where,    

•  εi’s  are  uncorrelated  with  a  mean  of  0  and  constant  variance  σ2ε.  •  εi’s  are  normally  distributed.  (This is needed in the test for the slope.)  

x

y

Y=β0+β1x

e1  *

ε1  

x1

Expected  point

Observed  point

Predicted  point

10  

Es4ma4ng  the  Variance  of  the  Error  Terms

11  

•  The unbiased estimator for σε2 is

!!!sse=sum(result$residuals^2) !# 16811.16!mse=sse/(80-2) ! ! !# 215.5277!sigma.hat=sqrt(mse) ! !# 14.68086!!anova(result) ! !!Response: Weight! Df Sum Sq Mean Sq F value Pr(>F) !Waist 1 79284 79284 367.86 < 2.2e-16 ***!Residuals 78 16811 216 !Total ! 79 96095!summary(result)!lm(formula = Weight ~ Waist)!Coefficients:! Estimate Std. Error t value Pr(>|t|) !(Intercept) -51.7279 11.1288 -4.648 1.34e-05 ***!Waist 2.3947 0.1249 19.180 < 2e-16 ***!!Residual standard error: 14.68 on 78 degrees of freedom!Multiple R-squared: 0.8251, !Adjusted R-squared: 0.8228 !F-statistic: 367.9 on 1 and 78 DF, p-value: < 2.2e-16!!

Y=β0+β1x

*

**

*

*

*

*

*

yi

Pi(xi,yi)  

y

xi

SSTO  =  SSE  +  SSR   Since the p-vlaue is less than 0.05, we conclude the the regression model account for a significant amount of the variability in weight. R2  =  SSR/SSTO  

Things  that  affect  the  slope  es4mate

12  

Ø  Watch the regression podcast by Dr. Will posted on our course webpage. •  Three things that affect the slope estimate:

1.  Sample size (n). 2.  Variability of the error terms (σε2). 3.  Spread of the independent variable.

!summary(result)!lm(formula = Weight ~ Waist)!Coefficients:! Estimate Std. Error t value Pr(>|t|) !(Intercept) -51.7279 11.1288 -4.648 1.34e-05 ***!Waist 2.3947 0.1249 19.180 < 2e-16 ***!!

SS=function(x,y){sum((x-mean(x))*(y-mean(y)))}!SSxy=SS(Waist,Weight) !# 33108.35!SSxx=SS(Waist,Waist) !# 13825.73!SSyy=SS(Weight,Weight) !# 96095.4 = SSTO!Beta1.hat=SSxy/SSxx !# 2.39469!

MSE=anova(result)$Mean[2] !# 215.5277!SE.beta1=sqrt(MSE/SSxx) ! !# 0.1248554!

t.obs=(Beta1.hat-0)/SE.beta1 !# 19.17971!p.value=2*(1-pt(19.18,df=78)) !# virtually 0!

Testing H0: β1 = 0 vs. H1: β1 ≠ 0.

As n increases, the standard error of the slope estimate decreases.

The smaller σε is, the smaller the standard error of the slope estimate.

Page 4: Chapter11:(( Correla4on …websites.uwlax.edu/storibio/math405_fall15/Chapter11_4slides.pdf1 Chapter11:((LinearRegressionandCorrelaon ... response is the addition of 1 inside the square

10/27/15  

4  

Effect  of  Outliers  to  the  Slope  Es4mate  

13  

Three types of outliers: 1.  Outlier in the x direction – This type of an outlier is said to be a high leverage point. 2.  Outlier in the y direction. 3.  Outlier in both x and y directions – This point is said to be a high influence point.

The effect of a high influence point.   The effect of a point with an outlying y value.  

Confidence  Intervals •  The (1-α)100% C.I. for β1: Hence, the 90% C.I. for β1 for our example is !

Lower=Beta1.hat – qt(0.95,df=78)*SE.beta1 ! ! !# 2.186853 Upper=Beta1.hat + qt(0.95,df=78)*SE.beta1 ! ! !# 2.602528!!

confint(result,level=.90)!! ! 5 % 95 %!

(Intercept) -70.253184 -33.202619!Waist 2.186853 2.602528!!!•  Estimating the mean response (µy) at a specified value of x: predict(result,newdata=data.frame(Waist=c(80,90)))! 1 2 !139.8473 163.7942!!•  Confidence interval for the mean response (µy) at a specified value of x: predict(result,newdata=data.frame(Waist=c(80,90)),interval=”confidence”)! fit lwr upr!1 139.8473 136.0014 143.6932!2 163.7942 160.4946 167.0938!!

Predic4on  Intervals •  Predicting the value of the response variable at a specified value of x: predict(result,newdata=data.frame(Waist=c(80,90)))! 1 2 !139.8473 163.7942!!•  Prediction interval for the value of new response value (yn+1) at a specified value of x: predict(result,newdata=data.frame(Waist=c(80,90)),interval=”prediction”)! fit lwr upr!1 139.8473 110.3680 169.3266!2 163.7942 134.3812 193.2072!!predict(result,newdata=data.frame(Waist=c(80,90)),interval=”prediction”,level=.99)! fit lwr upr!1 139.8473 100.7507 178.9439!2 163.7942 124.7855 202.8029!!

Note that the only difference between the prediction interval and confidence interval for the mean response is the addition of 1 inside the square root. This makes the prediction intervals wider than the confidence intervals for the mean response.

Confidence  and  Predic4on  Bands •  Working-Hotelling (1-α)100% confidence band: !result=lm(Weight~Waist)!CI=predict(result,se.fit=TRUE) # se.fit=SE(mean)!W=sqrt(2*qf(0.95,2,78)) ! ! # 2.495513 !band.lower=CI$fit - W*CI$se.fit!band.upper=CI$fit + W*CI$se.fit!!plot(Waist,Weight,xlab="Waist”,ylab="Weight”,main="Confidence Band")!abline(result)!points(sort(Waist),sort(band.lower),type="l",lwd=2,lty=2,col=”Blue")!points(sort(Waist),sort(band.upper),type="l",lwd=2,lty=2,col=”Blue")!!•  The ((1-α)100% Prediction Band: mse=anova(result)$Mean[2]!se.pred=sqrt(CI$se.fit^2+mse)!band.lower.pred=CI$fit - W*se.pred!band.upper.pred=CI$fit + W*se.pred!!points(sort(Waist),sort(band.lower.pred),type="l",lwd=2,lty=2,col="Red")!points(sort(Waist),sort(band.upper.pred),type="l",lwd=2,lty=2,col="Red”)!

,

Page 5: Chapter11:(( Correla4on …websites.uwlax.edu/storibio/math405_fall15/Chapter11_4slides.pdf1 Chapter11:((LinearRegressionandCorrelaon ... response is the addition of 1 inside the square

10/27/15  

5  

Tests  for  Correla4ons

17  

•  Testing H0: ρ = 0 vs. H1: ρ ≠ 0. cor(Waist,Weight) ! ! !# Computes the Pearson correlation coefficient, r!0.9083268 !cor.test(Waist,Weight, conf.level=.99) # Tests Ho:rho=0 and also constructs C.I. for rho!

!Pearson's product-moment correlation!data: Waist and Weight!t = 19.1797, df = 78, p-value < 2.2e-16!alternative hypothesis: true correlation is not equal to 0!99 percent confidence interval:! 0.8409277 0.9479759!!•  Testing H0: ρ = 0 vs. H1: ρ ≠ 0 using the (Nonparametric) Spearman’s method. cor.test(Waist,Weight,method="spearman") !# Test of independence using the !

!Spearman's rank correlation rho !# Spearman Rank correlation!data: Waist and Weight!S = 8532, p-value < 2.2e-16!alternative hypothesis: true rho is not equal to 0!sample estimates:!rho !0.9 ! ! ! ! !!

Note that the results are exactly the same as what we got when testing H0: β1 = 0 vs. H1: β1 ≠ 0.

Model  Diagnos4cs

18  

•  Model:  Yi=(β0+β1xi)  +  εi where,    

•  εi’s are uncorrelated with a mean of 0 and constant variance σ2ε.  

•  εi’s are normally distributed.  (This is needed in the test for the slope.)    

•  Assessing uncorrelatedness of the error terms plot(result$residuals,type='b')!  

•  Assessing Normality qqnorm(result$residuals); qqline(result$residuals)  

shapiro.test(result$residuals)!

W = 0.9884, p-value = 0.6937!

•  Assessing Constant Variance plot(result$fitted,result$residuals)  

levene.test(result$residuals,Waist)!

Test Statistic = 2.1156, p-value = 0.06764