• 7.1 - Motivation • 7.2 - Correlation / Simple Linear Regression • 7.3 - Extensions of Simple Linear Regression CHAPTER 7 Linear Correlation & Regression Methods
Jan 01, 2016
• 7.1 - Motivation
• 7.2 - Correlation / Simple Linear Regression
• 7.3 - Extensions of Simple Linear Regression
CHAPTER 7 Linear Correlation & Regression Methods
Parameter Estimation via SAMPLE DATA …Testing for association between two POPULATION variables X and Y…
• Categorical variables • Numerical variables
Categories of X
Categories of Y
Chi-squared Test ???????
Examples: X = Disease status (D+, D–)
Y = Exposure status (E+, E–)
X = # children in household (0, 1-2, 3-4, 5+)Y = Income level (Low, Middle, High)
PARAMETERS
Means: [ ] [ ]X YE X E Y
Variances: 2 2
2 2
( )
( )
X X
Y Y
E X
E Y
Covariance:
( )( )XY X YE X Y
Parameter Estimation via SAMPLE DATA …
2 2
2 2
( )
( )
X X
Y Y
E X
E Y
2
2
( )21
( )21
x x
x n
y y
y n
s
s
PARAMETERS
Means: [ ] [ ]X YE X E Y
Variances:
( )( )XY X YE X Y
• Numerical variables
???????
Means: [ ] [ ]X YE X E Y
Variances:
( )( )XY X YE X Y
PARAMETERSx y
n nx y STATISTICS
( )( )
1
x x y y
xy ns
1 2 3 4, , , , , nx x x x x
1 2 3 4, , , , , ny y y y y
(can be +, –, or 0)
Covariance: Covariance:
Parameter Estimation via SAMPLE DATA …
2
2
( )21
( )21
x x
x n
y y
y n
s
s
1 2 3 4, , , , , nx x x x x
1 2 3 4, , , , , ny y y y yx1 x2 x3 x4 … xn
y1 y2 y3 y4 … ynPARAMETERS
Means: [ ] [ ]X YE X E Y
Variances:
Covariance:
( )( )XY X YE X Y
• Numerical variables
???????
Means: [ ] [ ]X YE X E Y
Variances:
( )( )XY X YE X Y
PARAMETERSx y
n nx y STATISTICS
( )( )
1
x x y y
xy ns
(can be +, –, or 0)
X
Y
JAMA. 2003;290:1486-1493
Scatterplot
(n data points) Covariance:
Parameter Estimation via SAMPLE DATA …
2
2
( )21
( )21
x x
x n
y y
y n
s
s
x1 x2 x3 x4 … xn
y1 y2 y3 y4 … ynPARAMETERS
Means: [ ] [ ]X YE X E Y
Variances:
Covariance:
( )( )XY X YE X Y
• Numerical variables
???????
Means: [ ] [ ]X YE X E Y
Variances:
( )( )XY X YE X Y
PARAMETERSx y
n nx y STATISTICS
( )( )
1
x x y y
xy ns
(can be +, –, or 0)
X
Y
JAMA. 2003;290:1486-1493
Scatterplot
Does this suggest a linear trend between X and Y?
If so, how do we measure it?
(n data points) Covariance:
Testing for association between two population variables X and Y…
• Numerical variables
???????
PARAMETERS
Means: [ ] [ ]X YE X E Y
Variances: 2 2
2 2
( )
( )
X X
Y Y
E X
E Y
Covariance:
( )( )XY X YE X Y
Linear Correlation Coefficient:
2 2
XY
X Y
Always between –1 and +1
LINEAR^
Parameter Estimation via SAMPLE DATA …
2
2
( )21
( )21
x x
x n
y y
y n
s
s
2 2
XY
X Y
2 2
xy
x y
sr
s s
x1 x2 x3 x4 … xn
y1 y2 y3 y4 … ynPARAMETERS
Means: [ ] [ ]X YE X E Y
Variances:
Covariance:
( )( )XY X YE X Y
• Numerical variables
???????
Means: [ ] [ ]X YE X E Y
Variances:
Covariance:
( )( )XY X YE X Y
PARAMETERSx y
n nx y STATISTICS
( )( )
1
x x y y
xy ns
(can be +, –, or 0)
X
Y
JAMA. 2003;290:1486-1493
Scatterplot
Linear Correlation Coefficient:
Always between –1 and +1
(n data points)
Parameter Estimation via SAMPLE DATA …
JAMA. 2003;290:1486-1493
x1 x2 x3 x4 … xn
y1 y2 y3 y4 … ynPARAMETERS
Means: [ ] [ ]X YE X E Y
Variances: 2 2
2 2
( )
( )
X X
Y Y
E X
E X
Covariance:
( )( )XY X YE X Y
• Numerical variables
???????
Means: [ ] [ ]X YE X E Y
Variances: 2 2
2 2
( )
( )
X X
Y Y
E X
E X
Covariance:
( )( )XY X YE X Y
PARAMETERSx y
n nx y 2
2
( )21
( )21
x x
x n
y y
y n
s
s
STATISTICS
( )( )
1
x x y y
xy ns
(can be +, –, or 0)
X
Y
Scatterplot
(n data points)
Example in R (reformatted for brevity):
> c(mean(x), mean(y)) 7.05 12.08
> var(x) 29.48944
> var(y) 43.76178
> cov(x, y) -25.86667
2 2
xy
x y
sr
s s
Linear Correlation Coefficient:
Always between –1 and +1
> cor(x, y) -0.7200451
> pop = seq(0, 20, 0.1)
> x = sort(sample(pop, 10))1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1
> y = sample(pop, 10)13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0
n = 10
plot(x, y, pch = 19)
Parameter Estimation via SAMPLE DATA …
2 2
xy
x y
sr
s s
x1 x2 x3 x4 … xn
y1 y2 y3 y4 … yn
• Numerical variables
X
Y
JAMA. 2003;290:1486-1493
Scatterplot
Linear Correlation Coefficient:
Always between –1 and +1
r measures the strength of linear association
(n data points)
Parameter Estimation via SAMPLE DATA …
2 2
xy
x y
sr
s s
x1 x2 x3 x4 … xn
y1 y2 y3 y4 … yn
• Numerical variables
X
Y
JAMA. 2003;290:1486-1493
Scatterplot
Linear Correlation Coefficient:
Always between –1 and +1
–1 0 +1positive linear
correlationnegative linear
correlation
r
r measures the strength of linear association
(n data points)
Parameter Estimation via SAMPLE DATA …
2 2
xy
x y
sr
s s
x1 x2 x3 x4 … xn
y1 y2 y3 y4 … yn
• Numerical variables
X
Y
JAMA. 2003;290:1486-1493
Scatterplot
Linear Correlation Coefficient:
Always between –1 and +1
–1 0 +1positive linear
correlationnegative linear
correlation
r
r measures the strength of linear association
(n data points)
Parameter Estimation via SAMPLE DATA …
2 2
xy
x y
sr
s s
x1 x2 x3 x4 … xn
y1 y2 y3 y4 … yn
• Numerical variables
X
Y
JAMA. 2003;290:1486-1493
Scatterplot
Linear Correlation Coefficient:
Always between –1 and +1
–1 0 +1positive linear
correlationnegative linear
correlation
r
r measures the strength of linear association
(n data points)
r measures the strength of linear association
Parameter Estimation via SAMPLE DATA …
2 2
xy
x y
sr
s s
x1 x2 x3 x4 … xn
y1 y2 y3 y4 … yn
• Numerical variables
X
Y
JAMA. 2003;290:1486-1493
Scatterplot
Linear Correlation Coefficient:
Always between –1 and +1
–1 0 +1positive linear
correlationnegative linear
correlation
r
(n data points)
> cor(x, y) -0.7200451
r measures the strength of linear association
Testing for linear association between two numerical population variables X and Y…
Linear Correlation Coefficient
2 2
XY
X Y
0: 0 "
."
H No linear association
between X and Y
: 0 "
."AH Linear association
between X and Y
2 2
xy
x y
sr
s s
Linear Correlation Coefficient
Now that we have r, we can conduct HYPOTHESIS TESTING on
Test Statistic for p-value
222 ~
1n
rT n t
r
2
0.7210 2
1 ( .72)
82.935 on t
p-value = .0189 < .052 * pt(-2.935, 8)
Parameter Estimation via SAMPLE DATA …
If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…
2 2
xy
x y
sr
s s
Linear Correlation Coefficient:
r measures the strength of linear association 0 1Y X
“Response = Model + Error”
Find estimates and for the “best” line0 1
0 1ˆ ˆY X
> cor(x, y) -0.7200451
Residuals
( , )i ix y
ˆ( , )i ix y
ˆi i ie y y
in what sense??? 2ErrSS ie
Parameter Estimation via SAMPLE DATA …
in what sense???
If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…
2 2
xy
x y
sr
s s
Linear Correlation Coefficient:
r measures the strength of linear association 0 1Y X
“Response = Model + Error”
Find estimates and for the “best” line0 1
0 1ˆ ˆY X
> cor(x, y) -0.7200451
Residuals
( , )i ix y
ˆ( , )i ix y
ˆi i ie y y
i.e., that minimizes
1 2ˆ xy
x
s
s
0 1ˆ ˆy x
25.866670.87715
29.48944
12.08 ( 0.87715)(7.05)
18.26391
“Least Squares Regression Line”
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
2ErrSS ie
( , ) is on linex y
0 1ˆ ˆY X
If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…
2 2
xy
x y
sr
s s
Linear Correlation Coefficient:
r measures the strength of linear association 0 1Y X
“Response = Model + Error”
Find estimates and for the “best” line0 1
ˆ 18.26391 0.87715Y X
> cor(x, y) -0.7200451
Residuals
( , )i ix y
ˆ( , )i ix y
ˆi i ie y y
i.e., that minimizes
1 2ˆ xy
x
s
s
0 1ˆ ˆy x
25.866670.87715
29.48944
12.08 ( 0.87715)(7.05)
18.26391
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
2ErrSS ie
( , ) is on linex yCheck
0 1ˆ ˆY X
Find estimates and for the “best” line0 1
ˆ 18.26391 0.87715Y X
> cor(x, y) -0.7200451
Residuals
( , )i ix y
ˆ( , )i ix y
ˆi i ie y y
i.e., that minimizes
X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1
Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0
1 2ˆ xy
x
s
s
0 1ˆ ˆy x
25.866670.87715
29.48944
12.08 ( 0.87715)(7.05)
18.26391
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
observed response
2ErrSS ie
Find estimates and for the “best” line0 1
ˆ 18.26391 0.87715Y X
> cor(x, y) -0.7200451
Residuals
( , )i ix y
ˆ( , )i ix y
ˆi i ie y y
i.e., that minimizes
X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1
Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0
1 2ˆ xy
x
s
s
0 1ˆ ˆy x
25.866670.87715
29.48944
12.08 ( 0.87715)(7.05)
18.26391
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
observed response
fitted response Y
2ErrSS ie
Find estimates and for the “best” line0 1> cor(x, y) -0.7200451
Residuals
( , )i ix y
ˆ( , )i ix y
ˆi i ie y y
i.e., that minimizes
X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1
Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0
~ E X E R C I S E ~
1 2ˆ xy
x
s
s
0 1ˆ ˆy x
25.866670.87715
29.48944
12.08 ( 0.87715)(7.05)
18.26391
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
observed response
fitted response Y
ˆ 18.26391 0.87715Y X 2
ErrSS ie
ˆ 18.26391 0.87715Y X
Find estimates and for the “best” line0 1> cor(x, y) -0.7200451
Residuals
( , )i ix y
ˆ( , )i ix y
ˆi i ie y y
i.e., that minimizes
1 2ˆ xy
x
s
s
0 1ˆ ˆy x
25.866670.87715
29.48944
12.08 ( 0.87715)(7.05)
18.26391
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
observed response
fitted response
residuals
X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1
Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0
~ E X E R C I S E ~Y
ˆY Y
2ErrSS ie
ˆ 18.26391 0.87715Y X
Find estimates and for the “best” line0 1> cor(x, y) -0.7200451
( , )i ix y
ˆ( , )i ix y
ˆi i ie y y
i.e., that minimizes
1 2ˆ xy
x
s
s
0 1ˆ ˆy x
25.866670.87715
29.48944
12.08 ( 0.87715)(7.05)
18.26391
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
observed response
fitted response
residuals
X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1
Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0
~ E X E R C I S E ~
~ E X E R C I S E ~
Y
ˆY Y
Residuals
2ErrSS ie 189.6555
Testing for linear association between two numerical population variables X and Y…
Linear Regression Coefficients
0 1: 0 "
."
H No linear association
between X and Y
1: 0 "
."AH Linear association
between X and Y
Linear Regression CoefficientsTest Statistic for p-value?
1 2ˆ xy
x
s
s 0 1
ˆ ˆy x
0 1Y X
0 1ˆ ˆY X
“Response = Model + Error”
Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1
ˆ 18.26391 0.87715Y X
Find estimates and for the “best” line0 1> cor(x, y) -0.7200451
( , )i ix y
ˆ( , )i ix y
ˆi i ie y y
i.e., that minimizes
2ErrSS ie
1 2ˆ xy
x
s
s
0 1ˆ ˆy x
25.866670.87715
29.48944
12.08 ( 0.87715)(7.05)
18.26391
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
observed response
fitted response
residuals
X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1
Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0
~ E X E R C I S E ~
~ E X E R C I S E ~
Y
ˆY Y
Residuals
189.6555
Testing for linear association between two numerical population variables X and Y…
Linear Regression Coefficients
0 1: 0 "
."
H No linear association
between X and Y
1: 0 "
."AH Linear association
between X and Y
Linear Regression Coefficients
Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1
Test Statistic for p-value
1 2ˆ xy
x
s
s 0 1
ˆ ˆy x
0 1Y X
0 1ˆ ˆY X
“Response = Model + Error”
2Err ˆSS ( )y y
21 12
Err
ˆ( 1)
MSx nT n s t
ErrErr
SSMS
2n
0.87715 0(9)(29.48944)
189.6555 / 8
82.935 on t
Same t-score as H0: = 0! p-value = .0189
> plot(x, y, pch = 19)> lsreg = lm(y ~ x) # or lsfit(x,y)> abline(lsreg)> summary(lsreg)
Call:lm(formula = y ~ x)
Residuals: Min 1Q Median 3Q Max -8.6607 -3.2154 0.8954 3.4649 5.7742
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.2639 2.6097 6.999 0.000113 ***x -0.8772 0.2989 -2.935 0.018857 * ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.869 on 8 degrees of freedomMultiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886
BUT WHY HAVE TWO METHODS FOR THE SAME PROBLEM???
Because this second method generalizes…
Source df SS MS F-ratio p-value
Treatment
Error
Total –
ANOVA Table0 1
1
: 0
: 0A
H
H
0 1Y X
Source df SS MS F-ratio p-value
Regression
Error
Total –
ANOVA Table0 1
1
: 0
: 0A
H
H
0 1Y X
?
Source df SS MS F-ratio p-value
Regression 1
Error
Total –
ANOVA Table0 1
1
: 0
: 0A
H
H
0 1Y X
?
Testing for linear association between two numerical population variables X and Y…
Linear Regression Coefficients
0 1: 0 "
."
H No linear association
between X and Y
1: 0 "
."AH Linear association
between X and Y
Linear Regression Coefficients
Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1
Test Statistic for p-value
1 2ˆ xy
x
s
s 0 1
ˆ ˆy x
0 1Y X
0 1ˆ ˆY X
“Response = Model + Error”
2Err ˆSS ( )y y
21 12
Err
ˆ( 1)
MSx nT n s t
ErrErr
SSMS
2n
0.87715 0(9)(29.48944)
189.6555 / 8
82.935 on t
Same t-score as H0: = 0! p-value = .0189
Errdf 8
Source df SS MS F-ratio p-value
Regression 1
Error 8
Total –
ANOVA Table0 1
1
: 0
: 0A
H
H
0 1Y X
?
?
?
?
Parameter Estimation via SAMPLE DATA …
Means:
Variances:
x y
n nx y 2
2
( )21
( )21
x x
x n
y y
y n
s
s
STATISTICS
JAMA. 2003;290:1486-1493
x1 x2 x3 x4 … xn
y1 y2 y3 y4 … yn
Scatterplot
(n data points)
Total
Total
SS
df
Parameter Estimation via SAMPLE DATA …
Means:
Variances:
x y
n nx y 2
2
( )21
( )21
x x
x n
y y
y n
s
s
STATISTICS
JAMA. 2003;290:1486-1493
x1 x2 x3 x4 … xn
y1 y2 y3 y4 … yn
Scatterplot
(n data points)
Total
Total
SS
df
SSTot is a measure of the total amount
of variability in the observed responses (i.e., before any model-fitting).
2TotSS ( )y y 2( 1) yn s
Parameter Estimation via SAMPLE DATA …
JAMA. 2003;290:1486-1493
x1 x2 x3 x4 … xn
y1 y2 y3 y4 … yn
Scatterplot
(n data points)
Means:
Variances:
x y
n nx y 2
2
( )21
( )21
x x
x n
y y
y n
s
s
STATISTICS
Total
Total
SS
df
SSReg is a measure of the total amount
of variability in the fitted responses (i.e., after model-fitting.)
2TotSS ( )y y
2Reg ˆSS ( )y y
2( 1) yn s
Parameter Estimation via SAMPLE DATA …
Means:
Variances:
x y
n nx y 2
2
( )21
( )21
x x
x n
y y
y n
s
s
STATISTICS
Total
Total
SS
df
JAMA. 2003;290:1486-1493
x1 x2 x3 x4 … xn
y1 y2 y3 y4 … yn
Scatterplot
(n data points)
SSErr is a measure of the total amount
of variability in the resulting residuals (i.e., after model-fitting).
2Err ˆSS ( )y y
2TotSS ( )y y
2Reg ˆSS ( )y y
2( 1) yn s
ˆ 18.26391 0.87715Y X > cor(x, y) -0.7200451
( , )i ix y
ˆ( , )i ix y
ˆi i ie y y
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
observed response
fitted response
residuals
X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1
Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0
~ E X E R C I S E ~
~ E X E R C I S E ~
Y
ˆY Y
Residuals
2Err ˆSS ( )y y
2TotSS ( )y y
2Reg ˆSS ( )y y
2( 1) yn s
= 189.656
= 393.856
= 9 (43.76178)
= 204.2
ˆ 18.26391 0.87715Y X
( , )i ix y
ˆ( , )i ix y
ˆi i ie y y
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
observed response
fitted response
residuals
X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1
Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0
~ E X E R C I S E ~
~ E X E R C I S E ~
Y
ˆY Y
Residuals
2Err ˆSS ( )y y
2TotSS ( )y y
2Reg ˆSS ( )y y
= 189.656
= 393.856
= 204.2
SSTot = SSReg + SSErr minimum
> cor(x, y) -0.7200451
TotErr
Reg
Source df SS MS F-ratio p-value
Regression 1 204.200 MSReg
Fk – 1, n – k 0 < p < 1
Error 8 189.656 MSErr
Total 9 393.856 –
ANOVA Table0 1
1
: 0
: 0A
H
H
0 1Y X
Source df SS MS F-ratio p-value
Regression 1 204.200 204.200
8.61349 0.018857
Error 8 189.656 23.707
Total 9 393.856 –
ANOVA Table
Same as before!
0 1
1
: 0
: 0A
H
H
0 1Y X
Source df SS MS F-ratio p-value
Regression 1 204.200 204.200
8.61349 0.018857
Error 8 189.656 23.707
Total 9 393.856 –
> summary(aov(lsreg))
Df Sum Sq Mean Sq F value Pr(>F) x 1 204.20 204.201 8.6135 0.01886 *Residuals 8 189.66 23.707
Source df SS MS F-ratio p-value
Regression 1 204.200 204.200
8.61349 0.018857
Error 8 189.656 23.707
Total 9 393.856 –
Reg
Tot
SS
SS
204.2.
393.856 0.5185Moreover,
The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.
Coefficient of Determination
> cor(x, y) -0.7200451
Coefficient of Determination
Reg
Tot
SS
SS
204.2.
393.856 0.5185Moreover,
The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.
2 2( 0.72)r 0.5185
> plot(x, y, pch = 19)> lsreg = lm(y ~ x)> abline(lsreg)> summary(lsreg)
Call:lm(formula = y ~ x)
Residuals: Min 1Q Median 3Q Max -8.6607 -3.2154 0.8954 3.4649 5.7742
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.2639 2.6097 6.999 0.000113 ***x -0.8772 0.2989 -2.935 0.018857 * ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.869 on 8 degrees of freedomMultiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886
Reg
Tot
SS
SS 0.5185
2 2( 0.72)r 2 2( 0.72)r 0.5185Coefficient of Determination
The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.
Given:
2 2
xy
x y
sr
s s
Linear Correlation Coefficient
Least Squares Regression Line
12ˆ ,xy xs s 0 1
ˆ ˆy x 0 1
ˆ ˆˆ XY minimizes SSErr =
Summary of Linear Correlation and Simple Linear Regression
x1 x2 x3 x4 … xn
y1 y2 y3 y4 … yn
Means
x
y
Variances2
2
x
y
s
s
Covariance
xys
JAMA. 2003;290:1486-1493
X
Y
X
Y
= SSTot – SSReg
2ˆ( )y y
–1 r +1measures the strength of linear association
(ANOVA)All point estimates can be upgraded to CIs for hypothesis testing, etc.
Given:
2 2
xy
x y
sr
s s
Linear Correlation Coefficient
–1 r +1measures the strength of linear association
Least Squares Regression Line
12ˆ ,xy xs s 0 1
ˆ ˆy x
minimizes SSErr =
(ANOVA)
Summary of Linear Correlation and Simple Linear Regression
x1 x2 x3 x4 … xn
y1 y2 y3 y4 … yn
Means Variances Covariance
x
y
2
2
x
y
s
sxys
JAMA. 2003;290:1486-1493
X
Y
X
Y
= SSTot – SSReg
2ˆ( )y y0 1ˆ ˆˆ XY
upper 95% confidence band
95% Confidence Intervals
lower 95% confidence band
y
All point estimates can be upgraded to CIs for hypothesis testing, etc.
(see notes for “95% prediction intervals”)
Given:
2 2
xy
x y
sr
s s
Linear Correlation Coefficient
–1 r +1measures the strength of linear association
Least Squares Regression Line
12ˆ ,xy xs s 0 1
ˆ ˆy x 0 1
ˆ ˆˆ XY minimizes SSErr =
(ANOVA)
Summary of Linear Correlation and Simple Linear Regression
x1 x2 x3 x4 … xn
y1 y2 y3 y4 … yn
Means Variances Covariance
x
y
2
2
x
y
s
sxys
JAMA. 2003;290:1486-1493
X
Y
X
Y
Coefficient of Determination
= SSTot – SSReg
2ˆ( )y y
All point estimates can be upgraded to CIs for hypothesis testing, etc.
Reg2
Tot
SS
SSr proportion of total variability modeled
by the regression line’s variability.
Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.
0 1 1 2 2 3 3 1 1k kY X X X X
0 1 1 2 2 1 1ˆ ˆ ˆ ˆˆ
k kY X X X
0 1 2 3 1: 0kH 1 2 3 1
"
, , , , ."k
No linear association between Y and
any of its predictors X X X X : 0
for some 1,2,..., 1A iH
i k
"
."
Linear association between Y and
at least one of its predictors
“Response = Model + Error”
Multilinear Regression
“main effects”
For now, assume the “additive model,” i.e., main effects only.
Multilinear Regression
Fitted response
Residual
True response yi
X1
X20
Y
(x1i , x2i)
Predictors
ˆiy
ˆii iyye
1 2( , , )x x y
0 1 1 2 2ˆ ˆ ˆ ˆY X X
Once calculated, how do we then test the null hypothesis?
Least Squares calculation of regression coefficients is computer-intensive. Formulas require Linear Algebra (matrices)!
ANOVA
Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.
0 1 1 2 2 3 3 1 1k kY X X X X
“Response = Model + Error”
Multilinear Regression
“main effects”
R code example: lsreg = lm(y ~ x1+x2+x3)
R code example: lsreg = lm(y ~ x1+x2+x3)R code example: lsreg = lm(y ~ x+x^2+x^3)
Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.
0 1 1 2 2 3 3 1 1
2 2 21,1 1 2,2 2 1, 1 1
cubes +
k k
k k k
Y X X X X
X X X
“Response = Model + Error”
Multilinear Regression
“main effects”
quadratic terms, etc.(“polynomial regression”)
R code example: lsreg = lm(y ~ x+x^2+x^3)R code example: lsreg = lm(y ~ x1+x2+x1:x2)R code example: lsreg = lm(y ~ x1*x2)
Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.
0 1 1 2 2 3 3 1 1
2 2 21,1 1 2,2 2 1, 1 1
1,2 1 2 1,3 1 3 1, 1 1 1
2,3 2 3 2,4 2 4 2, 1 2 1
cubes +
+
+
+
k k
k k k
k k
k k
Y X X X X
X X X
X X X X X X
X X X X X X
“Response = Model + Error”
Multilinear Regression
“main effects”
quadratic terms, etc.(“polynomial regression”)
“interactions”“interactions”
Recall…
Example in R (reformatted for brevity):
> I = c(1,1,1,1,1,0,0,0,0,0)
> lsreg = lm(y ~ x*I)> summary(lsreg)
Coefficients:
Estimate(Intercept) 6.56463x 0.00998I 6.80422x:I 1.60858
Suppose these are actually two subgroups, requiring two distinct linear regressions!
Multiple Linear Reg with interactionwith an indicator (“dummy”) variable:
ˆ 6.56 0.01 6.80 1.61Y X I X I
I = 1
I = 0
ˆ 6.56 0.01Y X
ˆ 13.36 1.62Y X 0 1 2 3
ˆ ˆ ˆ ˆY X I X I
ANOVA Table (revisited)
0 1 2 3 1: 0kH
From sample of n data points…. 0 1 1 2 2 1 1ˆ ˆ ˆ ˆˆ
k kY X X X
-11 2 3 k
"No linear association between Y and
any of its predictors X , X , X ,…, X ."
0 1 1 2 2 1 1k kY X X X
: 0
for some 1,2,..., 1A iH
i k
"Linear association between Y and
at least one of its predictors."
Note that if true, then it would follow that 0Y
But how are these regression coefficients calculated in general?
“Normal equations” solved via computer (intensive).
Note that if true, then it would follow that
0 .Y
0ˆ .y
Source df SS MS F p-value
Regression
Error
Total
0 1 2 3 1: 0kH
1k
n k
1n 2
1
( )n
ii
y y
2
1
ˆ( )n
ii
y y
2
1
ˆ( )n
i ii
y y
SS
df
RegMS
ErrMS
Reg
Err
MS
MS
1,k n kF 0 1p
0 1 1 2 2 1 1ˆ ˆ ˆ ˆˆ
k kY X X X
1 2 3 1
"
, , , , ."k
No linear association between Y and
any of its predictors X X X X
ANOVA Table (revisited)
(based on n data points).
*** How are only the statistically significant variables determined? ***
p-values: p1 < .05 p2 < .05 p4 < .05 3 .05p
“MODEL SELECTION”(BE) Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model.
X1
+ + + + ……
X2 X3 X4
1 23 4
Y
0 1: 0H 0 2: 0H 0 3: 0H 0 4: 0H ……Step 1.t-tests:
Reject H0 Reject H0 Accept H0 Reject H0
…… ……
Step 2. Are all coefficients significant at level ? If not….
If significant, then…
3 .05p
p-values: p1 < .05 p2 < .05 p4 < .05 3 .05p 3 .05p
“MODEL SELECTION”(BE) Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model.
X1
+ + + + ……
X2 X3 X4
1 23 4
Y
0 1: 0H 0 2: 0H 0 3: 0H 0 4: 0H ……Step 1.t-tests:
Reject H0 Reject H0 Accept H0 Reject H0
…… ……
Step 2. Are all coefficients significant at level ? If not….
X1
+ + + + ……
X2 X41 2 X3
3 4
Y
delete that term,
If significant, then…
3 .05p p-values: p1 < .05 p2 < .05 p4 < .05
“MODEL SELECTION”(BE) Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model.
X1
+ + + + ……
X2 X3 X4
1 23 4
Y
0 1: 0H 0 2: 0H 0 3: 0H 0 4: 0H ……Step 1.t-tests:
Reject H0 Reject H0 Accept H0 Reject H0
…… ……
Step 2. Are all coefficients significant at level ? If not….
X1
+ + + + ……
X2 X41 2
4
Y
Step 3. Repeat 1-2 as necessary until all coefficients are significant → reduced model
delete that term, and recompute new coefficients!
If significant, then…
X1 X2 X4
+ + + ……
1 2 4
1
2
k
1Y 2Y kY
1 2 k
12
k
= ==H0:
Analysis of Variance (ANOVA)k 2 independent, equivariant, normally-distributed “treatment groups”
Recall ~
MODEL ASSUMPTIONS?
“Regression Diagnostics”
Re-plot data on a “log-log” scale.
Re-plot data on a “log” scale (of Y only)..
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
0 1
ˆ ˆ ˆlnˆ1
X
0 1ˆ ˆ( )
1ˆ1 Xe
“MAXIMUM LIKELIHOOD ESTIMATION”
“log-odds” (“logit”) = example of a general “link function”
( )g
(Note: Not based on LS implies “pseudo-R2,” etc.)
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
0 1 1 2 2
ˆ ˆ ˆ ˆ ˆlnˆ1
k kX X X
0 1 1 2 2ˆ ˆ ˆ ˆ( )
1ˆ1 k kX X Xe
Suppose one of the predictor variables is binary… 1
1, Age 500, Age 50
X
“log-odds” (“logit”)
10 1 2 2
1
ˆ ˆ ˆ ˆ ˆlnˆ1
k kX X
1 1:X
1 0 :X 00 2 2
0
ˆ ˆ ˆ ˆlnˆ1
k kX X
SUBTRACT!
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
0 1 1 2 2
ˆ ˆ ˆ ˆ ˆlnˆ1
k kX X X
0 1 1 2 2ˆ ˆ ˆ ˆ( )
1ˆ1 k kX X Xe
Suppose one of the predictor variables is binary… 1
1, Age 500, Age 50
X
“log-odds” (“logit”)
10
1
ˆ ˆlnˆ1
1 2 2ˆ ˆ X ˆ
k kX 1 1:X
1 0 :X 00
0
ˆ ˆlnˆ1
2 2ˆ X ˆ
k kX
SUBTRACT!
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
0 1 1 2 2
ˆ ˆ ˆ ˆ ˆlnˆ1
k kX X X
0 1 1 2 2ˆ ˆ ˆ ˆ( )
1ˆ1 k kX X Xe
Suppose one of the predictor variables is binary… 1
1, Age 500, Age 50
X
“log-odds” (“logit”)
011
1 0
ˆˆ ˆln lnˆ ˆ1 1
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
0 1 1 2 2
ˆ ˆ ˆ ˆ ˆlnˆ1
k kX X X
0 1 1 2 2ˆ ˆ ˆ ˆ( )
1ˆ1 k kX X Xe
Suppose one of the predictor variables is binary… 1
1, Age 500, Age 50
X
“log-odds” (“logit”)
1
11
0
0
ˆ
ˆ1 ˆlnˆ
ˆ1
1
odds of surgery given Age 50 ˆlnodds of surgery given Age 50
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
0 1 1 2 2
ˆ ˆ ˆ ˆ ˆlnˆ1
k kX X X
0 1 1 2 2ˆ ˆ ˆ ˆ( )
1ˆ1 k kX X Xe
Suppose one of the predictor variables is binary… 1
1, Age 500, Age 50
X
“log-odds” (“logit”)
ODDS RATIO
1ˆ
OR e ………….. implies ………….. 1ˆln OR
(1 )d
adt
d y
a ydt
( )M y
1
(1 )d a dt
ln | | ln |1 | at b
ln1
at b
d ya y
dt
in population dynamics
Unrestricted population growth(e.g., bacteria)
Population size y obeys the following law
with constant a > 0.
1d y a dt
y
ln | |y at b a t by e a t be e
0a ty y e
a tC eWith initial condition 0(0)y y
Restricted population growth(disease, predation, starvation, etc.)
Population size y obeys the following law,constant a > 0, and “carrying capacity” M.
Exponential growth
Let survival probability = .y M
1 1
1d a dt
Logistic growth0
0 0(1 ) ate