Top Banner
1 13. Linear Regression and Correlation 相關係數分析與線性迴歸
67

13. Linear Regression and Correlation

Jan 12, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
PowerPoint Presentation
2
Outline
• Data: two continuous measurements on each subject • Goal: study the relationship between the two variables • PART I : correlation analysis
– Study the relationship between two continuous variables. – Steps :
• Scatter diagram • Correlation coefficient : Calculation, meaning, hypothesis testing
• PART II : linear regression – Construct a linear equation between 2 variables.
• Model building • Model estimating : Confidence intervals and prediction intervals • Model fitting: Strength of the linear association,coefficient of
determination
3
• Ex. Gender(binary), brand(3-level)
– Y : response variable(cont. or binary). • Ex. Score, success-failure, yield,
– Q : whether X and Y are correlated ? • A : If Y is continuous, comparing the population
means of Y in the groups divided by X. – Ex : --Z-test, T-test, ANOVA F-test
4
• Recall : In Ch.11 and 12, – Q : whether X and Y are correlated ?
• A : If X and Y are binary, compare the population proportions of Y in the two groups divided by X.
– Ex :
– When sample sizes are large, Z-test is used.
– Q : How to determine the correlation if X and Y are both continuous? -- correlation and regression analysis!
5
• Data : – A sample of n sets of observation. – There are k continuous variables measured in each observation. – Example. Surveyed n=10 students, k=3 scores are recorded. – Questions : any association between scores?

1 82 67 56
2 89 99 70
3 45 31 42
4 74 66 67
5 75 86 99
6 69 39 75
7 70 86 67
8 47 61 86
9 92 88 75
10 92 79 54
• What is correlation analysis ? – Study the relationship between several continuous variables. – Measure the strength of the association between variables.
• Correlation analysis consists : – Step 1. Scatter diagram : Plot (X1, X2) – Step 2. Coefficient of correlation :
7
Conclusion :
• Population coefficient of correlation, ρ : – A measure of the strength of the linear relationship between two variables. – Definition: population correlation coefficient
– Estimation : sample correlation coefficient
xy
x y
n 1 n 1
n 1 n 1
9
• Properties : – -1r1 – “Positive linear association” : r > 0 – “Negative linear association” : r < 0 – “no linear relation” : r0 (! Other relation may exist) – “Strongly positive linear association” : r1 – “Strongly negative linear association” : r-1
10
11
• Why such definition? – If there is a strongly positive linear association, when
x is large, y is large, then we have a large positive value of Sxy.
– If there is a strongly negative linear association, when x is large, y is small, then we have a large negative value of Sxy.
– If there is no relation, when x is large, some y are large, some y are small, then Sxy0, r0.
12
EXCEL
13
EXCEL
2. =0.7151
3. =0.3754
Population : N=∞ subjects
Population : N=∞ subjects
“H0 : ρ= 0” ? Unknown!
-- a t-test!
• Testing the null hypothesis of no correlation : ρ=0
• Step 1. State the hypotheses – H0 : no correlation v.s. H1: correlated – H0 : ρ= 0 v.s. H1 : ρ≠0
• Step 2. Select the significance level α
17
– Note that under null hypothesis, t ~ t-distribution with d.f.=(n-2)
• Step 4. Formulate the decision rule – A two-sided test; – A t-test; – With significance level α, H0 should be rejected if
t > tα/2,n-2 or t <- tα/2,n-2
• Step 5. Collect data, compute t-value, draw conclusion
2 2
= = = − − −
18
• Example. At α=0.05, n=10, df=10-2=8, t(0.025,8)=2.306 • Test 1 : ( v.s. )
– H0 : ρ1= 0 v.s. H1 : ρ1≠0 – Since r1=0.033, n=10,
– Since –2.306< t=0.093 <2.306, H0 is not rejected. – Conclusion : there is no sufficient evidence to reject the null
hypothesis of no correlation.
19
• Example. At α=0.05, n=10, df=10-2=8, t(0.025,8)=2.306 • Test 2 : ( v.s. )
– H0 : ρ2= 0 v.s. H1 : ρ2≠0 – Since r2=0.7151, n=10,
– Since t=2.89>2.306, H0 is rejected. – Conclusion : there is sufficient evidence to reject the null
hypothesis of no correlation. –
89.2 )7151.0(1 2107151.0
20
• Example. At α=0.05, n=10, df=10-2=8, t(0.025,8)=2.306 • Test 3: ( v.s. )
– H0 : ρ3= 0 v.s. H1 : ρ3≠0 – Since r3=0.3755, n=10,
– Since –2.306< t=1.15 <2.306, H0 is not rejected. – Conclusion : there is no sufficient evidence to reject the null
hypothesis of no correlation.
21
! A word of caution : for H0 of no correlation been rejected, ! Only linear relationship between variables are ascertained.
! Quadratic? Cubic? ! No “cause and effect” () is established.
! ! “” ! “”? !
! Spurious() correlations : !
• Variables : – X=Independent variable(s), explanatory variable, predictor,
,
• To be predicted or estimated.
• Regression analysis : – Develop an equation/function that allows us to estimate/predict Y
based on X. – Example. X=Y 60
23
• Recall : In a one-way ANOVA – AGE vs. INCOME – The whole population are classified into three sub-
populations by “AGE” • A young-population. • A middle-age-population. • A senior-population
– The INCOMEs of all sub-populations are • Normally distributed with same variance
– Research question: “The mean INCOMEs, μincome ,are the same”?
ANOVA Regression
ANOVA Regression
• Recall : In a simple linear regression model, – (X) vs. (Y) – The whole population are classified into many sub-populations
by “(X)” • X=0-population; X=1-population;....., X=100-population.
– The (Y) of all sub-populations are • Normally distributed with same variance
– Research question: “The mean s, μY , are the same”? “Establish the relationship between μY and X”
25
Regression Model: (P449)
1. Given each value of X, there is a group of Ys. – X – X=60
Y= – X=50
– At X=60, Y~ – At X=50, Y~
),(N 2 60X|Y σµ =
),(N 2 50X|Y σµ =
26
3. The means of these normal distributions is a linear function of x
– X
– X
0
20
40
60
80
100

Example. -30+1.5()
27
4. The standard deviations of these normal distributions are all the same. (independent with x)


28
),x(N~Y 2σβ+α
29
unknown are ,, 2σβαPractically, only a sample data is collected and
?"x" x|Y β+α=µ
P(y)
: observations
30
How to estimate the regression equation using a sample data?
?"x" x|Y β+α=µ
Y
: observations
31
• Regression equation :
xx|Y β+α=µ
??? 2 =σ=β=α
• Let a, b be estimates of
• Predicted equation : Y’ = a + b x, it could be a 1. predicted value of Y : Y – Ex. X=60

– Ex.X=60
34
35
• In the predicted equation, the intercepta = ? The slope b= ?
• Least Squares estimates (LSE, ) a, b : – Principle : find a regression equation which minimizes the sum of
∑ =
• Estimated regression coefficients :
xy
S (x x)(y y) /(n 1) { xy nxy}/(n 1),
S (x x) /(n 1) { x nx }/(n 1)
= − − − = − −
= − − = − −
∑ ∑ ∑ ∑
Meaning of the estimated intercept, a
• a = Y’ at X=0. – The estimated value of when X=0.
• Example. XY0 = a
– The predicted value of Y when X=0. • Example. XY0 a
– 0Xa • Example. X=Y= • X0a
0X|Y =µ
38
• a is an estimate of the true interceptα. • One may interest in testing H0 : α=0. • When α=0, the equation passes through the origin(),
0 x
0x|Y
x|Y
X
39
Meaning of the estimated slope, b
• b = increment with unit change of x – When there is one unit change in x, the
increment/decrement in – Example. In previous case, if b=0.2, X1 0.2
x|Yµ
x|Yµ
40
• b is an estimate of the true slopeβ. • One is more interested in testing H0 : β=0. • When β=0, the equation is a constantand
independent of X values,
– The distribution of Y is uncorrelated with X. – X and Y are independent!
),(N~Y, 2 x|Y σαα=µ
α=µ x|Y
X Y XY
1 82 67 5494
2 89 99 8811
3 45 31 1395
4 74 66 4884
5 75 86 6450
6 69 39 2691
7 70 86 6020
8 47 61 2867
9 92 88 8096
10 92 79 7268
mean 74 70 53976
{53976 10(73.5)(70.2)}/(10 1) 264.333
= − −
= − − =
=
= = = −
= − × =

1 2222.52 2222.52 8.3747 0.0201
8 2123.08 265.39
1.5346 24.2804 0.0632 0.9512 -54.4562 57.5253
X 0.9342 0.3228 2.8939 0.0201 0.1898 1.6787
Model fitting
Model estimating
43
EXCEL : output t P- 95% 95%
1.5346 24.2804 0.0632 0.9512 -54.4562 57.5253
X 0.9342 0.3228 2.8939 0.0201 0.1898 1.6787
Note : The difference to previous calculation is due to rounding error.
a, b estimates of α,β
SE(a), SE(b)
t-value(a)=a/SE(a), t-value(b)=b/SE(b)
• p-value (a) =0.9511>0.05, not reject that α=0
• p-value (b) = 0.02<0.05, reject! β≠0
95%95% confidence interval for α,β
44
'Ybxaxx|Y =+≈β+α=µ
45
The standard error of estimate :
• Variance : – Dispersion of Y around the regression line – The variation of the random “error”,
Error = = : unobtainable
• Standard error of estimate : – Use “residuals” to estimate “error”,
Residual = = Y-Y’ : observable – Standard error of estimate is defined by
where Sy : sample s.d. of Y, Sx : sample s.d. of X

• – The random variation is unexplained by the regression line.
2 xy
47
Example. X=Y : Y’=1.53+0.93X X Y Y Y-Y' (Y-Y)^2
1 82 67 78.14 -11.14 124.12
2 89 99 84.68 14.32 205.05
3 45 31 43.57 -12.57 158.12
4 74 66 70.67 -4.67 21.78
5 75 86 71.60 14.40 207.32
6 69 39 66.00 -27.00 728.78
7 70 86 66.93 19.07 363.66
8 47 61 45.44 15.56 242.02
9 92 88 87.48 0.52 0.27
10 92 79 87.48 -8.48 71.96
2123.08
Note :
48
3.16))93.0(94.28284.482( 8 9)bSS(
49
EXCEL :

ESTIMATION & PREDICTION— Confidence intervals and prediction intervals
• ESTIMATION: – Q: At X=x, the mean value of Y, – Point estimation, confidence interval
• PREDICTION: – Q:If an individual is drawn from the population of X=x, Y=? – Point prediction, prediction interval
?x|Y =µ
?3x|Y =µ
Confidence interval of at X=xx|Yµ
x|Yµ• Confidence interval : At X=x, the mean value of Y, – Point estimation : Y’ = a+bx
– 100(1-α)% confidence interval :
Y’=1.53+0.93X
Ans.
2. 95% confidence interval :
Prediction interval of Y at X=x
• Prediction interval : If draw an individual from the population of X=x, Y=? – Prediction : Y’ = a + bx
– 100(1-α)% prediction interval :
Y’=1.53+0.93X
2 2 2 x
Y ' 57.33, t 2.306,s 16.29,
n 10,(x x) (60 73.5) 182.25,s 282.94
1 (x x)Y ' t s 1 n (n 1)s
1 182.2557.33 2.306 16.29 1 57.33 40.66 10 9(282.94)
α
• D.f . = n-1 for n observations. • MStotal = SS total/(n-1)
– SST = due to treatment = • Yj = estimated mean of Y of treatment-j group • D.f. = k-1 for k treatments • MST=SST/(k-1)
– SSE = due to random error = • D.f. = n – k • MSE = SSE/(n-k)
– SS total = SST + SSE
Degrees of Freedom
Mean Square F
Treatment SST k-1 SST/(k-1)=MST Error SSE n-k SSE/(n-k)=MSE
MST/MSE
Vs
57
• SStotal = Total variation of Y :
• SSR = The variation explained by the regression model • SSE=The unexplained variation
SSESSR )'YY()Y'Y(
)YY( SStotal
• SStotal = – D.f . = n-1 for n observations. – Mstotal = SS total/(n-1)
• SSR = due to regression model – Y’ = estimated mean of Y at some X-level – D.f. = 2-1=1 for 2 regression coefficients – MSR=SSR/1 = SSR
• SSE = due to random error – D.f. = n – 2 – MSE = SSE/(n-2) =
2 Y
2 S)1n()YY( −=∑ −
2 xyS ⋅
2 X
22 Sb)1n()Y'Y( −=∑ −=
2 xy
2 S)2n()'YY( ⋅−=∑ −=
Degrees of Freedom
Mean Square F
Regression SSR 2-1 SSR/1=MSR Error SSE n-2 SSE/(n-2)=MSE
MSR/MSE
vs.
The regression line is horizontal.
60
2 89 99
3 45 31
4 75 67
5 76 86
6 69 40
7 71 87
8 48 61
9 93 89
10 93 80
sum 740 706
mean 74.0 70.6
sd 16.8 22.0
variance 283.2 482.7
1 2222.518 2222.518 8.374682 0.020079
8 2123.082 265.3853
Further, since for F-test, p-value = 0.02< 0.05, the linearity exists.
6.434584.4829S)1n()YY( 2 Y
2 ≈×=−=−∑ 5.22229.2829342.09Sb)1n()Y'Y( 22
The Coefficient of Determination
• Coefficient of Determination : – the proportion of the total variation of Y that is explained by the
variation of X. – YX
– YX
SStotal SSE1
total lainedexpuntotal
)YY( )Y'Y(
total elmodbylainedexp
SStotal SSRr
1 2222.518 2222.518 8.374682 0.020079
8 2123.082 265.3853
– CORREL

66
Exercise.
• Linear regression analysis : – 45, 46, 53, 57 – EXCEL: 47, 49
67
• (X)(Y) 1. (correlation analysis) – Scatter plot, correlation matrix
– XY(α=0.05)
3. ANOVA
Outline
PART II. Linear Regression Analysis
Regression Model: (P449)
The standard error of estimate :
The standard error of estimate :
ESTIMATION & PREDICTION—Confidence intervals and prediction intervals
Confidence interval of at X=x
Prediction interval of Y at X=x
RECALL : ANOVA-table