This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
PowerPoint Presentation
2
Outline
• Data: two continuous measurements on each subject • Goal: study
the relationship between the two variables • PART I : correlation
analysis
– Study the relationship between two continuous variables. – Steps
:
• Scatter diagram • Correlation coefficient : Calculation, meaning,
hypothesis testing
• PART II : linear regression – Construct a linear equation between
2 variables.
• Model building • Model estimating : Confidence intervals and
prediction intervals • Model fitting: Strength of the linear
association,coefficient of
determination
3
• Ex. Gender(binary), brand(3-level)
– Y : response variable(cont. or binary). • Ex. Score,
success-failure, yield,
– Q : whether X and Y are correlated ? • A : If Y is continuous,
comparing the population
means of Y in the groups divided by X. – Ex : --Z-test, T-test,
ANOVA F-test
4
• Recall : In Ch.11 and 12, – Q : whether X and Y are correlated
?
• A : If X and Y are binary, compare the population proportions of
Y in the two groups divided by X.
– Ex :
– When sample sizes are large, Z-test is used.
– Q : How to determine the correlation if X and Y are both
continuous? -- correlation and regression analysis!
5
• Data : – A sample of n sets of observation. – There are k
continuous variables measured in each observation. – Example.
Surveyed n=10 students, k=3 scores are recorded. – Questions : any
association between scores?
1 82 67 56
2 89 99 70
3 45 31 42
4 74 66 67
5 75 86 99
6 69 39 75
7 70 86 67
8 47 61 86
9 92 88 75
10 92 79 54
• What is correlation analysis ? – Study the relationship between
several continuous variables. – Measure the strength of the
association between variables.
• Correlation analysis consists : – Step 1. Scatter diagram : Plot
(X1, X2) – Step 2. Coefficient of correlation :
7
Conclusion :
• Population coefficient of correlation, ρ : – A measure of the
strength of the linear relationship between two variables. –
Definition: population correlation coefficient
– Estimation : sample correlation coefficient
xy
x y
n 1 n 1
n 1 n 1
9
• Properties : – -1r1 – “Positive linear association” : r > 0 –
“Negative linear association” : r < 0 – “no linear relation” :
r0 (! Other relation may exist) – “Strongly positive linear
association” : r1 – “Strongly negative linear association” :
r-1
10
11
• Why such definition? – If there is a strongly positive linear
association, when
x is large, y is large, then we have a large positive value of
Sxy.
– If there is a strongly negative linear association, when x is
large, y is small, then we have a large negative value of
Sxy.
– If there is no relation, when x is large, some y are large, some
y are small, then Sxy0, r0.
12
EXCEL
13
EXCEL
2. =0.7151
3. =0.3754
Population : N=∞ subjects
Population : N=∞ subjects
“H0 : ρ= 0” ? Unknown!
-- a t-test!
• Testing the null hypothesis of no correlation : ρ=0
• Step 1. State the hypotheses – H0 : no correlation v.s. H1:
correlated – H0 : ρ= 0 v.s. H1 : ρ≠0
• Step 2. Select the significance level α
17
– Note that under null hypothesis, t ~ t-distribution with
d.f.=(n-2)
• Step 4. Formulate the decision rule – A two-sided test; – A
t-test; – With significance level α, H0 should be rejected if
t > tα/2,n-2 or t <- tα/2,n-2
• Step 5. Collect data, compute t-value, draw conclusion
2 2
= = = − − −
18
• Example. At α=0.05, n=10, df=10-2=8, t(0.025,8)=2.306 • Test 1 :
( v.s. )
– H0 : ρ1= 0 v.s. H1 : ρ1≠0 – Since r1=0.033, n=10,
– Since –2.306< t=0.093 <2.306, H0 is not rejected. –
Conclusion : there is no sufficient evidence to reject the
null
hypothesis of no correlation.
19
• Example. At α=0.05, n=10, df=10-2=8, t(0.025,8)=2.306 • Test 2 :
( v.s. )
– H0 : ρ2= 0 v.s. H1 : ρ2≠0 – Since r2=0.7151, n=10,
– Since t=2.89>2.306, H0 is rejected. – Conclusion : there is
sufficient evidence to reject the null
hypothesis of no correlation. –
89.2 )7151.0(1 2107151.0
20
• Example. At α=0.05, n=10, df=10-2=8, t(0.025,8)=2.306 • Test 3: (
v.s. )
– H0 : ρ3= 0 v.s. H1 : ρ3≠0 – Since r3=0.3755, n=10,
– Since –2.306< t=1.15 <2.306, H0 is not rejected. –
Conclusion : there is no sufficient evidence to reject the
null
hypothesis of no correlation.
21
! A word of caution : for H0 of no correlation been rejected, !
Only linear relationship between variables are ascertained.
! Quadratic? Cubic? ! No “cause and effect” () is
established.
! ! “” ! “”? !
! Spurious() correlations : !
• Variables : – X=Independent variable(s), explanatory variable,
predictor,
,
• To be predicted or estimated.
• Regression analysis : – Develop an equation/function that allows
us to estimate/predict Y
based on X. – Example. X=Y 60
23
• Recall : In a one-way ANOVA – AGE vs. INCOME – The whole
population are classified into three sub-
populations by “AGE” • A young-population. • A
middle-age-population. • A senior-population
– The INCOMEs of all sub-populations are • Normally distributed
with same variance
– Research question: “The mean INCOMEs, μincome ,are the
same”?
ANOVA Regression
ANOVA Regression
• Recall : In a simple linear regression model, – (X) vs. (Y) – The
whole population are classified into many sub-populations
by “(X)” • X=0-population; X=1-population;.....,
X=100-population.
– The (Y) of all sub-populations are • Normally distributed with
same variance
– Research question: “The mean s, μY , are the same”? “Establish
the relationship between μY and X”
25
Regression Model: (P449)
1. Given each value of X, there is a group of Ys. – X – X=60
Y= – X=50
– At X=60, Y~ – At X=50, Y~
),(N 2 60X|Y σµ =
),(N 2 50X|Y σµ =
26
3. The means of these normal distributions is a linear function of
x
– X
– X
0
20
40
60
80
100
Example. -30+1.5()
27
4. The standard deviations of these normal distributions are all
the same. (independent with x)
•
–
28
),x(N~Y 2σβ+α
29
unknown are ,, 2σβαPractically, only a sample data is collected
and
?"x" x|Y β+α=µ
P(y)
: observations
30
How to estimate the regression equation using a sample data?
?"x" x|Y β+α=µ
Y
: observations
31
• Regression equation :
xx|Y β+α=µ
??? 2 =σ=β=α
• Let a, b be estimates of
• Predicted equation : Y’ = a + b x, it could be a 1. predicted
value of Y : Y – Ex. X=60
– Ex.X=60
34
35
• In the predicted equation, the intercepta = ? The slope b=
?
• Least Squares estimates (LSE, ) a, b : – Principle : find a
regression equation which minimizes the sum of
∑ =
• Estimated regression coefficients :
xy
S (x x)(y y) /(n 1) { xy nxy}/(n 1),
S (x x) /(n 1) { x nx }/(n 1)
= − − − = − −
= − − = − −
∑ ∑ ∑ ∑
Meaning of the estimated intercept, a
• a = Y’ at X=0. – The estimated value of when X=0.
• Example. XY0 = a
– The predicted value of Y when X=0. • Example. XY0 a
– 0Xa • Example. X=Y= • X0a
0X|Y =µ
38
• a is an estimate of the true interceptα. • One may interest in
testing H0 : α=0. • When α=0, the equation passes through the
origin(),
0 x
0x|Y
x|Y
X
39
Meaning of the estimated slope, b
• b = increment with unit change of x – When there is one unit
change in x, the
increment/decrement in – Example. In previous case, if b=0.2, X1
0.2
x|Yµ
x|Yµ
40
• b is an estimate of the true slopeβ. • One is more interested in
testing H0 : β=0. • When β=0, the equation is a constantand
independent of X values,
– The distribution of Y is uncorrelated with X. – X and Y are
independent!
),(N~Y, 2 x|Y σαα=µ
α=µ x|Y
X Y XY
1 82 67 5494
2 89 99 8811
3 45 31 1395
4 74 66 4884
5 75 86 6450
6 69 39 2691
7 70 86 6020
8 47 61 2867
9 92 88 8096
10 92 79 7268
mean 74 70 53976
{53976 10(73.5)(70.2)}/(10 1) 264.333
= − −
= − − =
=
= = = −
= − × =
∑
1 2222.52 2222.52 8.3747 0.0201
8 2123.08 265.39
1.5346 24.2804 0.0632 0.9512 -54.4562 57.5253
X 0.9342 0.3228 2.8939 0.0201 0.1898 1.6787
Model fitting
Model estimating
43
EXCEL : output t P- 95% 95%
1.5346 24.2804 0.0632 0.9512 -54.4562 57.5253
X 0.9342 0.3228 2.8939 0.0201 0.1898 1.6787
Note : The difference to previous calculation is due to rounding
error.
a, b estimates of α,β
SE(a), SE(b)
t-value(a)=a/SE(a), t-value(b)=b/SE(b)
• p-value (a) =0.9511>0.05, not reject that α=0
• p-value (b) = 0.02<0.05, reject! β≠0
95%95% confidence interval for α,β
44
'Ybxaxx|Y =+≈β+α=µ
45
The standard error of estimate :
• Variance : – Dispersion of Y around the regression line – The
variation of the random “error”,
Error = = : unobtainable
• Standard error of estimate : – Use “residuals” to estimate
“error”,
Residual = = Y-Y’ : observable – Standard error of estimate is
defined by
where Sy : sample s.d. of Y, Sx : sample s.d. of X
2σ
• – The random variation is unexplained by the regression
line.
2 xy
47
Example. X=Y : Y’=1.53+0.93X X Y Y Y-Y' (Y-Y)^2
1 82 67 78.14 -11.14 124.12
2 89 99 84.68 14.32 205.05
3 45 31 43.57 -12.57 158.12
4 74 66 70.67 -4.67 21.78
5 75 86 71.60 14.40 207.32
6 69 39 66.00 -27.00 728.78
7 70 86 66.93 19.07 363.66
8 47 61 45.44 15.56 242.02
9 92 88 87.48 0.52 0.27
10 92 79 87.48 -8.48 71.96
2123.08
Note :
48
3.16))93.0(94.28284.482( 8 9)bSS(
49
EXCEL :
ESTIMATION & PREDICTION— Confidence intervals and prediction
intervals
• ESTIMATION: – Q: At X=x, the mean value of Y, – Point estimation,
confidence interval
• PREDICTION: – Q:If an individual is drawn from the population of
X=x, Y=? – Point prediction, prediction interval
?x|Y =µ
?3x|Y =µ
Confidence interval of at X=xx|Yµ
x|Yµ• Confidence interval : At X=x, the mean value of Y, – Point
estimation : Y’ = a+bx
– 100(1-α)% confidence interval :
Y’=1.53+0.93X
Ans.
2. 95% confidence interval :
Prediction interval of Y at X=x
• Prediction interval : If draw an individual from the population
of X=x, Y=? – Prediction : Y’ = a + bx
– 100(1-α)% prediction interval :
Y’=1.53+0.93X
2 2 2 x
Y ' 57.33, t 2.306,s 16.29,
n 10,(x x) (60 73.5) 182.25,s 282.94
1 (x x)Y ' t s 1 n (n 1)s
1 182.2557.33 2.306 16.29 1 57.33 40.66 10 9(282.94)
α
• D.f . = n-1 for n observations. • MStotal = SS total/(n-1)
– SST = due to treatment = • Yj = estimated mean of Y of
treatment-j group • D.f. = k-1 for k treatments •
MST=SST/(k-1)
– SSE = due to random error = • D.f. = n – k • MSE =
SSE/(n-k)
– SS total = SST + SSE
Degrees of Freedom
Mean Square F
Treatment SST k-1 SST/(k-1)=MST Error SSE n-k SSE/(n-k)=MSE
MST/MSE
Vs
57
• SStotal = Total variation of Y :
• SSR = The variation explained by the regression model • SSE=The
unexplained variation
SSESSR )'YY()Y'Y(
)YY( SStotal
• SStotal = – D.f . = n-1 for n observations. – Mstotal = SS
total/(n-1)
• SSR = due to regression model – Y’ = estimated mean of Y at some
X-level – D.f. = 2-1=1 for 2 regression coefficients – MSR=SSR/1 =
SSR
• SSE = due to random error – D.f. = n – 2 – MSE = SSE/(n-2)
=
2 Y
2 S)1n()YY( −=∑ −
2 xyS ⋅
2 X
22 Sb)1n()Y'Y( −=∑ −=
2 xy
2 S)2n()'YY( ⋅−=∑ −=
Degrees of Freedom
Mean Square F
Regression SSR 2-1 SSR/1=MSR Error SSE n-2 SSE/(n-2)=MSE
MSR/MSE
vs.
The regression line is horizontal.
60
2 89 99
3 45 31
4 75 67
5 76 86
6 69 40
7 71 87
8 48 61
9 93 89
10 93 80
sum 740 706
mean 74.0 70.6
sd 16.8 22.0
variance 283.2 482.7
1 2222.518 2222.518 8.374682 0.020079
8 2123.082 265.3853
Further, since for F-test, p-value = 0.02< 0.05, the linearity
exists.
6.434584.4829S)1n()YY( 2 Y
2 ≈×=−=−∑ 5.22229.2829342.09Sb)1n()Y'Y( 22
The Coefficient of Determination
• Coefficient of Determination : – the proportion of the total
variation of Y that is explained by the
variation of X. – YX
– YX
SStotal SSE1
total lainedexpuntotal
)YY( )Y'Y(
total elmodbylainedexp
SStotal SSRr
1 2222.518 2222.518 8.374682 0.020079
8 2123.082 265.3853
– CORREL
–
66
Exercise.
• Linear regression analysis : – 45, 46, 53, 57 – EXCEL: 47,
49
67
• (X)(Y) 1. (correlation analysis) – Scatter plot, correlation
matrix
– XY(α=0.05)
3. ANOVA
Outline
PART II. Linear Regression Analysis
Regression Model: (P449)
The standard error of estimate :
The standard error of estimate :
ESTIMATION & PREDICTION—Confidence intervals and prediction
intervals
Confidence interval of at X=x
Prediction interval of Y at X=x
RECALL : ANOVA-table