Chapter 15 Multiple Linear Regression Analysis
Dec 25, 2015
Chapter 15
Multiple Linear Regression Analysis
• Multiple linear regression
• Choice of independent variable • Application
Goal : construct the multiple linear regression model to assess the relationship between one dependant variable and a set of independent variables.
Data : the dependant variable is quantitative data; the independent variables are all or most quantitative data. If there are some qualitative data or ranked data ,we must change them.
Application : explain and prediction.
significance : Since the things are influenced by many facts, the chan
ge of dependent variable may influenced by many others independent
variables. For example, the change of diabetes’ blood sugar may affect
ed by many biochemical criterions such as insulin, glycosylated hemogl
obin, total cholesterol of serum, triglyceride and so on.
§ 1 Multiple linear regression
• variable: one dependant variable, a set of independent . together m+1 。
• Sample size: n• Data form in Table 15-1• General model of the regression equation:
eXXXY mm 22110
1 、 Multiple linear regression model
In the above model, the dependent variable y can be denoted the linear function of independent variables(x1,x2,•••xm) approximately.ß0 is the constant, ß 1, ß2, •••ßm are partial regression coefficient, denote that when other dependent variable holds the line, xj increase or decrease one unit that mean variation of y. The residual e is random error that excludes m entries independent variable influence to y.
Case NO. X1 X2 … Xm Y 1 X11 X12 … X1m Y1 2 X21 X22 … X2m Y2 ┇ ┇ ┇ … ┇ ┇ n Xn1 Xn2 … Xnm Yn
Table 15-1 Data form of multiple regression
Qualification
(1)There is linear relationship between y and x1,x2,•••xm.(2)The measured value yi(i=1,2, •••,n) of each case is independent.(3) The residual e is independent and normally distributed with mean 0 an
d variance σ2, it equates to that for any independent variables x1,x2,•••xm
the dependent variable y has the same variance, and obey to normal distribution.
General process
mm XbXbXbbY 22110ˆ
construct regression equation
(2) test and evaluate regression equation, the effect of each independent variables
(1)seek the partial regression coefficient mbbbb ,,,, 210
2 、 The construction of Multiple linear regression equation
Case 15-1 the measured values of total cholesterol of
serum, triglyceride, fasting blood - sugar level, glycosyl
ated hemoglobin, fasting blood glucose are lied in table
15-2. Please construct Multiple linear regression equati
on with blood sugar and others indexes.
Total cholesterin triglyceride insulin glycosylated Blood sugar (mmol/L) (mmol/L) (μ U/ml) hemoglobin(%) (mmol/L) NO.i
X1 X2 X3 X4 Y 1 5.68 1.90 4.53 8.2 11.2 2 3.79 1.64 7.32 6.9 8.8 3 6.02 3.56 6.95 10.8 12.3 4 4.85 1.07 5.88 8.3 11.6 5 4.60 2.32 4.05 7.5 13.4 6 6.05 0.64 1.42 13.6 18.3 7 4.90 8.50 12.60 8.5 11.1 8 7.08 3.00 6.75 11.5 12.1 9 3.85 2.11 16.28 7.9 9.6
10 4.65 0.63 6.59 7.1 8.4 11 4.59 1.97 3.61 8.7 9.3 12 4.29 1.97 6.61 7.8 10.6 13 7.97 1.93 7.57 9.9 8.4 14 6.19 1.18 1.42 6.9 9.6 15 6.13 2.06 10.35 10.5 10.9 16 5.71 1.78 8.53 8.0 10.1 17 6.40 2.40 4.53 10.3 14.8 18 6.06 3.67 12.79 7.1 9.1 19 5.09 1.03 2.53 8.9 10.8 20 6.13 1.71 5.28 9.9 10.2 21 5.78 3.36 2.96 8.0 13.6 22 5.43 1.13 4.31 11.3 14.9 23 6.50 6.21 3.47 12.3 16.0 24 7.98 7.92 3.37 9.8 13.2 25 11.54 10.89 1.20 10.5 20.0 26 5.84 0.92 8.61 6.4 13.3 27 3.84 1.20 6.45 9.6 10.4
Table 15-2 blood sugar of 27case diabetes and measured values of relative variables
222110
2 )]([)ˆ( mm XbXbXbbYYYQ
mYmmmmm
Ymm
Ymm
lblblbl
lblblbl
lblblbl
2211
22222121
11212111
)( 22110 mm XbXbXbYb
Partial derivative
( )( ) , , j=1,2, ,m
( )( ) , 1, 2 ,
i jij i i j j i j
jjY j j j
X Xl X X X X X X i
n
X Yl X X Y Y X Y j m
n
4321 6382027060351501424094335 X.X.X.X..Y
Principle least sum of squares
3、 Hypothesis test and evaluation
0 1 2: 0mH ,
1 : jH Notall(j =1, 2, , m) are
zero,
0.05
3.1.1 analysis of variance process :
ySS SS SS reg res
/
/ 1)y y
SS m MSF
SS n m MS
reg reg
(
( 1) Regression equation
)1(~ mn,mFF
Source of
variation
df SS MS F P
Total variation n-1 SSy
regression m SSreg SSy /m MSreg/MSres
residual n-m-1 SSres SSres /(n-m-1)
table15-4 analysis of variance of case 15-1
Source of
variation
df SS MS F P
Total
variation
26 222.5519
regression 4 133.7107 33.4277 8.28 <0.01
residual 22 88.8412 4.0382
Table 15-3 frame of Multiple linear regression analysis of variance
From the F bound value we get 31.4)22,4(01.0 F , 31.4F , 01.0P ,
at 05.0 lever we can reject H0,accept H1 and consider that the regression equation has
statistics’ significance.
( 0.05)
( 0.05)
10 2 R , 2R is the proportion of variation in the dependent variable that is predictable from the best linear combination of the
independent variables. The closer 2R is to 1, the better that the model is responsible for data. In this case:
6008.05519.222
7107.1332 R
In this example , 60% of the variance in blood sugar is
predictable from insulin, glycosylated hemoglobin, total
cholesterol of serum, triglyceride.
2 1 SS SS
RSS SS
reg grs
y y
A Coefficient of determination R 2:
B Multiple correlation coefficient
It can be used to measure the degree of relationship between dependent variable y and a set of independent variables, that is the degree of
relationship between observation y and estimation Y .
The equation of calculation: 2RR , in this example
7751060080 ..R , if m=1, that |r|R , r is simple correlation
coefficient .
( 2 ) for each independent variable The effect of each independent variable to y should be showed clearly in the equation. (analysis of variance and the total test of coefficient of determination.
A. Sum of the squared for partial regression
Significance In the equation, sum of the squared for partial regression of one of independent variables Xj means that when there are others m-1 independent variables, the contribution of this independent variable to the dependent variable y. That is, after Xj is excluded from the equation, the decrement of the sum of squared regression. That is, on the basic of m-1 independent variables, when Xj increases, the increase of the sum of squared regression.
( )/1
/ ( 1)
j
j
SS XF
SS n mreg
res
1 2 1, 1n m
is sum of the squared for partial regression, the bigger it is the more importance of corresponding independent variable.
In general condition, the effect of m-1 independent variables to the sum of squared partial regression of y should be obtained from new equation, rather than exclude the from equation of m independent variables simply.
( )jSS Xreg
Sum of the squared independent variables SSreg SSres
①
4321 X,X,X,X 133.7107 88.8412
② 432 X,X,X 133.0978 89.4540 ③ 431 XX,X 121.7480 100.8038 ④ 421 XX,X 113.6472 108.9047 ⑤ 321 XX,X 105.9168 116.6351
Table 15-5 some part result of case 15-1 base on regression analyze
Sum of squared for partial regression of each indepe
ndent variable can be accounted according to draw up regr
ession equation from different independent variables. Table
15-5 gives some part result of case 15-1.
1 1 2 3 4 2 3 4( ) ( , , , ) ( , , )
133.7107-133.0978=0.6129
SS X SS X X X X SS X X Xreg reg reg
2 1 2 3 4 1 3 4( ) ( , , , ) ( , , )
133.7107-121.7480 11.9627
SS X SS X X X X SS X X Xreg reg reg
3 1 2 3 4 1 2 4( ) ( , , , ) ( , , )
133.7107-113.6472 20.0635
SS X SS X X X X SS X X Xreg reg reg
4 1 2 3 4 1 2 3( ) ( , , , ) ( , , )
133.7107-105.9168 27.7939
SS X SS X X X X SS X X Xreg reg reg
152.0)1427(/8412.88
1/6129.01
F , 962.2
)1427/(8412.88
1/9627.112
F
968.4)1427/(8412.88
1/0635.203
F , 883.6
)1427/(8412.88
1/7939.274
F
results
B. t –test A method equals to sum of squared for partial regression test. Calculate formula is
jb
jj S
bt
Hypothesis test:
H0: 0j , jt obey to df is 1 mn of t
distribution。If 12 mn,/j t|t| , then at lever of
(0.05),reject H0,accept H1,that we can say there
is linear relationship between jX andY .
is estimative value of partial regression coefficient, is standard error of
jb
jbS jb
390036560
142401 .
.
.t
7211
20420
351502 .
.
.t
2292
12140
270603 .
.
.t
6232
24330
638204 .
.
.t
results
results0742222050 .t ,/. , 074.2|| 34 tt ,
P-value is lower than 0.05, that is to say
3b and 4b have statistical significance,but
1b and 2b do not have statistical significance。
C . Standardization regression coefficient Standardization variable is that subtract the mean of corresponding variable from original data, then divide by the standard deviation of variable.
' ( )j jj
j
X XX
S
This regression equation is named standardization regression equation, and corresponding regression coefficient is named standardization regression coefficient.
Y
jj
YY
jjjj S
Sb
l
lbb '
Standardization regression coefficient doesn’t have unit, it can be used to compare with the effective intension of each independent variable Xj to y. Generally, if there is statistical significance, the larger the absolute value of standardization regression coefficient is, the more important effect of correspondent independent variable to y
Attention :• Generally, regression coefficient has unit,
and to interpret the effect of each independent variable to dependent variable. It means when other independent variables keep steady, increases or decreases one unit that the average change of y. We can’t use each to compare the effect of to
• Standardization regression coefficient doesn’t have unit, and to compare the effect of each independent variable to dependent variable, the larger is, the larger effect of to
jXjb
jX Y
jb jX
Y
11.5934S,22.5748S,33.6706S ,41.8234S,2.9257YS
0776.09257.2
5934.11424.0'
1 b
3093092572
57482351502 .
.
..b '
3395092572
67063270603 .
.
..b '
3977092572
82341638204 .
.
..b '
results
As the result showed, the size of factors affect blood sugar can be ranked as follow: glycosylated hemoglobin(X4), insulin(X3), triglyceride(X2),total cholesterol of serum(X1).
§ 2 choosing of independent variable
purpose : The effect of prediction and /or
explanation should be in the best
1 、 entirely choosing method
Goal : for better prediction
significance : Compare the regression formula which construct of dif
ferent combined of independent variables select
method :1 . Revise determine coefficient 2cR choosing
method
2. pC choosing method
1.1.1.revise decision coefficient 2cR choosing
method,formula:
reg
resc MS
MS
pn
nRR
11
1)1(1 22
N is sample size, 2R is coefficient of determination of
regression equation, which include )( mpp
independent variables 。 The change rule of 2cR is :
when 2R are equal , the more number of independent
variables are, the smaller 2cR is。By mean of “ the best”
regression equation, 2cR is the largest。
1.1.2. choosing of pC
)]1(2[)(
)( pn
MS
SSC
mres
presp
presSS )( is sum of squared errors of regression
from )( mpp of independent variables ,
mresMS )( is residual mean square that comes from
the regression model of total m of independent variables.
When the equation from p of independent variables is theatrically best, the expected value of pC is p+1, so the regression equation in
which pC is the closest to p+1 should be chosen the optimum equation.
The pC should not be applied to choose independent variable if there is
no variable which has main effect to y in all independent variables.
Case 15-2 Use entirely choose excellent method to choose independent variables in case 15-1
Independent variable
2cR pC Independent
variable 2cR pC
X2,X3,X4 0.546 3.15 X2,X3 0.408 9.14
X1,X2,X3,X4 0.528 5.00 X1,X3 0.375 10.78
X1,X3,X4 0.488 5.96 X4 0.347 11.63
X1,X2,X4 0.447 7.97 X1 0.284 14.92
X1,X4 0.441 7.42 X1,X2 0.275 15.89
X2,X4 0.440 7.51 X3 0.231 17.77
X3,X4 0.435 7.72 X2 0.179 20.53
X1,X2,X3 0.408 9.88 m=4, so the number of regression equation of fit
is 42 1 2 1 15m 。 The best construction isX2,X3,X4,that is the
optimum regression equation which constructs triglyceride, insulin, glycosylated hemoglobin, with blood sugar.
2 stepwise selection
1.
2.1.forward selection , Import the independent variables into t
he regression equation one by one. This way is omitted from consideration o
n the whole.
2.2. backward elimination , Place all the independent variables into the equation, then eliminate those without statistic significant progressively. The way of independent variables elimination is to select a variable has the lest square sum of regression, make F-test to determine whether it should be eliminated. Eliminate the one without statistic significant and then make a new regression equation with the left ones. Repeat this progress ceaselessly, until all the independent variables in the equation can not be eliminated. Theoretically, it’s the best way, and we strongly recommend.
2.3.stepwise regression , Stepwise regression is on the basis
of the two approaches hereinbefore, it’s a way of bidirectional filtration. Esse
ntially speaking, it’s a way of forward selection.
Setting the test level: the test level of small sample is 0.10 or 0.15, the test level of large sample is 0.05.
A lower level means a stricter standard for selecting variables, as a result, there will be less selected variables. Whereas a higher level means a wider standard, which means more variables will be chosen.
Attention: the level of independent variable entered must lower than or equal to the level of independent variable moved.
Case 15-3 Use the stepwise regression to
analyze data in case 15-1 ( 100.入 , 15.0出 )。
process (l)
Variable entered
Variables Removed
The number
of variable p
2R ( )
( )l
SS X jreg ( )l
SSres F-value P-value
1 X4 1 0.372 82.714 139.837 14.788 0.0007 2 X1 2 0.484 25.076 114.762 5.244 0.0311 3 X3 3 0.547 13.958 100.804 3.185 0.0875 4 X2 4 0.601 11.963 88.841 2.962 0.0993 5 X1 3 0.598 0.613 88.841 0.152 0.7006
Table 15-7 the process of stepwise regression
Source of variation
df SS MS F P
Total variance 26 222.5519
Regression 3 133.098 44.366 11.41 0.0001
residual 23 89.454 3.889
Table 15-8 analysis of variance of case 15-3
“the best” regression equation :
432 6632.02871.04023.04996.6ˆ XXXY Result : There is linear relationship between the change of blood sugar and insulin, glycosylated hemoglobin, total cholesterol of serum, triglyceride. Insulin is negative relation. From the standard regression coefficient, we can conclude that glycosylated hemoglobin has the largest effect to fasting blood glucose.
Table 15-9 Estimation and test result of regression coefficient in case 15-3
variable Regression
coefficient b
Standard error
bS
Standard regression
coefficient 'b t-value P-value
constant 6.4996 2.3962 0 2.713 0.0124
X2 0.4023 0.1540 0.3541 2.612 0.0156
X3 -0.2870 0.1117 -0.3601 -2.570 0.0171
X4 0.6632 0.2303 0.4133 2.880 0.0084
§ 3
Application of Multiple Linear Regression and Attentions
1. Application
1.1. Analysis of the related factors
• For example, there are many factors that can affect hypertension, such as age, diet, habit, smoking, tension, family history and so on. So among those, it’s necessary to find which factors are related and which are further.
• During clinical practice, it is difficult to ensure the agreement of all parameter of all groups, because of lots of complicated condition.
• For example , the regression can help compare two different therapy ,with the disagreement on age, the state of illness and so on.
• An easy method to control confounding factors is to draw these to regression equation and analyze with other major variables.
2.2. Estimation and Prediction
• For example, estimating the surface area of children’s hearts by their cardiac broad diameter(TCD); predicting the infants’ weigh by their gestational age, diameter of head , diameter at breast height (DBH) and abdomen girth(AG).
2.3.Stastistical Control, Backrun Estimation
• For example, when we use the radio frequency therapy appearance to cure brain tumors, the impaired diameter of pallium has the linear regression relation with the temperature of radio frequency and the exposure time. The regression equation is established and it can help determine the optimal control of the temperature of radio frequency and the exposure time ,by given the impaired diameter of pallium in advance.
2 The problems of using multiple regression
2.1.Quantify of indices• (1)quantify, non-linear linear• (2)qualitative indices convert to quantitative
ones: (0,1)variable, dummy variable, false variable, indicative variable.
Binomial classified, use (0,1) variable,such as sexMultinomial classified, k-1(0,1)variables,such as blood
type:
0 male
1 female
Blood type X1 X2 X3
O 0 0 0 A 1 0 0 B 0 1 0
AB 0 0 1
No X1 X2 X3 Y 1 1 0 0 2 0 0 0 3 0 1 0 n 0 0 1
Data model regression equation
Founding regression equation
0 1 1 2 2 3 3Y b b X b X b X b1: the distinction of A type compares to O type
b2 : the distinction of B type compares to O type b3 : the distinction of AB type compares to O type
(3) Rank Quantities
We always change the rank from strong to weak into x=1,2,3, … (or x=0,1,2, … ). For example, education level could be classified into 4 degree: primary scholar, junior or senior student, undergraduate, graduate or PhD. stands for income.
1
1 2 3
X
小学中学大学
4 大学以上
0 1 1Y b b X
Explanation: b(b1) represents that when the 1unit of x(x1) increased, would increase b units(such as 500). It means junior or senior students could earn 500 more than primary scholar, undergraduates earn 500 more than junior students.
Primary scholarJunior or senior UndergraduateGraduate or PhD
Y
Y
We could also change the k degree into k-1 (0,1) variables
b1,b2,b3 represents the income differences between junior or senior ,undergraduate and graduate or PhD when compares to primary scholar.
Dummy variable X1 X2 X3
Primary school 0 0 0
Middle school 1 0 0
college 0 1 0
Graduate or PhD 0 0 1
2.2. Sample size: n =(5 ~ 10)m
2.3. Stepwise regression: Don’t trust in the result of stepwise blindly. The so called “best” regression equation does not by all means the best. The variable excluded from the equation does not mean that it has no statistical significance.For example: 15-3 if we change the entry probability of stepwise into 0.05( )and the removal probability into 0.10( ), the ultimate chosen variables should be , rather than .
Which regression equation be used is decided by the professional knowledge.
05.0入
10.0出41, XX
4321 ,,, XXXX
2.4. Multicollinearity: there maybe some stron
g linear relationship exists between independ
ent variables.
For example, hypertension and age, years of
smoking, years of drinking et al. Those indep
endent variables are highly related which ma
kes founding equation through the method of
least squares out of use. And it could invite s
ome negative result:
Elimination of multicollinearity: discard the independent variable which makes collinearity; rebuild equation of regression; use stepwise regression.
• (1) standard error of the test statistic becomes large, therefore, t value becomes small;
• (2) regression equation becomes unstable. The evaluation could change significant when the observed datum increased or decreased;
• (3) inaccuracy of t test caused the discard of important variables which should be involved in model;
• (4) the inconsistent positive and negative sign of
evaluation with objective reality.
2.5. The interaction between variables
In order to test whether there is
interaction between the two independent
variables, we usually added the product
of them into the equation.
In analyzing the data in table 15-2, we have chosen three variables: triglyceride(x2), insulin(x3) and glycosylated hemoglobin(x4). And now we add x3 x4 into the equation. If the product(x3* x4 ) is statistically significant, it means that there is interaction between the insulin and the glycosylated hemoglobin. Therefore, we should define the new variable z (z=x3* x4) , and reestimate test statistic according to the new equation (y=b0 + b2x2 + b3x3 + b4x4 + bzz). If the hypothetic test rejected H0: βz=0 , it could be concluded that there exists interactive effect except the main effect of x3 and x4. In this case, the conclusion is that the use of Z is statistically significant(p <0.01). y=0.7898+0.3690x2+
1.2267x3+1.5097x4-0.1785z. That means the effect of insulin in patients of diabetes is relied on the concentration of glycosylated hemoglobin.
2.6.Residuals analysis That is
Under regular circumstances, the residuals ei are normally distributed. The mean of this normal distribution is zero, and the variance equals to σ2 . The residuals plot is composed of standardized residuals
as the vertical line and as the horizontal line.iY
iii YYe ˆ
残MS
ee i
i '