Top Banner
REGRESSION ANALYSIS
29

Regression analysis

Feb 23, 2016

Download

Documents

allene

Regression analysis. Regression analysis. Objective : Investigate interplay of quantititative variables Identify relations between a dependent variable and one or several independent variables Make predictions based on observed data - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Regression  analysis

REGRESSION ANALYSIS

Page 2: Regression  analysis

Regression analysis• Objective:

• Investigate interplay of quantititative variables• Identify relations between a dependent variable and one or several

independent variables• Make predictions based on observed data

• Dependent variable: variable whose values shall be explained

• Independent variables: variables that have an impact on the dependent variable

Page 3: Regression  analysis

REGRESSION ANALYSISLinear regression

Page 4: Regression  analysis

Linear regression• Hier wird angenommen, dass der Einfluss der

unabhängigen Variablen auf die abhängige Variable linear ist

• Dabei unterscheidet man zwischen:

• Einfacher linearer Regression: Erklärung einer abhängigen Variable durch eine unabhängige Variable

• Multipler linearer Regression: Erklärung einer abhängigen Variable durch mehrere unabhängigen Variablen

Page 5: Regression  analysis

REGRESSION ANALYSISSimple linear regression

Page 6: Regression  analysis

Linear regression• The variables education and income are considered, whereof one

variable (eduaction) can be assumed to have an impact on the other (income)

• Dependent variable Y=(): income• Independent variable X=(): education

Page 7: Regression  analysis

Linear regression• Basic idea of linear regression: find a straight line, which

optimally describes the correlation between the two variables

Page 8: Regression  analysis

Linear regression• Lineares Regressionsmodell ( R: lm(y~x) ):

,Her, is called,(axis) intercept is the slope, X is the predictor variable and is the residual.

The residual is the difference between the regression line and the measurement values Y.Here, is called estimate of Y and we have:

Page 9: Regression  analysis

Linear regression• Objective:

• Estimate the coefficients such that the model fits optimally to the data

• Prediction of Y values

• The straight line shall be chosen in such a way that the squared distances between the values predicted by the model and the empirically observed values are minimized

• We want:

Page 10: Regression  analysis

Linear regression• We will obtain estimates of the coefficients wich are also

called regression coefficients , :

• ,

• , are the least square (LSQ) estimates of

Page 11: Regression  analysis

Linear regression• In case the are normally distributed we obtain for , and

confidence intervals:

• where , , are the respective standard errors of the estimates

Page 12: Regression  analysis

Linear regression• We obtain a t-Test for the null hypothesis against the

alternative• Reject , if ist

Page 13: Regression  analysis

REGRESSION ANALYSISMultiple linear regression

Page 14: Regression  analysis

Multiple linear regression• Now multiple independent variables . A sample of size n

now consists of the values , i=1,…,n• Hence:

• Here, the , j=1,…,m, are the unknown regression coefficients and the are the residuals

• Matrix notation:

Page 15: Regression  analysis

Linear regression• Estimation of the regression coefficients is again

performed with the least square method. After extensive calculus one obtains

• Estimation of is obtained according to: , where

• The estimation process is computationally demanding (matrix inversion is needed!) and has to be done by computers

Page 16: Regression  analysis

Linear regression• In case the are normally distributed we obtain a F-test for

the null hypothesis against the alternative

• We want to test whether the overall model is significant

• Overall F-test statistic:

• Reject, if

Page 17: Regression  analysis

REGRESSION ANALYSISLogistic regression

Page 18: Regression  analysis

Logistic regression• Unitl now, the dependet variable y was continous. Now, we consider

the situation where there are just two possible outcome values. An example is the dichotmous trait y = „affection status“ with the values „1=affected by the disease“ and „0= not affected“ (healthy)

• We want to predict the probality that an individual hast the value 1 (= is affected).

• The range of possible values is [0,1].

• • => Linear regression can not be used since the dependent variable is

nominal.• => Instead, logistic regression is used.

Page 19: Regression  analysis

Logistic regression• Example binary logistic regression:

• Sample with information on survival of the sinking of Titanic

• Question: was the chance to survive dependent on sex?

Dead Survived

Female 126 344 470

Male 1364 367 1731 1490 711 2201

Page 20: Regression  analysis

Logistic regression• The odds ratio is used to mode the chance to survive:

• We consider the ratio of the survival probability of women and the survival probability of women

• The OR of 10.14 indicates that the probability to survive was 10 times as high for women as for men

• From here until slide 22: details for specialists• => A regression on the logarithmic odds (so called logit) that

the 0/1-coded dependent variable takes on the value 1

Page 21: Regression  analysis

Logistic regression• Logarithmic odds are well suited for regressions analysis since there values are in

and since they are symetric

• Regression equation

• Now, the probability that the dependent variable takes on the value 1 given the values x can be computed as

• Estimation of regression coefficients is now done woht maximum-likelihood esitmation.

Logarithmic odds, that 0/1-variable takes on the value 1

Known from linear regression

Page 22: Regression  analysis

Logistic regression

• Interpretation of the parameters:

• : defines the probability P(Y=1|X=0) of the value X=0 of the independent variable X:

• The larger the larger the probability P(Y=1|X=0)

• Probabilities depend on

was set to 1

Page 23: Regression  analysis

Logistic regression • : determines the slope of the probability function, and thereby how strong differences at the independent variable X influence differences at the conditional probabilites

• , conditional probability of category 1 („affected“) is monotonically increasing function of X

• , conditional probability of category 1 („affected“) is monotonically decreasing function of X

• , X and Y are idependnetwas set to zero 0

Page 24: Regression  analysis

Logistic regression• Titianci:

• Coding: male=1, female=0, survival=1, non-surival=0

• Compute logarithmic odds according to • (In R: glm(y~x,family=binomial(„logit“)) ):

• Female: • Male:

• The logit coefficients are difficult to interpret, therefore they are transformed back by ()

• Female: , Male:

Beta coefficient S.E.Intercept 1.004 0.104

Sex -2.317 0.119

Page 25: Regression  analysis

Logistic regression• „Wald-test“ for the null hypothesis β = 0 •

• Reject the null hypothesis if , where p is the number of degrees of freedom (= the number of independent variables)

• Titanic example: • Wald-test für : Note: is typically not of interest. Here, it it signficant because it detects that there were much more men then women on the Titanic.• Wald-test für : • => The null hypothesis can be rejected

Page 26: Regression  analysis

Note• So far, the logistic regression example could have been

computed with the chi-square test for 2x2 tables. The advantage of the logistic regression is that it can be extended to multiple independent variables and that the independent variables can be continuous.

Page 27: Regression  analysis

Logistic regression• Logistic regression with multiple predictor variables

(In R: glm(y~x1+x2,family=binomial(„logit“)) ):

• Multiple predictor variables can also be analyzed as „cross-classification“

• Example: Does lung function depend on air pollution and smoking

• Dependent variable: lufu= lung function test, „normal“=1, „not normal“=0

• Dependent variables: • LV=degree of air polution, „no“=0, „yes“=1• Smoking „no“=0, „yes“=1

Page 28: Regression  analysis

Logistic regression• Data:Air pollution Smoking # normal Lufu # Anormal Lufu

no=0 0 209 10

1 159 9

yes=1 0 45 5

1 37 5

Page 29: Regression  analysis

Logistic regressionLogistic regression with R yields