Regression analysis

Post on 23-Feb-2016

36 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Regression analysis. Regression analysis. Objective : Investigate interplay of quantititative variables Identify relations between a dependent variable and one or several independent variables Make predictions based on observed data - PowerPoint PPT Presentation

Transcript

REGRESSION ANALYSIS

Regression analysis• Objective:

• Investigate interplay of quantititative variables• Identify relations between a dependent variable and one or several

independent variables• Make predictions based on observed data

• Dependent variable: variable whose values shall be explained

• Independent variables: variables that have an impact on the dependent variable

REGRESSION ANALYSISLinear regression

Linear regression• Hier wird angenommen, dass der Einfluss der

unabhängigen Variablen auf die abhängige Variable linear ist

• Dabei unterscheidet man zwischen:

• Einfacher linearer Regression: Erklärung einer abhängigen Variable durch eine unabhängige Variable

• Multipler linearer Regression: Erklärung einer abhängigen Variable durch mehrere unabhängigen Variablen

REGRESSION ANALYSISSimple linear regression

Linear regression• The variables education and income are considered, whereof one

variable (eduaction) can be assumed to have an impact on the other (income)

• Dependent variable Y=(): income• Independent variable X=(): education

Linear regression• Basic idea of linear regression: find a straight line, which

optimally describes the correlation between the two variables

Linear regression• Lineares Regressionsmodell ( R: lm(y~x) ):

,Her, is called,(axis) intercept is the slope, X is the predictor variable and is the residual.

The residual is the difference between the regression line and the measurement values Y.Here, is called estimate of Y and we have:

Linear regression• Objective:

• Estimate the coefficients such that the model fits optimally to the data

• Prediction of Y values

• The straight line shall be chosen in such a way that the squared distances between the values predicted by the model and the empirically observed values are minimized

• We want:

Linear regression• We will obtain estimates of the coefficients wich are also

called regression coefficients , :

• ,

• , are the least square (LSQ) estimates of

Linear regression• In case the are normally distributed we obtain for , and

confidence intervals:

• where , , are the respective standard errors of the estimates

Linear regression• We obtain a t-Test for the null hypothesis against the

alternative• Reject , if ist

REGRESSION ANALYSISMultiple linear regression

Multiple linear regression• Now multiple independent variables . A sample of size n

now consists of the values , i=1,…,n• Hence:

• Here, the , j=1,…,m, are the unknown regression coefficients and the are the residuals

• Matrix notation:

Linear regression• Estimation of the regression coefficients is again

performed with the least square method. After extensive calculus one obtains

• Estimation of is obtained according to: , where

• The estimation process is computationally demanding (matrix inversion is needed!) and has to be done by computers

Linear regression• In case the are normally distributed we obtain a F-test for

the null hypothesis against the alternative

• We want to test whether the overall model is significant

• Overall F-test statistic:

• Reject, if

REGRESSION ANALYSISLogistic regression

Logistic regression• Unitl now, the dependet variable y was continous. Now, we consider

the situation where there are just two possible outcome values. An example is the dichotmous trait y = „affection status“ with the values „1=affected by the disease“ and „0= not affected“ (healthy)

• We want to predict the probality that an individual hast the value 1 (= is affected).

• The range of possible values is [0,1].

• • => Linear regression can not be used since the dependent variable is

nominal.• => Instead, logistic regression is used.

Logistic regression• Example binary logistic regression:

• Sample with information on survival of the sinking of Titanic

• Question: was the chance to survive dependent on sex?

Dead Survived

Female 126 344 470

Male 1364 367 1731 1490 711 2201

Logistic regression• The odds ratio is used to mode the chance to survive:

• We consider the ratio of the survival probability of women and the survival probability of women

• The OR of 10.14 indicates that the probability to survive was 10 times as high for women as for men

• From here until slide 22: details for specialists• => A regression on the logarithmic odds (so called logit) that

the 0/1-coded dependent variable takes on the value 1

Logistic regression• Logarithmic odds are well suited for regressions analysis since there values are in

and since they are symetric

• Regression equation

• Now, the probability that the dependent variable takes on the value 1 given the values x can be computed as

• Estimation of regression coefficients is now done woht maximum-likelihood esitmation.

Logarithmic odds, that 0/1-variable takes on the value 1

Known from linear regression

Logistic regression

• Interpretation of the parameters:

• : defines the probability P(Y=1|X=0) of the value X=0 of the independent variable X:

• The larger the larger the probability P(Y=1|X=0)

• Probabilities depend on

was set to 1

Logistic regression • : determines the slope of the probability function, and thereby how strong differences at the independent variable X influence differences at the conditional probabilites

• , conditional probability of category 1 („affected“) is monotonically increasing function of X

• , conditional probability of category 1 („affected“) is monotonically decreasing function of X

• , X and Y are idependnetwas set to zero 0

Logistic regression• Titianci:

• Coding: male=1, female=0, survival=1, non-surival=0

• Compute logarithmic odds according to • (In R: glm(y~x,family=binomial(„logit“)) ):

• Female: • Male:

• The logit coefficients are difficult to interpret, therefore they are transformed back by ()

• Female: , Male:

Beta coefficient S.E.Intercept 1.004 0.104

Sex -2.317 0.119

Logistic regression• „Wald-test“ for the null hypothesis β = 0 •

• Reject the null hypothesis if , where p is the number of degrees of freedom (= the number of independent variables)

• Titanic example: • Wald-test für : Note: is typically not of interest. Here, it it signficant because it detects that there were much more men then women on the Titanic.• Wald-test für : • => The null hypothesis can be rejected

Note• So far, the logistic regression example could have been

computed with the chi-square test for 2x2 tables. The advantage of the logistic regression is that it can be extended to multiple independent variables and that the independent variables can be continuous.

Logistic regression• Logistic regression with multiple predictor variables

(In R: glm(y~x1+x2,family=binomial(„logit“)) ):

• Multiple predictor variables can also be analyzed as „cross-classification“

• Example: Does lung function depend on air pollution and smoking

• Dependent variable: lufu= lung function test, „normal“=1, „not normal“=0

• Dependent variables: • LV=degree of air polution, „no“=0, „yes“=1• Smoking „no“=0, „yes“=1

Logistic regression• Data:Air pollution Smoking # normal Lufu # Anormal Lufu

no=0 0 209 10

1 159 9

yes=1 0 45 5

1 37 5

Logistic regressionLogistic regression with R yields

top related