Talk 4

Statistics Lab

Rodolfo Metulini

IMT Institute for Advanced Studies, Lucca, Italy

Lesson 4 - The linear Regression Model: Theory andApplication - 23.01.2015

Introduction

In the past lessons we analyzed one variable.

For some reasons, it is even useful to analyze two or more variablestogether.

The question we want to answer regards:

I What are the relations and the causal effects between two ormore variables,

I analyze the determinants of changes in a variable,

I forecast or predict a variable for unknown n or t.

In symbols, the idea can be represented as follows:

Y = f (Y1,Y2, ...)

Y is the response, which is a function (it depends on) one or morevariables.

Objectives

All in all, the regression model is the instrument used to:

I measure the entity of the relations between two or morevariables: 4Y /4 X ,

and to measure the causal direction (4X −→ 4Y orviceversa?)

I forecast the value of the variable Y in response to somechanges in the others X1,X2, ... (called explanatories),

or for some cases that are not considered in the sample.

Simple linear regression model

The regression model is a stochastic model, which differ from adeterministic one.

Giving two sets of values (two variables) from a random sample oflength n: x = {x1, x2, ..., xi , ..xn}; y = {y1, y2, ..., yi , ..yn}:

Deterministic formula:

yi = β1 + β2xi , ∀i = 1, .., n

Stochastic formula:

yi = β1 + β2xi + εi ∀i = 1, .., n

where εi is the stochastic component.

β2 define the slope in the relations between X and Y (See graph inchart 1)

Simple linear regression model - 2

We need to find β = {β1, β2} as estimators of β1 and β2.

After β is estimated, we can draw the estimated regression line,which corresponds to the estimated regression model, asfollow:

yi = β1 + β2xi

Here, εi = yi − yi .

Where yi is the i-element of the estimated y vector, and yi is thei-elements of the real y vector. (see graph in chart 2)

Empirical Steps in the Regression Analysis

1. Study of the relations (scatter plot, correlations) between twoor more variables.

2. Estimation of the parameters of the model β = {β1, β2}.

3. Hypothesis tests on the estimated β2 to verify the causaleffects of X over Y

4. Robustness checks on the estimated model.

5. Use of the model to analyse the causal effect and/or to doforecasts/predictions.

Why linear?

I It is simple to estimate, to analyse and to interpret.

I It likely fits with most of empirical cases, in which therelations between two phenomenon is linear (NOT REALLYSURE OF IT!)1.

1The real complex world is not linear in the relations: logit, probit, mixedmodel, generalized additive models (GAM) are only some examples of nonlinear more advanced models you will study on econometric classes.

Model Hypotesis

In order the OLS estimation of the model to be unbiased, certainhypothesis must hold:

I E (εi ) = 0,∀i −→ E (yi ) = β1 + β2xi

I Omoschedasticity: V (εi ) = σ2i = σ2,∀i

I Null covariance: Cov(εi , εj) = 0,∀i 6= j

I Null covariance among residuals and explanatories:Cov(xi , εi ) = 0, ∀i

I Normal assumption: εi ∼ N(0, σ2)

Model Hypotesis - 2

From the hypotesis above, it follows that:

I V (yi ) = σ2,∀i . Y is stochastic only for the ε component.

I Cov(yi , yj) = 0, ∀i 6= j . Since the residuals are uncorrelated.

I yi ∼ N[(β1 + β2x1), σ2]. Since the residuals are also normal inshape.

Ordinary Least Squares (OLS) Estimation

The OLS is the estimation method used to estimate the vector β.The method comes from the idea to minimize the values of theresiduals.

Since ei (εi ) = yi − yi we are interested in minimize the componentei = yi − β1 − β2xi .

N.B. εi = β1 − β2xi , while ei = β1 − β2xi

The method consists in minimize the sum of the squaredifferences:∑n

i (yi − yi )2 =

∑ni e

2i = Min,

which is equal to solve the following two-equation-system derivedusing derivatives.

Ordinary Least Squares (OLS) Estimation - 2

δ

δβ1

n∑i

e2i = 0 (1)

δ

δβ2

n∑i

e2i = 0 (2)

After some maths, we end up with this estimators for the vectorβ:

β1 = y − β2x (3)

β2 =

∑ni (yi − y)(xi − x)∑n

i (xi − x)2(4)

OLS estimators

I OLS β1 and β2 are stochastic estimators (they are part of adistribution. They belong to the sample space of all thepossible estimators defined with different samples)

I β2: measure the estimated variation in Y determined by aunitary variation in X (δY /δX )

I The OLS estimators are both unbiased (E (β1) = β1 andE (β2) = β2),

and they are BLUE (corrects and with the lowest variance,furthermore, they are constructed on the full sample).

Linear dependency index (R2)

The R2 index is the most used measure to evaluate the linearfitting of the model.

R2 is confined in the boundary [0, 1], where, values near to 1means that the explanatories are properly describing the changes inY (the model is well defined).

How R2 is constructed:

SQT = SQR + SQE , or∑ni (yi − y)2 =

∑ni (yi − y)2 +

∑ni (yi − yi )

2, or

total variation = model variation + residual variation

The R2 is defined as SQRSQT or 1− SQE

SQT . Or, equivalent:

R2 =∑n

i (yi−y)2∑ni (yi−y)2

Hypotesis testing on β2

The hypothesis test for the slope parameter is really similar to thetests for the mean parameter. The estimated slope parameter β2 isstochastic. It distributes as a normal variable, when the sample islarge:

β2 ∼ N[β2, σ2/SSx ]

We can make use of the hypothesis tests approach to investigateon the causal relation between Y and X :

H0 : β2 = 0; H1 : β2 6= 0,

where, alternative hypothesis means causal relation. The testis:

z = β2−β2√σ2/SSx)

∼ N(0, 1).

Since SSx is, generally, unknown, we estimate it as :ˆSSx =

∑ni (xi − x)2, and we use t − test with n − 1 degrees of

freedom (in case n is small).

Prediction within the regression model

The question we want to answer is the following: Which is theexpected value of y (say yn+1), for a certain observation that is notin the sample?

Suppose we have, for that observation, the value for the variable X(say xn+1)

We make use of the estimated β to estimate yn+1 as:

yn+1 = β1 + β2xn+1

Model Checking

Several methods are used to test for the robustness of the model,most of them based on the stochastic part of the model (theestimated residuals).

I Graphical (at eye) checks: To plot the residuals versus thefitted values (residual hypothesis)

I qq-plot and Shapiro-Wilk test for normality

I Durbin-Watson test for residual correlation

I Breusch-Pagan test for residual heteroschedasticity.

Moreover, the Leverage is used to evaluate the contribution of eachobservation in determining the estimated coefficients β.

The Stepwise procedure is used to choice between different modelspecifications, in other words, to remove the explanatories whichare not significant.

Model Checking using estimated residuals - Linearity

An example of departure from the linearity assumption. In thiscase we can draw a curve (not a horizontal line) to interpolate thepoints.

Figure: residuals (Y) versus estimated (X) values

Model Checking using estimated residuals -Omoscedasticity

An example of departure from the omoschedasticity assumption. Inthis picture the estimated residuals increases as the predictedvalues increases.


Model Checking using estimated residuals - Normality

An example of departure from the normality assumption. Here theqq-points do not lie into the the qq-line bounds.


Model Checking using estimated residuals - Serialcorrelation

An example of departure from the assumption of no serialcorrelation of residuals: the residual at i depends on the value ati − 1


Homeworks

1. Using cement data (n = 13), determine the β1 and β2

coefficients manually, using OLS formula at page 11, of themodel y = β1 + β2x1

2. Using cement data, estimate the R2 index of the modely = β1 + β2x1, using formula at page 13.

Charts - 1

Figure: Slope coefficient in the linear model

Charts - 2

Figure: Fitted (line) versus real (points) values

Talk 4

Education