Regression

LINEAR REGRESSION

BY- ROHIT ARORA 10113057 IPE-FINAL YEAR

REGRESSION

• Technique used for the modeling and analysis of numerical data

• Exploits the relationship between two or more variables so that we can gain information about one of them through knowing values of the other

• Regression can be used for prediction, estimation, hypothesis testing, and modeling causal relationships

LINEAR REGRESSION• Simple linear regression is used for three main purposes:

• To describe the linear dependence of one variable on another

• To predict values of one variable from values of another, for which more data are available

• To correct for the linear dependence of one variable on another, in order to clarify other features of its variability.

• Linear regression determines the best-fit line through a scatterplot of data, such that the sum of squared residuals is minimized; equivalently, it minimizes the error variance. The fit is "best" in precisely that sense: the sum of squared errors is as small as possible. That is why it is also termed "Ordinary Least Squares" regression.

ASSUMPTIONS IN REGRESSION

• The relationship between the response y and regressors is linear.

• The error term ε has zero mean.• The error term ε has constant variance σ2.• The errors are uncontrolled.• The errors are normally distributed.

CALCULATION OF PARAMETERS

• Y = ₀β + ₁β x• ₀ β = intercept• ₁ β = slope• ₁ β =Sxy / Sxx

• β0 = y - ₁ β x• n- number of observations• Sxy = Ʃxy - 1/n(Ʃx)(Ʃy)• Sxx =Ʃx2 – 1/n(Ʃx)2

TESTING OF HYPOTHESIS ON SIGNIFICANCE OF REGRESSION

• C • A

• B

• A

• yi

•

• x

• y

• yi

•

• C

• B

B-EXPLAINED VARIATIONC-UNEXPLAINED VARIATION

Y

Yᵢ=β₀+β₁Xᵢ+Єᵢ


• One should check whether estimates from the regression model represents the real world data

• Model of regression is Yᵢ=β₀+β₁Xᵢ+Єᵢ• Total variation is made up of two parts:

SST = SSR + SSE TOTAL SUM REGRESSION SUM ERROR SUM OF SQUARES OF SQAURES OF SQAURES

SST =Ʃ(Yᵢ-Y)2 SSR=Ʃ(Yᵢ-Y)2 SSE =Ʃ(Yᵢ-Yᵢ)2

Where

Y=Average value of the dependent variable Yᵢ =Predicted value of Y for given Xᵢ Yᵢ=Observed value of the dependent variable


• H₀: The slope , β₁ =0 (There is no linear relationship between Y and X as shown in the model)

• H₁: The slope , β₁≠0 (There is a linear relationship between Y and X as shown in the model)

• The ANOVA table to test null hypothesis is:

Source of variation

Sum of squares Degrees of freedom

Mean sum of squares

F-ratio

Due to regression

SSR 1 MSSR = SSR F= MSSR / MSSE

Due to error SSE n-2 MSSE = SSE /n-2

Total SST n-1

COEFFIECIENT OF DETERMINATION

• Coefficient of determination ( R2 ) = SSR / SST

• The coefficient of determination is the variation in Y explained by the regression model when compared to the total variation . If this is higher , higher is the degree of relationship between Y and X. The range of R2 is from 0 to 1.

• The statistic R2 should be used with caution since it is always possible to make r2 large by adding enough terms to the model.

• The magnitude of R2 also depend on the range of variability in the regressor variable. Generally R2 will increase as the spread of the x’s increase and decrease as the spread of x’s decreases provided the assumed model form is correct.

• E( R2 )= β12 Sxx / β1

2 Sxx + MSSR

• Clearly the expected value of R2 will increase as Sxx increases. Thus a large value of R2 may result simply because x has been varied over an unrealistic large range.

• R2 does not measure the appropriates of the linear model , R2 Will often be large even though y and x are non linearly related.

• Coefficient of correlation =[SSR / SST ]1/2

COEFFICIENT OF CORRELATION

• Coefficient of correlation =[SSR / SST ]1/2 • The sign of R is same as that of the slope β₁ in the regression model.• A zero correlation indicates that there is no relationship between variables.• A correlation of -1 indicates a perfect negative relation• A correlation of +1 indicates a perfect positive relation.• It is very dangerous to conclude that there is no association between x and y just

because r is close to zero as shown in figure below. The correlation coefficient is of value only when the relation between x and y is linear.

•

Scatter plot between temperature and sales of ice cream

Calculated value of coefficient of correlation is zero but data follows a nice curve

SITUATION WHERE HYPOTHESIS H₀:β₁=0 IS NOT REJECTED

The failure to reject H₀:β₁=0 suggests that there is no linear relationship between y and x. Figure 11.8 is an illustration of the implication of this result .This may imply either that x is of little value in explaining the variation in y (a) or that the true relationship between x and y is not linear (b).

SITUATION WHERE HYPOTHESIS H₀:β₁=0 IS REJECTED

If H₀:β₁=0 is rejected this implies that x is of value in explaining the variability in y . However rejecting H₀ : β₁=0 could mean either that the straight line model is adequate (fig .a ) or that even though there is a linear effect of x , better result could be obtained with the addition of higher order polynomial terms.

HYPOTHESIS TESTING OF SLOPE

• We wish to test the hypothesis that slope equals a constant β₁₀. The hypothesis are:

• H₀ : β₁=β₁₀ H₁ : β₁ ≠ β₁₀• For this testing we assume that errors are normally and independently

distributed.• t₀=(β1 - β₁₀)/ (MSSE / SXX ) ½

• β1 is a linear combination of observations . It is normally distributed with mean β₁ and variance (σ2 / SXX ).

• Degrees of freedom associated with t₀ are degrees of freedom associated with MSSE .

• The procedure rejects the null hypothesis if │t₀│> tα/2 , n-2 • Standard error of slope is se(β1 ) = (MSSE / SXX )1/2

CONFIDENCE INTERVALS OF SLOPE• The sampling distribution of slope is given as:• t =(β1 - β₁)/ (MSSE / SXX ) ½ • t with n-2 degrees of freedom.• A 100(1-α) percent confidence interval of slope β₁ is given by• β1 - tα/2 , n-2 se(β1 ) ≤ β₁ ≤ β1 + tα/2 , n-2 se(β1 )

HYPOTHESIS TESTING OF INTERCEPT

• We wish to test the hypothesis that intercept equals a constant β00. The hypothesis are:

• H₀ : β₀=β₀₀ H₁ : β₀ ≠ β₀₀• For this testing we assume that errors are normally and independently

distributed.• t₀=(β ₀ - β₁₀)/ [MSSE /(1/n + x2 / SXX) ]1/2

• ₀β is a linear combination of observations . It is normally distributed with mean β₀.

• Degrees of freedom associated with t₀ are degrees of freedom associated with MSSE .

• The procedure rejects the null hypothesis if │t₀│> tα/2 , n-2

• Standard error of intercept is se( ₀β ) = [MSSE /(1/n + x2 / SXX) ]1/2

CONFIDENCE INTERVALS OF INTERCEPT• The sampling distribution of intercept is given as:• t =( ₀β - β₀)/[MSSE /(1/n + x2 / SXX) ]1/2

• t with n-2 degrees of freedom.• A 100(1-α) percent confidence interval of intercept β₁ is given

by• ₀β - tα/2 , n-2 se( ₀β ) ≤ β₀ ≤ ₀β + tα/2 , n-2 se(β₀ )

Confidence interval for mean and individual response of y at a specified x

• y₀ = ₀β + ₁β x , n-number of observations• For mean : y₀ - tα/2,n-2 [MSSE (1/n+(x0-x)2/SXX )]1/2 ≤ y0 ≤ y₀ + tα/2,n-2 [MSSE (1/n+(x0-x)2/SXX )]1/2

• For individual : y₀ - tα/2,n-2 [MSSE (1+1/n+(x0-x)2/SXX )]1/2 ≤ y0 ≤ y₀ + tα/2,n-2 [MSSE (1+1/n+(x0-x)2/SXX )]1/2

18

Residual Analysis

Definition of Residuals• Residual:

– The deviation between the data and the fit – A measure of the variability in the response

variable not explained by the regression model.– The realized or observed values of the model

errors.

niyye iii ,,1 ,ˆ

Residual Analysis

Residual Plot• Graphical analysis is a very effective way to investigate the adequacy of the fit of a

regression model and to check the underlying assumption.• Normal Probability Plot:

– If the errors come from a distribution with thicker or heavier tails than the normal, LS fit may be sensitive to a small subset of the data.

– Heavy-tailed error distributions often generate outliers that “pull” LS fit too much in their direction. Normal probability plot: a simple way to check the normal assumption.

– Ranked residuals: e[1] < … < e[n]

– Plot e[i] against Pi = (i-1/2)/n

– Sometimes plot e[i] against -1[ (i-1/2)/n]

– Plot nearly a straight line for large sample n > 32 if e[i] normal

– Small sample (n<=16) may deviate from straight line even e[i] normal– Usually 20 points are required to plot normal probability plots.

Residual Analysis

– Fitting the parameters tends to destroy the evidence of nonnormality in the residuals, and we cannot always rely on the normal probability to detect departures from normality.

– Defect: Occurrence of one or two large residuals. Sometimes this is an indication that the corresponding observations are outlier

Residual Analysis• Plot of Residuals against the Fitted Values

a: Satisfactoryb: Variance is an increase function of yc: Often occurs when y is a proportion between 0 and 1.d: Indicate nonlinearity.

Residual Analysis

• Plot of Residuals in Time Sequence:– The time sequence plot of residuals may indicate that the errors at one time

period are correlated with those at other time periods.– Autocorrelation: The correlation between model errors at different time

periods.– (a) positive autocorrelation– (b) negative autocorrelation

LACK OF FIT TEST• In this test we test the following hypothesis H0 : The model adequately fits the data H1 : The model does not fit the data• The test involves partitioning the error into two components SSE = SSPE + SSLOF where SSPE is sum of square attribute to pure error and SSLOF is the sum of square attribute

to lack of fit of the model.• To compute SSPE we must have repeated observations on y for at least one level of x. y11, y12……………………., y1n₁ repeated observation at x1 . . ya1 , ya2 ……………, yanₐ repeated observations at xc

Note that there are a distinct levels of x. The total sum of square for pure error would be obtained by SSPE = ∑ ∑(yiu - yi )2

LACK OF FIT TEST• The sum of square for lack of fit is simply SSLOF = SSE – SSPE

• F-test is conducted for a-2 degrees of freedom F0 = [SSLOF / (a-2)]/ [SSPE / (n-a)]

• We reject the hypothesis if F0 > Fα, a-2, n-a

A plot for lack of fit test

CONSIDERATIONS IN USE OF REGRESSIONS• Regression models are intended as interpolation equations over the range of

regressor variable used to fit the model. We must be careful if we explorate outside of this range.

• The deposition of x plays an important role in least squares fit. The slope is more strongly influenced by the remote values of x.

The slope depends heavily on A and B. The remaining data would give a very different estimate of the slope if A and B are deleted. Situations like this often require correction action such as further analysis or estimation of model parameter with some other technique that is less influenced by these points.

CONSIDERATIONS IN USE OF REGRESSIONS• In the following situation the slope is largely determined by the extreme point. If

this point is deleted the slope estimate is probably zero. Because of the gap between the two clusters of points , we really have 2 distinct information units with which to fit the model. So we should be aware that a small cluster of points may control key model parameters.

• Just because a regression analysis has indicated a strong relationship between two variables this does not imply that variables are related in causal sense. Causality implies necessary correlation . Regression analysis cannot address the issue of necessity.

CONSIDERATIONS IN USE OF REGRESSIONS

• Outliers can seriously disturb the least square fit as shown in figure but this data point A may not be a bad value and may be a highly useful piece of evidence concerning the process under investigation.

• In some applications the value of the regressor variable x required to predict y is unknown for example for forecast of load of electric power generation system we need to forecast the temperature.

POLYNOMIAL REGRESSION MODELS• Polynomial regression is a form of linear regression in which the relationship

between the independent variable x and the dependent variable y is modelled as an nth order polynomial.

• In many settings a linear relationship may not hold. For example, if we are modeling the yield of a chemical synthesis in terms of the temperature at which the synthesis takes place, we may find that the yield improves by increasing amounts for each unit increase in temperature. In this case, we might propose a quadratic model of the form

• In general, we can model the expected value of y as an nth order polynomial, yielding the general polynomial regression model

CONSIDERATIONS FOR FITTING A POLYNOMIAL IN ONE VARIABLE

• Order of the model : It is important to keep the order of the model as low as possible. Transformations should be tried to keep the model first order. A low order model in a transformed variable is almost always preferable to a high order model in the original metric.

• Extrapolation : Extrapolation with polynomial models can be very hazardous . If we extrapolate beyond the range of the original data the predicted response turns downward.

CONSIDERATIONS FOR FITTING A POLYNOMIAL IN ONE VARIABLE

• Hierarchy: The regression model y=β₀ + β₁x+ β2x2 + β3x3 + ε is said to be hierarchal because it contains all terms of order 3 and low.

• Model Building Strategy: Various strategies for choosing the order of an approximating polynomial have been suggested. The two main strategies are forward selection and backward elimination.

SELECTING REGRESSION MODELS INTRODUCTION• In complex regression situations, when there is a large number of explanatory variables which may or

may not be relevant for making predictions about the response variable, it is useful to be able to reduce the model to contain only the variables which provide important information about the response variable.

• First of all, we need to define the maximum model, that is, the model containing all explanatory variables which could possibly be present in the final model. Let k denote the maximum number of feasible explanatory variables then maximum model is given by

Yi = β₀+ β₁xi,1 + β2xi,2 +……………….+ βk xk + Єᵢ

where x₁ , x2 ………. k are the explanatory variables, and Єᵢ are independent, normally distributed random error terms with zero mean and common variance.

SELECTION CRITERIA• When the maximum model has been defined, the next point to consider is how to

determine whether one model is `better' than the rest: which criterion should we use to compare the possible models? A selection criterion is a criterion, which will order all possible models from `best' to `worst'. Many different criteria have been suggested through time; some are better than others, but there is no single criterion which is overall preferred.

SELECTING REGRESSION MODELS• The purpose of a selection criteria is to compare the maximum model with a reduced

model Yi = β₀+ β₁xi,1 + β2xi,2 +……………….+ βm

xm + Єᵢ which is a restriction of the maximum model. If the reduced model provides (almost)

as good a fit to the data as the maximum model, then we prefer the reduced model.• The Ra

2 Criteria : Due to the way is R2 defined, the largest model (the one with most explanatory variables) will always have the largest R2 -whether the extra variables provide any important information about the response variable or not. A common way to avoid this problem is to use an adjusted version of R2 instead of R2 itself. The adjusted R2 statistic, for a model with k explanatory variables, is given by

Note that Ra2 does not necessarily increase when the number of explanatory

variables increases. According to the Ra2 criteria, one should choose the model which

has the largest Ra2 .

SELECTING REGRESSION MODELS• The F -test criterion: A different, but equally intuitively natural selection criterion is

the F-test criterion. The idea is to test significance of k-m explanatory variables, say xm+1 ,……….xk in the maximum model , in order to get the reduced model . That is, we need test the null hypothesis

• The F-test statistic for testing significance of xm+1 ,……….xk is given by • If is H0 not rejected, the reduced model provides as good a fit to the data as the

maximum model, so we can use the reduced model instead of the maximum model.

SELECTION PROCEDURES• All possible models procedure: The most careful selection procedure is the all

possible models procedure in which all possible models are fitted to the data, and the selection criterion is used on all the models in order to find the model which is preferable to all others.

SELECTING REGRESSION MODELS• BACKWARD ELIMINATION : The backward elimination procedure is basically a

sequence of tests for significance of explanatory variables. Starting out with the maximum model

Yi = β₀+ β₁xi,1 + β2xi,2 +……………….+ βk xk + Єᵢ

we remove (or, eliminate) the variable with the highest p-value for the test of significance of the variable, conditioned on the p-value being bigger than some pre-determined level (say, 0.10). Next, we fit the reduced model (having removed the variable from the maximum model), and remove from the reduced model the variable with the highest p-value for the test of significance of that variable (if p>0.1 ). And so on. The procedure ends when no more variables can be removed from the model at significance level 10%. Note that we use the -test criterion in this procedure.

• FORWARD SELECTION :The forward selection procedure is a reversed version of the backward elimination procedure. Instead of starting with the maximum model, and eliminating variables one by one, we start with an `empty' model with no explanatory variables, and add variables one by one until we cannot improve the model significantly by adding another variable.

SELECTING REGRESSION MODELS• STEPWISE REGRESSION PROCEDURE :The stepwise regression procedure modifies

the forward selection procedure in the following way. Each time a new variable is added to the model, the significance of each of the variables already in the model is re-examined. That is, at each step in the forward selection procedure, we test for significance of each of the variables currently in the model, and remove the one with the highest p-value (if the p-value is above some threshold value, say 0.10). The model is then re-fitted without this variable, before going to the next step in the forward selection procedure. The stepwise regression procedure continues until no more variables can be added or removed.

THANK YOU

Regression

Documents