Top Banner
CORRELATION AND REGRESSION Quantitative Analysis R-11 (SS-3) Page 1 of 12 Covariance: o Measures the linear relationship between two variables. o It’s value is not very meaningful as it ranges from positive to negative infinity and presented in terms of squared units (i.e. % 2 , $ 2 ) Correlation: o Standardized measure of the linear relationship between two variables. o Its value has no measurement unit and ranges from -1 (perfectly negatively correlated) to +1 (perfectly positively correlated). o Limitations include the impact of outliers, potential for spurious correlation, and non-linear relationships. Interpreting a scatter plot: o A collection of points on a graph where each point represents the value of two variables. o If correlation equals +1 the points lie exactly on an upward sloping line, the opposite is correct for correlation equals -1. Hypothesis Testing for statistical significance: o Test whether the correlation between the population of two variables is equal to zero, (Two-tailed test with n-2 degrees of freedom at a given confidence level). o Test structure: o Test statistic: (assuming normal distribution) o Decision rule: o Interpretation: If null cannot be rejected, we conclude that the correlation between variables X and Y is not significantly different than zero at the given significance level (i.e 5%). Not For Release
12

Quantitative Methods - Level II - CFA Program

Apr 08, 2017

Download

Education

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Quantitative Methods - Level II - CFA Program

CORRELATION AND REGRESSION Quantitative Analysis R-11 (SS-3)

Page 1 of 12

Covariance: o Measures the linear relationship between two variables. o It’s value is not very meaningful as it ranges from positive to negative infinity and presented in terms

of squared units (i.e. %2, $2)

Correlation:

o Standardized measure of the linear relationship between two variables. o Its value has no measurement unit and ranges from -1 (perfectly negatively correlated) to +1

(perfectly positively correlated). o Limitations include the impact of outliers, potential for spurious correlation, and non-linear

relationships.

Interpreting a scatter plot:

o A collection of points on a graph where each point represents the value of two variables. o If correlation equals +1 the points lie exactly on an upward sloping line, the opposite is correct for

correlation equals -1.

Hypothesis Testing for statistical significance:

o Test whether the correlation between the population of two variables is equal to zero, (Two-tailed test with n-2 degrees of freedom at a given confidence level).

o Test structure:

o Test statistic: (assuming normal distribution)

o Decision rule:

o Interpretation:

If null cannot be rejected, we conclude that the correlation between variables X and Y is not significantly different than zero at the given significance level (i.e 5%).

Not

For

Rel

ease

Page 2: Quantitative Methods - Level II - CFA Program

CORRELATION AND REGRESSION Quantitative Analysis R-11 (SS-3)

Page 2 of 12

Simple Linear Regression: o Purpose: To explain the variation in a dependent variable in terms of the variation in a single

independent variable. Dependent variable = explained, endogenous, or predicted variable. Independent variable = explanatory, exogenous, predicting variable.

o Assumptions: (mostly related to residual –distrurbance or error- term (ε)) A linear relationship exists between the dependent and the independent variable. The independent variable is uncorrelated with the residuals. The expected value of the residual term is zero. The variance of the residual term is constant for all observations. (otherwise, the data is

heteroskedastic) The residual term is independently distributed; that is, the residual for one observation is

not correlated with the residual of another. (otherwise, the data exhibits autocorrelation) The residual term is normally distributed.

o Model construction: The linear equation (regression line or line of best fit) is the line which minimizes the Sum of

Squared Errors (SSE), that’s why simple linear regression is often called Ordinary Least Squares (OLS) regression and the estimated values are called least squares estimates.

Slope coefficient: descibes the change in Y for a one unit change in X.

(stock’s β or systematic risk level, when X=market excess returns and Y=stock excess retuns)

Inercept term: the line’s intersection with the Y axis (value of Y at X=0).

(ex-post α or excess risk-adjusted return relative to a market benchmark , when X=market excess returns and Y=stock excess retuns)

Not

For

Rel

ease

Page 3: Quantitative Methods - Level II - CFA Program

CORRELATION AND REGRESSION Quantitative Analysis R-11 (SS-3)

Page 3 of 12

o Importance of the regression model in explaining the independent variable: Requires determining the statistical significance of the regression (slope) coefficient through:

Confidence Interval:

Structure:

tc is the critical two tailed t-value for a given confidence level with n-2 df.

Decision rule & interpretation: If confidence interval doesn’t include zero, we can conclude that the slope coefficient slope coefficient is significantly different from zero.

Hypothesis Testing:

Test structure:

Test statistic: (assuming normal distribution)

Decision rule:

Interpretation:

If null cannot be rejected, we conclude that the slope coefficient is not significantly

different than the hypothesized value of b1 (zero in this case) at the given

significance level (i.e 5%).

F-Test: to be discussed later at the end of this reading

o Standard Error of Estimate (SEE):

Also known as standard error of the residual or standard error of the regression, measures

the degree of variability of the actual Y-values relative to the estimated Y-values from a

regression equation = σε.

The higher the correlation, the smaller the Standard Error, the better the fit.

o Coefficient of determination (R2):

The percentage of the total variation in the dependent variable explained by the

independent variable.

For simple linear regression, R2 = ρ2

o Confidence interval for predicted values:

Structure:

tc is the critical two tailed t-value for a given confidence level with n-2 df.

sf is the standard error of forecast. (Calculating sf is highly improbable in the exam) Not

For

Rel

ease

Page 4: Quantitative Methods - Level II - CFA Program

CORRELATION AND REGRESSION Quantitative Analysis R-11 (SS-3)

Page 4 of 12

Interpretation:

Given a forecasted value of X, we can be (i.e. 95%) confident that Y will be between Y -

tc*sf and Y + tc*sf.

o Analysis of Variance (ANOVA):

Total variation = Explained variation + Unexplained Variation

Total variation = Total Sum of Squares (SST) =

Explained variation = Regression Sum of Squares (RSS) =

Unexplained variation = Sum of Squared Errors (SSE) =

If we denote the number of independent variables as k, then, regression df = k = 1 for simple linear regression, error df = n-k-1 = n-2 for the same.

MSR is the mean regression sum of squares and MSE is the mean squared error. R2 = Explained variation (RSS) / Total variation (SST)

Standard Error of Estimate (SEE) = Variance of Y = SST / (n-1)

Not

For

Rel

ease

Page 5: Quantitative Methods - Level II - CFA Program

CORRELATION AND REGRESSION Quantitative Analysis R-11 (SS-3)

Page 5 of 12

o The F-Statistic: (more useful with multiple regression)

Asses how well a set of independent variables, as a group, explains the variation in the dependent variable with a desired level of significance. In other words, whether at least one of the independent variables explains a siginificant portion of the variation of the dependent variable. F-test is a one-tailed test. Test structure:

Test statistic:

Fc is the critical F-value at a given level of significance and the following df:

dfnumerator = k = 1

dfdenminator = n-k-1 = n-2

Decision rule:

Interpretation:

If null cannot be rejected, we conclude that the slope coefficient is not significantly different than zero at the given significance level (i.e 5%).

o Limitations of regression analysis: Linear relationships can change over time (parameter instability) It’s usefulness is limited if other market participants are aware of and act on it. If the assumptions of the model don’t hold, the interpretation of the results will not be valid.

Major reasons for model invalidity include heteroskedasticity (non-constant variance of error terms) and autocorrelation (error terms are not independent).

Not

For

Rel

ease

Page 6: Quantitative Methods - Level II - CFA Program

MULTIPLE REGRESSION & ISSUES IN REGRESSION ANALYSIS Quantitative Analysis R-12 (SS-3)

Page 6 of 12

Multiple regression is a regression analysis with more than one independent variable.

o Slope coefficient: descibes the change in Y for a one unit change in Xk holding other independent valriables constant.

o Inercept term: the line’s intersection with the Y axis (value of Y at all Xs =0).

Hypothesis Testing (t-tests).

p-values: the smallest level of significance for which the null hypothesis can be rejected. So, if the p-value < significance level, the null hypothesis can be rejected. Otherwise, the null hypothesis cannot be rejected.

Confidence interval for a regression coefficient.

If the independent variable is proved to be statistically insignificant (its coefficient is not different than zero at a given confidence level), the whole model needs to be reestimated as the coefficients of other significant variables will likely change.

Assumptions: same as univariate regression in addition to that there is no exact linear relation between any two or more independent variables. (otherwise, Multicollinearity)

The F-Statistic

R2: the percentage of variation in the dependent variable that is collectively explained by all of the independent variables.

o Multiple R: the correlation between actual and forecasted values of Y. Multiple R is the square root of R2. For simple regression, the correlation between the dependent and independent variables is the same as multiple R with the same sign as slope coefficient.

Adjusted R2: R2 increases as more independent variables are added to the model, regardless of their explanatory power, this problem is called overestimating the regression. To overcome this problem R2 should be adjusted for the number of independent variables as per the following formula:

o Adjusted R2 <= R2 o Adding a new variable to the model will increase R2 while it may increase or decrease adjusted R2 o Adjusted R2 may be less than zero if R2 is low enough

Dummy variables: o Usually used to quantify the impact of qualitative binary events (on or off). Dummy variables are

assigned values of 1 or 0 for on or off status. o Whenever we need to distinguish between n classes we must use n-1 dummy variables. Otherwise,

the multiple regression assumption of no exact linear relationship between independent variables would be violated.

Not

For

Rel

ease

Page 7: Quantitative Methods - Level II - CFA Program

MULTIPLE REGRESSION & ISSUES IN REGRESSION ANALYSIS Quantitative Analysis R-12 (SS-3)

Page 7 of 12

The omitted class should be thought of as the reference point which is represented by the intercept. o Testing the statistical significance of the slope coefficients is equivalent to testing whether the value

of the dummy variable is equal to the omitted variable (intercept).

Issues in regression analysis: o Heteroskedasticity:

Definition: Occurs when the variance of the residuals is not the same across all observations. There are two types:

Unconditional Heteroskedasticity: Not related to the level of the independent variables. Not a major problem.

Conditional Heteroskedasticity: Related to the level of the independent variables. Significant problem.

Effect: The standard errors are unreliable (affecting t-tests and F-test) while the coefficients are not affected. Too small standard error is the main concern as it might lead to type I error, by rejecting the null hypothesis of no significant coefficient.

Detection:

Examine the scatter plot of the residuals against the independent variables.

Breusch-Pagan (BP) test by conducting a second regression, using the squared residuals (from the 1st regression) against the independent variables and test whether the independent variable significantly contribute to the explanation of the squared residuals. The test statistic has a chi-square (χ2) distribution with k degrees of freedom and

calculated as:

This is a one-tailed test as the concern is having too large

If test statistic > chi-square critical value ⟹ Reject the null hypothesis and conclude that a conditional heteroskedasticity problem is present.

Correction:

Use robust standard errors (White-corrected standard errors or heteroskedasticity-consistent standard errors) which are usually higher than the original standard errors.

Use generalized least squares, which modifies the original equation in an attempt to eliminate heteroskedasticity.

o Serial Correlation (Autocorrelation): Definition:

Occurs when the residual terms are correlated with one another. It’s a common problem with time series data. There are two types:

Positive serial correlation: When a positive error in one period increases the probability of observing a positive error in the next period.

Negative serial correlation: When a positive error in one period increases the probability of observing a negative error in the next period.

Not

For

Rel

ease

Page 8: Quantitative Methods - Level II - CFA Program

MULTIPLE REGRESSION & ISSUES IN REGRESSION ANALYSIS Quantitative Analysis R-12 (SS-3)

Page 8 of 12

Effect: The tendency of the data to cluster together underestimates the coefficient standard errors, leading to type I errors.

Detection:

Examine the scatter plot of the residuals against time.

Durbin-Watson (DW) statistic (the calculation is impractical for the exam)

If the sample size is very large ⟹ DW 2 (1-r), where r is the correlation coefficient between residuals from one period and those from the previous period. ⟹ DW = 2 if r = 0 (homoscedastic data with no serial correlation) ⟹ DW > 2 if r < 0 (negative serial correlation) ⟹ DW < 2 if r > 0 (positive serial correlation) For the Durbin-Watson test, there are upper and lower DW values depending on the level of significance, number of observations, degrees of freedom (number of variables k)

o Test Structure: H0: No positive serial correlation o Decision Rule:

Reject H0 Inconclusive Fail to reject H0

0 DL DU Correction:

Use Hansen method to provide Hansen-White standard errors, which also could be used to correct for conditional heteroskedasticity. The general rule for use of adjusted standard errors is:

o If the problem is serial correlation only ⟹ Hansen Method o If the problem is conditional heteroskedasticity only ⟹ White-corrected o If the problem is both ⟹ Hansen Method

Improve the model specification, by including a seasonal term to reflect the time series nature of the data. This can be tricky.

o Multicollinearity: Definition:

Occurs when linear combinations of independent variables are highly correlated. For k>2, high correlation between individual independent variables (>0.7) suggests the possibility of multicollinearity, but low correlation doesn’t necessarily indicate no multicollinearity.

Not

For

Rel

ease

Page 9: Quantitative Methods - Level II - CFA Program

MULTIPLE REGRESSION & ISSUES IN REGRESSION ANALYSIS Quantitative Analysis R-12 (SS-3)

Page 9 of 12

Effect: Slope coefficients tend to be unreliable, and standard errors are artificially inflated. Hence, there is a greater probability of Type II error.

Detection: While the F-test is statistically significant and R2 is high, the t-tests indicate no significance of the individual coefficients.

Correction: Use statistical procedures, like stepwise regression, which systematically remove variables from the regression until multicollinearity is minimized.

Model misspecification: o Categories:

I. The functional form can be misspecified: 1. Important variables are omitted. 2. Variables should be transformed.

(If the dependent is linearly related to the natural log of the variable or standardizing B/S items by dividing by Total Assets or Sales for P&L and CF items. Common mistakes include squaring or taking square root of the variable).

3. Data is improperly pooled. (By pooling sub-periods that exhibit structural change).

II. Explanatory variables are correlated with the error terms in time series analysis: 1. A lagged dependent variable is used as an independent variable. 2. A function of the dependent variable is used as an independent variable

(“forecasting the past” i.e. use end of month market cap to predict returns during the month).

3. Independent variables are measured with error. (Use free float as a proxy for the corporate governance quality or actual inflation as a proxy for expected inflation).

III. Other time-series misspecifications that result in nonstationarity.

o Effect: Regression coefficients are often biased and/or inconsistent ⟹ unreliable hypothesis testing and inaccurate predictions.

Qualitative (dummy) dependent variables: o Probit and logit models:

Estimate the probability that the event occurs (i.e. probability of default or merger). The maximum likelihood is used to estimate coefficients. A probit model is based on normal distribution, while a logit model is based on a logistic

distribution. o Discriminant models:

Results in a linear function, similar to an ordinary regression, which generates an overall score for an observation. The scores can then be used to rank or classify observations.

Example: use financial ratios to get a score that places a company in a bankrupt or not bankrupt class.

Similar to probit and logit models but make different assumptions regarding the independent variables.

Not

For

Rel

ease

Page 10: Quantitative Methods - Level II - CFA Program

TIME-SERIES ANALYSIS Quantitative Analysis R-13 (SS-3)

Page 10 of 12

A time series is a set of observations over successive periods of time.

Linear trend model: (the data plot on a straight line)

Log-linear trend model: (the data plot in a curve) The model defines y as an exponential function of time. By taking the natural log of both sides, we transform the equation from an exponential to a linear function.

For financial time series which display exponential growth, log-linear model provides a better fit for the data and, thus, increases the model predictive power.

When a variable grows at a constant rate (i.e. financial data and company sales), a log-linear model is most appropriate. When it grows by a constant amount (i.e. inflation), a linear trend model is most appropriate.

Limitation of trend models: When time-series residuals exhibit serial correlation, as evident by DW test, we need to use autoregressive (AR) model. This is done by regressing the dependent variable against one or more lagged values of itself, on condition that the time-series being modeled is covariance stationary.

Conditions for covariance stationary: I. Constant and finite expected value (mean-reverting level).

II. Constant and finite variance (homoscedastic). III. Constant and finite covariance between values at any given lag (the covariance of the time series

with leading or lagged values of itself is constant).

An (AR) model of order p, AR(p) is expressed as:

(Where p is the number of lagged values included as independent variables)

Forecasting with an autoregressive model: Applying the chain rule of forecasting, it’s necessary to calculate a one-step-ahead forecast before a two-step-ahead forecast. This implies that:

o Multi-period forecasts are more uncertain than single-period forecasts. o Sample size = # of observations - # of (AR) order.

Detection and correction for autocorrelation in autoregressive models: The DW test used with trend models is not appropriate with AR models, instead, the following steps are followed to detect autocorrelation and make sure the AR model is correctly specified:

I. Estimate the AR(1) model.

II. Calculate the autocorrelation of the model’s residuals.

III. t-test whether the autocorrelations are significantly different from zero. With df=T-2

√ ⁄

, where T is the number of observations.

IV. If any of the autocorrelations is significantly different from zero, the model is not specified correctly. Add more lags to the model and repeat step II until all autocorrelations are insignificant.

Not

For

Rel

ease

Page 11: Quantitative Methods - Level II - CFA Program

TIME-SERIES ANALYSIS Quantitative Analysis R-13 (SS-3)

Page 11 of 12

Mean reversion and random walk: For a time-series to be covariance stationary it must have a constant and finite mean-reverting level, which is the value the time-series tends to move to. Once this level is reached it’s expected that

⟹ mean-reverting level

For b1 = 1 (called unit root) the model doesn’t have a finite mean-reverting level, and thus, not covariance stationary. This happens when the model follows random walk process which is classified to:

o Random walk without a drift: o Random walk with a drift:

Unit root detection: As testing if b1=1 cannot be performed directly, use Dickey and Fuller (DF) test by transforming the AR(1) model to run a simple regression by subtracting from both sides as follows: ⟹ Then test whether the transformed coefficient is different from zero using a modified t-test. With H0 : = 0, if we fail to reject, we conclude that the series has a unit root.

Unit root correction: Use first differencing to transform the data to a covariance stationary time series for which . This is done by constructing the following AR(1) model: where and and

⁄ ⁄ (finite value)

If the data has a linear trend, first difference the data. If the data has an exponential trend, first difference the natural log of the data.

Seasonality detection: A pattern that tends to repeat from year to year. Not accounting for seasonality, when it’s present, will make the AR model misspecified and unreliable for forecasting purposes. Seasonality can be detected by observing that the residual autocorrelation for the month or quarter from previous year (month 12 or quarter 4) is significantly different from zero.

Seasonality correction: Add an additional lag corresponding to the same period last year to the original model.

In-sample forecasts are made within the range of data used to estimate the model. Out-of-sample forecasts are made outside the sample period, to assess the predictive power of the model. Given two models, to assess which one is better, apply root mean squared error (RMSE) criterion (the square root of the average of the squared errors) on out-of-sample data. The model with the lowest RMSE is the most accurate.

As financial and economic environments are dynamic and frequently subject to structural shifts, there is a tradeoff between the increased statistical reliability when using long time-series periods, and the increased stability of the estimates when using shorter periods.

Not

For

Rel

ease

Page 12: Quantitative Methods - Level II - CFA Program

TIME-SERIES ANALYSIS Quantitative Analysis R-13 (SS-3)

Page 12 of 12

Autoregressive Conditional Heteroskedasticity (ARCH) model: ARCH exists if the variance of the residuals in one period is dependent on the variance of the residuals in a previous period. ARCH(1) model is expressed as:

If the coefficient a1 is statistically different from zero, it can be positive and the variance increases over time, or negative and the variance decreases over time, indicating that error terms exhibit heteroskedasticity. In either case, the time-series is ARCH(1) and, according to our need, we can either:

o Correct the model using procedures that correct for heteroskedasticity, such as generalized least squares.

o Predict the variance of the residuals in future periods.

Considerations for using two time-series variables in a linear regression: Test for covariance stationary (by detecting the presence of autocorrelation or unit root) with the following possibilities along with whether the data can be used or not:

1. Both time-series are covariance stationary ⟹ Yes 2. Only one variable is covariance stationary ⟹ No 3. Both time-series are not covariance stationary:

3.1. The two series are cointegrated ⟹ Yes 3.2. The two series are not cointegrated ⟹ No

Cointergration: Means that the two time-series are economically linked or follow the same trend and that relationship is not expected to change. To test for cointegration, regress one variable on the other using the following model: The residuals are tested for a unit root using DF test with critical t-values calculated by Engle and Granger (i.e. DF-EG test). If the test rejects null hypothesis of a unit root, we conclude that the error terms generated by the two time series are covariance stationary and the two series are cointegrated.

Not

For

Rel

ease