FRM 2

P1.T2.Quantitative Analysis FRM 2012 Study Notes – Vol. II

By David Harper, CFA FRM CIPM

www.bionicturtle.com

FRM 2012 QUANTITATIVE ANALYSIS 1 www.bionicturtle.com

Table of Contents

Stock, Chapter 2: Review of Probability .............................................................................. 2

Stock, Chapter 3: Review of Statistics ............................................................................... 28

Stock, Chapter 4: Linear Regression with one regressor.................................................... 50

Stock, Chapter 5: Single Regression: Hypothesis Tests and Confidence Intervals ............... 59

Stock: Chapter 6: Linear Regression with Multiple Regressors ........................................... 63

Stock, Chapter 7: Hypothesis Tests and Confidence Intervals in Multiple Regression ......... 68

Rachev, Menn, and Fabozzi, Chapter 2: Discrete Probability Distributions ......................... 72

Rachev, Menn, and Fabozzi, Chapter 3: Continuous Probability Distributions .................... 76

Jorion, Chapter 12: Monte Carlo Methods .......................................................................... 87

Hull, Chapter 22: Estimating Volatilities and Correlations .................................................. 98

Allen, Boudoukh, and Saunders, Chapter 2: Quantifying Volatility in VaR Models ...........108


Stock, Chapter 2:

Review of Probability

In this chapter…

Define random variables, and distinguish between continuous and discrete random variables.

Define the probability of an event. Define, calculate, and interpret the mean, standard deviation, and variance of a

random variable. Define, calculate, and interpret the skewness, and kurtosis of a distribution. Describe joint, marginal, and conditional probability functions. Explain the difference between statistical independence and statistical

dependence. Calculate the mean and variance of sums of random variables. Describe the key properties of the normal, standard normal, multivariate normal,

Chi-squared, Student t, and F distributions. Define and describe random sampling and what is meant by i.i.d. Define, calculate, and interpret the mean and variance of the sample average. Describe, interpret, and apply the Law of Large Numbers and the Central Limit

Theorem.


Define random variables, and distinguish between continuous and

discrete random variables.

We characterize (describe) a random variable with a probability distribution. The random

variable can be discrete or continuous; and in either the discrete or continuous case, the

probability can be local (PMF, PDF) or cumulative (CDF).

A random variable is a variable whose value is determined by the outcome of an

experiment (a.k.a., stochastic variable). “A random variable is a numerical summary of

a random outcome. The number of times your computer crashes while you are writing a

term paper is random and takes on a numerical value, so it is a random variable.”—S&W

Continuous random variable

A continuous random variable (X) has an infinite number of values within an interval:

( ) ( )b

aP a X b f x dx

Pr (c1 ≤ Z ≤ c2) = φ(c2) - φ(c1)

Pr (Z ≤ c)= φ(c)

Pr (X = 3) Pr (X ≤ 3)

Continuous

Discrete

probability

function

(pdf, pmf)

Cumulative

Distribution

Function (CDF)


Discrete random variable

A discrete random variable (X) assumes a value among a finite set including x1, x2, x3 and so

on. The probability function is expressed by:

( ) ( )k kP X x f x

Notes on continuous versus discrete random variables

Discrete random variables can be counted. Continuous random variables must be

measured.

Examples of a discrete random variable include: coin toss (head or tails, nothing in

between); roll of the dice (1, 2, 3, 4, 5, 6); and “did the fund beat the benchmark?”(yes,

no). In risk, common discrete random variables are default/no default (0/1) and loss

frequency.

Examples of continuous random variables include: distance and time. A common

example of a continuous variable, in risk, is loss severity.

Note the similarity between the summation (∑ ) under the discrete variable and the

integral (∫) under the continuous variable. The summation (∑) of all discrete outcomes

must equal one. Similarly, the integral (∫) captures the area under the continuous

distribution function. The total area “under this curve,” from (-∞) to (∞), must equal one.

All four of the so-called sampling distributions—that each converge to the

normal—are continuous: normal, student’s t, chi-square, and F distribution.


Summary

Continuous Discrete

Are measured Are counted

Infinite Finite

Examples in Finance

Distance, Time (e.g.) Default (1,0) (e.g.)

Severity of loss (e.g.) Frequency of loss (e.g.)

Asset returns (e.g.)

For example

Normal Bernoulli (0/1)

Student’s t Binomial (series i.i.d. Bernoullis)

Chi-square Poisson

F distribution Logarithmic

Lognormal

Exponential

Gamma, Beta

EVT Distributions (GPD, GEV)

Define the probability of an event.

Probability: Classical or “a priori” definition

The probability of outcome (A) is given by:

Number of outcomes favorable to A( )

Total number of outcomesP A

For example, consider a craps roll of two six-sided dice. What is the probability of rolling a

seven; i.e., P[X=7]? There are six outcomes that generate a roll of seven: 1+6, 2+5, 3+4, 4+3, 5+2,

and 6+1. Further, there are 36 total outcomes. Therefore, the probability is 6/36.

In this case, the outcomes need to be mutually exclusive, equally likely, and

“cumulatively exhaustive” (i.e., all possible outcomes included in total). A key property

of a probability is that the sum of the probabilities for all (discrete) outcomes is 1.0.


Probability: Relative frequency or empirical definition

Relative frequency is based on an actual number of historical observations (or Monte Carlo

simulations). For example, here is a simulation (produced in Excel) of one hundred (100) rolls of

a single six-sided die:

Empirical Distribution Roll Freq. %

1 11 11% 2 17 17% 3 18 18% 4 21 21% 5 18 18% 6 15 15%

Total 100 100%

Note the difference between an a priori probability and an empirical probability:

The a priori (classical) probability of rolling a three (3) is 1/6,

But the empirical frequency, based on this sample, is 18%. If we generate another

sample, we will produce a different empirical frequency.

This relates also to sampling variation. The a priori probability is based on population

properties; in this case, the a priori probability of rolling any number is clearly 1/6th.

However, a sample of 100 trials will exhibit sampling variation: the number of threes (3s)

rolled above varies from the parametric probability of 1/6th. We do not expect the

sample to produce 1/6th perfectly for each outcome.


Define, calculate, and interpret the mean, standard deviation, and

variance of a random variable.

If we can characterize a random variable (e.g., if we know all outcomes and that each outcome is

equally likely—as is the case when you roll a single die)—the expectation of the random

variable is often called the mean or arithmetic mean.

Mean (expected value)

Expected value is the weighted average of possible values. In the case of a discrete random

variable, expected value is given by:

1 1 2 21

( )k

k k i ii

E Y y p y p y p y p

In the case of a continuous random variable, expected value is given by:

( ) ( )E X xf X dx

Variance

Variance and standard deviation are the second moment measures of dispersion. The variance of

a discrete random variable Y is given by:

2 22

1

variance( )k

Y Y i Y ii

Y E Y y p

Variance is also expressed as the difference between the expected value of X^2 and the square

of the expected value of X. This is the more useful variance formula:

2 2 2 2[( ) ] ( ) [ ( )]Y YE Y E Y E Y

Please memorize this variance formula above: it comes in handy! For example, if the

probability of loan default (PD) is a Bernouilli trial, what is the variance of PD?

We can solve with E[PD^2] – (E[PD])^2, As E[PD^2] = p and E[PD] = p, E[PD^2] – (E[PD])^2

= p – p^2 = p*(1-p).


Example: Variance of a single six-sided die

For example, what is the variance of a single six-sided die? First, we need to solve for the

expected value of X-squared, E[X2]. This is given by:

2 2 2 2 2 2 21 1 1 1 1 1 91[ ] (1 ) (2 ) (3 ) (4 ) (5 ) (6 )

6 6 6 6 6 6 6E X

Then, we need to square the expected value of X, [E(X)]2. The expected value of a single six-sided

die is 3.5 (the average outcome). So, the variance of a single six-sided die is given by:

2 2 291( ) ( ) [ ( )] (3.5) 2.92

6Variance X E X E X

Here is the same derivation of the variance of a single six-sided die (which has a uniform

distribution) in tabular format:

What is the variance of the total of two six-sided die cast together? It is simply the

Variance (X) plus the Variance (Y) or about 5.83. The reason we can simply add them

together is that they are independent random variables.

Sample Variance:

The unbiased estimate of the sample variance is given by:

2 2

1

1( )

1

k

x i Yi

s yk


Properties of variance

only if independent

only if inde

2constant

2 2 2

2 2 2

2 2

2 2 2

2 2 2

2 2 2 2 2

pendent

2 2 2

only if independent

1. 0

2 .

2 .

3.

4.

5.

6.

7. ( ) ( )

X Y X Y

X Y X Y

X b X

aX X

aX b X

aX bY X Y

X

a

b

a

a

a b

E X E X

Standard deviation:

Standard deviation is given by:

2 2

var( )Y Y i Y iY E Y y p

As variance = standard deviation^2, standard deviation = Square Root[variance]

Sample Standard Deviation:

The unbiased estimate of the sample standard deviation is given by:

2

1

1( )

1

k

X i Yi

s yk

This is merely the square root of the sample variance. This formula is important because

this is the technically precise way to calculate volatility.


Define, calculate, and interpret the skewness, and kurtosis of a

distribution.

Skewness (asymmetry)

Skewness refers to whether a distribution is symmetrical. An asymmetrical distribution is

skewed, either positively (to the right) or negatively (to the left) skewed. The measure of “relative

skewness” is given by the equation below, where zero indicates symmetry (no skewness):

3

3 3

[( ) ]Skewness =

E X

For example, the gamma distribution has positive skew (skew > 0):

Skewness is a measure of asymmetry

If a distribution is symmetrical, mean = median = mode. If a distribution has positive

skew, the mean > median > mode. If a distribution has negative skew, the mean <

median < mode.

Kurtosis

Kurtosis measures the degree of “peakedness” of the distribution, and consequently of

“heaviness of the tails.” A value of three (3) indicates normal peakedness. The normal

distribution has kurtosis of 3, such that “excess kurtosis” equals (kurtosis – 3).

4

4 4

[( ) ]Kurtosis =

E X

Note that technically skew and kurtosis are not, respectively, equal to the third and fourth

moments; rather they are functions of the third and fourth moments.

- 0.20 0.40 0.60 0.80 1.00 1.20

0.0

0.6

1.2

1.8

2.4

3.0

3.6

4.2

4.8

Gamma Distribution Positive (Right) Skew

alpha=1,beta=1

alpha=2,beta=.5

alpha=4,beta=.25


A normal distribution has relative skewness of zero and kurtosis of three (or the same

idea put another way: excess kurtosis of zero). Relative skewness > 0 indicates positive

skewness (a longer right tail) and relative skewness < 0 indicates negative skewness (a

longer left tail). Kurtosis greater than three (>3), which is the same thing as saying

“excess kurtosis > 0,” indicates high peaks and fat tails (leptokurtic). Kurtosis less than

three (<3), which is the same thing as saying “excess kurtosis < 0,” indicates lower peaks.

Kurtosis is a measure of tail weight (heavy, normal, or light-tailed) and “peakedness”:

kurtosis > 3.0 (or excess kurtosis > 0) implies heavy-tails.

Financial asset returns are typically considered leptokurtic (i.e., heavy or fat- tailed)

For example, the logistic distribution exhibits leptokurtosis (heavy-tails; kurtosis > 3.0):

Univariate versus multivariate probability density functions

A single variable (univariate) probability distribution is concerned with only a single random

variable; e.g., roll of a die, default of a single obligor. A multivariate probability density

function concerns the outcome of an experiment with more than one random variable. This

includes, the simplest case, two variables (i.e., a bivariate distribution).

Density Cumulative

Univariate f(x)= P(X = x) F(x) = P(X ≤ x)

Bivariate f(x)= P(X = x, Y =y) f(x) = P(X ≤ x, Y ≤ y)

-

0.10

0.20

0.30

0.40

0.50

1 5 9 13 17 21 25 29 33 37 41

Logistic Distribution Heavy-tails (excess kurtosis > 0)

alpha=0, beta=1

alpha=2, beta=1

alpha=0, beta=3

N(0,1)


Describe joint, marginal, and conditional probability functions.

Stock & Watson illustrate with two variables:

The age of the computer (A), a Bernoulli such that the computer is old (0) or new (1)

The number of times the computer crashes (M)

Marginal probability functions

A marginal (or unconditional) probability is the simple case: it is the probability that does

not depend on a prior event or prior information. The marginal probability is also called the

unconditional probability.

In the following table, please note that ten joint outcomes are possible because the age variable

(A) has two outcomes and the “number of crashes” variable (M) has five outcomes. Each of the

ten outcomes is mutually exclusive and the sum of their probabilities is 1.0 or 100%. For

example, the probability that a new computer crashes once is 0.035 or 3.5%.

The marginal (unconditional) probability that a computer is new (A = 1) is the sum of joint

probabilities in the second row:

1

Pr( ) Pr ,l

ii

Y y X x Y y

Pr( 1) 0.5A

0 1 2 3 4 Tot

0 Old

0.35 0.065 0.05 0.025 0.01 0.50

1 New

0.45 0.035 0.01 0.005 0.00 0.50

Tot 0.80 0.100 0.03 0.030 0.01 1.00

“The marginal probability distribution of a random variable Y is just another name for

its probability distribution. This term distinguishes the distribution of Y alone (marginal

distribution) from the joint distribution of Y and another random variable. The marginal

distribution of Y can be computed from the joint distribution of X and Y by adding up the

probabilities of all possible outcomes for which Y takes on a specified value”—S&W


Joint probability functions

The joint probability is the probability that the random variables (in this case, both random

variables) take on certain values simultaneously.

Pr( , )X y Y y Pr( 0, 0) 0.35A M

0 1 2 3 4 Tot

0 Old

0.35 0.065 0.05 0.025 0.01 0.50

1 New

0.45 0.035 0.01 0.005 0.00 0.50

Tot 0.80 0.100 0.03 0.030 0.01 1.00

“The joint probability distribution of two discrete random variables, say X and Y, is the

probability that the random variables simultaneously take on certain values, say x and y.

The probabilities of all possible ( x, y) combinations sum to 1. The joint probability

distribution can be written as the function Pr(X = x, Y = y).” —S&W

Conditional probability functions

Conditional is the probability of an outcome given (conditional on) another outcome:

Pr( , )Pr( | )

Pr( )

X x Y yY y X x

X x

Pr( 0 | 0) 0.35 0.50 0.70M A

0 1 2 3 4 Tot

0 Old

0.35 0.065 0.05 0.025 0.01 0.50

1 New

0.45 0.035 0.01 0.005 0.00 0.50

Tot 0.80 0.100 0.03 0.030 0.01 1.00

“The distribution of a random variable Y conditional on another random variable X taking

on a specific value is called the conditional distribution of Y given X. The conditional

probability that Y takes on the value y when X takes on the value x is written:

Pr(Y = y | X = x).” –S&W


Conditional probability = Joint Probability/Marginal Probability

What is the probability of B occurring, given that A has already occurred?

( )( | ) ( ) ( | ) ( )

( )

P A BP B A P A P B A P A B

P A

Conditional and unconditional expectation

An unconditional expectation is the expected value of the variable without any restrictions (or

lacking any prior information).

A conditional expectation is an expected value for the variable conditional on prior

information or some restriction (e.g., the value of a correlated variable). The conditional

expectation of Y, conditional on X = x, is given by:

( | )E Y X x

The conditional variance of Y, conditional on X=x, is given by:

var( | )Y X x

The two-variable regression is a important conditional expectation. In this case, we say

the expected Y is conditional on X: 1 2( | )i iE Y X B B X

For Example: Two Stocks (S) and (T)

For example, consider two stocks. Assume that both Stock (S) and Stock (T) can each only reach

three price levels. Stock (S) can achieve: $10, $15, or $20. Stock (T) can achieve: $15, $20, or $30.

Historically, assume we witnessed 26 outcomes and they were distributed as follows.

Note S = S$10/15/20 and T = T$15/20/30 :

S= $10 S= $15 S=$20 Total

T=$15 0 2 2 4

T=$20 3 4 3 10

T=$30 3 6 3 12

Total 6 12 8 26

What is the joint probability?

A joint probability is the probability that both random variables will have a certain outcome.

Here the joint probability P(S=$20, T=$30) = 3/26.


What is the marginal (unconditional) probability

The unconditional probability of the outcome where S=$20 = 8/26 because there are eight

events out of 26 total events that produce S=$20. The unconditional probability P(S=20) = 8/26

What is the conditional probability

Instead we can ask a conditional probability question: “What is the probability that S=$20 given

that T=$20?” The probability that S=$20 conditional on the knowledge that T=$20 is 3/10

because among the 10 events that produce T=$20, three are S=$20.

( $20, $20) 3( $20 $20)

( $20) 10

P S TP S T

P T

In summary:

The unconditional probability P(S=20) = 8/26

The conditional probability P(S=20 | T=20) = 3/10

The joint probability P(S=20,T=30) = 3/26

Explain the difference between statistical independence and statistical

dependence.

X and Y are independent if the condition distribution of Y given X equals the marginal

distribution of Y. Since independence implies Pr (Y=y | X=x) = Pr(Y=y):

Pr( , )Pr( | )

Pr( )

X x Y yY y X x

X x

The most useful test of statistical independence is given by:

Pr( , ) Pr( ) ( )X x Y y X x P Y y

X and Y are independent if their joint distribution is equal to the product of their

marginal distributions.

Statistical independence is when the value taken by one variable has no effect on the value

taken by the other variable. If the variables are independent, their joint probability will equal

the product of their marginal probabilities. If they are not independent, they are dependent.


For example, when rolling two dice, the second will be independent

of the first.

This independence implies that the probability of rolling double-

sixes is equal to the product of P(rolling one six) and P(rolling one

six). If two die are independent, then P (first roll = 6, second roll = 6) = P(rolling a six) * P

(rolling a six). And, indeed: 1/36 = (1/6)*(1/6)

Calculate the mean and variance of sums of random variables.

Mean

( ) X YE a bX cY a b c

Variance

In regard to the sum of correlated variables, the variance of correlated variables is given by the

following (note the two expressions; the second merely substitutes the covariance with the

product of correlation and volatilities. Please make sure you are comfortable with this

substitution).

2 2 2

2 2 2

2 , and given that

2

X Y X Y XY XY X Y

X Y X Y X Y

In regard to the difference between correlated variables, the variance of correlated variables is

given by:

2 2 2

2 2 2

2 and given that

2

X Y X Y XY XY X Y

X Y X Y X Y

Variance with constants (a) and (b)

Variance of sum includes covariance (X,Y):

2 2 2 2variance( ) 2X XY YaX bY a ab b

If X and Y are independent, the covariance term drops out and the variance simply adds::

2 2variance( ) X YX Y


Describe the key properties of the normal, standard normal, multivariate

normal, Chi-squared, Student t, and F distributions.

Normal distribution

Key properties of the normal:

Symmetrical around mean; skew = 0

Parsimony: Only requires (is fully described by) two parameters: mean and variance

Summation stability: a linear combination (function) of two normally distributed random

variables is itself normally distributed

Kurtosis = 3 (excess kurtosis = 0)

The normal distribution is commonplace for at least three reasons:

The central limit theorem (CLT) says that sampling distribution of sample means tends

to be normal (i.e., converges toward a normally shaped distributed) regardless of the

shape of the underlying distribution; this explains much of the “popularity” of the normal

distribution.

The normal is economical (elegant) because it only requires two parameters (mean

and variance). The standard normal is even more economical: it requires no

parameters.

The normal is tractable: it is easy to manipulate (especially in regard to closed-form

equations like the Black-Scholes)

-0.1

0.1

0.3

0.5 (

4.0

)

(3

.0)

(2

.0)

(1

.0)

0.0

1.0

2.0

3.0

4.0

2 2( ) 21( )

2

xf x e


Standard normal distribution

A normal distribution is fully specified by two parameters, mean and variance (or standard

deviation). We can transform a normal into a unit or standardized variable:

Standard normal has mean = 0,and variance = 1 No parameters required!

This unit or standardized variable is normally distributed with zero mean and variance of

one (1.0). Its standard deviation is also one (variance = 1.0 and standard deviation = 1.0). This is

written as: Variable Z is approximately (“asymptotically”) normally distributed: Z ~ N(0,1)

Standard normal distribution: Critical Z values:

Key locations on the normal distribution are noted below. In the FRM curriculum, the choice of

one-tailed 5% significance and 1% significance (i.e., 95% and 99% confidence) is common, so

please pay particular attention to the yellow highlights:

Critical z values

Two-sided Confidence

One-sided Significance

1.00 ~ 68% ~ 15.87%

1.645 (~1.65)

~ 90% ~ 5.0 %

1.96 ~ 95% ~ 2.5%

2.327 (~2.33)

~ 98% ~ 1.0 %

2.58 ~ 99% ~ 0.5%

Memorize two common critical values: 1.65 and 2.33. These correspond to confidence

levels, respectively, of 95% and 99% for a one-tailed test. For VAR, the one-tailed test is

relevant because we are concerned only about losses (left-tail) not gains (right-tail).

Multivariate normal distributions

Normal can be generalized to a joint distribution

of normal; e.g., bivariate normal distribution.

Properties include:

1. If X and Y are bivariate normal, then aX + bY is normal;

any linear combination is normal

2. If a set of variables has a multivariate normal distribution,

the marginal distribution of each is normal

3. If variables with a multivariate normal distribution have covariances that equal zero,

then the variables are independent


Chi-squared distribution

For the chi-square distribution, we observe a sample variance and compare to hypothetical

population variance. This variable has a chi-square distribution with (n-1) d.f.:

22( 1)2

( 1) ~ n

sn

Chi-squared distribution is the sum of m squared independent standard normal random

variables. Properties of the chi-squared distribution include:

Nonnegative (>0)

Skewed right, but as d.f. increases it approaches normal

Expected value (mean) = k, where k = degrees of freedom

Variance = 2k, where k = degrees of freedom

The sum of two independent chi-square variables is also a chi-squared variable

Chi-squared distribution: For example (Google’s stock return variance)

Google’s sample variance over 30 days is 0.0263%. We can test the hypothesis that the

population variance (Google’s “true” variance) is 0.02%. The chi-square variable = 38.14:

Sample variance (30 days) 0.0263% Degrees of freedom (d.f.) 29 Population variance? 0.0200% Chi-square variable 38.14 = 0.0263%/0.02%*29 =CHIDIST() = p value 11.93% @ 29 d.f., Pr[.1] = 39.0875 Area under curve (1- ) 88.07%

With 29 degrees of freedom (d.f.), 38.14 corresponds to roughly 10% (i.e., to left of 0.10 on the

lookup table). Therefore, we can reject the null with only 88% confidence; i.e., we are likely to

accept the probability that the true variance is 0.02%.

0%

10%

20%

30%

40%

0 10 20 30

Chi-square distribution

k = 2

k = 5

k = 29


Student t’s distribution

The student’s t distribution (t distribution) is among the most commonly used distributions. As

the degrees of freedom (d.f.) increases, the t-distribution converges with the normal

distribution. It is similar to the normal, except it exhibits slightly heavier tails (the lower the d.f..,

the heavier the tails). The student’s t variable is given by:

X

x

Xt

S n

Properties of the t-distribution:

Like the normal, it is symmetrical

Like the standard normal, it has mean of zero (mean = 0)

Its variance = k/(k-2) where k = degrees of freedom. Note, as k increases, the variance

approaches 1.0. Therefore, as k increases, the t-distribution approximates the

standard normal distribution.

Always slightly heavy-tail (kurtosis>3.0) but converges to normal. But the student’s t is

not considered a really heavy-tailed distribution

In practice, the student’s t is the mostly commonly used distribution. When we test the

significance of regression coefficients, the central limit thereom (CLT) justifies the

normal distribution (because the coefficients are effectively sample means). But we

rarely know the population variance, such that the student’s t is the appropriate

distribution.

When the d.f. is large (e.g., sample over ~30), as the student’s t approximates the

normal, we can use the normal as a proxy. In the assigned Stock & Watson, the sample

sizes are large (e.g., 420 students), so they tend to use the normal.

0.00

0.01

0.02

0.03

0.04

0

0.4

0.8

1.2

1.6 2

2.4

2.8

3.2

3.6

t distribution vs. Normal

2

20

Normal


Student t’s distribution: For example

For example, Google’s average periodic return over a ten-day sample period was +0.02% with

sample standard deviation of 1.54%. Here are the statistics:

Sample Mean 0.02%

Sample Std Dev 1.54% Days (n=10)

10

Confidence

95%

Significance (1-) 5%

Critical t

2.262

Lower limit

-1.08%

Upper limit

1.12%

The sample mean is a random variable. If we know the population variance, we assume the

sample mean is normally distributed. But if we do not know the population variance (typically

the case!), the sample mean is a random variable following a student’s t distribution.

In the Google example above, we can use this to construct a confidence (random) interval:

sX t

n

We need the critical (lookup) t value. The critical t value is a function of:

Degrees of freedom (d.f.); e.g., 10-1 =9 in this example, and

Significance; e.g., 1-95% confidence = 5% in this example

The 95% confidence interval can be computed. The upper limit is given by:

1.54%(2.262) 1.12%

10X

And the lower limit is given by:

1.54%(2.262) 1.08%

10X

Please make sure you can take a sample standard deviation, compute the critical t value

and construct the confidence interval.


Both the normal (Z) and student’s t (t) distribution characterize the sampling distribution of

the sample mean. The difference is that te normal is used when we know the population

variance; the student’s t is used when we mus rely on the sample variance. In practice, we don’t

know the population variance, so the student’s t is typically appropriate.

X X

X XX XZ t

n n

S

F-Distribution

The F distribution is also called the variance ratio distribution (it may be helpful to think of it as

the variance ratio!). The F ratio is the ratio of sample variances, with the greater sample variance

in the numerator:

2

2x

y

sF

s

Properties of F distribution:

Nonnegative (>0)

Skewed right

Like the chi-square distribution, as d.f. increases, approaches normal

The square of t-distributed r.v. with k d.f. has an F distribution with 1,k d.f.

m * F(m,n)=χ2

0%

2%

4%

6%

8%

10%

0.10.40.71.01.31.61.92.2

F distribution

19,19

9,9


F-Distribution: For example

For example, based on two 10-day samples, we calculated the sample variance of Google and

Yahoo. Google’s variance was 0.0237% and Yahoo’s was 0.0084%. The F ratio, therefore, is 2.82

(divide higher variance by lower variance; the F ratio must be greater than, or equal to, 1.0).

GOOG YHOO

=VAR() 0.0237% 0.0084% =COUNT() 10 10 F ratio 2.82

Confidence 90% Significance 10% =FINV() 2.44

At 10% significance, with (10-1) and (10-1) degrees of freedom, the critical F value is 2.44.

Because our F ratio of 2.82 is greater than (>) 2.44, we reject the null (i.e., that the population

variances are the same). We conclude the population variances are different.

Moments of a distribution

The k-th moment about the mean () is given by:

1( )

k-th moment

n kii

x

n

In this way, the difference of each data point from the mean is raised to a power (k=1, k=2, k=3,

and k=4). There are the four moments of the distribution:

If k=1, refers to the first moment about zero: the mean.

If k=2, refers to the second moment about the mean: the variance.

If k=3, refers to the third moment about the mean: skewness

If k=4, refers to the fourth moment about the mean: tail density and peakedness.


Define and describe random sampling and what is meant by i.i.d.

A random sample is a sample of random variables that are independent and identically

distributed (i.i.d.)

Independent and identically distributed (i.i.d.) variables:

Each random variable has the same (identical) probability distribution (PDF/PMF, CDF)

distribution

Each random variable is drawn independently of the others; no serial- or auto-

correlation

The concept of independent and identically distributed (i.i.d.) variables is a key

assumption we often encounter: to scale volatility by the square root of time requires

i.i.d. returns. If returns are not i.i.d., then scaling volatlity by the square root of time

will give an incorrect answer.

Define, calculate, and interpret the mean and variance of the sample

average.

The sample mean is given by:

1

1( ) ( )

n

i Yi

E Y E Yn

The variance of the sample mean is given by:

2

variance( ) Std Dev( )Y YY

Y Yn n

Independent

Not (auto) correlated

Identical

Same Mean,Same Variance

Homo-skedastic


We expect the sample mean to equal the population mean

The sample mean is denoted byY . The expected value of the sample mean is, as you might

expect, the population mean:

( ) YYE Y

This formula says, “we expect the average of our sample will equal the average of the

population.” (over-bar signifies sample, Greek mu signifies the mean (average).

Sampling distribution of the sample mean

If either: (i) the population is infinite and random sampling, or (ii) finite population and

sampling with replacement, the variance of the sampling distribution of means is:

22 2[( ) ] Y

Y YE Y

n

This says, “The variance of the sample mean is equal to the population variance divided by the

sample size.” For example, the (population) variance of a single six-sided die is 2.92. If we roll

three die (i.e., sampling “with replacement”), then the variance of the sampling distribution =

(2.92 3) = 0.97.

If the population is size (N), if the sample size n N, and if sampling is conducted “without

replacement,” then the variance of the sampling distribution of means is given by:

22

1Y

Y

N n

n N

Standard error is the standard deviation of the sample mean

The standard error is the standard deviation of the sampling distribution of the estimator,

and the sampling distribution of an estimator is a probability (frequency distribution) of the

estimator (i.e., a distribution of the set of values of the estimator obtained from all possible

same-size samples from a given population). For a sample mean (per the central limit theorem!),

the variance of the estimator is the population variance divided by sample size. The

standard error is the square root of this variance; the standard error is a standard deviation:

2

se Y Y

n n


If the population is distributed with mean and variance 2 but the distribution is not a

normal distribution, then the standardized variable given by Z below is “asymptotically

normal; i.e., as (n) approaches infinity () the distribution becomes normal.

~ (0,1)

Y Y

Y

Y YZ N

sen

The denominator is the standard error: which is simply the name for the standard

deviation of sampling distribution.

Describe, interpret, and apply the Law of Large Numbers and the Central

Limit Theorem.

In brief:

Law of large numbers: under general conditions, the sample mean (Ӯ) will be near the

population mean.

Central limit theorem (CLT): As the sample size increases, regardless of the underlying

distribution, the sampling distributions approximates (tends toward) normal

Central limit theorem (CLT)

We assume a population with a known mean and finite variance, but not necessarily a normal

distribution (we may not know the distribution!). Random samples of size (n) are then

drawn from the population. The expected value of each random variable is the population’s

mean. Further, the variance of each random variable is equal the population’s variance divided

by n (note: this is equivalent to saying the standard deviation of each random variable is equal to

the population’s standard deviation divided by the square root of n).

The central limit theorem says that this random variable (i.e., of sample size n, drawn from the

population) is itself normally distributed, regardless of the shape of the underlying

population. Given a population described by any probability distribution having mean () and

finite variance (2), the distribution of the sample mean computed from samples (where each

sample equals size n) will be approximately normal. Generally, if the size of the sample is at least

30 (n 30), then we can assume the sample mean is approximately normal!


Each sample has a sample mean. There are many sample means. The sample means have

variation: a sampling distribution. The central limit theorem (CLT) says the sampling

distribution of sample means is asymptotically normal.

Summary of central limit theorem (CLT):

We assume a population with a known mean and finite variance, but not necessarily a

normal distribution.

Random samples (size n) drawn from the population.

The expected value of each random variable is the population mean

The distribution of the sample mean computed from samples (where each sample equals

size n) will be approximately (asymptotically) normal.

The variance of each random variable is equal to population variance divided by n

(equivalently, the standard deviation is equal to the population standard deviation

divided by the square root of n).

Sample Statistics and Sampling Distributions

When we draw from (or take) a sample, the sample is a random variable with its own

characteristics. The “standard deviation of a sampling distribution” is called the

standard error. The mean of the sample or the sample mean is a random variable defined by:

1 2 nX X XX

n

Not Normal!(individually)

But sample mean (and sum)→ Normal Distribution!

(if finite variance)


Stock, Chapter 3:

Review of Statistics

In this chapter…

Describe and interpret estimators of the sample mean and their properties. Describe and interpret the least squares estimator.

Define, interpret and calculate the critical t‐values.

Define, calculate and interpret a confidence interval. Describe the properties of point estimators:

Distinguish between unbiased and biased estimators Define an efficient estimator and consistent estimator

Explain and apply the process of hypothesis testing: Define and interpret the null hypothesis and the alternative hypothesis

Distinguish between one‐sided and two‐sided hypotheses Describe the confidence interval approach to hypothesis testing Describe the test of significance approach to hypothesis testing Define, calculate and interpret type I and type II errors Define and interpret the p value

Define, calculate, and interpret the sample variance, sample standard deviation, and standard error.

Define, calculate, and interpret confidence intervals for the population mean. Perform and interpret hypothesis tests for the difference between two means. Define, describe, apply, and interpret the t-statistic when the sample size is small. Interpret scatterplots. Define, describe, and interpret the sample covariance and correlation.

Describe and interpret estimators of the sample mean and their

properties.

An estimator is a function of a sample of data to be drawn randomly from a population.

An estimate is the numerical value of the estimator when it is actually computed using data from

a specific sample. An estimator is a random variable because of randomness in selecting the

sample, while an estimate is a nonrandom number.


The sample mean, Ӯ, is the best linear unbiased estimator (BLUE). In the Stock & Watson

example, the average (mean) wage among 200 people is $22.64:

Sample Mean $22.64

Sample Standard Deviation $18.14

Sample size (n) 200

Standard Error 1.28

H0: Population Mean = $20.00

Test t statistic 2.06

p value 4.09%

Please note:

The average wage of (n = ) 200 observations is $22.64 The standard deviation of this sample is $18.14 The standard error of the sample mean is $1.28 because $18.14/SQRT(200) = $1.28 The degrees of freedom (d.f.) in this case are 199 = 200 – 1

“An estimator is a recipe for obtaining an estimate of a population parameter. A simple

analogy explains the core idea: An estimator is like a recipe in a cook book; an estimate

is like a cake baked according to the recipe.” Barreto & Howland, Introductory

Econometrics

In the above example, the sample mean is an estimator of the unknown, true population mean

(in this case, the same mean estimator gives an estimate of $22.64).

What makes one estimator superior to another?

Unbiased: the mean of the sampling distribution is the population mean (mu)

Consistent. When the sample size is large, the uncertainty about the value of arising

from random variations in the sample is very small.

Variance and efficiency. Among all unbiased estimators, the estimator has the smallest

variance is “efficient.”

If the sample is random (i.i.d.), the sample mean is the Best Linear Unbiased Estimator

(BLUE). The sample mean is:

Consistent, AND

The most EFFICIENT among all linear UNBIASED estimators of the population mean


'Describe and interpret the least squares estimator.

The estimator (m) that minimizes the sum of squared gaps (Yi – m) is called the least squares

estimator:

2

1

1n

ii

Y m i

The estimator (m) that minimizes the sum of squared gaps in the formula above is called the

least squares estimator.

Define, interpret and calculate critical t‐values.

The t-statistic or t-ratio is given by:

,0

( )

YYt

SE Y

The critical t-value or “lookup” t-value is the t-value for which the test just rejects the null

hypothesis at a given significance level. For example:

95% two-tailed (2T) critical t-value with 20 d.f. is 2.086

Significance test: is t-statistic > critical (lookup) t?

The critical t-values bound a region within the student’s distribution that is a specific

percentage (90%? 95%? 99%?) of the total area under the student’s t distribution curve. The

student’s t distribution with (n-1) degrees of freedom (d.f.) has a confidence interval given by:

Y YY

S SY t Y t

n n

For example: critical t

If the (small) sample size is 20, then the 95% two-tailed critical t is 2.093. That is because the

degrees of freedom are 19 (d.f. = n - 1) and if we review the lookup table on the following page

(corresponds to Gujarati A-2) under the column = 0.025/0.5 and row = 19, then we find the cell

value = 2.093. Therefore, given 19 d.f., 95% of the area under the student’s t distribution is

bounded by +/- 2.093. Specifically, P(-2.093 ≤ t ≤ 2.093) = 95%.

Please note, further because the distribution is symmetrical (skew=0), 5% among both tails

implies 2.5% in the left-tail.


Student’s t Lookup Table

Excel function: = TINV(two-tailed probability [larger #], d.f.)

1-tail: 0.25 0.1 0.05 0.025 0.01 0.005 0.001

d.f. 2-tail: 0.50 0.2 0.1 0.05 0.02 0.01 0.002

1 1.000 3.078 6.314 12.706 31.821 63.657 318.309

2 0.816 1.886 2.920 4.303 6.965 9.925 22.327

3 0.765 1.638 2.353 3.182 4.541 5.841 10.215

4 0.741 1.533 2.132 2.776 3.747 4.604 7.173

5 0.727 1.476 2.015 2.571 3.365 4.032 5.893

6 0.718 1.440 1.943 2.447 3.143 3.707 5.208

7 0.711 1.415 1.895 2.365 2.998 3.499 4.785

8 0.706 1.397 1.860 2.306 2.896 3.355 4.501

9 0.703 1.383 1.833 2.262 2.821 3.250 4.297

10 0.700 1.372 1.812 2.228 2.764 3.169 4.144

11 0.697 1.363 1.796 2.201 2.718 3.106 4.025

12 0.695 1.356 1.782 2.179 2.681 3.055 3.930

13 0.694 1.350 1.771 2.160 2.650 3.012 3.852

14 0.692 1.345 1.761 2.145 2.624 2.977 3.787

15 0.691 1.341 1.753 2.131 2.602 2.947 3.733

16 0.690 1.337 1.746 2.120 2.583 2.921 3.686

17 0.689 1.333 1.740 2.110 2.567 2.898 3.646

18 0.688 1.330 1.734 2.101 2.552 2.878 3.610

19 0.688 1.328 1.729 2.093 2.539 2.861 3.579

20 0.687 1.325 1.725 2.086 2.528 2.845 3.552

21 0.686 1.323 1.721 2.080 2.518 2.831 3.527

22 0.686 1.321 1.717 2.074 2.508 2.819 3.505

23 0.685 1.319 1.714 2.069 2.500 2.807 3.485

24 0.685 1.318 1.711 2.064 2.492 2.797 3.467

25 0.684 1.316 1.708 2.060 2.485 2.787 3.450

26 0.684 1.315 1.706 2.056 2.479 2.779 3.435

27 0.684 1.314 1.703 2.052 2.473 2.771 3.421

28 0.683 1.313 1.701 2.048 2.467 2.763 3.408

29 0.683 1.311 1.699 2.045 2.462 2.756 3.396

30 0.683 1.310 1.697 2.042 2.457 2.750 3.385

The green shaded area represents values less than three (< 3.0). Think of it as the “sweet

spot.” For confidences less than 99% and d.f. > 13, the critical t is always less than 3.0. So, for

example, a computed t of 7 or 13 will generally be significant. Keep this in mind because in

many cases, you do not need to refer to the lookup table if the computed t is large; you can

simply reject the null.


Define, calculate and interpret a confidence interval.

The confidence interval uses the product of [standard error х critical “lookup” t]. In the Stock

& Watson example, the confidence interval is given by 22.64 +/- (1.28)(1.96) because 1.28 is the

standard error and 1.96 is the critical t (critical Z) value associated with 95% two-tailed

confidence:

Sample Mean $22.64

Sample Std Deviation $18.14

Sample size (n) 200

Standard Error 1.28

Confidence 95%

Critical t 1.972

Lower limit $20.11

Upper limit $25.17

Confidence Intervals: Another example with a sample of 28 P/E ratios

Assume we have price-to-earnings ratios (P/E ratios) of 28 NYSE companies:

Mean 23.25 Variance 90.13 Std Dev 9.49 Count 28 d.f. 27 Confidence (1-α) 95% Significance (α) 5% Critical t 2.052 Standard error 1.794

Lower limit 19.6 = 23.25 - (2.052)*(1.794) Upper limit 26.9 = 23.25 + (2.052)*(1.794)

Hypothesis 18.5 t value 2.65 = (23.25 - 18.5) / (1.794) p value 1.3% Reject null with 98.7%

The confidence coefficient is selected by the user; e.g., 95% (0.95) or 99% (0.99). The significance = 1 – confidence coefficient.

95% CI for 1.96

22.64 1.28 1.972

Y Y SE Y


To construct a confidence interval with the dataset above:

Determine degrees of freedom (d.f.). d.f. = sample size – 1. In this case, 28 – 1 = 27 d.f.

Select confidence. In this case, confidence coefficient = 0.95 = 95%

We are constructing an interval, so we need the critical t value for 5% significance with

two-tails.

The critical t value is equal to 2.052. That’s the value with 27 d.f. and either 2.5% one-

tailed significance or 5% two-tailed significance (see how they are the same provided the

distribution is symmetrical?)

The standard error is equal to the sample standard deviation divided by the square root

of the sample size (not d.f.!). In this case, 9.49/SQRT(28) 1.794.

The lower limit of the confidence interval is given by: the sample mean minus the

critical t (2.052) multiplied by the standard error (9.49/SQRT[28]).

The upper limit of the confidence interval is given by: the sample mean plus the

critical t (2.052) multiplied by the standard error (9.49/SQRT[28]).

9.49 9.49

23.25 2.052 23.25 2.05228 28

x xX

X

S SX t X t

n n

This confidence interval is a random interval. Why? Because it will vary randomly with

each sample, whereas we assume the population mean is static.

We don’t say the probability is 95% that the “true” population mean lies within

this interval. That implies the true mean is variable. Instead, we say the

probability is 95% that the random interval contains the true mean. See how the

population mean is trusted to be static and the interval varies?


Describe the properties of point estimators:

An estimator is a function of a sample of data to be drawn randomly from a population.

An estimate is the numerical value of the estimator when it is actually computed using

data from a specific sample.

The key properties of point estimators include:

Linearity: estimator is a linear function of sample observations. For example, the sample

mean is a linear function of the observations.

Unbiasedness: the average or expected value of the estimator is equal to the true value

of the parameter.

Minimum variance: the variance of the estimator is smaller than any “competing”

estimator. Note: an estimator can have minimum variance yet be biased.

Efficiency: Among the set of unbiased estimators, the estimator with the minimum

variance is the efficient estimator (i.e., it has the smallest variance among unbiased

estimators)

Best linear estimator (BLUE): the estimate that combines three properties: (i) linear,

(ii) unbiased, and (iii) minimum variance

Consistency: an estimator is consistent if, as the sample size increases, it approaches

(converges on) the true value of the parameter

Distinguish between unbiased and biased estimators

An estimator is unbiased if:

Y YE

Otherwise the estimator is biased.

If the expected value of the estimator is the population parameter, the estimator is

unbiased. If, in repeated applications of a method the mean value of the estimators

coincides with the true parameter value, that estimator is called an unbiased estimator.

Unbiasedness is a repeated sampling property: if we draw several samples of size (n)

from a population and compute the unbiased sample statistic for each sample, the

average of will tend to approach (converge on) the population parameter.


Define an efficient estimator and consistent estimator

An efficient estimate is both unbiased (i.e., the mean or expectation of the statistic is equal to

the parameter) and its variance is smaller than the alternatives (i.e., all other things being equal,

we would prefer a smaller variance). A statement of the error or precision of an estimate is

often called its reliability

Efficient: among unbiased, estimator will smallest variance

“Consistent” is about property as sample size increases

Efficient

• Unbiased

• Smallest variance

Consistent

• As sample size increases, estimator approaches true parameter value

• As n→∞, E[estimator] = parameter

variance varianceY Yp

Y Y


Explain and apply the process of hypothesis testing:

Define & interpret the null hypothesis and the alternative

Distinguish between one‐sided and two‐sided hypotheses

Describe the confidence interval approach to hypothesis testing

Describe the test of significance approach to hypothesis testing

Define, calculate and interpret type I and type II errors

Define and interpret the p value








Define and interpret the null hypothesis and the alternative hypothesis

Please not the null must contain the equal sign (“=“):

0 ,0

1 ,0

: ( )

: ( )

Y

Y

H E Y

H E Y

The null hypothesis, denoted by H0, is tested against

the alternative hypothesis, which is denoted by H1 or

sometimes HA.

Often, we test for the significance of the intercept or a

partial slope coefficient in a linear regression. Typically,

in this case, our null hypothesis is: “the slope is zero” or “there is no correlation between X and

Y” or “the regression coefficients jointly are not significant.” In which case, if we reject the null,

we are finding the statistic to be significant which, in this case, means “significantly different

than zero.”

Statistical significance implies our null hypothesis (i.e., the parameter equals zero) was

rejected. We concluded the parameter is nonzero. For example, a “significant” slope

estimate means we rejected the null hypothesis that the true slope is zero.

0

1

: ( ) $20

: ( ) $20

H E Y

H E Y









Your default assumption should be a two-sided

hypothesis. If unsure, assume two-sided.

Here is a one-sided null hypothesis:

0 ,0

1 ,0

: ( )

: ( )

Y

Y

H E Y

H E Y

Specifically, “The one-sided null hypothesis is that the

population average wage is less than or equal to $20.00:”

0

1

: ( ) $20

: ( ) $20

H E Y

H E Y

The null hypothesis always includes the equal sign (=), regardless! The null cannot include

only less than (<) or greater than (>).









In the confidence interval approach, instead of

computing the test statistic, we define the confidence

interval as a function of our confidence level; i.e., higher

confidence implies a wider interval.

Then we simply ascertain if the null hypothesized value

is within the interval (within the “acceptance region”).

90% CI for 1.64

95% CI for 1.96

99% CI for 2.58

Y

Y

Y

Y SE Y

Y SE Y

Y SE Y









In the significance approach, instead of defining the

confidence interval, we compute the standardized

distance in standard deviations from the observed mean

to the null hypothesis: this is the test statistic (or

computed t value). We compare it to the critical (or

lookup) value.

If the test statistic is greater than the critical (lookup)

value, then we reject the null.

0

0

0

Reject H at 90% if t 1.64



act

act

act









If we reject a hypothesis which is actually true, we have

committed a Type I error. If, on the other hand, we

accept a hypothesis that should have been rejected, we

have committed a Type II error.

Type I error = significance level = α = Pr [reject H0

| H0 is true]

Type II error = β = Pr [“accept” H0 | H0 is false]

We can reject null with (1-p)% confidence

Type I: to reject a true hypothesis

Type II: to accept a false hypothesis

Type I and Type II errors: for example

Suppose we want to hire a portfolio manager who has

produced an average return of +8% versus an index that

returned +7%. We conduct a test statistical test to

determine whether the “excess +1%” is due to luck or “alpha” skill. We set a 95% confidence

level for our test. In technical parlance, our null hypothesis is that the manager adds no skill (i.e.,

the expected return is 7%).

Under the circumstances, a Type I error is the following: we decide that excess is significant and

the manager adds value, but actually the out-performance was random (he did not add skill). In

technical terms, we mistakenly rejected the null. Under the circumstances, a Type II error is the

following: we decide the excess is random and, to our thinking, the out-performance was

random. But actually it was not random and he did add value. In technical terms, we falsely

accepted the null.









The p-value is the “exact significance level:”

Lowest significance level a which a null

can be rejected

We can reject null with (1-p)% confidence

The p-value is an abbreviation that stands for

“probability-value.” Suppose our hypothesis is

that a population mean is 10; another way of

saying this is “our null hypothesis is H0: mean =

10 and our alternative hypothesis is H1: mean

10.” Suppose we conduct a two-tailed test, given

the results of a sample drawn from the

population, and the test produces a p-value of .03.

This means that we can reject the null hypothesis

with 97% confidence – in other words, we can be

fairly confident that the true population mean is

not 10.

Our example was a two-tailed test, but recall we have three possible tests:

The parameter is greater than (>) the stated value (right-tailed test), or

The parameter is less than (<) the stated value (left-tailed test), or

The parameter is either greater than or less than (≠) the stated value (two-tailed test).

Small p-values provide evidence for rejecting the null hypothesis in favor of the alternative

hypothesis, and large p values provide evidence for not rejecting the null hypothesis in favor of

the alternative hypothesis.

Keep in mind a subtle point about the p-value and “rejecting the null.” It is a soft rejection.

Rather than accept the alternative, we fail to reject the null. Further, if we reject the null, we are

merely rejecting the null in favor of the alternative.

The analogy is to a jury verdict. The jury does not return a verict of “innocent;” rather they

return a verdict of “not guilty.”

0Pr 1act act

Hp value Z t t


Define, calculate, and interpret the sample variance, sample standard

deviation, and standard error.

2 22

1 1

1 1

1 1

n n

Y i Y ii i

s Y Y s Y Yn n

( ) ˆ YY

sSE Y

n

Define, calculate, and interpret confidence intervals for the population

mean.

90% CI for 1.64

95% CI for 1.96

99% CI for 2.58

Y

Y

Y

Y SE Y

Y SE Y

Y SE Y

Perform and interpret hypothesis tests for the difference between two

means

Test statistic for comparing two means:

0m w

m w

Y Y dt

SE Y Y

Define, describe, apply, and interpret the t-statistic when the sample size

is small.

If the sample size is small, t-statistic has a Student’s t distribution with (n-1) degrees of freedom

,0

2

Y

Y

Yt

s n


Interpret scatterplots.

The scattergram is a plot of the dependent variable (on the Y axis) against the independent

(explanatory) variable (on the X axis). In Stock and Waton, the explanatory variable is the

student-teacher ratio (STR). The dependent variable is test score:

Define, describe, and interpret the sample covariance and correlation.

Covariance is the average cross-product. Sample covariance multiplies the sum of cross-

products by 1/(n-1) rather than 1/n:

1

1

1

n

XY i ii

s X X Y Yn

Sample correlation is sample covariance divided by the product of sample standard deviations:

XYXY

X Y

sr

S S

cov( , ) [( )( )]XY X YX Y E X Y

600.0

620.0

640.0

660.0

680.0

700.0

720.0

10.0 15.0 20.0 25.0 30.0

Test Scores

Student-teacher ratio

Test Scores verus Student-Teacher Ratio (Stock Watson Fig 5.2)


Covariance: For example

For a very simple example, consider three (X,Y) pairs: {(3,5), (2,4), (4,6)}:

X Y (X-X avg

)(Y-Y avg

)

3 5 0.0 2 4 1.0 4 6 1.0 Avg = 3 Avg = 5 Avg =

XY = 0.67

StdDev = SQRT(.67) StdDev = SQRT(.67) Correl. = 1.0

Please note:

Average X = (3+2+4)/3 = 3.0. Average Y = (5+4+6)/3 = 5.0

The first cross-product = (3 – 3)*(5 - 5) = 0.0

The sum of cross-products = 0 + 2 + 1 = 2.0

The population covariance = [sum of cross-products] / n = 2.0 / 3 = 0.67

The sample covariance = [sum of cross-products] / (n- 1) = 2.0 / 2 = 1.0

Properties of covariance

2

2 2 2

2 2 2

1. If X &Y are independent, cov( , ) 0

2. cov( , ) cov( , )

3. cov( , ) var( ). In notation,

4. If X &Y are not independent,

2

2

XY

XX X

X Y X Y XY

X Y X Y XY

X Y

a bX c dY bd X Y

X X X

Note that a variable’s covariance with itself is its variance. Keeping this in mind, we

realize that the diagonal in a covariance matrix is populated with variances.


Correlation Coefficient

The correlation coefficient is the covariance (X,Y) divided by the product of the each variable’s

standard deviation. The correlation coefficient translates covariance into a unitless metric

that runs from -1.0 to +1.0:

cov( , )

StandardDev( ) StandardDev( )XY

X Y

X Y

X Y

X Y XY

Memorize this relationship between the covariance, the correlation coefficient, and the

standard deviations. It has high testability.

On the next page we illustrate the application of the variance theorems and the correlation

coefficient.

Please walk through this example so you understand the calculations.

The example refers to two products, Coke (X) and Pepsi (Y).

We (somehow) can generate growth projections for both products. For both Coke (X) and Pepsi

(Y), we have three scenarios (bad, medium, and good). Probabilities are assigned to each

growth scenario.

In regard to Coke:

Coke has a 20% chance of growing 3%,

a 60% of growing 9%, and

a 20% chance of growing 12%.

In regard to Pepsi,

Pepsi has a 20% chance of growing 5%,

a 60% chance of growing 7%, and

a 20% of growing 9%.

Finally, we know these outcomes are not independent. We want to calculate the correlation

coefficient.


20% 60% 20%

Coke (X) 3 9 12

Pepsi (Y) 5 7 9

pX 0.6 5.4 2.4

pY 1.0 4.2 1.8

E(X) 8.4

Sum of pXs above

E(Y) 7.0

Sum of pYs above

XY 15 63 108

pXY 3 37.8 21.6

E(XY) 62.4

E(XY)-E(X)E(Y) 3.6

Key formula: Covariance of X,Y

X2 9 81 144

Y2 25 49 81

pX2 1.8 48.6 28.8

pY2 5 29.4 16.2

E(X2) 79.2

E(Y2) 50.6

VAR(X) 8.64

E[X^2] – [E(X)]^2

VAR(Y) 1.60

E[Y^2] – [E(Y)]^2

STDEVP(X) 2.939

STDEVP(Y) 1.265

COV/(STD)(STD) 0.9682

The calculation of expected values is required: E(X), E(Y), E(XY), E(X2) and E(Y2). Make sure you

can replicate the following two steps:

The covariance is equal to E(XY) – E(X)E(Y): 3.6 = 62.4 – (8.4)(7.0)

The correlation coefficient () is equal to the Cov(X,Y) divided by the product of the

standard deviations: XY = 97% = 3.6 (2.94 1.26)


Key properties of correlation:

Correlation has the same sign (+/-) as covariance

Correlation measures the linear relationship between two variables

Between -1.0 and +1.0, inclusive

The correlation is a unit-less metric

Zero covariance → zero correlation (But the converse not necessarily true. For example,

Y=X^2 is nonlinear )

Correlation (or dependence) is not causation. For example, in a basket credit default

swap, the correlation (dependence) between the obligors is a key input. But we do not

assume there is mutual causation (e.g., that one default causes another). Rather, more

likely, different obligors are similarly sensitive to economic conditions. So, economic

deterioration may the the external cause that all obligors have in common.

Consqequently, their default exhibit dependence. But the causation is not internal.

Further, note that (linear) correlation is a special case of dependence. Dependence is

more general and includes non-linear relationships.

Sample mean

Sample mean is the sum of observations divided by the number of observations:

1

n

ii

X

Xn

Variance

A population variance is given by:

2 2

1

1( )

n

x ii

X Xn

The sample variance is divided by (n-1):

2 2

1

1( )

1

n

x ii

s X Xn

For example: population versus sample variance

Assume the following series of four numbers: 10, 12, 14, and 16. The average of the series is

(10+12+14+16) 4 = 13. For the population variance, in the numerator we want to sum the

squared differences. The population variance is given by [(10-13)2 + (12-13) 2 + (14-13) 2 + (16-

13) 2] 4 = 20 4 = 5. The sample variance has the same numerator and (5-1) for the

denominator: 20 3 6.7. The standard deviation is the square roots of the variance. The

population standard deviation is (square root of 5 2.24) and the sample standard deviation is

(square root of 6.7 2.6).


Covariance

Covariance is the average cross-product:

1( )( )XY i iX X Y Y

n

Sample covariance is given by:

1sample ( )( )

1XY i iX X Y Y

n

Correlation coefficient

Correlation coefficient is given by:

cov( , )

StdDev( ) StdDev( )XY

X Y

X Y

X Y

Sample correlation coefficient is given by:

sample sample XY

X YS S

Skewness

Skewness is given by:

3 3

3 3 3

[( ) ]Skewness =

E X

Sample skewness is given by:

3

3 3

( )( 1)

Sample Skewness =

X XN

S

Kurtosis

Kurtosis is given by:

4 4

4 4 4

[( ) ]Kurtosis =

E X

Sample kurtosis is given by:

4

4

( )( 1)

Sample Kurtosis =

X XN

S


Stock, Chapter 4:

Linear Regression

with one regressor

In this chapter…

Explain how regression analysis in econometrics measures the relationship between dependent and independent variables.

Define and interpret a population regression function, regression coefficients, parameters, slope and the intercept.

Define and interpret the stochastic error term (or noise component). Define and interpret a sample regression function, regression coefficients,

parameters, slope and the intercept. Describe the key properties of a linear regression. Describe the method and assumptions of ordinary least squares for estimation of

parameters: Define and interpret the explained sum of squares (ESS), the total sum of

squares (TSS), the sum of squared residuals (SSR), the standard error of the regression (SER), and the regression R2.

Interpret the results of an ordinary least squares regression

What is Econometrics?

Econometrics is a social science that applies tools (economic theory, mathematics and statistical

inference) to the analysis of economic phenomena. Econometrics consists of “the application of

mathematical statistics to economic data to lend empirical support to the models constructed

by mathematical economics.”

Methodology of econometrics

Create a statement of theory or hypothesis

Collect data: time-series, cross-sectional, or pooled (combination of time-series and

cross-sectional)

Specify the (pure) mathematical model: a linear function with parameters (but without

an error term)

Specify the statistical model: adds the random error term


Estimate the parameters of the chosen econometric model: we are likely to use ordinary

least squares (OLS) approach to estimate parameters

Check for model adequacy: model specification testing

Test the model’s hypothesis

Use the model for prediction or forecasting

Note:

The pure mathematical model “although of prime interest to the mathematical

economist, is of limited appeal to the econometrician, for such a model assumes an

exact, or deterministic, relationship between the two variables.”

The difference between the mathematical and statistical model is the random error

term (u in the econometric equation below). The statistical (or empirical) econometric

model adds the random error term (u):

0 1i i iY B B X u

Create theory (hypothesis)

Collect data

Specify mathematical model

Specify statistical (econometric) model

Estimate parameters

Test model specification

Test hypothesis

Use model to predict or forecast


Three different data types used in empirical analysis

Three types of data for empirical analysis:

Time series - returns over time for an individual asset

Cross-sectional - average return across assets on a given day

Pooled (combination of time series and cross-sectional) - returns over time for a

combination of assets; and

Panel data (a.k.a., longitudinal or micropanel) data is a special type of pooled data in

which the cross-sectional unit (e.g., family, company) is surveyed over time.

For example, we often characterize a portfolio with a matrix. In such a matrix, the assets are

given in the rows and the period returns (e.g., days/months/years) are given in the columns:

Time → 2006 2007 2008

Asset #1 Cross –

Sectional (or spatial) correlation

Returns: Auto or Serial Correlation Asset #2 Asset #3 Return Volatility: Auto or Serial Correlation Asset #4

For such a “matrix portfolio,” we can examine the data in at least three ways:

Returns over time for an individual asset (time series)

Average return across assets on a given day (cross-sectional or spatial)

Returns over time for a combination of assets (pooled)

Time Series

• Returns over time for an individual asset

• Example – Returns on a single asset from Jan. through Mar. 2009

Cross-sectional

• Average return across assets on a given day

• Example – Returns for a business/family on a given day

Pooled

• Returns over time for a combination of assets

• Includes panel data (a.k.a., longitudinal, micropanel)


Explain how regression analysis in econometrics measures the

relationship between dependent and independent variables.

A linear regression may have one or more of the following objectives:

To estimate the (conditional) mean, or average, value of dependent variable

Test hypotheses about nature of dependence

To predict or forecast dependent

One or more of the above

Correlation (dependence) is not causation. Further, linear correlation is a specific type of

dependence, which is a more general relationship (e.g., non-linear relationship).

In Stock and Watson, the authors regress student test scores (dependent variable) against class

size (independent variable):

0

0 1

other factors

Y

ClassSize

i i i

TestScore ClassSize

X u

Define and interpret a population regression function, regression

coefficients, parameters, slope and the intercept.

0 1Yi i iX u

Dependent (regressand)

Variable

Independent(regressor)

Variable


Define and interpret the stochastic error term (or noise component).

The error term contains all the other factors aside from (X) that determine the value of the

dependent variable (Y) for a specific observation.

0 1Yi i iX u

The stochastic error term is a random variable. Its value cannot be a priori determined.

May (probably) contains variables not explicit in model

Even if all variables included, will still be some randomness

Error may also include measurement error

Ockham’s razor: a model is a simplification of reality. We don’t necessarily want to

include every explanatory variable

Define and interpret a sample regression function, regression

coefficients, parameters, slope and the intercept.

In theory, there is one unknowable population and one set of unknowable parameters (B1, B2).

But there are many samples, each sample → SRF → Estimator (statistic) → Estimate

Stochastic PRF 0 1i i iY B B X u

Sample regression function (SRF) 0 1

ˆi iY b b X

Stochastic sample regression function (SRF) 0 1i i iY b b X e

Each sample produces its own scatterplot. Through this sample scatterplot, we can plot a sample

regression line (SRL). The sample regression function (SRF) characterizes this line; the SRF is

analogous to the PRF, but for each sample.

B0 = intercept = regression coefficient

B1 = slope = regression coefficient

e(i) = the residual term

Note the correspondence between error term and the residual. As we specify the model,

we ex ante anticipate an error; after we analyze the observations, we ex post observe

residuals.


Unlike the PRF which is presumed to be stable (unobserved), the SRF varies with each

sample. So, we expect to get different SRF. There is no single “correct” SRF!

Describe the key properties of a linear regression.

It is okay if the regression function is non-linear in variables, but it must be linear in

parameters:

20 1( ) iE Y B B X

20 1( ) iE Y B B X

0

10

20

30

40

50

0 200 400

Each Sample returns a different SRF (sampling variation)

Sample #1

Sample #2

Linear (Sample #1)

Linear (Sample #2)

Nonlinear variable, Linear parameter

Linear variable, nonlinear parameter


Describe the method and assumptions of ordinary least squares for

estimation of parameters:

The process of ordinary least squares estimation seeks to achieve the minimum value for the

residual sum of squares (squared residuals = e^2).

Define and interpret the explained sum of squares (ESS), the total sum of

squares (TSS), the sum of squared residuals (SSR), the standard error of

the regression (SER), and the regression R2.

We can break the regression equation into three parts:

Explained sum of squares (ESS),

Sum of squared residual (RSS), and

Total sum of squares (TSS).

The explained sum of squares (ESS) is the squared distance between the predicted Y and the

mean of Y:

2

1

ˆ( )n

ii

ESS Y Y

Large outliers are unlikely

[X(i), Y(i)] are independent and identically distributed

(i.i.d.)

The conditional distribution of u(i) given X(i) has a mean of

zero

Estimate (conditional) mean of dependent variable

Test hypotheses about nature of dependence

To forecast the mean value of the dependent

Correlation (dependence) is not causation


The sum of squared residuals (SSR) is the summation of each squared deviation between

the observed (actual) Y and the predicted Y:

2

1

ˆ( )n

i ii

SSR Y Y

The sum of squared residual (SSR) is the square of the error term. It is directly related to the

standard error of the regression (SER):

2 2 2

1 1

ˆ ˆ( ) ( 2)n n

i i ii i

SSR Y Y u SER n

Equivalently:

2 22ˆ ˆ

2 2i ie e

n n

The ordinary least square (OLS) approach minimizes the SSR.

The SSR and the standard error of regression (SER) are directly related; the SER is the

standard deviation of the Y values around the regression line.

The standard error of the regression (SER) is a function of the sum of squared residual (SSR):

2

2 2ieSSR

SERn n

Note the use of the use of (n-2) instead of (n) in the denominator. Division by this smaller

number—in this case (n-2) instead of (n)—is referred to as “an unbiased estimate.”

(n-2) is used because the two-variable regression has (n-2) degrees of freedom (d.f.).

In order to compute the slope and intercept estimates, two independent observations are

consumed.

If k = the number of explanatory variables plus the intercept (e.g., 2 if one explanatory

variable; 3 if two explanatory variables), then SER = SQRT[SSR/(n-k)].

If k = the number of slope coefficients (excluding the intercept), then similarly, SER =

SQRT[SSR/(n-k -1)]


Interpret the results of an ordinary least squares regression

In the Stock & Watson example, the authors regress TestScore against the Student-teacher ratio

(STR):

The regression function, with standard error, is given by:

698.9 2.28

(9.47) (0.48)

TestScore STR

The regression results are given by:

B(1) B(0)

Regression coefficients -2.28 698.93 Standard errors, SE() 0.48 9.47 R^2, SER 0.05 18.58 F, d.f. 22.58 418.00 ESS, RSS 7,794 144,315

Please note:

Both the slope and intercept are both significant at 95%, at least. The test statistics are

73.8 for the slope (699/9.47) and 4.75 (2.28/0.48). For example, given the very high test

statistic for the slope, its p-value is approximately zero.

The coefficient of determination (R^2) is 0.05, which is equal to 7,794/(7,7794 +

144,315)

The degrees of freedom are n – 2 = 420 – 2 = 418

600.0

620.0

640.0

660.0

680.0

700.0

720.0

10.0 15.0 20.0 25.0 30.0

Test

Score

s

Student-teacher ratio

Test Scores versus Student-Teacher Ratio


Stock, Chapter 5: Single

Regression: Hypothesis Tests

and Confidence Intervals

In this chapter…

Define, calculate, and interpret confidence intervals for regression coefficients. Define and interpret hypothesis tests about regression coefficients. Define and differentiate between homoskedasticity and heteroskedasticity. Describe the implications of homoskedasticity and heteroskedasticity. Explain the Gauss-Markov Theorem and its limitations, and alternatives to the OLS.

Define, calculate, and interpret confidence intervals for regression

coefficients.

Upper/lower limit = Regression coefficient ± [standard error × critical value @ c%]

The coefficient is effectively a sample mean, so this is essentially similar to

computing the confidence interval for a sample mean

In the example from Stock and Watson, lower limit = 680.4 = 698.9 – 9.47 × 1.96

Confidence Interval

Coefficient SE

Lower Upper

Intercept 698.9 9.47

680.4 717.5

Slope (B1) -2.28 0.48

-3.2 -1.3


Define and interpret hypothesis tests about regression coefficients.

The key idea here is that the regression coefficient (the estimator or sample statistic) is a

random variable that follows a student’s t distribution (because we do not know the population

variance, or it would be the normal):

1 12

regression coefficient null hypothesis [0]~

( 1) (regression coefficient)n

b Bt t

se b se

Test of hypothesis for the slope (b1)

To test the hypothesis that the regression coefficient (b1) is equal to some specified value (),

we use the fact that the statistic

1

1

test statistic ( )

bt

se b

This has a student's distribution with n - 2 degrees of freedom because there are two coefficients

(slope and intercept).

Using the same example:

698.9 2.28

(9.47) (0.48)

TestScore STR

STR: t statistic = |(-2.28 – 0)/0.48| = 4.75

p value 2 Tail ~ 0 %


Define and differentiate between homoskedasticity and

heteroskedasticity.

The error term u(i) is homoskedastic if the variance of

the conditional distribution of u(i) given X(i) is constant

for i = 1,…,n and in particular does not depend on X(i).

Otherwise the error term is heteroskedastic.

Describe the implications of homoskedasticity and heteroskedasticity.

Mathematical Implications of Homoskedasticity

The OLS estimators remain unbiased and asymptotically normal. “Whether the

errors are homoskedastic or heteroskedastic, the OLS estimator is unbiased, consistent,

and asymptotically normal.”

OLS estimators are efficient if the least squares assumptions are true. This result is

called the Gauss– Markov theorem.

Homoskedasticity-only variance formula. If the errors are homoskedastic, then there

is a specialized formula that can be used for the standard errors of the slope and

intercept estimates.

Explain the Gauss-Markov Theorem and its limitations, and alternatives to

the OLS.

The Gauss– Markov theorem states that, under a set of conditions known as the Gauss– Markov

conditions, the OLS slope (b1) estimator has the smallest conditional variance, given , of all

linear conditionally unbiased estimators of parameter (B1); that is, the OLS estimator is BLUE.

The Gauss– Markov theorem provides a theoretical justification for using OLS, but has two key

limitations:

Its conditions might not hold in practice. “In particular, if the error term is

heteroskedastic— as it often is in economic applications— then the OLS estimator is no

longer BLUE… An alternative to OLS when there is heteroskedasticity of a known form,

called the weighted least squares estimator.

Even if the conditions of the theorem hold, there are other candidate estimators that are

not linear and conditionally unbiased; under some conditions, these other estimators are

more efficient than OLS.


Alternatives to ordinary least squares (OLS)

Under certain conditions, some regression estimators are more efficient than OLS.

The weighted least squares (WLS) estimator: If the errors are heteroskedastic, then

OLS is no longer BLUE. If the heteroskedasticity is known (i.e., if the conditional variance

of given is known up to a constant factor of proportionality) then an alternate estimator

exists with a smaller variance than the OLS estimator. This method, weighted least

squares (WLS), weights the (i-th) observation by the inverse of the square root of the

conditional variance of u(i) given X(i). Because of this weighting, the errors in this

weighted regression are homoskedastic, so OLS, when applied to the weighted data, is

BLUE.

o Although theoretically elegant, the problem with weighted least squares is that we

must know how the conditional variance of u(i) depends on X(i). Because this is

rarely known, weighted least squares is used far less frequently in practice than

OLS.

The least absolute deviations (LAD) estimator: The OLS estimator can be sensitive to

outliers. If extreme outliers are “not rare” (or if we can safely ignore extreme outliers),

then the least absolute deviations (LAD) estimator may be more effecitve. In LAD, the

regression coefficients are obtained by solving a minimization that uses the absolute

value of the prediction “mistake” (i.e., instead of its square).

o Because “in many economic data sets, severe outliers are rare,” the use of the LAD

estimator is uncommon in applications.


Stock: Chapter 6:

Linear Regression

with Multiple

Regressors

In this chapter…

Define, interpret, and discuss methods for addressing omitted variable bias. Distinguish between simple and multiple regression. Define and interpret the slope coefficient in a multiple regression. Describe homoskedasticity and heterosckedasticity in a multiple regression. Describe and discuss the OLS estimator in a multiple regression. Define, calculate, and interpret measures of fit in multiple regression. Explain the assumptions of the multiple linear regression model. Explain the concept of imperfect and perfect multicollinearity and their

implications.

Define, interpret, and discuss methods for addressing omitted variable

bias.

Omitted variable bias occurs if both:

1. Omitted variable is correlated with the included regressor, and

2. Omitted variable is a determinant of the dependent variable

“If the regressor ( the student– teacher ratio) is correlated with a variable that has been

omitted from the analysis ( the percentage of English learners) and that determines, in

part, the dependent variable ( test scores), then the OLS estimator will have omitted

variable bias.” –S&W

The first least squares assumption is that the error term, u(i), has a conditional mean of

zero: E[ u(i) | X(i) ] = 0. Omitted variable bias means this OLS assumption is not true.


Stock and Watson show that omitted variable bias can be expressed mathematically, if we

assume a correlation between u(i) and X(i) denoted by rho(X,u). Then the OLS estimator has the

following limit; i.e., as the sample sizes increases the estimator (B1 carrot) does not tend toward

the parameter (B1) but rather:

1 1

ˆ pX

X

1. Omitted variable bias is a problem whether the sample size is large or small. Because the

estimator (B1 carrot) does not converge in probability to the true value (B1), the

estimator (B1 carrot) is biased and inconsistent; i.e., [B1 carrot] is not a consistent

estimator of [B1] when there is omitted variable bias.

2. Whether this bias is large or small in practice depends on the correlation (rho) between

the regressor and the error term. The larger is the correlation, the larger the bias.

3. The direction of the bias in depends on whether (X) and (u) are positively or negatively

correlated.

Distinguish between simple [single] and multiple regression.

Multiple regression model extends the single variable regression model:

0 1 1i i iY X u

0 1 1 2 2 , 1,...,i i i k ki iY X X X u i n

Define and interpret the slope coefficient in a multiple regression.

The B(1) slope coefficient, for example, is the effect on Y of a unit change in X(1) if we hold the

other independent variables, X(2) …., constant.

0 1 1 2 2 , 1,...,i i i k ki iY X X X u i n

B(2) is the effect on Y(i) given a unit change in X(2), if

we hold constant X(1), X(3), … and X(N)


Describe homoskedasticity and heterosckedasticity in a multiple

regression.

If variance [u(i) | X(1i), …. X(ki)] is contant for i = 1, …, n, then model is homoskedastic.

Otherwise, the model is heteroskedastic.

Describe and discuss the OLS estimator in a multiple regression.

686 1.10 0.65TestScore STR PctEL

Define, calculate, and interpret measures of fit in multiple regression.

Standard error of regression (SER)

Standard error of regression (SER) estimates the standard deviation of the error term u(i). In

this way, the SER is a measure of spread of the distribution of Y around the regression line. In a

multiple regression, the SER is given by:

1

SSRSER

n k

Where (k) is the number of slope coefficients; e.g., in this case of a two variable regression, k = 1.

For the standard error of the regression (SER), the denominator is n – [# of variables], or

n – [# of coefficients including the intercept].

In a univariate regression (i.e., one indepenent variable/regressor), the deonominator is

n – 2 as n – 1 – 1 = n -2

In a regression with two regressors (two independent variables), the denominator is n – 3

as n – 2 – 1 = n – 3.

OLS estimate of the coefficient

on the student-teacher ratio (B1)


Coefficient of determination (R^2)

The coefficient of determination is the fraction of the sample variance of Y(i) explained by (or

predicted by) the independent variables”

2 1ESS SSR

RTSS TSS

In multiple regression, the R^2 increases whenever a regressor (independent variable)

is added, unless the estimated oefficient on the added regressor is exactly zero.

“Adjusted R^2”

The unadjusted R^2 will tend to increase as additional independent variables are added.

However, this does not necessarily reflect a better fitted model. The adjusted R^2 is a

modified version of the R^2 that does not necessarily increase with a new independent

variable is added. Adjusted R^2 is given by:

22 ˆ

2

11 1

1u

Y

sn SSRR

n k TSS s

“The R^2 is useful because it quantifies the extent to which the regressors (independent

variables) account for, or explain, the variation in the dependent variable. Nevertheless,

heavy reliance on the R^2 can be a trap. In applications, “maximize the R^2” is rarely the

answer to any economically or statistically meaningful question. Instead, the decision

about whether to include a variable in a multiple regression should be based on whether

including that variable allows you better to estimate the causal effect of interest.” –S&W

Explain the assumptions of the multiple linear regression model.

Conditional distribution of u(i) given X(1i), X(2i),…,X(ki) has mean of zero

X(1i), X(2i), … X(ki), Y(i) are independent and identically distributed (i.i.d.)

Large outliers are unlikely

No perfect collinearity

The regressors exhibit perfect multi-collinearity if one of the regressors is a perfect

linear function of the other regressors. The fourth least squares assumption is that the

regressors are not perfectly multicollinear.


Explain the concept of imperfect and perfect multicollinearity and their

implications.

Imperfect multicollinearity is when two or more of the independent variables (regressors) are

highly correlated: there is a linear function of one of the regressors that is highly correlated with

another regressor. Imperfect multicollinearity does not pose any problems for the theory

of the OLS estimators; indeed, a purpose of OLS is to sort out the independent influences of the

various regressors when these regressors are potentially correlated. Imperfect multicollinearity

does not prevent estimation of the regression, nor does it imply a logical problem with the

choice of independent variables (i.e., regressor).

However, imperfect multicollinearity does mean that one or more of the regression

coefficients could be estimated imprecisely

“Perfect multicollinearity is a problem that often signals the presence of a log-ical error.

In contrast, imperfect multicollinearity is not necessarily an error, but rather just a

feature of OLS, your data, and the question you are trying to answer. If the variables

in your regression are the ones you meant to include— the ones you chose to address the

potential for omitted variable bias— then imperfect mul-ticollinearity implies that it will

be difficult to estimate precisely one or more of the partial effects using the data at

hand.” –S&W


Stock, Chapter 7:

Hypothesis Tests and

Confidence Intervals

in Multiple Regression

In this chapter…

Construct, perform, and interpret hypothesis tests and confidence intervals for a single coefficient in a multiple regression.

Construct, perform, and interpret hypothesis tests and confidence intervals for multiple coefficients in a multiple regression.

Define and interpret the F-statistic. Define, calculate, and interpret the homoskedasticity-only F-statistic. Describe and interpret tests of single restrictions involving multiple coefficients. Define and interpret confidence sets for multiple coefficients. Define and discuss omitted variable bias in multiple regressions. Interpret the R2 and adjusted-R2 in a multiple regression.

Construct, perform, and interpret hypothesis tests and confidence

intervals for a single coefficient in a multiple regression.

The Stock & Watson example adds an additional independent variable (regressor). Under this

three variable regression, Test Scores (dependent) are a function of the Student/Teacher ratio

(STR) and the Percentage of English learners in district (PctEL).


STR = Student/Teacher ratio

PctEL = Percentage (%) of English learners in district

686 1.10 0.65

(7.41) (0.38) (0.04)

TestScore STR PctEL

STR: t statistic = (-1.10 – 0)/0.38 = -2.90

p value 2 Tail = 0.40%

PctEL t statistic = (-0.65– 0)/0.04 = 16.52

p value 2 Tail ~ 0.0%

STR: Lower limit = -1.10 – 0.38×1.96 = -1.85

Upper limit = -1.10 + 0.38×1.96 = -0.35

PctEL Lower limit = -0.65 – 0.04×1.96 = -0.73

Upper limit = -0.65 + 0.04×1.96 = -0.57

Construct, perform, and interpret hypothesis tests and confidence

intervals for multiple coefficients in a multiple regression.

The “overall” regression F-statistic tests the joint hypothesis that all slope coefficients are zero

Define and interpret the F-statistic.

F-statistic is used to test joint hypothesis about regression coefficients.

2 21 2 1, 2 1 2

1, 2

2 ˆ1

2 1 2 ˆ

t t

t t

t t t tF

The “overall” regression F- statistic tests the joint hypothesis that all the slope coefficients are

zero; i.e., the null hypothesis is that all slope coefficients are zero. Under this null hypothesis,

none of the regressors explains any of the variation in the dependent variable (although the

intercept can be nonzero).


Define, calculate, and interpret the homoskedasticity-only F-statistic.

If the error term is homoskedastic, the F-statistic can be written in terms of the improvement in

the fit of the regression as measured either by the sum of squared residuals or by the regression

R^2.

1

restricted unrestricted

unrestricted unrestricted

SSR SSR qF

SSR n k

Describe and interpret tests of single restrictions involving multiple

coefficients.

Approach #1: Test the restrictions directly

Approach #2: Transform the regression

Define and interpret confidence sets for multiple coefficients.

Confidence ellipse characterizes a confidence set for two coefficients; this is the two-dimension

analog to the confidence interval:

-1

0

1

2

3

4

5

6

7

8

9

-2 -1.5 -1 -0.5 0 0.5 1 1.5

Coefficient on Expn (B2)

Coefficient on STR (B1)


Define and discuss omitted variable bias in multiple regressions.

Omitted variable bias: an omitted determinant of Y (the dependent variable) is correlated with

at least one of the regressor (independent variables).

Interpret the R2 and adjusted-R2 in a multiple regression.

There are four pitfalls to watch in using the R^2 or adjusted R^2:

1. An increase in the R^2 or adjusted R^2 does not necessarily imply that an added variable

is statistically significant

2. A high R^2 or adjusted R^2 does not mean the regressors are a true cause of the

dependent variable

3. A high R^2 or adjusted R^2 does not mean there is no omitted variable bias

4. A high R^2 or adjusted R^2 does not necessarily mean you have the most appropriate set

of regressors, nor does a low R^2 or adjusted R^2 necessarily mean you have an

inappropriate set of regressors


Rachev, Menn, and

Fabozzi, Chapter 2:

Discrete Probability

Distributions

In this chapter…

Describe the key properties of the Bernoulli distribution, Binomial distribution, and Poisson distribution, and identify common occurrences of each distribution.

Identify the distribution functions of Binomial and Poisson distributions for various parameter values.

Describe the key properties of the Bernoulli distribution, Binomial

distribution, and Poisson distribution, and identify common occurrences

of each distribution.

Bernoulli

A random variable X is called Bernoulli distributed with parameter (p) if it has only two

possible outcomes, often encoded as 1 (“success” or “survival”) or 0 (“failure” or

“default”), and if the probability for realizing “1” equals p and the probability for “0”

equals 1 – p. The classic example for a Bernoulli-distributed random variable is the default

event of a company.

A Bernoulli variable is discrete and has two possible outcomes:

1 if C defaults in I

0 elseX


Binomial

A binomial distributed random variable is the sum of (n) independent and identically distributed

(i.i.d.) Bernoulli-distributed random variables. The probability of observing (k) successes is

given by:

!( ) (1 ) ,

( )! !

k n kn n nP Y k p p

n k kk k

Poisson

The Poisson distribution depends upon only one parameter, lambda λ, and can be interpreted as

an approximation to the binomial distribution. A Poisson-distributed random variable is usually

used to describe the random number of events occurring over a certain time interval. The

lambda parameter (λ) indicates the rate of occurrence of the random events; i.e., it tells us

how many events occur on average per unit of time.

In the Poisson distribution, the random number of events that occur during an interval of time,

(e.g., losses/ year, failures/ day) is given by:

( )!

k

P N k ek

Normal Binomial Poisson Mean np

Variance 2 2 npq 2

Standard Dev. npq

In Poisson, lambda is both the expected value (the mean) and the variance!


The Bernoulli is used to characterize default; consequently the binomial is used to characterize a

portfolio of credits. In finance, the Poisson distribution is often used, as a generic a stochastic

process, to model the time of default in some credit risk models.

Identify the distribution functions of Binomial and Poisson distributions

for various parameter values.

0.0%

2.0%

4.0%

6.0%

60

65

70

75

80

85

90

95

100

105

110

115

120

125

130

135

140

Poisson binomial normal

0.0%

2.0%

4.0%

6.0%

8.0%

10.0%

12.0%

-

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

Binomial with different (p)

p=20%

p=50%

p=80%

Bernoulli

• Default (0/1)

Binomial

• Basket of Credits; Basket CDS

• BET

Poisson

• Operational loss frequency


0.0%2.0%4.0%6.0%8.0%

10.0%12.0%14.0%16.0%18.0%20.0%

-

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

Poisson with different lambdas

lambda = 5

lambda = 10

lambda = 20


Rachev, Menn, and

Fabozzi, Chapter 3:

Continuous Probability

Distributions

In this chapter…

Describe the key properties of Normal, Exponential, Weibull, Gamma, Beta, Chi‐squared, Student’s t, Lognormal, Logistic and Extreme Value distributions.

Explain the summation stability of normal distributions. Describe the hazard rate of an exponentially distributed random variable. Explain the relationship between exponential and Poisson distributions. Explain why the generalized Pareto distribution is commonly used to model

operational risk events. Explain the concept of mixtures of distributions.

Describe the key properties of Normal, Exponential, Weibull, Gamma,

Beta, Chi‐squared, Student’s t, Lognormal, Logistic and Extreme Value

distributions.

Normal

Characteristics of the normal distribution include:

The middle of the distribution, mu (µ), is the mean (and median). This first moment is

also called the “location”

Standard deviation and variance are measures of dispersion (a.k.a., shape). Variance is

the second-moment; typically, variance is denoted by sigma-squared such that standard

deviation is sigma.

The distribution is symmetric around µ. In other words, the normal has skew = 0

The normal has kurtosis = 3 or “excess kurtosis” = 0


Properties of normal distribution:

Location-scale invariance: Imagine random variable X, which is normally distributed

with the parameters µ and σ. Now consider random variable Y, which is a linear function

of X: Y = aX + b. In general, the distribution of Y might substantially differ from the

distribution of X, but in the case where X is normally distributed, the random variable Y

is again normally distributed with parameters [mean = a*mu + b] and [variance =

a^2*sigma]. Specifically, we do not leave the class of normal distributions if we

multiply the random variable by a factor or shift the random variable.

Summation stability: If you take the sum of several independent random variables,

which are all normally distributed with mean (µi) and standard deviation (σi), then the

sum will be normally distributed again.

The normal distribution possesses a domain of attraction. The central limit theorem

(CLT) states that—under certain technical conditions—the distribution of a large sum of

random variables behaves necessarily like a normal distribution.

The normal distribution is not the only class of probability distributions having a domain

of attraction. Actually three classes of distributions have this property: they are called

stable distributions.

Exponential

The exponential distribution is popular in queuing theory. It is used to model the time we have

to wait until a certain event takes place. According to the text, examples include “the time

until the next client enters the store, the time until a certain company defaults or the time until

some machine has a defect.”

1( ) , , 0xf x e x

The exponential function is non-zero:

0.00

0.50

1.00

1.50

2.00Exponential

0.5

1

2


Weibull

Weibull is a generalized exponential distribution; i.e., the exponential is a special case of the

Weibull where the alpha parameter equals 1.0.

( ) 1 , 0

x

F x e x

The main difference between the exponential distribution and the Weibull is that, under the

Weibull, the default intensity depends upon the point in time t under consideration. This allows

us to model the aging effect or teething troubles:

For α > 1—also called the “light-tailed” case—the default intensity is monotonically increasing

with increasing time, which is useful for modeling the “aging effect” as it happens for machines:

The default intensity of a 20-year old machine is higher than the one of a 2-year old machine.

For α < 1—the “heavy-tailed” case—the default intensity decreases with increasing time. That

means we have the effect of “teething troubles,” a figurative explanation for the effect that after

some trouble at the beginning things work well, as it is known from new cars. The credit spread

on noninvestment-grade corporate bonds provides a good example: Credit spreads usually

decline with maturity. The credit spread reflects the default intensity and, thus, we have the

effect of “teething troubles.” If the company survives the next two years, it will survive for a

longer time as well, which explains the decreasing credit spread.

For α = 1, the Weibull distribution reduces to an exponential distribution with parameter β.

-

0.50

1.00

1.50

2.00

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Weibull distribution

alpha=.5,beta=1

alpha=2, beta=1

alpha=2, beta=2


Gamma distribution

The family of Gamma distributions forms a two parameter probability distribution family with

the density function (pdf) given by:

11

( ) , 0( )

x xf x e x

The Gamma distribution is related to:

For alpha = 1, Gamma distribution becomes exponential distribution

For alpha = k/2 and beta = 2, Gamma distribution becomes Chi-square distribution

Beta distribution

The beta distribution has two parameters: alpha (“center”) and beta (“shape”). The beta

distribution is very flexible, and popular for modeling recovery rates.

-

0.20

0.40

0.60

0.80

1.00

1.20

Gamma distribution

alpha=1, beta=1

alpha=2, beta=.5

alpha=4, beta=.25

0.00

0.01

0.02

0.03

0.04

0.05

0.06

-

0.0

7

0.1

4

0.2

1

0.2

8

0.3

5

0.4

2

0.4

9

0.5

6

0.6

3

0.7

0

0.7

7

0.8

4

0.9

1

0.9

8

Beta distribution (popular for recovery rates)

alpha = 0.6, beta = 0.6

alpha = 1, beta = 5

alpha = 2, beta = 4

alpha = 2, beta = 1.5


Example of Beta Distribution

The beta distribution is often used to model recovery rates. Here are two examples: one beta

distribution to model a junior class of debt (i.e., lower mean recovery) and another for a senior

class of debt (i.e., lower loss given default):

Junior Senior

alpha (center) 2.0 4.0

beta (shape) 6.0 3.3

Mean recovery 25% 55%

Lognormal

The lognormal is common in finance: If an asset return (r) is normally distributed, the

continuously compounded future asset price level (or ratio or prices; i.e., the wealth ratio) is

lognormal. Expressed in reverse, if a variable is lognormal, its natural log is normal.

0.00

0.01

0.02

0.03

0%

6%

12%

18%

24%

30%

36%

42%

48%

54%

60%

66%

72%

78%

84%

90%

96%

Recovery (Residual Value)

Senior

Junior

0.00%

0.20%

0.40%

0.60%

0.80%

1.00%

Lognormal Non-zero, positive skew, heavy right tail


Logistic

A logistic distribution has heavy tails:

Extreme Value Theory

Measures of central tendency and dispersion (variance, volatility) are impacted more by

observations near the mean than outliers. The problem is that, typically, we are concerned with

outliers; we want to size the liklihood and magnitude of low frequency, high severity (LFHS)

events. Extreme value theory (EVT) solves this problem by fitting a separate distribution to

the extreme loss tail. EVT uses only the tail of the distribution, not the entire dataset.

In applying extreme value theory (EVT), two general approaches are:

Block maxima (BM). The classic approach

Peaks over threshold (POT). The modern approach that is often preferred.

-

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

1 4 7 10 13 16 19 22 25 28 31 34 37 40

Logistic distribution

alpha=0, beta=1

alpha=2, beta=1

alpha=0, beta=3

N(0,1)


Block maxima

The dataset is parsed into (m) identical, consecutive and non-overlapping periods called blocks.

The length of the block should be greater than the periodicity; e.g., if the returns are daily, blocks

should be weekly or more. Block maxima partitions the set into time-based intervals. It requires

that observations be identically and independently (i.i.d.) distributed.

Generalized extreme value (GEV) fits block maxima

The Generalized extreme value (GEV) distribution is given by:

1

exp (1 ) 0( )

exp( ) 0y

yH y

e

The (xi) parameter is the “tail index;” it represents the fatness of the tails. In this expression, a

lower tail index corresponds to fatter tails.

Per the (unassigned) Jorion reading on EVT, the key thing to know here is that (1) among

the three classes of GEV distributions (Gumbel, Frechet, and Weibull), we only care

about the Frechet because it fits to fat-tailed distributions, and (2) the shape parameter

determines the fatness of the tails (higher shape → fatter tails)

0.00

0.05

0.10

0.15

0 5

10

15

20

25

30

35

40

45

Generalized Extreme Value (GEV)


Peaks over threshold (POT)

Peaks over threshold (POT) collects the dataset of losses above (or in excess of) some threshold.

The cumulative distribution function here refers to the probability that the “excess loss” (i.e., the

loss, X, in excess of the threshold, u, is less than some value, y, conditional on the loss exceeding

the threshold):

( ) ( | )UF y P X u y X u

Peaks over threshold (POTS):

1

,

1 (1 ) 0

( )

1 exp( ) 0

x

G xx

-4 -3 -2 -1 0 1 2 3 4

u X


Block maxima is: time-based (i.e., blocks of time), traditional, less sophisticated, more

restrictive in its assumptions (i.i.d.)

Peaks over threshold (POT) is: more modern, has at least three variations (semi-

parametric; unconditional parametric; and conditional parametric), is more flexible

EVT Highlights:

Both GEV and GPD are parametric distributions used to model heavy-tails.

GEV (Block Maxima)

Has three parameters: location, scale and tail index

If tail > 0: Frechet

GPD (peaks over threshold, POT)

Has two parameters: scale and tail (or shape)

But must select threshold (u)

Explain the summation stability of normal distributions.

The sum of independent normally distributed random variables is also normally distributed

-

0.50

1.00

1.50

0 1 2 3 4

Generalized Pareto Distribution (GPD)


Describe the hazard rate of an exponentially distributed random variable.

In credit risk modeling, the parameter 1/ is interpreted as a hazard rate or default intensity.

1( ) , 0

( ) 1 , 0

x

x

f x e x

F x e x

1

( )

( ) 1

x

x

f x e

F x e

Explain the relationship between exponential and Poisson distributions.

The Poisson distribution counts the number of discrete events in a fixed time period; it is related

to the exponential distribution, which measures the time between arrivals of the events. If

events occur in time as a Poisson process with parameter λ, the time between events are

distributed as an exponential random variable with parameter λ. For example (from the learning

XLS):

Avg. events / day (lambda) 6 4

Number of hours/day 24 24

Average per hour 0.250 0.167

Poisson

Events / day (x) 6.00 2.00 Poisson distribution gives the probability

that exactly 6 events (losses) will occur in one day

P[X = x] 16.1% 14.7%

P[X = x], Excel check 16.1% 14.7%

Exponential

Hours [H] 1 12

Average Events / H 0.25 2.00

P [Y > ] 77.9% 13.5% Exponential distribution gives the

probability that the next loss will occur before (within) the next 1 hour (12 hours)

Days / hour 0.042 0.042

P [ Y > t] 77.9% 13.5%

P [ Y < t], CDF 22.1% 86.5%

Alternative:

Days / event 0.17 0.25

Hours / event 4.00 6.00

P [ Y > t] 77.9% 13.5%

P [ Y < t], CDF 22.1% 86.5%


Explain why the generalized Pareto distribution is commonly used to

model operational risk events.

The generalized Pareto distribution (GPD) models the distribution of so-called “peaks over

threshold.” GPD is limiting distribution the distribution of excesses about a threshold (“Peaks-

over-threshold” model). Possible applications are in the field of operational risks, where we are

concerned about losses above a certain threshold.

For severity tails, empirical distributions rarely sufficient (there is rarely enough data!).

Explain the concept of mixtures of distributions.

If two normal distributions have the same mean, they combine (mix) to produce mixture

distribution with leptokurtosis (heavy-tails). Otherwise, mixtures are infinitely flexible.

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

-10 -5 0 5 10

Normal 1

Normal 2

Mixture


Jorion, Chapter 12:

Monte Carlo

Methods

In this chapter…

Describe how to simulate a price path using a geometric Brownian motion model. Describe how to simulate various distributions using the inverse transform method. Describe the bootstrap method. Explain how simulations can be used for computing VaR and pricing options. Describe the relationship between the number of Monte Carlo replications and the

standard error of the estimated values. Describe and identify simulation acceleration techniques. Explain how to simulate correlated random variables using Cholesky factorization. Describe deterministic simulations. Discuss the drawbacks and limitations of simulation procedures.

Describe how to simulate a price path using a geometric Brownian motion

(GBM) model.

Geometric Brownian Motion (GBM) is the continuous motion/ process in which the randomly

varying quantity (in our example ‘Asset Value’) has a fluctuated movement and is dependent on

a variable parameter (in our example the stochastic variable is ‘Time’). The standard variable

parameter is the shock and the progress in the asset’s value is the drift. Now, the GBM can be

represented as drift + shock as shown below.

Specify a random

process (GBM for a stock)

Run trials (10 or 1 MM)

each a function of

random variable

For all trials, calculate

terminal (at horizon) asset (or portolio)

value

Sort outcomes,

best to worst. Quintiles

(e.g., 1%ile) are VaRs


t t t t tdS S dt S dz is the infinitesimal (continuous) representation of the GBM

1( )tS S t t is the discrete representation of the GBM

GBM models a deterministic drift plus a stochastic random shock

The above illustration is the shock and drift progression of the asset. The asset drifts upward

with the expected return of over the time interval t . But the drift is also impacted by shocks

from the random variable . We measure the standard deviation by a random variable

(which is the random shock) here. As the variance is adjusted with time t , volatility is adjusted

with the square root of time t .

Expected Drift is the deterministic component but shock is the random component in this stock

price process simulation. The Y-axis has an empirical distribution rather than a parametric

distribution and can be easily used to calculate the VaR. This Monte Carlo Distribution allows us

to produce an empirical distribution in future which can be used to calculate the VaR.

GBM assumes constant volatility (generally a weakness) unlike GARCH(1,1) which models

time-varying volatility.

$8.00

$9.00

$10.00

$11.00

$12.00

da

y 1

da

y 2

da

y 3

da

y 4

da

y 5

da

y 6

da

y 7

da

y 8

da

y 9

da

y 1

0

10-day GBM simulation (40 trials)


Describe how to simulate various distributions using the inverse

transform method.

The inverse transform method translates a random number (under a uniform distribution) into

a cumulative standard normal distribution:

Random CDF

NORMSINV() pdf

NORMDIST()

0.10 -1.282 0.18

0.15 -1.036 0.23 0.20 -0.842 0.28

0.25 -0.674 0.32

0.30 -0.524 0.35 0.35 -0.385 0.37

0.40 -0.253 0.39 0.45 -0.126 0.40

0.50 0.000 0.40

A random variable is generated, between 0 and 1. In Excel, the function is =RAND(). This

corresponds to the Y-axis on the first chart below. This will necessarily correspond to standard

normal CDF; e.g., RAND(.4) corresponds to -0.126 because NORMSINV(RAND(.4)) = -0.126.


Describe the bootstrap method.

The bootstrap method is a subclass of (type of) historical simulation (like HS). In regular

historical simulation, current portfolio weights are applied to the historical sample (e.g., 250

trading days). The bootstrap differs because it is “with replacement:” a historical period (i.e., a

vector of daily returns on a given day in the historical window) is selected at random. This

becomes the “simulated” vector of returns for day T+1. Then, for day T+2 simulation, a daily

vector is again selected from the window; it is “with replacement” because each simulated day

can select from the entire historical sample. Unlike historical simulation—which runs the

current portfolio through the single historical sample—the bootstrap randomizes the historical

sample and therefore can generate many historically-informed samples.

The advantages of the bootstrap include: can model fat-tails (like HS); by generating

repeated samples, we can ascertain estimate precision. Limitations, according to Jorion,

include: for small sample sizes, the bootstrapped distribution may be a poor

approximation of the actual one.

Randomize

Historical Date, But same indexed returns within date (preserves cross-sectional correlations)

-5.000 0.000 5.000

Standard normal PDF


Monte Carlo versus bootstrapping

Monte Carlo simulation is a generation of a distribution of returns and/or asset prices paths by

the use of random numbers. Bootstrapping randomizes the selection of actual historical returns.

Monte Carlo Bootstrapping

Both generate hypothetical future scenario and determine VaR by

“lookup function:” what is 95th

or 99th

worst simulated loss?

Neither incorporates autocorrelation (basic MC does not)

Algorithm describes return path (e.g., GBM)

Retrieves set (vector) of actual historical returns

Randomizes return Randomizes historical date

Correlation must be modeled Built-in correlation

Uses parametric assumptions No distributional assumption (does not assume normality)

Model risk Needs lots of data

Monte Carlo advantages include:

Powerful & flexible

Able to account for a range of risks (e.g., price risk, volatility risk, and nonlinear

exposures)

Ban be extended to longer horizons (important for credit risk measurement)

Can measure operational risk.

However, Monte Carlo simulation can be expensive and time-consuming to run,

including: costly computing power and costly expertise (human capital).

Bootstrapping advantages include:

Simple to implement

Naturally incorporates spatial (cross-sectional) correlation

Automatically captures non-normality in price changes (i.e., does not impose a

parametric distributional assumption)


Explain how simulations can be used for computing VaR and pricing

options.

Value at Risk (VaR)

Once a price path has been generated, we can build a portfolio distribution at the end of the

selected horizon:

1. Choose a stochastic process and parameters

2. Generate a pseudo-sequence of variables from which prices are computed

3. Calculate the value of the asset (or portfolio) under this particular sequence of prices at

the target horizon

4. Repeat steps 2 and 3 as many times as needed

This process creates a distribution of values. We can sort the observations and tabulate the

expected value and the quantile, which is the expected value in c times 10,000 replications.

Value at risk (VaR) relative to the mean is then:

( , ) ( ) ( , )T TVaR c T E F Q F C

Pricing options

Options can be priced under the risk-neutral valuation method by using Monte Carlo simulation:

1. Choose a process with drift equal to riskless rate (mu = r)

2. Simulate prices to the horizon

3. Calculate the payoff of the stock option (or derivative) at maturity

4. Repeat steps as often as needed

The current value of the derivative is obtained by discounting at the risk free rate and averaging

across all experiments:

* ( )rtt Tf E e F S

This formula means that each future simulated price, F(St), is discounted at the risk-free rate;

i.e., to solve for the present value. Then the average of those values is the expected value, or

value of the option. The Monte Carlo method has several advantages. It can be applied in

many situations, including options with so-called price-dependent paths (i.e., where the value

depends on the particular path) and options with atypical payoff patterns. Also, it is powerful

and flexible enough to handle varieties of options. With one notable exception: it cannot

accurately price options where the holder can exercise early (e.g., American-style options).


Describe the relationship between the number of Monte Carlo

replications and the standard error of the estimated values.

The relationship between the number of replications and precision (i.e., the standard error of

estimated values) is not linear: to increase the precision by 10X requires approximately 100X

more replications. The standard error of the sample standard deviation:

1 ( ) 1ˆ( )ˆ

2 2

SESE

T T

Therefore to increase VaR precision by (1/T) requires a multiple of about T2 the number of

replications; e.g., x 10 precision needs x 100.

Describe and identify simulation acceleration techniques.

Because an increase in precision requires exponentially more replications, acceleration

techniques are used:

Antithetic variable technique: changes the sign of the random samples. Appropriate

when the original distribution is symmetric. Creates twice as many replications at little

additional cost.

Control variates technique: attempts to increase accuracy by reducing sample variance

instead of by increasing sample size (the traditional approach).

Importance sampling technique (Jorion calls this the most effective acceleration

technique): attempts to sample along more important paths

Stratified sampling technique: partitions the simulation region into two zones.

= 10^2 = 100x

replications

se() = 1/10

reduce se() for better precision


Explain how to simulate correlated random variables using Cholesky

factorization.

Cholesky factorization

By virtue of the inverse transform method, we can use =NORMSINV(RAND()) to create standard

random normal variables. The RAND() function is a uniform distribution bounded by [0,1]. The

NORMSINV() translates the random number into the z-value that corresponds to the probability

given by a cumulative distribution. For example, =NORMSINV(5%) returns -1.645 because 5% of

the area under a normal curve lies to the left of - 1.645 standard deviations.

But no realistic asset or portfolio contains only one risk factor. To model several risk factors, we

could simply generate multiple random variables. Put more technically, the realistic modeling

scenario is a multivariate distribution function that models multiple random variables. But the

problem with this approach, if we just stop there, is that correlations are not included. What we

really want to do is simulate random variables but in such a way that we capture or reflect the

correlations between the variables. In short, we want random but correlated variables.

The typical way to incorporate the correlation structure is by way of a Cholesky factorization (or

decomposition) . There are four steps:

1. The covariance matrix. This contains the implied correlation structure; in fact, a

covariance matrix can itself be decomposed into a correlation matrix and a volatility

vector.

2. The covariance matrix(R) will be decomposed into a lower-triangle matrix (L) and an

upper-triangle matrix (U). Note they are mirrors of each other. Both have identical

diagonals; their zero elements and nonzero elements are merely "flipped"

3. Given that R = LU, we can solve for all of the matrix elements: a,b,c (the diagonal) and x,

y, z. That is “by definition.” That's what a Cholesky decomposition is, it is the solution

that produces two triangular matrices whose product is the original (covariance) matrix.

4. Given the solution for the matrix elements, we can calculate the product of the triangle

matrix to ensure the produce does equal the original covariance matrix (i.e., does LU =

R?). Note, in Excel a single array formula can be used with = MMULT().

The lower triangle (LU) is the result of the Cholesky Decomposition. It is the thing we can use to

simulate random variables, that itself is "informed" by our covariance matrix.


Correlated random variables

The following transforms two independent random variables into correlated random variables:

1 1

22 1 2

1 2

1 2

(1 )

, : independent random variables

: correlation coefficient

, : correlated random variables

Snapshot from the learning spreadsheet:

Series #1

Series #2 Correlation 0.75 Mean 1% 1%

Volatility 10% 10% Correlated N (0,1) N (0,1) Series

#1 Series #2 2.06 1.26 $10.0 $10.0

0.52 (0.73) $10.62 $9.37 1.51 0.99 $12.34 $10.39 (1.44) 0.48 $10.68 $11.00

If the variables are uncorrelated, randomization can be performed independently for each

variable. Generally, however, variables are correlated. To account for this correlation, we start

with a set of independent variables η, which are then transformed into the (). In a two-variable

setting, we construct the following:

-3.00

-2.00

-1.00

-

1.00

2.00

-4.00 -2.00 - 2.00

Correlated Random Vars

$4.0

$6.0

$8.0

$10.0

$12.0

$14.0

$16.0

1 4 7 10 13 16 19 22 25 28 31

Correlated Time Series


1 1

2 1/22 1 2(1 )

ρ = is the correlation coefficient between the variables ().

This is a transformation of two independent random variable into correlated random variables.

Prior to the transformation, 1 and 2 are random variables that have necessary correlation.

The first random variable is retained (1 = 1) and the second is transformed (recast) into a

random variable that is correlated

Describe deterministic simulations.

Quasi Monte Carlo (QMC) – a.k.a. deterministic simulation

Instead of drawing independent samples, the deterministic scheme systematically fills the space

left by previous numbers in the series.

Advantage: Standard error shrinks at 1/K rather than1/ k .

Disadvantage: Since not independent accuracy cannot be easily determined.

Monte Carlo simulations methods generate independent, pseudorandom points that attempt to

“fill” an N-dimensional space, where N is the number of variables driving the price of securities.

Researchers now realize that the sequence of points does not have to be chosen randomly. In a

deterministic scheme, the draws (or trials) are not entirely random. Instead of random trials,

this scheme fills space left by previous numbers in the series.

Scenario Simulation

The first step consists of using principal-component analysis to reduce the dimensionality of the

problem; i.e., to use the handful of factors, among many, that are most important.

The second step consists of building scenarios for each of these factors, approximating a normal

distribution by a binomial distribution with a small number of states.


Discuss the drawbacks and limitations of simulation procedures.

Simulation Methods are flexible

Either postulate stochastic process or resample historical

All full valuation on target date

However

More prone to model risk: need to pre-specify the distribution

Much slower and less transparent than analytical methods

Sampling variation (more precision requires vastly greater number of replications)

The tradeoff is speed vs. accuracy

A key drawback of the Monte Carlo method is the computational requirements; a large number

of replications are typically required (e.g., thousands of trials are not unusual).

Simulations inevitably generate sampling variability, or variations in summary statistics due to

the limited number of replications. More replications lead to more precise estimates but take

longer to estimate.


Hull, Chapter 22:

Estimating Volatilities

and Correlations

In this chapter…

Discuss how historical data and various weighting schemes can be used in estimating volatility.

Describe the exponentially weighted moving average (EWMA) model for estimating volatility and its properties. Estimate volatility using the EWMA model.

Describe the generalized auto regressive conditional heteroscedasticity [GARCH(p,q)] model for estimating volatility and its properties. Estimate volatility using the GARCH(p,q) model. Explain mean reversion and how it is captured in the GARCH(1,1) model.

Discuss how the parameters of the GARCH(1,1) and the EWMA models are estimated using maximum likelihood methods.

Explain how GARCH models perform in volatility forecasting. Discuss how correlations and covariances are calculated, and explain the

consistency condition for covariances.

How to Estimate Volatility

Volatility is instantaneously unobservable. In general, our basic approaches either infer an

implied volatility (based on an observed market price) or estimate the current volatility based

on a historical series of returns. There are two broad steps to computing historical volatility:

1. Compute the series of periodic (e.g., daily) returns;

2. Choose a weighting scheme (to translate a series into a single metric)


1. Compute the series of periodic returns (e.g., 1 period = 1 day)

In many cases, we assume one period equals one day. In this case, we are estimating a daily

volatility. We can either compute the “continuously compounded daily return” or the “simple

percentage change.” If Si-1 is yesterday’s price and Si is today’s price,

Continuously compounded return (aka, log return):

1

ln ii

i

Su

S

The simple percentage return is given by:

1

1

i ii

i

S Su

S

Linda Allen (the next reading) contrasts three periodic returns: continuously

compounded, simple percentage change, and absolute change (she says we should only

use absolute changes in the case of interest rate-related variables). She argues that

continuously compounded returns should be used when computing VAR because these

returns are “time consistent.”

2. Choose a weighting scheme

The series can be either un-weighted (each return is equally weighted) or weighted. A weighted

scheme puts more weight on recent returns because they tend to be more relevant.

The “standard” un-weighted (or equally weighted) scheme

The un-weighted (which is really equally-weighted) variance is a “standard” historical variance.

In this case, the variance is given by:

2 2

1

1( )

1

m

n n ii

u um

2 variance rate per day

most recent m observations

the mean/average of all daily returns ( )

n

i

m

u u

Please note this is the sample formula employed by Stock and Watson for the sample

variance. This is technically a correct variance.


However, in practice the dataset often consists of daily returns, a relatively large sample (e.g.,

one year = 250 trading days), and the mean is often near to zero. Given this, for practical

purposes and because the difference is typically insignificant, Hull makes two simplifying

assumptions:

The average daily return of zero is assumed to be zero: ̅

The denominator (m-1) is replaced with m.

This produces a simplified version of the standard (un-weighted) variance:

2 2

1

1variance =

m

n n ii

um

According to Hull: “These three changes make very little difference to the estimates that are

calculated, but they allow us to simplify the formula.” Hull’s third change is to switch from the

continuously compounded (log) return to the simple return, but we recommend that you keep

the log return to maintain consistency with the next (Linda Allen) reading.

In the “convenent” versoin, we replace (m-1) with (m) in the denominator. (m-1)

produces an “unbiased” estimator and (m) produces a “maximum likelihood” estimator.

Which is correct? Both are correct, the choice begs the issue of what properties of the

estimator we find more desirable. Estimators are like recipes intended to give estimates

of the true population variance. There can be different “recipes;” some will have more

desirable properties than others.

Discuss how historical data and various weighting schemes can be used in

estimating volatility.

The weighted scheme (a better approach, generally)

The simple historical approach does not apply different weights to each return (put another

way, it gives equal weights to each return). But we generally prefer to apply greater weights to

more recent returns:

2 2

1

m

n i n ii

u

The alpha () parameters are simply weights; the sum of the alpha () parameters must equal

one because they are weights.

The glaring flaw in a simple historical volatility (i.e., an un-weighted or equally-weighted

variance) is that the most distant return gets the same weight as yesterday’s return


We can now add another factor to the model: the long-run average variance rate. The idea here

is that the variance is “mean regressing:” think of it the variance as having a “gravitational pull”

toward its long-run average. We add another term to the equation above, in order to capture the

long-run average variance. The added term is the weighted long-run variance:

2 2

1

m

n L i n ii

V u

The added term is gamma (the weighting) multiplied by () the long-run variance because

the variance is a weighted factor.

-

formatted ARCH (m) model:

2 2

1

m

n i n ii

u

This is the same ARCH(m) only the product of gamma and the long-run variance is

replaced by a single constant, omega (ω). Why does this matter? Because you may see

GARCH(1,1) represented with a single constant (i.e., the omega term), and you want to

realize the constant will not be the long-run variance itself; rather, the constant will be

the product of the long-run variance and a weight.

Summary: Un-weighted versus weighted

Un-Weighted Scheme

2 2

1

1 m

n n ii

um

Weighted Scheme

2 2

1

m

n i n ii

u

alpha(i) weights must sum to one


Describe exponentially weighted moving average (EWMA) model for

estimating volatility. Estimate volatility using EWMA model.

In exponentially weighted moving average (EWMA), the weights decline (in constant proportion,

given by lambda). The exponentially weighted moving average (EWMA) is given by:

2 0 21

1 22

2 23

(1 )

(1 )

(1

... infinite serie

)

s

n n

n

n

u

u

u

The infinite series above reduces to (i.e., is equivalent to) the recursive form of EWMA:

2 2 21 1(1 )n n nu

RiskMetricsTM is a just a branded version of EWMA:

2 2 21 1(0.94) (0.06)n n nu

“The EWMA approach has the attractive feature that relatively little data need to be

stored. At any given time, only the current estimate of the variance rate and the most

recent observation on the value of the market variable need be remembered. When a

new observation on the market variable is obtained, a new daily percentage change is

calculated … to update the estimate of the variance rate. The old estimate of the

variance rate and the old value of the market variable can then be discarded.” –Hull

Describe the generalized auto regressive conditional heteroscedasticity

(GARCH(p,q)) model for estimating volatility and its properties. Estimate

volatility using the GARCH(p,q) model.

EWMA is a special case of GARCH(1,1) where gamma = 0 and (alpha + beta = 1). GARCH (1,1) is

the weighted sum of a long run-variance (weight = gamma), the most recent squared-return

(weight = alpha), and the most recent variance (weight = beta)

2 2 21 1n L n nV u

This GARCH(1,1) is a case of the ARCH(m) above: the first term is constant omega (i.e.,

the weighted long-run variance) and the second and third terms are recursively giving

exponentially decreasing weights to the historical series of returns.

Ratio between any two consecutive weights is constant: lambda (λ)

Infinite series elegantly

reduces to recursive EWMA


To summarize the key features of the GARCH(1,1):

12 22

1nn nL uV

“The ‘(1,1)’ in GARCH(1,1) indicates that on is based on the most recent observation of

u^2 and the most recent estimate of the variance rate. The more general GARCH(p, q)

model calculates on from the most recent p observations on u2 and the most recent q

estimates of the variance rates. GARCH(1,1) is by far the most popular of the GARCH

models.” – Hull

Two worked examples (in two columns) are on the following page.

The mean reversion term is the product of a weight (gamma) and the long-run (unconditional) variance. If gamma = 0, GARCH(1,1) “reduces” to EWMA

An alpha weight applied to the most recent return^2 (aka, “innovation”). Alpha is analogous to (1- lambda) in EWMA

A beta weight applied to the most recent variance. Beta is analogous to lambda in EWMA.


In the volatility practice bag (learning spreadsheet 2.b.6), we illustrate and compare the

calculation of EWMA to GARCH(1,1):

beta (b) or lambda 0.860 0.898 In both, most weight to lag variance

If EWMA: lambda only

1-lambda

0.140 0.102 In EWMA, only two weights

sum of weights

1.00 1.00

If GARCH (1,1): alpha, beta, & gamma

omega (w)

0.00000200 0.00000176 omega = gamma * long run variance

alpha (a)

0.130 0.063 Weight to lag return

alpha + beta (a+b) 0.9900 0.9602 “persistence” of GARCH

gamma

0.010 0.040 Weight to L.R. var = 1 – alpha – beta

sum of weights:

1.000 1.000

Long Term Variance 0.00020 0.00004 omega/(1-alpha-beta)

Long Term Volatility 1.4142% 0.6650% SQRT()

Updated Volatility Estimate

Assumptions

Last Volatility

1.60% 0.60%

Last Variance

0.000256 0.000036

Yesterday's price 10.00 10.00

Today's price

9.90 10.21

Last Return

-1.0% 2.0%

EWMA

Updated Variance 0.000235 0.000074 *lag variance+(1-)*lag return^2

Updated Volatility 1.53% 0.86%

GARCH(1,1)

Updated Variance 0.000236 0.000060 GARCH (1,1) = omega + beta*lag var

+ alpha * lag return^2

Updated Volatility 1.53% 0.77%

GARCH (1,1) Forecast

Number of days (t) 10 10

Forecast Variance 0.00023218 0.00005463 L.R. var + (alpha+beta)^t*(var-L.R. var)

Forecast Volatility 1.524% 0.739%


Explain mean reversion and how it is captured in the GARCH(1,1) model.

If we are given omega and two of the weights (alpha and beta), we can use our understanding of

GARCH(1,1) ….

2 2 2 2 2

1 1 1 1n n n L n nVu u

… to solve for the long-run average variance as a function of omega and the weights:

1LV

Discuss how the parameters of the GARCH(1,1) and the EWMA models are

estimated using maximum likelihood methods.

In maximum likelihood methods we choose parameters that maximize the likelihood of the

observations occurring.

Max Likelihood Estimation (MLE) for GARCH(1,1) Avg Return 0.001

mu

0.0006

Std Dev (Returns) 0.006 Omega

0.0000

Alpha

0.0001

Beta

0.8221

mu * 1000

0.646

alpha

0.000

persistence 0.822

variance*10000 0.363

Log likelihood value: 110.94

“It is now appropriate to discuss how the parameters in the models we have been

considering are estimated from historical data. The approach used is known as the

maximum likelihood method. It involves choosing values for the parameters that

maximize the chance (or likelihood) of the data occurring. To illustrate the method, we

start with a very simple example. Suppose that we sample 10 stocks at random on a

certain day and find that the price of one of them declined on that day and the prices of

the other nine either remained the same or increased. What is the best estimate of the

probability of a price decline? The natural answer is 0.1.” –Hull


Explain how GARCH models perform in volatility forecasting.

The forecasted volatility forward (k) days is given by:

2 2[ ] ( ) ( )kn k L n LE V V

The expected future variance rate, in (t) periods forward, is given by:

2 2[ ] ( ) ( )tn t L n LE V V

For example, assume that a current volatility estimate (period n) is given by the following

GARCH (1, 1) equation:

2 2

10.00008 ( 0.1)(4%) ( 0.7)(.0016)n n

In this example, alpha is the weight (0.1) assigned to the previous squared return (the previous

return was 4%), beta is the weight (0.7) assigned to the previous variance (0.0016).

What is the expected future volatility, in ten days (n + 10)?

First, solve for the long-run variance. It is not 0.00008; this term is the product of the variance

and its weight. Since the weight must be 0.2 (= 1 - 0.1 -0.7), the long run variance = 0.0004.

0.00008

0.00041 1 0.7 0.1

LV

Second, we need the current variance (period n). That is almost given to us above:

2 0.00008 0.00016 0.00112 0.00136n

Now we can apply the formula to solve for the expected future variance rate:

2 10[ ] (0.0004) (0.1 0.7) (0.00136 0.0004)

0.0005031

n tE

This is the expected variance rate, so the expected volatility is approximately 2.24%. Notice how

this works: the current volatility is about 3.69% and the long-run volatility is 2%. The 10-day

forward projection “fades” the current rate nearer to the long-run rate.


Discuss how correlations and covariances are calculated, and explain the

consistency condition for covariances.

Correlations play a key role in the calculation of value at risk (VaR). We can use similar methods

to EWMA for volatility. In this case, an updated covariance estimate is a weighted sum of

The recent covariance; weighted by lambda

The recent cross-product; weighted by (1-lambda)

1 1 1cov cov (1 )n n n nx y


Allen, Boudoukh, and

Saunders, Chapter 2:

Quantifying Volatility in

VaR Models

In this chapter…

Discuss how asset return distributions tend to deviate from the normal distribution. Explain potential reasons for the existence of fat tails in a return distribution and

discuss the implications fat tails have on analysis of return distributions. Distinguish between conditional and unconditional distributions. Discuss the implications regime switching has on quantifying volatility. Explain the various approaches for estimating VaR. Compare, contrast and calculate parametric and non-parametric approaches for

estimating conditional volatility, including: Historical standard deviation Exponential smoothing GARCH approach Historic simulation Multivariate density estimation Hybrid methods

Explain the process of return aggregation in the context of volatility forecasting methods.

Discuss implied volatility as a predictor of future volatility and its shortcomings. Explain long horizon volatility/VaR and the process of mean reversion according to

an AR(1) model.


Key terms

Risk varies over time. Models often assume a normal (Gaussian) distribution (“normality”) with

constant volatility from period to period. But actual returns are non-normal and volatility varies

over time (volatility is “time-varying” or “non-constant”). Therefore, it is hard to use parametric

approaches to random returns; in technical terms, it is hard to find robust “distributional

assumptions for stochastic asset returns”

Conditional parameter (e.g., conditional volatility): a parameter such as variance that

depends on (is conditional on) circumstances or prior information. A conditional

parameter, by definition, changes over time.

Persistence: In EWMA, the lambda parameter (λ). In GARCH (1,1), the sum of the alpha

(α) and beta () parameters. High persistence implies slow decay toward to the long-run

average variance.

Autoregressive: Recursive. A parameter (today’s variance) is a function of itself

(yesterday’s variance).

Heteroskedastic: Variance changes over time (homoskedastic = constant variance).

Leptokurtosis: a fat-tailed distribution where relatively more observations are near the

middle and in the “fat tails (kurtosis > 3)

Stochastic behavior of returns

Risk measurement (VaR) concerns the tail of a distribution, where losses occur. We want to

impose a mathematical curve (a “distributional assumption”) on asset returns so we can

estimate losses. The parametric approach uses parameters (i.e., a formula with parameters) to

make a distributional assumption but actual returns rarely conform to the distribution curve. A

parametric distribution plots a curve (e.g., the normal bell-shaped curve) that approximates a

range of outcomes but actual returns are not so well-behaved: they rarely “cooperate.”


Value at Risk (VaR) – 2 asset, relative vs. absolute

Know how to compute two-asset portfolio variance & scale portfolio volatility to derive VaR:

Inputs (per annum(

Trading days /year 252

Initial portfolio value (W) $100

VaR Time horizon (days) (h) 10

VaR confidence interval 95%

Asset A

Volatility (per year) 10.0%

Expected Return (per year) 12.0%

Portfolio Weight (w) 50%

Asset B

Volatility 20.0%

Expected Return (per year) 25.0%

Portfolio Weight (1-w) 50%

Correlation (A,B) 0.30

Autocorrelation (h-1, h) 0.25

Independent, = 0. Mean reverting = negative

Outputs

Annual

Covariance (A,B) 0.0060 0.0060

COV = (correlation A,B)(volatility A)(volatility B)

Portfolio variance 0.0155 0.0155

Exp Portfolio return 18.5%

Portfolio volatility (per year) 12.4%

Period (h days)

Exp periodic return (u) 0.73%

Std deviation (h), i.i.d 2.48%

Scaling factor 15.78 Don’t need to know this, used for AR(1)

Std deviation (h), Autocorrelation

3.12% Standard deviation if auto-correlation.

Normal deviate (critical z value)

1.64 Normal deviate

Expected future value 100.73 100.73

Relative VaR, i.i.d $4.08 Doesn’t include the mean return

Absolute VaR, i.i.d $3.35 Includes return; i.e., loss from zero

Relative VaR, AR(1) $5.12 The corresponding VaRs, if autocorrelation incorporated. Note VaR is higher! Absolute VaR, AR(1) $4.39

Relative VaR, iid = $100 value * 2.48% 10-day sigma * 1.645 normal deviate Absolute VaR, iid = $100 * (-0.73% + 2.48% * 1.645) Relative VaR, AR(1) = $100 value * 3.12% 10-day AR sigma * 1.645 normal deviate Absolute VaR, AR(1) = $100 * (-0.73% + 3.12% * 1.645)


Discuss how asset return distributions tend to deviate from the normal

distribution.

Compared to a normal (bell-shaped) distribution, actual asset returns tend to be:

Fat-tailed (a.k.a., heavy tailed): A fat-tailed distribution is characterized by having

more probability weight (observations) in its tails relative to the normal distribution.

Skewed: A skewed distribution refers—in this context of financial returns—to the

observation that declines in asset prices are more severe than increases. This is in

contrast to the symmetry that is built into the normal distribution.

Unstable: the parameters (e.g., mean, volatility) vary over time due to variability in

market conditions.

NORMAL RETURNS ACTUAL FINANCIAL RETURNS

Symmetrical Skewed

“Normal” Tails Fat-tailed (leptokurtosis)

Stable Unstable (time-varying)

Interest rate distributions are not constant over time

10 years of interest rate data are collected (1982 – 1993). The distribution plots the daily change

in the three-month treasury rate. The average change is approximately zero, but the “probability

mass” is greater at both tails. It is also greater at the mean; i.e., the actual mean occurs more

frequently than predicted by the normal distribution.

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

3.5%

4.0%

4.5%

-3 -2 -1 0 1 2 3

3rd Moment =

Skew • 3

2nd Variance “scale”

4th Moment =

kurtosis • 4

1st moment

Mean “location”

Actual returns: 1. Skewed 2. Fat-tailed (kurtosis>3) 3. Unstable


Explain potential reasons for the existence of fat tails in a return

distribution and discuss the implications fat tails have on analysis of

return distributions.

A distribution is unconditional if tomorrow’s distribution is the same as today’s distribution. But

fat tails could be explained by a conditional distribution: a distribution that changes over time.

Two things can change in a normal distribution: mean and volatility. Therefore, we can explain

fat tails in two ways:

Conditional mean is time-varying; but this is unlikely given the assumption that

markets are efficient

Conditional volatility is time-varying; Allen says this is the more likely explanation!

Explain how outliers can really be indications that the volatility varies with time.

We observe that actual financial returns tend to exhibit fat-tails. Jorion (like Allen et al) offers

two possible explanations:

The true distribution is stationary. Therefore, fat-tails reflect the true distribution but the

normal distribution is not appropriate

The true distribution changes over time (it is “time-varying”). In this case, outliers can in

reality reflect a time-varying volatility.

Normal distribution says: -10% @ 95th %ile

If fat tails, expected VaR loss is understated!


Distinguish between conditional and unconditional distributions.

An unconditional distribution is the same regardless of market or economic conditions; for

this reason, it is likely to be unrealistic.

A conditional distribution in not always the same: it is different, or conditional on, some

economic or market or other state. It is measured by parameters such as its conditional mean,

conditional standard deviation (conditional volatility), conditional skew, and conditional

kurtosis.

Discuss the implications regime switching has on quantifying volatility.

A typical distribution is a regime-switching volatility model: the regime (state) switches from

low to high volatility, but is never in between. A distribution is “regime-switching” if it changes

from high to low volatility.

The problem: a risk manager may assume (and measure) an unconditional volatility but the

distribution is actually regime switching. In this case, the distribution is conditional (i.e., it

depends on conditions) and might be normal but regime-switching; e.g., volatility is 10% during a

low-volatility regime and 20% during a high-volatility regime but during both regimes, the

distribution may be normal. However, the risk manager may incorrectly assume a single 15%

unconditional volatility. But in this case, the unconditional volatility is likely to exhibit fat

tails because it does not account for the regime switching.


Explain the various approaches for estimating VaR.

Volatility versus Value at Risk (VaR)

Volatility is an input into our (parametric) value at risk (VaR):

$

$VaR W z

%VaR z

Linda Allen’s Historical-based approaches

The common attribute to all the approaches within this class is their use of historical time series

data in order to determine the shape of the conditional distribution.

Parametric approach. The parametric approach imposes a specific distributional

assumption on conditional asset returns. A representative member of this class of models

is the conditional (log) normal case with time-varying volatility, where volatility is

estimated from recent past data.

Nonparametric approach. This approach uses historical data directly, without

imposing a specific set of distributional assumptions. Historical simulation is the

simplest and most prominent representative of this class of models.


Implied volatility based approach.

This approach uses derivative pricing models and current derivative prices in order to impute

an implied volatility without having to resort to historical data. The use of implied volatility

obtained from the Black–Scholes option pricing model as a predictor of future volatility is the

most prominent representative of this class of models.

Jorion’s Value at Risk (VaR) typology

Please note that Jorion’s taxonomy approaches from the perspective of local versus full

valuation. In that approach, local valuation tends to associate with parametric approaches:

Risk Measurement

Local valuation

Linear models

Full Covariance matrix

Factor Models

Diagonal Models

Nonlinear models

Gamma

Convexity

Full valuation

Historical Simulation

Monte Carlo Simulation


Value at Risk (VaR)

Parametric

Delta normal

Non parametric


Bootstrap

Monte Carlo

Hybrid (semi-p)

HS + EWMA

EVT

POT (GPD)

Block maxima (GEV)


Volatility

1. Implied Volatility

2. Equally weighted returns or un-weighted (STDEV)

3. More weight to recent returns

GARCH(1,1)

EWMA

4. MDE (more weight to similar states!)


Historical approaches

An historical-based approach can be non-parametric, parametric or hybrid (both). Non-

parametric directly uses a historical dataset (historical simulation, HS, is the most common).

Parametric imposes a specific distributional assumption (this includes historical standard

deviation and exponential smoothing)

Compare, contrast and calculate parametric and non-parametric

approaches for estimating conditional volatility, including: HISTORICAL

STANDARD DEVIATION

Historical standard deviation is the simplest and most common way to estimate or predict

future volatility. Given a history of an asset’s continuously compounded rate of returns we take

a specific window of the K most recent returns.

This standard deviation is called a moving average (MA) by Jorion. The estimate requires a

window of fixed length; e.g., 30 or 60 trading days. If we observe returns (rt) over M days, the

volatility estimate is constructed from a moving average (MA):

2 2

1

(1/ )M

t t ii

M r

Each day, the forecast is updated by adding the most recent day and dropping the furthest day.

In a simple moving average, all weights on past returns are equal and set to (1/M). Note raw

returns are used instead of returns around the mean (i.e., the expected mean is assumed zero).

This is common in short time intervals, where it makes little difference on the volatility estimate.

For example, assume the previous four daily returns for a stock are 6% (n-1), 5% (m-2), 4% (n-

3) and 3% (n-4). What is a current volatility estimate, applying the moving average, given that

our short trailing window is only four days (m=14)? If we square each return, the series is

0.0036, 0.0025, 0.0016 and 0.0009. If we sum this series of squared returns, we get 0.0086.

Divide by 4 (since m=4) and we get 0.00215. That’s the moving average variance, such that the

moving average volatility is about 4.64%.

The above example illustrates a key weakness of the moving average (MA): since all

returns weigh equally, the trend does not matter. In the example above, notice that

volatilty is trending down, but MA does not reflect in any way this trend. We could

reverse the order of the historical series and the MA estimation would produce the same

result.


The moving average (MA) series is simple but has two drawbacks

The MA series ignores the order of the observations. Older observations may no

longer be relevant, but they receive the same weight.

The MA series has a so-called ghosting feature: data points are dropped arbitrarily due

to length of the window.


approaches for estimating conditional volatility, including: GARCH

APPROACH, EXPONENTIAL SMOOTHING (EWMA), and Exponential

smoothing (conditional parametric)

Modern methods place more weight on recent information. Both EWMA and GARCH place

more weight on recent information. Further, as EWMA is a special case of GARCH, both EWMA

and GARCH employ exponential smoothing.

GARCH (p, q) and in particular GARCH (1, 1)

GARCH (p, q) is a general autoregressive conditional heteroskedastic model:

Autoregressive (AR): tomorrow’s variance (or volatility) is a regressed function of today’s variance—it regresses on itself

Conditional (C): tomorrow’s variance depends—is conditional on—the most recent variance. An unconditional variance would not depend on today’s variance

Heteroskedastic (H): variances are not constant, they flux over time

GARCH regresses on “lagged” or historical terms. The lagged terms are either variance or

squared returns. The generic GARCH (p, q) model regresses on (p) squared returns and (q)

variances. Therefore, GARCH (1, 1) “lags” or regresses on last period’s squared return (i.e., just 1

return) and last period’s variance (i.e., just 1 variance).

GARCH (1, 1) given by the following equation.

20 1 1 1t t th r h

2t

21 t-1

2 2t-1 t-1,t

or conditional variance (i.e., we're solving for it)

or weighted long-run (average) variance

or previous variance

r or r previous squared return

t

t

h

a

h


Persistence is a feature embedded in the GARCH model.

In the above formulas, persistence is = (b + c) or (alpha-1+ beta). Persistence refers to

how quickly (or slowly) the variance reverts or “decays” toward its long-run average.

High persistence equates to slow decay and slow “regression toward the mean;” low

persistence equates to rapid decay and quick “reversion to the mean.”

A persistence of 1.0 implies no mean reversion. A persistence of less than 1.0 implies “reversion

to the mean,” where a lower persistence implies greater reversion to the mean.

As above, the sum of the weights assigned to the lagged variance and lagged squared

return is persistence (b+c = persistence). A high persistence (greater than zero but less

than one) implies slow reversion to the mean.

But if the weights assigned to the lagged variance and lagged squared return are greater

than one, the model is non-stationary. If (b+c) is greater than 1 (if b+c > 1) the model is

non-stationary and, according to Hull, unstable. In which case, EWMA is preferred.

Linda Allen says about GARCH (1, 1):

GARCH is both “compact” (i.e., relatively simple) and remarkably accurate. GARCH

models predominate in scholarly research. Many variations of the GARCH model have

been attempted, but few have improved on the original.

The drawback of the GARCH model is its nonlinearity.

For example: Solve for long-run variance in GARCH (1,1)

Consider the GARCH (1, 1) equation below:

2 2 2

1 10.2n n nu

Assume that:

the alpha parameter = 0.2,

the beta parameter = 0.7, and

Note that omega is 0.2 but don’t mistake omega (0.2) for the long-run variance! Omega is the

product of gamma and the long-run variance. So, if alpha + beta = 0.9, then gamma must be

0.1. Given that omega is 0.2, we know that the long-run variance must be 2.0 (0.2 0.1 = 2.0).


EWMA

EWMA is a special case of GARCH (1,1). Here is how we get from GARCH (1,1) to EWMA:

2 2 21, 1GARCH(1,1) t t t ta br c

Then we let a = 0 and (b + c) =1, such that the above equation simplifies to:

2 2 21, 1GARCH(1,1) = (1 )t t t tbr b

This is now equivalent to the formula for exponentially weighted moving average (EWMA):

2 2 21, 1

2 2 21 1,

(1 )

(1 )

t t t t

t t t t

EWMA br b

r

In EWMA, the lambda parameter now determines the “decay:” a lambda that is close to one

(high lambda) exhibits slow decay.

RiskMetricsTM Approach

RiskMetrics is a branded form of the exponentially weighted moving average (EWMA) approach.

The optimal (theoretical) lambda varies by asset class, but the overall optimal parameter used

by RiskMetrics has been 0.94. In practice, RiskMetrics only uses one decay factor for all series:

0.94 for daily data 0.97 for monthly data (month defined as 25 trading days)

Technically, the daily and monthly models are inconsistent. However, they are both easy to use,

they approximate the behavior of actual data quite well, and they are robust to misspecification.

Each of GARCH (1, 1), EWMA and RiskMetrics are each parametric and recursive.

Advantages and Disadvantages of MA (i.e., STDEV) vs. GARCH

GARCH estimations can provide estimations that are more accurate than MA

Jorion’s Moving average (MA) = Allen’s STDEV GARCH

Ghosting feature More recent data assigned greater weights

Trend information is not incorporated A term added to incorporate mean reversion

Except Linda Allen warns: GARCH (1,1) needs more parameters and may pose greater

MODEL RISK (“chases a moving target”) when forecasting out-of-sample


Graphical summary of the parametric methods that assign more weight to recent returns (GARCH & EWMA)


Summary Tips:

GARCH (1, 1) is generalized RiskMetrics; and, conversely, RiskMetrics is restricted case of

GARCH (1,1) where a = 0 and (b + c) =1. GARCH (1, 1) is given by:

2 2 21 1n L n nV u

The three parameters are weights and therefore must sum to one:

1

Be careful about the first term in the GARCH (1, 1) equation: omega (ω) = gamma(λ) *

(average long-run variance). If you are asked for the variance, you may need to divide

out the weight in order to compute the average variance.

Determine when and whether a GARCH or EWMA model should be used in volatility

estimation

In practice, variance rates tend to be mean reverting; therefore, the GARCH (1, 1) model is

theoretically superior (“more appealing than”) to the EWMA model. Remember, that’s the big

difference: GARCH adds the parameter that weights the long-run average and therefore it

incorporates mean reversion.

GARCH (1, 1) is preferred unless the first parameter is negative (which is implied if alpha

+ beta > 1). In this case, GARCH (1,1) is unstable and EWMA is preferred.

Explain how the GARCH estimations can provide forecasts that are more accurate.

The moving average computes variance based on a trailing window of observations; e.g., the

previous ten days, the previous 100 days.

There are two problems with moving average (MA):

Ghosting feature: volatility shocks (sudden increases) are abruptly incorporated into the

MA metric and then, when the trailing window passes, they are abruptly dropped from

the calculation. Due to this the MA metric will shift in relation to the chosen window

length

Trend information is not incorporated

GARCH estimates improve on these weaknesses in two ways:

More recent observations are assigned greater weights. This overcomes ghosting

because a volatility shock will immediately impact the estimate but its influence will fade

gradually as time passes

A term is added to incorporate reversion to the mean


Explain how persistence is related to the reversion to the mean.

Given the GARCH (1, 1) equation: 2

0 1 1 1t t th r h

1Persistence

GARCH (1, 1) is unstable if the persistence > 1. A persistence of 1.0 indicates no mean reversion.

A low persistence (e.g., 0.6) indicates rapid decay and high reversion to the mean.

GARCH (1, 1) has three weights assigned to three factors. Persistence is the sum of the

weights assigned to both the lagged variance and lagged squared return. The other

weight is assigned to the long-run variance.

If P = persistence and G = weight assigned to long-run variance, then P+G = 1.

Therefore, if P (persistence) is high, then G (mean reversion) is low: the persistent series

is not strongly mean reverting; it exhibits “slow decay” toward the mean.

If P is low, then G must be high: the impersistent series does strongly mean revert; it

exhibits “rapid decay” toward the mean.

The average, unconditional variance in the GARCH (1, 1) model is given by:

0

11VL



approaches for estimating conditional volatility, including: HISTORIC

SIMULATION

Historical simulation is easy: we only need to determine the “lookback window.” The problem is

that, for small samples, the extreme percentiles (e.g., the worst one percent) are less precise.

Historical simulation effectively throws out useful information.

“The most prominent and easiest to implement methodology within the class of

nonparametric methods is historical simulation (HS). HS uses the data directly. The only

thing we need to determine up front is the lookback window. Once the window length is

determined, we order returns in descending order, and go directly to the tail of this

ordered vector. For an estimation window of 100 observations, for example, the fifth

lowest return in a rolling window of the most recent 100 returns is the fifth percentile.

The lowest observation is the first percentile. If we wanted, instead, to use a 250

observations window, the fifth percentile would be somewhere between the 12th and the

13th lowest observations (a detailed discussion follows), and the first percentile would be

somewhere between the second and third lowest returns.” –Linda Allen

Compare and contrast the use of historic simulation, multivariate density

estimation, and hybrid methods for volatility forecasting.

Nonparametric Volatility Forecasting

Historic Simulation (HS)

• Sort returns

• Lookup worst

• If n=100, for 95th percentile look between bottom 5th & 6th

MDE

• Like ARCH(m)

• But weights based on function of [current vs. historical state]

• If state (n-50) state (today), heavy weight to that return2

Hybrid

(HS & EWMA)

• Sort returns (like HS)

• But weight them, greater weight to recent (like EWMA)


Advantages Disadvantages


Easiest to implement (simple, convenient)

Uses data inefficiently (much data is not used)

Multivariate density estimation

Very flexible: weights are function of state (e.g., economic context such as interest rates) not constant

Onerous model: weighting scheme; conditioning variables; number of observations

Data intensive

Hybrid

approach Unlike the HS approach, better incorporates more recent information

Requires model assumptions; e.g., number of observations


approaches for estimating conditional volatility, including: MULTIVARIATE

DENSITY ESTIMATION

Multivariate Density Estimation (MDE)

The key feature of multivariate density estimation is that the weights (assigned to historical

square returns) are not a constant function of time. Rather, the current state—as

parameterized by a state vector—is compared to the historical state: the more similar the states

(current versus historical period), the greater the assigned weight. The relative weighting is

determined by the kernel function:

2 2

1

( )K

t t ii

t i u

Instead of weighting returns^2 by time,

Weighting by proximity to current state

Kernel function Vector describing economic state at

time t-i


Compare EWMA to MDE:

Both assign weights to historical squared returns (squared returns = variance

approximation);

Where EWMA assigns the weight as an exponentially declining function of time (i.e., the

nearer to today, the greater the weight), MDE assigns the weight based on the nature of

the historical period (i.e., the more similar to the historical state, the greater the weight)


approaches for estimating conditional volatility, including: HYBRID

METHODS

The hybrid approach is a variation on historical simulation (HS). Consider the ten (10)

illustrative returns below. In simple HS, the return are sorted from best-to-worst (or worst-to-

best) and the quantile determines the VaR. Simple HS amounts to giving equal weight to each

returns (last column). Given 10 returns, the worst return (-31.8%) earns a 10% weight under

simple HS.

However, under the hybrid approach, the EWMA weighting scheme is instead applied. Since the

worst return happened seven (7) periods ago, the weight applied is given by the following,

assuming a lambda of 0.9 (90%):

Weight (7 periods prior) = 90%^(7-1)*(1-90%)/(1-90%^10) = 8.16%

Sorted Periods Hybrid Cum'l

Hybrid Compare

Return Ago Weight Weight to HS

-31.8% 7 8.16% 8.16% 10%

-28.8% 9 6.61% 14.77% 20%

-25.5% 6 9.07% 23.83% 30%

-22.3% 10 5.95% 29.78% 40%

5.7% 1 15.35% 45.14% 50%

6.1% 2 13.82% 58.95% 60%

6.5% 3 12.44% 71.39% 70%

6.9% 4 11.19% 82.58% 80%

12.1% 5 10.07% 92.66% 90%

60.6% 8 7.34% 100.00% 100%


Note that because the return happened further in the past, the weight is below the 10% that is

assigned under simple HS.

Hybrid methods using Google stock’s prices and returns:

Number

Google (GOOG) Period

of days Cumulative Weight

Date Close Return Sorted

ago HS Hybrid

6/24/2009 409.29 0.89% 1 -5.90%

76 1.0% 0.2% 0.2%

6/23/2009 405.68 -0.41% 2 -5.50%

94 2.0% 0.1% 0.3%

6/22/2009 407.35 -3.08% 3 -4.85%

86 3.0% 0.1% 0.4%

6/19/2009 420.09 1.45% 4 -4.29%

90 4.0% 0.1% 0.5%

6/18/2009 414.06 -0.27% 5 -4.25%

78 5.0% 0.2% 0.7%

6/17/2009 415.16 -0.20% 6 -3.35%

47 6.0% 0.6% 1.3%

6/16/2009 416 -0.18% 7 -3.26%

81 7.0% 0.2% 1.4%

6/15/2009 416.77 -1.92% 8 -3.08%

3 8.0% 3.7% 5.1%

6/12/2009 424.84 -0.97% 9 -3.01%

88 9.0% 0.1% 5.2%

6/11/2009 429 -0.84% 10 -2.64%

55 10.0% 0.4% 5.7%

In this case:

Sample includes 100 returns (n=100)

We are solving for the 95th percentile (95%) value at risk (VaR)

For the hybrid approach, lambda = 0.96

Sorted returns are shown in the purple column

The HS 95% VaR = ~ 4.25% because it is the fifth-worst return (actually, the quantile can

be determined in more than one way)

However, the hybrid approach returns a 95% VaR of 3.08% because the “worst returns”

that inform the dataset tend to be further in the past (i.e., days ago = 76, 94, 86, 90…).

Due to this, the individual weights are generally less than 1%.

0%

20%

40%

60%

80%

100%

120%

1 2 3 4 5 6 7 8 9 10

HybridWeights

HS Weights


Explain the process of return aggregation in the context of volatility

forecasting methods.

The question is: how do we compute VAR for a portfolio which consists of several positions.

The first approach is the variance-covariance approach: if we make (parametric) assumptions

about the covariances between each position, then we extend the parametric approach to the

entire portfolio. The problem with this approach is that correlations tend to increase (or change)

during stressful market events; portfolio VAR may underestimate VAR in such circumstances.

The second approach is to extend the historical simulation (HS) approach to the portfolio:

apply today’s weights to yesterday’s returns. In other words, “what would have happened if we

held this portfolio in the past?”

The third approach is to combine these two approaches: aggregate the simulated returns and

then apply a parametric (normal) distributional assumption to the aggregated portfolio.

The first approach (variance-covariance) requires the dubious assumption of normality—for the

positions “inside” the portfolio. The text says the third approach is gaining in popularity and is

justified by the law of large numbers: even if the components (positions) in the portfolio are not

normally distributed, the aggregated portfolio will converge toward normality.


Explain how implied volatility can be used to predict future volatility

To impute volatility is to derivate volatility (to reverse-engineer it, really) from the observed

market price of the asset. A typical example uses the Black-Scholes option pricing model to

compute the implied volatility of a stock option; i.e., option traders will average at-the-money

implied volatility from traded puts and calls.

The advantages of implied volatility are:

Truly predictive (reflects market’s forward-looking consensus)

Does not require, nor is restrained by, historical distribution patterns

The shortcomings (or disadvantages) of implied volatility include:

Model-dependent

Options on the same underlying asset may trade at different implied volatilities; e.g., volatility smile/smirk

Stochastic volatility; i.e., the model assumes constant volatility, but volatility tends to change over time

Limited availability because it requires traded (set by market) price

Explain how to use option prices to derive forecasts of volatilities

This requires that a market mechanism (e.g., an exchange) can provide a market price for the

option. If a market price can be observed, then instead of solving for the price of an option, we

use an option pricing model (OPM) to reveal the implied (implicit) volatility. We solve (“goal

seek”) for the volatility that produces a model price equal to the market price:

market ( )ISDc f

Where the implied standard deviation (ISD) is the volatility input into an option pricing model

(OPM). Similarly, implied correlations can also be “recovered” (reverse-engineered) from

options on multiple assets. According to Jorion, ISD is a superior approach to volatility

estimation. He says, “Whenever possible, VAR should use implied parameters” [i.e., ISD or

market implied volatility].


Discuss implied volatility as a predictor of future volatility and its

shortcomings.

Many risk managers describe the application of historical volatility as similar to “driving by

looking in the rear-view mirror.” Another flaw is the assumption of stationarity; i.e., the

assumption that the past is indicative of the future.

Implied volatility, “an intriguing alternative,” can be imputed from derivative prices using a

specific derivative pricing model. The simplest example is the Black–Scholes implied volatility

imputed from equity option prices.

In the presence of multiple implied volatilities for various option maturities and exercise

prices, it is common to take the at-the-money (ATM) implied volatility from puts and

calls and extrapolate an average implied; this implied is derived from the most liquid

(ATM) options

The advantage of implied volatility is that it is a forward-looking, predictive measure.

“A particularly strong example of the advantage obtained by using implied volatility (in

contrast to historical volatility) as a predictor of future volatility is the GBP currency

crisis of 1992. During the summer of 1992, the GBP came under pressure as a result of the

expectation that it should be devalued relative to the European Currency Unit (ECU)

components, the deutschmark (DM) in particular (at the time the strongest currency

within the ECU). During the weeks preceding the final drama of the GBP devaluation,

many signals were present in the public domain … This was the case many times prior to

this event, especially with the Italian lira’s many devaluations. Therefore, the market

was prepared for a crisis in the GBP during the summer of 1992. Observing the thick

solid line depicting option-implied volatility, the growing pressure on the GBP

manifests itself in options prices and volatilities. Historical volatility is trailing,

“unaware” of the pressure. In this case, the situation is particularly problematic since

historical volatility happens to decline as implied volatility rises. The fall in historical

volatility is dueto the fact that movements close to the intervention band are bound to

be smaller by the fact of the intervention bands’ existence and the nature of

intervention, thereby dampening the historical measure of volatility just at the time that

a more predictive measure shows increases in volatility.” – Linda Allen

Is implied volatility a superior predictor of future volatility?

“It would seem as if the answer must be affirmative, since implied volatility can react

immediately to market conditions. As a predictor of future volatility this is certainly an

important feature.”


Why does implied volatility tend to be greater than historical volatility?

According to Linda Allen, “empirical results indicate, strongly and consistently, that implied

volatility is, on average, greater than realized volatility.” There are two common explanations.

Market inefficiency due to supply and demand forces.

Rational markets: implied volatility is greater than realized volatility due to stochastic

volatility. “Consider the following facts: (i) volatility is stochastic; (ii) volatility is a priced

source of risk; and (iii) the underlying model (e.g., the Black–Scholes model) is, hence,

misspecified, assuming constant volatility. The result is that the premium required by the

market for stochastic volatility will manifest itself in the forms we saw above – implied

volatility would be, on average, greater than realized volatility.”

But implied volatility has shortcomings.

Implied volatility is model-dependent. A mis-specified model can result in an

erroneous forecast.

“Consider the Black–Scholes option-pricing model. This model hinges on a few

assumptions, one of which is that the underlying asset follows a continuous time

lognormal diffusion process. The underlying assumption is that the volatility parameter is

constant from the present time to the maturity of the contract. The implied volatility is

supposedly this parameter. In reality, volatility is not constant over the life of the options

contract. Implied volatility varies through time. Oddly, traders trade options in “vol”

terms, the volatility of the underlying, fully aware that (i) this vol is implied from a

constant volatility model, and (ii) that this very same option will trade tomorrow at a

different vol, which will also be assumed to be constant over the remaining life of the

contract.” –Linda Allen

At any given point in time, options on the same underlying may trade at different

vols. An example is the [volatility] smile effect – deep out of the money (especially) and

deep in the money (to a lesser extent) options trade at a higher volatility than at the

money options.

Explain long horizon volatility/VaR and the process of mean reversion

according to an AR(1) model.

Explain the implications of mean reversion in returns and return volatility

The key idea refers to the application of the square root rule (S.R.R. says that variance scales

directly with time such that the volatility scales directly with the square root of time). The

square root rule, while mathematically convenient, doesn’t really work in practice because it

requires that normally distributed returns are independent and identically distributed (i.i.d.).


What I mean is, we use it on the exam, but in practice, when applying the square root rule to

scaling delta normal VaR/volatility, we should be sensitive to the likely error introduced.

Allen gives two scenarios that each illustrate “violations” in the use of the square root rule to

scale volatility over time:

If mean reversion… Then square root rule

In returns Overstates long run volatility

In return volatility If current vol. > long run volatility, overstates If current vol. < long run volatility, understates

For FRM purposes, three definitions of mean reversion are used:

Mean reversion in the asset dynamics. The price/return tends towards a long-run

level; e.g., interest rate reverts to 5%, equity log return reverts to +8%

Mean reversion in variance. Variance reverts toward a long-run level; e.g., volatility

reverts to a long-run average of 20%. We can also refer to this as negative

autocorrelation, but it's a little trickier. Negative autocorrelation refers to the fact that a

high variance is likely to be followed in time by a low variance. The reason it's tricky is

due to short/long timeframes: the current volatility may be high relative to the long run

mean, but it may be "sticky" or cluster in the short-term (positive autocorrelation) yet, in

the longer term it may revert to the long run mean. So, there can be a mix of (short-term)

positive and negative autocorrelation on the way being pulled toward the long run mean.

Autoregression in the time series. The current estimate (variance) is informed by (a

function of) the previous value; e.g., in GARCH(1,1) and exponentially weighted moving

average (EWMA), the variance is a function of the previous variance.


Square root rule

The simplest approach to extending the horizon is to use the “square root rule”

, , 1( ) ( )t t J t tr r J J-period VAR = J 1-period VAR

For example, if the 1-period VAR is $10, then the 2-period VAR is $14.14 ($10 x square root of 2)

and the 5-period VAR is $22.36 ($10 x square root of 5).

The square-root-rule: under the two assumptions below, VaR scales with the square root

of time. Extend one-period VaR to J-period VAR by multiplying by the square root of J.

The square root rule (i.e., variance is linear with time) only applies under restrictive i.i.d.

The square-root rule for extending the time horizon requires two key assumptions:

Random-walk (acceptable)

Constant volatility (unlikely)

FRM 2

Documents

continuous random variable

random outcome

discrete random variables

sw continuous random

random sampling

stochastic variable

review of probability

quantitative analysis