Statistics Introduction Slides

QuantL2

STAT101Statistics can be divided into two parts:* Descriptive Statistics

* Inferential StatisticsWe start with Descriptive Statistics

A population is a set of objects/individuals/things whose numerical characteristic(s) we are interested

For example, we may be interested in the heights of all residents of the US, or the number of pages in all the books in the US Library of CongressOne way to describe a population is to list all the numerical values, for example we can list the heights of all the residents of the US, a database that will have almost 300 million entriesAlternatively we can list some characteristics of the population like the mean and standard deviation

This will just be 2 numbers instead of 300 million, but will give us an idea of what the population looks like

The mean of a population is the average of all the individual values in it and is represented by the Greek letter :

the above notation is a formal way of writing that equals the sum of xi as i varies from 1 to N, divided by N

xi is the value of an individual data point within the population, and N is the total number of data points in the population Specifically note that the symbol is used to represent the sum of whatever follows in brackets, with i (the parameter of summation) varying from the amount appearing at the lower-right of , i.e. 1, to the amount appearing at the top-right of , i.e. N

So we have to simply be the sum of values of all data points in the population divided by the population size is a measure of how variable the values are within the population To measure we find the deviation of an individual data point from the mean , then square it and add these squared deviations for all data points

Why do we square? Because if we simply add without squaring we will get a big grand ZERO as a resultFinally we once again divide by N, and the result is the variance 2, the square-root of which is the standard deviation

And the standard deviation is given by:

There is actually more to the standard deviation than just avoiding getting zero by summing

There are other ways of avoiding the zero, for example by taking absolute values or the fourth power However is special, for example it is a parameter of normal distribution, which is probably the most important probability distribution. We will get to that laterHow are means and standard deviations useful? For example, we may find that the mean height of adult men in the US is 59 and the standard deviation 4, whereas the mean height of adult Dutch men are 510 and the standard deviation 2. This information enables us to form an idea about the heights of the US male and Dutch male populations

Alternatively we could actually list the heights of all adult males, but then we would have to deal with many million values instead of just 2 for each countryThink of a bowl with many Red Balls and Blue Balls, say a total of 1,000,000 (only a few of which are shown in the picture)

To apply numerical results we assign numerical values of 0 to Reds and 1 to Blues

That implies, for example, if the population has 30% (300,000) Reds and 70% (700,000) Blues, the population mean will be 0.70Of course we could have assigned other values, for example 5 to Reds and -1 to Blues, that would just be an origin shift and scalingFor the remainder of this course assume that the pot actually has 52% Reds and 48% Blues If we actually counted all the balls in the pot, we would find 480,000 Blues and 520,000 Reds

Therefore:

=

520K*0 + 480K*1

1,000,000

=

0.482=

520K*(0-.48)2+480K*(1-.48)2

1,000K=

0.2496=>

=

0.4996Distributions:

The population of Red and Blue Balls is an example of a binomial (discrete) distribution, a data point can take only one of 2 values

There are many variables that can take an infinite number of values, for example height. These variables have a continuous distribution. In such distributions a particular value has a very very small probability (tending to Zero). For example a height of exactly 510. You will be hard put to find anyone in a population who is absolutely exactly 510. Even those who are very very close will probably differ by at least a billionth of an inchIn that case it becomes meaningful to talk about number of individuals for a range of outcomes (say heights with values between 59 to 510) rather than individual outcomes (like 59 exactly or 510 exactly)Distributions that give density (amounts for ranges) are continuous distributionsThe most useful and hence famous example of a continuous distribution is the Normal distribution We see above that the Normal Distribution is symmetric around the mean For example the probability density 2 standard deviations above the mean is the same as that 2 standard deviations below the meanWhat in the world is normally distributed?

Well, heights arent!Suppose an apple orchard produces a million apples and puts them randomly into bags of a thousand apples each. Then the weights of the bags would be approximately normal.Values for the normal distribution are usually given in the form of cumulative tables These tables are given in terms of the z variable, which is:

Z = (Variable Value Mean)/Std DevFor example, a variable that is one standard deviation above the mean will have a value for Z = 1Normally distributed data (variable) can be origin shifted and scaled to a standard distribution called Standardized Normal Distribution, or Z-distribution. The origin shifting is done by subtracting the mean, and the scaling is done by dividing the variable by its standard deviation. Once this is done, the data has the same distribution as all other standardized normal variables. The mean of the Z variable is 0, and the standard deviation 1 and tables are available for cumulative density values.

For example from the above we see that the area (probability) of the random variable having a value from infinity to 1 standard deviation above the mean (Z = 1) is:

0.8413 = 84.13%. As the probability from infinity to the mean is 50% due to symmetry around the mean this also implies that the probability of the random variable lying between the mean and mean plus one standard deviation is:

84.13% - 50% = 34.13%

Example: Suppose you are told that the consumption of potatoes by the Irish is normally distributed with a mean of 30 lbs a year, and a standard deviation of 5 lbs. The population of Ireland is 4 million. Calculate the number who consume between 27 lbs and 31 lbs. The deviation of 27 lbs from the mean of 30 lbs = 3 lbs = 3/5 = 0.6 std dev

Therefore the probability mass from 27 lbs to 30 lbs = (0.7257 0.5000) = 0.2257

Similarly from 30 lbs to 31 lbs, the deviation = 1/5 = 0.20 std dev and probability mass = (0.5793 0.5000) = 0.0793

So total probability = 0.2257 + 0.0793 = 0.3050

So number of Irish consuming between 27 lbs and 31 lbs = 0.3050 * 4 M = 1.22M Inferential StatisticsThink again of the bowl with many Red Balls and Blue Balls

Now instead of counting the number of Reds and Blues we will Sample and make inferences about the totals

A random variable is one whose value is not known to us. In the future the value will be known to us (if observable). For example, the value of a random draw from the population in the jar is unknown to us, it is a random variable, but once we have made the pick it will be either 0 or 1. We may know the probability of getting a particular value of a random variable, for example in a jar containing 52% Reds and 48% Blues, the probability of getting 0 is 52% and 1 is 48%. This is the probability distribution for this random variable.

You want to estimate what fraction of the total is Red and what fraction is Blue. You dont have time to count them all, so you just dip your and select a few randomly (Random Sample). From your sample you make an Estimate of the fractions of Blues and Reds. The branch of mathematics that deals which such issues is Inferential Statistics.

When you take a Random Sample from the jar, each ball has an equal Probability of being picked. Suppose you pick up one ball every time you dip your hand into the jar. If there are a total of N balls in the jar, then each ball has a probability of 1/N of being randomly picked every dip of your hand. The total probability of that any one ball will be picked is the sum of probabilities for all balls: 1/N + 1/N + 1/N + 1/N (for the N balls) = 1. So the probability that any one ball be picked for one dip of the hand is 1. A probability of 1 is a Certain event. The reason we can add the probabilities of 1/N is because the events (of a ball being picked) are Mutually Exclusive. That is, if one ball is picked then another cant be picked.When you pick a ball, the result is either Red or Blue. So your data takes one of two values, which is the simplest possible result. Your data is said to be Binomial, and the Probability Distribution of the data is also Binomial. Your data has a Probability Distribution because every time you dip your hand into the jar, you cannot predict Ex-ante (prior to the dip) what data you will get, but you can say what the probability will be. For example, if the fraction of Reds equal 52%, then you can say the probability of picking a Red (randomly) is 52% and a Blue is 48%. The Probability Distribution then is (Red = 0.52; Blue = 0.48).

Most of the time your data will take many values, in which case there will be probabilities associated with not 2, but many possible outcomes. If the outcomes are Continuous, (that is not discrete) the number of possible outcomes becomes infinite, and each individual outcome has a very very small probability (tending to Zero). In that case it becomes meaningful to talk about probabilities for a range of outcomes (say outcomes with values between 2.5 and 3.1) rather than individual outcomes (like 2.5 or 2.9 or 3.1). When a random variable can take an infinite number of values, the probability of one particular value (usually) tends to zero. Then we do not have probabilities associated with particular outcomes, like a height of 510, but probability densities given by a (continuous) probability distribution. The probability for a range of outcomes is given by the integral of density over the range, that is the area under the probability (density) distribution curve. If the range is small we can approximate the area (probability) as the average of densities at the endpoints of the range, multiplied by the length of the range.The probability distributions then become Continuous Probability Distribution, the most famous example of which is the Normal Distribution. Other examples are Chi-Square, T-distribution, Poisson etc.

Suppose you have dipped your hand in the pot 80 times and find that you have picked 38 Reds and 42 Blues, that is a frequency of 0.475 for Reds and 0.525 for Blues. What can you conclude?

Now there are different ways in which we can think of the situation and different questions we can answer. Two major ways in which we can do statistical analysis is:

1) Frequentist (or Classical) Statistics

2) Bayesian Statistics

Though there is disagreement between the above two approaches, each is logically valid. They just view the world differently and ask different questions. Classical Statistics is by far more commonly used, and that is what we will study. I will also discuss Bayes Rule in an Advanced MBA Lecture.

As is proper for a course of this nature, I will take an applied approach. That is, I will sacrifice rigor while striving to give you the important intuitions.

Remember that the question of estimating the population frequency of Reds and Blues is equivalent to estimating the population mean.

To make an inference from the data we have (Reds = 0.475 and Blues = 0.525 for a mean of 0.525) we first need to understand what kind of samples a binomial population will generate. For example it is quite likely that our population (52% Reds and 48% Blues) will generate a sample with, say 40 Reds and 40 Blues, but NOT IMPOSSIBLE that it will generate a sample of 0 Reds and 80 Blues. If we see our sample has 0 Reds and 80 Blues, we cannot therefore conclude that the population does not have 52% Reds and 48% Blues, but we can conclude it is UNLIKELY. Similarly if we see a sample of 70 Reds and 10 Blues, we would conclude that it is LIKELY that the population has more Reds than Blues. Essentially, to make inferences from samples we first need to know with what probabilities a population produces different samples. Our inferences are in terms of what is likely and what is not, that is from a sample we cannot make an inference that a population is exactly something, only that it is likely to be something and unlikely to be something else. This likeliness/unlikeliness is expressed in terms of rejection/or failure to reject hypothesis, or in terms of confidence intervals for the estimates of mean, standard deviation etc.Our binomial population generates samples according to a binomial distribution, which involves Combinatorial mathematics (Permutations and Combinations). Instead of going there we will use an Asymptotic result, The Central Limit Theorem (CLT).

CLT is undoubtedly the most useful result in statistics. It says the mean of a large sample from a population (discrete or continuous) is distributed according to a Normal Distribution, with mean equal to population mean, and standard deviation equal population standard deviation divided by the square-root of the sample size. CLT applies to populations with all kinds of distributions as long as they dont have some unusual features (like infinite variance). As a practical matter, a sample larger than 30 can be thought of as a large sample.Now think of repeated samples of 80 from our population. How would the means of these samples be distributed? CLT says that the means would be distributed with a mean of 0.48 and a standard deviation of 0.4996/801/2 = 0.05586Or, when you do repeated sampling of the population many times, you will see that about 68.26% of the means of the samples lie between 0.48 0.05586 = 0.42414 and 0.48 + 0.05586 = 0.53586.I repeat, the result given by CLT is that sample means are normally distributed with mean 0.48 and standard distribution 0.05586.You may wonder how a discrete distribution in which the data takes values of only 0 or 1, leads to a continuous normal distribution for the means. The answer is that a sample of 80 has means that take one of 80 values, starting from 0 (80 Reds, 0 Blues) and increasing in steps of 0.0125 (for one more Blue) all the way to 1 (0 Reds, 80 Blues). Though the mean is still discrete, the number of possible values it takes is large enough (80) to think of it as approximately continuous. The next more thing to do is to define precisely the inference we wish to test. We call it the Null Hypothesis. This is the hypothesis we test using the sample, and we either Reject it, or Fail to Reject it.

Lets say the Null Hypothesis (represented by H0) is:

H0: = 0.40, i.e., the population is 60% Reds and 40% Blues.

We are almost ready to make our first statistical inference, exciting!CLT says that the mean of the probability distribution of the sample mean is the population mean .

So if we have many samples we can find an estimate of the mean of the means. However we dont have many samples, but just one. So how do we proceed?This is the joy of statistics, to be able to say something logical about something of which we have only partial knowledge!

We do the following two things:

1) We start by assuming the Null Hypothesis is true. Then we will have the distribution for sample means given by CLT.

= 0.40, and 2 = 60% * (0 0.40)2 + 40% * (1 0.40)2 = 0.24 or = 0.48990. This implies the sample means will be distributed normally with mean 0.40 and std dev of 0.4899/801/2 = 0.05477.

Note: As the actual population has 48% (not 40%) Blues, the actual distribution of sample means will have a mean of 0.48 and std dev of 0.05586. However we do not know what the actual population is (if we did the game was over and we could go home) so we proceed assuming the hypothesis that we are testing to be true.2) Next we look at the sample mean of the sample we have taken, and ask the question: Can we say that it is really UNLIKELY that we would have got this mean from a distribution with a mean of 0.40 and std dev of 0.48990? If the answer is Yes, then we reject the Null Hypothesis. If the answer is No, we Fail to Reject the Null Hypothesis.Next we need to define what we mean by UNLIKELY. It is defined in terms of how far away the sample mean is away from the mean as per the Null Hypothesis. This distance is presented in terms of the percentage of the total area under the normal curve.

We decide before taking the sample what the level for rejection. The choice is ours, but generally two levels are used: 95% and 99%. A level of 95% means that:

IF the sample mean DOES NOT lie in the 95% area surrounding the mean, THEN we infer that it is unlikely that the sample was produced by a distribution with characteristics of the Null Hypothesis, and therefore REJECT the Null Hypothesis.OTHERWISE we FAIL TO REJECT the Null Hypothesis.

From the above picture we see that the Null Hypothesis would be rejected if the Z for the sample mean is greater than 1.96 or less than -1.96.

Similarly, if we set our standard for rejection at 99%, then we reject the null only if the sample mean does not lie in the 99% area surrounding the mean.

Now the value of the mean we have for the sample is 0.525. We are assuming (not yet rejected the Null Hypothesis) that this has been produced by a normal distribution with a mean of 0.40 and standard deviation of 0.05477. The number of standard deviations the value of 0.525 is above the mean is:

Z=0.525 0.40=2.28

0.05477The above statistic is called the Z statistic. Now if we had chosen our level for rejection to be 95% BEFORE taking the sample, then we will REJECT the Null Hypothesis of = 0.40.

However if we had chosen our level for rejection to be 99% BEFORE taking the sample, then we will FAIL TO REJECT the Null Hypothesis of = 0.40.

You should understand the above example of statistical inference well. We will see many different examples of statistical inference in this course, but the underlying logic presented above, which is the Classical (Frequentist) approach, remains the same.

The p-value

With a Z of 2.28, the area under the curve for the right and left tails (area for which Z < -2.28 and Z > +2.28) equals 1.13% * 2 = 2.26% = 0.0226. This is referred to as the p-value. It is the probability that the population (assuming the Null is true) would produce a sample mean equal to or greater than 0.525. Often statistical inference is approached not by deciding before hand the confidence level (say 95% or 99%) but by reporting the p-value and letting those considering the results to judge for themselves whether the p-value warrants rejection of the Null Hypothesis.The smaller the p-value, the smaller the probability that the population under the Null Hypothesis would have produced the sample. Hence smaller the p-value the more confidence we can have in rejecting a Null Hypothesis.Recap of the Classical (Frequentist) Approach to Statistical Testing

1) We start with population that we want to test a Null Hypothesis about, say the mean of the population equals a certain value.

2) We describe an appropriate test statistic. Appropriate implies a statistic that we will be able to test.

3) Assuming that the Null Hypothesis is true, we find the distribution of test statistic. For means, the CLT is a very powerful result that provides us the distribution when the sample size is large (in practical terms > 30). It is the statisticians job to tell us what the test statistic is, how it is distributed, and to provide us tables for critical values. We will take these as given rather than delving deep into these issues.4) We optionally define the level of confidence (level of significance/rejection). Alternatively we skip this step and report the p-value at step 6.

5) We take a sample of the population and estimate the test statistic.

6) On the distribution of the test statistic implied by the Null, we see where the test statistic lies. If it lies in the regions of rejection from the Null value we REJECT the Null Hypothesis. Otherwise we FAIL TO REJECT the Null Hypothesis. The regions of rejection are those which are UNLIKELY for the test statistic to end up in if the Null Hypothesis is true, for example the right and left tails of the standard normal distributions when we were testing means. If the test statistic does not lie in the regions of rejection we conclude the sample estimates are NOT UNLIKELY ENOUGH to justify rejecting the Null Hypothesis.You should get comfortable with the above logic. This is what statistical testing is, it will appear in many different forms, but the logic remains the same.Also various words/phrases are used to describe the same thing: for example 95% confidence level = 5% level of significance = 5% level of rejectionOne-tailed and Two-tailed tests:

Look back at the picture of the 95% region around the mean. We reject the hypothesis that the population mean equals 0.40 if the Z-value of the sample mean lies in either of the shaded regions on the right or the left. This is an example of a Two-tailed test, that is we reject if the sample mean lies in either tail.

Sometimes we may want to test the hypothesis that the population mean is greater than a particular number, rather than different from a particular number. For example, we may want to test that the population mean is greater than 0.40. Why would be want to do that?

Suppose the 40% is the success rate of a drug. We have a population (an existing drug that is already in use) that we know has a population mean of 0.40 and we have a preference for greater means (greater success). Success could imply, for example that the patient is alive after 5 years. Hence the population is distributed binomially, the patient is either alive or not alive. We have to choose between the known population (existing drug) and the population to be sampled (new drug).

We want to continue using the existing drug unless it is UNLIKELY that the new drug has a smaller success rate, that is the population to be sampled has a smaller mean. Then the Null Hypothesis becomes < 0.40. In this situation what is the rejection region for the Null Hypothesis. The tail on the left can no longer be a rejection region. So only the right tail is the rejection region, and now this has to have the entire probability mass of 5%, so the value of Z for rejection falls to 1.64

Note that in the above test comparing two drugs, the benefit of doubt is being given to the existing drug, the Null Hypothesis is that it is the better drug. The Null Hypothesis gets the benefit of doubt.

We are ready to accept that the new drug is an inferior treatment unless the sample result shows that is UNLIKELY, as in having a sample mean that has a Z > 1.64. This corresponds to sample mean > 0.40 + 1.64 * 0.05477 => sample mean > 0.4898. (sample size 80)It is not enough for the new drug to have a sample mean merely greater than 0.40, for example 0.42 wont do. The amount by which 0.42 exceeds 0.40 is not sufficient to reject the Null Hypothesis. Statistics says that if we use 95% one-tailed confidence then sample mean must exceed 0.4898.We are favoring the Null Hypothesis, in the sense that we are making it difficult to reject. Even if the sample mean is 0.42 we still do not reject the Null that the population mean of the new drug is less than 0.40. Even though we favor the Null Hypothesis, we still may sometimes make the mistake of rejecting it mistakenly (can happen with a small probability).

For example suppose the population mean of the new drug is actually 0.38, then there is still a chance (less than 5%) that we will end up with a sample mean greater than 0.4898. We would then have wrongly rejected the Null Hypothesis. An error of this type is called a Type I Error.

We are more likely to make the error of the other sort, that is not rejecting the Null Hypothesis even though the new drug has a success rate larger than 0.40. For example, if the success rate of the new drug is 0.42, then the population standard deviation will be = (0.42 * 0.58)1/2 = 0.4936. Then the Z for the sample mean (of a sample of size 80) that equals 0.4898 is:

Z=0.4898 0.42

(0.4936/801/2)

=1.265

The probability of the sample mean exceeding 0.4898 can be found by using the above Z and the table for Standard Normal Distribution. It is 10.29%. So there is a 100% - 10.29% = 89.71% probability that with a population mean of 0.42 we will erroneously FAIL TO REJECT the Null Hypothesis that the population mean of the new drug is less than 0.40 (while in reality it is 0.42 > 0.40). This is a Type II Error, the error of not rejecting a false Null.

Another quick example: If the population mean for the new drug is 0.4898, then the probability of making a Type II error with the rejection level at 95% (also called the 5% significance level) is 50%. This follows as there is a 50% chance that the sample mean from a population with mean 0.4898 will turn out to be less than 0.4898.

If a test of the Null has lesser probability of making a Type II Error, it is said to possess greater Power. The Power of a test the more faith we can have in our statistical inference. Tests should be designed to maximize Power for a given probability of making a Type I error. We should be aware of the Power of our tests.

If we use, say the 99% confidence level (1% significance level) instead of the 95% confidence level, we make it more difficult to reject the Null Hypothesis. This reduces the probability of making a Type I Error, but increases the probability of making a Type II Error. It decreases the Power of the test. The t-distributionThe example above was of a population which was Binomially distributed. Specifically, such a distribution enabled us to calculate the standard deviation that was implied by the mean of the Null Hypothesis. If the population is not distributed Binomially, then the Null will not be enough to compute the standard deviation, and we will have to form an estimate of the population standard deviation from the sample standard deviation.

When the population standard deviation also has to be estimated, the test statistic becomes the deviation of the sample mean from the population mean divided the estimate of the population standard deviation divided by square-root of N. This statistic is called the t-statistic and it is said to have N-1 degrees of freedom (dof).

Henceforth we will call the sample mean xS and the sample standard deviation sS.

These estimators are random variables. From the sample we get one particular estimate for each of these two random variables. We use these estimates to make inferences about the population.

Fortunately for us, the t-distribution has been extensively studied and tables similar to the Standard Normal table have been prepared. So the only change we need to make from the earlier example is to calculate the sample standard deviation, compute the t-statistic, and use the t tables.

If the size of the sample is large (say greater than 30) we can approximate the population standard deviation to the sample standard deviation, and use the Z-statistic instead of the t-statistic.t-statistic tables are available on the internet, for example:

http://www.statsoft.com/textbook/sttable.htmlConfidence IntervalsBesides testing Null Hypothesis that the population parameter (for example the mean ) is of a particular value, we can also wish by sampling to ESTIMATE the population parameter. For finite (especially small) samples we cannot determine for sure the population parameter, but can only make a statement. The statement that we make is in the form of a CONFIDENCE INTERVAL.In Classical Statistics a Confidence Interval is an Interval Estimator. The Confidence Interval that we compute based on a sample is one particular realization of the Interval Estimator.A Confidence Interval for a population parameter consists of two endpoints (the interval is between the endpoints) and a confidence level p% (usually 95%). For the population mean, the sample mean is the midpoint of the Confidence Interval.We consider 4 situations:

1) If the underlying population is distributed normally (which is rare) then sample means are distributed normally (even if the sample size is small). And the population standard deviation is known (), then we have the confidence interval to be:

xs + Z*/(N1/2)Where = (100% - p%)/2 and Z is defined as Z > Z having a probability mass of .

2) If the underlying population has a known standard deviation () and sample size is large (say > 80) then once again we have the confidence interval to be:

xs + Z*/(N1/2)3) If the underlying population has an unknown standard deviation and the sample size is small, then the confidence interval is:

xs + t*ss/(N1/2)4) If the underlying population has an unknown standard deviation and the sample size is large (say > 80), then the t distribution can be approximated by the Z distribution and the confidence interval is:

xs + Z*ss/(N1/2)Note that the sample standard deviation estimator ss is:

Why divide by N-1 rather than N? The intuition is that we are not using the actual mean but the fitted mean xs, which will lead to deviations being underestimated, and to correct that divide by N-1 rather than N.

Online resource for Stats: http://onlinestatbook.comExample: Suppose we wish to find the Confidence Interval at the 95% level for estimating mean of a binomial population (with values 0 or 1). The sample mean is 52.5% for a sample of size 80 (38 0s and 42 1s).First estimate the sample standard deviation ss.

ss2 = (38*(0-0.525)2 + 42*(1-0.525)2)/(80-1)

=> ss = 0.5025Our estimate of standard deviation for the mean is: 0.5025 / (801/2) = 0.0562Given the sample size of 80, we can use the Z approximation in place of t. The critical values for the 5% two-tails become: + 1.96The Confidence Interval then becomes:

52.5% + 1.96 * 5.62%

= 52.5% + 11.02%= (41.48%, 63.52%)In the newspapers the results of the sampling will be reported as In our opinion poll, the Reds have 52.5% of the vote and the Blues have 47.5%, with an error of 11.02%. The confidence level is not reported, the default level is 95% for such sampling.

What does the Confidence Interval of (41.48%, 63.52%) mean? It means that if the population standard deviation is really the one we estimated (which is a fair approximation given the largish sample size of 80), then if the population mean was 41.48%, then the probability of getting a sample mean of 52.5% or greater would be 2.5%. And if the population mean was 63.52%, then the probability of getting a sample mean of 52.5% or lesser would be 2.5%.

Another way to think of the confidence interval: Pick any value from the interval. Assume that is the population mean. Then the 95% region around that population mean will contain the sample estimate of mean (0.525). If the sample is large the 95% region will be given by the z-distribution with the population standard deviation approximated to the sample standard deviation. If the sample is small, then the 95% region is given by the t-distribution. Yet another way to think of a confidence interval, the set of values for which we would fail to reject a Null Hypothesis setting the population mean equal to those values.What a Confidence Interval in Classical Statistics does NOT mean is that there is a 95% chance that the population mean lies between 41.48% and 63.52%.

The population variable in this sort of statistics is not a random variable, hence to talk about probability with respect to it is meaningless.

Difference of Means

Another example of statistical testing is to determine if two populations are different in terms of their means. This differs slightly from the earlier existing and new drug example. There we knew the population mean of one population (existing drug) and wanted to test if the other drug was better. Here we simply wish to test whether two populations are different without any predisposition about one having a greater or lesser mean. For example we may wish to test if the populations of adult females of US and Canada have the same heights.

We take one sample each from both populations. Both these sample means are normally distributed. We also assume that the population variances (standard deviations) are equal, though we do not know what they are. Under the Null that both populations have the same mean, the following statistic is distributed as t:

Where n1 and n2 are the sizes of the samples of population 1 and 2 respectively, xs1 and xs2 are the samples means respectively, ss1 and ss2 are the sample standard deviations respectively, and 1 and 2 are the population means under the Null Hypothesis respectively.

If we do not assume the population variances are equal, then we have the t-statistic given by:

Paired Comparison TestDid firms increase R&D spending as percentage of total sales after Congress passed tax breaks for R&D expenditure? Suppose we look at a sample of 25 firm. One way to do the test is to look at the two populations of R&D spending and compare their means using the above tests of unequal means. A test with more power is to instead study the difference in firms spending before and after. That is, take the population of changes and test whether its mean is greater than zero. This is a Paired Comparison Test and it has more power because we are not losing information by aggregating the R&D spending.

Suppose the mean for the population of changes is 7%, and the standard deviation is 5%. Assume that the distribution is normal, so the test will use the t-statistic.The degrees of freedom (dof) is 25-1 = 24. Suppose we test at the 95% confidence level, then the critical value as given by the t-tables is 1.7109.

The t-statistic will be:

t24

=

7%

5% / 251/2

=

7As the t-statistic exceeds the critical value we will reject the Null Hypothesis that the tax breaks passed by Congress did not increase R&D expenditure.

The t-tables are given in terms of the right-tail. If you want them for two-tailed tests, you have to take the critical value for half the significance level. That is if you want 95% confidence, that corresponds to 5% significance, then look up the critical value for 2.5% significance. This works because the t-distribution is symmetric (like the normal distribution). So 2.5% on each tail adds up to 5% significance. The 2.5% critical value for only the right tail from the tables with 24 dof was 2.0639, which would be the critical value for the two-tailed test.How does the Paired Comparison test increase power? Suppose there are 8 firms in the sample, 4 of which move from 5% to 6% and the other 4 move from 25% to 26%. Then if we aggregate, the aggregate moves from 12.5% to 13.5%. There is however substantial standard deviation in the samples due to the difference between 5% and 25% and between 6% and 26%. The standard deviation for the two samples (pre and post change) turn out to be 10%. If we do the t-test we get:

t15=

0.01

( 0.102+

0.102 )1/2

8

8=

0.2The above t-stat is too small to reject the Null at any meaningful levels (the p-value is greater than 40%). Given this the 1% change may not show up as significant, and we may fail to reject the Null. The above is a test of differences in means.

However when we form a sample of differences, all differences are 1%, making the sample standard deviation to zero. This makes the t-stat infinite and the p-value zero! We can reject the Null at all levels of significance. This is a test of mean of differences.

This is an example how the power of tests can be increased by paying attention to the design of the test. How did the increase in power of the test happen? In the unequal means test there was variation in the samples due to variations between firms. We are however interested only in the variation due to the tax breaks that Congress passed. So considering only intra-firm variations enables us to focus on the tax breaks and eliminate the noise due to inter-firm differences, and results in a test of greater power. Of course, if the samples are independent (say trials of two drugs) then we cant use the Paired Comparison test. However when possible, when samples are not independent (as in the case of changes in firms R&D spending) we should use the Paired Comparison (mean of differences).

Tests of Variances

For a normally distributed population we can test whether the variance equals a particular value. This is similar to testing whether the mean equals a particular value as we have done before, but with variance substituted for the mean.Suppose the Null Hypothesis H0 and the alternative Hypothesis HA are:

H0: 2= 02

HA: 2 02

The logic for the test remains the same as for the Classical (Frequentist) approach to statistical testing. That is suppose the population is normally distributed and has 2 = 02 then the test statistic will have a particular distribution. This distribution was Z (standard normal) or t for the statistic to test the mean, and is Chi-squared for testing the variance. If the test statistic lies beyond the critical value then we conclude our initial assumption that the population conformed to the Null Hypothesis was wrong and we Reject. If the test statistic lies within the critical value then we Fail to Reject the Null Hypothesis.The Chi-squared test statistic 2 has N-1 degrees of freedom and is:

2N-1

=

(N 1) ss2

02The two-tailed Chi-squared test rejects the Null if the value of the test statistic is too small (sample variance much smaller than expected under the Null) or too large (sample variance much larger than expected under the Null). For example, we divide a 5% level of significance into a 2.5% right tail and a 2.5% left tail.The Chi-squared table, and also F-tables that we will need soon are available at:http://www.statsoft.com/textbook/sttable.htmlExample: A sample of 25 daily returns for a stock has a variance of 0.20%2. Assuming that the process that generates the returns is stable (unchanging) test at the 1% significance level the hypothesis that the variance of the daily stock returns is 0.25%2.The dof is 24 and the two tails at the 1% level of significance are: 9.88 and 45.55

The test statistic from the sample is:

224

=

24 * 0.202

0.252=

15.36

As 45.5585 > 15.36 > 9.8862 the test statistic does not lie in the region of rejection and we FAIL TO REJECT the Null Hypothesis that the variance of daily stock returns is 0.25%2.

Test of equality of variances

Equality of variances can be tested using a test statistic that is the ratio of variances of the samples and under the Null has a F-distribution. The larger variance is put in the numerator and the smaller in the denominator. The test statistic has two dofs, first the numerators next the denominators. The order of the dofs is important as they are not interchangeable. As before the dofs are one less than the sample sizes. The larger variance is put in the numerator, so the test is single (right) tailed. This putting of the larger sample variance on the top means that we eliminate the lower tail. So when we test for inequality of variances, suppose the level of significance is 5%, the right tail should only have probability of 2.5%.

Example: An analyst is comparing the monthly returns of 2 year T-bonds and bonds issued by GM. She decides to investigate whether the T-bills have the same variance as the GM bonds by taking a sample of 20 monthly returns of the former and 31 monthly returns of the latter. The variances are 0.0012 and 0.00122. What can you conclude if the level of significance is 10%?The critical level for the F-stat at the 10% level of significance with degrees of freedom 30 and 19 is 2.07 (right tail of with 5% probability)The F-statistic for the test of equality of variances is:F30,19

=

0.00122

0.00102

=

1.44As the F-stat is less than the critical value 2.07, we cannot reject the Null Hypothesis that the two variances are equal.

Covariance, Correlation CoefficientThese are (descriptive) statistics that measures the characteristics not of one variable (like mean and standard deviation) but a pair of variables.

The Covariance of a pair of variables is a measure of how much they move together. For example, one of the variables increases by 5%, then covariance is a measure of how much we would the other variable to have increased. If the variables have a tendency to move in opposite directions the covariance will be negative.

The formula for Covariance is similar to that for Variance, except that it takes the deviations from the mean for the two variables and multiplies them rather than taking the square of either. The formula for the estimator for covariance in a sample between variables X and Y is:

We take the pairwise occurrences of the variables X and Y, find their deviations from their means and multiple them. Finally we divide the number of occurrences to find the covariance. If the population means are not known, then we can use the sample means xs and ys to form an estimate of the covariance.

The correlation coefficient XY is the covariance between X and Y normalized (scaled down) by the standard deviations of the two variables:

The correlation coefficient lies in the range (-1, +1).

A correlation of +1 mean that X and Y are perfectly correlated, every P% increase (decrease) in X will result in a P% increase (decrease) in Y.

A correlation of -1 mean that X and Y are perfectly negatively correlated, every P% decrease (increase) in X will result in a P% increase (decrease) in Y.

We denote the estimator of XY as rXY. It is a random variable. Sometimes it is called sample correlation coefficient. Prior to the sample being taken we have the random variable (which we know how to compute using the formula). Post sampling we have one realization for (estimate of) the estimator.

Outliers are data points that are deviate significantly from the norm. As the computation of variances and covariances involve squaring, outliers can have a disproportionate impact on their calculation. For example, a data point that deviates from the mean by 3 units, will have the same impact as 9 data points that deviate from the mean by 1 unit in the calculation of variance. Hence we should examine the impact of outliers on our estimates, and may even exclude them from our calculations.Correlation and CausalityCorrelation is a statistical estimate, it does not imply causality. Just because X and Y are positively correlated does not mean that X causes Y or vice versa. Consider two variables, number of boats out in Lake Michigan and number of cars on the highways. If you do a correlation you will find that they are positively correlated. However it is not the boats that are causing more cars to be on the highways, or vice versa. Rather both are activities that increase in the summer, leading to the positive correlation. The correlation between boats and cars is spurious in the sense that it is not causal.Testing whether the correlation coefficient is ZERO.We represent the estimator of the correlation coefficient XY by rXY.

The Null to be tested is:

H0: XY = 0

The following test statistic is for a two-tailed test of the Null and is distributed as t with N-2 degrees of freedom:

Example: You have 22 pairwise observations for daily returns to stocks X an Y. The correlation coefficient is 0.42. Test the Null at the 1% level of significance.The critical level for t at the 1% level of significance with 20 dofs is 2.85. The test statistic has a value:t=0.42 * 201/2

(1 0.422)1/2

=2.07So we cannot reject the Null that the two daily stock returns have a zero correlation (uncorrelated) at the 1% level of significance.Linear Regressions

We now come to an area of statistical inference and estimation that is of particular importance to Economics and Finance: Regressions.

The basic single independent variable Regression model is:

Yi= +Xi +i

The above model says that the dependent variable Y is determined by the value of the independent variable X multiplied by a constant plus a constant plus other influences on Y summarized in the error term .The constant is also called the slope coefficient, and constant the intercept. The reason should be clear from the following graph

The work now is to estimate the coefficients of the Regression Equation, namely and .

If we define the vectors b, X, and Y:

b=[, ]

X1=[1, 1, 1]

X2=[X11, X12, X1N]

Y=[Y1, Y2, YN]

Where X1 and X1 represent the 1st and 2nd columns of the Nx2 matrix X. The first columns can be thought of as using 1s (a constant) as a variable to extract the coefficient . That is, like the other coefficient , the coefficient is also multiplied by a variable, except that the variable happens to be the constant 1.

The linear regression model becomes:

Y=XbThen the Ordinary Least Squares (OLS) estimator for the coefficients is the 2 dimensional vector:

bOLS=(XX)-1XY

with the first element of bOLS the intercept and the second element the slope.

If you do not want vector notation, you can use the summary statistics:

Sx =x1 + x2 + + xN

Sy =y1 + y2 + + yN

Sxx =x21 + x22 + x2N

Sxy=x1y1+x2y2+xNyN

OLS = N*Sxy Sx*SyN*Sxx Sx*SyOLS = Sy OLS* Sx

NLet the vector of residuals generated by the regression e. It is a N dimensional vector. OLS chooses the coefficients such that the sum of squares of the ei (i=1,2,.., N) is minimized. The method was applied in 1801 by Gauss to predict the position of the asteroid Ceres. Other mathematicians who had tried to predict (forecast) the asteroids position had been unsuccessful. Legendre shares the credit with Gauss for discovering OLS.

The OLS estimate has some desirable features, provided some conditions are satisfied, prominently:

1) X and are uncorrelated

2) has the same variance across different data points.

3) has a zero expected value (note and can be defined to make this true)

4) is not serially correlated (that is i is uncorrelated with i+1, i+2 etc.)

If these conditions are satisfied the OLS estimator is BLUE (Best Linear Unbiased Estimator)

Best as in lowest variance, unbiased implies that if you find the OLS estimators repeatedly their average will be and .

If the conditions for OLS are not satisfied a variety of other estimators have been designed by statisticians, for example:

1) Generalized Least Squares (GLS) if does not have the same variance across different data points.

2) Instrument Variable estimators if X and are correlated.

Just like we had a variance for the estimator of the mean of a population, we have variances for the OLS estimators. As of now, I will not get into how the variances are calculated, rather will provide you the variance. Usually exam questions and statistical packages provide the variances rather than expect you to calculate.

Hypothesis TestingThe most common Null Hypothesis to be tested for the linear regression is that one of the coefficients is zero. As before, the Null is rejected if the value of the coefficient estimate divided by its standard deviation exceeds (in magnitude) some critical value (usually either Z or t).

Example: A regression with 50 data points produces an estimate of 2.27 for the OLS estimator. The sample estimate of standard deviation ss is 1.14. Test the Null that the slope coefficient is greater than 0 at the 95% level.

As the number of data points is large at 50, we can use the Z-distribution. The critical value for the Z-stat at the one-tailed 95% level is 1.64.

test statistic = 2.27 / 1.14 = 1.99

As the test statistic is more than the critical value, we reject the Null Hypothesis that the slope coefficient is 0.

Following the Frequentist approach, we can form Confidence Intervals (which are estimators) for the coefficients of the linear model with critical values tC from the t-table:

OLS+tC*ssOLS+tC*sswhere ss and ss are the standard deviations for the OLS and OLS estimators.

Example: A regression with 70 data points yields OLS and OLS estimates of 1.57 and 2.49. The variances ss and ss are 0.49 and 1.08. Estimate the 95% confidence intervals for the coefficients. As the number of data points is large, we can use the two-tailed 95% Z critical values of +1.96.

OLS+ZC*ss

=1.57 + 1.96*0.49 = (0.61, 2.53) OLS+ZC*ss = 2.49 + 1.96*1.08= (0.37, 4.61)

Analysis of Variance (ANOVA) for Regressions

When we say Sum of Squares we mean Sum of Squared Deviations from the MeanThe Sum of Squares for Y is called Total Sum of Squares (SST)

The Sum of Squares for OLS*X is called Regression Sum of Squares (SSR). This is the amount of variation that would exist for Y if only X was the cause of variation. This is the amount explained by the regression.

The Sum of Squares for is called Error Sum of Squares (SSE). This is the amount of variation that the regression does not explain, and attributes to the error term. It is the unexplained part. This is also called Sum of Squared Residuals as it is indeed calculated by squaring the estimated residuals and adding.

By algebra we have: SSR + SSE = SST

The Mean Squared Error (MSE) equals SSE/(N-2).

MSE the average of the squared residual errors. Normally when we calculate average we divide by the number of observations N, but here we divide by the number of observations minus 2. Why N-2 rather than N? The answer is the degrees of freedom are N-2. At which point you may be inclined to ask Why is the degrees of freedom N-2? Without getting into a detailed answer, note that our goal for the regression is to minimize the residuals in some optimal fashion. Our ability to choose mean and standard deviation (the two unknowns) implies we can pick these parameters such that 2 of the residuals become zero or any arbitrary number we wish to assign. We are not really facing a vector of N residuals that are beyond our control, but rather a vector of N-2 residuals. That is the residuals can vary in N-2 dimensions, rather than N dimensions, hence N-2 degrees of freedom.

The R2 (R-squared or Coefficient of Determination) for a regression is:

R2=SSR/SST

The better the data fits the linear regression, the higher the R2 will be. If X can explain all the variation in Y, the R2 will be 100%. That would mean there would be no variation needed to be explained by the error term, and X and Y would be perfectly correlated. A R2 of zero implies that X can explain none of Y, that is X is orthogonal to Y, they have zero correlation.

Confidence Interval for Forecasts of Y

We can use the regression coefficients to make Forecasts for values of Y for given values of X.

The Predicted Value of Y is:

The sign over Y is a hat and the variable is called Y-hat. This is standard statistics notation.

Define the Standard Error of Estimate (SEE):

SEE = (SSE /(N-2))1/2

And then the variance FY2 for the Forecasts of Y, is:

Where XG is the given value of X for which Yi is predicted, SX and sSX are the sample means and standard deviation for X respectively.

Example: A linear regression with 40 data points yields OLS and OLS estimates of 7.45 and 1.93. The SEE from the regression is 0.59, SX is 1.25 and sSX is 0.31. Estimate the Predicted Y value for X = 1.89 and the 95% confidence interval.

The Predicted Value = 7.45 + 1.93 * 3.29 = 13.80

We next compute the variance for the predicted value of Y to be:

0.59 (1 + (1/40) + (1.89 1.25)2/(40-1)0.312)

= 0.00394846Which gives the standard deviation to be: 6.28%

As N is larger than 30, we can use the Z critical values. If it were smaller, we should use t with N-2 dofs. The critical values for the 95% confidence interval are +1.96.

The 95% confidence interval is:

13.80 + 1.96*0.0628

= (13.68, 13.92)

Statistics Introduction Slides

Documents