Statistical Inference Two Statistical Tasks 1. Description 2. Inference.

Post on 30-Dec-2015

237 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

Transcript

Statistical InferenceStatistical Inference

Two Statistical Tasks 1. Description 2. Inference

Thus far, we have completed:1. Descriptive Statistics a. Central tendency

i. discrete variablesii. continuous variables

b. Variationi. discrete variablesii. continuous variables

c. Associationi. discrete variables

Now we begin: 2. Inferential Statistics

a. Estimation b. Hypothesis testing

Inferential statistics are based on random sampling.A sample is a subset of some universe (or population [set]).If (and only if) the sample is selected according to the laws of probability, we can make inferences about the universe from known (statistical) characteristics of the sample.

“Random” means selected so that each element in the universe has exactly the same chance of being picked for the sample (sometimes called an equi-probability sample).

Put differently, the only difference between elements selected into the sample and those not selected is pure chance (i.e., “the luck of the draw”).

All inferential statistics evaluate the probability that unlucky selection in creating a random sample (the “luck of the draw,” technically called “sampling error”) explains the statistical outcomes obtained from random samples.

Sample 1: 75% cardinal(n1 = 4)

Sample 2: 0% cardinal(n2 = 4)

Sample 3: 25% cardinal(n3 = 4)

Percent cardinal f

0 lowest 25 medium 50 highest 75 medium100 lowest

0% 25% 50% 75% 100%Percent cardinal in random samples

All statistics calculated on variables from a random sample have a (known) sampling distribution. Sampling distributions are the theoretically possible distributions of statistical outcomes from an infinite number of random samples of the same size.

Knowing this, we do not actually need to draw an infinite number of random samples. When we draw ONE (large) random sample, CHANCES ARE that its characteristics will be closer to the center of its sampling distribution than the extremes. That is, any sample statistic is likely to be close to (rather than very different from) the actual (unknown) value (parameter) in the universe.

For example, when we find that the value of 2 for the association between two variables in a large random sample is 13.748, chances are that the (unknown) value of 2 for the universe (the so-called “true” value) is similar rather than very different.

The question is: Does this sample value of 2 permit us to infer that the two variables are (probably) related or are (probably) independent in the universe? The answer requires knowing how to use the Chi-Square sampling distribution(s).

Sampling distributions allow us to identify the probability that a sample statistic has a similar value in the universe from which the random sample was drawn (that is, whether the value holds in general, not merely for the sample).

Unfortunately, 2 has not one but several sampling distributions, each differently shaped. The one that is relevant for the specific inference we wish to make can be identified by knowing the number of degrees of freedom involved in the calculation of this sample statistic.

In the case of contingency tables (crosstabulations), degrees of freedom associated with 2 are a function of the size of the table (i.e., the number of rows and columns). Specifically,

df = (R – 1)(C – 1)

For example, a contingency table having two rows and two columns (i.e., a 2 x 2 table) has only one degree of freedom:

df = (R – 1)(C – 1) = (2 – 1)(2 – 1) = (1)(1) = 1

Column

Row One Two Total

One ? ? 100

Two ? ? 200

Total 200 100 300

Column

Row One Two Total

One 96 ? 100

Two ? ? 200

Total 200 100 300

Column

Row One Two Total

One 96 4 100

Two ? ? 200

Total 200 100 300

Column

Row One Two Total

One 96 4 100

Two 104 ? 200

Total 200 100 300

Column

Row One Two Total

One 96 4 100

Two 104 96 200

Total 200 100 300

An Example

Year

1984 1985 TotalParty Preference Democrat 545 595 1,140

Independent 528 462 990

Republican 370 455 825

Total 1,443 1,512 2,955

For the crosstabulation in this table, 2 = 13.748. Is the association in this table confined to the sample, or does this mean that there was a “real” shift in party identification from one year to the next? There are several steps in answering this question.

Since these data are from a large random sample, we can use the laws of chance to infer whether this value represents a “real” shift in the universe (i.e., among people in the U.S. in general ) or is merely an artifact of sampling (bad luck in randomly selecting 2,955 people who are NOT like the rest of the population).

We know that 13.748 is ONE of the values on a sampling distribution of 2, but which sampling distribution? Since df = 2 [i.e., (3-1)(2-1)], we can determine that the sampling distribution is the one whose values are located in row 2 of the table in Appendix 4, the “Critical Values of Chi-Square.”

We need a DECISION RULE or CUT POINT to decide whether this represents a true shift or merely the result of chance in drawing the random sample.

We must decide what chance of being wrong we want to entertain in deciding between a “true” relationship between changes over time and political party preference (i.e., one that actually exists in the universe) and an artifact of sampling (i.e., a relationship that exists nowhere else except in our sample due to the “luck of the draw”). Actually, with Appendix 4 we are limited to some conventional probabilities of deciding incorrectly: 10 percent (.10, column 1), 5 percent (.05, column 2), 1 percent (.01, column 3), or 1/10 of 1 percent (.001, column 4). Until we have introduced some additional criteria, let's stick with a 5 percent chance of incorrectly deciding between a real association and chance.

This is known as an alpha level (or significance level) and is expressed as:

= 0.05

It means that we have only a 5 percent chance of incorrectly deciding between a true association in the universe and one due to chance (which exists only in the sample). In other words, this means that we have a 95 percent chance of being correct in making our inference.

Having decided on an alpha level of .05 (i.e., accepting a 5 percent chance that we will decide incorrectly) and knowing the appropriate Chi-Square sampling distribution (one defined by 2 degrees of freedom), we can find the critical value of 2. From row 2 (df = 2) and column 2 ( = .05) of Appendix 4, we find that the appropriate critical value is 5.99. Since 2 for the data was calculated to be 13.748 and since 13.748 is GREATER than the critical value, we conclude that the odds favor there being a true association between party preference and year of poll. In other words, there is less that a 5 percent chance that this association could be due to chance (by randomly selecting people who are atypical of the rest of the population).

Recapitulation

1. Statistical inference involves “generalizing” from a sample to a (statistical) universe.2. Statistical inference is only possible with random samples.3. Statistical inference estimates the probability that a sample result could be due to chance (in sample selection).4. Sampling distributions are the “keys” that connect (known) sample statistics and (unknown) universe parameters.5. Alpha levels are used to identify “critical values” on sampling distributions.

top related