Chapter 12: Analysis of Categorical Data 1 Chapter 12 Analysis of Categorical Data LEARNING OBJECTIVES This chapter presents several nonparametric statistics that can be used to analyze data enabling you to: 1. Understand the chi-square goodness-of-fit test and how to use it. 2. Analyze data using the chi-square test of independence. CHAPTER TEACHING STRATEGY Chapter 12 is a chapter containing the two most prevalent chi-square tests: chi- square goodness-of-fit and chi-square test of independence. These two techniques are important because they give the statistician a tool that is particularly useful for analyzing nominal data (even though independent variable categories can sometimes have ordinal or higher categories). It should be emphasized that there are many instances in business research where the resulting data gathered are merely categorical identification. For example, in segmenting the market place (consumers or industrial users), information is gathered regarding gender, income level, geographical location, political affiliation, religious preference, ethnicity, occupation, size of company, type of industry, etc. On these variables, the measurement is often a tallying of the frequency of occurrence of individuals, items, or companies in each category. The subject of the research is given no "score" or "measurement" other than a 0/1 for being a member or not of a given category. These two chi-square tests are perfectly tailored to analyze such data. The chi-square goodness-of-fit test examines the categories of one variable to determine if the distribution of observed occurrences matches some expected or theoretical distribution of occurrences. It can be used to determine if some standard or previously known distribution of proportions is the same as some observed distribution of
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Chapter 12: Analysis of Categorical Data 1
Chapter 12 Analysis of Categorical Data
LEARNING OBJECTIVES
This chapter presents several nonparametric statistics that can be used to analyze data enabling you to:
1. Understand the chi-square goodness-of-fit test and how to use it. 2. Analyze data using the chi-square test of independence.
CHAPTER TEACHING STRATEGY
Chapter 12 is a chapter containing the two most prevalent chi-square tests: chi-square goodness-of-fit and chi-square test of independence. These two techniques are important because they give the statistician a tool that is particularly useful for analyzing nominal data (even though independent variable categories can sometimes have ordinal or higher categories). It should be emphasized that there are many instances in business research where the resulting data gathered are merely categorical identification. For example, in segmenting the market place (consumers or industrial users), information is gathered regarding gender, income level, geographical location, political affiliation, religious preference, ethnicity, occupation, size of company, type of industry, etc. On these variables, the measurement is often a tallying of the frequency of occurrence of individuals, items, or companies in each category. The subject of the research is given no "score" or "measurement" other than a 0/1 for being a member or not of a given category. These two chi-square tests are perfectly tailored to analyze such data.
The chi-square goodness-of-fit test examines the categories of one variable to
determine if the distribution of observed occurrences matches some expected or theoretical distribution of occurrences. It can be used to determine if some standard or previously known distribution of proportions is the same as some observed distribution of
Chapter 12: Analysis of Categorical Data 2
proportions. It can also be used to validate the theoretical distribution of occurrences of phenomena such as random arrivals which are often assumed to be Poisson distributed. You will note that the degrees of freedom which are k - 1 for a given set of expected values or for the uniform distribution change to k - 2 for an expected Poisson distribution and to k - 3 for an expected normal distribution. To conduct a chi-square goodness-of-fit test to analyze an expected Poisson distribution, the value of lambda must be estimated from the observed data. This causes the loss of an additional degree of freedom. With the normal distribution, both the mean and standard deviation of the expected distribution are estimated from the observed values causing the loss of two additional degrees of freedom from the k - 1 value.
The chi-square test of independence is used to compare the observed frequencies
along the categories of two independent variables to expected values to determine if the two variables are independent or not. Of course, if the variables are not independent, they are dependent or related. This allows business researchers to reach some conclusions about such questions as is smoking independent of gender or is type of housing preferred independent of geographic region. The chi-square test of independence is often used as a tool for preliminary analysis of data gathered in exploratory research where the researcher has little idea of what variables seem to be related to what variables, and the data are nominal. This test is particularly useful with demographic type data.
A word of warning is appropriate here. When an expected frequency is small, the
observed chi-square value can be inordinately large thus yielding an increased possibility of committing a Type I error. The research on this problem has yielded varying results with some authors indicating that expected values as low as two or three are acceptable and other researchers demanding that expected values be ten or more. In this text, we have settled on the fairly widespread accepted criterion of five or more.
CHAPTER OUTLINE
16.1 Chi-Square Goodness-of-Fit Test Testing a Population Proportion Using the Chi-square Goodness-of-Fit
Test as an Alternative Technique to the z Test
16.2 Contingency Analysis: Chi-Square Test of Independence
KEY TERMS
Categorical Data Chi-Square Test of Independence
Chi-Square Distribution Contingency Analysis Chi-Square Goodness-of-Fit Test Contingency Table
Chapter 12: Analysis of Categorical Data 3
SOLUTIONS TO CHAPTER 16
12.1 f0 0
20 )(
f
ff e− fe
53 68 3.309 37 42 0.595 32 33 0.030 28 22 1.636 18 10 6.400 15 8 6.125 Ho: The observed distribution is the same as the expected distribution. Ha: The observed distribution is not the same as the expected distribution.
Observed ∑−=
e
e
f
ff 202 )(χ = 18.095
df = k - 1 = 6 - 1 = 5, α = .05 χ2
.05,5 = 11.07 Since the observed χ2 = 18.095 > χ2
.05,5 = 11.07, the decision is to reject the null hypothesis. The observed frequencies are not distributed the same as the expected frequencies.
Ho: The observed frequencies are normally distributed. Ha: The observed frequencies are not normally distributed.
Chapter 12: Analysis of Categorical Data 6
For Category 10 - 20 Prob
z = 43.14
3.4410− = -2.38 .4913
z = 43.14
3.4420− = -1.68 - .4535
Expected prob.: .0378 For Category 20-30 Prob for x = 20, z = -1.68 .4535
z = 43.14
3.4430− = -0.99 -.3389
Expected prob: .1146 For Category 30 - 40 Prob for x = 30, z = -0.99 .3389
z = 43.14
3.4440− = -0.30 -.1179
Expected prob: .2210 For Category 40 - 50 Prob for x = 40, z = -0.30 .1179
z = 43.14
3.4450− = 0.40 +.1554
Expected prob: .2733 For Category 50 - 60 Prob
z = 43.14
3.4460− = 1.09 .3621
for x = 50, z = 0.40 -.1554 Expected prob: .2067
Chapter 12: Analysis of Categorical Data 7
For Category 60 - 70 Prob
z = 43.14
3.4470− = 1.78 .4625
for x = 60, z = 1.09 -.3621 Expected prob: .1004 For Category 70 - 80 Prob
z = 43.14
3.4480− = 2.47 .4932
for x = 70, z = 1.78 -.4625 Expected prob: .0307 For < 10: Probability between 10 and the mean, 44.3 = (.0378 + .1145 + .2210 + .1179) = .4913. Probability < 10 = .5000 - .4913 = .0087 For > 80: Probability between 80 and the mean, 44.3 = (.0307 + .1004 + .2067 + .1554) = .4932. Probability > 80 = .5000 - .4932 = .0068 Category Prob expected frequency < 10 .0087 .0087(129) = 0.99 10-20 .0378 .0378(129) = 4.88 20-30 .1146 14.78 30-40 .2210 28.51 40-50 .2733 35.26 50-60 .2067 26.66 60-70 .1004 12.95 70-80 .0307 3.96 > 80 .0068 0.88 Due to the small sizes of expected frequencies, category < 10 is folded into 10-20 and >80 into 70-80.
.05,1 = 9.48773 Since the observed χ2 = 2.004 > χ2
.05,4 = 9.48773, the decision is to fail to reject the null hypothesis. There is not enough evidence to declare that the observed frequencies are not normally distributed.
12.5 Definition fo Exp.Prop. fe 0
20 )(
f
ff e−
Happiness 42 .39 227(.39)= 88.53 24.46 Sales/Profit 95 .12 227(.12)= 27.24 168.55 Helping Others 27 .18 40.86 4.70 Achievement/ Challenge 63 .31 70.34 0.77 227 198.48 Ho: The observed frequencies are distributed the same as the expected frequencies. Ha: The observed frequencies are not distributed the same as the expected frequencies. Observed χ2 = 198.48 df = k – 1 = 4 – 1 = 3, α = .05 χ2
.05,3 = 7.81473
Chapter 12: Analysis of Categorical Data 9
Since the observed χ2 = 198.48 > χ2
.05,3 = 7.81473, the decision is to reject the null hypothesis. The observed frequencies for men are not distributed the same as the expected frequencies which are based on the responses of women.
12.6 Age fo Prop. from survey fe 0
20 )(
f
ff e−
10-14 22 .09 (.09)(212)=19.08 0.45 15-19 50 .23 (.23)(212)=48.76 0.03 20-24 43 .22 46.64 0.28 25-29 29 .14 29.68 0.02 30-34 19 .10 21.20 0.23 > 35 49 .22 46.64 0.12 212 1.13 Ho: The distribution of observed frequencies is the same as the distribution of expected frequencies. Ha: The distribution of observed frequencies is not the same as the distribution of expected frequencies. α = .01, df = k - 1 = 6 - 1 = 5 χ2
.01,5 = 15.0863 The observed χ2 = 1.13 Since the observed χ2 = 1.13 < χ2
.01,5 = 15.0863, the decision is to fail to reject the null hypothesis. There is not enough evidence to declare that the distribution of observed frequencies is different from the distribution of expected frequencies.
Ho: The observed frequencies are normally distributed. Ha: The observed frequencies are not normally distributed. For Category 10-20 Prob
z = 6.13
63.3910− = -2.18 .4854
z = 6.13
63.3920− = -1.44 -.4251
Expected prob. .0603 For Category 20-30 Prob for x = 20, z = -1.44 .4251
z = 6.13
63.3930− = -0.71 -.2611
Expected prob. .1640 For Category 30-40 Prob for x = 30, z = -0.71 .2611
z = 6.13
63.3940− = 0.03 +.0120
Expected prob. .2731
Chapter 12: Analysis of Categorical Data 11
For Category 40-50 Prob
z = 6.13
63.3950− = 0.76 .2764
for x = 40, z = 0.03 -.0120 Expected prob. .2644 For Category 50-60 Prob
z = 6.13
63.3960− = 1.50 .4332
for x = 50, z = 0.76 -.2764 Expected prob. .1568 For Category 60-70 Prob
z = 6.13
63.3970− = 2.23 .4871
for x = 60, z = 1.50 -.4332 Expected prob. .0539 For < 10: Probability between 10 and the mean = .0603 + .1640 + .2611 = .4854 Probability < 10 = .5000 - .4854 = .0146 For > 70: Probability between 70 and the mean = .0120 + .2644 + .1568 + .0539 = .4871 Probability > 70 = .5000 - .4871 = .0129 Age Probability fe < 10 .0146 (.0146)(231) = 3.37 10-20 .0603 (.0603)(231) = 13.93 20-30 .1640 37.88 30-40 .2731 63.09 40-50 .2644 61.08
Chapter 12: Analysis of Categorical Data 12
50-60 .1568 36.22 60-70 .0539 12.45 > 70 .0129 2.98 Categories < 10 and > 70 are less than 5. Collapse the < 10 into 10-20 and > 70 into 60-70.
.05,3 = 7.81473 Observed χ2 = 2.45 Since the observed χ2 < χ2
.05,3 = 7.81473, the decision is to fail to reject the null hypothesis. There is no reason to reject that the observed frequencies are normally distributed.
.01,5 = 15.0863 Since the observed χ2 = 10.27 < χ2
.01,5 = 15.0863, the decision is to fail to reject the null hypothesis. There is not enough evidence to reject the claim that the observed frequencies are Poisson distributed. 12.9 H0: p = .28 n = 270 x = 62 Ha: p ≠ .28
fo fe 0
20 )(
f
ff e−
Spend More 62 270(.28) = 75.6 2.44656 Don't Spend More 208 270(.72) = 194.4 0.95144 Total 270 270.0 3.39800
Chapter 12: Analysis of Categorical Data 14
The observed value of χ2 is 3.398 α = .05 and α/2 = .025 df = k - 1 = 2 - 1 = 1 χ2
.025,1 = 5.02389 Since the observed χ2 = 3.398 < χ2
.025,1 = 5.02389, the decision is to fail to reject the null hypothesis. 12.10 H0: p = .30 n = 180 x= 42 Ha: p ≠ .30
f0 fe 0
20 )(
f
ff e−
Provide 42 180(.30) = 54 2.6666 Don't Provide 138 180(.70) = 126 1.1429 Total 180 180 3.8095 The observed value of χ2 is 3.8095 α = .05 and α/2 = .025 df = k - 1 = 2 - 1 = 1 χ2
.025,1 = 5.02389 Since the observed χ2 = 3.8095 < χ2
.025,1 = 5.02389, the decision is to fail to reject the null hypothesis.
Chapter 12: Analysis of Categorical Data 15
12.11 Variable Two Variable One
203 326 529 178 68 110
271 436 707 Ho: Variable One is independent of Variable Two. Ha: Variable One is not independent of Variable Two.
.05,5 = 11.0705, the decision is to fail to reject the null hypothesis. There is not enough evidence to reject the claim that the observed frequency of arrivals is Poisson distributed.
Chapter 12: Analysis of Categorical Data 30
12.24 Ho: The distribution of observed frequencies is the same as the distribution of expected frequencies.
Ha: The distribution of observed frequencies is not the same as the distribution of expected frequencies.
.05,6 = 12.5916 Since the observed χ2 = 26.03 > χ2
.05,6 = 12.5916, the decision is to reject the null hypothesis. The observed frequencies are not distributed the same as the expected frequencies from the national poll. 12.25
127 387 57 571 Ho: Number of Children is independent of Type of College or University. Ha: Number of Children is not independent of Type of College or University.
.05,6 = 12.5916 Since the observed χ2 = 54.63 > χ2
.05,6 = 12.5916, the decision is to reject the null hypothesis. Number of children is not independent of type of College or University.
12.28 The observed chi-square is 30.18 with a p-value of .0000043. The chi-square goodness-of-fit test indicates that there is a significant difference between the observed frequencies and the expected frequencies. The distribution of responses to the question are not the same for adults between 21 and 30 years of age as they are to others. Marketing and sales people might reorient their 21 to 30 year old efforts away from home improvement and pay more attention to leisure travel/vacation, clothing, and home entertainment.
12.29 The observed chi-square value for this test of independence is 5.366. The associated p-value of .252 indicates failure to reject the null hypothesis. There is not enough evidence here to say that color choice is dependent upon gender. Automobile marketing people do not have to worry about which colors especially appeal to men or to women because car color is independent of gender. In addition, design and production people can determine car color quotas based on other variables.