Chapter 12: The Analysis of Categorical Data and Goodness-of-Fit Test
Jan 04, 2016
Chapter 12: The Analysis of Categorical Data and Goodness-
of-Fit Test
Chi-Square Tests for Univariate Categorical Data
• One way frequency table – univariate categorical data are most conveniently summarized
Cash Credit Exchange Refused
Frequency 34 18 31 17
)1... :(Note
KCategory for proportion true
2Category for proportion true
1Category for proportion true
variablelcategorica a of categories ofnumber k
Notation
21
k
2
1
k
kCategory for proportion edhypothesiz
2Category for proportion edhypothesiz
1Category for proportion edhypothesiz :H
form thehave testedbe tohypotheses The
k
2
10
Ha: H0 is not true, so at least one of the true category proportions differs from the corresponding hypothesized value.
Example
• A number of psychological studies have considered the relationship between various deviant behaviors and other variables, such as lunar phase. An article focused on the existence of any relationship between date of patient admission for specified treatment and patient’s birthday. Admission date was partitioned into four categories according to how close it was to the patient’s birthday:
1. Within 7 days of birthday
2. Between 8 and 30 days, inclusive, from the birthday
3. Between 31 and 90 days, inclusive, from the birthday
4. More than 90 days from the birthday
• Let π1, π2, π3, and π4 denote the true proportions in categories 1, 2, 3, and 4, respectively. If there is no relationship between admission date and birthday, then, because there are 15 days included in the first category (from 7 days before the patient’s birthday to 7 days after, including of course, the birthday itself).
504.365
184
329.365
120
126.365
46
041.365
15
4
3
2
1
The hypotheses of interest are then
H0: π1 = .041, π2 = .126, π3 = .329, π4 = .504
Ha: H0 is not true
• The cited article gave data for n = 200 patients admitted for alcoholism treatment. If H0 is true, the expected counts are
100.8200(.504)4)category for proportion izedn(hypothes4)Category for count (expected
65.8200(.329)3)category for proportion izedn(hypothes3)Category for count (expected
25.2200(.126)2)category for proportion izedn(hypothes2)Category for count (expected
8.2200(.041) 1)category for proportion izedn(hypothes1)Category for count expected(
Category
1 2 3 4
Observed 11 24 69 96
Expected 8.2 25.2 65.8 100.8
count cell expected
count) cell expected -count cell (observed
quantity thecomputingfirst from results ,X statistic,fit -of-goodness The2
2
cells all
22
2
count cell expected
count) cell expected -count cell (observedX
:cellsk allfor quantities theseof sum theis statistic X The
Example
• We use the same data from previous example to test the hypothesis that admission date is unrelated to birthday. Let’s use a .05 significance level and the nine-step hypothesis-testing procedure.
1. Let π1, π2, π3, and π4 denote the proportions of all admissions for treatment of alcoholism falling in the four categories.
2. H0: π1=.041, π2=.126, π3=.329, π4=.504
3. Ha: H0 is not true.
4. Significance level: α = .05
cells all
22
count cell expected
count) cell expected -count cell observed(X :StatisticTest .5
6. Assumptions: The expected cell counts (from Example 12.1) are 8.2, 25.2, 65.8, and 100.8, all of which are greater than 5. The article did not indicate how the patients were selected. We can proceed with the chi-square test if it is reasonable to assume that the 200 patients in the sample can be regarded as a random sample of patients admitted for treatment of alcoholism.
41.1
23.016.006.096.08.100
)8.10096(
8.65
)8.6569(
2.25
)2.2524(
2.8
)2.811(
:nCalculatio .72222
2
X
8. P-value: The P-value is based on a chi-square distribution with df = 4 – 1 = 3. The computed value of X2 is smaller than 6.25 (the smallest entry in the df = 3 column), so P-value > .10.
9. Conclusion: Because P-value > α, H0 cannot be rejected. There is not sufficient evidence to conclude that date admitted for treatment and birthday are related.
Example
• Does the color of a car influence the chance that it will be stolen? It was reported the following information for a random sample of 830 stolen vehicles: 140 were white, 100 were blue, 270 were red, 230 were black, and 90 were other colors. We use X2 goodness-of-fit test and a significance level of .01 to test the hypothesis that proportion stolen are identical to population color proportions.
• Suppose that it is known that 15% of all cars are white, 15% are blue, 35% red, 30% are black, and 5% are other colors. If these same population color proportions hold for stolen cars, the expected counts are:
• Expected for white = 830(0.15) = 124.5
• Expected for blue = 830(0.15) = 124.5
• Expected for red = 830(0.35) = 290.5
• Expected for black = 830(0.30) = 249.0
• Expected for other = 830(0.05) = 41.5
Observed and Expected Counts
Category Color Observed Count
Expected Count
1 White 140 124.5
2 Blue 100 124.5
3 Red 270 290.5
4 Black 230 249.0
5 Other 90 41.5
1. Let π1, π2,…, π5 denote the true proportions of stolen cars that fall into the five color categories.
2. H0: π1=.15, π2=.15, π3=.35, π4=.30, π5=.05
3. Ha: H0 is not true
4. Significance level: α = .01
cells all
22
count cell expected
count) cell expected-count cell observed(:StatisticTest .5 X
6. Assumptions: The sample was a random sample of stolen vehicles. All expected counts are greater than 5, so the sample size is large enough to use the chi-square test.
33.66
68.5645.145.182.493.15.41
)5.4190(
249
)249230(
5.290
)5.290270(
5.124
)5.124100(
5.124
)5.124140(
:nCalculatio .722222
2
X
8. P-value: All expected counts exceed 5, so the P-value can be based on a chi-square distribution with df = 5 – 1 = 4. The computed value is larger than 18.46, the largest value in the df = 4 column so P-value < .001
9. Conclusion: Because P-value ≤ α, H0 is rejected. There is convincing evidence that at least one of the color proportions for stolen cars differs from the corresponding proportion for all cars.