Statistics for Business and Economics Chapter 9 Categorical Data Analysis
Mar 30, 2015
Statistics for Business and Economics
Chapter 9
Categorical Data Analysis
Learning Objectives
1. Explain 2 Test for Proportions
2. Explain 2 Test of Independence
3. Solve Hypothesis Testing Problems• More Than Two Population Proportions• Independence
Data Types
Data
Quantitative Qualitative
ContinuousDiscrete
Qualitative Data
• Qualitative random variables yield responses that classify
– Example: gender (male, female)
• Measurement reflects number in category
• Nominal or ordinal scale
• Examples– What make of car do you drive? – Do you live on-campus or off-campus?
Hypothesis Tests Qualitative Data
QualitativeData
Z Test Z Test c2 Test
Proportion Independence1 pop.
c2 Test
More than2 pop.
2 pop.
Chi-Square (2) Test for k Proportions
Hypothesis Tests Qualitative Data
QualitativeData
Z Test Z Test c2 Test
Proportion Independence1 pop.
c2 Test
More than2 pop.
2 pop.
Multinomial Experiment
• n identical trials
• k outcomes to each trial
• Constant outcome probability, pk
• Independent trials
• Random variable is count, nk
• Example: ask 100 people (n) which of 3 candidates (k) they will vote for
Chi-Square (2) Test for k Proportions
• Tests equality (=) of proportions only– Example: p1 = .2, p2=.3, p3 = .5
• One variable with several levels
• Uses one-way contingency table
One-Way Contingency Table
Shows number of observations in k independent groups (outcomes or variable levels)
Outcomes (k = 3)
Number of responses
Candidate
Tom Bill Mary Total
35 20 45 100
Conditions Required for a Valid Test: One-way Table
1. A multinomial experiment has been conducted
2. The sample size n is large: E(ni) is greater than or equal to 5 for every cell
2 Test for k Proportions Hypotheses & Statistic
2. Test Statistic
2
2
all cells
i i
i
n E n
E n
Observed count
Expected count:E(ni) = npi,0
3. Degrees of Freedom: k – 1 Number of outcomes
Hypothesized probability
1. Hypotheses
H0: p1 = p1,0, p2 = p2,0, ..., pk = pk,0
Ha: At least one pi is different from above
2 Test Basic Idea
1. Compares observed count to expected count assuming null hypothesis is true
2. Closer observed count is to expected count, the more likely the H0 is true
• Measured by squared difference relative to expected count— Reject large values
Finding Critical Value Example
What is the critical 2 value if k = 3, and =.05?
c20
Upper Tail AreaDF .995 … .95 … .051 ... … 0.004 … 3.8412 0.010 … 0.103 … 5.991
2 Table (Portion)
If ni = E(ni), 2 = 0.
Do not reject H0
df = k - 1 = 2
5.991
Reject H0
= .05
As personnel director, you want to test the perception of fairness of three methods of performance evaluation. Of 180 employees, 63 rated Method 1 as fair, 45 rated Method 2 as fair, 72 rated Method 3 as fair. At the .05 level of significance, is there a difference in perceptions?
2 Test for k Proportions Example
• H0:• Ha:• =
• n1 = n2 = n3 =
• Critical Value(s):
Test Statistic:
Decision:
Conclusion:
p1 = p2 = p3 = 1/3
At least 1 is different
.05
63 45 72
= .05
c20
Reject H0
5.991
2 Test for k Proportions Solution
,0
1 2 3 180 1 3 60
i iE n np
E n E n E n
2 Test for k Proportions Solution
2
2
all cells
2 2 263 60 45 60 72 60
6.360 60 60
i i
i
n E n
E n
Test Statistic:
Decision:
Conclusion:
2 = 6.3
Reject at = .05
There is evidence of a difference in proportions
2 Test for k Proportions Solution
• H0:• Ha:• =
• n1 = n2 = n3 =
• Critical Value(s):
c20
Reject H0
p1 = p2 = p3 = 1/3
At least 1 is different
.05
63 45 72
5.991
= .05
Contingency Tables
Contingency Tables
• Useful in situations involving multiple population proportions
• Used to classify sample observations according to two or more characteristics
• Also called a cross-classification table.
Contingency Table Example
Left-Handed vs. Gender
Dominant Hand: Left vs. Right
Gender: Male vs. Female
2 categories for each variable, so called a 2 x 2 table
Suppose we examine a sample of 300 children
Contingency Table Example
Sample results organized in a contingency table:(continued)
Gender
Hand Preference
Left Right
Female 12 108 120
Male 24 156 180
36 264 300
120 Females, 12 were left handed
180 Males, 24 were left handed
sample size = n = 300:
2 Test for the Difference Between Two Proportions
• If H0 is true, then the proportion of left-handed females should be the same as the proportion of left-handed males
• The two proportions above should be the same as the proportion of left-handed people overall
H0: π1 = π2 (Proportion of females who are left
handed is equal to the proportion of
males who are left handed)
H1: π1 ≠ π2 (The two proportions are not the same
hand preference is not independent of gender)
The Chi-Square Test Statistic
• where:fo = observed frequency in a particular cellfe = expected frequency in a particular cell if H0 is true
(Assumed: each cell in the contingency table has
expected frequency of at least 5)
cells
22
)(
all e
eoSTAT f
ffχ
The Chi-square test statistic is:
freedom of degree 1 has case 2x 2 thefor 2STAT
χ
Decision Rule
2α
Decision Rule:If , reject H0, otherwise, do not reject H0
The test statistic approximately follows a chi-squared distribution with one degree of freedom
0
Reject H0Do not reject H0
2STAT
χ
22αSTAT
χ χ
Computing the Average Proportion
Here: 120 Females, 12 were
left handed
180 Males, 24 were left handed
i.e., of all the children the proportion of left handers is 0.12, that is, 12%
n
X
nn
XXp
21
21
12.0300
36
180120
2412p
The average proportion is:
Finding Expected Frequencies
• To obtain the expected frequency for left handed females, multiply the average proportion left handed (p) by the total number of females
• To obtain the expected frequency for left handed males, multiply the average proportion left handed (p) by the total number of males
If the two proportions are equal, then
P(Left Handed | Female) = P(Left Handed | Male) = .12
i.e., we would expect (.12)(120) = 14.4 females to be left handed(.12)(180) = 21.6 males to be left handed
Observed vs. Expected Frequencies
Gender
Hand Preference
Left Right
FemaleObserved = 12
Expected = 14.4
Observed = 108
Expected = 105.6120
MaleObserved = 24
Expected = 21.6
Observed = 156
Expected = 158.4180
36 264 300
Gender
Hand Preference
Left Right
FemaleObserved = 12
Expected = 14.4
Observed = 108
Expected = 105.6120
MaleObserved = 24
Expected = 21.6
Observed = 156
Expected = 158.4180
36 264 300
0.7576158.4
158.4)(156
21.6
21.6)(24
105.6
105.6)(108
14.4
14.4)(12
f
)f(fχ
2222
cells all e
2eo2
STAT
The Chi-Square Test Statistic
The test statistic is:
Decision Rule
Decision Rule:If > 3.841, reject H0, otherwise, do not reject H0
3.841 d.f. 1 with ; 0.7576 is statistic test The 205.0
2 χχSTAT
Here, = 0.7576< = 3.841, so we do not reject H0 and conclude that there is not sufficient evidence that the two proportions are different at = 0.05
20.05 = 3.841
0
0.05
Reject H0Do not reject H0
2STAT
χ
2STAT
χ 205.0
χ
• Extend the 2 test to the case with more than two independent populations:
2 Test for Differences Among More Than Two Proportions
H0: π1 = π2 = … = πc
H1: Not all of the πj are equal (j = 1, 2, …, c)
The Chi-Square Test Statistic
• Where:
fo = observed frequency in a particular cell of the 2 x c table
fe = expected frequency in a particular cell if H0 is true
(Assumed: each cell in the contingency table has expectedfrequency of at least 1)
cells
22
)(
all e
eoSTAT f
ffχ
The Chi-square test statistic is:
freedom of degrees 1-c 1)-1)(c-(2 has case cx 2 thefor χ 2 STAT
Computing the Overall Proportion
n
X
nnn
XXXp
c21
c21
The overall
proportion is:
• Expected cell frequencies for the c categories are calculated as in the 2 x 2 case, and the decision rule is the same:
Where is from the chi-squared distribution with c – 1 degrees of freedom
Decision Rule:If , reject H0, otherwise, do not reject H0
22αSTAT
χ χ
2α
χ
The Marascuilo Procedure
• Used when the null hypothesis of equal proportions is rejected
• Enables you to make comparisons between all pairs
• Start with the observed differences, pj – pj’, for all pairs (for j ≠ j’) . . .
• . . .then compare the absolute difference to a calculated critical range
2 Test of Independence
Hypothesis Tests Qualitative Data
QualitativeData
Z Test Z Test c2 Test
Proportion Independence1 pop.
c2 Test
More than2 pop.
2 pop.
2 Test of Independence
• Shows if a relationship exists between two qualitative variables
– One sample is drawn– Does not show causality
• Uses two-way contingency table
2 Test of Independence Contingency Table
Shows number of observations from 1 sample jointly in 2 qualitative variables
House Location House Style Urban Rural Total
Split-Level 63 49 112 Ranch 15 33 48 Total 78 82 160
Levels of variable 2
Levels of variable 1
Conditions Required for a Valid 2 Test: Independence
1. Multinomial experiment has been conducted
2. The sample size, n, is large: Eij is greater than or equal to 5 for every cell
2 Test of Independence Hypotheses & Statistic
1. Hypotheses• H0: Variables are independent
• Ha: Variables are related (dependent)
3. Degrees of Freedom: (r – 1)(c – 1)
Rows Columns
2. Test Statistic Observed count
Expected count
2
2
all cells
ij ij
ij
n E
E
2 Test of Independence Expected Counts
1. Statistical independence means joint probability equals product of marginal probabilities
2. Compute marginal probabilities and multiply for joint probability
3. Expected count is sample size times joint probability
112 160
Marginal probability =
Expected Count Example
Location Urban Rural
House Style Obs. Obs. Total
Split–Level 63 49 112
Ranch 15 33 48
Total 78 82 160
78 160
Marginal probability =
Expected Count Example112 160
Marginal probability =
Location Urban Rural
House Style Obs. Obs. Total
Split–Level 63 49 112
Ranch 15 33 48
Total 78 82 160
Expected Count Example
78 160
Marginal probability =
112 160
Marginal probability = Joint probability = 112 160
78 160
Location Urban Rural
House Style Obs. Obs. Total
Split–Level 63 49 112
Ranch 15 33 48
Total 78 82 160
Expected count = 160· 112 160
78 160
= 54.6
Expected Count Calculationi jR C
= nijE
House Location Urban Rural
House Style Obs. Exp. Obs. Exp. Total
Split-Level 63
112·78 160
54.6 49
112·82 160
57.4 112
Ranch 15
48·78 160
23.4 33
48·82 160
24.6 48
Total 78 78 82 82 160
As a realtor you want to determine if house style and house location are related. At the .05 level of significance, is there evidence of a relationship?
2 Test of Independence Example
House Location House Style Urban Rural Total
Split-Level 63 49 112 Ranch 15 33 48 Total 78 82 160
2 Test of Independence Solution
• H0: • Ha: • = • df = • Critical Value(s):
Test Statistic:
Decision:
Conclusion:
No Relationship
Relationship
.05(2 - 1)(2 - 1) = 1
c20
Reject H0
3.841
= .05
Eij 5 in all cells
2 Test of Independence Solution
House Location Urban Rural
House Style Obs. Exp. Obs. Exp. Total
Split-Level 63 54.6 49 57.4 112
Ranch 15 23.4 33 24.6 48
Total 78 78 82 82 160
112·82 160
48·78 160
48·82 160
112·78 160
2
2
all cells
2 2 2
11 11 12 12 22 22
11 12 22
2 2 263 54.6 49 57.4 33 24.6
8.4154.6 57.4 24.6
ij ij
ij
n E
E
n E n E n E
E E E
2 Test of Independence Solution
2 Test of Independence Solution
Test Statistic:
Decision:
Conclusion:
2 = 8.41
Reject at = .05
There is evidence of a relationship
• H0: • Ha: • = • df = • Critical Value(s):
c20
Reject H0
No Relationship
Relationship
.05(2 - 1)(2 - 1) = 1
3.841
= .05
You’re a marketing research analyst. You ask a random sample of 286 consumers if they purchase Diet Pepsi or Diet Coke. At the .05 level of significance, is there evidence of a relationship?
2 Test of Independence Thinking Challenge
Diet PepsiDiet Coke No Yes TotalNo 84 32 116Yes 48 122 170Total 132 154 286
2 Test of Independence Solution*
• H0: • Ha: • = • df = • Critical Value(s):
Test Statistic:
Decision:
Conclusion:
No Relationship
Relationship
.05(2 - 1)(2 - 1) = 1
c20
Reject H0
3.841
= .05
Diet Pepsi No Yes
Diet Coke Obs. Exp. Obs. Exp. Total
No 84 53.5 32 62.5 116
Yes 48 78.5 122 91.5 170
Total 132 132 154 154 286
Eij 5 in all cells
170·132 286
170·154 286
116·132 286
154·132 286
2 Test of Independence Solution*
2
2
all cells
2 2 2
11 11 12 12 22 22
11 12 22
2 2 284 53.5 32 62.5 122 91.5
54.2953.5 62.5 91.5
ij ij
ij
n E
E
n E n E n E
E E E
2 Test of Independence Solution*
2 Test of Independence Solution*
Test Statistic:
Decision:
Conclusion:
2 = 54.29
Reject at = .05
There is evidence of a relationship
• H0: • Ha: • = • df = • Critical Value(s):
c20
Reject H0
No Relationship
Relationship
.05(2 - 1)(2 - 1) = 1
3.841
= .05
There is a statistically significant relationship between purchasing Diet Coke and Diet Pepsi. So what do you think the relationship is? Aren’t they competitors?
2 Test of Independence Thinking Challenge 2
Diet PepsiDiet Coke No Yes TotalNo 84 32 116Yes 48 122 170Total 132 154 286
Low Income
You Re-Analyze the Data
High IncomeDiet Pepsi
Diet Coke No Yes Total No 4 30 34 Yes 40 2 42 Total 44 32 76
Diet Pepsi Diet Coke No Yes Total
No 80 2 82 Yes 8 120 128 Total 88 122 210
True Relationships*
Apparent relation
Underlying causal relation
Control or intervening variable (true cause)
Diet Coke
Diet Pepsi
Moral of the Story*
© 1984-1994 T/Maker Co.
Numbers don’t think - People do!
Conclusion
1. Explained 2 Test for Proportions
2. Explained 2 Test of Independence
3. Solved Hypothesis Testing Problems• More Than Two Population Proportions• Independence