Statistics for Business and Economics Chapter 9 Categorical Data Analysis.

Statistics for Business and Economics

Chapter 9

Categorical Data Analysis

Learning Objectives

1. Explain 2 Test for Proportions

2. Explain 2 Test of Independence

3. Solve Hypothesis Testing Problems• More Than Two Population Proportions• Independence

Data Types

Data

Quantitative Qualitative

ContinuousDiscrete

Qualitative Data

• Qualitative random variables yield responses that classify

– Example: gender (male, female)

• Measurement reflects number in category

• Nominal or ordinal scale

• Examples– What make of car do you drive? – Do you live on-campus or off-campus?

Hypothesis Tests Qualitative Data

QualitativeData

Z Test Z Test c2 Test

Proportion Independence1 pop.

c2 Test

More than2 pop.

2 pop.

Chi-Square (2) Test for k Proportions


QualitativeData



c2 Test

More than2 pop.

2 pop.

Multinomial Experiment

• n identical trials

• k outcomes to each trial

• Constant outcome probability, pk

• Independent trials

• Random variable is count, nk

• Example: ask 100 people (n) which of 3 candidates (k) they will vote for

Chi-Square (2) Test for k Proportions

• Tests equality (=) of proportions only– Example: p1 = .2, p2=.3, p3 = .5

• One variable with several levels

• Uses one-way contingency table

One-Way Contingency Table

Shows number of observations in k independent groups (outcomes or variable levels)

Outcomes (k = 3)

Number of responses

Candidate

Tom Bill Mary Total

35 20 45 100

Conditions Required for a Valid Test: One-way Table

1. A multinomial experiment has been conducted

2. The sample size n is large: E(ni) is greater than or equal to 5 for every cell

2 Test for k Proportions Hypotheses & Statistic

2. Test Statistic

2

2

all cells

i i

i

n E n

E n

Observed count

Expected count:E(ni) = npi,0

3. Degrees of Freedom: k – 1 Number of outcomes

Hypothesized probability

1. Hypotheses

H0: p1 = p1,0, p2 = p2,0, ..., pk = pk,0

Ha: At least one pi is different from above

2 Test Basic Idea

1. Compares observed count to expected count assuming null hypothesis is true

2. Closer observed count is to expected count, the more likely the H0 is true

• Measured by squared difference relative to expected count— Reject large values

Finding Critical Value Example

What is the critical 2 value if k = 3, and =.05?

c20

Upper Tail AreaDF .995 … .95 … .051 ... … 0.004 … 3.8412 0.010 … 0.103 … 5.991

2 Table (Portion)

If ni = E(ni), 2 = 0.

Do not reject H0

df = k - 1 = 2

5.991

Reject H0

= .05

As personnel director, you want to test the perception of fairness of three methods of performance evaluation. Of 180 employees, 63 rated Method 1 as fair, 45 rated Method 2 as fair, 72 rated Method 3 as fair. At the .05 level of significance, is there a difference in perceptions?

2 Test for k Proportions Example

• H0:• Ha:• =

• n1 = n2 = n3 =

• Critical Value(s):

Test Statistic:

Decision:

Conclusion:

p1 = p2 = p3 = 1/3

At least 1 is different

.05

63 45 72

= .05

c20

Reject H0

5.991

2 Test for k Proportions Solution

,0

1 2 3 180 1 3 60

i iE n np

E n E n E n


2

2

all cells

2 2 263 60 45 60 72 60

6.360 60 60

i i

i

n E n

E n

Test Statistic:

Decision:

Conclusion:

2 = 6.3

Reject at = .05

There is evidence of a difference in proportions


• H0:• Ha:• =

• n1 = n2 = n3 =

• Critical Value(s):

c20

Reject H0

p1 = p2 = p3 = 1/3

At least 1 is different

.05

63 45 72

5.991

= .05

Contingency Tables

Contingency Tables

• Useful in situations involving multiple population proportions

• Used to classify sample observations according to two or more characteristics

• Also called a cross-classification table.

Contingency Table Example

Left-Handed vs. Gender

Dominant Hand: Left vs. Right

Gender: Male vs. Female

2 categories for each variable, so called a 2 x 2 table

Suppose we examine a sample of 300 children

Contingency Table Example

Sample results organized in a contingency table:(continued)

Gender

Hand Preference

Left Right

Female 12 108 120

Male 24 156 180

36 264 300

120 Females, 12 were left handed

180 Males, 24 were left handed

sample size = n = 300:

2 Test for the Difference Between Two Proportions

• If H0 is true, then the proportion of left-handed females should be the same as the proportion of left-handed males

• The two proportions above should be the same as the proportion of left-handed people overall

H0: π1 = π2 (Proportion of females who are left

handed is equal to the proportion of

males who are left handed)

H1: π1 ≠ π2 (The two proportions are not the same

hand preference is not independent of gender)

The Chi-Square Test Statistic

• where:fo = observed frequency in a particular cellfe = expected frequency in a particular cell if H0 is true

(Assumed: each cell in the contingency table has

expected frequency of at least 5)

cells

22

)(

all e

eoSTAT f

ffχ

The Chi-square test statistic is:

freedom of degree 1 has case 2x 2 thefor 2STAT

χ

Decision Rule

2α

Decision Rule:If , reject H0, otherwise, do not reject H0

The test statistic approximately follows a chi-squared distribution with one degree of freedom

0

Reject H0Do not reject H0

2STAT

χ

22αSTAT

χ χ

Computing the Average Proportion

Here: 120 Females, 12 were

left handed

180 Males, 24 were left handed

i.e., of all the children the proportion of left handers is 0.12, that is, 12%

n

X

nn

XXp

21

21

12.0300

36

180120

2412p

The average proportion is:

Finding Expected Frequencies

• To obtain the expected frequency for left handed females, multiply the average proportion left handed (p) by the total number of females

• To obtain the expected frequency for left handed males, multiply the average proportion left handed (p) by the total number of males

If the two proportions are equal, then

P(Left Handed | Female) = P(Left Handed | Male) = .12

i.e., we would expect (.12)(120) = 14.4 females to be left handed(.12)(180) = 21.6 males to be left handed

Observed vs. Expected Frequencies

Gender

Hand Preference

Left Right

FemaleObserved = 12

Expected = 14.4

Observed = 108

Expected = 105.6120

MaleObserved = 24

Expected = 21.6

Observed = 156

Expected = 158.4180

36 264 300

Gender

Hand Preference

Left Right

FemaleObserved = 12

Expected = 14.4

Observed = 108

Expected = 105.6120

MaleObserved = 24

Expected = 21.6

Observed = 156

Expected = 158.4180

36 264 300

0.7576158.4

158.4)(156

21.6

21.6)(24

105.6

105.6)(108

14.4

14.4)(12

f

)f(fχ

2222

cells all e

2eo2

STAT


The test statistic is:

Decision Rule

Decision Rule:If > 3.841, reject H0, otherwise, do not reject H0

3.841 d.f. 1 with ; 0.7576 is statistic test The 205.0

2 χχSTAT

Here, = 0.7576< = 3.841, so we do not reject H0 and conclude that there is not sufficient evidence that the two proportions are different at = 0.05

20.05 = 3.841

0

0.05

Reject H0Do not reject H0

2STAT

χ

2STAT

χ 205.0

χ

• Extend the 2 test to the case with more than two independent populations:

2 Test for Differences Among More Than Two Proportions

H0: π1 = π2 = … = πc

H1: Not all of the πj are equal (j = 1, 2, …, c)


• Where:

fo = observed frequency in a particular cell of the 2 x c table

fe = expected frequency in a particular cell if H0 is true

(Assumed: each cell in the contingency table has expectedfrequency of at least 1)

cells

22

)(

all e

eoSTAT f

ffχ

The Chi-square test statistic is:

freedom of degrees 1-c 1)-1)(c-(2 has case cx 2 thefor χ 2 STAT

Computing the Overall Proportion

n

X

nnn

XXXp

c21

c21

The overall

proportion is:

• Expected cell frequencies for the c categories are calculated as in the 2 x 2 case, and the decision rule is the same:

Where is from the chi-squared distribution with c – 1 degrees of freedom

Decision Rule:If , reject H0, otherwise, do not reject H0

22αSTAT

χ χ

2α

χ

The Marascuilo Procedure

• Used when the null hypothesis of equal proportions is rejected

• Enables you to make comparisons between all pairs

• Start with the observed differences, pj – pj’, for all pairs (for j ≠ j’) . . .

• . . .then compare the absolute difference to a calculated critical range

2 Test of Independence


QualitativeData



c2 Test

More than2 pop.

2 pop.

2 Test of Independence

• Shows if a relationship exists between two qualitative variables

– One sample is drawn– Does not show causality

• Uses two-way contingency table

2 Test of Independence Contingency Table

Shows number of observations from 1 sample jointly in 2 qualitative variables

House Location House Style Urban Rural Total

Split-Level 63 49 112 Ranch 15 33 48 Total 78 82 160

Levels of variable 2

Levels of variable 1

Conditions Required for a Valid 2 Test: Independence

1. Multinomial experiment has been conducted

2. The sample size, n, is large: Eij is greater than or equal to 5 for every cell

2 Test of Independence Hypotheses & Statistic

1. Hypotheses• H0: Variables are independent

• Ha: Variables are related (dependent)

3. Degrees of Freedom: (r – 1)(c – 1)

Rows Columns

2. Test Statistic Observed count

Expected count

2

2

all cells

ij ij

ij

n E

E

2 Test of Independence Expected Counts

1. Statistical independence means joint probability equals product of marginal probabilities

2. Compute marginal probabilities and multiply for joint probability

3. Expected count is sample size times joint probability

112 160

Marginal probability =

Expected Count Example

Location Urban Rural

House Style Obs. Obs. Total

Split–Level 63 49 112

Ranch 15 33 48

Total 78 82 160

78 160


Expected Count Example112 160





Ranch 15 33 48

Total 78 82 160

Expected Count Example

78 160


112 160

Marginal probability = Joint probability = 112 160

78 160




Ranch 15 33 48

Total 78 82 160

Expected count = 160· 112 160

78 160

= 54.6

Expected Count Calculationi jR C

= nijE

House Location Urban Rural

House Style Obs. Exp. Obs. Exp. Total

Split-Level 63

112·78 160

54.6 49

112·82 160

57.4 112

Ranch 15

48·78 160

23.4 33

48·82 160

24.6 48

Total 78 78 82 82 160

As a realtor you want to determine if house style and house location are related. At the .05 level of significance, is there evidence of a relationship?

2 Test of Independence Example

House Location House Style Urban Rural Total

Split-Level 63 49 112 Ranch 15 33 48 Total 78 82 160

2 Test of Independence Solution

• H0: • Ha: • = • df = • Critical Value(s):

Test Statistic:

Decision:

Conclusion:

No Relationship

Relationship

.05(2 - 1)(2 - 1) = 1

c20

Reject H0

3.841

= .05

Eij 5 in all cells


House Location Urban Rural

House Style Obs. Exp. Obs. Exp. Total

Split-Level 63 54.6 49 57.4 112

Ranch 15 23.4 33 24.6 48

Total 78 78 82 82 160

112·82 160

48·78 160

48·82 160

112·78 160

2

2

all cells

2 2 2

11 11 12 12 22 22

11 12 22

2 2 263 54.6 49 57.4 33 24.6

8.4154.6 57.4 24.6

ij ij

ij

n E

E

n E n E n E

E E E



Test Statistic:

Decision:

Conclusion:

2 = 8.41

Reject at = .05

There is evidence of a relationship


c20

Reject H0

No Relationship

Relationship

.05(2 - 1)(2 - 1) = 1

3.841

= .05

You’re a marketing research analyst. You ask a random sample of 286 consumers if they purchase Diet Pepsi or Diet Coke. At the .05 level of significance, is there evidence of a relationship?

2 Test of Independence Thinking Challenge

Diet PepsiDiet Coke No Yes TotalNo 84 32 116Yes 48 122 170Total 132 154 286

2 Test of Independence Solution*


Test Statistic:

Decision:

Conclusion:

No Relationship

Relationship

.05(2 - 1)(2 - 1) = 1

c20

Reject H0

3.841

= .05

Diet Pepsi No Yes

Diet Coke Obs. Exp. Obs. Exp. Total

No 84 53.5 32 62.5 116

Yes 48 78.5 122 91.5 170

Total 132 132 154 154 286

Eij 5 in all cells

170·132 286

170·154 286

116·132 286

154·132 286


2

2

all cells

2 2 2

11 11 12 12 22 22

11 12 22

2 2 284 53.5 32 62.5 122 91.5

54.2953.5 62.5 91.5

ij ij

ij

n E

E

n E n E n E

E E E



Test Statistic:

Decision:

Conclusion:

2 = 54.29

Reject at = .05

There is evidence of a relationship


c20

Reject H0

No Relationship

Relationship

.05(2 - 1)(2 - 1) = 1

3.841

= .05

There is a statistically significant relationship between purchasing Diet Coke and Diet Pepsi. So what do you think the relationship is? Aren’t they competitors?

2 Test of Independence Thinking Challenge 2

Diet PepsiDiet Coke No Yes TotalNo 84 32 116Yes 48 122 170Total 132 154 286

Low Income

You Re-Analyze the Data

High IncomeDiet Pepsi

Diet Coke No Yes Total No 4 30 34 Yes 40 2 42 Total 44 32 76

Diet Pepsi Diet Coke No Yes Total

No 80 2 82 Yes 8 120 128 Total 88 122 210

True Relationships*

Apparent relation

Underlying causal relation

Control or intervening variable (true cause)

Diet Coke

Diet Pepsi

Moral of the Story*

© 1984-1994 T/Maker Co.

Numbers don’t think - People do!

Conclusion

1. Explained 2 Test for Proportions

2. Explained 2 Test of Independence

3. Solved Hypothesis Testing Problems• More Than Two Population Proportions• Independence

Statistics for Business and Economics Chapter 9 Categorical Data Analysis.

Documents

cell slide

reject h

proportions solution

test of independence

test proportionindependence

valid test

handed h

chisquare test statistic