BIOL 582

BIOL 582

Lecture Set 16Analysis of frequency and categorical data

Part I: Goodness of Fit Tests

BIOL 582 Some preliminary comments

• Until this point, we have concerned ourselves mainly with continuous quantitative response data, somewhat with discrete data that behave as if continuous, and rarely, categorical response data

• The emphasis for this lecture topic is how to analyze response data that tend to fall in categories.

• Analyses such as these are best motivated by examples• The examples, nomenclature, and coverage of topic pretty much

follows Chapter 17 from Sokal and Rohlf (2011) Biometry, 4th Edition (but excludes some additional detail that the book goes into).

BIOL 582 A Simple Example

• An example of simple Mendelian genetics• A phenotypic trait that exhibits genetic dominance:

• White and brown fuzzy bunnies• Let the gene for coat color be denoted by alleles B (brown) or b (not brown)• There are three genotypes possible• BB (brown), Bb (brown), and bb (white)

• If a monohybrid cross is performed – i.e., two heterozygous brown bunnies (Bb) are mated – the possible offspring produced would have one of the following genotypes

• Realizing that bB and Bb are the same thing, the expected genotype frequency of offspring is 1:2:1 for BB:Bb:bb. The expected phenotype frequency is 3:1, brown:white, because of genetic dominance

B from mom b from mom

B from Dad BB Bb

B from Dad bB bb


• Some heterozygous brown bunnies are mated and 100 offspring are born, 89 brown and 11 white

• Does this result defy expectation?• Solution 1: binomial probability distribution get an exact p-value

• Where n is the number of subjects and k is the number of subjects with some specified value (e.g., brown color). π is the expected portion of subjects with the specified value; 1 - π is the expected portion without the value. For our example,

• Note that this is the probability of finding exactly 89 brown bunnies of 100, when the process should produce 75. It is probably better to find

• As this is the probability of finding at least 89 brown bunnies when the process should produce 75. This returns a P-value of 0.0003935178.



• Does this result defy expectation?• Solution 1: binomial probability distribution get an exact p-value• Using R…. > # For probability that k = 89, when n = 100

> > pbinom(89,100,0.75,lower.tail=T) - pbinom(88,100,0.75,lower.tail=T)

[1] 0.0002564172

> # or> > pbinom(88,100,0.75,lower.tail=F) - pbinom(89,100,0.75,lower.tail=F)

[1] 0.0002564172

> # Note that R uses cumulative probability function

> # For probability that k >= 89, when n = 100> > 1- pbinom(88,100,0.75,lower.tail=T)

[1] 0.0003935178

> # or> pbinom(88,100,0.75,lower.tail=F)

[1] 0.0003935178

To denote

Theoretical distribution

To denote

Sample statistic



• Does this result defy expectation?• Solution 2: “Chi-square” test (Note, this is really a bad name since a Chi-square distribution is

based on continuous frequencies and the test statistic calculated in the following example – as you probably learned in a genetics class – is a sample statistic calculated from discrete frequencies. However, the statistic approximately follows a Chi-square distribution.)

f is the observed frequency for category i; f-hat is the expected frequency, found as πn or (π-1)n, depending on whether the expected frequency corresponds to the specified category or the unspecified category

In the example, 0.75*100 = 75 is the expected number of brown bunnies; 0.25*100 = 25 is the expected number of white bunnies

For the example,

To denote

Theoretical distribution

The interior part is a likelihood ratio, which approximates the ratio of binomial probabilities for π and k/n.



• Does this result defy expectation?• Solution 3: “G” (Goodness of Fit) test

For the example,

This equation can also be written as



• Does this result defy expectation?• Summary of solutions

• Why not always use binomial probability?• Expected frequencies might not be known but a reference distribution could be used

• Why did the G test (in this case) have more statistical power?• Although G test and “Chi-square” test approximately follow a chi-square distribution of

the same df, the G test is known to follow it more closely (produces values consistent with theoretical distribution). G test is also a likelihood ratio test, and will have some better properties for more complicated examples (as we will see). In general, the two produce similar results (especially with large sample sizes). Both are also susceptible to problems with small sample size, but G is better. Rule of Thumb: use G when |O - E| > E for any (O)bserved and (E)xpected values.

Method P value Conclusion?Binomial Probability 0.000393 (exact) Reject 3:1 hypothesisG test 0.000429 (approximate) Reject 3:1 hypothesisChi-square test 0.001224 (approximate) Reject 3:1 hypothesis

BIOL 582 Example: Single Classification Tests for Goodness of Fit

• Example set-up• Ecologists are often interested in whether species diversity at local scales

differs from regional scales• If one were to sample species in a local area, would the sample be

comprised of the same species in the same proportion as is found in the region?

• One can substitute taxonomic affinities for the following “expected” regional species proportions: Species A 50%, Species B 22%, Species C 16%, Species D 9%, Species E 1%, Species F 0.5%, Species G, H, I, J “Pooled” 1.5%

• A scientist collects the following numbers of species in a sampling event of a local place (e.g., pond, lake, river, prairie fragment, sinkhole, etc.) – See next page

• Question: is local species diversity the same as regional species diversity?


• Example set-up• Knowing that 0s values can cause problems for

the likely statistical tests, and also being constrained to pool species, G, H, I and J already, the scientist summarizes the data as below

Species fA 125B 42C 30

D 24E 10F 0G 14H 0I 12J 11

TOTAL 268

Species f Expected

A 125 0.50*268 = 134B 42 0.22*268 = 58.96C 30 0.16*268 = 42.88D 24 0.09*268 = 24.12

E 10 0.01*268 = 2.68F, G, H, I, J

(the rare species)37 0.02*268 = 5.36

TOTAL 268 268


• Example analysis• Note: There is no way to calculate the binomial probability, as these

are not binomial data. But Goodness of fit tests can still be applied.

Species f Expected

A 125 0.50*268 = 134B 42 0.22*268 = 58.96C 30 0.16*268 = 42.88D 24 0.09*268 = 24.12

E 10 0.01*268 = 2.68F, G, H, I, J

(the rare species)37 0.02*268 = 5.36

TOTAL 268 268



are not binomial data. But Goodness of fit tests can still be applied. Just add a few columns…

Species f Expected

A 125 0.50*268 = 134B 42 0.22*268 = 58.96

C 30 0.16*268 = 42.88D 24 0.09*268 = 24.12E 10 0.01*268 = 2.68

F, G, H, I, J 37 0.02*268 = 5.36TOTAL (SUM) 268 268

0.6045 -8.69084.8786 -14.24603.8688 -10.71620.0006 -0.1197

19.9934 13.1677186.7704 71.4823216.1164 50.8773

df = a -1 for a classes



are not binomial data. But Goodness of fit tests can still be applied. Just add a few columns…

• Note that there is a correction factor for these tests, as type I error rates tend to be higher than the intended levels. These are more substantial for small sample sizes.

• For G test, divide G by: (William’s Correction)

• For X 2 or G test, stats are adjusted by adding or subtracting 0.5 from observed frequencies, whichever is more limiting (Correction for Continuity)

• With Chi-square test, the correction for continuity is easy to perform by taking the absolute difference of observed and expected frequencies and subtracting ½ before squaring.

• Williams correction for example


• Example conclusion• This mythical place has many more rare species than expected,

compared to regional species pool• Note that the pooling of species, F, G, H, I, and J might influence the

outcome

Species f Expected

A 125 0.50*268 = 134B 42 0.22*268 = 58.96C 30 0.16*268 = 42.88D 24 0.09*268 = 24.12

E 10 0.01*268 = 2.68F, G, H, I, J

(the rare species)37 0.02*268 = 5.36

TOTAL 268 268

BIOL 582 Expansion of Goodness of Fit Tests

• The first two examples included one frequency distribution and some known or true expectation.

• The first two examples included categorical data• There are two different ways we can (and will) go

1. Goodness of Fit tests for continuous frequency data2. Goodness of fit tests for more than one distribution

• We have to start with one of these, so let’s start with 1.

• Before proceeding, it is important to establish two different hypotheses that are used as “null” model for frequency expectations. In the previous two examples, the expected frequencies were established by theory (expected genotypes) or a larger empirical pool of information (species proportions). These are extrinsic hypotheses for the basis of expected frequencies. Intrinsic hypotheses can also be used for estimating frequencies. For example, if we wish to test if a continuous frequency distribution is normal, we can generate expected frequencies but would first need to know the mean and variance of the sample. Thus, the degrees of freedom for the test are reduced by these additional parameter estimates.

BIOL 582

Documents

brown color

heterozygous brown bunnies

alleles b brown

genotypes possiblebb

brown fuzzy bunnieslet

bb whiteif

specified value

t pbinom88