Top Banner
1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal
63

1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

Dec 25, 2015

Download

Documents

Emory Tucker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

1

Advances in Statistics

Or, what you might find if you picked up a current issue of a Biological Journal

Page 2: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

2

Advances in Statistics

• Extensions to the ANOVA• Computer-intensive methods• Maximum likelihood

Page 3: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

3

Extensions to ANOVA

• One-way ANOVA– This works for a single explanatory variable

– Simplest possible design

• Two-way ANOVA– Two categorical explanatory variables

– Factorial design

Page 4: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

4

ANOVA Tables

Source of variation

Sum of squares

df Mean Squares

F ratio P

Treatment

k-1

Error N-k

Total N-1€

SSerror = si2(ni −1)∑€

SSgroup = ni(Y i −Y )2∑

SSgroup + SSerror €

MSerror =SSerror

dferror€

MSgroup =SSgroup

dfgroup

F =MSgroup

MSerror

*

Page 5: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

5

Two-factor ANOVA TableSource of variation

Sum of Squares

df Mean Square

F ratio P

Treatment 1

SS1 k1 - 1 SS1

k1 - 1

MS1

MSE

Treatment 2

SS2 k2 - 1 SS2

k2 - 1

MS2

MSE

Treatment 1 * Treatment 2

SS1*2 (k1 - 1)*(k2 - 1)

SS1*2

(k1 - 1)*(k2 - 1)

MS1*2

MSE

Error SSerror XXX SSerror

XXX

Total SStotal N-1

Page 6: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

6

Two-factor ANOVA TableSource of variation

Sum of Squares

df Mean Square

F ratio P

Treatment 1

SS1 k1 - 1 SS1

k1 - 1

MS1

MSE

Treatment 2

SS2 k2 - 1 SS2

k2 - 1

MS2

MSE

Treatment 1 * Treatment 2

SS1*2 (k1 - 1)*(k2 - 1)

SS1*2

(k1 - 1)*(k2 - 1)

MS1*2

MSE

Error SSerror XXX SSerror

XXX

Total SStotal N-1

Two categorical explanatory variables

Page 7: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

7

General Linear Models

• Used to analyze variation in Y when there is more than one explanatory variable

• Explanatory variables can be categorical or numerical

Page 8: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

8

General Linear Models

• First step: formulate a model statement

• Example:

Y = μ + TREATMENT

Page 9: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

9

General Linear Models

• First step: formulate a model statement

• Example:

Y = μ + TREATMENT

Overallmean

Treatment effect

Page 10: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

10

General Linear Models

• Second step: Make an ANOVA table

• Example: Source of variation

Sum of squares

df Mean Squares

F ratio P

Treatment

k-1

Error N-k

Total N-1

SSerror = si2(ni −1)∑€

SSgroup = ni(Y i −Y )2∑

SSgroup + SSerror

MSerror =SSerror

dferror€

MSgroup =SSgroup

dfgroup

F =MSgroup

MSerror*

Page 11: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

11

General Linear Models

• Second step: Make an ANOVA table

• Example: Source of variation

Sum of squares

df Mean Squares

F ratio P

Treatment

k-1

Error N-k

Total N-1

SSerror = si2(ni −1)∑€

SSgroup = ni(Y i −Y )2∑

SSgroup + SSerror

MSerror =SSerror

dferror€

MSgroup =SSgroup

dfgroup

F =MSgroup

MSerror*

This is the same as a one-way ANOVA!

Page 12: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

12

General Linear Models

• If there is only one explanatory variable, these are exactly equivalent to things we’ve already done– One categorical variable: ANOVA– One numerical variable: regression

• Great for more complicated situations

Page 13: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

13

Example 1: Experiment with blocking

• Fish experiment: sensitivity of goldfish to light

• Fish are randomly selected from the population

• Four different light treatments are applied to each fish

Page 14: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

14

Randomized Block Design

Blocks (fish)

Treatments(light wavelengths)

Page 15: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

15

Randomized Block Design

Page 16: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

16

Step 1: Make a model statement

Y = μ + BLOCK + TREATMENT

Page 17: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

17

Step 2: Make an ANOVA table

Page 18: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

18

Another Example: Mole Rats

• Are there lazy mole rats?• Two variables:

– Worker type: categorical•“frequent workers” and “infrequent workers”

– Body mass (ln-transformed): numerical

Page 19: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

19

Page 20: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

20

Step 1: Make a model statement

Y = μ + CASTE + LNMASS + CASTE * LNMASS

Page 21: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

21

Step 2: Make an ANOVA table

Page 22: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

22

Step 2: Make an ANOVA table

Page 23: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

23

Step 1: Make a model statement

Y = μ + CASTE + LNMASS

Page 24: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

24

Step 2: Make an ANOVA table

Page 25: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

25

Step 2: Make an ANOVA table

Also called ANCOVA-

Analysis of Covariance

Page 26: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

26

General Linear Models

• Can handle any number of predictor variables

• Each can be categorical or numerical

• Tables have the same basic structure

• Same assumptions as ANOVA

Page 27: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

27

General Linear Models

• Don’t run out of degrees of freedom!

• Sometimes, the F-statistics will have DIFFERENT denominators - see book for an example

Page 28: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

28

Computer-intensive methods

• Hypothesis testing:– Simulation– Randomization

• Confidence intervals– Bootstrap

Page 29: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

29

Simulation

• Simulates the sampling process on a computer many times: generates the null distribution from estimates done on the simulated data

• Computer assumes the null hypothesis is true

Page 30: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

30

Example: Social spider sex ratios

Social spiders live in groups

Page 31: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

31

Example: Social spider sex ratios

• Groups are mostly females• Hypothesis: Groups have just enough males to allow reproduction

• Test: Whether distribution of number of males is as predicted by chance

• Problem: Groups are of many different sizes

• Binomial distribution therefore doesn’t apply

Page 32: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

32

Simulation:

• For each group, the number of spiders is known. The overall proportion of males, pm, is known.

• For each group, the computer draws the real number of spiders, and each has pm probability of being male.

• This is done for all groups, and the variance in proportion of males is calculated.

• This is repeated a large number of times.

Page 33: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

33

0.5 1. 1.5 2. 2.5 3. 3.5

200

400

600

800

1000

Variance in proportion of males

(Pseudo-values)

Frequency

Actual Observed Value (0.44)

The observed value (0.44), or something more extreme,is observed in only 4.9% of the simulations. Therefore P = 0.049.

Page 34: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

34

Randomization

• Used for hypothesis testing• Mixes the real data randomly• Variable 1 from an individual is paired with variable 2 data from a randomly chosen individual. This is done for all individuals.

• The estimate is made on the randomized data.

• The whole process is repeated numerous times. The distribution of the randomized estimates is the null distribution.

Page 35: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

35

Without replacement

• Randomization is done without replacement.

• In other words, all data points are used exactly once in each randomized data set.

Page 36: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

36

Randomization can be done for any test of association between

two variables

Page 37: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

37

Example: Sage crickets

Sage cricket males sometimesoffer their hind-wings to females to eat during mating.

Do females who eat hind-wingswait longer to re-mate?

Page 38: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

38

Table 12.3A Waiting time to remating in sage cricket females afterinitial mating with either a wingless or winged male (presented inln(days))

Male wingless Male winged

0 1.4

0.7 1.6

0.7 1.9

1.4 2.3

1.6 2.6

1.8 2.8

1.9 2.8

1.9 2.8

1.9 3.1

2.2 3.8

2.1 3.9

2.1 4.5

4.7

Page 39: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

39

ln(Time to remating): First mate had no wings

ln(Time to remating): First mate had intact wings

Problems:Unequal variance, non-normal distributions

Page 40: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

40

Male wingless

Male winged

0 1.4

0.7 1.6

0.7 1.9

1.4 2.3

1.6 2.6

1.8 2.8

1.9 2.8

1.9 2.8

1.9 3.1

2.2 3.8

2.1 3.9

2.1 4.5

4.7

Real data: Randomized data:

Y 1 −Y 2 = −1.41

Male wingless

Male winged

0.7 2.8

2.3 1.9

1.9 2.1

1.8 1.6

3.8 0

1.4 1.4

1.9 2.2

3.9 2.1

4.7 1.6

2.6 4.5

1.9 2.8

2.8 0.7

3.1

Y 1 −Y 2 = 0.41

Page 41: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

41

Note that each data point was only used

once

Page 42: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

42

1000 randomizations

P < 0.001

Page 43: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

43

Randomization: Other questions

Q: Is this periodic?

(yes)

Page 44: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

44

Bootstrap

• Method for estimation (and confidence intervals)

• Often used for hypothesis testing too

• "Picking yourself up by your own bootstraps"

Page 45: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

45

Bootstrap

• For each group, randomly pick with replacement an equal number of data points, from the data of that group

• With this bootstrap dataset, calculate the estimate -- bootstrap replicate estimate

Page 46: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

46

Male wingless

Male winged

0 1.4

0.7 1.6

0.7 1.9

1.4 2.3

1.6 2.6

1.8 2.8

1.9 2.8

1.9 2.8

1.9 3.1

2.2 3.8

2.1 3.9

2.1 4.5

4.7

Real data: Bootstrap data:

Y 1 −Y 2 = −1.41

Male wingless

Male winged

0.7 1.4

0.7 1.4

1.4 2.8

1.4 2.8

1.8 2.8

1.8 3.1

1.8 3.1

1.9 3.9

1.9 4.5

2.1 4.7

2.1 4.7

2.1 4.7

4.7

Y 1 −Y 2 = −1.78

Page 47: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

47

Page 48: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

48

Bootstraps are often used in evolutionary

trees

Page 49: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

49

Likelihood

L hypothesis A | data( ) = P data | hypothesis A[ ]

Likelihood considers many possible hypotheses, not just one

Page 50: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

50

Law of likelihood

A particular data set supports one hypothesis better than another if the likelihood of that hypothesis is higher than the likelihood of the other hypothesis.

Therefore we try to find the hypothesis with the maximum likelihood.

Page 51: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

51

All estimates we have learned so far are

also maximum likelihood estimates.

Page 52: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

52

"Simple" example

• Using likelihood to estimate a proportion

• Data: 3 out of 8 individuals are male.

• Question: What is the maximum likelihood estimate of the proportion of males?

Page 53: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

53

Likelihood

L p = x( ) = P 3 males out of 8 | p = x[ ]

where x is a hypothesized value of the proportion of males.

e.g., L(p=0.5) is the likelihood of the hypothesis that the proportion of males is 0.5.

Page 54: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

54

For this example only...

The probability of getting 3 males out of 8 independent trials is given by the binomial distribution.

L p = x( ) = Pr data | p = x[ ]

= Pr 3out of 8 | p = x[ ]

=8

3

⎝ ⎜

⎠ ⎟x

3 1− x( )8−3

Page 55: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

55

How to find maximum likelihood hypothesis

1. Calculus

or

2. Computer calculations

Page 56: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

56

By calculus...

• Maximum value of L(p=x) is found when x = 3/8.

• Note that this is the same value we would have gotten by methods we already learned.

Page 57: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

57

By computer calculation...

0.2 0.4 0.6 0.8 1

0.001

0.002

0.003

0.004

0.005

x

L(p=x)

x = 3/8

Input likelihood formula to computer, plot the value of L for each value of x, and find the largest L.

Page 58: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

58

Finding genes for corn yield:

Corn Chromosome 5

Page 59: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

59

Hypothesis testing by likelihood

• Compares the likelihood of maximum likelihood estimate to a null hypothesis

Log-likelihood ratio =

lnLikelihood[Maximum likelihood hypothesis]

Likelihood[Null hypothesis]

⎣ ⎢

⎦ ⎥

Page 60: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

60

Test statistic

χ 2 = 2 log likelihood ratio

With df equal to the number of variables fixed to make null hypothesis

Page 61: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

61

Example:3 males out of 8 individuals

• H0: 50% are male

• Maximum likelihood estimate

ˆ p =3

8

L p = 3/8[ ] =8

3

⎝ ⎜

⎠ ⎟ 3/8( )

31− 3/8( )

5= 0.2816

Page 62: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

62

Likelihood of null hypothesis

L p = 0.5[ ] =8

3

⎝ ⎜

⎠ ⎟ 0.5( )

31− 0.5( )

5= 0.21875

Page 63: 1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal.

63

Log likelihood ratio

lnL p = 3/8[ ]L p = 0.5[ ]

⎣ ⎢

⎦ ⎥= ln

0.2816

0.21875

⎡ ⎣ ⎢

⎤ ⎦ ⎥= 0.2526

χ 2 = 2 0.2526( ) = 0.5051

We fixed one variable in the null hypothesis (p),So the test has df = 1.

χ0.05,12 = 3.84, so we do not reject H0.