Top Banner
t-Distribution DRAUGHT 19 08 -6 -4 -2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and the t-test Jack J. Miller, DPhil [email protected] Hilary Term 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
117

Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

Aug 06, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

t-DistributionDRAUGHT

19 08

-6 -4 -2 2 4 6

0.1

0.2

0.3

0.4

Biochemistry Prelims StatisticsLecture II:

Sampling and the t-test

Jack J. Miller, [email protected]

Hilary Term 2018

. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .

Page 2: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Sampling

HT 2018 Statistics Lecture 2 — Introduction 2

Page 3: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Last time. . .

▶ We explored the fundamental idea of Frequentist statistics,namely that we live in an uncertain world, and eachmeasurement we make of it is drawn randomly from some(unspecified) Probability Distribution Function, or PDF.

▶ If we know the shape of a PDF, we can compute ways ofcharacterising it – for example, by computing its mean andmedian, or standard deviation and interquartile range.

HT 2018 Statistics Lecture 2 — Introduction 3

Page 4: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

However...

▶ When we do experiments, we make one or moremeasurements of an unknown quantity. We don’t know whatthe PDF of the unknown quantity looks like (otherwise therewould be no point in doing the experiment!)

▶ As we repeat the experiment more and more times, we aredrawing samples at random from the underlying PDF. (This isoften referred to as “simple random sampling”)

▶ We want to infer as much as we can about the properties ofthe underlying distribution as a whole based on this sample.

▶ Things are complicated by the fact that there are, in general,infinitely many distributions that the data could have comefrom!

HT 2018 Statistics Lecture 2 — Introduction 4

Page 5: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

However...

▶ When we do experiments, we make one or moremeasurements of an unknown quantity. We don’t know whatthe PDF of the unknown quantity looks like (otherwise therewould be no point in doing the experiment!)

▶ As we repeat the experiment more and more times, we aredrawing samples at random from the underlying PDF. (This isoften referred to as “simple random sampling”)

▶ We want to infer as much as we can about the properties ofthe underlying distribution as a whole based on this sample.

▶ Things are complicated by the fact that there are, in general,infinitely many distributions that the data could have comefrom!

HT 2018 Statistics Lecture 2 — Introduction 4

Page 6: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

However...

▶ When we do experiments, we make one or moremeasurements of an unknown quantity. We don’t know whatthe PDF of the unknown quantity looks like (otherwise therewould be no point in doing the experiment!)

▶ As we repeat the experiment more and more times, we aredrawing samples at random from the underlying PDF. (This isoften referred to as “simple random sampling”)

▶ We want to infer as much as we can about the properties ofthe underlying distribution as a whole based on this sample.

▶ Things are complicated by the fact that there are, in general,infinitely many distributions that the data could have comefrom!

HT 2018 Statistics Lecture 2 — Introduction 4

Page 7: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

However...

▶ When we do experiments, we make one or moremeasurements of an unknown quantity. We don’t know whatthe PDF of the unknown quantity looks like (otherwise therewould be no point in doing the experiment!)

▶ As we repeat the experiment more and more times, we aredrawing samples at random from the underlying PDF. (This isoften referred to as “simple random sampling”)

▶ We want to infer as much as we can about the properties ofthe underlying distribution as a whole based on this sample.

▶ Things are complicated by the fact that there are, in general,infinitely many distributions that the data could have comefrom!

HT 2018 Statistics Lecture 2 — Introduction 4

Page 8: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

Let’s consider the height of people in the UK. Population datashows that, ignoring sex, on average our height is normallydistributed (with µ = 1686 mm, σ = 98.89 mm):

0

500

1000

1400 1600 1800

Height (mm)

Peo

ple

HT 2018 Statistics Lecture 2 — Estimating parameters 5

Page 9: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

Imagine I pick five people at random from this room, measurethem, and obtain their heights as xi = 1589, 1565, 1529,1823, 1694 mm.

HT 2018 Statistics Lecture 2 — Estimating parameters 5

Page 10: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

Imagine I pick five people at random from this room, measurethem, and obtain their heights as xi = 1589, 1565, 1529,1823, 1694 mm.I’d like to try to estimate the population mean (and ideallystandard deviation) from these five numbers. It turns out that thebest I can do is estimate the population mean and standarddeviation, µ and σ from these five numbers, using the constructs

x̄ =1n

n∑i=1

xi, s =

√√√√ 1n − 1

n∑i=1

(xi − x)2,

where n is the number of samples I have taken (5!)

HT 2018 Statistics Lecture 2 — Estimating parameters 5

Page 11: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

For these five numbers, I can easily compute

x̄ = 1640 mm, s =120 mm.

What happens if I ask more people stand up, and measure them?Or what if I tell those people to sit down, and measure another fiveinstead?

HT 2018 Statistics Lecture 2 — Estimating parameters 6

Page 12: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

For these five numbers, I can easily compute

x̄ = 1640 mm, s =120 mm.

What happens if I ask more people stand up, and measure them?Or what if I tell those people to sit down, and measure another fiveinstead?My values for x̄ and s will change. Let’s do this a few times andmake up a histogram of values for x̄. This histogram ultimatelybecomes known as the sampling distribution of the mean and thestandard deviation respectively.

HT 2018 Statistics Lecture 2 — Estimating parameters 6

Page 13: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

To make things straightforward, let’s just consider x̄ for now –here’s a “histogram” of it for the five people I sampled above:

0

500

1000

1400 1600 1800Height (mm)

Peop

le

0.000.250.500.751.00

1400 1600 1800Mean height (mm)

Prob

abilit

y

HT 2018 Statistics Lecture 2 — Sampling distributions 7

Page 14: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

Let’s tell them to sit down, and pick another five people instead:

0

500

1000

1400 1600 1800Height (mm)

Peop

le

0.000.100.200.300.400.50

1400 1600 1800Mean height (mm)

Mean of new sample

Prob

abilit

y

HT 2018 Statistics Lecture 2 — Sampling distributions 7

Page 15: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

If I continue doing this, I get an idea of the distribution of thesample mean when I measure the height of five people: here’s theplot with 200 lots of samples of 5:

0

500

1000

1400 1600 1800

Height (mm)

Peo

ple

0.0000

0.0025

0.0050

0.0075

1400 1600 1800

Mean height (mm)

Pro

babi

lity

HT 2018 Statistics Lecture 2 — Sampling distributions 7

Page 16: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

...and with 20 000 lots of samples of five people each:

0

500

1000

1400 1600 1800Height (mm)

Peop

le

0.00000

0.00025

0.00050

0.00075

1400 1600 1800Mean height (mm)

Prob

abilit

y

HT 2018 Statistics Lecture 2 — Sampling distributions 7

Page 17: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

Now, clearly I’ve done this a bit strangely: if I measure the heightof 5 × 20 000 people, I’d probably be much better off computing x̄of all of them!What happens if repeat the above, but draw samples containing 30people instead of 5?

HT 2018 Statistics Lecture 2 — Sampling distributions 8

Page 18: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

One sample of 30 people:

0

500

1000

1400 1600 1800Height (mm)

Peop

le

0.000.250.500.751.00

1400 1600 1800Mean height (mm)

Prob

abilit

y

HT 2018 Statistics Lecture 2 — Sampling distributions 9

Page 19: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

Ten samples of 30 people:

0

500

1000

1400 1600 1800

Height (mm)

Peo

ple

0.00

0.02

0.04

0.06

1400 1600 1800

Mean height (mm)

Pro

babi

lity

HT 2018 Statistics Lecture 2 — Sampling distributions 9

Page 20: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

20 000 samples of 30 people:

0

500

1000

1400 1600 1800

Height (mm)

Peo

ple

0.000

0.005

0.010

0.015

0.020

1400 1600 1800

Mean height (mm)

Pro

babi

lity

HT 2018 Statistics Lecture 2 — Sampling distributions 9

Page 21: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

Let’s have a look at the sampling distribution of the mean forthese data, as we vary the number of samples we take, n:

HT 2018 Statistics Lecture 2 — Sampling distributions 10

Page 22: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

Let’s have a look at the sampling distribution of the mean forthese data, as we vary the number of samples we take, n:

0.0000.0020.0040.006

1400 1600 1800Prob

abilit

y

0.00000.00250.00500.0075

1400 1600 1800Prob

abilit

y

0.0000.0050.0100.0150.020

1400 1600 1800Mean height (mm)

Prob

abilit

y n = 30

n = 5

n = 2

HT 2018 Statistics Lecture 2 — Sampling distributions 10

Page 23: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

What about s2 and σ2?

HT 2018 Statistics Lecture 2 — Sampling distributions 11

Page 24: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

0e+00

2e−05

4e−05

6e−05

8e−05

0 5000 10000 15000 20000 25000

Prob

abilit

y

0e+00

1e−04

2e−04

3e−04

4e−04

0 5000 10000 15000 20000 25000

Variance of height (mm2)

Prob

abilit

y

n = 2

n = 5

0.00000

0.00005

0.00010

0.00015

0 5000 10000 15000 20000 25000

Prob

abilit

y

n = 30

HT 2018 Statistics Lecture 2 — Sampling distributions 11

Page 25: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

What about s and σ?

HT 2018 Statistics Lecture 2 — Sampling distributions 12

Page 26: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

What about s and σ? Playing the same trick, we can show how sapproximates σ as the number of samples increases:

0.0000.0020.0040.0060.008

0 250 500 750 1000Prob

abilit

y

0.0000.0050.010

0 250 500 750 1000Prob

abilit

y

0.00000.00250.00500.00750.01000.0125

0 250 500 750 1000Standard deviation of height (mm)

Prob

abilit

y n = 30

n = 5

n = 2

HT 2018 Statistics Lecture 2 — Sampling distributions 12

Page 27: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

So, to summarise, x̄ and s give us an estimate of µ and σ – butthis estimate is itself uncertain!

It turns out that x̄ and s2 are the Best Unbiased Estimators of µand σ2 that we can construct (in most circumstances).

Only as n → ∞ does x̄ → µ and s2 → σ2.

HT 2018 Statistics Lecture 2 — Sampling distributions 12

Page 28: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

BUE?

▶ What do I mean by Best and Unbiased?

▶ Unbiased means that they converge on the “right answer”,i.e. that

limn→∞

x̄ = µ and limn→∞

s2 = σ2

▶ Best means here that the width of their sampling distributionis minimal – i.e. they’re “usually close to the right answer”.

HT 2018 Statistics Lecture 2 — Sampling distributions 13

Page 29: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

BUE?

▶ What do I mean by Best and Unbiased?▶ Unbiased means that they converge on the “right answer”,

i.e. that

limn→∞

x̄ = µ and limn→∞

s2 = σ2

▶ Best means here that the width of their sampling distributionis minimal – i.e. they’re “usually close to the right answer”.

HT 2018 Statistics Lecture 2 — Sampling distributions 13

Page 30: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

BUE?

▶ What do I mean by Best and Unbiased?▶ Unbiased means that they converge on the “right answer”,

i.e. that

limn→∞

x̄ = µ and limn→∞

s2 = σ2

▶ Best means here that the width of their sampling distributionis minimal – i.e. they’re “usually close to the right answer”.

HT 2018 Statistics Lecture 2 — Sampling distributions 13

Page 31: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

A standard deviation related word of caution...

0.00000.00250.00500.00750.01000.0125

0 250 500 750 1000Standard deviation of height (mm)

Prob

abilit

y n = 30

The di�erence is bias!

True answer

Mean of estimates (provided by s with n=30)

HT 2018 Statistics Lecture 2 — Sampling distributions 14

Page 32: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

A standard deviation related word of caution...

It turns out that s is a biased estimator of σ, but it is usually thebest we can do without knowing more about the (unknown)

distribution.

It can be shown that for any finite sample size s is always anunderestimate of σ, and that the bias originates due to the

nonlinear behaviour of the square root. This bias is small – if thedata are normally distributed, it is approximately σ/4n. We shall

henceforth ignore it.

(However, had we used the definition of standard deviation that involves division by n, as opposed to n − 1, we

would find a greater bias here – it would be wrong by a factor of n/(n − 1). More on this much later.)

HT 2018 Statistics Lecture 2 — Sampling distributions 14

Page 33: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Clearly, as n → ∞, everything gets easier.

However, biochemistry is filled with small n experiments, usuallyfor understandable reasons (e.g. cost and ethics).

Time for a historical interlude.

HT 2018 Statistics Lecture 2 — Sampling distributions 15

Page 34: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

William Sealy Gosset (New College, graduated in 1899, read

chemistry and maths) was employed by the GuinnessSon and Co. brewery in Dublin straight outof university, initially doing something thatwe would perhaps regard as industrialbiochemistry – systematically optimising beerquality given variable starting products andconditions.

HT 2018 Statistics Lecture 2 — William Sealy Gosset 16

Page 35: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

Guinness had a policy of employing Oxbridgegraduates, as they “found before them analmost unexplored field lying open toinvestigation. A great mass of data wasavailable or could easily be collected whichwould throw light on the relations, hithertoundetermined or only guessed at in anempirical way, between the quality of the rawmaterials of beer, such as barley and hops,the conditions of production and the qualityof the finished article.”

Biometrika, Volume 30, Issue 3-4, 1 January 1939, pp. 210–250

HT 2018 Statistics Lecture 2 — William Sealy Gosset 16

Page 36: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

After two years of training to be a Brewer, heset his mind to work trying to improve theproduction process, and was specificallyinterested in the sugar composition of maltedbarley, which affected the alcohol content ofthe final product (and hence the tax bill!).

He, and others in the firm, had difficultycoming to firm conclusions on matters suchas whether or not the nitrogen soil contenton a barley farm mattered due to a lot ofvariation in the measurement, and the factthat the sample sizes were necessarily low(due to the limited availability of comparablebarley farms in Ireland).

HT 2018 Statistics Lecture 2 — William Sealy Gosset 16

Page 37: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

Naturally, Gosset tried to work out theshape of the sampling distributions Ishowed you above – and wrote aninternal memo to the other brewers inGuinness, entitled “The application ofthe law of error to the work of theBrewery” (1904) detailing some of hisprogress.

He published the results in 1908, underthe pseudonym “Student”.

Biometrika, Volume 6, Issue 1, 1 March 1908, pp. 1–25.

t-DistributionDRAUGHT

19 08

HT 2018 Statistics Lecture 2 — William Sealy Gosset 17

Page 38: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

▶ The trouble with investigating x̄ and s is that they depend onthe problem at hand.

▶ One way to make all problems “look the same” is tostandardise them, through by computing the quantity

x̄ − µ

σ/√

n .

If we know µ and σ, then the expected value for this quantityis a normal distribution of mean 0 and variance 1.

▶ This was known before Student, and most people assumedthat s was a very good approximation for σ. This is true with“enough” samples.

HT 2018 Statistics Lecture 2 — The t-distribution 18

Page 39: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

▶ The trouble with investigating x̄ and s is that they depend onthe problem at hand.

▶ One way to make all problems “look the same” is tostandardise them, through by computing the quantity

x̄ − µ

σ/√

n .

If we know µ and σ, then the expected value for this quantityis a normal distribution of mean 0 and variance 1.

▶ This was known before Student, and most people assumedthat s was a very good approximation for σ. This is true with“enough” samples.

HT 2018 Statistics Lecture 2 — The t-distribution 18

Page 40: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

▶ The trouble with investigating x̄ and s is that they depend onthe problem at hand.

▶ One way to make all problems “look the same” is tostandardise them, through by computing the quantity

x̄ − µ

σ/√

n .

If we know µ and σ, then the expected value for this quantityis a normal distribution of mean 0 and variance 1.

▶ This was known before Student, and most people assumedthat s was a very good approximation for σ. This is true with“enough” samples.

HT 2018 Statistics Lecture 2 — The t-distribution 18

Page 41: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

▶ Student showed that if we don’t know σ, but know s and thedata really are sampled from a normal distribution, then thequantity

Z =x̄ − µ

s/√

nfollows a different distribution, which has since become knownas the t-distribution. NB: some authors call this T or t, and use Z for the case where

n → ∞. To try and be concise, I’m calling it Z regardless of n.

▶ (Quantities like Z have a specific name – formally it is known as a test statistic.)

▶ t depends on a parameter, known as the number of degrees offreedom, ν, which here is n − 1. As ν → ∞, the t-distributionbecomes the normal distribution with mean 0 and variance 1.

HT 2018 Statistics Lecture 2 — The t-distribution 19

Page 42: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

▶ Student showed that if we don’t know σ, but know s and thedata really are sampled from a normal distribution, then thequantity

Z =x̄ − µ

s/√

nfollows a different distribution, which has since become knownas the t-distribution. NB: some authors call this T or t, and use Z for the case where

n → ∞. To try and be concise, I’m calling it Z regardless of n.

▶ (Quantities like Z have a specific name – formally it is known as a test statistic.)

▶ t depends on a parameter, known as the number of degrees offreedom, ν, which here is n − 1. As ν → ∞, the t-distributionbecomes the normal distribution with mean 0 and variance 1.

HT 2018 Statistics Lecture 2 — The t-distribution 19

Page 43: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

▶ Student showed that if we don’t know σ, but know s and thedata really are sampled from a normal distribution, then thequantity

Z =x̄ − µ

s/√

nfollows a different distribution, which has since become knownas the t-distribution. NB: some authors call this T or t, and use Z for the case where

n → ∞. To try and be concise, I’m calling it Z regardless of n.

▶ (Quantities like Z have a specific name – formally it is known as a test statistic.)

▶ t depends on a parameter, known as the number of degrees offreedom, ν, which here is n − 1. As ν → ∞, the t-distributionbecomes the normal distribution with mean 0 and variance 1.

HT 2018 Statistics Lecture 2 — The t-distribution 19

Page 44: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

▶ For a small number of degrees of freedom, t is “broader” thanthe corresponding Gaussian, and has fatter tails.

▶ The full analytic form for t is mildly hairy, but computers arevery good at providing numbers from it should we need them:

HT 2018 Statistics Lecture 2 — The t-distribution 20

Page 45: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

▶ For a small number of degrees of freedom, t is “broader” thanthe corresponding Gaussian, and has fatter tails.

▶ The full analytic form for t is mildly hairy, but computers arevery good at providing numbers from it should we need them:

pt(x, ν) =Γ(ν+1

2)

√νπ Γ

(ν2) (1 +

x2

ν

)− ν+12

Where Γ(ν) =

∫ ∞

0xν−1e−x dx

HT 2018 Statistics Lecture 2 — The t-distribution 20

Page 46: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

The t-distribution with ν degrees of freedom looks like this:

0.0

0.1

0.2

0.3

0.4

−2 0 2x

Prob

abilit

y

ν

= 1

= 2

= 3

= 40

= inf

t

HT 2018 Statistics Lecture 2 — The t-distribution 21

Page 47: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

0.0000.0020.004

1400 1600 1800Mean height (mm)

Prob

abilit

y

0.00000.00250.00500.0075

0 2500.0000.0020.0040.0060.008

0 250Prob

abilit

y

500 750 1000Standard deviation of height (mm)

Prob

abilit

y

0.00.10.20.3

−10 −5 0 5 10Z

Prob

abilit

y

ν = 1

HT 2018 Statistics Lecture 2 — The t-distribution 22

Page 48: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

0.005

0.00000.00250.0050

0.000

0.0080.012

0 250 500 750 1000Standard deviation of height (mm)

0.0075

1400 1600 1800Mean height (mm)

Prob

abilit

y

0.0000Pr

obab

ility

0.00.10.20.3

−10 −5 0 5 10Z

Prob

abilit

y

ν = 4

HT 2018 Statistics Lecture 2 — The t-distribution 22

Page 49: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-distribution

0.000.010.02

1400 1600 1800Mean height (mm)

Prob

abilit

y

0.000.010.020.030.04

0 250 500 750 1000Standard deviation of height (mm)

Prob

abilit

y

0.00.10.20.30.4

−10 −5 0 5 10Z

Prob

abilit

y

ν = 40

HT 2018 Statistics Lecture 2 — The t-distribution 22

Page 50: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Confidence limits

Why is this useful?

HT 2018 Statistics Lecture 2 — Confidence limits 23

Page 51: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Confidence limits

Why is this useful? Because we don’t know µ, but we do knowthings about Z!

HT 2018 Statistics Lecture 2 — Confidence limits 23

Page 52: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Confidence limits

▶ Remember the first rule about probabilities, how they sum toone? Well, we can look at the above and write down astatement about probability –

P(−zα/2 ≤ Z ≤ zα/2) = 1 − α

HT 2018 Statistics Lecture 2 — Confidence limits 24

Page 53: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Confidence limits

▶ Remember the first rule about probabilities, how they sum toone? Well, we can look at the above and write down astatement about probability –

P(−zα/2 ≤ Z ≤ zα/2) = 1 − α

HT 2018 Statistics Lecture 2 — Confidence limits 24

Page 54: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Confidence limits

▶ Then, simply replacing Z, we get:

P(−zα/2 ≤ x̄ − µ

s/√

n ≤ zα/2

)= 1 − α

HT 2018 Statistics Lecture 2 — Confidence limits 25

Page 55: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Confidence limits

▶ Let’s do a bit of algebra inside the brackets:

−zα/2 ≤ x̄ − µ

s/√

n ≤ zα/2

−zα/2

(s√n

)≤ x̄ − µ ≤ +zα/2

(s√n

)−x̄ − zα/2

(s√n

)≤ −µ ≤ −x̄ + zα/2

(s√n

)HT 2018 Statistics Lecture 2 — Confidence limits 26

Page 56: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Confidence limits

x̄ − zα/2

(s√n

)≤ µ ≤ x̄ + zα/2

(s√n

)(Remember me!)

HT 2018 Statistics Lecture 2 — Confidence limits 27

Page 57: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Confidence limits

Now, the thing about zα/2 is that it’s just a number, chosen todivide the area under the curve as shown so that most of it lieswithin a particular region.

HT 2018 Statistics Lecture 2 — Confidence limits 28

Page 58: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Confidence limits

Now, the thing about zα/2 is that it’s just a number, chosen todivide the area under the curve as shown so that most of it lieswithin a particular region. Specifically, for some number α knownas the “significance level” (which is typically chosen to be 5%), wewant ∫ zα/2

−zα/2

p(t, ν) dt = 1 − α

HT 2018 Statistics Lecture 2 — Confidence limits 28

Page 59: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Confidence limits

Now, the thing about zα/2 is that it’s just a number, chosen todivide the area under the curve as shown so that most of it lieswithin a particular region. Specifically, for some number α knownas the “significance level” (which is typically chosen to be 5%), wewant ∫ zα/2

−zα/2

p(t, ν) dt = 1 − α

We can’t do this integral by hand very easily, but a computer can.In R, zα/2 is given by qt(1-alpha/2,df=n-1) where you fill in nand alpha to taste.

HT 2018 Statistics Lecture 2 — Confidence limits 28

Page 60: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Confidence limits

▶ This lets us say some very powerful things about thelocation of the true population mean given the sample weobtained.

▶ For example, I can tell you that zα/2 for a large number ofsamples (ν → ∞) at the 5% significance limit isapproximately 1.96.

▶ This lets me say that the 95% confidence limit for thepopulation mean µ is

x̄ − 1.96 s√n ≤ µ ≤ x̄ + 1.96 s√

n

HT 2018 Statistics Lecture 2 — Confidence limits 29

Page 61: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Confidence limits

▶ This lets us say some very powerful things about thelocation of the true population mean given the sample weobtained.

▶ For example, I can tell you that zα/2 for a large number ofsamples (ν → ∞) at the 5% significance limit isapproximately 1.96.

▶ This lets me say that the 95% confidence limit for thepopulation mean µ is

x̄ − 1.96 s√n ≤ µ ≤ x̄ + 1.96 s√

n

HT 2018 Statistics Lecture 2 — Confidence limits 29

Page 62: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Confidence limits

This means that if I repeated everything again and again, 95%of the time the population mean would lie within this interval.

HT 2018 Statistics Lecture 2 — Confidence limits 30

Page 63: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Confidence limits

This means that if I repeated everything again and again, 95%of the time the population mean would lie within this interval.In other words, A (1 − α)× 100% confidence interval is an intervalcalculated using a procedure such that it will contain the truevalue (1 − α)× 100% of the times you use it, but the rest of thetime you will be unlucky.

HT 2018 Statistics Lecture 2 — Confidence limits 30

Page 64: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Confidence limits

In R, it’s easy to generate a whatever-precision-you-like confidencelimit, e.g. for the upper 95% limit:mean(data) +qt(0.975,df=length(data)-1)*sd(data)/sqrt(length(data))

HT 2018 Statistics Lecture 2 — Confidence limits 31

Page 65: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Standard Error on the Mean

If we set zα/2 to one we obtain an estimator for the standarderror on the mean or SEM, which is the standard deviation ofthe mean’s sampling distribution.

HT 2018 Statistics Lecture 2 — Confidence limits 32

Page 66: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Standard Error on the Mean

If we set zα/2 to one we obtain an estimator for the standarderror on the mean or SEM, which is the standard deviation ofthe mean’s sampling distribution.I.e., I mean this:

0

500

1000

1400 1600 1800Height (mm)

Peop

le

0.0000.0050.0100.0150.020

1400 1600 1800Mean height (mm)

Prob

abilit

y

1σ = SEM

HT 2018 Statistics Lecture 2 — Confidence limits 32

Page 67: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Standard Error on the Mean

In other words,SEM =

σ√n ≈ s√

n(This definition is a bit wooly – everyone in the biosciences uses s which estimates σ )

Note that as n → ∞ the SEM tends to 0, i.e. the sample meantends to the population mean.

Not all authors of papers seem to appreciate this point!

HT 2018 Statistics Lecture 2 — Confidence limits 33

Page 68: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Summary so far

▶ Estimators based on a final sample size are inherentlyuncertain.

▶ The sampling distribution for an estimator tells us about thatuncertainty.

▶ It turns out that the sampling distribution of the sample meanis related to the t-distribution

▶ (I haven’t discussed this, but the sampling distribution of thevariance is related to something called the χ2 distribution)

▶ One can use knowledge of this to construct confidence limitson the mean.

HT 2018 Statistics Lecture 2 — Confidence limits 34

Page 69: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The t-test

HT 2018 Statistics Lecture 2 — t-tests 35

Page 70: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Factors

Now, let’s go back to the height histogram:

0

500

1000

1400 1600 1800

Height (mm)

Peo

ple

HT 2018 Statistics Lecture 2 — t-tests 36

Page 71: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Factors

Now, let’s go back to the height histogram:

0

500

1000

1400 1600 1800

Height (mm)

Peo

ple Sex

M

F

HT 2018 Statistics Lecture 2 — t-tests 36

Page 72: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Factors

Now, let’s go back to the height histogram:

0

300

600

900

1400 1600 1800Height (mm)

Peop

le SexM

F

HT 2018 Statistics Lecture 2 — t-tests 36

Page 73: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Factors

▶ In the language of statistics sex – here a variable relevant tothe quantity at hand – is called a factor, and its levels are“male” and “female”. Colloquially we may refer to them asgroups.

▶ Note that we can’t “rank” or “order” levels; consequently sexis called a categorical variable.

▶ (Sometimes we do have categorical variables that we might be able to rank, like that classic ‘Strongly

agree, Agree, ..., Strongly Disagree’ scale that you might have seen before. These are known as ordinal

variables, as they can be ordered)

HT 2018 Statistics Lecture 2 — t-tests 37

Page 74: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Factors

Because we often have many factors that may influence aparticular experiment, it’s much more common to see factorsplotted on an x axis, e.g. in a box plot:

x7580

8590

95

Example BoxplotVa

lue

Largest non-extreme value(typically 1.5 × IQR)

Upper quartileMedian

Lower quartile

Smallest non−extreme value

Extreme value

HT 2018 Statistics Lecture 2 — t-tests 38

Page 75: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

FactorsBecause we often have many factors that may influence aparticular experiment, it’s much more common to see factorsplotted on an x axis, e.g. in a box plot:

1300

1500

1700

1900

M F

Sex

Hei

ght (

mm

)

HT 2018 Statistics Lecture 2 — t-tests 38

Page 76: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

FactorsBecause we often have many factors that may influence aparticular experiment, it’s much more common to see factorsplotted on an x axis, e.g. in a box plot:

Med

ian

Thyr

otro

pin

Leve

l(m

U/li

ter)

3.0

2.5

1.5

1.0

2.0

0.5

0.0No Thyroxine

TreatmentThyroxineTreatmentwith Low

ThyrotropinLevel

Treatment withThyroxine

andOmeprazole

Treatment withHigher-Dose

Thyroxineand

Omeprazole

HT 2018 Statistics Lecture 2 — t-tests 38

Page 77: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Statistical tests

A common question (asked by Student and many other peoplesince!) is as follows:

HT 2018 Statistics Lecture 2 — t-tests 39

Page 78: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Statistical tests

A common question (asked by Student and many other peoplesince!) is as follows:

H1: Given that I have measured a set of samples in both groups, isthere evidence that there is a difference in the population means ofboth groups?

HT 2018 Statistics Lecture 2 — t-tests 39

Page 79: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Statistical tests

A common question (asked by Student and many other peoplesince!) is as follows:

H1: Given that I have measured a set of samples in both groups, isthere evidence that there is a difference in the population means ofboth groups?

H0: Or could my samples all come from one similar underlyingdistribution? (It’s always important to consider the case wherenothing happens!)

HT 2018 Statistics Lecture 2 — t-tests 39

Page 80: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Statistical tests

0.0

0.5

1.0

1.5

2.0

1400 1600 1800

Height (mm)

Peo

ple Sex

M

F

HT 2018 Statistics Lecture 2 — t-tests 40

Page 81: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

One-sample Student’s t-test

▶ The way we deal with this is by going back to the expressionfor Z we had earlier – we know that x̄−µ

s/√n is t-distributed withn − 1 degrees of freedom.

▶ So, if we want to test the hypothesis that x̄ is equal to somespecified value µ, we just compute

x̄ − µ

s/√

n .

Since we know that this quantity is t-distributed, and we havethe ability to look up values of zα, we can see if this is verylikely – i.e. obtain a value p that represents theprobability of observing a value at least as extreme asthe one observed.

HT 2018 Statistics Lecture 2 — t-tests 41

Page 82: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

One-sample Student’s t-test

▶ The way we deal with this is by going back to the expressionfor Z we had earlier – we know that x̄−µ

s/√n is t-distributed withn − 1 degrees of freedom.

▶ So, if we want to test the hypothesis that x̄ is equal to somespecified value µ, we just compute

x̄ − µ

s/√

n .

Since we know that this quantity is t-distributed, and we havethe ability to look up values of zα, we can see if this is verylikely – i.e. obtain a value p that represents theprobability of observing a value at least as extreme asthe one observed.

HT 2018 Statistics Lecture 2 — t-tests 41

Page 83: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

One-sample Student’s t-test

▶ More “extreme” values of observed Z are larger in magnitude,and are less likely to occur.

▶ Here,“more extreme” means having a Z-value at least as greatin magnitude (at least as far from zero) as the observedZ-value. This means this is called a two-tailed test, as I’minterested in both tails of the t-distribution

▶ If, a priori I have a good reason to know that an effect canonly possibly exist in one direction, I can do a one-tailed test.This is discouraged.

HT 2018 Statistics Lecture 2 — t-tests 42

Page 84: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

One-sample Student’s t-test

▶ More “extreme” values of observed Z are larger in magnitude,and are less likely to occur.

▶ Here,“more extreme” means having a Z-value at least as greatin magnitude (at least as far from zero) as the observedZ-value. This means this is called a two-tailed test, as I’minterested in both tails of the t-distribution

▶ If, a priori I have a good reason to know that an effect canonly possibly exist in one direction, I can do a one-tailed test.This is discouraged.

HT 2018 Statistics Lecture 2 — t-tests 42

Page 85: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

▶ Suppose I measure the plasma iron concentration in fivepeople with a particular SNP in the gene called HBB, whichcodes for a protein whose absence or reduction is known tocause thalassemia, a form of anaemia that arises becauseblood cells are destroyed.

▶ The “reference range” for a normal healthy adult is 11 –32 µmol l−1, reflecting the fact that plasma iron can changedue to physiological reasons in different people.

▶ I measure their plasma iron concentration as being 42, 34, 48,45, and 55 µmol l−1.

▶ Is this different from the known population maximum value ofµ = 32 µmol l−1?

HT 2018 Statistics Lecture 2 — t-tests 43

Page 86: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

▶ Suppose I measure the plasma iron concentration in fivepeople with a particular SNP in the gene called HBB, whichcodes for a protein whose absence or reduction is known tocause thalassemia, a form of anaemia that arises becauseblood cells are destroyed.

▶ The “reference range” for a normal healthy adult is 11 –32 µmol l−1, reflecting the fact that plasma iron can changedue to physiological reasons in different people.

▶ I measure their plasma iron concentration as being 42, 34, 48,45, and 55 µmol l−1.

▶ Is this different from the known population maximum value ofµ = 32 µmol l−1?

HT 2018 Statistics Lecture 2 — t-tests 43

Page 87: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

▶ Suppose I measure the plasma iron concentration in fivepeople with a particular SNP in the gene called HBB, whichcodes for a protein whose absence or reduction is known tocause thalassemia, a form of anaemia that arises becauseblood cells are destroyed.

▶ The “reference range” for a normal healthy adult is 11 –32 µmol l−1, reflecting the fact that plasma iron can changedue to physiological reasons in different people.

▶ I measure their plasma iron concentration as being 42, 34, 48,45, and 55 µmol l−1.

▶ Is this different from the known population maximum value ofµ = 32 µmol l−1?

HT 2018 Statistics Lecture 2 — t-tests 43

Page 88: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

▶ Suppose I measure the plasma iron concentration in fivepeople with a particular SNP in the gene called HBB, whichcodes for a protein whose absence or reduction is known tocause thalassemia, a form of anaemia that arises becauseblood cells are destroyed.

▶ The “reference range” for a normal healthy adult is 11 –32 µmol l−1, reflecting the fact that plasma iron can changedue to physiological reasons in different people.

▶ I measure their plasma iron concentration as being 42, 34, 48,45, and 55 µmol l−1.

▶ Is this different from the known population maximum value ofµ = 32 µmol l−1?

HT 2018 Statistics Lecture 2 — t-tests 43

Page 89: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

▶ So, I obtain x̄ = 44.8 and s = 7.73 µmol l−1 as estimators forthe population mean (µ1) and SD (σ1) of the ironconcentration.

▶ I can then formally state the hypothesis that I am testing:

H0 :The sample is drawn from the healthy population: µ1 = µ

H1 :They’re different: µ1 ̸= µ

▶ I then computeZ =

x̄ − 32s/√

n ≈ 3.70

HT 2018 Statistics Lecture 2 — t-tests 44

Page 90: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

▶ So, I obtain x̄ = 44.8 and s = 7.73 µmol l−1 as estimators forthe population mean (µ1) and SD (σ1) of the ironconcentration.

▶ I can then formally state the hypothesis that I am testing:

H0 :The sample is drawn from the healthy population: µ1 = µ

H1 :They’re different: µ1 ̸= µ

▶ I then computeZ =

x̄ − 32s/√

n ≈ 3.70

HT 2018 Statistics Lecture 2 — t-tests 44

Page 91: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

▶ So, I obtain x̄ = 44.8 and s = 7.73 µmol l−1 as estimators forthe population mean (µ1) and SD (σ1) of the ironconcentration.

▶ I can then formally state the hypothesis that I am testing:

H0 :The sample is drawn from the healthy population: µ1 = µ

H1 :They’re different: µ1 ̸= µ

▶ I then computeZ =

x̄ − 32s/√

n ≈ 3.70

HT 2018 Statistics Lecture 2 — t-tests 44

Page 92: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

▶ Looking this up in either a big table of t values (or using acomputer) for the distribution with 4 degrees of freedom (i.e.5-1) I find that this corresponds to a p value of 0.021.

▶ This is less than the commonly-used significance thresholdof 0.05 – in other words the sample mean is likely to bedifferent from the population mean.

▶ I can therefore say that we state that as p < 0.05 we rejectthe null hypothesis at the 5% level and conclude that thosepeople with the SNP in question are likely to have higherplasma iron levels than the reference range.

▶ In R we’d do this more concisely as: t.test(data, mu=32).

HT 2018 Statistics Lecture 2 — t-tests 45

Page 93: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example▶ Looking this up in either a big table of t values (or using a

computer) for the distribution with 4 degrees of freedom (i.e.5-1) I find that this corresponds to a p value of 0.021.

...p is between 0.05 and 0.02

We measured Z≈3.7, which implies that....

▶ This is less than the commonly-used significance thresholdof 0.05 – in other words the sample mean is likely to bedifferent from the population mean.

▶ I can therefore say that we state that as p < 0.05 we rejectthe null hypothesis at the 5% level and conclude that thosepeople with the SNP in question are likely to have higherplasma iron levels than the reference range.

▶ In R we’d do this more concisely as: t.test(data, mu=32).

HT 2018 Statistics Lecture 2 — t-tests 45

Page 94: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

▶ Looking this up in either a big table of t values (or using acomputer) for the distribution with 4 degrees of freedom (i.e.5-1) I find that this corresponds to a p value of 0.021.

▶ This is less than the commonly-used significance thresholdof 0.05 – in other words the sample mean is likely to bedifferent from the population mean.

▶ I can therefore say that we state that as p < 0.05 we rejectthe null hypothesis at the 5% level and conclude that thosepeople with the SNP in question are likely to have higherplasma iron levels than the reference range.

▶ In R we’d do this more concisely as: t.test(data, mu=32).

HT 2018 Statistics Lecture 2 — t-tests 45

Page 95: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

▶ Looking this up in either a big table of t values (or using acomputer) for the distribution with 4 degrees of freedom (i.e.5-1) I find that this corresponds to a p value of 0.021.

▶ This is less than the commonly-used significance thresholdof 0.05 – in other words the sample mean is likely to bedifferent from the population mean.

▶ I can therefore say that we state that as p < 0.05 we rejectthe null hypothesis at the 5% level and conclude that thosepeople with the SNP in question are likely to have higherplasma iron levels than the reference range.

▶ In R we’d do this more concisely as: t.test(data, mu=32).

HT 2018 Statistics Lecture 2 — t-tests 45

Page 96: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

An example

▶ Looking this up in either a big table of t values (or using acomputer) for the distribution with 4 degrees of freedom (i.e.5-1) I find that this corresponds to a p value of 0.021.

▶ This is less than the commonly-used significance thresholdof 0.05 – in other words the sample mean is likely to bedifferent from the population mean.

▶ I can therefore say that we state that as p < 0.05 we rejectthe null hypothesis at the 5% level and conclude that thosepeople with the SNP in question are likely to have higherplasma iron levels than the reference range.

▶ In R we’d do this more concisely as: t.test(data, mu=32).

HT 2018 Statistics Lecture 2 — t-tests 45

Page 97: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Spoilers ahead

If you have already covered this material before, and are waitingfor me to start talking about all the assumptions and problemsof t-tests, and say words like ‘Type I error’, that’s the subject ofthe next lecture.

For now, let’s extend this machinery to compare two differentgroups of samples, and ask what evidence there is that theirmeans are different.

HT 2018 Statistics Lecture 2 — t-tests 46

Page 98: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Two-sample Student’s t-tests

0.0

0.5

1.0

1.5

2.0

1400 1600 1800

Height (mm)

Peo

ple Sex

M

F

HT 2018 Statistics Lecture 2 — t-tests 47

Page 99: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Two sample Student’s t-tests

▶ The general approach is similar, but things are morecomplicated because we don’t know either mean exactly.

▶ Moreover, we also don’t know either standard deviation!

HT 2018 Statistics Lecture 2 — t-tests 48

Page 100: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Two sample Student’s t-tests

▶ The general approach is similar, but things are morecomplicated because we don’t know either mean exactly.

▶ Moreover, we also don’t know either standard deviation!

Group A Parameter Group Bx̄A Sample mean x̄BsA Sample standard deviation sBnA Number of samples nB

HT 2018 Statistics Lecture 2 — t-tests 48

Page 101: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Two-sample Student’s t-test

▶ It turns out that there are two fundamentally different waysof interpreting the two different estimates for populationvariance that the two groups give us.

▶ We can either assume that both groups have the samevariance, or, unsurprisingly, different variance.

HT 2018 Statistics Lecture 2 — t-tests 49

Page 102: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Two-sample Student’s t-test

▶ It turns out that there are two fundamentally different waysof interpreting the two different estimates for populationvariance that the two groups give us.

▶ We can either assume that both groups have the samevariance, or, unsurprisingly, different variance.

HT 2018 Statistics Lecture 2 — t-tests 49

Page 103: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Equal variance two-sample Student’s t-test

If the two groups have the same mean, then the difference of x̄Aand x̄B should be, on average, zero.

HT 2018 Statistics Lecture 2 — t-tests 50

Page 104: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Equal variance two-sample Student’s t-test

If the two groups have the same mean, then the difference of x̄Aand x̄B should be, on average, zero.

It turns out that we can construct a pooled estimate of thestandard deviation, if we assume that it’s common to both groups.This estimate is

sp =

√(nA − 1)s2

A + (nB − 1)s2B

nA + nB − 2

HT 2018 Statistics Lecture 2 — t-tests 50

Page 105: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Equal variance two-sample Student’s t-test

We therefore construct the quantity

x̄A − x̄B

sp√

1nA

+ 1nB

,

which is very much like before – it’s t-distributed, but withnA + nB − 2 degrees of freedom.

HT 2018 Statistics Lecture 2 — t-tests 51

Page 106: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Equal variance two-sample Student’s t-test

We therefore construct the quantity

x̄A − x̄B

sp√

1nA

+ 1nB

,

which is very much like before – it’s t-distributed, but withnA + nB − 2 degrees of freedom.

To test H1 : the groups A and B have different population means,we plug the numbers in and compare the value we get to thet-distribution.

HT 2018 Statistics Lecture 2 — t-tests 51

Page 107: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Unequal variance two-sample Welch’s t-test

If we take the different s’s of acting as estimators forfundamentally different population variances, then the picture ismore complex. (It was originally described by B. L. Welch in 1947, in Biometrika, 34, 28 – 35)

HT 2018 Statistics Lecture 2 — t-tests 52

Page 108: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Unequal variance two-sample Welch’s t-test

If we take the different s’s of acting as estimators forfundamentally different population variances, then the picture ismore complex. (It was originally described by B. L. Welch in 1947, in Biometrika, 34, 28 – 35)

Here, the quantity in question is

Z =x̄A − x̄B√

s2A

nA+

s2B

nB

HT 2018 Statistics Lecture 2 — t-tests 52

Page 109: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Unequal variance two-sample Welch’s t-test

There’s a catch, however, – this isn’t t-distributed “nicely”.

HT 2018 Statistics Lecture 2 — t-tests 53

Page 110: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Unequal variance two-sample Welch’s t-test

There’s a catch, however, – this isn’t t-distributed “nicely”.

It’s t-distributed with (s2A

nA+

s2B

nB

)2

(s2AnA

)2

nA−1 +

(s2BnB

)2

nB−1

degrees of freedom (!)

HT 2018 Statistics Lecture 2 — t-tests 53

Page 111: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Nevertheless...

▶ This is just a quantity that can be calculated, and thetheoretical t-value compared to the observed t-value.

▶ We can therefore perform a hypothesis test as before, andchoose to reject (or not) the null hypothesis that thepopulation means are the same at some chosen significancelevel.

HT 2018 Statistics Lecture 2 — t-tests 54

Page 112: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Paired t-tests

One other very powerful trick is to perform repeated experimentson the same subject – for example, measuring a quantity with andwithout administration of a drug within a number of patients.

We’re then interested in changes, and if the mean differencebetween the groups is distinct from zero. As we obtain data inpairs, this is known as a paired test – and it can have morestatistical power.

HT 2018 Statistics Lecture 2 — t-tests 55

Page 113: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

In practice...

0.0

0.5

1.0

1.5

2.0

1400 1600 1800

Height (mm)

Peo

ple Sex

M

F

HT 2018 Statistics Lecture 2 — t-tests 56

Page 114: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

In practice...

> t.test(x=males, y=females)

Welch Two Sample t-test

data: males and femalest = 3.893, df = 9.9081, p-value = 0.003047alternative hypothesis:true difference in means is not equal to 0

95 percent confidence interval:53.23589 196.14649

sample estimates:mean of x mean of y1751.374 1626.683

HT 2018 Statistics Lecture 2 — t-tests 57

Page 115: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Quick summary & Spoilers

▶ The sample mean and standard deviation provide estimatesfor the population mean and standard deviation.

▶ The standardised (or “Studentised”) constructs I’ve shownyou all have the same distribution – a t-distribution if thedata are drawn from a normal distribution.

▶ We can use this to infer whether or not two groups ofmeasurements are likely to have been drawn from onepopulation with one mean.

▶ Next time, I’ll talk a lot about the perils of the t-test, andwhat happens if your data are not normally distributed.

HT 2018 Statistics Lecture 2 — t-tests 58

Page 116: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Quick summary & Spoilers

▶ The sample mean and standard deviation provide estimatesfor the population mean and standard deviation.

▶ The standardised (or “Studentised”) constructs I’ve shownyou all have the same distribution – a t-distribution if thedata are drawn from a normal distribution.

▶ We can use this to infer whether or not two groups ofmeasurements are likely to have been drawn from onepopulation with one mean.

▶ Next time, I’ll talk a lot about the perils of the t-test, andwhat happens if your data are not normally distributed.

HT 2018 Statistics Lecture 2 — t-tests 58

Page 117: Biochemistry Prelims Statistics Lecture II: [2em] Sampling ... · t-Distribution DRAUGHT 19 08 6 4 2 2 4 6 0.1 0.2 0.3 0.4 Biochemistry Prelims Statistics Lecture II: Sampling and

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The end!

HT 2018 Statistics Lecture 2 — t-tests 59