The text: Field, Discovering Statistics Using SPSS, 3 rd Edition The software: SPSS, version [xxx], available through MCIT Session 1 (today):

EM Stats Reading Group

Housekeeping

The text: Field, Discovering Statistics Using SPSS, 3rd Edition

The software: SPSS, version [xxx], available through MCIT

Session 1 (today): Chapters 1 and 2

Homework Chapters 3 and 4

▪ Installing SPSS and making graphs

Session 2 (to be scheduled) Chapter 5

▪ Testing assumptions

Today

Chapter One Nuts and bolts

▪ Types of variables▪ Levels of measurement▪ Measurement error▪ Validity and reliability▪ Observation and experiment

Frequency distributions

Chapter Two Population and sample Summary statistics The Central Limit Theorem Test statistics

Variables, Parameters, and Statistics

Variable: Usually, a feature of some subject that you measure

▪ E.g., the speed of a car Sometimes, something you infer from indirect measurements

▪ E.g., your intelligence Synonymous with attribute

Parameter A number that relates two variables

▪ E.g., the displacement of an accelerator and the speed of a car▪ E.g., your annual income and your intelligence

2 1

2 1

x

( , )

Variable Parameter Variable

Variable f Variable Parameter


Statistic A measure of some feature of a sample of a variable

▪ E.g., ▪ Collect the height of everyone in this room▪ Calculate the mean height ▪ The mean is a summary statistic and an estimate 0f the height of the

population in general

▪ E.g., ▪ Collect the height of everyone downstairs▪ Compare to the height of everyone in this room with a t test▪ The t value is a test statistic

A statistic does not necessarily reflect any measurable feature of an individual ▪ E.g.,

▪ You do not have a mean height


Independent and dependent variables Independent variable is one that you can adjust

▪ E.g., the thermostat setting in this room Dependent variable is one (potentially) driven by the

independent variable▪ E.g., the temperature in this room (likely related)▪ E.g., the price of gold this morning (likely not related)

Sometimes defined in the opposite way Not useful terminology -- probably best to just drop them

Predictor and outcome variables Same idea, more clearly stated Careful to note that this is not a statement of causality

Levels of Measurement

Categorical Variables Variable may take only some finite (and usually small) number of

values E.g.,

▪ Binary variable: Only two possible states: Heads/Tails, Yes/No, True/False▪ Nominal variable: More than two states: Red/Violet/Green/Yellow▪ Ordinal variable: Multiple states, ordered:

▪ Annual income: < $10,000; $10,001 - $30,000; $30,001 - $50,000; > $50,000▪ Red/Yellow/Green/Violet…?

Continuous Variables E.g.,

▪ Interval variable: Like an ordinal variable, but distance between any two states is assumed to be equal▪ Rate your pain on a scale of 1:10

▪ Ratio variable: ratios along the states also must ‘hold up’▪ Middle ‘C’ on a piano is ½ the frequency of one octave above Middle ‘C’

Levels of Measurement

The distinctions can get blurry Some ‘two state’ systems may have more states

▪ E.g., Heads; Tails; Coin rolled under Fridge Most ‘ratio data’ aren’t really

▪ No experimental instrument can provide infinitely resolute measurements

▪ E.g., ED oral thermometers only have 3-4 significant figures▪ While your temperature can be 39.0304327 …o C, your measured temp can

only be 39.0 or 39.1

Many attributes of a patient, sprocket, etc., can be measured using different scales When possible, go with the ratio method Most test statistics are more ‘efficient’ for ratio data

▪ They detect differences with fewer observations

Measurement Error

For now we see through a glass, darkly…

The things you want to measure cannot be absolutely measured

In the course of statistical analysis, there are many sources of ‘error’ Error does not apply a mistake

Measurement error is the discrepancy between the value of the feature you’re evaluating and the measurement you take of it. All measurement is associated with error. In research, an important goal is to make the measurement

error as small as economically feasible, or at least much, much smaller than the natural variability in the feature you are measuring.

Validity and Reliability

Validity Is your measuring scheme capturing the feature you’re

interested in?▪ Does the measurement change in some predictable way with a

change in the feature?▪ Does the measurement not change when things other than the

feature are changed? Many different types of validity have been described (e.g.,

content validity, etc.)

Reliability Does the measuring scheme produce the same results when

applied repeatedly to the same experimental condition?

Study Design: Observation vs. Experiment

Observational studies aim to take measurements without influencing the system under examination Typically easier than experiments Sometimes the only means feasible for studying a problem Some study designs based on observation are referred to as

quasiexperimental methods Establishing causality in observational studies is not possible.

Experimental studies take measures on a system that the investigator is intentionally perturbing When well-designed, these methods are typically more

powerful than observations May get you closer to establishing causality

Frequency Distributions

A frequency distribution is a tabulation of values taken by a sample of some variable under study (if graphed, it’s called a histogram)

In statistical analysis, it is usually the case that you will approximate your empiric distribution with a probability distribution Eg. Normal (Gaussian), t, Beta, Exponential, Gamma distributions

Swapping your actual distribution with one of these allows you to use statistical tools which are well behaved and thoroughly understood

Always remember that when you assume your data follow some distribution, you must do some homework to make sure that assumption is true! (See Chapter 5)


An attractive feature of defined probability distributions as a means of describing a frequency distribution is that they can usually be fully described by just a few parameters.

Result

Count

2 27

3 43

4 86

5 121

6 148

7 178

8 137

9 108

10 74

11 50

12 28


An attractive feature of defined probability distributions as a means of describing a frequency distribution is that they can usually be fully described by just a few parameters.

1000 Tossed Dice Pairs

Result of Toss

Fra

ctio

n o

f All

To

sse

s

2 4 6 8 10 12

0.0

00

.05

0.1

00

.15

Mean (m)= 6.96Standard Deviation (s)= 2.35

These are NOT your data – they are a modelof your data!


2

2

( )

2

2

1( )

2

x

p x e

Normal (Gaussian) distribution Common, but not universal

Described with 2 parameters Mean (m) Standard deviation (s)

Turns out it’s really easy to calculate these two parameters given a set of data


Finding the mean and standard deviation for a normal distribution

1

2 2

1

2

1

1( )

1

n

x ii

n

x ii

x x

xn

xn

Mean

Variance

Standard Deviation


A counter example: The time to arrival of the next patient in triage

A sample from the UM ED was taken at brisk activity – 10 new patients an hour

The question is asked – given a patient arriving at t = 0, what’s the range of likely times until the next patient arrives? E.g., how many minutes do I have to get this patient triaged?

Time to Next Patient

Pro

ba

bili

ty

0 20 40 60

0.0

00

.02

0.0

40

.06


Given 10 arrivals per hour, you know the average time to the next patient should be around 6 minutes.

As a first approximation, a normal distribution is chosen.

Mean = 9.9 minutes SD = 9.3 minutes

The normal distribution does a very poor job at estimating the distribution of arrival times.


Pro

ba

bili

ty

0 20 40 60

0.0

00

.02

0.0

40

.06


An alternative distribution The exponential

distribution Describes the data very

nicely The mean and standard

deviation have different forms and are derived from the intensity of the process – the number of cases per hour, usually abbreviated with l


Pro

ba

bili

ty

0 20 40 60

0.0

00

.02

0.0

40

.06


Finding the mean and standard deviation for an exponential distribution Weird: The mean and the standard deviation are the same… With the exponential distribution, you only need one number (l)

to described the whole thing.


Pro

ba

bili

ty

0 20 40 60

0.0

00

.02

0.0

40

.06

22

1

1

1


F0r Discrete Variables (e.g., count data) Bernoulli Rademacher Binomial Hypergeometric

For Continuous Variables Beta Von Mises-Fisher Chi squared Exponential Gamma Log-normal Normal Poisson Weibull

Many, many more. Quantum mechanics is based on probability distributions over complex numbers (with

real and imaginary parts)


Take home point: The frequency distribution of your data is just a tabulation of your

observations

A probability distribution function is a mathematical tool that you can use to reduce a lot of observations into a small number of summary statistic values to describe your data and perform analysis

The mean and standard deviation are two common summary statistics▪ They will be calculated differently depending on which distribution you

choose

Many probability functions exist▪ Your choice will depend on the problem at hand and available software

Populations and Samples

A population is the entire set of individuals about which you want to learn something from and potentially infer something to. In general, it is assumed that a true census– an observation of every

member of a population– is not possible Thus, the actual features of a population are essentially unknowable.

A sample is a subset of a population selected for measurement A random sample is one in which every member of the population has the

possibility of joining A uniform random sample is one in which every member of the population

has an equal probability of joining

Central themes in statistics include: Collecting representative samples Choosing appropriate summary measures to describe those samples Quantifying the likely discrepancy between one’s summary measures and

the true values in the population (e.g., quantifying confidence)

Summary Statistics

Central tendency Median: The 50%ile value

– half of observations are greater than, half are less than, the median value

Mode: The most frequency single observation

Mean: The calculated summary statistic based on the probability distribution under consideration

Faculty Comments Per Faculty Meeting

Nu

mb

er

of F

acu

lty

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Median: 8Mode: 8Mean: 8

Summary Statistics

Outlier An observation that appears to

deviate greatly from other members of the sample

Effect of outlier on central tendency summary statistics: Median – relatively resistant to

outliers Mode – relatively resistant to

outliers, but a sample may have more than one mode

Mean – resistance to outliers is a proportional to the sample size – small samples may sway dramatically to outliers


Nu

mb

er

of F

acu

lty

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Median: 8Mode: 8Mean: 10

Summary Statistics

Dispersion Interquartile range: A

pair of numbers – the 25th and 75th%ile of the sample

Standard Deviation: Calculated value based on the distribution under consideration

Confidence intervals: Variation on the standard deviation


Nu

mb

er

of F

acu

lty

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

2.0

2.5

3.0

IQR: 6,9SD: 2.3

Summary Statistics

Effect of outliers on dispersion summary statistics IQR: Relatively resistant to

outliers

SD, CI: Potentially very sensitive to outliers. As with the mean, the extent is a function of the sample size (small samples are very susceptible).


Nu

mb

er

of F

acu

lty

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

2.0

2.5

3.0

IQR: 7,10SD: 6.4

Summary Statistics


Nu

mb

er

of F

acu

lty

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Normal distribution,Excluding outlier

Normal distribution, Including outlier

Choosing Summary Statistics to Use

Distribution-Based Statistics Pros:

▪ Commonly used▪ Provide a natural link to

statistical tests and methods to be developed later

Cons:▪ Are only as good as the

choice of probability distribution

▪ May significantly misrepresent the data when outliers are present

Distribution-Free (i.e., Non-parametric) Statistics Pros:

▪ Commonly used▪ Avoid (potentially wrong)

assumptions associated with probability distributions

Cons▪ Don’t benefit from the

powerful toolkit in parametric statistical methods

▪ Many real-world phenomena likely follow theoretical distributions – avoiding them may suggest ‘pathologic’ data

The Central Limit Theorem

An Example: ~ 750,000 cases of sepsis

each year in US

Suppose you could measure every one’s WBC on admission (i.e., you knew the population frequency distribution of WBC)

Find some summary statistics of that distribution▪ Mean = 14.3▪ SD = 6.5▪ IQR = 8.9, 19.4

WBC in All US Septic Patientsin 2011

WBC (x 1000 / mm^3)

Nu

mb

er

of P

atie

nts

0 5 10 15 20 25 30 35

02

00

04

00

06

00

0


Now consider this experiment: You enroll 50 patients in a

sepsis study at UM You calculate your

summary statistics:▪ Mean = 14.6 (v. 14.4)▪ SD = 7.1 (v. 6.5)▪ IQR = 7.9, 20.2 (v. 8.9,

19.4)

A reasonable question to ask is: If my sample is

representative of the whole population, how close are my summary statistics likely to be to ‘the truth?’

WBC in UM Sepsis Study

WBC (x1000/mm^3)

Nu

mb

er

of P

atie

nts

0 5 10 15 20 25 30 35

02

46

81

01

2


To figure it out, do your experiment over, enrolling 50 new patients

Now, repeat your experiment 1,000 times…

WBC in NewUM Sepsis Trial

WBC (x1000/mm^3)

Nu

mb

er

of P

atie

nts

0 5 10 15 20 25 30 35

02

46

81

0

Group Mean SD IQR

Population

14.4 6.5 8.9, 19.4

1st Trial 14.6 7.1 7.9, 20.2

2nd Trial 14.2 6.7 7.9, 19.6


Sampling Distribution of the Mean Approaches a Normal

Distribution in the Limit (Wow!)

The SD of this curve is the Standard Error of the Mean

The SEM is estimated by:

where sm = SEM, s = standard deviation of the original sample, and N = sample size

Mean WBC

WBC (x 1000/mm^3)

Den

sity

12 13 14 15 16 17

0.0

0.1

0.2

0.3

0.4

WBCStandard Deviation

WBC (x 1000/mm 3̂)

Den

sity

5.0 5.5 6.0 6.5 7.0 7.5 8.0

0.0

0.2

0.4

0.6

0.8

WBC25th%ile

WBC (x 1000/mm 3̂)

Den

sity

4 6 8 10 12 14 16

0.00

0.10

0.20

WBC75th%ile

WBC (x 1000/mm 3̂)

Den

sity

16 17 18 19 20 21 22

0.0

0.1

0.2

0.3

0.4

N


The distribution of the sampled mean value will, when resampled many many times, approach a normal distribution with a mean equal to the population mean and a standard deviation equal to the standard deviation / square root of the sample size.

To double the precision of your estimate of the mean, you need to increase your sample size 4-fold

Other statistics of the distribution (e.g., the standard deviation, IQR, etc.) will also approach fixed distribution These distributions are not necessarily normal (e.g, because

the SD by definition is a positive real number, it obeys a distribution that avoids ‘0’)

These distributions are closely tied to the idea of Bayesian statistical methods

The Standard Deviation and the Standard Error of the Mean

Standard deviation describes the likely distance between any measurement in your sample and the mean of the sample (i.e., the variation or dispersion in your sample)

The SEM describes the likely distance between your sample mean and the (unknown) sample mean of the population

SEM is by definition always smaller; many will report this number because it ‘looks better.’

I prefer SD as it more naturally implies the spread of the measurements actually taken and does not refer to a population the details of which aren’t really knowable.

Confidence Intervals

CI’s are a simple extension of standard deviation They proceed from a probability distribution

Easily calculated by numerous software packages

An ‘n’% confidence interval represents the range that a summary statistic (e.g., the mean) will fall n% of the time when new experimental samples are drawn from a population This is an odd definition that stems from details of the

probability logic involved A close, but not identical definition is the range in which there

is a 95% probability of finding the true value of the summary statistic (e.g., the mean)

See text for calculation details

Test Statistics

The Foundation of All Statistical Testing:

Don’t freak out, and keep in mind: The test statistic is a number being used to describe some feature of a

sample Because it is derived from a random sample, the value of a test statistic

will vary from one sample to another Statistical significance is a comment on the probability of finding the

value of a test statistic higher than what you’d expect if Effect < Error A statistical ‘test’ is comparison of a sample’s test statistic versus the

distribution of that test statistic seen when Effect < Error Accordingly, we’re going to need to rely on someone having figured out

what the sampling distribution of a test statistic is when Effect < Error

Variance Explained by Model EffectTest Statistic =

Variance Not Explained by Model Error

Test Statistics

Much, much more to come in later chapters

The text: Field, Discovering Statistics Using SPSS, 3 rd Edition The software: SPSS, version [xxx], available through MCIT Session 1 (today):

Documents