EM Stats Reading Group
Housekeeping
The text: Field, Discovering Statistics Using SPSS, 3rd Edition
The software: SPSS, version [xxx], available through MCIT
Session 1 (today): Chapters 1 and 2
Homework Chapters 3 and 4
▪ Installing SPSS and making graphs
Session 2 (to be scheduled) Chapter 5
▪ Testing assumptions
Today
Chapter One Nuts and bolts
▪ Types of variables▪ Levels of measurement▪ Measurement error▪ Validity and reliability▪ Observation and experiment
Frequency distributions
Chapter Two Population and sample Summary statistics The Central Limit Theorem Test statistics
Variables, Parameters, and Statistics
Variable: Usually, a feature of some subject that you measure
▪ E.g., the speed of a car Sometimes, something you infer from indirect measurements
▪ E.g., your intelligence Synonymous with attribute
Parameter A number that relates two variables
▪ E.g., the displacement of an accelerator and the speed of a car▪ E.g., your annual income and your intelligence
2 1
2 1
x
( , )
Variable Parameter Variable
Variable f Variable Parameter
Variables, Parameters, and Statistics
Statistic A measure of some feature of a sample of a variable
▪ E.g., ▪ Collect the height of everyone in this room▪ Calculate the mean height ▪ The mean is a summary statistic and an estimate 0f the height of the
population in general
▪ E.g., ▪ Collect the height of everyone downstairs▪ Compare to the height of everyone in this room with a t test▪ The t value is a test statistic
A statistic does not necessarily reflect any measurable feature of an individual ▪ E.g.,
▪ You do not have a mean height
Variables, Parameters, and Statistics
Independent and dependent variables Independent variable is one that you can adjust
▪ E.g., the thermostat setting in this room Dependent variable is one (potentially) driven by the
independent variable▪ E.g., the temperature in this room (likely related)▪ E.g., the price of gold this morning (likely not related)
Sometimes defined in the opposite way Not useful terminology -- probably best to just drop them
Predictor and outcome variables Same idea, more clearly stated Careful to note that this is not a statement of causality
Levels of Measurement
Categorical Variables Variable may take only some finite (and usually small) number of
values E.g.,
▪ Binary variable: Only two possible states: Heads/Tails, Yes/No, True/False▪ Nominal variable: More than two states: Red/Violet/Green/Yellow▪ Ordinal variable: Multiple states, ordered:
▪ Annual income: < $10,000; $10,001 - $30,000; $30,001 - $50,000; > $50,000▪ Red/Yellow/Green/Violet…?
Continuous Variables E.g.,
▪ Interval variable: Like an ordinal variable, but distance between any two states is assumed to be equal▪ Rate your pain on a scale of 1:10
▪ Ratio variable: ratios along the states also must ‘hold up’▪ Middle ‘C’ on a piano is ½ the frequency of one octave above Middle ‘C’
Levels of Measurement
The distinctions can get blurry Some ‘two state’ systems may have more states
▪ E.g., Heads; Tails; Coin rolled under Fridge Most ‘ratio data’ aren’t really
▪ No experimental instrument can provide infinitely resolute measurements
▪ E.g., ED oral thermometers only have 3-4 significant figures▪ While your temperature can be 39.0304327 …o C, your measured temp can
only be 39.0 or 39.1
Many attributes of a patient, sprocket, etc., can be measured using different scales When possible, go with the ratio method Most test statistics are more ‘efficient’ for ratio data
▪ They detect differences with fewer observations
Measurement Error
For now we see through a glass, darkly…
The things you want to measure cannot be absolutely measured
In the course of statistical analysis, there are many sources of ‘error’ Error does not apply a mistake
Measurement error is the discrepancy between the value of the feature you’re evaluating and the measurement you take of it. All measurement is associated with error. In research, an important goal is to make the measurement
error as small as economically feasible, or at least much, much smaller than the natural variability in the feature you are measuring.
Validity and Reliability
Validity Is your measuring scheme capturing the feature you’re
interested in?▪ Does the measurement change in some predictable way with a
change in the feature?▪ Does the measurement not change when things other than the
feature are changed? Many different types of validity have been described (e.g.,
content validity, etc.)
Reliability Does the measuring scheme produce the same results when
applied repeatedly to the same experimental condition?
Study Design: Observation vs. Experiment
Observational studies aim to take measurements without influencing the system under examination Typically easier than experiments Sometimes the only means feasible for studying a problem Some study designs based on observation are referred to as
quasiexperimental methods Establishing causality in observational studies is not possible.
Experimental studies take measures on a system that the investigator is intentionally perturbing When well-designed, these methods are typically more
powerful than observations May get you closer to establishing causality
Frequency Distributions
A frequency distribution is a tabulation of values taken by a sample of some variable under study (if graphed, it’s called a histogram)
In statistical analysis, it is usually the case that you will approximate your empiric distribution with a probability distribution Eg. Normal (Gaussian), t, Beta, Exponential, Gamma distributions
Swapping your actual distribution with one of these allows you to use statistical tools which are well behaved and thoroughly understood
Always remember that when you assume your data follow some distribution, you must do some homework to make sure that assumption is true! (See Chapter 5)
Frequency Distributions
An attractive feature of defined probability distributions as a means of describing a frequency distribution is that they can usually be fully described by just a few parameters.
Result
Count
2 27
3 43
4 86
5 121
6 148
7 178
8 137
9 108
10 74
11 50
12 28
Frequency Distributions
An attractive feature of defined probability distributions as a means of describing a frequency distribution is that they can usually be fully described by just a few parameters.
1000 Tossed Dice Pairs
Result of Toss
Fra
ctio
n o
f All
To
sse
s
2 4 6 8 10 12
0.0
00
.05
0.1
00
.15
Mean (m)= 6.96Standard Deviation (s)= 2.35
These are NOT your data – they are a modelof your data!
Frequency Distributions
2
2
( )
2
2
1( )
2
x
p x e
Normal (Gaussian) distribution Common, but not universal
Described with 2 parameters Mean (m) Standard deviation (s)
Turns out it’s really easy to calculate these two parameters given a set of data
Frequency Distributions
Finding the mean and standard deviation for a normal distribution
1
2 2
1
2
1
1( )
1
n
x ii
n
x ii
x x
xn
xn
Mean
Variance
Standard Deviation
Frequency Distributions
A counter example: The time to arrival of the next patient in triage
A sample from the UM ED was taken at brisk activity – 10 new patients an hour
The question is asked – given a patient arriving at t = 0, what’s the range of likely times until the next patient arrives? E.g., how many minutes do I have to get this patient triaged?
Time to Next Patient
Pro
ba
bili
ty
0 20 40 60
0.0
00
.02
0.0
40
.06
Frequency Distributions
Given 10 arrivals per hour, you know the average time to the next patient should be around 6 minutes.
As a first approximation, a normal distribution is chosen.
Mean = 9.9 minutes SD = 9.3 minutes
The normal distribution does a very poor job at estimating the distribution of arrival times.
Time to Next Patient
Pro
ba
bili
ty
0 20 40 60
0.0
00
.02
0.0
40
.06
Frequency Distributions
An alternative distribution The exponential
distribution Describes the data very
nicely The mean and standard
deviation have different forms and are derived from the intensity of the process – the number of cases per hour, usually abbreviated with l
Time to Next Patient
Pro
ba
bili
ty
0 20 40 60
0.0
00
.02
0.0
40
.06
Frequency Distributions
Finding the mean and standard deviation for an exponential distribution Weird: The mean and the standard deviation are the same… With the exponential distribution, you only need one number (l)
to described the whole thing.
Time to Next Patient
Pro
ba
bili
ty
0 20 40 60
0.0
00
.02
0.0
40
.06
22
1
1
1
Frequency Distributions
F0r Discrete Variables (e.g., count data) Bernoulli Rademacher Binomial Hypergeometric
For Continuous Variables Beta Von Mises-Fisher Chi squared Exponential Gamma Log-normal Normal Poisson Weibull
Many, many more. Quantum mechanics is based on probability distributions over complex numbers (with
real and imaginary parts)
Frequency Distributions
Take home point: The frequency distribution of your data is just a tabulation of your
observations
A probability distribution function is a mathematical tool that you can use to reduce a lot of observations into a small number of summary statistic values to describe your data and perform analysis
The mean and standard deviation are two common summary statistics▪ They will be calculated differently depending on which distribution you
choose
Many probability functions exist▪ Your choice will depend on the problem at hand and available software
Populations and Samples
A population is the entire set of individuals about which you want to learn something from and potentially infer something to. In general, it is assumed that a true census– an observation of every
member of a population– is not possible Thus, the actual features of a population are essentially unknowable.
A sample is a subset of a population selected for measurement A random sample is one in which every member of the population has the
possibility of joining A uniform random sample is one in which every member of the population
has an equal probability of joining
Central themes in statistics include: Collecting representative samples Choosing appropriate summary measures to describe those samples Quantifying the likely discrepancy between one’s summary measures and
the true values in the population (e.g., quantifying confidence)
Summary Statistics
Central tendency Median: The 50%ile value
– half of observations are greater than, half are less than, the median value
Mode: The most frequency single observation
Mean: The calculated summary statistic based on the probability distribution under consideration
Faculty Comments Per Faculty Meeting
Nu
mb
er
of F
acu
lty
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Median: 8Mode: 8Mean: 8
Summary Statistics
Outlier An observation that appears to
deviate greatly from other members of the sample
Effect of outlier on central tendency summary statistics: Median – relatively resistant to
outliers Mode – relatively resistant to
outliers, but a sample may have more than one mode
Mean – resistance to outliers is a proportional to the sample size – small samples may sway dramatically to outliers
Faculty Comments Per Faculty Meeting
Nu
mb
er
of F
acu
lty
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Median: 8Mode: 8Mean: 10
Summary Statistics
Dispersion Interquartile range: A
pair of numbers – the 25th and 75th%ile of the sample
Standard Deviation: Calculated value based on the distribution under consideration
Confidence intervals: Variation on the standard deviation
Faculty Comments Per Faculty Meeting
Nu
mb
er
of F
acu
lty
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
2.0
2.5
3.0
IQR: 6,9SD: 2.3
Summary Statistics
Effect of outliers on dispersion summary statistics IQR: Relatively resistant to
outliers
SD, CI: Potentially very sensitive to outliers. As with the mean, the extent is a function of the sample size (small samples are very susceptible).
Faculty Comments Per Faculty Meeting
Nu
mb
er
of F
acu
lty
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
2.0
2.5
3.0
IQR: 7,10SD: 6.4
Summary Statistics
Faculty Comments Per Faculty Meeting
Nu
mb
er
of F
acu
lty
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Normal distribution,Excluding outlier
Normal distribution, Including outlier
Choosing Summary Statistics to Use
Distribution-Based Statistics Pros:
▪ Commonly used▪ Provide a natural link to
statistical tests and methods to be developed later
Cons:▪ Are only as good as the
choice of probability distribution
▪ May significantly misrepresent the data when outliers are present
Distribution-Free (i.e., Non-parametric) Statistics Pros:
▪ Commonly used▪ Avoid (potentially wrong)
assumptions associated with probability distributions
Cons▪ Don’t benefit from the
powerful toolkit in parametric statistical methods
▪ Many real-world phenomena likely follow theoretical distributions – avoiding them may suggest ‘pathologic’ data
The Central Limit Theorem
An Example: ~ 750,000 cases of sepsis
each year in US
Suppose you could measure every one’s WBC on admission (i.e., you knew the population frequency distribution of WBC)
Find some summary statistics of that distribution▪ Mean = 14.3▪ SD = 6.5▪ IQR = 8.9, 19.4
WBC in All US Septic Patientsin 2011
WBC (x 1000 / mm^3)
Nu
mb
er
of P
atie
nts
0 5 10 15 20 25 30 35
02
00
04
00
06
00
0
The Central Limit Theorem
Now consider this experiment: You enroll 50 patients in a
sepsis study at UM You calculate your
summary statistics:▪ Mean = 14.6 (v. 14.4)▪ SD = 7.1 (v. 6.5)▪ IQR = 7.9, 20.2 (v. 8.9,
19.4)
A reasonable question to ask is: If my sample is
representative of the whole population, how close are my summary statistics likely to be to ‘the truth?’
WBC in UM Sepsis Study
WBC (x1000/mm^3)
Nu
mb
er
of P
atie
nts
0 5 10 15 20 25 30 35
02
46
81
01
2
The Central Limit Theorem
To figure it out, do your experiment over, enrolling 50 new patients
Now, repeat your experiment 1,000 times…
WBC in NewUM Sepsis Trial
WBC (x1000/mm^3)
Nu
mb
er
of P
atie
nts
0 5 10 15 20 25 30 35
02
46
81
0
Group Mean SD IQR
Population
14.4 6.5 8.9, 19.4
1st Trial 14.6 7.1 7.9, 20.2
2nd Trial 14.2 6.7 7.9, 19.6
The Central Limit Theorem
Sampling Distribution of the Mean Approaches a Normal
Distribution in the Limit (Wow!)
The SD of this curve is the Standard Error of the Mean
The SEM is estimated by:
where sm = SEM, s = standard deviation of the original sample, and N = sample size
Mean WBC
WBC (x 1000/mm^3)
Den
sity
12 13 14 15 16 17
0.0
0.1
0.2
0.3
0.4
WBCStandard Deviation
WBC (x 1000/mm 3̂)
Den
sity
5.0 5.5 6.0 6.5 7.0 7.5 8.0
0.0
0.2
0.4
0.6
0.8
WBC25th%ile
WBC (x 1000/mm 3̂)
Den
sity
4 6 8 10 12 14 16
0.00
0.10
0.20
WBC75th%ile
WBC (x 1000/mm 3̂)
Den
sity
16 17 18 19 20 21 22
0.0
0.1
0.2
0.3
0.4
N
The Central Limit Theorem
The distribution of the sampled mean value will, when resampled many many times, approach a normal distribution with a mean equal to the population mean and a standard deviation equal to the standard deviation / square root of the sample size.
To double the precision of your estimate of the mean, you need to increase your sample size 4-fold
Other statistics of the distribution (e.g., the standard deviation, IQR, etc.) will also approach fixed distribution These distributions are not necessarily normal (e.g, because
the SD by definition is a positive real number, it obeys a distribution that avoids ‘0’)
These distributions are closely tied to the idea of Bayesian statistical methods
The Standard Deviation and the Standard Error of the Mean
Standard deviation describes the likely distance between any measurement in your sample and the mean of the sample (i.e., the variation or dispersion in your sample)
The SEM describes the likely distance between your sample mean and the (unknown) sample mean of the population
SEM is by definition always smaller; many will report this number because it ‘looks better.’
I prefer SD as it more naturally implies the spread of the measurements actually taken and does not refer to a population the details of which aren’t really knowable.
Confidence Intervals
CI’s are a simple extension of standard deviation They proceed from a probability distribution
Easily calculated by numerous software packages
An ‘n’% confidence interval represents the range that a summary statistic (e.g., the mean) will fall n% of the time when new experimental samples are drawn from a population This is an odd definition that stems from details of the
probability logic involved A close, but not identical definition is the range in which there
is a 95% probability of finding the true value of the summary statistic (e.g., the mean)
See text for calculation details
Test Statistics
The Foundation of All Statistical Testing:
Don’t freak out, and keep in mind: The test statistic is a number being used to describe some feature of a
sample Because it is derived from a random sample, the value of a test statistic
will vary from one sample to another Statistical significance is a comment on the probability of finding the
value of a test statistic higher than what you’d expect if Effect < Error A statistical ‘test’ is comparison of a sample’s test statistic versus the
distribution of that test statistic seen when Effect < Error Accordingly, we’re going to need to rely on someone having figured out
what the sampling distribution of a test statistic is when Effect < Error
Variance Explained by Model EffectTest Statistic =
Variance Not Explained by Model Error