-
1
Lab 1: The Scientific Method and Statistics
The power of science comes not from scientists but from its
method. - E. O. Wilson, The Creation
When asked What is biology? most people respond with Its the
study of living things which is mostly correct. A better answer
would be Its the scientific study of living things. But what makes
one inquiry scientific and another not? The wealth of knowledge
that fills your textbook is the result of the hard work of many
scientists over the course of centuries of work (and only
represents a fraction of what scientists have discovered during
that period). As Dr. Wilson accurately points out above, the key to
scientific inquiry is not the scientist, but the scientific
method.
There are two main approaches to scientific discovery: one in
which nature is described using observation and measurements
(observational science) and one in which a natural phenomenon is
explained using the scientific method (hypothesis-based or
empirical science). Often, observational science leads to questions
which can be answered through hypothesis-based investigations.
There are six steps to the scientific process, each important and
vital to making the process work:
THE SCIENTIFIC METHOD
1. Observe: You cannot ask educated questions about natural
phenomena without knowing something about the nature of the
phenomena. The key to any investigation, scientific or otherwise,
is observation. Observation leads to the collection of data and
usually will lead the observer to ask a question.
2. Question: As a scientist, you should be
curious about the way the things you observed work, came to be,
interact with other things, etc. What is it you want to know about
the world? Why did that stop? How is this formed? Which group is
faster? The question formed will lead to the formation of
hypotheses.
3. Hypothesize: A hypothesis (pl.
hypotheses) is a proposed explanation for a set of observations.
Another way to look at a hypothesis is to say that it is a
reasonable guess of what might be occurring; a possible answer to
your question. The hypothesis is formed using information gathered
from observations. To be useful, a hypothesis should be testable
using experimentation. Once a hypothesis is proposed, several
consequences can be reasonably expected. These expectations can be
termed predictions and they are the expected outcomes if a
hypothesis is true.
Based on a figure from Lehner Handbook of ethological
methods.
1996.
-
2
4. Design experiment and collect data: Once a hypothesis (or
hypotheses) is formulated, the investigator should attempt to
verify the predictions. To test the validity of the predictions,
experiments are created to allow for controlled testing of the
hypothesis. Your experiment will determine if your hypothesis is
supported or not by evidence. A good experiment is one that will
provide supporting evidence if a hypothesis is correct and is
equally likely to show that a hypothesis is false, if it is not
correct. Since biological systems vary a lot, it is best to repeat
your experiment several times (replication) and then use statistics
to sort out the outcome.
Experiments that are appropriately set up only change one
variable (called manipulated variable or independent variable) for
each run of the experiment. All other factors are controls (or
controlled variables). In other words, in order to test whether or
not this one variable (as determined by the hypothesis) affects the
outcome of the experiment, you must 1) keep all other variables
constant (i.e., everything but the variable in
question is the same between subjects) and 2) have controls to
show that the organism is able to function normally in this
experiment. Your experiment should produce data. Data (plural) are
measurable outcomes of the experiment. These are the numbers that
will indicate if your hypothesis is actually causing the observed
effect. Data can be measured in height, width, length, seconds,
minutes, days, volume, density, number of something, and many other
measurable quantities. Dont forget your units!
5. Analyze data: Analyze data using statistical methods (see
below). Statistics typically check to see if the data produced by
one group (e.g., the control) is different from the data produced
by another (the manipulated variable) in a significantly meaningful
way. They take into consideration not only the average, but the
variation around that average. If the data are not statistically
significant, the outcome from the manipulated variable did not
differ from the control and you must reject your hypothesis. If you
accept your hypothesis, then there is evidence that your prediction
is true. If the experimenter can provide replication of these
results, then the hypothesis can be considered a reasonable
explanation of the observed phenomena. If you reject your
hypothesis, then your prediction is probably not causing the
observed phenomenon and you still do not know what is causing it.
As a curious scientist, the experimenter should revise the
hypothesis and/or experimental design and try again.
6. Conclude: When data are collected and analyzed, the
hypothesis is either tentatively accepted or rejected. The
acceptance of a hypothesis may only be temporary, as new
observations and experiments can lead to the rejection of the
hypothesis and/or an alternative explanation may also be true.
Here, the outcomes of the experiment are interpreted in a broader
context.
As a side note: A theory is when an idea is supported by many
lines of evidence gathered by many investigators testing many
hypotheses over many years. It leads to other predictions, or
hypotheses, which when tested are often also supported by evidence.
We can only base our conclusions on the evidence at hand. In the
future, further evidence may lead us to a different conclusion.
Therefore, in science, all of our knowledge is theoretical. Famous
theories include things like the following:
gravity- that all objects have a pulling force, larger objects
have a larger pull cell theory- that the basic unit of life is the
cell, cells come from other cells, etc germ theory- that diseases
are often caused by microorganisms, rather than bad karma. Plate
tectonics- which proposes that the movements of large continental
plates drift
across the liquid layer below them causing earthquakes, etc
Atomic theory- that atoms are the smallest units of matter
Some of these have been supported by a hundred years of data
now. Even so, they are still called theories. Most were
controversial when first proposed and some of them may still be
controversial. All of them could, in theory (pun intended) be
overturned by a better theory that fits all the evidence more
closely. In this lab, you will be making and breaking hypotheses
constantly, but will you create a theory? Not likely. If you still
dont see the difference, talk to someone about it.
-
3
APPLYING THE SCIENTIFIC METHOD A certain observation or
situation, leads us to ask questions. A hypothesis is a proposed
explanation or answer to your question based on an
educated/informed guess. For example, Suzies car wont start this
morning on her way to school. Your question, Why is Suzies car not
starting? A hypothesis for this situation might be that Suzies car
battery has died. Another hypothesis might be that Suzie has no gas
in her car. Both of those are statements assume that you might know
the problem and have guessed the reason. Note, that a hypothesis
must be written in statement form, not as a question. Also, a good
hypothesis must be testable and falsifiable (able to be proved
false.
The second step is experimental procedure. In this step you
design an experiment to test your hypothesis. This experimental
design will need to accurately test the hypothesis directly as well
as give you a clear yes or no answer. To help with determining the
yes or no answer a control should be incorporated. In the
experimental design, controls are set up the exact same as your
experimental test, but with only one thing different. In the
non-working car example, the first hypothesis being tested is
whether the battery is dead. The experimental procedure would
consist of putting a new battery into Suzies car. The control would
be the original old battery in Suzies car. Only the type of battery
has changed; old versus new. The same car and battery cables must
be used to ensure that only one thing has changed. Or, Suzie could
take her battery to the mechanic and they can test the battery and
compare Suzies battery results to those of a functional battery of
the same brand.
The results of that experiment will then need to be analyzed so
that a conclusion can be made. If the new battery in Suzies car
makes the car start, or the mechanic states that Suzies old battery
is below normal working standards, then Suzie will know that the
hypothesis of a non-working battery has been validated (a yes
answer was achieved). We can then conclude, or make a substantiated
explanation of the situation that the battery was indeed the reason
the car did not start. But what if the new battery did not restart
Suzies car or what if the mechanic stated that Suzies battery was
well within working parameters? That would mean that the hypothesis
was invalidated (a no answer was achieved). Therefore, Suzie has
not determined the cause of her car troubles. So, she will have to
start over to diagnose the problem and come up with another
hypothesis, such as the car wouldnt start because it had no gas in
the engine. Suzie will then redesign an experimental procedure to
test this hypothesis using the proper controls. Suzie can again
either solve her problem, or will have to come up with another
hypothesis. The art of deduction from Monty Python and the Holy
Grail: Scene 5
BEDEMIR: Quiet, quiet. Quiet! There are ways of telling whether
she is a witch. CROWD: Are there? What are they? BEDEMIR: Tell me,
what do you do with witches? VILLAGER #2: Burn! CROWD: Burn, burn
them up! BEDEMIR: And what do you burn apart from witches? VILLAGER
#1: More witches! VILLAGER #2: Wood! BEDEMIR: So, why do witches
burn? [pause] VILLAGER #3: B--... 'cause they're made of wood...?
BEDEMIR: Good! CROWD: Oh yeah, yeah... BEDEMIR: So, how do we tell
whether she is made of wood? VILLAGER #1: Build a bridge out of
her. BEDEMIR: Aah, but can you not also build bridges out of stone?
VILLAGER #2: Oh, yeah. BEDEMIR: Does wood sink in water? VILLAGER
#1: No, no. VILLAGER #2: It floats! It floats! VILLAGER #1: Throw
her into the pond! CROWD: The pond! BEDEMIR: What also floats in
water? VILLAGER #1: Bread! VILLAGER #2: Apples!
VILLAGER #3: Very small rocks! VILLAGER #1: Cider! VILLAGER #2:
Great gravy! VILLAGER #1: Cherries! VILLAGER #2: Mud! VILLAGER #3:
Churches -- churches! VILLAGER #2: Lead -- lead! ARTHUR: A duck.
CROWD: Oooh. BEDEMIR: Exactly! So, logically..., VILLAGER #1: If...
she.. weighs the same as a duck, she's made of wood. BEDEMIR: And
therefore--? VILLAGER #1: A witch! CROWD: A witch! BEDEMIR: We
shall use my larger scales
-
4
Introduction to Statistics All of the above information is to
show you how science works. Science demands a lot. It requires the
scientist to be unassuming, unbiased, curious, skeptical, and
clever. Just because you may think things happen in a certain way
or because of a certain reason, you cannot assume anything without
evidence. If your experiment shows that birds fly because they are
lighter than air, then this is what you assume. Of course, your
experiment may not be appropriate to test this or it wasnt done
properly, but until it is redone, any conclusions must be drawn on
the evidence at hand. We use science and statistics more than you
might think. And often, we are drawing the wrong conclusions from
them. I just heard on NPR that New Mexico dropped from 1st to 17th
when ranking states by the number of alcohol-related fatalities.
Good for NM, right? But it is important to know how those stats are
determined. Previously this was the number of fatalities per state
population (per capita). Now it is the number of fatalities per
miles driven on average. And as the reporter said, New Mexicans
drive a lotIt is still in the bottom 10 for per capita deaths. The
two figures should not be compared directly. Flipping through an
old Newsweek magazine I came across these advertising slogans that
use statistics, but I wonder about the specifics. Here are the
claims they make and some questions that immediately come to my
mind: Product Slogan/Claim Questions
Dodge The worlds biggest cab What is this measuring: volume,
surface area, length, width?
Internal or external? Compared to what: other trucks, cars,
everything that ever
existed?
Crestor 10-mg dose of Crestor, along with diet, can lower bad
cholesterol by as much as 52% (vs 7% with placebo)
Are these percentages averages of the study group or maxima? In
other words, perhaps only 1 person in the entire study saw a 52%
drop and this was way above the average.
What is the effect of diet alone? Did the placebo treatment have
the same diet?
Tempur-Pedic
In a recent survey, 92% of our enthusiastic owners report
sleeping better and waking more refreshed.
How many people were surveyed? 9.2/10 is not the same as
920/1000.
Compared to what? Their old mattress? Nothing? Did you only
survey enthusiastic owners or were only the
opinions of enthusiastic owners included in this statistic?
Since many of you have aspirations to join the medical field, let
me point this out. It is my experience that most medical
researchers do not use proper statistical tests during their
experiments. This means that most of the studies used to show a
link between cancer and your favorite pastime, or to discover the
gene that controls your diet preferences, or to develop medicines
that treat the common cold might be based on faulty premises. Ive
seen peer-reviewed papers published with no regard for proper
statistical procedure. Youd be surprised how many papers talk about
how correlated their treatment is with recovery based on figures
worse than the last regression example above. The authors got away
with it because the reviewers dont know it either. In addition,
most patients and many physicians are overwhelmed, uniformed, or
ignorant of how statistics are used by researchers, medical supply
companies, or mainstream media to sell their story. Ive seen news
stories based on differences between groups that are smaller than
the margin of error (meaning there is no difference). Hopefully,
with this lab manual you can change some of that. My point is to be
critical and be curious. There is a lot of information out there
these days. Not all of it is valid. Do not take it at face value.
Ask questions. Question your consumer products, your friends, your
doctors, your professors. Then youll be one step closer to thinking
like a scientist.
You can claim anything with statistics, but only to those who
dont understand statistics. - unknown
-
5
STATISTICS! Humans consider themselves good judges of what is
bigger, faster, or better than the rest, but how much bigger is
bigger? For example, imagine you want to know if frogs from one
population (Pop A) are bigger than those from another population
(Pop B). You could take a frog from each one and measure their
lengths. You might find that the frog from Pop B is much bigger
than the one from Pop A. But do these frogs really show the
difference between the two populations? What if you just happened
to catch a small frog from Pop A and a large one from Pop B? What
if you caught several more and found that the frogs from both
populations really had a lot of variation in lengths and looked
something like the dataset below? Some of the frogs in Pop A are
bigger than some in Pop B. Now can you tell if one population is
bigger than another? Which one?
This is where statistics becomes useful. We cannot easily judge
the differences between groups accurately when there is variation
(as in most studies in biology). Even the advertisement on the left
admits Individual results may vary. Therefore we must analyze them
statistically. We usually use statistics to understand something
about a given population, be it animal, vegetable, or mineral. The
best way would be to measure every single individual from both
groups, but that is not often possible. Therefore we collect data
on a subset or sample of the group of interest (n = sample size,
the total number of individuals sampled). We then extrapolate the
data for this sample to the rest of the population. But
the sample should be unbiased, and care should be taken to
control for the following: 1. Make sure the sample is a good
representation of the population as a whole. For instance, if
you wanted to know if Georgians prefer the Bulldogs or the
Jackets, you will not get a complete understanding of peoples
preferences if you only survey people in Athens.
2. If you are using multiple groups to examine the potential
effects of a specific variable, make
sure that all groups are equivalent in every way, except in the
variable being tested. For instance, it is not good science to give
a placebo to people over 50 years of age and a new medicine to
people under 50 years of age, and conclude that the people given
the drug are healthier. In this case, there are two experimental
variables: presence of the drug and age. The two groups are not
equivalent.
-
6
Once you have gone out and taken some measurements of a sample;
the number of trees in a forest, the size of seeds eaten by cotton
rats, the blood pressure of patients with arthritis, your first
step will be to describe different characteristics of your sample.
To do this, we use Descriptive Statistics, so here we go! Measures
of Center These statistics indicate which values are most common.
They attempt to define what is normal for the population. Mean in
popular speak, the mean is the average. It is the most commonly
used measure of central
tendency. The mean is computed by summing all values in your
sample and dividing that sum by the sample size.
where, Xis = each individual data point n = sample size (the
number of data points in your sample) = summation
In our frogs, Pop A has a mean length of 7.38 cm and Pop B is
9.06 cm on average.
Mode - The mode of a sample is the score that appears most
often. In Pop A, mode = 8.7 cm (i.e. 2 of the frogs were 8.7 cm
long). Pop B does not have a mode.
Median - The median divides the distribution into halves; half
of the scores are above the median and half are below it when the
data are arranged in numerical order.
When the sample size is an odd number, as in our frog sample,
the median is the middle value (Pop A = 7.7, Pop B = 8.8). When the
sample size is an even number, the median is halfway between the
middle values, e.g., for the dataset (1 3 4 5 8 9), the median
location is half-way between the 3rd and 4th scores (4 and 5) or
4.5.
Median is a useful measure of center for data that is very
skewed towards one end, like salaries, in which the mean does not
give a good measurement of the norm (see figure at right). If you
were complaining to your boss that you and your coworkers needed a
raise, would you complain about how low the mean or the median
is?
If the dataset is normally distributed, the data points are
equally distributed about the mean (as on right) and mean = median
= mode
nX
X i=
Mean, Median, Mode
DESCRIPTIVE STATISTICS
A biologist, a chemist, and a statistician are out hunting. The
biologist shoots at a deer and misses 5ft to the left, the chemist
takes a shot and misses 5ft to the right, the statistician yells We
got em!
-
7
Measures of Variation Although Measures of Center are
informative as they describe how scores are centered in the
distribution, the mean, median, and mode alone do not provide the
best possible description of a sample (distribution). For example,
think of samples of two populations, X and Y: Both X and Y have the
same Mean (50) and similar Median (50 and 48, respectively). But,
would you call those 2 datasets very similar? As you can see,
Measures of Center alone, are not sufficient to clearly describe a
data set. . Measures of variation provide additional critical
information about your data. Specifically, the degree to which
individual scores are clustered about or deviate from the average
value in a distribution. In biology, everything varies, such as the
seed sizes of the two species on right. So it is important to
always report some measure of variation along with your sample mean
when describing your data. Range Range is the simplest measure of
variability. It describes how much the population is spread
around the mean. It is the difference between the highest and
lowest score in a distribution:
Range = maximum value minimum value.
Although easy to compute, range is based solely on the two most
extreme scores in the distribution and thus it is susceptible to
much fluctuation. For instance, in frog Pop A, the range is 3.4 cm.
However, if we caught one additional frog that measures 11.5 cm,
the range would jump to 6.2 cm!, Therefore, the range is not often
a reliable measurement of variability.
Variance (s2) - Variance measures the average distance between
each data point from the mean.:
(Xi X ) However, simply summing the deviations will result in a
value of 0 because values below the mean (negative) cancel out
those above the mean (positive). To get around this problem,
variance is based on squared deviations of scores about the
mean:
(Xi X )2 Squaring the scores removes the positive/negative
signs.
If we had a much larger sample size (say 100 frogs instead of
just 5), our variance would be expected to rise due simply to
sampling effort (we would have caught larger and smaller frogs). To
control for this, the sum of the squared deviations is divided by
the sample size (n). The result is the average of the sum of the
squared deviations. This is the variance.
1)( 22
=
nXX
s i
The variance in Pop A = 2.17 and in Pop B = 3.46. This means
that the lengths of frogs in Pop B vary more than those in Pop
A.
X Y 49 2 50 48 51 100
Note: the symbol for variance is s2. So if s2 = 9; variance
equals 9. You do not need to take the square root. The same goes
for other symbols in this lab (e.g., X2).
-
8
Standard deviation (s)- Standard deviation is a measure of
variability expressed in the same units as the data being measured.
It is calculated by taking the square root of the variance.
Variance is a measure in squared units and has little meaning with
respect to the datas units.
2ss =
The standard deviation for Pop A is 1.47 cm. Often the mean is
written with the standard deviation to show the variability of the
data, for instance, frogs in Pop A had an average length of 7.4 +/-
1.5 cm.
Standard error (se): Standard error is the square root of the
variance divided by the sample size.
Standard error gives more information about sample size than
standard deviation. If we sampled 100 frogs, we might expect to
understand the true nature of this population better than if we
just measure 5 frogs.
nsse2
=
The standard error of our frog Pop A is 0.66.
Data is graphed using bars that show the mean and some
measurement of variance around that mean, such as standard
deviation or standard error. You should always indicate which is
used. The graph on the right is mean +/- standard error. WHAT ELSE
CAN WE DO WITH STATISTICS??
Now that you have an understanding of how to use Descriptive
Statistics to characterize a population and analyze its
distribution we can begin to use other Statistical techniques to
help us answer scientific questions. Science experiments are
usually designed to determine the effect of some variable on a
group by changing the variable for one group and comparing the
effects of this change to an unchanged group (control group). For
instance we might like to know if a certain drug actually helps
people get better. We could then set up an experiment and compare
patients on the drug to patients on a placebo (a sugar pill or
otherwise neutral medication). In addition, scientists often wonder
whether populations might differ with respect to certain
characteristics and if so what factors account for those
differences. Here too, we can use Statistics to determine whether
differences really exist.
So lets keep using our frog populations A and B to examine this
further
-
9
We might start by asking:
Are Pop A frogs smaller than Pop B frogs? Based on our knowledge
of their habitats we might hypothesize that Pop A frogs are smaller
than Pop B frogs.
However, it will be impossible to collect ALL frogs from each
pond (and probably bad for the pops survival), so we must rely on
statistical analysis of samples collected in each pond. Whenever we
rely on samples to answer questions about populations, we must use
Statistics to decipher how different they are. Most the questions
we will be asking in this lab pertain to differences between
samples. We often hypothesize that the populations are different
because of some variable (otherwise we wouldnt be interested in
them). To determine if they are different enough to be meaningful,
our statistical methods need something to compare to. The default
for statistical tests is that there is no difference. We call this
the Null Hypothesis. A null hypothesis (H0) states that there is no
difference among groups you are comparing or no effect of a
variable on a system. In this example, your Null Hypothesis would
be There is no difference in body size between Pop. A and Pop. B.
If the factor has a big enough effect, the samples will be
different statistically. We can state this as Pop A frogs are
smaller than Pop B frogs. This is termed our Alternative
Hypothesis. You can formalize an alternative hypothesis, but the
statistics are testing the Null Hypothesis. After you collect data,
statistics will allow you to either Accept or Reject this Null
Hypothesis. If you reject the null, your evidence may point to the
alternative as the cause, but you did not prove that the variable
you measured was the cause. There could have been some unknown
thing happening too. When using statistics it is NOT POSSIBLE to
prove anything with 100% certainty, so you cant prove that one
population is larger than the other, BUT statistics gives you to
tools to DISPROVE that they are the same (Disprove your Null
Hypothesis). Think about this statement: All male Cardinals are
red, is it even possible to Prove that statement correct? I think
not!! but you can easily Disprove that statement when you
photograph the first male Cardinal that is not red!
Framing your hypothesis in the form of a Null Hypothesis gives
you the ability to statistically Accept or Reject the hypothesis.
If you reject the Null, then as a scientist you can begin to
explain why you think they are different.
-
10
In the following sections, four statistical tests will be
introduced: t-test, ANOVA, regression, and chi-square test. Using
simple math (trust me if you can add, subtract, multiply and
divide, you can do statistics! if you cant, you can use a computer
and still do statistics J ), each of these four statistical tests
allows us to make specific kinds of comparisons with our data, but
more on this later. Each test utilizes the data we collect and
computes a number called a Calculated Test Statistic (each test has
its own CTS). A Calculated Test Statistic is a single number that
quantifies (or represents) the difference among the groups being
compared based on their sample size, total value, and/or mean and
variance. For each statistical test, we would compare our CTS to a
theoretical critical value (do not worry about how these
theoretical values are computed, it is wizardry!, no one really
knows J). Based on that comparison, you will determine whether to
accept or reject your Null Hypothesis. Regardless of how
overwhelming your data may show that the frog populations are
different, there is always a risk of being wrong when you Reject a
Null Hypothesis. That risk is given with each Statistical Analysis
that you perform in the form of a p value So, lets start there.
What is the p value? Probability of significance (p value) - The p
value represents the likelihood that your results are simply due to
random chance and do not represent something biologically real. So
obviously,
a low p value is best!!
In most areas of science, we have agreed that a p value equal to
or less than 0.05 is scientifically significant and therefore that
is the level at which we can confidently REJECT a null hypothesis.
That is, there is less than 5% chance that the result you obtained
from your experiments is random. If you did this experiment with
different subjects 100 times, you should get similar results 95
times. So again, if the p value for a statistical analysis is 0.05,
you can confidently reject your Null hypothesis, shout it from the
roof top I reject my Null Hypothesis, the frog populations are
truly different in body size. But are you 100% sure that the
populations in these 2 ponds are really different??? No! but you
are 95% sure! and that is enough for us to make that statement J. A
higher value of the calculated test statistic results in a lower p
value. The exact relationship between a calculated test statistic
and p is usually complicated and cannot generally be calculated
with a simple formula. But think about it, if the CTS represents
the degree of difference among the groups you are comparing, then
the greater the number, the lower the probability that those
differences are random. In contrast, a low CTS shows us that the
differences among groups are small and likely not biologically real
so your p value will be larger. The relationship includes the
number of degrees of freedom (df). Degrees of freedom (df) are an
integer number representing the number of independent pieces of
information that are used to estimate a statistical parameter. They
are related to the sample size, the number of classes, categories,
or groups.
If our calculated test statistic is high enough and our p-value
is low enough, we can conclude that there is indeed a difference
between our samples. In science, we say that the means are
significantly
different. Since the word significant is commonly used, care
must be taken when writing in science not to use it in any other
sense. To say something is significant implies that the proper
stats have been done and a difference was found. To say two groups
are different indicates that they are significantly different.
-
11
Types of data There are actually two types of data. Which type
you collect determines which test you use to analyze them:
Continuous data (quantitative) quantitative data that can take on
many different values, in theory,
any value between the lowest and highest points on the
measurement scale. e.g.: (1, 2, 3) or (4.011, 4.012, 4.013)
Use a t-test or ANOVA to analyze continuous data.
Discrete data (qualitative)- categorical data that has a limited
number of values e.g.: (yes/no) or gender (male/female) or college
class (freshman/sophomore/junior/senior). We will use a chi square
test to analyze discrete data.
In order to correctly use the following tests, a few things are
assumed. If these assumptions are not met, you should transform
your data into a format that is acceptable, or choose a test that
does not require these assumptions to be met (which we will not go
over in this class unless we have to.) Assumptions of Parametric
Statistics: 1. Continuous variables (or almost, i.e., there
are a lot of possibilities) 2. Samples are collected randomly 3.
Observations (data) are independent of each
other. The members of each group are assumed to have nothing in
common except the desired treatment.
4. Within-group variance is equal across groups. Use F-test to
test for differences among variances.
5. Data must be normally distributed.
If these assumptions are not met, see your instructor
STATISTICAL TESTS
All the statistical tests described here (except regression) ask
the question: Is there a difference between these two groups? They
then test that question mathematically.
-
12
I. t-test of means o Used with continuous data, one variable. o
Looks for differences in means of 2 groups.
Example of when you would use a t-test:
Question Is there is a difference in height between male and
female giraffes? Variable of interest: Height (continuous data)
Groups: Male and Female (2 groups) Comparison: Means (continuous
data with variation)
Null Hypothesis: Mean height of Males = Mean Height of Females
Alternative Hypothesis: Mean height of Males Mean Height of
Females
The heart of the t-test is the calculation of a statistic known
as the "t value". The formula for the t value associated with two
sample means is: |!| = X ! X ! !!!!! + !!!!!
Where, X ! = the mean of group 1 !!! = variance of group 1
n1 = sample size of group 1
X ! = the mean of group 2 !!! = variance of group 2
n2 = sample size of group 2 For the t test, the number of
degrees of freedom is: df = (n1-1) + (n2-1). By convention, the
sample with the larger mean is designated sample 1to avoid a
negative value of t, but some statistical software does not do
this, and thus produces negative values for t. In that case, simply
take the absolute value of the listed t (|!|). Because of its
complexity, the calculation of p is not easily done by hand.
Rather, the calculated t value is compared to a table of critical
values, which lists the value that the calculated statistic must
exceed in order for p to be less than 0.05 for the appropriate
number of degrees of freedom (SEE TABLE 1). If the calculated t
value is greater than the critical t value in the table, then we
REJECT THE NULL HYPOTHESIS, the means are significantly
different.
Explanation of equation: The numerator evaluates the size of the
difference between the two sample means. A greater difference in
the means in the numerator produces a larger value of t. The
denominator is actually the formula for the standard error of the
difference between the means. Just as was the case for the standard
error of a single mean, the size of this standard error depends on
how many measurements we made (n) and how variable the measurements
are (the standard deviation, s). When the measurements are more
variable (i.e. a bigger s), our samples are less likely to be
representative, our standard error is bigger, and the calculated t
is smaller. When our sample size increases (i.e. a bigger n), we
are more confident that our sample is representative because the
variation in individual measurements tend to cancel out - leading
to a smaller standard error and a larger value of t. Thus you can
see that the formula for t includes all of the factors that affect
our ability to assess whether differences are real or whether they
have resulted from chance unrepresentative sampling: the size of
the differences, the variability in the population, and the sample
size of our experiment.
-
13
Example of t-test: The average age (in days) individuals of
Daphnia longispina, a crustacean, begin reproduction were measured
from two populations.
Question Do the populations begin reproduction at different
ages? Variable of interest: Age (in days)
Groups: Population I and II Comparison: Means (continuous data
with variation)
Null Hypothesis: Mean age of reproduction in Pop I = Mean age in
Pop II Alternative Hypothesis: Mean age of reproduction in Pop I
Mean age in Pop II
Population
I II Individual ages (X):
7.2 8.8 7.1 7.5 9.1 7.7 7.2 7.6 7.3 7.4 7.2 6.7 7.5 7.2
Sum (X) 52.6 52.9 Sample size (n) 7 7 Mean (X) 7.5143 7.5571
Variance (s2) 0.5047 0.4095
Plug data into equation for t = |!| = X 1 X 2 !12!1 + !22!2
|!| = !.!"#$ !.!!"# !.!"#$7 +!.!"#$7 = -0.0428 / 0.1306 =
-0.0428/0.3613 =-0.1184= 0.1184
df = (n1-1) + (n2-1) = (7 1) + (7 1) = 12 Critical value from
table of desired 0.05 p value and 12 degrees of freedom = 2.179
Since our t, which equals 0.1184, is not greater than 2.179, we
must ACCEPT our Null Hypothesis, the means of the two populations
are found not to be different, and thus we cannot say that these
populations reach reproductive maturity at different ages. We have
no evidence to the contrary! Here is how this might be graphed:
Variance is depicted as standard error. Notice that the error bars
overlap, another indication that the means are not statistically
different. Age at reproduction in Daphnia longispina
-
14
II. Analysis of variance (ANOVA) o Used with continuous data,
one or more variables. o Looks for differences in means among 3 or
more groups.
Example of when you would use An ANOVA:
Question Is there a difference in body size (as measured by
weight) among frogs from 4 different ponds? Variable of interest:
Weight (continuous data)
Groups: Population I, II, III, IV (more than 2 groups)
Comparison: Means (continuous data with variation)
Null Hypothesis: There is no difference in weight among the four
ponds Alternative Hypothesis: There is a difference in weight among
the four ponds
ANOVA will let you simultaneously compare the means of 3 or more
groups.
Although an ANOVA can be performed by hand, we will not take the
time to do that in this lab. Instead, we will use a computer
program to perform the messy parts. You can do that here:
http://www.physics.csbsju.edu/stats/anova.html t-tests tell you if
there is a significant difference between two groups. If there is,
you can easily look at the two means and tell which one is bigger.
An ANOVA tells you if there is a difference among more than two
groups. In this situation, you cannot easily tell if treatment A is
bigger than treatment B, but not C, etc. If you want to know whwere
among the four ponds there is a difference, you must use another
test. Therefore you must perform a follow up test, called post hoc
tests, to look for differences among the means (such as the
Tukey-Kramer test). For instance, if we had measured a third
population of Daphnia, we might get results like the following:
ANOVA table: Source of variation
Sum of Squares
df Mean squares
F
Between 158.8 2 77.40 13.12 Error 70.8 12 5.900 Total 225.6 14
The probability of this result, assuming the null hypothesis, is
0.001 Therefore we can REJECT the NULL and say that there is a
difference among these populations. The graph on the right shows
the means of each population and standard error. If the error bars
overlap, the populations are not different (also indicated by the
letters).
Why not just use several t-tests? t-tests should not be used for
comparing means of more than two groups because each comparison has
its own error (probability of getting a significant result due to
chance). The error adds up with each comparison. In other words, if
you had many samples and compared each possible pair with a p-value
of 0.05, you have a 5% chance of finding a difference randomly. The
more samples, the more likely this is. Foe example, if we had 7
different groups, there would be 21 pairs and we would expect to
see a difference in at least one of them that is simply by
chance
A A B
-
15
III. Regression/Correlation o Use with continuous data, two
variables. o Tests the relationship of two variables.
Example of when you would use a Regression:
Question Is height related to shoe size? Variable of interest:
Height and Shoe size (continuous data)
Groups: The group of individuals you measured (only 1 group, but
2 variables) Comparison: Each individuals measurements (continuous
data with variation)
Null Hypothesis: There is no relationship between height and
shoe size. Alternative Hypothesis: Height and shoe size are
related.
Regressions and correlations are very similar and for the
purposes of this class, we may treat them equally, but technically:
Regression- Tests the relationship of one variable to another by
expressing one as a linear (or more complex) function of the other.
In regression, one variable is the cause and the other is the
effect.
For example, people who predominantly eat fatty foods weigh more
(the more fatty foods one eats, the heavier she/he will be).
Correlation- Tests the degree to which two variables vary
together.
Both variables change together, not because one causes the
other, but because they are both affected by a third variable. TA
recent study just found a correlation between the amount of
chocolate consumed by a country and the number of Nobel laureates
they produce. Is this due to the chocolate? Maybe, but most likely
the two are affected by a third cause.
CORRELATION DOES NOT EQUAL CAUSATION!!!
A function is a mathematical relationship enabling us to predict
what values of variable Y correspond to given values of variable X.
Such a relationship is written as Y = f(X) You may recognize this
as Y = bX In the simplest regression, Y = X. Therefore, for
example, when Y = 25, we can predict that X will also = 25. Fitting
a line through this relationship produces something like the figure
on the right: Here, the X is the independent variable (the cause,
free to vary) and the Y is the dependent variable (effect, due to
the cause).
-
16
The following figure shows a functional relationship (the
variables are not perfectly correlated), in which for every
increase of 7 units of X, there will be a 1 unit increase in Y.
In nature, there is variation in the relationship between each
pair of X and Ys. For example: The vertical lines connecting each
datum dot to the best fit line on the right are measuring the
variation. And whenever there is variation, we must do statistics
to see if what we found is random chance or a significant
relationship. The test for regression is similar to that for ANOVA,
but the math involved is beyond this class. There are two things
you need to know to understand a regression: the p-value (see
above) and the r2 value. The short and skimpy explanation is that
the p-value, as always, tells you if there is a significant
relationship between your two variables. The r2 value tells you how
good a predictor that relationship is. It measures the variation of
each data point from the best fit line (that is, how far away from
the line each
dot is, see figure above). If your data is a perfect predictor,
the data will line up nicely and you will get a high r2. In the
figure to the left, p < 0.001, r2 = 0.94. This means that the
independent variable explains 94% of the dependent variable. It is
a really good predictor. By the way, this is a positive
relationship: as X increases, so does Y.
But in the figure to the right, p = 0.02, so the line is
significant, but r2 = 0.22. X is not as good a predictor of Y (it
only explains 22% of the variation). This is an example of a
negative relationship: as X increases, Y decreases. You can think
of r2 as a measurement of the scatter of the data. How scattered is
the data? In the first figure, it is less scattered than in the
second figure. Here is a general guideline for r2 values:
Pair of data
Independent Variable (X)
Dependent Variable (Y)
1 20 30 2 21 33 3 27 40 4 29 39
-
17
IV. Chi square test (X2) o Used with discrete data, one
variable. o Compares observed frequencies of an experiment to
expected frequencies.
Either/or, yes/no, proportions. o Since the data is discrete,
there is no real variation. Also, chi square is
usually not graphed because there are only 2 numbers. These can
be listed in the text or in a table.
Example of when to use a Chi Square: You made a bet with friend
based on a coin toss. You pick head every time, but lose best out
of 10 by 8 to 2. You think the coin might have been weighted and
your friend cheated.
Question Is 8:2 different than what we would expect with a
random toss (5:5)? Variable of interest: The ratio of heads to
tails
Groups: Heads and Tails Comparison: Each individuals
measurements (continuous data with variation)
Null Hypothesis: Observed frequency = Expected frequency (8:2 is
not significantly different than 5:5)
Alternative Hypothesis: Observed frequency Expected frequency
(8:2 is significantly different than 5:5) With Chi Square, you can
statistically compare how many time heads and tails come up on this
coin (observed frequency) to what you would expect if the coin is
not rigged (expected frequency). How to do a Chi Square test 1.
Collect your data!
You flipped the coin 10 times and observed the following: o 2
heads (these are your Observed Frequencies) o 8 tails
2. Determine the expected frequencies
If the coin is not rigged, we would expect an even number of
heads and tails, a 50:50 ratio. Since we flipped it 10 times, 5our
expected frequency is o 5 heads o 5 tails
3. Calculate your chi square test value. Make a chi square table
like the one below to set up your calculations. !2 = (!"#$%&$'
!"#!$%!&)!!"#!$%!&
4. Compare X2 and degrees of freedom in Table 2 to find p
value.
Degrees of freedom (df) for X2 = number of categories 1
o So in our example we had two potential outcomes Heads or
Tails, so df= 2-1 o If The calculated X2 value is greater than the
theoretical (critical) value given under p = 0.05 in
the Chi Square Distribution table (see Table 2),we reject the
null hypothesis and conclude that the coin is rigged. If our
calculated chi square is less than the critical chi square, we must
accept the null and conclude that the coin is not rigged (did not
behave differently than what you expected).
-
18
CHI SQUARE TABLE
Steps: 1 2 3 4 5 6 7 Observed
frequencies Expected
frequencies Deviation
from expected
Deviations squared
Equation:
o
Expected
ratio:
e
o e
(o e)2
(o e)2
e
Heads 2 5 -3 9 1.8 1.8 Tails 8 5 3 9 1.8 +1.8
Sum (n) 10 1 10 0 X2 = 3.6
P > 0.05
Since our p value is greater than 0.05, we must ACCEPT the NULL.
This means that there is no difference between our observed
frequency and the expectation of randomness. We can trust our
friends coin and we have lost the bet.
In the pages below, you will find a quick flow chart to
determine which statistical test to use and two tables that allow
you to convert a test statistic (t and X2) to a p value. Now you
have many tools in your statistics toolbox. Go out and DO SCIENCE
TO IT!
Remember, the symbol for chi square is X2. So if X2 = 9; your
chi square value is 9. You do not need to take the square root. The
same goes for other symbols in this lab (e.g., s2).
-
19
-
20
Table 1. Critical Values for t-tests
df
Two-tailed p values: Means are NOT significantly different Means
ARE significantly different
1.00 0.50 0.40 0.30 0.20 0.10 0.05 0.02 0.01 0.002 0.001 1 0.000
1.000 1.376 1.963 3.078 6.314 12.71 31.82 63.66 318.3 636.6 2 0.000
0.816 1.061 1.386 1.886 2.920 4.303 6.965 9.925 22.32 31.59 3 0.000
0.765 0.978 1.250 1.638 2.353 3.182 4.541 5.841 10.21 12.92 4 0.000
0.741 0.941 1.190 1.533 2.132 2.776 3.747 4.604 7.173 8.610 5 0.000
0.727 0.920 1.156 1.476 2.015 2.571 3.365 4.032 5.893 6.869 6 0.000
0.718 0.906 1.134 1.440 1.943 2.447 3.143 3.707 5.208 5.959 7 0.000
0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 4.785 5.408 8 0.000
0.706 0.889 1.108 1.397 1.860 2.306 2.896 3.355 4.501 5.041 9 0.000
0.703 0.883 1.100 1.383 1.833 2.262 2.821 3.250 4.297 4.781 10
0.000 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 0.000 0.697 0.876 1.088 1.363 1.796 2.201 2.718 3.106 4.025
4.437 12 0.000 0.695 0.873 1.083 1.356 1.782 2.179 2.681 3.055
3.930 4.318 13 0.000 0.694 0.870 1.079 1.350 1.771 2.160 2.650
3.012 3.852 4.221 14 0.000 0.692 0.868 1.076 1.345 1.761 2.145
2.624 2.977 3.787 4.140 15 0.000 0.691 0.866 1.074 1.341 1.753
2.131 2.602 2.947 3.733 4.073 16 0.000 0.690 0.865 1.071 1.337
1.746 2.120 2.583 2.921 3.686 4.015 17 0.000 0.689 0.863 1.069
1.333 1.740 2.110 2.567 2.898 3.646 3.965 18 0.000 0.688 0.862
1.067 1.330 1.734 2.101 2.552 2.878 3.610 3.922 19 0.000 0.688
0.861 1.066 1.328 1.729 2.093 2.539 2.861 3.579 3.883 20 0.000
0.687 0.860 1.064 1.325 1.725 2.086 2.528 2.845 3.552 3.850 21
0.000 0.686 0.859 1.063 1.323 1.721 2.080 2.518 2.831 3.527 3.819
22 0.000 0.686 0.858 1.061 1.321 1.717 2.074 2.508 2.819 3.505
3.792 23 0.000 0.685 0.858 1.060 1.319 1.714 2.069 2.500 2.807
3.485 3.768 24 0.000 0.685 0.857 1.059 1.318 1.711 2.064 2.492
2.797 3.467 3.745 25 0.000 0.684 0.856 1.058 1.316 1.708 2.060
2.485 2.787 3.450 3.725 26 0.000 0.684 0.856 1.058 1.315 1.706
2.056 2.479 2.779 3.435 3.707 27 0.000 0.684 0.855 1.057 1.314
1.703 2.052 2.473 2.771 3.421 3.690 28 0.000 0.683 0.855 1.056
1.313 1.701 2.048 2.467 2.763 3.408 3.674 29 0.000 0.683 0.854
1.055 1.311 1.699 2.045 2.462 2.756 3.396 3.659 30 0.000 0.683
0.854 1.055 1.310 1.697 2.042 2.457 2.750 3.385 3.646 40 0.000
0.681 0.851 1.050 1.303 1.684 2.021 2.423 2.704 3.307 3.551 60
0.000 0.679 0.848 1.045 1.296 1.671 2.000 2.390 2.660 3.232 3.460
80 0.000 0.678 0.846 1.043 1.292 1.664 1.990 2.374 2.639 3.195
3.416 100 0.000 0.677 0.845 1.042 1.290 1.660 1.984 2.364 2.626
3.174 3.390 One-tailed p values:
0.50 0.25 0.20 0.15 0.10 0.05 0.025 0.01 0.005 0.001 0.0005
df = (n1-1) + (n2-1)
-
21
Table 2. Critical Values for Chi-Square tests Expected and
observed are NOT significantly different Expected and observed ARE
significantly different p: 0.99 0.95 0.90 0.75 0.50 0.25 0.10 0.05
0.025 0.01 df 1 0.0002 0.003 0.015 0.10 0.45 1.32 2.70 3.84 5.02
6.63 2 0.0201 0.102 0.210 0.57 1.38 2.77 4.60 5.99 7.37 9.21 3
0.1148 0.351 0.584 1.21 2.36 4.10 6.25 7.81 9.34 11.34 4 0.2971
0.710 1.063 1.92 3.35 5.38 7.77 9.48 11.14 13.27 5 0.5543 1.145
1.610 2.67 4.35 6.62 9.23 11.07 12.83 15.08 6 0.8721 1.635 2.204
3.45 5.34 7.84 10.64 12.59 14.44 16.81 7 1.2390 2.167 2.833 4.25
6.34 9.03 12.01 14.06 16.01 18.47 8 1.6465 2.732 3.489 5.07 7.34
10.21 13.36 15.50 17.53 20.09 9 2.0879 3.325 4.168 5.89 8.34 11.38
14.63 16.91 19.02 21.66 10 2.5582 3.940 4.865 6.73 9.34 12.54 15.98
18.30 20.48 23.20
!2 = (!"#$%&$' !"#!$%!&)!!"#!$%!&
Degrees of freedom equals the
number of groups being compared minus one (df = n-1)
Critical values that, at the given degrees of freedom, indicate
the given p-values
-
22
1108K Lab 1: Statistics Postlab
Name___________________________________ 1. Lets say that one time
you drank a soda before a test and did better on that test than you
ever have
before. Design an experiment using the other students in this
class to determine if drinking any of the following: soda, milk, or
water, improves performance on a test over the other drinks. You DO
NOT have to perform this experiment, just think through the design.
DO NOT MAKE UP DATA. Provide the following:
a. Question
b. Null Hypothesis
c. Alternate Hypothesis
d. Experimental design. Make sure to keep all variables the same
except the one of interest. Also remember that you need replication
to piece apart any variation.
e. Type of data you would collect. What will you measure
exactly?
f. Statistical test you need to analyze the results.
-
23
2. Why is standard error always smaller than standard
deviation?
3. What does a p-value of 0.03 mean? Be specific in your
interpretation without making up data.
4. Is it more likely that the data depicted on the right has a
high r2 or a low r2?
5. What does that mean?
6. If you increase the sample size, what happens to the critical
value (the number you have to reach to find a significant
difference) of a t-test?
-
24
7. Using a t-test, determine if these two groups are
statistically different. Be sure to show your t-test work (filled
in equation) and report the t-value and p-value. You are trying to
determine whether breastfeeding or bottle-feeding is the best
method to speed up the growth of a human baby. You surveyed 15
women who breastfed their babies. Their babies gained 17 pounds on
average (variance = 5) over a 3-week period. You surveyed 20 women
with similar age babies who bottle-fed their child. Their babies
gained 12 pounds on average (variance = 4) over a 3-week
period.
a. Question
b. Null Hypothesis
c. Alternate Hypothesis
d. What are your results and conclusions? (Do you accept or
reject your hypothesis?)
-
25
8. For this part, I want you to perform a t-test on data you
collect yourself. Nothing too complicated (no need to perform an
experiment), you just need to collect continuous, quantitative data
of two groups and test to see if there is a difference between
their means. Analyze this data using a t-test. Provide the
following.
a. Question:
b. Null Hypothesis:
c. Alternate Hypothesis:
d. Your data: e. Complete the table Group 1 Group 2 Mean Median
Mode Variance Standard Deviation Standard Error
f. Your calculations for the t-test
g. Your conclusions, including the t statistic, p-value, and
interpretation of this p-value.
h. Attach a graph of your results using Excel. Show the means
and the standard error. Label your axes.
Group 1 Group 2 Individual Data Individual Data