Lab 01- Scientific Method and Statistics (New Version)

1

Lab 1: The Scientific Method and Statistics

The power of science comes not from scientists but from its method. - E. O. Wilson, The Creation

When asked What is biology? most people respond with Its the study of living things which is mostly correct. A better answer would be Its the scientific study of living things. But what makes one inquiry scientific and another not? The wealth of knowledge that fills your textbook is the result of the hard work of many scientists over the course of centuries of work (and only represents a fraction of what scientists have discovered during that period). As Dr. Wilson accurately points out above, the key to scientific inquiry is not the scientist, but the scientific method.

There are two main approaches to scientific discovery: one in which nature is described using observation and measurements (observational science) and one in which a natural phenomenon is explained using the scientific method (hypothesis-based or empirical science). Often, observational science leads to questions which can be answered through hypothesis-based investigations. There are six steps to the scientific process, each important and vital to making the process work:

THE SCIENTIFIC METHOD

1. Observe: You cannot ask educated questions about natural phenomena without knowing something about the nature of the phenomena. The key to any investigation, scientific or otherwise, is observation. Observation leads to the collection of data and usually will lead the observer to ask a question.

2. Question: As a scientist, you should be

curious about the way the things you observed work, came to be, interact with other things, etc. What is it you want to know about the world? Why did that stop? How is this formed? Which group is faster? The question formed will lead to the formation of hypotheses.

3. Hypothesize: A hypothesis (pl.

hypotheses) is a proposed explanation for a set of observations. Another way to look at a hypothesis is to say that it is a reasonable guess of what might be occurring; a possible answer to your question. The hypothesis is formed using information gathered from observations. To be useful, a hypothesis should be testable using experimentation. Once a hypothesis is proposed, several consequences can be reasonably expected. These expectations can be termed predictions and they are the expected outcomes if a hypothesis is true.

Based on a figure from Lehner Handbook of ethological methods.

1996.

2

4. Design experiment and collect data: Once a hypothesis (or hypotheses) is formulated, the investigator should attempt to verify the predictions. To test the validity of the predictions, experiments are created to allow for controlled testing of the hypothesis. Your experiment will determine if your hypothesis is supported or not by evidence. A good experiment is one that will provide supporting evidence if a hypothesis is correct and is equally likely to show that a hypothesis is false, if it is not correct. Since biological systems vary a lot, it is best to repeat your experiment several times (replication) and then use statistics to sort out the outcome.

Experiments that are appropriately set up only change one variable (called manipulated variable or independent variable) for each run of the experiment. All other factors are controls (or controlled variables). In other words, in order to test whether or not this one variable (as determined by the hypothesis) affects the outcome of the experiment, you must 1) keep all other variables constant (i.e., everything but the variable in

question is the same between subjects) and 2) have controls to show that the organism is able to function normally in this experiment. Your experiment should produce data. Data (plural) are measurable outcomes of the experiment. These are the numbers that will indicate if your hypothesis is actually causing the observed effect. Data can be measured in height, width, length, seconds, minutes, days, volume, density, number of something, and many other measurable quantities. Dont forget your units!

5. Analyze data: Analyze data using statistical methods (see below). Statistics typically check to see if the data produced by one group (e.g., the control) is different from the data produced by another (the manipulated variable) in a significantly meaningful way. They take into consideration not only the average, but the variation around that average. If the data are not statistically significant, the outcome from the manipulated variable did not differ from the control and you must reject your hypothesis. If you accept your hypothesis, then there is evidence that your prediction is true. If the experimenter can provide replication of these results, then the hypothesis can be considered a reasonable explanation of the observed phenomena. If you reject your hypothesis, then your prediction is probably not causing the observed phenomenon and you still do not know what is causing it. As a curious scientist, the experimenter should revise the hypothesis and/or experimental design and try again.

6. Conclude: When data are collected and analyzed, the hypothesis is either tentatively accepted or rejected. The

acceptance of a hypothesis may only be temporary, as new observations and experiments can lead to the rejection of the hypothesis and/or an alternative explanation may also be true. Here, the outcomes of the experiment are interpreted in a broader context.

As a side note: A theory is when an idea is supported by many lines of evidence gathered by many investigators testing many hypotheses over many years. It leads to other predictions, or hypotheses, which when tested are often also supported by evidence. We can only base our conclusions on the evidence at hand. In the future, further evidence may lead us to a different conclusion. Therefore, in science, all of our knowledge is theoretical. Famous theories include things like the following:

gravity- that all objects have a pulling force, larger objects have a larger pull cell theory- that the basic unit of life is the cell, cells come from other cells, etc germ theory- that diseases are often caused by microorganisms, rather than bad karma. Plate tectonics- which proposes that the movements of large continental plates drift

across the liquid layer below them causing earthquakes, etc Atomic theory- that atoms are the smallest units of matter

Some of these have been supported by a hundred years of data now. Even so, they are still called theories. Most were controversial when first proposed and some of them may still be controversial. All of them could, in theory (pun intended) be overturned by a better theory that fits all the evidence more closely. In this lab, you will be making and breaking hypotheses constantly, but will you create a theory? Not likely. If you still dont see the difference, talk to someone about it.

3

APPLYING THE SCIENTIFIC METHOD A certain observation or situation, leads us to ask questions. A hypothesis is a proposed explanation or answer to your question based on an educated/informed guess. For example, Suzies car wont start this morning on her way to school. Your question, Why is Suzies car not starting? A hypothesis for this situation might be that Suzies car battery has died. Another hypothesis might be that Suzie has no gas in her car. Both of those are statements assume that you might know the problem and have guessed the reason. Note, that a hypothesis must be written in statement form, not as a question. Also, a good hypothesis must be testable and falsifiable (able to be proved false.

The second step is experimental procedure. In this step you design an experiment to test your hypothesis. This experimental design will need to accurately test the hypothesis directly as well as give you a clear yes or no answer. To help with determining the yes or no answer a control should be incorporated. In the experimental design, controls are set up the exact same as your experimental test, but with only one thing different. In the non-working car example, the first hypothesis being tested is whether the battery is dead. The experimental procedure would consist of putting a new battery into Suzies car. The control would be the original old battery in Suzies car. Only the type of battery has changed; old versus new. The same car and battery cables must be used to ensure that only one thing has changed. Or, Suzie could take her battery to the mechanic and they can test the battery and compare Suzies battery results to those of a functional battery of the same brand.

The results of that experiment will then need to be analyzed so that a conclusion can be made. If the new battery in Suzies car makes the car start, or the mechanic states that Suzies old battery is below normal working standards, then Suzie will know that the hypothesis of a non-working battery has been validated (a yes answer was achieved). We can then conclude, or make a substantiated explanation of the situation that the battery was indeed the reason the car did not start. But what if the new battery did not restart Suzies car or what if the mechanic stated that Suzies battery was well within working parameters? That would mean that the hypothesis was invalidated (a no answer was achieved). Therefore, Suzie has not determined the cause of her car troubles. So, she will have to start over to diagnose the problem and come up with another hypothesis, such as the car wouldnt start because it had no gas in the engine. Suzie will then redesign an experimental procedure to test this hypothesis using the proper controls. Suzie can again either solve her problem, or will have to come up with another hypothesis. The art of deduction from Monty Python and the Holy Grail: Scene 5

BEDEMIR: Quiet, quiet. Quiet! There are ways of telling whether she is a witch. CROWD: Are there? What are they? BEDEMIR: Tell me, what do you do with witches? VILLAGER #2: Burn! CROWD: Burn, burn them up! BEDEMIR: And what do you burn apart from witches? VILLAGER #1: More witches! VILLAGER #2: Wood! BEDEMIR: So, why do witches burn? [pause] VILLAGER #3: B--... 'cause they're made of wood...? BEDEMIR: Good! CROWD: Oh yeah, yeah... BEDEMIR: So, how do we tell whether she is made of wood? VILLAGER #1: Build a bridge out of her. BEDEMIR: Aah, but can you not also build bridges out of stone? VILLAGER #2: Oh, yeah. BEDEMIR: Does wood sink in water? VILLAGER #1: No, no. VILLAGER #2: It floats! It floats! VILLAGER #1: Throw her into the pond! CROWD: The pond! BEDEMIR: What also floats in water? VILLAGER #1: Bread! VILLAGER #2: Apples!

VILLAGER #3: Very small rocks! VILLAGER #1: Cider! VILLAGER #2: Great gravy! VILLAGER #1: Cherries! VILLAGER #2: Mud! VILLAGER #3: Churches -- churches! VILLAGER #2: Lead -- lead! ARTHUR: A duck. CROWD: Oooh. BEDEMIR: Exactly! So, logically..., VILLAGER #1: If... she.. weighs the same as a duck, she's made of wood. BEDEMIR: And therefore--? VILLAGER #1: A witch! CROWD: A witch! BEDEMIR: We shall use my larger scales

4

Introduction to Statistics All of the above information is to show you how science works. Science demands a lot. It requires the scientist to be unassuming, unbiased, curious, skeptical, and clever. Just because you may think things happen in a certain way or because of a certain reason, you cannot assume anything without evidence. If your experiment shows that birds fly because they are lighter than air, then this is what you assume. Of course, your experiment may not be appropriate to test this or it wasnt done properly, but until it is redone, any conclusions must be drawn on the evidence at hand. We use science and statistics more than you might think. And often, we are drawing the wrong conclusions from them. I just heard on NPR that New Mexico dropped from 1st to 17th when ranking states by the number of alcohol-related fatalities. Good for NM, right? But it is important to know how those stats are determined. Previously this was the number of fatalities per state population (per capita). Now it is the number of fatalities per miles driven on average. And as the reporter said, New Mexicans drive a lotIt is still in the bottom 10 for per capita deaths. The two figures should not be compared directly. Flipping through an old Newsweek magazine I came across these advertising slogans that use statistics, but I wonder about the specifics. Here are the claims they make and some questions that immediately come to my mind: Product Slogan/Claim Questions

Dodge The worlds biggest cab What is this measuring: volume, surface area, length, width?

Internal or external? Compared to what: other trucks, cars, everything that ever

existed?

Crestor 10-mg dose of Crestor, along with diet, can lower bad cholesterol by as much as 52% (vs 7% with placebo)

Are these percentages averages of the study group or maxima? In other words, perhaps only 1 person in the entire study saw a 52% drop and this was way above the average.

What is the effect of diet alone? Did the placebo treatment have the same diet?

Tempur-Pedic

In a recent survey, 92% of our enthusiastic owners report sleeping better and waking more refreshed.

How many people were surveyed? 9.2/10 is not the same as 920/1000.

Compared to what? Their old mattress? Nothing? Did you only survey enthusiastic owners or were only the

opinions of enthusiastic owners included in this statistic? Since many of you have aspirations to join the medical field, let me point this out. It is my experience that most medical researchers do not use proper statistical tests during their experiments. This means that most of the studies used to show a link between cancer and your favorite pastime, or to discover the gene that controls your diet preferences, or to develop medicines that treat the common cold might be based on faulty premises. Ive seen peer-reviewed papers published with no regard for proper statistical procedure. Youd be surprised how many papers talk about how correlated their treatment is with recovery based on figures worse than the last regression example above. The authors got away with it because the reviewers dont know it either. In addition, most patients and many physicians are overwhelmed, uniformed, or ignorant of how statistics are used by researchers, medical supply companies, or mainstream media to sell their story. Ive seen news stories based on differences between groups that are smaller than the margin of error (meaning there is no difference). Hopefully, with this lab manual you can change some of that. My point is to be critical and be curious. There is a lot of information out there these days. Not all of it is valid. Do not take it at face value. Ask questions. Question your consumer products, your friends, your doctors, your professors. Then youll be one step closer to thinking like a scientist.

You can claim anything with statistics, but only to those who dont understand statistics. - unknown

5

STATISTICS! Humans consider themselves good judges of what is bigger, faster, or better than the rest, but how much bigger is bigger? For example, imagine you want to know if frogs from one population (Pop A) are bigger than those from another population (Pop B). You could take a frog from each one and measure their lengths. You might find that the frog from Pop B is much bigger than the one from Pop A. But do these frogs really show the difference between the two populations? What if you just happened to catch a small frog from Pop A and a large one from Pop B? What if you caught several more and found that the frogs from both populations really had a lot of variation in lengths and looked something like the dataset below? Some of the frogs in Pop A are bigger than some in Pop B. Now can you tell if one population is bigger than another? Which one?

This is where statistics becomes useful. We cannot easily judge the differences between groups accurately when there is variation (as in most studies in biology). Even the advertisement on the left admits Individual results may vary. Therefore we must analyze them statistically. We usually use statistics to understand something about a given population, be it animal, vegetable, or mineral. The best way would be to measure every single individual from both groups, but that is not often possible. Therefore we collect data on a subset or sample of the group of interest (n = sample size, the total number of individuals sampled). We then extrapolate the data for this sample to the rest of the population. But

the sample should be unbiased, and care should be taken to control for the following: 1. Make sure the sample is a good representation of the population as a whole. For instance, if

you wanted to know if Georgians prefer the Bulldogs or the Jackets, you will not get a complete understanding of peoples preferences if you only survey people in Athens.

2. If you are using multiple groups to examine the potential effects of a specific variable, make

sure that all groups are equivalent in every way, except in the variable being tested. For instance, it is not good science to give a placebo to people over 50 years of age and a new medicine to people under 50 years of age, and conclude that the people given the drug are healthier. In this case, there are two experimental variables: presence of the drug and age. The two groups are not equivalent.

6

Once you have gone out and taken some measurements of a sample; the number of trees in a forest, the size of seeds eaten by cotton rats, the blood pressure of patients with arthritis, your first step will be to describe different characteristics of your sample. To do this, we use Descriptive Statistics, so here we go! Measures of Center These statistics indicate which values are most common. They attempt to define what is normal for the population. Mean in popular speak, the mean is the average. It is the most commonly used measure of central

tendency. The mean is computed by summing all values in your sample and dividing that sum by the sample size.

where, Xis = each individual data point n = sample size (the number of data points in your sample) = summation

In our frogs, Pop A has a mean length of 7.38 cm and Pop B is 9.06 cm on average.

Mode - The mode of a sample is the score that appears most often. In Pop A, mode = 8.7 cm (i.e. 2 of the frogs were 8.7 cm long). Pop B does not have a mode.

Median - The median divides the distribution into halves; half of the scores are above the median and half are below it when the data are arranged in numerical order.

When the sample size is an odd number, as in our frog sample, the median is the middle value (Pop A = 7.7, Pop B = 8.8). When the sample size is an even number, the median is halfway between the middle values, e.g., for the dataset (1 3 4 5 8 9), the median location is half-way between the 3rd and 4th scores (4 and 5) or 4.5.

Median is a useful measure of center for data that is very skewed towards one end, like salaries, in which the mean does not give a good measurement of the norm (see figure at right). If you were complaining to your boss that you and your coworkers needed a raise, would you complain about how low the mean or the median is?

If the dataset is normally distributed, the data points are equally distributed about the mean (as on right) and mean = median = mode

nX

X i=

Mean, Median, Mode

DESCRIPTIVE STATISTICS

A biologist, a chemist, and a statistician are out hunting. The biologist shoots at a deer and misses 5ft to the left, the chemist takes a shot and misses 5ft to the right, the statistician yells We got em!

7

Measures of Variation Although Measures of Center are informative as they describe how scores are centered in the distribution, the mean, median, and mode alone do not provide the best possible description of a sample (distribution). For example, think of samples of two populations, X and Y: Both X and Y have the same Mean (50) and similar Median (50 and 48, respectively). But, would you call those 2 datasets very similar? As you can see, Measures of Center alone, are not sufficient to clearly describe a data set. . Measures of variation provide additional critical information about your data. Specifically, the degree to which individual scores are clustered about or deviate from the average value in a distribution. In biology, everything varies, such as the seed sizes of the two species on right. So it is important to always report some measure of variation along with your sample mean when describing your data. Range Range is the simplest measure of variability. It describes how much the population is spread

around the mean. It is the difference between the highest and lowest score in a distribution:

Range = maximum value minimum value.

Although easy to compute, range is based solely on the two most extreme scores in the distribution and thus it is susceptible to much fluctuation. For instance, in frog Pop A, the range is 3.4 cm. However, if we caught one additional frog that measures 11.5 cm, the range would jump to 6.2 cm!, Therefore, the range is not often a reliable measurement of variability.

Variance (s2) - Variance measures the average distance between each data point from the mean.:

(Xi X ) However, simply summing the deviations will result in a value of 0 because values below the mean (negative) cancel out those above the mean (positive). To get around this problem, variance is based on squared deviations of scores about the mean:

(Xi X )2 Squaring the scores removes the positive/negative signs.

If we had a much larger sample size (say 100 frogs instead of just 5), our variance would be expected to rise due simply to sampling effort (we would have caught larger and smaller frogs). To control for this, the sum of the squared deviations is divided by the sample size (n). The result is the average of the sum of the squared deviations. This is the variance.

1)( 22

=

nXX

s i

The variance in Pop A = 2.17 and in Pop B = 3.46. This means that the lengths of frogs in Pop B vary more than those in Pop A.

X Y 49 2 50 48 51 100

Note: the symbol for variance is s2. So if s2 = 9; variance equals 9. You do not need to take the square root. The same goes for other symbols in this lab (e.g., X2).

8

Standard deviation (s)- Standard deviation is a measure of variability expressed in the same units as the data being measured. It is calculated by taking the square root of the variance. Variance is a measure in squared units and has little meaning with respect to the datas units.

2ss =

The standard deviation for Pop A is 1.47 cm. Often the mean is written with the standard deviation to show the variability of the data, for instance, frogs in Pop A had an average length of 7.4 +/- 1.5 cm.

Standard error (se): Standard error is the square root of the variance divided by the sample size.

Standard error gives more information about sample size than standard deviation. If we sampled 100 frogs, we might expect to understand the true nature of this population better than if we just measure 5 frogs.

nsse2

=

The standard error of our frog Pop A is 0.66.

Data is graphed using bars that show the mean and some measurement of variance around that mean, such as standard deviation or standard error. You should always indicate which is used. The graph on the right is mean +/- standard error. WHAT ELSE CAN WE DO WITH STATISTICS??

Now that you have an understanding of how to use Descriptive Statistics to characterize a population and analyze its distribution we can begin to use other Statistical techniques to help us answer scientific questions. Science experiments are usually designed to determine the effect of some variable on a group by changing the variable for one group and comparing the effects of this change to an unchanged group (control group). For instance we might like to know if a certain drug actually helps people get better. We could then set up an experiment and compare patients on the drug to patients on a placebo (a sugar pill or otherwise neutral medication). In addition, scientists often wonder whether populations might differ with respect to certain characteristics and if so what factors account for those differences. Here too, we can use Statistics to determine whether differences really exist.

So lets keep using our frog populations A and B to examine this further

9

We might start by asking:

Are Pop A frogs smaller than Pop B frogs? Based on our knowledge of their habitats we might hypothesize that Pop A frogs are smaller than Pop B frogs.

However, it will be impossible to collect ALL frogs from each pond (and probably bad for the pops survival), so we must rely on statistical analysis of samples collected in each pond. Whenever we rely on samples to answer questions about populations, we must use Statistics to decipher how different they are. Most the questions we will be asking in this lab pertain to differences between samples. We often hypothesize that the populations are different because of some variable (otherwise we wouldnt be interested in them). To determine if they are different enough to be meaningful, our statistical methods need something to compare to. The default for statistical tests is that there is no difference. We call this the Null Hypothesis. A null hypothesis (H0) states that there is no difference among groups you are comparing or no effect of a variable on a system. In this example, your Null Hypothesis would be There is no difference in body size between Pop. A and Pop. B. If the factor has a big enough effect, the samples will be different statistically. We can state this as Pop A frogs are smaller than Pop B frogs. This is termed our Alternative Hypothesis. You can formalize an alternative hypothesis, but the statistics are testing the Null Hypothesis. After you collect data, statistics will allow you to either Accept or Reject this Null Hypothesis. If you reject the null, your evidence may point to the alternative as the cause, but you did not prove that the variable you measured was the cause. There could have been some unknown thing happening too. When using statistics it is NOT POSSIBLE to prove anything with 100% certainty, so you cant prove that one population is larger than the other, BUT statistics gives you to tools to DISPROVE that they are the same (Disprove your Null Hypothesis). Think about this statement: All male Cardinals are red, is it even possible to Prove that statement correct? I think not!! but you can easily Disprove that statement when you photograph the first male Cardinal that is not red!

Framing your hypothesis in the form of a Null Hypothesis gives you the ability to statistically Accept or Reject the hypothesis. If you reject the Null, then as a scientist you can begin to explain why you think they are different.

10

In the following sections, four statistical tests will be introduced: t-test, ANOVA, regression, and chi-square test. Using simple math (trust me if you can add, subtract, multiply and divide, you can do statistics! if you cant, you can use a computer and still do statistics J ), each of these four statistical tests allows us to make specific kinds of comparisons with our data, but more on this later. Each test utilizes the data we collect and computes a number called a Calculated Test Statistic (each test has its own CTS). A Calculated Test Statistic is a single number that quantifies (or represents) the difference among the groups being compared based on their sample size, total value, and/or mean and variance. For each statistical test, we would compare our CTS to a theoretical critical value (do not worry about how these theoretical values are computed, it is wizardry!, no one really knows J). Based on that comparison, you will determine whether to accept or reject your Null Hypothesis. Regardless of how overwhelming your data may show that the frog populations are different, there is always a risk of being wrong when you Reject a Null Hypothesis. That risk is given with each Statistical Analysis that you perform in the form of a p value So, lets start there. What is the p value? Probability of significance (p value) - The p value represents the likelihood that your results are simply due to random chance and do not represent something biologically real. So obviously,

a low p value is best!!

In most areas of science, we have agreed that a p value equal to or less than 0.05 is scientifically significant and therefore that is the level at which we can confidently REJECT a null hypothesis. That is, there is less than 5% chance that the result you obtained from your experiments is random. If you did this experiment with different subjects 100 times, you should get similar results 95 times. So again, if the p value for a statistical analysis is 0.05, you can confidently reject your Null hypothesis, shout it from the roof top I reject my Null Hypothesis, the frog populations are truly different in body size. But are you 100% sure that the populations in these 2 ponds are really different??? No! but you are 95% sure! and that is enough for us to make that statement J. A higher value of the calculated test statistic results in a lower p value. The exact relationship between a calculated test statistic and p is usually complicated and cannot generally be calculated with a simple formula. But think about it, if the CTS represents the degree of difference among the groups you are comparing, then the greater the number, the lower the probability that those differences are random. In contrast, a low CTS shows us that the differences among groups are small and likely not biologically real so your p value will be larger. The relationship includes the number of degrees of freedom (df). Degrees of freedom (df) are an integer number representing the number of independent pieces of information that are used to estimate a statistical parameter. They are related to the sample size, the number of classes, categories, or groups.

If our calculated test statistic is high enough and our p-value is low enough, we can conclude that there is indeed a difference between our samples. In science, we say that the means are significantly

different. Since the word significant is commonly used, care must be taken when writing in science not to use it in any other sense. To say something is significant implies that the proper stats have been done and a difference was found. To say two groups are different indicates that they are significantly different.

11

Types of data There are actually two types of data. Which type you collect determines which test you use to analyze them: Continuous data (quantitative) quantitative data that can take on many different values, in theory,

any value between the lowest and highest points on the measurement scale. e.g.: (1, 2, 3) or (4.011, 4.012, 4.013)

Use a t-test or ANOVA to analyze continuous data.

Discrete data (qualitative)- categorical data that has a limited number of values e.g.: (yes/no) or gender (male/female) or college class (freshman/sophomore/junior/senior). We will use a chi square test to analyze discrete data.

In order to correctly use the following tests, a few things are assumed. If these assumptions are not met, you should transform your data into a format that is acceptable, or choose a test that does not require these assumptions to be met (which we will not go over in this class unless we have to.) Assumptions of Parametric Statistics: 1. Continuous variables (or almost, i.e., there

are a lot of possibilities) 2. Samples are collected randomly 3. Observations (data) are independent of each

other. The members of each group are assumed to have nothing in common except the desired treatment.

4. Within-group variance is equal across groups. Use F-test to test for differences among variances.

5. Data must be normally distributed.

If these assumptions are not met, see your instructor

STATISTICAL TESTS

All the statistical tests described here (except regression) ask the question: Is there a difference between these two groups? They then test that question mathematically.

12

I. t-test of means o Used with continuous data, one variable. o Looks for differences in means of 2 groups.

Example of when you would use a t-test:

Question Is there is a difference in height between male and female giraffes? Variable of interest: Height (continuous data)

Groups: Male and Female (2 groups) Comparison: Means (continuous data with variation)

Null Hypothesis: Mean height of Males = Mean Height of Females Alternative Hypothesis: Mean height of Males Mean Height of Females

The heart of the t-test is the calculation of a statistic known as the "t value". The formula for the t value associated with two sample means is: |!| = X ! X ! !!!!! + !!!!!

Where, X ! = the mean of group 1 !!! = variance of group 1

n1 = sample size of group 1

X ! = the mean of group 2 !!! = variance of group 2

n2 = sample size of group 2 For the t test, the number of degrees of freedom is: df = (n1-1) + (n2-1). By convention, the sample with the larger mean is designated sample 1to avoid a negative value of t, but some statistical software does not do this, and thus produces negative values for t. In that case, simply take the absolute value of the listed t (|!|). Because of its complexity, the calculation of p is not easily done by hand. Rather, the calculated t value is compared to a table of critical values, which lists the value that the calculated statistic must exceed in order for p to be less than 0.05 for the appropriate number of degrees of freedom (SEE TABLE 1). If the calculated t value is greater than the critical t value in the table, then we REJECT THE NULL HYPOTHESIS, the means are significantly different.

Explanation of equation: The numerator evaluates the size of the difference between the two sample means. A greater difference in the means in the numerator produces a larger value of t. The denominator is actually the formula for the standard error of the difference between the means. Just as was the case for the standard error of a single mean, the size of this standard error depends on how many measurements we made (n) and how variable the measurements are (the standard deviation, s). When the measurements are more variable (i.e. a bigger s), our samples are less likely to be representative, our standard error is bigger, and the calculated t is smaller. When our sample size increases (i.e. a bigger n), we are more confident that our sample is representative because the variation in individual measurements tend to cancel out - leading to a smaller standard error and a larger value of t. Thus you can see that the formula for t includes all of the factors that affect our ability to assess whether differences are real or whether they have resulted from chance unrepresentative sampling: the size of the differences, the variability in the population, and the sample size of our experiment.

13

Example of t-test: The average age (in days) individuals of Daphnia longispina, a crustacean, begin reproduction were measured from two populations.

Question Do the populations begin reproduction at different ages? Variable of interest: Age (in days)

Groups: Population I and II Comparison: Means (continuous data with variation)

Null Hypothesis: Mean age of reproduction in Pop I = Mean age in Pop II Alternative Hypothesis: Mean age of reproduction in Pop I Mean age in Pop II

Population

I II Individual ages (X):

7.2 8.8 7.1 7.5 9.1 7.7 7.2 7.6 7.3 7.4 7.2 6.7 7.5 7.2

Sum (X) 52.6 52.9 Sample size (n) 7 7 Mean (X) 7.5143 7.5571 Variance (s2) 0.5047 0.4095

Plug data into equation for t = |!| = X 1 X 2 !12!1 + !22!2

|!| = !.!"#$ !.!!"# !.!"#$7 +!.!"#$7 = -0.0428 / 0.1306 = -0.0428/0.3613 =-0.1184= 0.1184

df = (n1-1) + (n2-1) = (7 1) + (7 1) = 12 Critical value from table of desired 0.05 p value and 12 degrees of freedom = 2.179 Since our t, which equals 0.1184, is not greater than 2.179, we must ACCEPT our Null Hypothesis, the means of the two populations are found not to be different, and thus we cannot say that these populations reach reproductive maturity at different ages. We have no evidence to the contrary! Here is how this might be graphed: Variance is depicted as standard error. Notice that the error bars overlap, another indication that the means are not statistically different. Age at reproduction in Daphnia longispina

14

II. Analysis of variance (ANOVA) o Used with continuous data, one or more variables. o Looks for differences in means among 3 or more groups.

Example of when you would use An ANOVA:

Question Is there a difference in body size (as measured by weight) among frogs from 4 different ponds? Variable of interest: Weight (continuous data)

Groups: Population I, II, III, IV (more than 2 groups) Comparison: Means (continuous data with variation)

Null Hypothesis: There is no difference in weight among the four ponds Alternative Hypothesis: There is a difference in weight among the four ponds

ANOVA will let you simultaneously compare the means of 3 or more groups.

Although an ANOVA can be performed by hand, we will not take the time to do that in this lab. Instead, we will use a computer program to perform the messy parts. You can do that here: http://www.physics.csbsju.edu/stats/anova.html t-tests tell you if there is a significant difference between two groups. If there is, you can easily look at the two means and tell which one is bigger. An ANOVA tells you if there is a difference among more than two groups. In this situation, you cannot easily tell if treatment A is bigger than treatment B, but not C, etc. If you want to know whwere among the four ponds there is a difference, you must use another test. Therefore you must perform a follow up test, called post hoc tests, to look for differences among the means (such as the Tukey-Kramer test). For instance, if we had measured a third population of Daphnia, we might get results like the following: ANOVA table: Source of variation

Sum of Squares

df Mean squares

F

Between 158.8 2 77.40 13.12 Error 70.8 12 5.900 Total 225.6 14 The probability of this result, assuming the null hypothesis, is 0.001 Therefore we can REJECT the NULL and say that there is a difference among these populations. The graph on the right shows the means of each population and standard error. If the error bars overlap, the populations are not different (also indicated by the letters).

Why not just use several t-tests? t-tests should not be used for comparing means of more than two groups because each comparison has its own error (probability of getting a significant result due to chance). The error adds up with each comparison. In other words, if you had many samples and compared each possible pair with a p-value of 0.05, you have a 5% chance of finding a difference randomly. The more samples, the more likely this is. Foe example, if we had 7 different groups, there would be 21 pairs and we would expect to see a difference in at least one of them that is simply by chance

A A B

15

III. Regression/Correlation o Use with continuous data, two variables. o Tests the relationship of two variables.

Example of when you would use a Regression:

Question Is height related to shoe size? Variable of interest: Height and Shoe size (continuous data)

Groups: The group of individuals you measured (only 1 group, but 2 variables) Comparison: Each individuals measurements (continuous data with variation)

Null Hypothesis: There is no relationship between height and shoe size. Alternative Hypothesis: Height and shoe size are related.

Regressions and correlations are very similar and for the purposes of this class, we may treat them equally, but technically: Regression- Tests the relationship of one variable to another by expressing one as a linear (or more complex) function of the other. In regression, one variable is the cause and the other is the effect.

For example, people who predominantly eat fatty foods weigh more (the more fatty foods one eats, the heavier she/he will be).

Correlation- Tests the degree to which two variables vary together.

Both variables change together, not because one causes the other, but because they are both affected by a third variable. TA recent study just found a correlation between the amount of chocolate consumed by a country and the number of Nobel laureates they produce. Is this due to the chocolate? Maybe, but most likely the two are affected by a third cause.

CORRELATION DOES NOT EQUAL CAUSATION!!!

A function is a mathematical relationship enabling us to predict what values of variable Y correspond to given values of variable X. Such a relationship is written as Y = f(X) You may recognize this as Y = bX In the simplest regression, Y = X. Therefore, for example, when Y = 25, we can predict that X will also = 25. Fitting a line through this relationship produces something like the figure on the right: Here, the X is the independent variable (the cause, free to vary) and the Y is the dependent variable (effect, due to the cause).

16

The following figure shows a functional relationship (the variables are not perfectly correlated), in which for every increase of 7 units of X, there will be a 1 unit increase in Y.

In nature, there is variation in the relationship between each pair of X and Ys. For example: The vertical lines connecting each datum dot to the best fit line on the right are measuring the variation. And whenever there is variation, we must do statistics to see if what we found is random chance or a significant relationship. The test for regression is similar to that for ANOVA, but the math involved is beyond this class. There are two things you need to know to understand a regression: the p-value (see above) and the r2 value. The short and skimpy explanation is that the p-value, as always, tells you if there is a significant relationship between your two variables. The r2 value tells you how good a predictor that relationship is. It measures the variation of each data point from the best fit line (that is, how far away from the line each

dot is, see figure above). If your data is a perfect predictor, the data will line up nicely and you will get a high r2. In the figure to the left, p < 0.001, r2 = 0.94. This means that the independent variable explains 94% of the dependent variable. It is a really good predictor. By the way, this is a positive relationship: as X increases, so does Y.

But in the figure to the right, p = 0.02, so the line is significant, but r2 = 0.22. X is not as good a predictor of Y (it only explains 22% of the variation). This is an example of a negative relationship: as X increases, Y decreases. You can think of r2 as a measurement of the scatter of the data. How scattered is the data? In the first figure, it is less scattered than in the second figure. Here is a general guideline for r2 values:

Pair of data

Independent Variable (X)

Dependent Variable (Y)

1 20 30 2 21 33 3 27 40 4 29 39

17

IV. Chi square test (X2) o Used with discrete data, one variable. o Compares observed frequencies of an experiment to expected frequencies.

Either/or, yes/no, proportions. o Since the data is discrete, there is no real variation. Also, chi square is

usually not graphed because there are only 2 numbers. These can be listed in the text or in a table.

Example of when to use a Chi Square: You made a bet with friend based on a coin toss. You pick head every time, but lose best out of 10 by 8 to 2. You think the coin might have been weighted and your friend cheated.

Question Is 8:2 different than what we would expect with a random toss (5:5)? Variable of interest: The ratio of heads to tails

Groups: Heads and Tails Comparison: Each individuals measurements (continuous data with variation)

Null Hypothesis: Observed frequency = Expected frequency (8:2 is not significantly different than 5:5)

Alternative Hypothesis: Observed frequency Expected frequency (8:2 is significantly different than 5:5) With Chi Square, you can statistically compare how many time heads and tails come up on this coin (observed frequency) to what you would expect if the coin is not rigged (expected frequency). How to do a Chi Square test 1. Collect your data!

You flipped the coin 10 times and observed the following: o 2 heads (these are your Observed Frequencies) o 8 tails

2. Determine the expected frequencies

If the coin is not rigged, we would expect an even number of heads and tails, a 50:50 ratio. Since we flipped it 10 times, 5our expected frequency is o 5 heads o 5 tails

3. Calculate your chi square test value. Make a chi square table like the one below to set up your calculations. !2 = (!"#$%&$' !"#!$%!&)!!"#!$%!&

4. Compare X2 and degrees of freedom in Table 2 to find p value.

Degrees of freedom (df) for X2 = number of categories 1

o So in our example we had two potential outcomes Heads or Tails, so df= 2-1 o If The calculated X2 value is greater than the theoretical (critical) value given under p = 0.05 in

the Chi Square Distribution table (see Table 2),we reject the null hypothesis and conclude that the coin is rigged. If our calculated chi square is less than the critical chi square, we must accept the null and conclude that the coin is not rigged (did not behave differently than what you expected).

18

CHI SQUARE TABLE

Steps: 1 2 3 4 5 6 7 Observed

frequencies Expected

frequencies Deviation

from expected

Deviations squared

Equation:

o

Expected

ratio:

e

o e

(o e)2

(o e)2

e

Heads 2 5 -3 9 1.8 1.8 Tails 8 5 3 9 1.8 +1.8

Sum (n) 10 1 10 0 X2 = 3.6

P > 0.05

Since our p value is greater than 0.05, we must ACCEPT the NULL. This means that there is no difference between our observed frequency and the expectation of randomness. We can trust our friends coin and we have lost the bet.

In the pages below, you will find a quick flow chart to determine which statistical test to use and two tables that allow you to convert a test statistic (t and X2) to a p value. Now you have many tools in your statistics toolbox. Go out and DO SCIENCE TO IT!

Remember, the symbol for chi square is X2. So if X2 = 9; your chi square value is 9. You do not need to take the square root. The same goes for other symbols in this lab (e.g., s2).

20

Table 1. Critical Values for t-tests

df

Two-tailed p values: Means are NOT significantly different Means ARE significantly different

1.00 0.50 0.40 0.30 0.20 0.10 0.05 0.02 0.01 0.002 0.001 1 0.000 1.000 1.376 1.963 3.078 6.314 12.71 31.82 63.66 318.3 636.6 2 0.000 0.816 1.061 1.386 1.886 2.920 4.303 6.965 9.925 22.32 31.59 3 0.000 0.765 0.978 1.250 1.638 2.353 3.182 4.541 5.841 10.21 12.92 4 0.000 0.741 0.941 1.190 1.533 2.132 2.776 3.747 4.604 7.173 8.610 5 0.000 0.727 0.920 1.156 1.476 2.015 2.571 3.365 4.032 5.893 6.869 6 0.000 0.718 0.906 1.134 1.440 1.943 2.447 3.143 3.707 5.208 5.959 7 0.000 0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 4.785 5.408 8 0.000 0.706 0.889 1.108 1.397 1.860 2.306 2.896 3.355 4.501 5.041 9 0.000 0.703 0.883 1.100 1.383 1.833 2.262 2.821 3.250 4.297 4.781 10 0.000 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 4.144 4.587 11 0.000 0.697 0.876 1.088 1.363 1.796 2.201 2.718 3.106 4.025 4.437 12 0.000 0.695 0.873 1.083 1.356 1.782 2.179 2.681 3.055 3.930 4.318 13 0.000 0.694 0.870 1.079 1.350 1.771 2.160 2.650 3.012 3.852 4.221 14 0.000 0.692 0.868 1.076 1.345 1.761 2.145 2.624 2.977 3.787 4.140 15 0.000 0.691 0.866 1.074 1.341 1.753 2.131 2.602 2.947 3.733 4.073 16 0.000 0.690 0.865 1.071 1.337 1.746 2.120 2.583 2.921 3.686 4.015 17 0.000 0.689 0.863 1.069 1.333 1.740 2.110 2.567 2.898 3.646 3.965 18 0.000 0.688 0.862 1.067 1.330 1.734 2.101 2.552 2.878 3.610 3.922 19 0.000 0.688 0.861 1.066 1.328 1.729 2.093 2.539 2.861 3.579 3.883 20 0.000 0.687 0.860 1.064 1.325 1.725 2.086 2.528 2.845 3.552 3.850 21 0.000 0.686 0.859 1.063 1.323 1.721 2.080 2.518 2.831 3.527 3.819 22 0.000 0.686 0.858 1.061 1.321 1.717 2.074 2.508 2.819 3.505 3.792 23 0.000 0.685 0.858 1.060 1.319 1.714 2.069 2.500 2.807 3.485 3.768 24 0.000 0.685 0.857 1.059 1.318 1.711 2.064 2.492 2.797 3.467 3.745 25 0.000 0.684 0.856 1.058 1.316 1.708 2.060 2.485 2.787 3.450 3.725 26 0.000 0.684 0.856 1.058 1.315 1.706 2.056 2.479 2.779 3.435 3.707 27 0.000 0.684 0.855 1.057 1.314 1.703 2.052 2.473 2.771 3.421 3.690 28 0.000 0.683 0.855 1.056 1.313 1.701 2.048 2.467 2.763 3.408 3.674 29 0.000 0.683 0.854 1.055 1.311 1.699 2.045 2.462 2.756 3.396 3.659 30 0.000 0.683 0.854 1.055 1.310 1.697 2.042 2.457 2.750 3.385 3.646 40 0.000 0.681 0.851 1.050 1.303 1.684 2.021 2.423 2.704 3.307 3.551 60 0.000 0.679 0.848 1.045 1.296 1.671 2.000 2.390 2.660 3.232 3.460 80 0.000 0.678 0.846 1.043 1.292 1.664 1.990 2.374 2.639 3.195 3.416 100 0.000 0.677 0.845 1.042 1.290 1.660 1.984 2.364 2.626 3.174 3.390 One-tailed p values:

0.50 0.25 0.20 0.15 0.10 0.05 0.025 0.01 0.005 0.001 0.0005

df = (n1-1) + (n2-1)

21

Table 2. Critical Values for Chi-Square tests Expected and observed are NOT significantly different Expected and observed ARE significantly different p: 0.99 0.95 0.90 0.75 0.50 0.25 0.10 0.05 0.025 0.01 df 1 0.0002 0.003 0.015 0.10 0.45 1.32 2.70 3.84 5.02 6.63 2 0.0201 0.102 0.210 0.57 1.38 2.77 4.60 5.99 7.37 9.21 3 0.1148 0.351 0.584 1.21 2.36 4.10 6.25 7.81 9.34 11.34 4 0.2971 0.710 1.063 1.92 3.35 5.38 7.77 9.48 11.14 13.27 5 0.5543 1.145 1.610 2.67 4.35 6.62 9.23 11.07 12.83 15.08 6 0.8721 1.635 2.204 3.45 5.34 7.84 10.64 12.59 14.44 16.81 7 1.2390 2.167 2.833 4.25 6.34 9.03 12.01 14.06 16.01 18.47 8 1.6465 2.732 3.489 5.07 7.34 10.21 13.36 15.50 17.53 20.09 9 2.0879 3.325 4.168 5.89 8.34 11.38 14.63 16.91 19.02 21.66 10 2.5582 3.940 4.865 6.73 9.34 12.54 15.98 18.30 20.48 23.20

!2 = (!"#$%&$' !"#!$%!&)!!"#!$%!&

Degrees of freedom equals the

number of groups being compared minus one (df = n-1)

Critical values that, at the given degrees of freedom, indicate the given p-values

22

1108K Lab 1: Statistics Postlab Name___________________________________ 1. Lets say that one time you drank a soda before a test and did better on that test than you ever have

before. Design an experiment using the other students in this class to determine if drinking any of the following: soda, milk, or water, improves performance on a test over the other drinks. You DO NOT have to perform this experiment, just think through the design. DO NOT MAKE UP DATA. Provide the following:

a. Question

b. Null Hypothesis

c. Alternate Hypothesis

d. Experimental design. Make sure to keep all variables the same except the one of interest. Also remember that you need replication to piece apart any variation.

e. Type of data you would collect. What will you measure exactly?

f. Statistical test you need to analyze the results.

23

2. Why is standard error always smaller than standard deviation?

3. What does a p-value of 0.03 mean? Be specific in your interpretation without making up data.

4. Is it more likely that the data depicted on the right has a high r2 or a low r2?

5. What does that mean?

6. If you increase the sample size, what happens to the critical value (the number you have to reach to find a significant difference) of a t-test?

24

7. Using a t-test, determine if these two groups are statistically different. Be sure to show your t-test work (filled in equation) and report the t-value and p-value. You are trying to determine whether breastfeeding or bottle-feeding is the best method to speed up the growth of a human baby. You surveyed 15 women who breastfed their babies. Their babies gained 17 pounds on average (variance = 5) over a 3-week period. You surveyed 20 women with similar age babies who bottle-fed their child. Their babies gained 12 pounds on average (variance = 4) over a 3-week period.

a. Question

b. Null Hypothesis

c. Alternate Hypothesis

d. What are your results and conclusions? (Do you accept or reject your hypothesis?)

25

8. For this part, I want you to perform a t-test on data you collect yourself. Nothing too complicated (no need to perform an experiment), you just need to collect continuous, quantitative data of two groups and test to see if there is a difference between their means. Analyze this data using a t-test. Provide the following.

a. Question:

b. Null Hypothesis:

c. Alternate Hypothesis:

d. Your data: e. Complete the table Group 1 Group 2 Mean Median Mode Variance Standard Deviation Standard Error

f. Your calculations for the t-test

g. Your conclusions, including the t statistic, p-value, and interpretation of this p-value.

h. Attach a graph of your results using Excel. Show the means and the standard error. Label your axes.

Group 1 Group 2 Individual Data Individual Data

Lab 01- Scientific Method and Statistics (New Version)

Documents

scientific method hypothesis

scientific process

inquiry scientific

scientific inquiry

scientific discovery

process work

observational science

power of science