Chapter 6: Basics of Experimentation Experiment—A test designed to arrive at a causal explanation (Cook & Campbell, 1979) Mill (1843)—Joint method of agreement and difference: causation can be inferred if some result, X, follows an event, A, if A and X vary together and it can be shown that event A produces result X If A occurs, then so will X, and if A does not occur, then neither will X If event B occurs, then X does not occur
43
Embed
Chapter 6: Basics of Experimentation Experiment—A test designed to arrive at a causal explanation (Cook & Campbell, 1979) Mill (1843)—Joint method.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Chapter 6: Basics of Experimentation Experiment—A test designed to arrive at a
causal explanation (Cook & Campbell, 1979) Mill (1843)—Joint method of agreement and
difference: causation can be inferred if some result, X, follows an event, A, if A and X vary together and it can be shown that event A produces result X
If A occurs, then so will X, and if A does not occur, then neither will X If event B occurs, then X does not occur
Chapter 6 continued:
Tip-of-the-Tongue (TOT) example: X = correct resolution of the TOT state, A = presenting letter initials, B = repeat question, C = present picture of celebrity Subjects were instructed to name celebrities, and 10.5
instances per subject resulted in TOT states (the name of the celebrity was on the “tip of the subjects tongue,” but they could not actually recall it) Subjects showed significantly better resolution to TOT states with
letter initials as a cue compared to either repeating the question cue or presenting a picture of the celebrity
This suggests that memory for celebrities is coded using letter-level orthographic information rather than visual or “data warehouse” related information codes
However, we are not told whether conditions B or C differed significantly from chance—if they are above chance, then this suggests that this type of coding occurs, but is less common than orthographic coding
Also, in so-called TOT states, it could have been that the unresolved cases were actually due to subjects no knowing the name of the celebrity
Chapter 6 continued:
Joint Method of Agreement and Difference continued: Note that in the real world of science, A does not always
produce X, and the absence of event A does not always fail to produce X (because science is inductive, or probabilistic rather than deductive)
Thus, the inductive version of Mill’s joint method of agreement and difference is that Event A (presenting letter initials) produces significantly more resolution of X than event B (repeating the question) or event C (presenting a picture of the celebrity) So, “more” is defined by statistical significance
Statistical significance tests whether two (or more) means differ even when we consider error variance (noise)
This is why we refer to statistics as the “language of science”
Chapter 6 continued:
In many experiments in psychology, you compare a neutral baseline (e.g., repeating a question in our TOT example—although if this question was coded as a contextual cue with the memory for the celebrity’s name, then this would not have been a neutral baseline!)
The experimental condition (presenting celebrity’s initials) should show a significantly larger effect on the DV (percent recall of the celebrity’s name) than the control condition(s)
Experimental control is central to an experiment because it allows the production of a comparison by controlling the occurrence or nonoccurrence of a variable (while holding all other possible causes constant so they cannot affect the outcome)
Control has three components: Comparison (the control condition is used as a comparison) Production (levels of values of the IV can be produced) Constancy (the experimental setting can be controlled by holding
certain aspects constant)
Chapter 6 continued:
Advantages of Experimentation: Using animal models can sometimes save money and is
considered to be more ethical by many E.g., cosmetics are frequently tested on rabbits
But this has been very controversial! Experimentation has more control than ex post facto
research in which levels of the IV are selected after the fact (selected rather than manipulated with control) E.g., research on the health consequences of smoking has
been almost entirely correlational The Tobacco lobby actually used as a defense in trial that
cigarette smokers were more likely to develop lung cancer than non-smokers because smokers were more neurotic—and it was really the higher levels of neuroticism that were causing the cancer risk!
One cannot rule out this because neuroticism was not controlled
Chapter 6 continued:
Variables in Experimentation: IVs—are manipulated by the experimenter because they are
hypothesized to cause changes on the DV Failure to find an effect of the IV on the DV is termed a “null result”
This can be due to either a lack of an effect, an invalid manipulation, or a lack of statistical power
DV—the performance variable observed and recorded by the experimenter. A good DV (e.g., RT or accuracy) should be reliable and should not be overly sensitive to floor or ceiling effects Floor effect—when it is impossible to do any worse on a task because you are
already at the bottom Ceiling effect—when it is impossible to improve because you already are at
perfect performance CV—potential IVs that are held constant during an experiment
This is usually because one can only manipulate a small number of variables (usually five or fewer) in any given experiment
If a potential IV that is not manipulated is not controlled, then it can become a confounded variable
Chapter 6 continued:
Review four examples of experiments from the text
Chapter 6 continued:
More than one IV: A typical experiment will manipulate between 2-4 IVs This is done because it is more efficient—
experimental control is typically superior with multiple IVs, and the results can be generalized across a group of IVs rather than just a single IV
Multiple IVs also allow a researcher to examine both main effects (an effect of just one IV in isolation) and interactions (when the effects produced by are not the same across the levels of a second IV) Interactions allow us to examine joint effects of multiple
IVs and add increased precision An interaction takes precedence over main effects
Chapter 6 continued:
More than one DV: we analyze just one DV at a time in univariate statistics
If we truly wish to analyze two of more DVs at a time, this is a multivariate statistical technique (such as MANOVA) In a MANOVA, we form a composite DV form from multiple DVs
But MANOVAs do not tell us whether the pattern of effects are consistent across DVs We can use correlations across trial blocks, diffusion models, or
entropy/RT models to look at the overall pattern of results across multiple DVs (e.g., RT and errors) However, these techniques are complicated Consequently, most experiments in psychology use a single DV and
are analyzed using ANOVA
Chapter 6 continued:
Possible sources of experimental error: Demand characteristics or reactivity—
Hawthorne effect (Homans, 1965) Deception can be used to prevent demand
characteristics Because subjects do not know what is being
tested, they cannot be biased through reactivity
However, if an experimenter uses deception, they typically need to debrief participants after the study
Chapter 6 continued:
External validity of the research procedure: Representativeness of subjects—the ability to
generalize across different participant populations Are rats really representative of humans?
E.g., rats’ basal ganglia system is probably different from that of humans
Variable representativeness—the ability to generalize across different experimental manipulations E.g., the relationship of background noise to studying
efficiency (do noise and music both impair performance) Setting representativeness—the
representativeness of the experimental setting (or ecological validity) Realism is not the same as generalizability, though
Chapter 7: Validity and Reliability in Psychological Research Validity—the truth of an observation Types of Validity:
Predictive validity—checking the truth of an observation by comparing it to another criterion that is thought to measure the same thing We will use SAT I as an example Criterion—another measurement of behavior that serves as a standard
for the measurement in question (e.g., ACT, college freshman GPA) In predictive validity, the relation between two scores is typically
assessed by a statistic termed the correlation coefficient (e.g., Pearson’s product-moment correlation coefficient)
The better the prediction of the observation (e.g., SAT I score predicting college freshman GPA), the greater the predictive validity of the predictor score
However, predictive validity does not define a measure or construct E.g., We cannot assume that a person with a higher SAT I score than
another person is smarter than the other person because predictive validity does not allow us to do this unless our criterion is that sort of measurement
E.g., an intelligence test score rather than freshman GPA
Chapter 7 continued:
Types of Validity continued: Construct Validity—the degree to which the independent and
dependent variables accurate reflect or measure what they are intended to measure (Cook & Campbell, 1979; Judd et al., 1991)—really, are the names accurate? In our Stroop experiment from Chapter 1, did our tasks really reflect reading
and does scan really reflect reading performance? Counting the number of digits in a row is probably not a good measure of reading
Extraneous Variables—confounding variables that may be a source of invalidity can threaten construct validity
Reading aloud requires speech production processes that are not required in reading, and Tasks 2 and 3 required counting which is not the same as reading
Katz et al. (1990) have also claimed that the SAT I is not construct valid Freedle and Kostin (1994) found that SAT test takers did use the passages to
respond, so they found some construct validity Reactivity and Random Error
Subjects could have been afraid of looking like a poor reader on a Stroop task Some subjects could have been tested with a second hand on a watch, and others
could have been timed with a chronograph (a stopwatch), this could have led to random error in timing precision
Chapter 7 continued:
Construct Validity continued: We can improve construct validity by using an operational definition (a
recipe for specifying how a construct, such as reading, is produced and measured) This is because operational definitions allow the conditions that produce the
concept to be measured and defined In our Stroop example, reading is reduced to the independent variables that produce
it and the dependent variable(s) that that is used to measure it Protocols—the specification of how the measurement and procedures
are to be undertaken—also reduce the risk of construct invalidity because they reduce the likelihood of random error
Circular reasoning is a potential problem when using an operational definition, though We need to have a method of defining something independent of how we
measure it Some have claimed that the concept of processing resources suffers from this
problem (circularity, Navon, 1979) However, we can use PRP and coactivation methods
Chapter 7 continued:
Construct validity is usually demonstrated using psychometric methods: Factor Analysis
A data reduction method in which you determine which measured variables are related to which constructs
You can also show that you constructs from factor analysis are related in the manner predicted by your theory using causal analysis:
Path analysis or Structural Equation Modeling (or covariance structure modeling)
Item Response Theory (or IRT)—is a mathematical technique for determining which items on a test measure the same construct
Chapter 7 continued:
Types of Validity continued: External Validity—the extent that we can
generalize our research results (in this setting measured on this sample) to other settings and other populations or samples To demonstrate external validity, we need to replicate
our initial results in other settings and on different people
Hypertension, gender and race Our experimental setting needs to be representative
of the typical situation (e.g., reading is typically tested using a reading out-loud method in elementary school even though this is not an accurate measure of reading comprehension—it is more of a measure of speech perception or production)
Chapter 7 continued:
Internal Validity—when we can make causal statements about the relationship between IVs and DVs Specifically, when your IV causes an effect on the DV
(are we testing what we claim to be testing—although this can be similar to construct validity) Without internal validity, we are not doing science
Internal validity requires good experimental control This is at odds with external validity because as we increase
experimental control, our results become less generalizable! A major challenge in science is to maximize both internal
and external validity even though they are negatively correlated
We can do this by keeping good experimental control and by comparing our results across multiple samples with large sample sizes
Chapter 7 continued:
Reliability—the consistency of behavioral measures
Types of Reliability: Test-Retest: giving the same test twice in succession
over a short time interval in order to measure consistency (using a correlation coefficient to measure consistency)
Parallel Forms: giving two versions of a test on two testing occasions to determine whether they result in consistent scores
Split-Half: dividing test items from a single test into two arbitrary groups and correlating the resulting scores after administration—if the correlation is sufficiently high, then test reliability is confirmed (this also establishes the equivalency of your test items)
Chapter 7 continued:
Statistical Reliability and Validity: Statistical Reliability determines whether findings are the result
of chance If not, we assume that the results occur because of the effect of the
IV(s) on the DV Statistical validity is whether we are measuring what we claim to be
measuring We sample subjects from a population when we use inferential
statistics The sample size needs to be large enough in order for the sample to
estimate its underlying population(s) The Central Limit Theorem states that samples of 20-30 allow us to
assume that a sample estimates the shape of the population Increasing sample size typically increases statistical power—
the ability to reject a false null hypothesis Random Sampling increases the likelihood that the obtained
sample does estimate accurately the characteristics of the population that it is attempting to estimate
Chapter 7 continued:
Types of errors in inferential statistics: Type I error—the probability of rejecting a
true null hypothesis (the alpha level) Type II error—when you fail to reject a false
null hypothesis 1-probability of a Type II error = power
Chapter 7 continued:
Measurement procedures—a systematic method of assigning numbers or names to objects and their attributes: Nominal scale—labels with no quantitative
significance Ordinal scale—measures differences in
magnitude (ranks), but not how much Interval scale—measures differences
magnitude as well as how much different Ratio scale—same as interval except with an
added absolute zero—so you can determine how many times greater something is
Chapter 8: Experimental Design Internal Validity in Experiments—by using
experimental control, the researcher can rule out confounding variables as a cause, so that one’s results really do reflect an effect of the IV on the DV Internal validity requires careful selection of IVs and a well
thought-out experimental design You can never “fix” design problems are the analysis stage
Although you can use “statistical control” through the use of ANCOVA
In this chapter, we will discuss two main types of experimental designs—between subjects and within subjects Between Subjects—independent groups of subjects receive the
different levels of the IV Within Subjects—all subjects receive all levels of the IV
Chapter 8 continued:
Crossed versus Nested designs: A crossed design is a factorial design—there
are no empty cells A nested design is when subjects receive
different levels of the IV You have empty cells You only use this design in special situations because
you cannot interpret interactions You might use a nested design to save money when only
certain cells are of interest A placebo design is nested
But you can treat this as a crossed design—see example
Chapter 8 continued:
Why experimental design matters and how even with the best of intentions you must be very careful in interpreting your results Example of a between subjects design: Executive Monkeys—
Brady (1958) found that “executive monkeys” in control of when they were shocked were more likely to develop ulcers than “blue-collar” monkeys that had no control over when they were shocked However, Weiss (1968,1971) found that executive rates that had
control over when an electric shock was administered were less likely to develop ulcers than helpless rats that had no control over when electric shocks were administered (this is an example of learned helplessness
The discrepancy occurred because Brady randomly assigned high response-rate monkeys to the executive monkey condition (“neurotic monkeys”)—an individual difference
The moral of the story is that individual differences are ALWAYS confounded with IV effects in a between subjects design
With large sample sizes, hopefully this would not occur Also, replication is essential to catch these errant results
Chapter 8 continued:
To see if the animal results of the effect of unavoidable stress on performance generalizes to humans, many researchers look at the effect of different stressors on cortisol (a stress hormone) Meta-analysis (Dickerson & Kemey, 2004) has shown
that cognitive tasks (e.g., mental arithmetic) and public speaking cause cortisol levels to rise, but that noise exposure and emotion induction do not
So, stress does increase cortisol levels in humans as well as non-human animals Chronically high levels of cortisol can cause cell death in the
hippocampus and amygdala
Chapter 8 continued:
Example of a within subjects design: experiments with LSD Jarrard (1963) looked at the dose response curve of LSD
on rats (by looking at the rate of lever pressing with salt water being the control) Jarrard counterbalanced the order the dose
(.05, ,.10, .20, .40, .80 milligram per kilogram of body weight)
Jarrard found that the two smallest doses slightly enhanced the response rate but that the two highest doses severely impaired response rate
One problem with drug studies using a within subjects design is that carryover effects may be so strong that counterbalancing cannot correct them
So, you may need to use a between subjects design for this type of study
Chapter 8 continued:
Types of Experimental Designs: Between-subjects—a conservative design that
prevents carryover effects (by using different subjects for different levels of the IV) However, this design is extremely susceptible to
individual differences confounding results In order to minimize individual differences confounding
one’s results, one can use matching (important subject characteristics are matched in the various treatment conditions) and randomization (random assignment)
However, subject attrition can make matching difficult, although newer mixed models can be used to analyze the data with missing data points
Chapter 8 continued:
Within subjects designs—are more efficient and control for individual differences (because each subject serves as their own control), but this design is sensitive to carryover effects (e.g., practice and fatigue effects) Counterbalancing can help minimize carryover effects
Factorial counterbalancing is the most comprehensive method, although it may not be practical (go over factorials)
A Latin square design can simplify counterbalancing Balanced Latin square: for an even number of conditions: 1, 2,
n, 3, n-1, 4, n-2 … For an odd number of conditions, two squares are needed
(the one above and a second reversed square) Another option is to use a modular counterbalancing
scheme (n-1)
Chapter 8 continued:
Control condition—in its simplest form, a group that does not receive a treatment It is a baseline against which some other
variable in the experiment can be compared Mixed designs—when you have at least
one between subjects variable and at least one within subjects variable
Choosing an experimental design: Issues to consider Carryover effects in a within subjects design Individual differences in a between subjects
design
Chapter 9: Complex Designs Factorial Designs—we use these
complex designs because real-world information processing is complex and requires multiple IVs As we begin to understand a phenomenon
better, the complexity of our experiments tends to increase from single IVs to many IVs
Chapter 9 continued:
Main Effects and Interactions Color (hue), Case Type, and Spacing in visual
word recognition If we use a fast achromatic (magnocellular) channel
and two slower parvocellular (one chromatic and one achromatic) channels to recognize words on a lexical decision task, then we should see a different pattern of hue effects for consistent lowercase versus mixed-case presentation
If this effect is due to the channel dynamics mentioned above, it should be relatively consistent for spaced and unspaced words
Chapter 9 continued: Experiment 1 Results:
620
640
660
680
700
720
740
760
monochrome mixed-hue
lowercasemixed-case
Chapter 9 continued: Experiment 4 Results:
720730740750760770780790800810820
Monochrome Mixed-Hue
US LCUS MCS LCS MC
Chapter 9 continued:
Main Effects: when we look at the effect of one IV collapsed across all other IVs In our case, a main effect for case type
Interaction: when the effects of one IV depend upon the levels of another IV In our case, a Case Type x Hue Type interaction, but no
three-way interaction Because interactions typically qualify main
effects, if you have an interaction, then you need to make sure that the interaction does not attenuate or eliminate your main effects
Control in between subjects designs: random-groups and matched-groups designs
Chapter 9 continued:
Complex within subjects designs (such as our example above) Block randomization or complete
randomization (we used complete randomization)
Mixed designs: when you have at least one between subjects variable and at least one within subjects variable
Chapter 10: Small-n Experimentation Small-n Experimentation—when a very few subjects are
studies intensely This design framework is often used for non-human animal
research (because of the expense and logistic complexity of testing large numbers of, say, rats)
It is also used for special populations of humans that are difficult to obtain (e.g., progeria cases) and for clinical populations (e.g., ADHD children or patients with peculiar brain damage—such as H.M.—that can be difficult to obtain because of privacy issues—although this is of questionable validity because there are costs to using this approach)
The main cost in using small-n designs is that you are using descriptive statistics That is, you are not obtaining a sample and assuming that this
sample estimates a population—you are simply describing this small group of individuals
You must be very careful in assuming that these results generalize to a population, as a whole
Chapter 10 continued:
Types of Small-n designs: The AB design—A represents a baseline
condition before, say, therapy (the control condition of the IV) and B represents the condition after the introduction of therapy (the treatment condition of the IV) This design is used in some research—although it is a
very poor design because changes that occur during treatment in the B phase may be caused by other uncontrolled variables that are confounded with therapy in that they really cause the change on the DV
E.g., development (the passage of time during which we mature)
Chapter 10 continued:
Small-n designs continued: ABA (or ABAB) or
reversal design—a design in which there are interspersed baseline (A) and treatment (B) phases of manipulation:
This design rules out maturation, so it is superior to an AB design
AA BB AA BB
BaselinBaselinee
ExtinctioExtinctionn
BaselinBaselinee
ExtinctioExtinctionn
Chapter 10 continued:
Small-n designs continued: Before an ABA design is used, usually researchers use a
“functional analysis of behavior” (a la Skinner) approach to better understand the phenomenon of interest
In a functional analysis of behavior study, one attempts to discover the antecedents and consequences of a given behavior in considerable detail Functional relationship—the functional relation between
what leads to the target behavior and the consequences that it produces
The Contingency—the relationship between the behavior and the outcome (includes reinforcement, punishment, escape, and avoidance)
The Discriminative Behavior—the controlling stimulus or stimuli that cause the unwanted behavior
Chapter 10 continued:
Small-n designs continued: Alternating Treatments Design (ACABCBCB) (A = no
treatment, B = cookie with no dye, C= cookie with dye that potentially causes hyperactivity)—more than one IV is used, and there may be numerous baseline periods This design extends the ABAB design because it allows
multiple IVs (or at least a control condition) However, it does not work well when carryover effects are
present with some or all of the IVs (but the same holds for an ABAB design)
In the Rose (1978) study, the two hyperactive girls showed no difference between A and B, but they did show more hyperactivity in the C condition—suggesting that the dye caused the increase in hyperactivity rather than the cookie, per se
Chapter 10 continued:
Small-n designs continued: The Multiple-Baseline design—can be used with a between-
subjects design to overcome carryover effects—several behaviors (within subjects) or several people (between subjects) receive baseline periods of varying length, after which the IV is introduced (you can also look across settings) One behavior is allowed to occur under baseline conditions (e.g.,
crying) and then the experimenter switches to the treatment The timing of the onset of treatment is varied across subjects—if the
treatment consistently is associated with a change in behavior (when other potential causes are held constant), then it is assumed that the treatment caused the change in behavior
You can use this same approach with the same subjects across different behaviors with different timing of the onset of the treatment—if the treatment for crying reduces crying but it does not affect fighting (and vice versa), then you can assume that your treatment caused the change in behavior
Chapter 10 continued:
Small-n designs continued: The changing-criterion design—a method in which the
researcher changes the behavior necessary to obtain reinforcement If the behavior changes systematically with the changing
criteria (e.g., you have to ride 5 miles instead of 3 miles on a stationary bike to get bonus points), then one assumes that the reinforcement criteria are producing the change
That is, if the experimenter removes the incentive completely (e.g., points that can be used to buy video games if 11-year-old boys exercise, DeLuca & Holborn, 1992), the level of exercise decreases back to zero
Note that if people base behavior on just external rewards, then this is not a good situation (e.g., if children do not clean their room unless they get paid to do so, then their house will be in bad chape when they are an adult)
Chapter 10 continued:
Clinical Psychology—case studies: typically based on one patient with a disorder (e.g., H.M.) Nissen et al.’s (1988) study of a dissociative
identity disorder (multiple personality disorder) patient using memory tasks This study is interesting because the explicit task
showed an effect but the implicit task did not—contrary to the authors’ interpretation, it could have been that the DID patient was simply not able to catch the automatic processing but he could the processing of which he was consciously aware