Chapter 6: Basics of Experimentation Experiment—A test designed to arrive at a causal explanation (Cook & Campbell, 1979) Mill (1843)—Joint method.

Chapter 6: Basics of Experimentation Experiment—A test designed to arrive at a

causal explanation (Cook & Campbell, 1979) Mill (1843)—Joint method of agreement and

difference: causation can be inferred if some result, X, follows an event, A, if A and X vary together and it can be shown that event A produces result X

If A occurs, then so will X, and if A does not occur, then neither will X If event B occurs, then X does not occur

Chapter 6 continued:

Tip-of-the-Tongue (TOT) example: X = correct resolution of the TOT state, A = presenting letter initials, B = repeat question, C = present picture of celebrity Subjects were instructed to name celebrities, and 10.5

instances per subject resulted in TOT states (the name of the celebrity was on the “tip of the subjects tongue,” but they could not actually recall it) Subjects showed significantly better resolution to TOT states with

letter initials as a cue compared to either repeating the question cue or presenting a picture of the celebrity

This suggests that memory for celebrities is coded using letter-level orthographic information rather than visual or “data warehouse” related information codes

However, we are not told whether conditions B or C differed significantly from chance—if they are above chance, then this suggests that this type of coding occurs, but is less common than orthographic coding

Also, in so-called TOT states, it could have been that the unresolved cases were actually due to subjects no knowing the name of the celebrity


Joint Method of Agreement and Difference continued: Note that in the real world of science, A does not always

produce X, and the absence of event A does not always fail to produce X (because science is inductive, or probabilistic rather than deductive)

Thus, the inductive version of Mill’s joint method of agreement and difference is that Event A (presenting letter initials) produces significantly more resolution of X than event B (repeating the question) or event C (presenting a picture of the celebrity) So, “more” is defined by statistical significance

Statistical significance tests whether two (or more) means differ even when we consider error variance (noise)

This is why we refer to statistics as the “language of science”


In many experiments in psychology, you compare a neutral baseline (e.g., repeating a question in our TOT example—although if this question was coded as a contextual cue with the memory for the celebrity’s name, then this would not have been a neutral baseline!)

The experimental condition (presenting celebrity’s initials) should show a significantly larger effect on the DV (percent recall of the celebrity’s name) than the control condition(s)

Experimental control is central to an experiment because it allows the production of a comparison by controlling the occurrence or nonoccurrence of a variable (while holding all other possible causes constant so they cannot affect the outcome)

Control has three components: Comparison (the control condition is used as a comparison) Production (levels of values of the IV can be produced) Constancy (the experimental setting can be controlled by holding

certain aspects constant)


Advantages of Experimentation: Using animal models can sometimes save money and is

considered to be more ethical by many E.g., cosmetics are frequently tested on rabbits

But this has been very controversial! Experimentation has more control than ex post facto

research in which levels of the IV are selected after the fact (selected rather than manipulated with control) E.g., research on the health consequences of smoking has

been almost entirely correlational The Tobacco lobby actually used as a defense in trial that

cigarette smokers were more likely to develop lung cancer than non-smokers because smokers were more neurotic—and it was really the higher levels of neuroticism that were causing the cancer risk!

One cannot rule out this because neuroticism was not controlled


Variables in Experimentation: IVs—are manipulated by the experimenter because they are

hypothesized to cause changes on the DV Failure to find an effect of the IV on the DV is termed a “null result”

This can be due to either a lack of an effect, an invalid manipulation, or a lack of statistical power

DV—the performance variable observed and recorded by the experimenter. A good DV (e.g., RT or accuracy) should be reliable and should not be overly sensitive to floor or ceiling effects Floor effect—when it is impossible to do any worse on a task because you are

already at the bottom Ceiling effect—when it is impossible to improve because you already are at

perfect performance CV—potential IVs that are held constant during an experiment

This is usually because one can only manipulate a small number of variables (usually five or fewer) in any given experiment

If a potential IV that is not manipulated is not controlled, then it can become a confounded variable


Review four examples of experiments from the text


More than one IV: A typical experiment will manipulate between 2-4 IVs This is done because it is more efficient—

experimental control is typically superior with multiple IVs, and the results can be generalized across a group of IVs rather than just a single IV

Multiple IVs also allow a researcher to examine both main effects (an effect of just one IV in isolation) and interactions (when the effects produced by are not the same across the levels of a second IV) Interactions allow us to examine joint effects of multiple

IVs and add increased precision An interaction takes precedence over main effects


More than one DV: we analyze just one DV at a time in univariate statistics

If we truly wish to analyze two of more DVs at a time, this is a multivariate statistical technique (such as MANOVA) In a MANOVA, we form a composite DV form from multiple DVs

But MANOVAs do not tell us whether the pattern of effects are consistent across DVs We can use correlations across trial blocks, diffusion models, or

entropy/RT models to look at the overall pattern of results across multiple DVs (e.g., RT and errors) However, these techniques are complicated Consequently, most experiments in psychology use a single DV and

are analyzed using ANOVA


Possible sources of experimental error: Demand characteristics or reactivity—

Hawthorne effect (Homans, 1965) Deception can be used to prevent demand

characteristics Because subjects do not know what is being

tested, they cannot be biased through reactivity

However, if an experimenter uses deception, they typically need to debrief participants after the study


External validity of the research procedure: Representativeness of subjects—the ability to

generalize across different participant populations Are rats really representative of humans?

E.g., rats’ basal ganglia system is probably different from that of humans

Variable representativeness—the ability to generalize across different experimental manipulations E.g., the relationship of background noise to studying

efficiency (do noise and music both impair performance) Setting representativeness—the

representativeness of the experimental setting (or ecological validity) Realism is not the same as generalizability, though

Chapter 7: Validity and Reliability in Psychological Research Validity—the truth of an observation Types of Validity:

Predictive validity—checking the truth of an observation by comparing it to another criterion that is thought to measure the same thing We will use SAT I as an example Criterion—another measurement of behavior that serves as a standard

for the measurement in question (e.g., ACT, college freshman GPA) In predictive validity, the relation between two scores is typically

assessed by a statistic termed the correlation coefficient (e.g., Pearson’s product-moment correlation coefficient)

The better the prediction of the observation (e.g., SAT I score predicting college freshman GPA), the greater the predictive validity of the predictor score

However, predictive validity does not define a measure or construct E.g., We cannot assume that a person with a higher SAT I score than

another person is smarter than the other person because predictive validity does not allow us to do this unless our criterion is that sort of measurement

E.g., an intelligence test score rather than freshman GPA


Types of Validity continued: Construct Validity—the degree to which the independent and

dependent variables accurate reflect or measure what they are intended to measure (Cook & Campbell, 1979; Judd et al., 1991)—really, are the names accurate? In our Stroop experiment from Chapter 1, did our tasks really reflect reading

and does scan really reflect reading performance? Counting the number of digits in a row is probably not a good measure of reading

Extraneous Variables—confounding variables that may be a source of invalidity can threaten construct validity

Reading aloud requires speech production processes that are not required in reading, and Tasks 2 and 3 required counting which is not the same as reading

Katz et al. (1990) have also claimed that the SAT I is not construct valid Freedle and Kostin (1994) found that SAT test takers did use the passages to

respond, so they found some construct validity Reactivity and Random Error

Subjects could have been afraid of looking like a poor reader on a Stroop task Some subjects could have been tested with a second hand on a watch, and others

could have been timed with a chronograph (a stopwatch), this could have led to random error in timing precision


Construct Validity continued: We can improve construct validity by using an operational definition (a

recipe for specifying how a construct, such as reading, is produced and measured) This is because operational definitions allow the conditions that produce the

concept to be measured and defined In our Stroop example, reading is reduced to the independent variables that produce

it and the dependent variable(s) that that is used to measure it Protocols—the specification of how the measurement and procedures

are to be undertaken—also reduce the risk of construct invalidity because they reduce the likelihood of random error

Circular reasoning is a potential problem when using an operational definition, though We need to have a method of defining something independent of how we

measure it Some have claimed that the concept of processing resources suffers from this

problem (circularity, Navon, 1979) However, we can use PRP and coactivation methods


Construct validity is usually demonstrated using psychometric methods: Factor Analysis

A data reduction method in which you determine which measured variables are related to which constructs

You can also show that you constructs from factor analysis are related in the manner predicted by your theory using causal analysis:

Path analysis or Structural Equation Modeling (or covariance structure modeling)

Item Response Theory (or IRT)—is a mathematical technique for determining which items on a test measure the same construct


Types of Validity continued: External Validity—the extent that we can

generalize our research results (in this setting measured on this sample) to other settings and other populations or samples To demonstrate external validity, we need to replicate

our initial results in other settings and on different people

Hypertension, gender and race Our experimental setting needs to be representative

of the typical situation (e.g., reading is typically tested using a reading out-loud method in elementary school even though this is not an accurate measure of reading comprehension—it is more of a measure of speech perception or production)


Internal Validity—when we can make causal statements about the relationship between IVs and DVs Specifically, when your IV causes an effect on the DV

(are we testing what we claim to be testing—although this can be similar to construct validity) Without internal validity, we are not doing science

Internal validity requires good experimental control This is at odds with external validity because as we increase

experimental control, our results become less generalizable! A major challenge in science is to maximize both internal

and external validity even though they are negatively correlated

We can do this by keeping good experimental control and by comparing our results across multiple samples with large sample sizes


Reliability—the consistency of behavioral measures

Types of Reliability: Test-Retest: giving the same test twice in succession

over a short time interval in order to measure consistency (using a correlation coefficient to measure consistency)

Parallel Forms: giving two versions of a test on two testing occasions to determine whether they result in consistent scores

Split-Half: dividing test items from a single test into two arbitrary groups and correlating the resulting scores after administration—if the correlation is sufficiently high, then test reliability is confirmed (this also establishes the equivalency of your test items)


Statistical Reliability and Validity: Statistical Reliability determines whether findings are the result

of chance If not, we assume that the results occur because of the effect of the

IV(s) on the DV Statistical validity is whether we are measuring what we claim to be

measuring We sample subjects from a population when we use inferential

statistics The sample size needs to be large enough in order for the sample to

estimate its underlying population(s) The Central Limit Theorem states that samples of 20-30 allow us to

assume that a sample estimates the shape of the population Increasing sample size typically increases statistical power—

the ability to reject a false null hypothesis Random Sampling increases the likelihood that the obtained

sample does estimate accurately the characteristics of the population that it is attempting to estimate


Types of errors in inferential statistics: Type I error—the probability of rejecting a

true null hypothesis (the alpha level) Type II error—when you fail to reject a false

null hypothesis 1-probability of a Type II error = power


Measurement procedures—a systematic method of assigning numbers or names to objects and their attributes: Nominal scale—labels with no quantitative

significance Ordinal scale—measures differences in

magnitude (ranks), but not how much Interval scale—measures differences

magnitude as well as how much different Ratio scale—same as interval except with an

added absolute zero—so you can determine how many times greater something is

Chapter 8: Experimental Design Internal Validity in Experiments—by using

experimental control, the researcher can rule out confounding variables as a cause, so that one’s results really do reflect an effect of the IV on the DV Internal validity requires careful selection of IVs and a well

thought-out experimental design You can never “fix” design problems are the analysis stage

Although you can use “statistical control” through the use of ANCOVA

In this chapter, we will discuss two main types of experimental designs—between subjects and within subjects Between Subjects—independent groups of subjects receive the

different levels of the IV Within Subjects—all subjects receive all levels of the IV


Crossed versus Nested designs: A crossed design is a factorial design—there

are no empty cells A nested design is when subjects receive

different levels of the IV You have empty cells You only use this design in special situations because

you cannot interpret interactions You might use a nested design to save money when only

certain cells are of interest A placebo design is nested

But you can treat this as a crossed design—see example


Why experimental design matters and how even with the best of intentions you must be very careful in interpreting your results Example of a between subjects design: Executive Monkeys—

Brady (1958) found that “executive monkeys” in control of when they were shocked were more likely to develop ulcers than “blue-collar” monkeys that had no control over when they were shocked However, Weiss (1968,1971) found that executive rates that had

control over when an electric shock was administered were less likely to develop ulcers than helpless rats that had no control over when electric shocks were administered (this is an example of learned helplessness

The discrepancy occurred because Brady randomly assigned high response-rate monkeys to the executive monkey condition (“neurotic monkeys”)—an individual difference

The moral of the story is that individual differences are ALWAYS confounded with IV effects in a between subjects design

With large sample sizes, hopefully this would not occur Also, replication is essential to catch these errant results


To see if the animal results of the effect of unavoidable stress on performance generalizes to humans, many researchers look at the effect of different stressors on cortisol (a stress hormone) Meta-analysis (Dickerson & Kemey, 2004) has shown

that cognitive tasks (e.g., mental arithmetic) and public speaking cause cortisol levels to rise, but that noise exposure and emotion induction do not

So, stress does increase cortisol levels in humans as well as non-human animals Chronically high levels of cortisol can cause cell death in the

hippocampus and amygdala


Example of a within subjects design: experiments with LSD Jarrard (1963) looked at the dose response curve of LSD

on rats (by looking at the rate of lever pressing with salt water being the control) Jarrard counterbalanced the order the dose

(.05, ,.10, .20, .40, .80 milligram per kilogram of body weight)

Jarrard found that the two smallest doses slightly enhanced the response rate but that the two highest doses severely impaired response rate

One problem with drug studies using a within subjects design is that carryover effects may be so strong that counterbalancing cannot correct them

So, you may need to use a between subjects design for this type of study


Types of Experimental Designs: Between-subjects—a conservative design that

prevents carryover effects (by using different subjects for different levels of the IV) However, this design is extremely susceptible to

individual differences confounding results In order to minimize individual differences confounding

one’s results, one can use matching (important subject characteristics are matched in the various treatment conditions) and randomization (random assignment)

However, subject attrition can make matching difficult, although newer mixed models can be used to analyze the data with missing data points


Within subjects designs—are more efficient and control for individual differences (because each subject serves as their own control), but this design is sensitive to carryover effects (e.g., practice and fatigue effects) Counterbalancing can help minimize carryover effects

Factorial counterbalancing is the most comprehensive method, although it may not be practical (go over factorials)

A Latin square design can simplify counterbalancing Balanced Latin square: for an even number of conditions: 1, 2,

n, 3, n-1, 4, n-2 … For an odd number of conditions, two squares are needed

(the one above and a second reversed square) Another option is to use a modular counterbalancing

scheme (n-1)


Control condition—in its simplest form, a group that does not receive a treatment It is a baseline against which some other

variable in the experiment can be compared Mixed designs—when you have at least

one between subjects variable and at least one within subjects variable

Choosing an experimental design: Issues to consider Carryover effects in a within subjects design Individual differences in a between subjects

design

Chapter 9: Complex Designs Factorial Designs—we use these

complex designs because real-world information processing is complex and requires multiple IVs As we begin to understand a phenomenon

better, the complexity of our experiments tends to increase from single IVs to many IVs


Main Effects and Interactions Color (hue), Case Type, and Spacing in visual

word recognition If we use a fast achromatic (magnocellular) channel

and two slower parvocellular (one chromatic and one achromatic) channels to recognize words on a lexical decision task, then we should see a different pattern of hue effects for consistent lowercase versus mixed-case presentation

If this effect is due to the channel dynamics mentioned above, it should be relatively consistent for spaced and unspaced words

Chapter 9 continued: Experiment 1 Results:

620

640

660

680

700

720

740

760

monochrome mixed-hue

lowercasemixed-case

Chapter 9 continued: Experiment 4 Results:

720730740750760770780790800810820

Monochrome Mixed-Hue

US LCUS MCS LCS MC


Main Effects: when we look at the effect of one IV collapsed across all other IVs In our case, a main effect for case type

Interaction: when the effects of one IV depend upon the levels of another IV In our case, a Case Type x Hue Type interaction, but no

three-way interaction Because interactions typically qualify main

effects, if you have an interaction, then you need to make sure that the interaction does not attenuate or eliminate your main effects

Control in between subjects designs: random-groups and matched-groups designs


Complex within subjects designs (such as our example above) Block randomization or complete

randomization (we used complete randomization)

Mixed designs: when you have at least one between subjects variable and at least one within subjects variable

Chapter 10: Small-n Experimentation Small-n Experimentation—when a very few subjects are

studies intensely This design framework is often used for non-human animal

research (because of the expense and logistic complexity of testing large numbers of, say, rats)

It is also used for special populations of humans that are difficult to obtain (e.g., progeria cases) and for clinical populations (e.g., ADHD children or patients with peculiar brain damage—such as H.M.—that can be difficult to obtain because of privacy issues—although this is of questionable validity because there are costs to using this approach)

The main cost in using small-n designs is that you are using descriptive statistics That is, you are not obtaining a sample and assuming that this

sample estimates a population—you are simply describing this small group of individuals

You must be very careful in assuming that these results generalize to a population, as a whole


Types of Small-n designs: The AB design—A represents a baseline

condition before, say, therapy (the control condition of the IV) and B represents the condition after the introduction of therapy (the treatment condition of the IV) This design is used in some research—although it is a

very poor design because changes that occur during treatment in the B phase may be caused by other uncontrolled variables that are confounded with therapy in that they really cause the change on the DV

E.g., development (the passage of time during which we mature)


Small-n designs continued: ABA (or ABAB) or

reversal design—a design in which there are interspersed baseline (A) and treatment (B) phases of manipulation:

This design rules out maturation, so it is superior to an AB design

AA BB AA BB

BaselinBaselinee

ExtinctioExtinctionn

BaselinBaselinee

ExtinctioExtinctionn


Small-n designs continued: Before an ABA design is used, usually researchers use a

“functional analysis of behavior” (a la Skinner) approach to better understand the phenomenon of interest

In a functional analysis of behavior study, one attempts to discover the antecedents and consequences of a given behavior in considerable detail Functional relationship—the functional relation between

what leads to the target behavior and the consequences that it produces

The Contingency—the relationship between the behavior and the outcome (includes reinforcement, punishment, escape, and avoidance)

The Discriminative Behavior—the controlling stimulus or stimuli that cause the unwanted behavior


Small-n designs continued: Alternating Treatments Design (ACABCBCB) (A = no

treatment, B = cookie with no dye, C= cookie with dye that potentially causes hyperactivity)—more than one IV is used, and there may be numerous baseline periods This design extends the ABAB design because it allows

multiple IVs (or at least a control condition) However, it does not work well when carryover effects are

present with some or all of the IVs (but the same holds for an ABAB design)

In the Rose (1978) study, the two hyperactive girls showed no difference between A and B, but they did show more hyperactivity in the C condition—suggesting that the dye caused the increase in hyperactivity rather than the cookie, per se


Small-n designs continued: The Multiple-Baseline design—can be used with a between-

subjects design to overcome carryover effects—several behaviors (within subjects) or several people (between subjects) receive baseline periods of varying length, after which the IV is introduced (you can also look across settings) One behavior is allowed to occur under baseline conditions (e.g.,

crying) and then the experimenter switches to the treatment The timing of the onset of treatment is varied across subjects—if the

treatment consistently is associated with a change in behavior (when other potential causes are held constant), then it is assumed that the treatment caused the change in behavior

You can use this same approach with the same subjects across different behaviors with different timing of the onset of the treatment—if the treatment for crying reduces crying but it does not affect fighting (and vice versa), then you can assume that your treatment caused the change in behavior


Small-n designs continued: The changing-criterion design—a method in which the

researcher changes the behavior necessary to obtain reinforcement If the behavior changes systematically with the changing

criteria (e.g., you have to ride 5 miles instead of 3 miles on a stationary bike to get bonus points), then one assumes that the reinforcement criteria are producing the change

That is, if the experimenter removes the incentive completely (e.g., points that can be used to buy video games if 11-year-old boys exercise, DeLuca & Holborn, 1992), the level of exercise decreases back to zero

Note that if people base behavior on just external rewards, then this is not a good situation (e.g., if children do not clean their room unless they get paid to do so, then their house will be in bad chape when they are an adult)


Clinical Psychology—case studies: typically based on one patient with a disorder (e.g., H.M.) Nissen et al.’s (1988) study of a dissociative

identity disorder (multiple personality disorder) patient using memory tasks This study is interesting because the explicit task

showed an effect but the implicit task did not—contrary to the authors’ interpretation, it could have been that the DID patient was simply not able to catch the automatic processing but he could the processing of which he was consciously aware

Chapter 6: Basics of Experimentation Experiment—A test designed to arrive at a causal explanation (Cook & Campbell, 1979) Mill (1843)—Joint method.

Documents