METODOLOGI PENYELIDIKAN DAN ANALISIS STATISTIK 2004
METODOLOGI PENYELIDIKAN DAN ANALISIS STATISTIK 2004
Validity OverviewOverview & Definitions
One of the fundamental questions to be asked and addressed is
'how valid and reliable are the study and the data collected'?
Validity of Research; Two general concepts important.
1. Internal validity: Extent to which results can be attributed
to "treatment".
2. External validity: Extent to which results can be
generalized. External validity is examined qualitatively by
scrutinizing the sampling scheme employed.
Validity of Data; Concerned primarily with the dependent
variable. The instrument used to quantify the dependent variable
should be examined for it's validity (ability to truly measure what
it's supposed to). If the instrument is a well known one with
established validity it may be enough to site a reference where
validity was examined and show that the same protocol was followed
in your study on similar subjects.
Validity of the dependent variable can be assessed:
Qualitatively
Content/logical validity
Quantitatively
Content/logical - factor analysis
Construct validity - multi-trait/multi- method procedure
(correlations) or Factor analysis
Reliability of Research; Essentially this refers to the
replicability of the research. The reliability of the research is
assessed qualitatively by scrutinizing the design and methodology
employed in the research.
An estimate of power (probability of correctly rejecting the
null hypothesis) is also useful in examining the stability of the
research.
Reliability of Data; Concerned primarily with the dependent
variable. The instrument used to quantify the dependent variable
should be examined for it's reliability (accuracy of measures
reflected in consistency)
Reliability of the dependent variable can be assessed
quantitatively using:
1. Coefficient alpha
2. Intraclass R
Once the statement of the problem is translated into a null
hypothesis choices with respect to statistical analysis can be
made.
Definitions
RESEARCH: The systematic, replicable, empirical investigation
between or among several variables.
EXPERIMENTAL RESEARCH: Involves the manipulation of some
variable by the researcher in order to determine whether changes in
a behavior of interest are effected. A well-controlled experiment
also requires that the assignment of subjects to conditions or
vice-versa be made on a random basis and that all other variable
except the conditions that the subjects will experience.
NON-EXPERIMENTAL RESEARCH: No manipulation of variables takes
place. Observational, historical, and survey studies are typically
non-experimental. Much of the research conducted in natural
settings is non experimental since the researcher is unable to
manipulate the conditions that the subjects will experience.
INDEPENDENT VARIABLE: The variable manipulated by the
experimenter. Or a broader definition would be - any variable that
is assumed to produce an effect on, or be related to, a behavior of
interest.
LEVELS OF AN INDEPENDENT VARIABLE: The various values or
groupings of values of an independent variable. Ex: a study is
conducted to determine the effect of room temperature on
performance. If the experimenter tests the subject at 70, 80, and
90 degrees, there is one independent variable - room temperature -
with three levels.
FACTORIAL DESIGN: One that allows two or more independent
variables to be combined in the same study so that the effects of
each variable can be evaluated independently of the effects of the
other(s). In addition, a factorial design allows the experimenter
to determine whether there was an interaction between the
independent variables. That is whether the effects of one variable
depend on the specific level of the other variable with which it
was combined. To produce a factorial design, each level of one
variable is paired with each level of the other independent
variable(s). Ex: Suppose the independent variables are room
temperature (70, 90) and class size (15, 30 students). A factorial
design would require four research conditions.
DEPENDENT VARIABLE: The behavior or characteristic observed of
analyzed by the researcher, generally in regards to how the
independent variable(s) affected or were related to it.
TYPE OF DEPENDENT VARIABLE: In empirical research, the dependent
variable is quantified in some way. Statistical analysis is carried
out on the numerical values of the dependent variable. The three
basic types are score data (ratio, interval), ordered data
(ordinal), and frequency data (categorical).
Score data: Generally requires relatively precise measuring
instruments and an understanding of the behavior being measured.
Statistical techniques developed to analyze score data make rather
stringent assumptions about the nature of the scores. The most
common assumptions are:
1. The intervals between scores are equal; that is, differences
between scores at one point of the measuring scale are equivalent
to the same size difference at any other pint on the scale.
2. The scores are assumed to be normally distributed within the
population(s) from which they were drawn.
3. The variances of the populations are assumed to be
homogeneous.
Ordered data: used when reliable score data cannot be (or is
not) obtained, but the subjects can be ranked from high to low
along the dimension of interest. In some cases, a researcher may
convert score data to ranks because it is believed that the
measuring instrument was not precise enough to trust the numeral
scores, or that the assumptions underlying a statistical test for
score data would be badly violated by the data. Statistical tests
deigned for use with ordered data generally do not make stringent
assumption about the nature of the underlying distributions and
hence are more conservative than those designed for score data.
Frequency data: Each subject is classified into a particular
category. The frequency of occurrence of subjects in each category
typically provides the data from which statistical analysis is
done.
Qualitative Research
Recognizing what the sources of qualitative data as distinct
from data for empirical/experimental research are is important.
3 broad categories that produce qualitative data:
In-depth open ended interviews
Direct observation
Analysis of written documents
Tendency for qualitative research to be more exploratory in
nature.
Validity and reliability are as important in qualitative
research as quantitative research.
General definitions:
Validity: tests/observations/instruments etc. are valid if they
produce relevant and clean measures of what you want to assess.
Reliability: tests/observations/instruments etc. are reliable if
they produce accurate measures of what you want to assess
Validity of Research
Internal validity: extent to which findings free of bias,
observer effects, Hawthorne effects etc. Extent to which findings
match reality.
External validity: Extent to which findings can be
generalized.
Validity of Data
Qualitative data is valid when it records relevant
events/information without interpretation, bias or filtering.
Reliability of Research
Extent to which findings replicable.
Reliability of Data
Qualitative data is reliable when it accurately (objectively)
records events/information.
The validity and reliability of qualitative data depend to a
great extent on the methodological skill, sensitivity and integrity
of the researcher.
Skillful interviewing involves more than just asking
questions.
Systematic and rigorous observation involves far more than just
being present and looking around.
Content analysis require considerably more than just reading to
see whats there.
All require discipline, knowledge, training, practice, and
creativity on the part of the researcher.
Qualitative and Quantitative research are not polar opposites
with completely different sets of techniques and approaches to
inquiry. They exist along a continuum commonly framed in terms of
the amount of control or manipulation present.
You can consider quantitative research to be grounded in
scientific inquiry or the scientific method and proceeding along
two dimensions:
The extent to which the researcher manipulates some phenomenon
in advance in order to study it.
The extent to which constraints are placed on output measures -
that is, predetermined categories or variables are used to describe
the phenomenon under study.
You can think of qualitative inquiry as discovery oriented. No
pre-conditions are set and no constraints are placed on the
outcomes. Variables emerge from the data.
The advantage of a quantitative approach is that it is possible
to measure the reactions of many people to a limited set of
questions thus facilitating comparison and statistical aggregation
of data. A broad, generalizable set of findings result.
The advantage of a qualitative approach is that a wealth of
detailed information about a specific event is produced. This
increases understanding of the cases and situations studied but
reduces generalizability.
Measurement IssuesThe place to start is with how to classify
data - the scale the data appropriately belongs on will affect
analysis decisions.
Measurement Scales
Categorical/nominal scale: Used to measure discrete variables
that can be classified by two or more mutually exclusive
categories.
Ex: Gender is a categorically scaled variable with two
categories: male & female. the scale scores (0,1) have no
meaning.
Ordinal scale: Used to measure discrete variables that are
categorical in nature and can be ordered (meaningfully).
Ex: Undergraduate class is an ordinally scaled variable with
four meaningfully ordered categories: freshman, sophomore, Junior,
Senior. The scale scores (1,2,3,4) have meaning in that Juniors
have complete more units than sophomores who have completed more
than freshman . . .
Interval scale: Used to measure continuous variables that are
ordinal in nature and result in values that represent actual and
equal differences in the variable measured.
Ex: Temperature is an interval scaled variable with meaningfully
ordered categories (hot, cold) that can be measured (scale has a
constant unit of measurement) to finer and finer degrees given
appropriate instrumentation.
Ratio scale: Used to measure continuous variables that have a
true zero, implying total lack of the attribute/property being
measured.
Ex: Weight is a ratio scaled variable with meaningfully ordered
categories (heavy, light) that can be measured to finer and finer
degrees that also has a true rather than arbitrary zero.
Continuous variables are ones at least interval scaled - call
for use of parametric statistics
Discrete variables are ones that are categorical or ordinal in
nature and call for use of non-parametric statistics.
Validity, Reliability, Objectivity
Recall that it is important to question 'how valid and reliable
are the study and the data collected'?
Validity of Research: Two general concepts important.
1. Internal validity: Extent to which results can be attributed
to "treatment". Internal validity is examined qualitatively by
scrutinizing the research design for 'sources of invalidity'.
2. External validity: Extent to which results can be
generalized. External validity is examined qualitatively by
scrutinizing the sampling scheme employed.
Factors that can affect internal and external validity
Your research design and selection of a sample are the keys to
limiting the problems of internal and external validity.
Sources of Invalidity
Rosenthal effect: Self fulfilling prophecy - you get what you
expect. Best to do a double-blind study when this is a potential
source of invalidity.
Halo effect: General effect of good or bad feeling you have
about a person. In observational designs this may be a particular
problem. Best to use a check-list a verify reliability of the
instrument and those collecting data.
Demand characteristics: Allowing subjects to know what the goals
are. Deception (of an ethical nature) may be needed to avoid this
source of invalidity.
Volunteer effect: Volunteers may be fundamentally different from
the overall population you are trying to generalize to.
Instrumentation effect: Changes in instruments can be mistaken
for changes in subjects.
Pre-testing effect: Subjects can be changed or learning can take
place during a pre-test which could affect results.
Time: Over a length of time, maturation may have more of an
impact than the independent variable. Also, major events can affect
subjects' behaviors and/or opinions.
Hawthorne effect: When the giving of attention rather than the
independent variable is the cause of observed
differences/relationships.
Validity of Dependent Variable
The instrument used to quantify the dependent variable(s) should
be examined with respect to validity. If the instrument is a well
known one with established validity it may be enough to site a
reference where validity was examined and show that the same
protocol has been followed in your study on similar subjects. If
the measures come from an instrument devised by you, work must be
done to show at least logical/content validity and preferably
appropriate estimates of criterion related validity.
Data (from tests, instruments, observation, etc.) is good when
it is valid - when it is relevant, & clean (reflects what it's
supposed to) and reliable (produces accurate measures).
Depending on the type and purpose of a data collection, validity
can be examined from one or more of several perspectives.
PurposeCognitiveMotor Skills
EvaluationContent Validity Concurrent ValidityLogical Validity
Concurrent Validity
PredictionPredictive ValidityPredictive Validity
ResearchContent Validity Concurrent Validity Predictive Validity
Construct ValidityLogical Validity Concurrent Validity Predictive
Validity Construct Validity
When measures are found to be valid for one purpose they will
not necessarily be valid for another purpose. Validity also may not
be generalizable across groups with varying characteristics.
Content/logical validity (assessed qualitatively)
1. Clearly define what you want to measure.
2. State all procedures you will use to gather measures.
3. Have an "expert" assess whether or not you are measuring what
you think you are.
Content validity (assessed quantitatively) Ex: survey
research
1. Pilot test the survey
2. Conduct a factor analysis of survey results
3. Revise based on analysis
4. Administer survey and conduct another factor analysis
Criterion-related validity (predictive and concurrent)
Compare measures from your 'instrument' with measures from a
criterion (expert, another test, etc.)
Concurrent validity (assessed quantitatively)
1. Gather x and y measures from a large group
2. Compute an appropriate correlation coefficient
3. If correlation > .80 for positively correlated variables
or < -.80 for inversely related variables your measure (x) is
said to have good concurrent validity
Predictive validity (assessed quantitatively)
1. Gather measures using your instrument (x) and measures on the
variable(s) you are trying to predict (y)
2. Compute an appropriate correlation coefficient
3. If correlation > .80 for positively correlated variables
or < -.80 for inversely related variables your measure (x) is
said to have good predictive validity
4. Follow up with estimation of SEE - band place around
predicted score to quantify prediction error.
Construct validity (assessed quantitatively)
A construct is an intangible characteristic. When you want to
measure a construct such as anxiety, competitiveness, etc., you
have no direct means to do so. Therefore indirect methods need to
be employed. To then estimate the validity of the indirect measures
(as reflections of the construct you're interested in) you record a
pattern of correlations between the indirect measure(s) and other
similar and dissimilar measures. Your hope is that the pattern
reveals high correlations with similar measures (convergent
validity) and low correlations with different measures
(divergent/discriminant validity).
Two techniques used to quantitatively assess construct validity
- Multi-trait multi-method matrix and factor analysis.
Reliability of Research: Essentially this refers to the
replicability of the research. The reliability of the research is
assessed qualitatively by scrutinizing the design and methodology
employed in the research.
Reliability of Data; Concerned primarily with the dependent
variable. The instrument used to quantify the dependent variable
should be examined for it's reliability (accuracy of measures
reflected in consistency)
Reliability of the dependent variable can be assessed
quantitatively using:
Coefficient alpha
Intraclass R
Should then follow up with SEM: band placed around observed
score to quantify measurement error.
The primary concern here is the accuracy of measures of the
dependent variable (in a correlational study both the independent
and dependent variable should be examined). Reducing sources of
measurement error is the key to enhancing the reliability of the
data.
Reliability is typically assessed in one of two ways:
1. Internal consistency - Precision and consistency of test
scores on one administration of a test.
2. Stability - Precision and consistency of test scores over
time. (test-retest)
To estimate reliability you need 2 or more scores per
person.
If motor skills/physiological measures collected at one time
only, the most common way of getting 2 scores per person is to
split the measures in half - usually by odd/even or first
half/second half by time or trials.
For survey research with multiple factors, reliability is
typically assessed within factors by examining consistency of
response across items within a factor. So, for a survey with 3
factors, you will compute 3 reliability coefficients.
If every subject can be measured twice on the dependent variable
then you readily have data from which reliability can be
examined.
Once you have 2 scores per person the question is how consistent
overall were the scores in order to infer how accurate scores
were.
Sources of measurement error
1. Random fluctuations, or a person's inability to score the
same twice or perform consistently throughout one
administration.
2. Measuring device - test
3. Researcher
4. Temporary effects - warm-up, practice
5. Testing length - time/trials
As a researcher it is important to identify and eliminate as
many sources of error as possible in order to enhance
reliability.
Relationship between reliability and validity
It is possible to have reliable measures that are invalid.
Measures that are valid will typically also be reliable. However,
reliability does not insure validity.
What statistic to use
In the past, reliability has been estimated using the Pearson
correlation coefficient. This is not appropriate since
1) the PPMC is meant to show the relationship between two
different variables - not two measures of the same variable,
and
2) the PPMC is not sensitive to fluctuations in test scores.
The PPMC is an interclass coefficient; what is needed is an
intraclass coefficient. The most appropriate statistic is the
intraclass R calculated from values in an analysis of variance
table though coefficient alpha is equally acceptable.
Objectivity
Whenever measures have a strong subjective component to them it
is essential to examine objectivity. Subjectivity itself is a
source of measurement error and so affects reliability and
validity. Therefore, objectivity is a matter of determining the
accuracy of measures by examining consistency across multiple
observations (multiple judges on one occasion or repeated measures
over time from one evaluator) that typically involve the use of
rating scales.
Objectivity (using coefficient alpha/Intraclass R)
This way of examining objectivity requires that you have either
multiple evaluators assessing performance/knowledge on one
occasion, or one evaluator assessing the same performance
(videotaped)/knowledge at least twice. If the measures (typically
from a rating scale) are objective they will be consistent across
the two or more measures per subject.
Experimental Research - Designs, Power, Type I error, Type II
errorWhen interested in differences or change over time for one
group or between groups a number of designs are applicable. The
most frequently used designs can be collapsed into two broad types:
true experimental and quasi-experimental.
True experimental designs: these designs all have in common the
fact that the groups are randomly formed. The advantage associated
with this feature is it permits the assumption to be made that the
groups were equivalent at the beginning of the research which would
provide control over sources of invalidity based on non-equivalency
of groups. The control is of course not inherent in the design. The
researcher must still work with the groups in such a way that
nothing happens to one group (other that the treatment) that does
not happen to the other and that scores on the dependent measure do
not vary as a result of instrumentation problems, or that the loss
of subjects is not different between the groups.
Randomized groups design:
This design requires the formation of two groups. One group will
receive the experimental treatment the other will not. The group
not receiving the treatment is commonly referred to as the control
group.
This design allows the researcher to test for significant
differences between the control and experimental group after the
experimental group has received the treatment. An independent
t-test or one-way analysis of variance (ANOVA) may be used to
statistically test the null hypothesis that means are equal.
In this design there is one independent variable and one
dependent variable. In the situation depicted above there are two
levels of one independent variable. The independent variable is
group or treatment condition (two levels - experimental/group 1
& control/group 2). The dependent variable is whatever is under
study - eg. Cholesterol level.
Extension of randomized groups designs.
One extension requires the formation of three groups. Two groups
will receive varying levels or conditions of the experimental
treatment the third will not be treated and will again be referred
to as the control group.
This design allows the researcher to test for significant
differences across the two experimental groups and the control
group after the experimental groups have received the treatment. A
one-way ANOVA would be used to statistically test the null
hypothesis that H0: 1 = 2 = 3.
In this expanded design there is still one independent variable
(now with 3 levels) and one dependent variable. The independent
variable is still groups or treatment condition and the dependent
variable is again the variable under study. In text, example is of
training at two levels (70% of VO2 max & 40% VO2 max) with the
control group not training. The dependent variable is
cardiorespiratory fitness. So, at the end of training, measures of
cardiorespiratory fitness are collected on each subject and the
means for each group compared using a one-way ANOVA (if assumptions
not met, a Kruskal-Wallis test would be used).
Points to consider
1. For each subject in the study there can be only one score. If
many measures are taken for each subject, these measures must be
combined in some fashion (eg averaged) so that there is only one
score for each subject.
2. Although it is not necessary, it is usually best to have an
equal number of subjects in each of the experimental groups.
3. The number of groups compared depends on the hypothesis being
examined, however, it is rare that more than four or five groups
are compared in one study.
Example
Assume that the researcher is interested in determining the
effect of shock on the time required to solve a set of difficult
problems. Subjects are randomly assigned to four experimental
conditions. Subjects in Group 1 receive no shock; Group 2, very low
intensity shocks; Group 3, medium shocks; and Group 4, high
intensity shocks. The total time required to solve all the problems
is the measure recorded for each subject. The independent variable
then is shock (which has 4 levels) and the dependent variable is
time.
Factorial design:
Essentially an extension of the randomized-groups design, this
design has more than one independent variable and just one
dependent variable. This design requires the formation of a group
for every combination (of every level) of the two or more
independent variables.
This design allows the researcher to test for significant
differences as a function of each independent variable separately
(main effects) and in combination (interaction). A two-way ANOVA
would be used to statistically test the null hypothesis that means
are equivalent for the first independent variable, that means are
equivalent for the second independent variable, and that the
interaction is not significant.
Points to consider
1. For each subject there can be only one score. If many
measures are taken for each subject, these must be combined (eg
averaged) so that there is only one score for each subject.
2. It is best to have an equal number of subjects in each
group.
3. The number of groups compared depends on the hypothesis being
examined, however, it is rare that more than four or five groups
are included in either of the two factors.
Example
Assume than a researcher is interested in determining the
effects of high vs. low-intensity shock on the memorization of a
hard vs. an easy list of nonsense syllables. Subjects would be
randomly assigned to four experimental conditions: (1) low shock
& easy list, (2) high shock & easy list, (3) low shock
& hard list, and (4) high shock & hard list. The total
number of errors made by each subject is the measure recorded. The
dependent variable then is the number of errors and the independent
variables are shock (with two levels) and list difficulty (with two
levels).
This would require the formation of 4 randomly assigned
groups:
List difficulty
EasyDifficult
Shock TypeLowA1 B1A1 B2
HighA2 B1A2 B3
A two-way ANOVA would be used in this situation to test for a
difference in
number of errors made depending on shock type (regardless of
list difficulty);
the number of errors made depending on list difficulty
(regardless of shock type);
and the number of errors made due to the combined effect of
shock type and list difficulty.
Since 3 F statistics are examined, you would divide your alpha
by 3 before comparing the 3 p values to your alpha. This is done to
keep the overall studys alpha in tact.
The analysis would be referred to as a 2X2 ANOVA communicating
that there are two levels of the first independent variable and two
levels of the 2nd independent variable. The language used to talk
about the results would be the main effect for shock type, the main
effect for list difficulty, and the interaction.
Variation of factorial design: When one or more of the
independent variables is a categorical variable, such as gender,
where individuals cannot be randomly assigned to the levels, you
have a factorial design that no longer qualifies completely as a
true experimental design, but, is used quite frequently and is
quite appropriate when the topic under study calls for the
examination of characteristics that people cannot be assigned.
Using gender and shock type as the independent variables you
would still need to form 4 groups, but, you could not randomly
assign individuals to the gender categories though you would still
randomly assign individuals to the shock category.
Gender
MaleFemale
Shock TypeLowA1 B1A1 B2
HighA2 B1A2 B3
Pretest-Posttest Randomized-groups design
In its simplest form, this design requires the formation of two
groups. One group will receive the experimental treatment the other
will not. The group not receiving the treatment is still referred
to as the control group.
This design allows the researcher to test for significant
differences or amount of change produced by the treatment - does
the experimental group change more than the control group. Though
the testing effect cannot be evaluated it is assumed to be
controlled since both the control and treatment groups are
pretested. A factorial repeated measures ANOVA is the recommended
analytical procedure. With this approach you have two independent
variables (or factors) and one dependent variable. The first factor
(not repeated measures) is treatment condition and the second is
test (repeated measures) - pre/post. The dependent variable is what
is being measured at the pre and post test.
Consider a dietary seminar intended to change eating habits
particularly with respect to consumption of fat.
Group 1Pre-testSeminarPost Test
Group 2Pre-testPost Test
In this example there are two independent variables and one
dependent variable. In the situation depicted above there are two
levels of each independent variable. The first independent variable
is group or treatment condition (two levels - experimental/group 1
& control/group 2). The second independent variable is test
(two levels - pretest & posttesst). The dependent variable is
grams of fat consumed.
Repeated Measures Design
The repeated measures design is a variation of the completely
randomized design though not considered a true experimental design.
Instead of using different groups of subjects, only one group of
subjects is formed and all subjects are measured/tested multiple
times. There is no control group.
This design allows the researcher to test for significant
differences produced by the treatment - are the means across
repeated measures different. A repeated measures ANOVA is the
recommended analytical procedure. With this approach you have one
independent variable and one dependent variable.
Assume that a researcher wants to know whether or not mean
scores on an intelligence test change from year to year. To answer
this, the researcher chooses subjects, all twelve years old, and
obtains an IQ score for each subject is recorded at age 12, 13, 14,
and 15. The dependent variable in this case is IQ score and the
independent variable is age.
As another example, assume that a researcher wants to know
whether or not mean scores on a measure of exercise satisfaction
change depending on the environment runners exercise in. To answer
this, the researcher obtains measures of exercise satisfaction from
subjects after they run in an urban setting, the countryside, an
indoor track, and an outdoor track. The dependent variable is
exercise satisfaction and the independent variable is exercise
environment.
The major advantage of this design over the completely
randomized design is that fewer subjects are required. In addition,
very often increased statistical power is gained because the random
variability of a single subject from one measure to the next is
usually much less than the variability introduced by measuring and
comparing different subjects. The major disadvantage is that there
may be carry-over effects from one treatment/testing to the next.
In addition, subjects might become progressively more proficient at
performing the criterion task and show an improvement in
performance more attributable to learning than the treatment.
Points to consider
1. Each subject is tested under, and a score is entered for,
each treatment condition.
2. The number of repeated measures depends on the research
question, however, it is rare to have more than four or five
treatment conditions.
Solomon four-group design
This design requires the formation of four groups. It allows the
researcher to determine whether or not the pretest results in
increased sensitivity of the subjects to the treat. Two of the
groups are pre & post tested while the other two are only
tested once. One of the groups receiving both the pre & post
test along with one of the groups tested only once are exposed to
the treatment. This arrangement permits evaluation of reactive or
interactive effects of testing (threats to external validity). As a
reminder, this is the problem of the pretest making subjects more
sensitive to the treatment and thus reducing the ability to
generalize the findings to an un-pretested population.
While it would be ideal to examine:
Replication of the treatment effect
Assessment of the amount of change due to the treatment
Evaluation of the testing effect
An assessment of whether or not the pretest interacts with the
treatment
Unfortunately there is no way to tackle this statistically. The
best to be done is a 2X2 ANOVA. One independent variable then is
treatment condition (no treatment & treatment) and the second
independent variable is testing (pre tested & not pre
tested).
No TreatmentTreatment
Pre tested
Not Pretested
From the ANOVA produced with this analysis you can assess the
effects of pretesting (main effect for testing), the effect of the
treatment (main effect for treatment condition), and the external
validity threat of the pretest interacting with the treatment
(interaction effect). Other concerns can and should be examined
descriptively.
Power, Type I error, Type II error
These topics must be considered in the context of hypothesis
testing.
Hypothesis testing involves examination of a statistically
expressed hypothesis.
The statistical expression is referred to as the null hypothesis
and is written as H0: It is called null because the expression when
completed implies no difference or relationship depending on the
problem being examined.
When you compare observed data with expected results from
normative values this is called testing the null hypothesis.
You can think of hypothesis testing as trying to see if your
results are unusual enough so that they would not even be expected
by chance.
By chance, your sample could lie at the extremes of the
distribution and so you could draw the wrong conclusion. These
erroneous decisions are generally referred to as type I and type II
errors.
Type I error = Incorrectly deciding to reject a null hypothesis.
Incorrectly reject a true null hypothesis.
Type II error = Incorrectly deciding not to reject a null
hypothesis. Failing to reject a false null hypothesis.
Alpha = The level of risk an experimenter is willing to take of
rejecting a true null hypothesis. Often called the level of
significance, this value is used in establishing a critical value
around which decisions (reject or not reject null) are made. It is
also common to define alpha as the probability of incorrectly
rejecting a true null hypothesis or the probability of making a
type I error.
Beta = The level of risk (not under direct control of an
experimenter) of failing to reject a false null hypothesis. It is
also common to define beta as the probability of making a type II
error.
Power = 1 - Beta. The probability of correctly rejecting a false
null hypothesis.
*In practice you will never know whether or not you've made a
poor decision (made a type I or type II error) but, you can (a) set
the probability that you will make a type I error when you select
your alpha, and (b) determine beta (through estimating power) to
estimate the probability that you made a type II error. Note: since
sample size is directly related to power (and so tied to beta),
studies will fail to find statistically significant results even
when they do exist because of a small sample size.
- If you decrease alpha (more stringent) power will decrease so
beta will increase.
- As you increase sample size, power increases so beta
decreases.
- As you enhance measurement precision both alpha and beta
decrease so power increases.
- As effect size increases both alpha and beta decrease so power
increases.
Power
Power is the probability of correctly rejecting a false null
hypothesis.
Ideally, power should be considered when planning a study, not
after it is over. Knowing what you would like power to be you can
determine (using software or power charts) what your sample size
should be.
For studies examining differences between means, power charts
are used to determine the sample size needed to achieve a desired
power.
For studies examining relationships, software is available. By
hand, the computations are very complex.
For studies estimating proportions or means, the computations
can be done by hand.
If power is not considered at the start of a study it should be
estimated at the end, particularly when non-significant results
arise.
When non-significant results are obtained one of the following
has occurred:
Inadequate theory
Evaluation of preliminary evidence incorrect
Design, analysis, or sample size choices faulty
Sample size is closely tied to power. True
differences/relationships go unnoticed without enough subjects. On
the other hand, trivial differences/relationships can be
statistically significant with large sample sizes.
Another factor affecting power is measurement precision. As
precision increases, power increases.
T
he information needed to determine the sample size needed to
achieve a particular level of power for differences includes:
Alpha
# of groups (k)
Minimum effect size you want to detect. This would come from a
pilot study, literature, or your own opinion on the size of the
effect you would like to be able to detect (.80, .50, .30).
Power desired
The information needed to determine the sample size needed to
achieve a particular level of power for relationships includes:
Alpha
Standard deviation for dependent and independent variables. This
would come from a pilot study or the literature.
Minimum effect you want to detect. Here the effect size is the
correlation coefficient itself. Again this would come from a pilot
study, literature, or your own opinion on the size of the effect
you would like to be able to detect (.80, .50, .30).
Power desired.
Sample size calculations for a t-test
To determine how many subjects to use per group to achieve
desired power:
Select alpha and effect size.
Determine df1. Since groups = 1, df1 = 2-1 = 1.
Get power chart for df1 = 1.
Find phi in power chart where desired power and the infinity
line cross.
Using values determined in steps 1-4, solve for n using sample
size equation:
Expanding to the ANOVA situation:
Sample size calculations for an ANOVA
To determine how many subjects to use per group to achieve
desired power:
Select alpha and effect size.
Determine df1 = K -1.
Get power chart for df1 = K-1.
Find phi in power chart where desired power and the infinity
line cross.
Using values determined in steps 1-4, solve for n using
equation:
StatisticsDescriptive Statistics
Central tendency
A technique for conveying group information regarding the middle
or center of a distribution of scores. Indicates where scores tend
to be concentrated.
Measures of central tendency
1. Mode: Score most frequently observed
2. Median: Score that divides the distribution of scores into
two equal halves
3. Mean: Arithmetic average of a distribution of scores.
To decide which is appropriate to use consider:
Distribution of scores. When symmetrical the mean=median=mode so
all are equally appropriate. When skewed, the median is a more
accurate representation of central tendency.
Degree of accuracy needed. The mean considers all scores and is
more stable than other measures. When skewed and information on the
center is needed an exact median should be calculated.
Level of measurement. You should not calculate a mean or median
on categorical or ordinal data. Frequency distribution tables
should be used to convey distributional information on categorical
and ordinal data. The mean and median can be used when variables
are at least interval scaled.
Relationships among measures of central tendency:
When distribution symmetrical, mean = median = mode
When distribution positively skewed, mode < median <
mean
When distribution negatively skewed, mode > median >
mean
Variability
Measures of central tendency are often not enough by themselves
to communicate useful information about distributions of scores. A
measure of central tendency is valuable to report anytime you want
to summarize data, but, the spread of scores around the center
should accompany it.
Measures of variability
Inclusive range: Crude measure. Not stable since it uses only
two scores.
Signed deviation: An individual statistic. Represents the signed
distance of a raw score from its mean. Not useful as summary
statistic. Only conveys information on an individual.
Standard deviation: The average unsigned distance of all scores
from the mean.
For categorical data, frequency distribution tables or
crosstabulation tables are useful in summarizing information.
Example:
MaleFemale
Smokes42%30%
Does Not Smoke58%70%
Correlation
The correlation statistic allows you to examine the strength of
the relationship between two variables. This statistic helps you
answer questions like 'Do gymnasts with considerable amounts of
fast twitch muscle fibers tumble better?' The underlying question
can be phrased: is there a relationship between concentration of
FTMF and tumbling.
There are several types of correlation coefficients to choose
from. The choice is based on the nature of the variables being
correlated.
Pearson Product Moment Correlation (PPMC)
Use when both variables continuous
Phi
Use when both variables true dichotomies
Point Biserial
Use when one variable continuous and the other a true
dichotomy.
A PPMC coefficient describes the strength and direction of the
linear relationship between two variables. When two variables are
not linearly related, the PPMC is likely to underestimate the true
strength of the relationship. A graph of the x and y values can
show whether or not the relationship is linear.
When the scores on two variables get larger/smaller together,
the direction of the relationship is positive. When scores on one
get larger as scores on the other variable get smaller, the
direction of the relationship is negative due to the inverse
relationship of the two variables. When there is no pattern, there
is no relationship.
A PPMC coefficient is a signed number between -1 and 1 where 0
represents no relationship. Presence of a relationship should never
be interpreted as demonstrating cause and effect. Remember the
negative sign simply conveys direction. The farther away from zero
in either direction the stronger the relationship
The PPMC is affected by the variability of the scores collected.
Other things being equal, the more homogeneous the group (on the
variables being measured) the lower the correlation coefficient.
Since small groups tend to be more homogeneous, Pearson is most
meaningful and most stable when group size is large (>50).
A point biserial correlation coefficient tells you the strength
of the relationship between one continuous and one dichotomous
variable. The sign carries little meaning. It only indicates which
group tended to have higher scores. The point biserial coefficient
is a signed number between -1 and 1 where again zero represents no
relationship.
A phi correlation coefficient tells you the strength of the
relationship between two dichotomous variables. The sign carries
little meaning. It only indicates which diagonal had the greater
concentration of scores. The phi coefficient is a signed number
between -1 and 1 where again zero represents no relationship.
Inferential Statistics
Once a statistical statement of the null hypothesis is made
decisions must be made regarding what statistic to use to test the
null hypothesis. Two broad categories of statistics are available:
Parametric and non-parametric. All other things being equal more
power exists with parametric statistics. However, even when a
researcher wants to use a parametric statistic it is not always
possible. Prior to using any statistic you must first check to see
whether the assumptions associated with the statistic are met.
Parametric & Non-parametric Statistical Tests
A parametric statistical test specifies certain conditions about
the distribution of responses in the population from which the
research sample was drawn. Since these conditions are not
ordinarily able to be tested, they are assumed to hold. The
meaningfulness of the results of a parametric test depends on
whether or not these assumptions have been met. These assumption
typically include (a) normality, (b) homogeneity of variance,
sample randomly drawn and (d) at least interval scaled data.
A non-parametric statistical test is based on a model that
specifies only very general conditions and none regarding the
specific form of the distribution from which the sample was drawn
or the level of measurement required. Additionally, non-parametric
procedures often test different hypotheses than parametric
procedures. Assumptions associated with most non-parametric tests
include (a) observations are independent, (b) sample randomly
drawn, and variable(s) have underlying continuity. These are less
stringent than parametric assumptions.
In choosing a statistical test for use in testing a hypothesis
you should consider
(a) The applicability of the test, and
(b) the power and efficiency of the test.
Applicability refers to the type of analysis needed, level of
measurement, and whether or not assumptions have been met. Power
refers to the probability of correctly rejecting a false null
hypothesis. Efficiency refers to the simplicity of the analysis as
well as design considerations.
All other things being equal, parametric tests are more powerful
than non-parametric tests provided all the assumptions are met.
However, since power may be enhanced by increasing N and parametric
assumptions are difficult to meet, non-parametric procedures become
very important.
Advantages of Non-parametric Statistical Tests
If the sample is very small, distributional assumptions (tied to
parametric tests) are not likely to be met. Therefore, an advantage
is that no distributional assumptions must be met.
Non-parametric tests can be applied to variables at any level of
measurement.
Non-parametric tests are available for categorical data.
Non-parametric tests are available for treating samples made up
of observations from several different populations.
Interpretations are direct and often less complex than
parametric findings.
Disadvantages of Non-parametric Tests
Less powerful when parametric assumptions have been met.
They are not systematic though common themes do exist.
Tables often necessary to compare findings to are scattered
widely and appear in different formats.
Differences - Parametric Tests
To examine whether or not there is a statistically significant
difference in means on some dependent variable (continuous) as a
function of some independent variable (categorical) you can use the
t-test when you have just two levels of the independent variable
(ex: gender) or you can use the ANOVA procedure when you have two
or more levels of the independent variable (ex: ethnicity).
Independent t-test Statistical Procedure for testing H0: that
two means are equivalent when the two levels of the independent
variable are not related.
One-way ANOVA Statistical Procedure for testing H0: that two or
more means are equivalent when the two or more levels of the
independent variable are not related.
Assumptions of the independent t-test and ANOVA procedure:
Homogeneity of variance - is the variability of the dependent
variable in the population similar for each level of the
independent variable? You examine this assumption by comparing the
largest and smallest standard deviations for the groups in your
sample. If they are similar (larger/smaller CV, reject H0
It is important to look beyond statistical significance for
practical significance. Because, for example, With N=102, and = .05
a rxy of .20 is statistically significant but we know intuitively
this is not a strong (or useful) correlation.
To assess practical significance
Calculate a coefficient of determination (rxy2). This value
indicates the proportion of variance in the dependent variable that
can be explained by the other.
Ex: If rxy = .60, rxy2 = .36. So, 36% of the variance in the DV
can be explained by the IV. Left unexplained is 1 - rxy2.
Note: Outliers can significantly affect rxy. All outliers should
be critically examined before leaving them in the analysis. If the
values are legitimate and your sample size is substantial leave
them in the analysis.
Partial correlation
Sometimes a correlation between two variables is due to their
dependence on a 3rd variable.
Ex: Any set of variables that increase with age (shoe size &
intelligence) - if you remove (control for) the effects of age,
correlation could change in direction and/or strength.
A partial correlation procedure allows you to hold constant a
3rd variable and look at a 'truer' correlation between x and y.
Regression
This is the most common approach to prediction problems when you
have one dependent variable and multiple independent variables.
Assumptions
Errors are independent and normally distributed
Homoscedasticity (variability of y's at each x similar)
Linearity (lack of fit of linear model)
Dependent variable at lest interval scaled
Hypothesis testing for significant regression
H0: b = 0
Values from an analysis of variance table (which partitions the
variance due to regression (explained) and residual (unexplained))
can be used to (a) test the lack of fit assumption, (b) then if
assumptions met, test for a significant regression, and examine
practical significance.
Data reduction using Stepwise regression
A Regression procedure called stepwise regression analyzes a set
of independent variables in such a way that it finds the most
potent variable(s) with respect to their relationship to the
dependent variable. An excellent exploratory analytical
technique.
Discriminant Function Analysis
This technique is used when the interest is in examining several
independent variable with respect to their ability to discriminate
between groups. This is very similar to multiple regression. In
fact, conceptually, the only difference is that discrimination
between groups rather than prediction of a score is the object.
Factor Analysis
Also a data reduction procedure, this technique is used when the
interest is in examining the underlying structure to several
variables or items on a test. Both exploratory and confirmatory
techniques exist.
Non-parametric analyses for relationship problems
Non parametric statistics are needed when (a) the variables
being related are categorical or ordinal, or (b) when the
assumptions associated with parametric statistics have been
violated.
With categorical/ordinal data, descriptive analyses:
Charts
Frequency distribution tables
Cross tabulated tables
are informative, however, it is often desirable to conduct a
test of the null hypothesis that two categorical/ordinal variables
are not related.
The statistic that will test for the presence relationship
between two categorical (though can also be used on ordinal data)
variables is the chi-square statistic. The null hypothesis used to
examine this is that there is no relationship. Another way to say
this is that the variables x and y are independent. In fact the chi
square statistic is commonly referred to as the chi square test of
independence.
Assumptions
The expected frequency in all cells is at least 5.
Data must be random samples from multinomial distributions.
For example, Is there a relationship between level of ability of
athletes and willingness to spend time on a task for someone else?
Assume return rate (of a survey or other information) is considered
willingness to spend time for someone else's benefit. The
information in the table below then represents return rate by
level.
Spend TimeEliteCollegeIntramural
Yes103235
No624037
To examine the expected frequency assumption you need an
expected frequencies table. Each cell in the expected frequencies
table should have at least 5 cases.
To determine if this chi square is statistically significant,
you compare it to a critical value found in a chi square table. The
degrees of freedom for a chi square statistic are:
df = (R-1)(C-1): Where R = # of rows, and C = # of columns in
the two-way table.
The degrees of freedom for this problem are 2 so the critical
value for an alpha of .01 is 9.21. Therefore, if the chi square
statistic for this problem is > 9.21 you can reject the null
hypothesis which suggest that there is a statistically significant
relationship between level of ability and willingness to spend time
on a task for someone else.
This does not necessarily mean that the relationship is of any
practical significance. At this point all you know is that the
variables in question are not independent. You should not stop here
and should not claim you have something special to report.
Since the chi square statistic is sensitive to sample size, just
about any two variables can be found to be related statistically
given a large enough sample size. So, to examine practical
significance you assess the strength of the association between
variables using phi or Cramer's V.
Use Phi for 2X2 tables:
Use Cramer's V for larger tables (Cramer's V and Phi are
equivalent for smaller tables)
With chi square based measures you cannot say much beyond the
strength of the relationship. No predictive interpretation is
possible.
Meta Analysis
Not a procedure for the analytically faint of heart. Much
remains unsettle and the source of considerable disagreement among
many researchers. At the very least careful thought and research
needs to undertaken before beginning a meta analysis project
regarding the questions raised in the Thomas & Nelson text:
What should be used as the standard deviation when calculating
an ES?
Because sample ESs are biased estimators of the population of
ESs, how can this bias be corrected?
Should ESs be weighted for their sample size?
Are all ESs in a sample from the same population of ESs? This is
the apples and oranges" issue: Is the sample of ESs
homogeneous?
What are appropriate statistical tests for analyzing ESs?
If a sample of ESs includes outliers, how can they be
identified?
VALIDITY OF DATA COLLECTION PROCESSESValidity of data collection
addresses the question of whether a data collection process is
really measuring what it purports to be measuring. A data
collection process is valid to the extent that the results are
actually a measurement of the characteristic the process was
designed to measure, free from the influence of extraneous factors.
Validity is the most important characteristic of a data collection
process.A data collection process is invalid to the extent that the
results have been influenced by irrelevant characteristics rather
than by the factors the process was intended to measure. For
example, if a teacher gives a reading test and the test does not
really measure reading performance, the test is useless. There is
no logical way that the invalid test can help the teacher measure
the outcome in which she is interested. If she gives a self-concept
test that is so difficult to read that the third graders taking it
are unable to interpret the tasks correctly, the test cannot
validly measure self-concept among those students. It is invalid
for that purpose, because it is so heavily influenced by reading
skills that self-concept is not likely to come to the surface. This
test cannot help the teachers make decisions about the outcome
variable "self-concept." For example, if they ran a self-concept
program for their students and their students' "self-concept"
scores improved, how could they know whether it was really
self-concept and not just reading ability that improved? In
designing and carrying out any sort of data collection process,
therefore, validity is of paramount importance.As we said with
regard to reliability, it is important to keep in mind that it is
the validity of the data collection process - not of the data
collection instrument - that must be demonstrated. What we really
want to do is strengthen the validity of the conclusions we draw
based on the data collection process; we don't want to draw
conclusions based on the measurement of the wrong outcomes. It is
technically incorrect to refer to the validity of a test. A test, a
checklist, an interview schedule, or any other data collection
device that is valid in one setting or for one purpose may be
invalid in another setting or for another purpose. Therefore, this
chapter always refers to the validity of data collection processes.
It is important to rein ember this distinction.
SOURCES OF INVALIDITYWhat makes a data collection process valid
or invalid? A data collection process is valid to the extent that
it meets the triple criteria of (1) employing a logically
appropriate operational definition, (2) matching the items to the
operational definition, and (3) possessing a reasonable degree of
reliability. Invalidity enters the picture when the data collection
strategy fails seriously with regard to one of these criteria or
fails to lesser degrees in a combination of these criteria.It may
be instructive to look at some examples of invalid data collection
processes. Assume that a researcher wants to develop an
intelligence test. He operationally defines intelligence as
follows: "A person is intelligent to the extent that he/she agrees
with me." He then makes up a list of 100 of his opinions and has
people indicate whether they agree or disagree with each item on
this list. A person agreeing with 95 of the items would be defined
as being more intelligent than one who agreed with 90, and so on.
This is an invalid measure of intelligence, because the operational
definition has nothing to do with intelligence as any reputable
theorist has ever defined it.Not all invalid data collection
processes are so blatantly invalid. Indeed, one of the most heated
arguments in psychology today is over the question of what
intelligence tests actually measure. This whole question is one of
validity. The advocates of many IQ tests argue that intelligence
can be defined as general problem-solving ability. They
operationally define intelligence as something like, "People are
intelligent to the extent that they can solve new problems
presented to them." They test for intelligence by giving a child a
series of problems and counting how many she can solve. A child who
can solve a large number of problems is considered to be more
intelligent than one who can solve only a few. The opponents of
such tests argue that the tests are invalid. They say that general
problem-solving ability is not the only quality - or even the most
important one required to do well on such tests. The tests, they
argue, really measure how well a person has adapted to a specific
middle-class culture. Success on such tests, therefore, is really
an operational definition of ability to adapt to middle-class
culture." Since the test is designed to measure intelligence but
really measures a different ability, it is invalid. The argument
over the validity of IQ tests is far from settled. Important
theorists continue to line up on both sides, and others continue to
suggest compromises - such as recommending new tests or redefining
the concept of intelligence.Consider another hypothetical
intelligence test. Assume that we ask the child one question
directly related to a valid operational definition. This is an
excessively short test, and thus it is likely to provide an
unreliable estimate of intelligence. Our result is also likely to
be invalid, because our conclusion that a child is a genius for
answering 100% of the questions correctly is about as likely to be
a result of chance factors (unreliability) as it is to be a result
of real ability related to the concept of intelligence.The factors
that determine the validity of a data collection process are
diagrammed in Figure 5.1. The first test cited in this section was
invalid because the operational definition was inappropriate. In
the second case, the operational definition was logically
appropriate, but it was not clear whether the tasks the child
performed were really related to this operational definition. The
final IQ test was considerably limited in its validity because the
test was unreliable.
To the extent that there is a complete breakdown at any of these
stages, the data collection process is invalid. Likewise, if there
is a cumulative breakdown at several stages, the data collection
process can be invalid.Figure 5.1
Factors Influencing Test ValidityESTABLISHING VALIDITYFrom the
preceding discussion, it can be seen that there are three steps to
establishing the validity of a data collection process designed to
measure an outcome variable:
1. Demonstrate that the operational definition upon which the
data collection process is based is actually a logically
appropriate operational definition of the outcome variable under
consideration. The strategy for demonstrating logical
appropriateness was discussed in detail in chapter 4, where we
pointed out that operational definitions are not actually
synonymous with the outcome variable but rather represent the
evidence that we are willing to accept to indicate that an internal
behavior is occurring. Table 5.2 lists some cases where the
operational definitions are to varying degrees logically
inappropriate. For example, if the instructors in English 101
administer an anonymous questionnaire at the end of the semester to
evaluate their performance in the course, they might think that the
students are responding to questions about how they performed
during the course. However, it's possible that the students who are
completing the questionnaire are thinking, "If we tell them what we
really think, they'll be upset and come down hard on us when they
grade the exam. I think we should play it safe and give them good
ratings for the course." If this is what students are thinking,
then the favorable comments on the questionnaire are actually an
operational definition of "anxiety over alienating instructor"
rather than of "quality teaching."In many cases, the logical
connection is easy to establish, and hence the logical fallacies
found in Table 5.2 are often easy to avoid. For example, the
connection between the operational definitions and the outcome
variables in Table 5.3 are much more obvious than the connections
in Table 5.2. It's still possible for a person to perform behaviors
described in the operational definitions without having achieved
the outcome variable, but it is much less likely than was the case
in the situations in Table 5.2.Logical inappropriateness is most
likely to occur when the outcome variable under consideration is a
highly internalized one. Affective outcomes present particularly
difficult problems, because the evidence is much less directly
connected to the internal outcome than is the case with behavioral,
psychomotor, and cognitive outcomes. The guidelines presented in
chapter 4 are applicable here - namely, rule out as many
alternative explanations as possible, and use more than one
operational definition.Table 5.2 Some Examples of Logically
Inappropriate Operational Definitions of Outcome Variables
Assumed Outcome VariableOperational DefinitionConceivable Real
Outcome Variable
Ability to understand reading passagesThe pupil paraphrases a
passage he/she has read silentlyAbility to guess from context
clues
Love of Shakespearean dramaThe student will carry a copy of
Shakespeare's plays with him to classEagerness to impress
professor
Appreciation of English 101The students will indicate on a
questionnaire that they liked the courseAnxiety over alienating
instructor
Knowledge of driving lawsThe candidate will get at least 17 out
of 20 true-false questions right on license testAbility to take
true-false tests with subtle clues present in them
Friendliness toward peersThe pupil will stand near other
children on the playgroundAnxiety over being beaten up if he or she
stands apart
Appreciation of American heritageChild will voluntarily attend
the Fourth of July picnic given by the American LegionAppreciation
of watching fireworks explode
Table 5.3 Some Examples of Operational Definitions That Are
Almost Certain to Be Appropriate for the Designated Outcome
Variables
Ability to add single-digit integersThe student will add
single-digit integers presented to him ten at a time on a test
sheet
Ability to tie one's own shoesThe student will tie her own shoes
after they have been presented to her untied
Ability to bench press 150 poundsThe student will bench press
150 pounds during the test period in the gymnasium.
Ability to spell correctly from memoryThe student will write
down from memory the correct spelling of each word given in
dictation
Ability to spell correctly on essays with use of dictionaryThe
student will make no more than two spelling errors in a 200-word
essay written during class with the aid of a dictionary
Ability to type 60 words per minuteThe student will type a
designated 300-word passage in five minutes or less
Ability to raise hand before talking in classThe student will
raise his hand before talking in class.
Ability to recall the quadratic equationThe student will write
from memory the quadratic equation
Ability to apply the quadratic equationGiven the quadratic
equation and ten problems that can be solved using the equation,
the student will solve at least nine correctly
2. Demonstrate that the tasks the respondent has to perform to
generate a score during the data collection process match the task
suggested by the operational definition. The benefits of stating
operational definitions can be completely nullified if the tasks
that generate a score during the data collection process do not
match the tasks stated in the operational definitions.Table 5.4
provides examples of such mismatches. The first three are not
intended to be facetious. Mismatches this obvious actually do occur
on teacher-designed tests. They say they are going to measure one
thing, and then they measure something else. The other examples in
Table 5.4 are more subtle. In these cases, the teacher has one
behavior in mind; and in fact, many of the persons responding to
the data collection process will perform the behavior anticipated
by the teacher. But the mismatch occurs whenever a respondent
performs the different or additional tasks indicated in the second
column of the table.Table 5.4 Some Examples of a Mismatch Between
the Operational Definition and the Task the Respondent Has to
Perform on the Instrument
Operational DefinitionTask on Instrument
The student will add single-digit integers presented to him ten
at a time on a test sheet"If I have three apples and you give me
two more apples, how many do I have?"
The student will solve problems using the quadratic
equation"Explain the derivation of the quadratic equation."
The student will use prepositions correctly in her essays"Write
the definition of a preposition."
The student will apply the principles of operant conditioning to
hypothetical situationsThe student first has to unscramble a
complex multiple-choice thought pattern and then apply the
principles
Given a (culturally familiar) novel problem to solve, the test
taker will be able to solve the problemThe student is presented
with a problem entirely foreign to his cultural background
The student will describe the relationship between nuclear
energy and atmospheric pollutionThe student will write, in correct
grammatical structures, a description of the relationship between
nuclear energy and atmospheric pollution
The student will circle each of the prepositions in the
paragraph providedThe student will first decipher the teacher's
unintelligible directions and then circle each of the
prepositions
The respondent will place herself in the simulated job situation
provided to her and will indicate how she would perform in that
situationThe respondent has to first ignore that the situation is
absurdly artificial and highly different from the real world and
then still respond as she would perform in the hypothetical
situation
When questions arise concerning various sorts of bias in the
data collection process, it is often the mismatch between task and
operational definition that is being challenged. For example, with
regard to bias in IQ tests, one of the most common arguments is
essentially that middle-class youngsters who take the test are
actually performing behaviors related to the operational
definition, whereas equally intelligent lower-class youngsters are
taking a test where there is a discrepancy between what they are
doing and the operational definition of intelligence.
It is important to be aware of the various kinds of bias and
other contaminating factors that could cause discrepancies, and to
carefully rule these out. Such sources of mismatching include
cultural bias, test-wiseness, reading ability, writing ability,
ability to put oneself in a hypothetical framework, tendency to
guess, and social responsibility bias. The preceding list is not to
be considered exhaustive. There are her factors unique to specific
individuals that produce a similar effect. A good way to assure a
match to have several different qualified persons examine the data
collection process and state whether the task matches the
operational definition.A special type of mismatch between
operational definition and task is worth mentioning. Some data
collection strategies are so obtrusive that the respondent is more
likely to be responding to the data collection process itself than
to be performing the tasks indicated in the operational definition.
For example, if a child knows that a questionnaire is measuring
prejudice and that it is not nice to be prejudiced, the child may
answer what he thinks he should answer instead of revealing his
true attitude. (This is referred to as a social-desirability bias.)
Likewise, if a researcher comes into the classroom and sits in a
prominent position with a behavioral checklist, children may be
acutely aware that something unusual is happening; and so the
behavior recorded on the checklist is more a reaction to the data
collection strategy than an indication of actual behavioral
tendencies. (Specific strategies for overcoming obtrusiveness are
discussed in chapter 6.)3. Demonstrate that the data collection
process is reliable. Reliability was discussed extensively earlier
in this chapter. The contribution of reliability to validity was
mentioned in Figure 5.1 and in the accompanying discussion. The
relationship between reliability and validity is diagrammed more
specifically in Figure 5.2. As this diagram suggests, a certain
amount of reliability is necessary before a data collection process
can possess validity. In other words, a data collection process
cannot measure what it's supposed to measure if it measures nothing
consistently. In demonstrating that data collection processes are
valid, professional test constructors first demonstrate that their
data collection processes are reliable - that they measure
something consistently; then they demonstrate that this something
is the characteristic that the data collection processes are
supposed to measure. In other words, they first demonstrate
reliability in several ways, and then they demonstrate validity.An
important caution is necessary in discussing the relationship
between reliability and validity. It is crucial to realize that it
is possible (but undesirable and inappropriate) to increase
reliability while simultaneously reducing the validity of a data
collection process. This can be done by either (1) narrowing or
changing the operational definition so that it is no longer
logically appropriate or (2) changing the tasks based on the
operational definition to less directly related tasks and then (3)
devising a more reliable data collection process based on the more
measurable but less appropriate operational definition or tasks.
This is obviously a bad idea, because the result is that the data
collection now measures a less valid or wrong outcome "more
reliably."Such an increase in reliability accompanied by a
reduction in validity occurs, for example, if a teacher introduces
unnecessarily complex language into a data collection process. A
data collection process that had previously measured "ability to
apply scientific concepts" might now instead measure "ability to
decipher complex language and then apply scientific concepts." The
resulting reliability might be higher; but if the teacher is still
making decisions about the original outcome, the data collection
process has become less valid.Overemphasis on reliability is one of
the arguments against culturally biased norm-referenced tests.
Their detractors argue that many standardized tests become more
reliable when cultural bias is added, because such bias is a
relatively stable (consistent) factor, which is likely to work the
same way on all questions and on all administrations of the test.
However, the cultural bias detracts from the validity of the
test.It is important to be alert to the tendency to accept
spuriously high statistical estimates of reliability as solid
evidence of validity. The fact that a certain amount of reliability
is a necessary prerequisite for validity does not mean that the
most reliable data collection process is also the most valid.
Statistical reliability is only one factor in establishing the
validity of a data collection process. Another way to state this is
to say that reliability is a necessary but not sufficient condition
for validity.As you can see, establishing validity is predominantly
a logical process.Finally, before leaving this introduction to the
validity of data collection processes, it is important to note that
a data collection process that provides valid data for group
decisions will not always provide valid data for decisions about
individuals. On the other hand, a data collection process that
provides valid data for decisions about individuals will always
provide valid data for group decisions. This is not as complicated
as it sounds. To take an example, we might operationally define
appreciation of Shakespeare as "borrowing Shakespearean books from
the library without being required to do so." Even if Janet Jones
borrows books on Shakespeare without being required to do so, it is
not possible to diagnose her specifically as either appreciating or
not appreciating the bard using this operational definition. There
are too many competing explanations for her behavior, and these
would invalidate this data collection process as an estimate of her
appreciation. (For example, she might hate the subject but need to
pass the exam; and so she has to borrow a vast number of books to
do burdensome, additional studying. Or she might like Shakespeare
so much that she owns annotated copies of all the plays and never
has to borrow from any library except her own.) Nevertheless, it
may still be valid to evaluate the group based on this operational
definition. If you teach the Shakespeare plays a certain way one
year and only 2% of the students ever borrow related books from the
library, and the next year you teach the same subject differently
and 50% of the students spontaneously borrow books, it is probably
valid to infer from their available documented records that
appreciation of Shakespeare has increased. The group decision, at
any rate, is more likely to be valid than is the individual
diagnosis.Box 5.1
An Argument-Based Approach to ValidityKane (1992) presents the
practical yet sophisticated idea that validity should be discussed
in terms of the practical effectiveness of the argument to support
the interpretation of the results of a data collection process for
a particular purpose. The researcher or user of the research
chooses an interpretation of the data, specifies the interpretive
argument associated with that interpretation, identifies competing
interpretations, and develops evidence to support the intended
interpretation and refute the competing interpretations. The amount
and type of evidence needed in a particular case depend on the
inferences and assumptions associated with a particular
application.The key points in this approach are that the
interpretive argument and the associated assumptions be stated as
clearly as possible and that the assumptions be carefully tested by
whatever strategies will best rule Out bias and other sources of
faulty conclusions. As the most questionable inferences and
assumptions are checked and either supported by the evidence or
adjusted so that they become more plausible, the plausibility
(validity) of the interpretive argument increases.This
interpretation of validity is compatible with the discussion
presented in this chapter. In addition, it has the advantage of
presenting validity as a special instance of the overall
application of formal and informal reasoning to solving problems.
From this viewpoint, when educators do research, they are under the
same obligation as any other person making public statements to
demonstrate that those statements really do mean what the speaker
or writer says they mean- Statistical procedures and other specific
techniques are merely pieces of evidence to check the quality of
inferences and the authenticity of the assumptions underlying a
particular interpretation.(Source: Kane, M. T. [1992]. An
argument-based approach to validity. Psychological Bulletin, 112,
327-535.)
REVIEW QUIZ 5.4Part IIdentify the item from each pair that is
most likely to be an invalid measure of the outcome variable given
in parentheses.Set 1. a. The child will correspond intelligibly
with an assigned Spanish-speaking pen pal. (understands Spanish)b.
The child will correspond intelligibly with an assigned
Spanish-speaking pen pal. (appreciates Spanish culture)Set 2. a.
The student will identify examples of the principles of physics in
the kitchen at home. (understands principles of physics)b. The
student will choose to take optional courses in the physical
sciences. (appreciates physical sciences)Part 2Write Invalid next
to statements that indicate an invalid data collection process;
write Valid next to those that indicate a valid data collection
process; write N if no relevant information regarding validity is
contained in the statement.1____ The questions were so hard that I
was reduced to flipping a coin to guess the answers.
2____ The test measures mere trivia, not the important outcomes
of the course.3____ To rule out the influence of memorized
information regarding a problem, only topics that were entirely
novel to all the students were included on the problem-solving
test.4____ The only way he got an A was by having his girlfriend
write the term paper for him.5____ The length of the true-false
English test was increased from 30 to 50 items to minimize the
chances of getting a high score by guessing.6____ The teacher ruled
out the likelihood of cheating by giving each of the students
seated at the same table a different form of the test.7____ Since
the personality test had such a difficult vocabulary level, it
probably was influenced more by intelligence than by personality
factors.8____ The observer rated the classroom as displaying a
hostile environment toward handicapped people, but the teacher
argued that the observer's judgment was clouded because she
observed from a position where she was next to students who were
not at all typical of the entire class.9____ The observer rated the
atmosphere of the school hoard meeting as being supportive of
innovative teaching, but the newspaper critic pointed out that this
was because the board members were local residents with business
interests and were therefore very likely to be supportive of
innovation.If you got most of the questions in Review Quiz 5.4
correct, or if you easily saw the logic of the explanations, then
you probably have a good basic grasp of the concept of validity. If
you do not understand the concept, reread the chapter to this
point, check the chapter in the workbook, refer to the recommended
readings, or ask your instructor or a peer for help. Be sure that
you understand the summary in the following paragraph so that you
will profit from the rest of this chapter.In summary, validity
refers to whether a data collection process really measures what it
is designed to measure. Invalidity occurs to the extent that the
data collection process measures an incorrect variable or no
consistent variable at all. The main sources of invalidity are
logically inappropriate operational definitions, mismatches between
operational definitions and the tasks employed to measure them, and
unreliability of data collection processes. Validity is not an
all-or-nothing characteristic; data collection processes range from
strong validity to weak validity. Because of the highly
internalized nature of educational outcomes, data collection
processes in education can never be perfectly valid. By carefully
stating appropriate operational definitions, ascertaining that
tasks employed in data collection processes are directly related to
the operational definitions, and designing reliable data collection
processes, we can increase the validity of our data collection
processes and the probability that we will draw valid conclusions
from them.SPECIFIC, TECHNICAL EVIDENCE OF MEASUREMENT VALIDITYIf
you read a test manual or look up the citation of a test in The
Mental Measurements Yearbook (Kramer & Conoley, 2002), you will
find references to three basic types of evidence to support
measurement validity. These have been defined by several major
organizations interested in mental measurement (American
Educational Research Association et al., 1985). The technical types
of evidence for validity are rooted in the theory discussed earlier
in this chapter, and it is not difficult to achieve a fundamental
understanding of these concepts. A brief discussion of these types
of evidence for validity can help teachers and researchers develop
more valid data collection processes for their own use. In
addition, an understanding of these concepts will be especially
useful when selecting or using standardized tests, reading the
professional literature, and attempting to measure psychological or
theoretical characteristics beyond those that are typically covered
by classroom tests. These three types of evidence for validity are
(1) content validity, (2) criterion-related validity, and (3)
construct validity.Content ValidityContent validity refers to the
extent to which a data collection process measures a representative
sample of the subject matter or behavior that should be encompassed
by the operational definition. A high school English teacher's
midterm exam, for example, lacks content validity when it focuses
exclusively on what was covered in the last two weeks of the term
and inadvertently ignores the first six weeks of the grading
period. Likewise, a self-concept test would lack content validity
if all the items focused on academic situations, ignoring the
impact of home, church, and other factors outside the school.
Content validity is assured by logically analyzing the domain of
subject matter or behavior that would be appropriate for inclusion
on a data collection process and examining the items to make sure
that a representative sample of the possible domain is included. In
classroom tests, a frequent violation of content validity occurs
when test items are written that focus on knowledge and
comprehension levels (because such items are easy to write), while
ignoring the important higher levels, such as synthesis and
application of principles (because such items are difficult to
write).Criterion-Related ValidityCriterion-related validity refers
to how closely performance on a data collection process is related
to other measure of performance. There are two of criterion-related
validity: predictive and concurrent.Predictive validity refers to
how well a data collection process predicts some future
performance. If a university uses the Graduate Record Exam (GRE) as
a criterion for admission to graduate school, for example, the
predictive validity of the GRE must be known. This predictive
validity would have been established by administering the GRE to a
group of students entering a school and determining how their
performance on the GRE corresponded with their performance in that
school. It would be expressed as correlation coefficient. A high
positive coefficient would indicate that persons who did well on
the GRE tended to do well in graduate school, whereas who scored
low on the GRE tended to perform poorly in school. A low
correlation would indicate that there was little relationship
between GRE performance and success in that particular graduate
school.Concurrent validity refers to how well a data collection
process correlates with some current criterion - usually another
test. It "predicts" the present. At first glance it sounds like an
exercise in futility to predict what is already known, but more
careful consideration will suggest two important uses for
concurrent validity. First, it is a useful predecessor for
predictive validity. If the GRE, for example, does not even
correlate with success among those who are going to school right
now, then there is little value in doing the more expensive,
time-consuming, predictive validity study. Second, concurrent
validity enables us to use one measuring strategy in place of
another. If a university wants to require that students either take
freshman composition or take a test to "test out" of the course,
concurrent validity would enable the English department to
demonstrate that a high score on the alternative test has a similar
meaning to a high grade in the course. Like predictive validity,
concurrent validity is expressed by a correlation
coefficient.Construct ValidityConstruct validity refers to the
extent to which the results of a data collection process can be
interpreted in terms of underlying psychological constructs. A
construct is a label or hypothetical interpretation of an internal
behavior or psychological quality - such as self-confidence,
motivation, or intelligence - that we assume exists to explain some
observed behavior. Construct validity often necessitates an
extremely complicated process of validation. To state it briefly,
the researcher develops a theory about how people should perform
during the data collection process if it really measures the
alleged construct and then collects data to see whether this is
what really happens. The process is complicated because the
researcher is doing two separate things: (1) proving that the data
collection process possesses construct validity and (2) refining
the theo