Top Banner
METODOLOGI PENYELIDIKAN DAN ANALISIS STATISTIK 2004 Validity Overview Overview & Definitions One of the fundamental questions to be asked and addressed is 'how valid and reliable are the study and the data collected'? Validity of Research; Two general concepts important. 1. Internal validity: Extent to which results can be attributed to "treatment". 2. External validity: Extent to which results can be generalized. External validity is examined qualitatively by scrutinizing the sampling scheme employed. Validity of Data; Concerned primarily with the dependent variable. The instrument used to quantify the dependent variable should be examined for it's validity (ability to truly measure what it's supposed to). If the instrument is a well known one with established validity it may be enough to site a reference where validity was examined and show that the same protocol was followed in your study on similar subjects. Validity of the dependent variable can be assessed: Qualitatively Content/logical validity Quantitatively azizi Page 1 5/21/2022
107
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript

METODOLOGI PENYELIDIKAN DAN ANALISIS STATISTIK 2004

METODOLOGI PENYELIDIKAN DAN ANALISIS STATISTIK 2004

Validity OverviewOverview & Definitions

One of the fundamental questions to be asked and addressed is 'how valid and reliable are the study and the data collected'?

Validity of Research; Two general concepts important.

1. Internal validity: Extent to which results can be attributed to "treatment".

2. External validity: Extent to which results can be generalized. External validity is examined qualitatively by scrutinizing the sampling scheme employed.

Validity of Data; Concerned primarily with the dependent variable. The instrument used to quantify the dependent variable should be examined for it's validity (ability to truly measure what it's supposed to). If the instrument is a well known one with established validity it may be enough to site a reference where validity was examined and show that the same protocol was followed in your study on similar subjects.

Validity of the dependent variable can be assessed:

Qualitatively

Content/logical validity

Quantitatively

Content/logical - factor analysis

Construct validity - multi-trait/multi- method procedure (correlations) or Factor analysis

Reliability of Research; Essentially this refers to the replicability of the research. The reliability of the research is assessed qualitatively by scrutinizing the design and methodology employed in the research.

An estimate of power (probability of correctly rejecting the null hypothesis) is also useful in examining the stability of the research.

Reliability of Data; Concerned primarily with the dependent variable. The instrument used to quantify the dependent variable should be examined for it's reliability (accuracy of measures reflected in consistency)

Reliability of the dependent variable can be assessed quantitatively using:

1. Coefficient alpha

2. Intraclass R

Once the statement of the problem is translated into a null hypothesis choices with respect to statistical analysis can be made.

Definitions

RESEARCH: The systematic, replicable, empirical investigation between or among several variables.

EXPERIMENTAL RESEARCH: Involves the manipulation of some variable by the researcher in order to determine whether changes in a behavior of interest are effected. A well-controlled experiment also requires that the assignment of subjects to conditions or vice-versa be made on a random basis and that all other variable except the conditions that the subjects will experience.

NON-EXPERIMENTAL RESEARCH: No manipulation of variables takes place. Observational, historical, and survey studies are typically non-experimental. Much of the research conducted in natural settings is non experimental since the researcher is unable to manipulate the conditions that the subjects will experience.

INDEPENDENT VARIABLE: The variable manipulated by the experimenter. Or a broader definition would be - any variable that is assumed to produce an effect on, or be related to, a behavior of interest.

LEVELS OF AN INDEPENDENT VARIABLE: The various values or groupings of values of an independent variable. Ex: a study is conducted to determine the effect of room temperature on performance. If the experimenter tests the subject at 70, 80, and 90 degrees, there is one independent variable - room temperature - with three levels.

FACTORIAL DESIGN: One that allows two or more independent variables to be combined in the same study so that the effects of each variable can be evaluated independently of the effects of the other(s). In addition, a factorial design allows the experimenter to determine whether there was an interaction between the independent variables. That is whether the effects of one variable depend on the specific level of the other variable with which it was combined. To produce a factorial design, each level of one variable is paired with each level of the other independent variable(s). Ex: Suppose the independent variables are room temperature (70, 90) and class size (15, 30 students). A factorial design would require four research conditions.

DEPENDENT VARIABLE: The behavior or characteristic observed of analyzed by the researcher, generally in regards to how the independent variable(s) affected or were related to it.

TYPE OF DEPENDENT VARIABLE: In empirical research, the dependent variable is quantified in some way. Statistical analysis is carried out on the numerical values of the dependent variable. The three basic types are score data (ratio, interval), ordered data (ordinal), and frequency data (categorical).

Score data: Generally requires relatively precise measuring instruments and an understanding of the behavior being measured. Statistical techniques developed to analyze score data make rather stringent assumptions about the nature of the scores. The most common assumptions are:

1. The intervals between scores are equal; that is, differences between scores at one point of the measuring scale are equivalent to the same size difference at any other pint on the scale.

2. The scores are assumed to be normally distributed within the population(s) from which they were drawn.

3. The variances of the populations are assumed to be homogeneous.

Ordered data: used when reliable score data cannot be (or is not) obtained, but the subjects can be ranked from high to low along the dimension of interest. In some cases, a researcher may convert score data to ranks because it is believed that the measuring instrument was not precise enough to trust the numeral scores, or that the assumptions underlying a statistical test for score data would be badly violated by the data. Statistical tests deigned for use with ordered data generally do not make stringent assumption about the nature of the underlying distributions and hence are more conservative than those designed for score data.

Frequency data: Each subject is classified into a particular category. The frequency of occurrence of subjects in each category typically provides the data from which statistical analysis is done.

Qualitative Research

Recognizing what the sources of qualitative data as distinct from data for empirical/experimental research are is important.

3 broad categories that produce qualitative data:

In-depth open ended interviews

Direct observation

Analysis of written documents

Tendency for qualitative research to be more exploratory in nature.

Validity and reliability are as important in qualitative research as quantitative research.

General definitions:

Validity: tests/observations/instruments etc. are valid if they produce relevant and clean measures of what you want to assess.

Reliability: tests/observations/instruments etc. are reliable if they produce accurate measures of what you want to assess

Validity of Research

Internal validity: extent to which findings free of bias, observer effects, Hawthorne effects etc. Extent to which findings match reality.

External validity: Extent to which findings can be generalized.

Validity of Data

Qualitative data is valid when it records relevant events/information without interpretation, bias or filtering.

Reliability of Research

Extent to which findings replicable.

Reliability of Data

Qualitative data is reliable when it accurately (objectively) records events/information.

The validity and reliability of qualitative data depend to a great extent on the methodological skill, sensitivity and integrity of the researcher.

Skillful interviewing involves more than just asking questions.

Systematic and rigorous observation involves far more than just being present and looking around.

Content analysis require considerably more than just reading to see whats there.

All require discipline, knowledge, training, practice, and creativity on the part of the researcher.

Qualitative and Quantitative research are not polar opposites with completely different sets of techniques and approaches to inquiry. They exist along a continuum commonly framed in terms of the amount of control or manipulation present.

You can consider quantitative research to be grounded in scientific inquiry or the scientific method and proceeding along two dimensions:

The extent to which the researcher manipulates some phenomenon in advance in order to study it.

The extent to which constraints are placed on output measures - that is, predetermined categories or variables are used to describe the phenomenon under study.

You can think of qualitative inquiry as discovery oriented. No pre-conditions are set and no constraints are placed on the outcomes. Variables emerge from the data.

The advantage of a quantitative approach is that it is possible to measure the reactions of many people to a limited set of questions thus facilitating comparison and statistical aggregation of data. A broad, generalizable set of findings result.

The advantage of a qualitative approach is that a wealth of detailed information about a specific event is produced. This increases understanding of the cases and situations studied but reduces generalizability.

Measurement IssuesThe place to start is with how to classify data - the scale the data appropriately belongs on will affect analysis decisions.

Measurement Scales

Categorical/nominal scale: Used to measure discrete variables that can be classified by two or more mutually exclusive categories.

Ex: Gender is a categorically scaled variable with two categories: male & female. the scale scores (0,1) have no meaning.

Ordinal scale: Used to measure discrete variables that are categorical in nature and can be ordered (meaningfully).

Ex: Undergraduate class is an ordinally scaled variable with four meaningfully ordered categories: freshman, sophomore, Junior, Senior. The scale scores (1,2,3,4) have meaning in that Juniors have complete more units than sophomores who have completed more than freshman . . .

Interval scale: Used to measure continuous variables that are ordinal in nature and result in values that represent actual and equal differences in the variable measured.

Ex: Temperature is an interval scaled variable with meaningfully ordered categories (hot, cold) that can be measured (scale has a constant unit of measurement) to finer and finer degrees given appropriate instrumentation.

Ratio scale: Used to measure continuous variables that have a true zero, implying total lack of the attribute/property being measured.

Ex: Weight is a ratio scaled variable with meaningfully ordered categories (heavy, light) that can be measured to finer and finer degrees that also has a true rather than arbitrary zero.

Continuous variables are ones at least interval scaled - call for use of parametric statistics

Discrete variables are ones that are categorical or ordinal in nature and call for use of non-parametric statistics.

Validity, Reliability, Objectivity

Recall that it is important to question 'how valid and reliable are the study and the data collected'?

Validity of Research: Two general concepts important.

1. Internal validity: Extent to which results can be attributed to "treatment". Internal validity is examined qualitatively by scrutinizing the research design for 'sources of invalidity'.

2. External validity: Extent to which results can be generalized. External validity is examined qualitatively by scrutinizing the sampling scheme employed.

Factors that can affect internal and external validity

Your research design and selection of a sample are the keys to limiting the problems of internal and external validity.

Sources of Invalidity

Rosenthal effect: Self fulfilling prophecy - you get what you expect. Best to do a double-blind study when this is a potential source of invalidity.

Halo effect: General effect of good or bad feeling you have about a person. In observational designs this may be a particular problem. Best to use a check-list a verify reliability of the instrument and those collecting data.

Demand characteristics: Allowing subjects to know what the goals are. Deception (of an ethical nature) may be needed to avoid this source of invalidity.

Volunteer effect: Volunteers may be fundamentally different from the overall population you are trying to generalize to.

Instrumentation effect: Changes in instruments can be mistaken for changes in subjects.

Pre-testing effect: Subjects can be changed or learning can take place during a pre-test which could affect results.

Time: Over a length of time, maturation may have more of an impact than the independent variable. Also, major events can affect subjects' behaviors and/or opinions.

Hawthorne effect: When the giving of attention rather than the independent variable is the cause of observed differences/relationships.

Validity of Dependent Variable

The instrument used to quantify the dependent variable(s) should be examined with respect to validity. If the instrument is a well known one with established validity it may be enough to site a reference where validity was examined and show that the same protocol has been followed in your study on similar subjects. If the measures come from an instrument devised by you, work must be done to show at least logical/content validity and preferably appropriate estimates of criterion related validity.

Data (from tests, instruments, observation, etc.) is good when it is valid - when it is relevant, & clean (reflects what it's supposed to) and reliable (produces accurate measures).

Depending on the type and purpose of a data collection, validity can be examined from one or more of several perspectives.

PurposeCognitiveMotor Skills

EvaluationContent Validity Concurrent ValidityLogical Validity Concurrent Validity

PredictionPredictive ValidityPredictive Validity

ResearchContent Validity Concurrent Validity Predictive Validity Construct ValidityLogical Validity Concurrent Validity Predictive Validity Construct Validity

When measures are found to be valid for one purpose they will not necessarily be valid for another purpose. Validity also may not be generalizable across groups with varying characteristics.

Content/logical validity (assessed qualitatively)

1. Clearly define what you want to measure.

2. State all procedures you will use to gather measures.

3. Have an "expert" assess whether or not you are measuring what you think you are.

Content validity (assessed quantitatively) Ex: survey research

1. Pilot test the survey

2. Conduct a factor analysis of survey results

3. Revise based on analysis

4. Administer survey and conduct another factor analysis

Criterion-related validity (predictive and concurrent)

Compare measures from your 'instrument' with measures from a criterion (expert, another test, etc.)

Concurrent validity (assessed quantitatively)

1. Gather x and y measures from a large group

2. Compute an appropriate correlation coefficient

3. If correlation > .80 for positively correlated variables or < -.80 for inversely related variables your measure (x) is said to have good concurrent validity

Predictive validity (assessed quantitatively)

1. Gather measures using your instrument (x) and measures on the variable(s) you are trying to predict (y)

2. Compute an appropriate correlation coefficient

3. If correlation > .80 for positively correlated variables or < -.80 for inversely related variables your measure (x) is said to have good predictive validity

4. Follow up with estimation of SEE - band place around predicted score to quantify prediction error.

Construct validity (assessed quantitatively)

A construct is an intangible characteristic. When you want to measure a construct such as anxiety, competitiveness, etc., you have no direct means to do so. Therefore indirect methods need to be employed. To then estimate the validity of the indirect measures (as reflections of the construct you're interested in) you record a pattern of correlations between the indirect measure(s) and other similar and dissimilar measures. Your hope is that the pattern reveals high correlations with similar measures (convergent validity) and low correlations with different measures (divergent/discriminant validity).

Two techniques used to quantitatively assess construct validity - Multi-trait multi-method matrix and factor analysis.

Reliability of Research: Essentially this refers to the replicability of the research. The reliability of the research is assessed qualitatively by scrutinizing the design and methodology employed in the research.

Reliability of Data; Concerned primarily with the dependent variable. The instrument used to quantify the dependent variable should be examined for it's reliability (accuracy of measures reflected in consistency)

Reliability of the dependent variable can be assessed quantitatively using:

Coefficient alpha

Intraclass R

Should then follow up with SEM: band placed around observed score to quantify measurement error.

The primary concern here is the accuracy of measures of the dependent variable (in a correlational study both the independent and dependent variable should be examined). Reducing sources of measurement error is the key to enhancing the reliability of the data.

Reliability is typically assessed in one of two ways:

1. Internal consistency - Precision and consistency of test scores on one administration of a test.

2. Stability - Precision and consistency of test scores over time. (test-retest)

To estimate reliability you need 2 or more scores per person.

If motor skills/physiological measures collected at one time only, the most common way of getting 2 scores per person is to split the measures in half - usually by odd/even or first half/second half by time or trials.

For survey research with multiple factors, reliability is typically assessed within factors by examining consistency of response across items within a factor. So, for a survey with 3 factors, you will compute 3 reliability coefficients.

If every subject can be measured twice on the dependent variable then you readily have data from which reliability can be examined.

Once you have 2 scores per person the question is how consistent overall were the scores in order to infer how accurate scores were.

Sources of measurement error

1. Random fluctuations, or a person's inability to score the same twice or perform consistently throughout one administration.

2. Measuring device - test

3. Researcher

4. Temporary effects - warm-up, practice

5. Testing length - time/trials

As a researcher it is important to identify and eliminate as many sources of error as possible in order to enhance reliability.

Relationship between reliability and validity

It is possible to have reliable measures that are invalid. Measures that are valid will typically also be reliable. However, reliability does not insure validity.

What statistic to use

In the past, reliability has been estimated using the Pearson correlation coefficient. This is not appropriate since

1) the PPMC is meant to show the relationship between two different variables - not two measures of the same variable, and

2) the PPMC is not sensitive to fluctuations in test scores.

The PPMC is an interclass coefficient; what is needed is an intraclass coefficient. The most appropriate statistic is the intraclass R calculated from values in an analysis of variance table though coefficient alpha is equally acceptable.

Objectivity

Whenever measures have a strong subjective component to them it is essential to examine objectivity. Subjectivity itself is a source of measurement error and so affects reliability and validity. Therefore, objectivity is a matter of determining the accuracy of measures by examining consistency across multiple observations (multiple judges on one occasion or repeated measures over time from one evaluator) that typically involve the use of rating scales.

Objectivity (using coefficient alpha/Intraclass R)

This way of examining objectivity requires that you have either multiple evaluators assessing performance/knowledge on one occasion, or one evaluator assessing the same performance (videotaped)/knowledge at least twice. If the measures (typically from a rating scale) are objective they will be consistent across the two or more measures per subject.

Experimental Research - Designs, Power, Type I error, Type II errorWhen interested in differences or change over time for one group or between groups a number of designs are applicable. The most frequently used designs can be collapsed into two broad types: true experimental and quasi-experimental.

True experimental designs: these designs all have in common the fact that the groups are randomly formed. The advantage associated with this feature is it permits the assumption to be made that the groups were equivalent at the beginning of the research which would provide control over sources of invalidity based on non-equivalency of groups. The control is of course not inherent in the design. The researcher must still work with the groups in such a way that nothing happens to one group (other that the treatment) that does not happen to the other and that scores on the dependent measure do not vary as a result of instrumentation problems, or that the loss of subjects is not different between the groups.

Randomized groups design:

This design requires the formation of two groups. One group will receive the experimental treatment the other will not. The group not receiving the treatment is commonly referred to as the control group.

This design allows the researcher to test for significant differences between the control and experimental group after the experimental group has received the treatment. An independent t-test or one-way analysis of variance (ANOVA) may be used to statistically test the null hypothesis that means are equal.

In this design there is one independent variable and one dependent variable. In the situation depicted above there are two levels of one independent variable. The independent variable is group or treatment condition (two levels - experimental/group 1 & control/group 2). The dependent variable is whatever is under study - eg. Cholesterol level.

Extension of randomized groups designs.

One extension requires the formation of three groups. Two groups will receive varying levels or conditions of the experimental treatment the third will not be treated and will again be referred to as the control group.

This design allows the researcher to test for significant differences across the two experimental groups and the control group after the experimental groups have received the treatment. A one-way ANOVA would be used to statistically test the null hypothesis that H0: 1 = 2 = 3.

In this expanded design there is still one independent variable (now with 3 levels) and one dependent variable. The independent variable is still groups or treatment condition and the dependent variable is again the variable under study. In text, example is of training at two levels (70% of VO2 max & 40% VO2 max) with the control group not training. The dependent variable is cardiorespiratory fitness. So, at the end of training, measures of cardiorespiratory fitness are collected on each subject and the means for each group compared using a one-way ANOVA (if assumptions not met, a Kruskal-Wallis test would be used).

Points to consider

1. For each subject in the study there can be only one score. If many measures are taken for each subject, these measures must be combined in some fashion (eg averaged) so that there is only one score for each subject.

2. Although it is not necessary, it is usually best to have an equal number of subjects in each of the experimental groups.

3. The number of groups compared depends on the hypothesis being examined, however, it is rare that more than four or five groups are compared in one study.

Example

Assume that the researcher is interested in determining the effect of shock on the time required to solve a set of difficult problems. Subjects are randomly assigned to four experimental conditions. Subjects in Group 1 receive no shock; Group 2, very low intensity shocks; Group 3, medium shocks; and Group 4, high intensity shocks. The total time required to solve all the problems is the measure recorded for each subject. The independent variable then is shock (which has 4 levels) and the dependent variable is time.

Factorial design:

Essentially an extension of the randomized-groups design, this design has more than one independent variable and just one dependent variable. This design requires the formation of a group for every combination (of every level) of the two or more independent variables.

This design allows the researcher to test for significant differences as a function of each independent variable separately (main effects) and in combination (interaction). A two-way ANOVA would be used to statistically test the null hypothesis that means are equivalent for the first independent variable, that means are equivalent for the second independent variable, and that the interaction is not significant.

Points to consider

1. For each subject there can be only one score. If many measures are taken for each subject, these must be combined (eg averaged) so that there is only one score for each subject.

2. It is best to have an equal number of subjects in each group.

3. The number of groups compared depends on the hypothesis being examined, however, it is rare that more than four or five groups are included in either of the two factors.

Example

Assume than a researcher is interested in determining the effects of high vs. low-intensity shock on the memorization of a hard vs. an easy list of nonsense syllables. Subjects would be randomly assigned to four experimental conditions: (1) low shock & easy list, (2) high shock & easy list, (3) low shock & hard list, and (4) high shock & hard list. The total number of errors made by each subject is the measure recorded. The dependent variable then is the number of errors and the independent variables are shock (with two levels) and list difficulty (with two levels).

This would require the formation of 4 randomly assigned groups:

List difficulty

EasyDifficult

Shock TypeLowA1 B1A1 B2

HighA2 B1A2 B3

A two-way ANOVA would be used in this situation to test for a difference in

number of errors made depending on shock type (regardless of list difficulty);

the number of errors made depending on list difficulty (regardless of shock type);

and the number of errors made due to the combined effect of shock type and list difficulty.

Since 3 F statistics are examined, you would divide your alpha by 3 before comparing the 3 p values to your alpha. This is done to keep the overall studys alpha in tact.

The analysis would be referred to as a 2X2 ANOVA communicating that there are two levels of the first independent variable and two levels of the 2nd independent variable. The language used to talk about the results would be the main effect for shock type, the main effect for list difficulty, and the interaction.

Variation of factorial design: When one or more of the independent variables is a categorical variable, such as gender, where individuals cannot be randomly assigned to the levels, you have a factorial design that no longer qualifies completely as a true experimental design, but, is used quite frequently and is quite appropriate when the topic under study calls for the examination of characteristics that people cannot be assigned.

Using gender and shock type as the independent variables you would still need to form 4 groups, but, you could not randomly assign individuals to the gender categories though you would still randomly assign individuals to the shock category.

Gender

MaleFemale

Shock TypeLowA1 B1A1 B2

HighA2 B1A2 B3

Pretest-Posttest Randomized-groups design

In its simplest form, this design requires the formation of two groups. One group will receive the experimental treatment the other will not. The group not receiving the treatment is still referred to as the control group.

This design allows the researcher to test for significant differences or amount of change produced by the treatment - does the experimental group change more than the control group. Though the testing effect cannot be evaluated it is assumed to be controlled since both the control and treatment groups are pretested. A factorial repeated measures ANOVA is the recommended analytical procedure. With this approach you have two independent variables (or factors) and one dependent variable. The first factor (not repeated measures) is treatment condition and the second is test (repeated measures) - pre/post. The dependent variable is what is being measured at the pre and post test.

Consider a dietary seminar intended to change eating habits particularly with respect to consumption of fat.

Group 1Pre-testSeminarPost Test

Group 2Pre-testPost Test

In this example there are two independent variables and one dependent variable. In the situation depicted above there are two levels of each independent variable. The first independent variable is group or treatment condition (two levels - experimental/group 1 & control/group 2). The second independent variable is test (two levels - pretest & posttesst). The dependent variable is grams of fat consumed.

Repeated Measures Design

The repeated measures design is a variation of the completely randomized design though not considered a true experimental design. Instead of using different groups of subjects, only one group of subjects is formed and all subjects are measured/tested multiple times. There is no control group.

This design allows the researcher to test for significant differences produced by the treatment - are the means across repeated measures different. A repeated measures ANOVA is the recommended analytical procedure. With this approach you have one independent variable and one dependent variable.

Assume that a researcher wants to know whether or not mean scores on an intelligence test change from year to year. To answer this, the researcher chooses subjects, all twelve years old, and obtains an IQ score for each subject is recorded at age 12, 13, 14, and 15. The dependent variable in this case is IQ score and the independent variable is age.

As another example, assume that a researcher wants to know whether or not mean scores on a measure of exercise satisfaction change depending on the environment runners exercise in. To answer this, the researcher obtains measures of exercise satisfaction from subjects after they run in an urban setting, the countryside, an indoor track, and an outdoor track. The dependent variable is exercise satisfaction and the independent variable is exercise environment.

The major advantage of this design over the completely randomized design is that fewer subjects are required. In addition, very often increased statistical power is gained because the random variability of a single subject from one measure to the next is usually much less than the variability introduced by measuring and comparing different subjects. The major disadvantage is that there may be carry-over effects from one treatment/testing to the next. In addition, subjects might become progressively more proficient at performing the criterion task and show an improvement in performance more attributable to learning than the treatment.

Points to consider

1. Each subject is tested under, and a score is entered for, each treatment condition.

2. The number of repeated measures depends on the research question, however, it is rare to have more than four or five treatment conditions.

Solomon four-group design

This design requires the formation of four groups. It allows the researcher to determine whether or not the pretest results in increased sensitivity of the subjects to the treat. Two of the groups are pre & post tested while the other two are only tested once. One of the groups receiving both the pre & post test along with one of the groups tested only once are exposed to the treatment. This arrangement permits evaluation of reactive or interactive effects of testing (threats to external validity). As a reminder, this is the problem of the pretest making subjects more sensitive to the treatment and thus reducing the ability to generalize the findings to an un-pretested population.

While it would be ideal to examine:

Replication of the treatment effect

Assessment of the amount of change due to the treatment

Evaluation of the testing effect

An assessment of whether or not the pretest interacts with the treatment

Unfortunately there is no way to tackle this statistically. The best to be done is a 2X2 ANOVA. One independent variable then is treatment condition (no treatment & treatment) and the second independent variable is testing (pre tested & not pre tested).

No TreatmentTreatment

Pre tested

Not Pretested

From the ANOVA produced with this analysis you can assess the effects of pretesting (main effect for testing), the effect of the treatment (main effect for treatment condition), and the external validity threat of the pretest interacting with the treatment (interaction effect). Other concerns can and should be examined descriptively.

Power, Type I error, Type II error

These topics must be considered in the context of hypothesis testing.

Hypothesis testing involves examination of a statistically expressed hypothesis.

The statistical expression is referred to as the null hypothesis and is written as H0: It is called null because the expression when completed implies no difference or relationship depending on the problem being examined.

When you compare observed data with expected results from normative values this is called testing the null hypothesis.

You can think of hypothesis testing as trying to see if your results are unusual enough so that they would not even be expected by chance.

By chance, your sample could lie at the extremes of the distribution and so you could draw the wrong conclusion. These erroneous decisions are generally referred to as type I and type II errors.

Type I error = Incorrectly deciding to reject a null hypothesis. Incorrectly reject a true null hypothesis.

Type II error = Incorrectly deciding not to reject a null hypothesis. Failing to reject a false null hypothesis.

Alpha = The level of risk an experimenter is willing to take of rejecting a true null hypothesis. Often called the level of significance, this value is used in establishing a critical value around which decisions (reject or not reject null) are made. It is also common to define alpha as the probability of incorrectly rejecting a true null hypothesis or the probability of making a type I error.

Beta = The level of risk (not under direct control of an experimenter) of failing to reject a false null hypothesis. It is also common to define beta as the probability of making a type II error.

Power = 1 - Beta. The probability of correctly rejecting a false null hypothesis.

*In practice you will never know whether or not you've made a poor decision (made a type I or type II error) but, you can (a) set the probability that you will make a type I error when you select your alpha, and (b) determine beta (through estimating power) to estimate the probability that you made a type II error. Note: since sample size is directly related to power (and so tied to beta), studies will fail to find statistically significant results even when they do exist because of a small sample size.

- If you decrease alpha (more stringent) power will decrease so beta will increase.

- As you increase sample size, power increases so beta decreases.

- As you enhance measurement precision both alpha and beta decrease so power increases.

- As effect size increases both alpha and beta decrease so power increases.

Power

Power is the probability of correctly rejecting a false null hypothesis.

Ideally, power should be considered when planning a study, not after it is over. Knowing what you would like power to be you can determine (using software or power charts) what your sample size should be.

For studies examining differences between means, power charts are used to determine the sample size needed to achieve a desired power.

For studies examining relationships, software is available. By hand, the computations are very complex.

For studies estimating proportions or means, the computations can be done by hand.

If power is not considered at the start of a study it should be estimated at the end, particularly when non-significant results arise.

When non-significant results are obtained one of the following has occurred:

Inadequate theory

Evaluation of preliminary evidence incorrect

Design, analysis, or sample size choices faulty

Sample size is closely tied to power. True differences/relationships go unnoticed without enough subjects. On the other hand, trivial differences/relationships can be statistically significant with large sample sizes.

Another factor affecting power is measurement precision. As precision increases, power increases.

T

he information needed to determine the sample size needed to achieve a particular level of power for differences includes:

Alpha

# of groups (k)

Minimum effect size you want to detect. This would come from a pilot study, literature, or your own opinion on the size of the effect you would like to be able to detect (.80, .50, .30).

Power desired

The information needed to determine the sample size needed to achieve a particular level of power for relationships includes:

Alpha

Standard deviation for dependent and independent variables. This would come from a pilot study or the literature.

Minimum effect you want to detect. Here the effect size is the correlation coefficient itself. Again this would come from a pilot study, literature, or your own opinion on the size of the effect you would like to be able to detect (.80, .50, .30).

Power desired.

Sample size calculations for a t-test

To determine how many subjects to use per group to achieve desired power:

Select alpha and effect size.

Determine df1. Since groups = 1, df1 = 2-1 = 1.

Get power chart for df1 = 1.

Find phi in power chart where desired power and the infinity line cross.

Using values determined in steps 1-4, solve for n using sample size equation:

Expanding to the ANOVA situation:

Sample size calculations for an ANOVA

To determine how many subjects to use per group to achieve desired power:

Select alpha and effect size.

Determine df1 = K -1.

Get power chart for df1 = K-1.

Find phi in power chart where desired power and the infinity line cross.

Using values determined in steps 1-4, solve for n using equation:

StatisticsDescriptive Statistics

Central tendency

A technique for conveying group information regarding the middle or center of a distribution of scores. Indicates where scores tend to be concentrated.

Measures of central tendency

1. Mode: Score most frequently observed

2. Median: Score that divides the distribution of scores into two equal halves

3. Mean: Arithmetic average of a distribution of scores.

To decide which is appropriate to use consider:

Distribution of scores. When symmetrical the mean=median=mode so all are equally appropriate. When skewed, the median is a more accurate representation of central tendency.

Degree of accuracy needed. The mean considers all scores and is more stable than other measures. When skewed and information on the center is needed an exact median should be calculated.

Level of measurement. You should not calculate a mean or median on categorical or ordinal data. Frequency distribution tables should be used to convey distributional information on categorical and ordinal data. The mean and median can be used when variables are at least interval scaled.

Relationships among measures of central tendency:

When distribution symmetrical, mean = median = mode

When distribution positively skewed, mode < median < mean

When distribution negatively skewed, mode > median > mean

Variability

Measures of central tendency are often not enough by themselves to communicate useful information about distributions of scores. A measure of central tendency is valuable to report anytime you want to summarize data, but, the spread of scores around the center should accompany it.

Measures of variability

Inclusive range: Crude measure. Not stable since it uses only two scores.

Signed deviation: An individual statistic. Represents the signed distance of a raw score from its mean. Not useful as summary statistic. Only conveys information on an individual.

Standard deviation: The average unsigned distance of all scores from the mean.

For categorical data, frequency distribution tables or crosstabulation tables are useful in summarizing information.

Example:

MaleFemale

Smokes42%30%

Does Not Smoke58%70%

Correlation

The correlation statistic allows you to examine the strength of the relationship between two variables. This statistic helps you answer questions like 'Do gymnasts with considerable amounts of fast twitch muscle fibers tumble better?' The underlying question can be phrased: is there a relationship between concentration of FTMF and tumbling.

There are several types of correlation coefficients to choose from. The choice is based on the nature of the variables being correlated.

Pearson Product Moment Correlation (PPMC)

Use when both variables continuous

Phi

Use when both variables true dichotomies

Point Biserial

Use when one variable continuous and the other a true dichotomy.

A PPMC coefficient describes the strength and direction of the linear relationship between two variables. When two variables are not linearly related, the PPMC is likely to underestimate the true strength of the relationship. A graph of the x and y values can show whether or not the relationship is linear.

When the scores on two variables get larger/smaller together, the direction of the relationship is positive. When scores on one get larger as scores on the other variable get smaller, the direction of the relationship is negative due to the inverse relationship of the two variables. When there is no pattern, there is no relationship.

A PPMC coefficient is a signed number between -1 and 1 where 0 represents no relationship. Presence of a relationship should never be interpreted as demonstrating cause and effect. Remember the negative sign simply conveys direction. The farther away from zero in either direction the stronger the relationship

The PPMC is affected by the variability of the scores collected. Other things being equal, the more homogeneous the group (on the variables being measured) the lower the correlation coefficient. Since small groups tend to be more homogeneous, Pearson is most meaningful and most stable when group size is large (>50).

A point biserial correlation coefficient tells you the strength of the relationship between one continuous and one dichotomous variable. The sign carries little meaning. It only indicates which group tended to have higher scores. The point biserial coefficient is a signed number between -1 and 1 where again zero represents no relationship.

A phi correlation coefficient tells you the strength of the relationship between two dichotomous variables. The sign carries little meaning. It only indicates which diagonal had the greater concentration of scores. The phi coefficient is a signed number between -1 and 1 where again zero represents no relationship.

Inferential Statistics

Once a statistical statement of the null hypothesis is made decisions must be made regarding what statistic to use to test the null hypothesis. Two broad categories of statistics are available: Parametric and non-parametric. All other things being equal more power exists with parametric statistics. However, even when a researcher wants to use a parametric statistic it is not always possible. Prior to using any statistic you must first check to see whether the assumptions associated with the statistic are met.

Parametric & Non-parametric Statistical Tests

A parametric statistical test specifies certain conditions about the distribution of responses in the population from which the research sample was drawn. Since these conditions are not ordinarily able to be tested, they are assumed to hold. The meaningfulness of the results of a parametric test depends on whether or not these assumptions have been met. These assumption typically include (a) normality, (b) homogeneity of variance, sample randomly drawn and (d) at least interval scaled data.

A non-parametric statistical test is based on a model that specifies only very general conditions and none regarding the specific form of the distribution from which the sample was drawn or the level of measurement required. Additionally, non-parametric procedures often test different hypotheses than parametric procedures. Assumptions associated with most non-parametric tests include (a) observations are independent, (b) sample randomly drawn, and variable(s) have underlying continuity. These are less stringent than parametric assumptions.

In choosing a statistical test for use in testing a hypothesis you should consider

(a) The applicability of the test, and

(b) the power and efficiency of the test.

Applicability refers to the type of analysis needed, level of measurement, and whether or not assumptions have been met. Power refers to the probability of correctly rejecting a false null hypothesis. Efficiency refers to the simplicity of the analysis as well as design considerations.

All other things being equal, parametric tests are more powerful than non-parametric tests provided all the assumptions are met. However, since power may be enhanced by increasing N and parametric assumptions are difficult to meet, non-parametric procedures become very important.

Advantages of Non-parametric Statistical Tests

If the sample is very small, distributional assumptions (tied to parametric tests) are not likely to be met. Therefore, an advantage is that no distributional assumptions must be met.

Non-parametric tests can be applied to variables at any level of measurement.

Non-parametric tests are available for categorical data.

Non-parametric tests are available for treating samples made up of observations from several different populations.

Interpretations are direct and often less complex than parametric findings.

Disadvantages of Non-parametric Tests

Less powerful when parametric assumptions have been met.

They are not systematic though common themes do exist.

Tables often necessary to compare findings to are scattered widely and appear in different formats.

Differences - Parametric Tests

To examine whether or not there is a statistically significant difference in means on some dependent variable (continuous) as a function of some independent variable (categorical) you can use the t-test when you have just two levels of the independent variable (ex: gender) or you can use the ANOVA procedure when you have two or more levels of the independent variable (ex: ethnicity).

Independent t-test Statistical Procedure for testing H0: that two means are equivalent when the two levels of the independent variable are not related.

One-way ANOVA Statistical Procedure for testing H0: that two or more means are equivalent when the two or more levels of the independent variable are not related.

Assumptions of the independent t-test and ANOVA procedure:

Homogeneity of variance - is the variability of the dependent variable in the population similar for each level of the independent variable? You examine this assumption by comparing the largest and smallest standard deviations for the groups in your sample. If they are similar (larger/smaller CV, reject H0

It is important to look beyond statistical significance for practical significance. Because, for example, With N=102, and = .05 a rxy of .20 is statistically significant but we know intuitively this is not a strong (or useful) correlation.

To assess practical significance

Calculate a coefficient of determination (rxy2). This value indicates the proportion of variance in the dependent variable that can be explained by the other.

Ex: If rxy = .60, rxy2 = .36. So, 36% of the variance in the DV can be explained by the IV. Left unexplained is 1 - rxy2.

Note: Outliers can significantly affect rxy. All outliers should be critically examined before leaving them in the analysis. If the values are legitimate and your sample size is substantial leave them in the analysis.

Partial correlation

Sometimes a correlation between two variables is due to their dependence on a 3rd variable.

Ex: Any set of variables that increase with age (shoe size & intelligence) - if you remove (control for) the effects of age, correlation could change in direction and/or strength.

A partial correlation procedure allows you to hold constant a 3rd variable and look at a 'truer' correlation between x and y.

Regression

This is the most common approach to prediction problems when you have one dependent variable and multiple independent variables.

Assumptions

Errors are independent and normally distributed

Homoscedasticity (variability of y's at each x similar)

Linearity (lack of fit of linear model)

Dependent variable at lest interval scaled

Hypothesis testing for significant regression

H0: b = 0

Values from an analysis of variance table (which partitions the variance due to regression (explained) and residual (unexplained)) can be used to (a) test the lack of fit assumption, (b) then if assumptions met, test for a significant regression, and examine practical significance.

Data reduction using Stepwise regression

A Regression procedure called stepwise regression analyzes a set of independent variables in such a way that it finds the most potent variable(s) with respect to their relationship to the dependent variable. An excellent exploratory analytical technique.

Discriminant Function Analysis

This technique is used when the interest is in examining several independent variable with respect to their ability to discriminate between groups. This is very similar to multiple regression. In fact, conceptually, the only difference is that discrimination between groups rather than prediction of a score is the object.

Factor Analysis

Also a data reduction procedure, this technique is used when the interest is in examining the underlying structure to several variables or items on a test. Both exploratory and confirmatory techniques exist.

Non-parametric analyses for relationship problems

Non parametric statistics are needed when (a) the variables being related are categorical or ordinal, or (b) when the assumptions associated with parametric statistics have been violated.

With categorical/ordinal data, descriptive analyses:

Charts

Frequency distribution tables

Cross tabulated tables

are informative, however, it is often desirable to conduct a test of the null hypothesis that two categorical/ordinal variables are not related.

The statistic that will test for the presence relationship between two categorical (though can also be used on ordinal data) variables is the chi-square statistic. The null hypothesis used to examine this is that there is no relationship. Another way to say this is that the variables x and y are independent. In fact the chi square statistic is commonly referred to as the chi square test of independence.

Assumptions

The expected frequency in all cells is at least 5.

Data must be random samples from multinomial distributions.

For example, Is there a relationship between level of ability of athletes and willingness to spend time on a task for someone else? Assume return rate (of a survey or other information) is considered willingness to spend time for someone else's benefit. The information in the table below then represents return rate by level.

Spend TimeEliteCollegeIntramural

Yes103235

No624037

To examine the expected frequency assumption you need an expected frequencies table. Each cell in the expected frequencies table should have at least 5 cases.

To determine if this chi square is statistically significant, you compare it to a critical value found in a chi square table. The degrees of freedom for a chi square statistic are:

df = (R-1)(C-1): Where R = # of rows, and C = # of columns in the two-way table.

The degrees of freedom for this problem are 2 so the critical value for an alpha of .01 is 9.21. Therefore, if the chi square statistic for this problem is > 9.21 you can reject the null hypothesis which suggest that there is a statistically significant relationship between level of ability and willingness to spend time on a task for someone else.

This does not necessarily mean that the relationship is of any practical significance. At this point all you know is that the variables in question are not independent. You should not stop here and should not claim you have something special to report.

Since the chi square statistic is sensitive to sample size, just about any two variables can be found to be related statistically given a large enough sample size. So, to examine practical significance you assess the strength of the association between variables using phi or Cramer's V.

Use Phi for 2X2 tables:

Use Cramer's V for larger tables (Cramer's V and Phi are equivalent for smaller tables)

With chi square based measures you cannot say much beyond the strength of the relationship. No predictive interpretation is possible.

Meta Analysis

Not a procedure for the analytically faint of heart. Much remains unsettle and the source of considerable disagreement among many researchers. At the very least careful thought and research needs to undertaken before beginning a meta analysis project regarding the questions raised in the Thomas & Nelson text:

What should be used as the standard deviation when calculating an ES?

Because sample ESs are biased estimators of the population of ESs, how can this bias be corrected?

Should ESs be weighted for their sample size?

Are all ESs in a sample from the same population of ESs? This is the apples and oranges" issue: Is the sample of ESs homogeneous?

What are appropriate statistical tests for analyzing ESs?

If a sample of ESs includes outliers, how can they be identified?

VALIDITY OF DATA COLLECTION PROCESSESValidity of data collection addresses the question of whether a data collection process is really measuring what it purports to be measuring. A data collection process is valid to the extent that the results are actually a measurement of the characteristic the process was designed to measure, free from the influence of extraneous factors. Validity is the most important characteristic of a data collection process.A data collection process is invalid to the extent that the results have been influenced by irrelevant characteristics rather than by the factors the process was intended to measure. For example, if a teacher gives a reading test and the test does not really measure reading performance, the test is useless. There is no logical way that the invalid test can help the teacher measure the outcome in which she is interested. If she gives a self-concept test that is so difficult to read that the third graders taking it are unable to interpret the tasks correctly, the test cannot validly measure self-concept among those students. It is invalid for that purpose, because it is so heavily influenced by reading skills that self-concept is not likely to come to the surface. This test cannot help the teachers make decisions about the outcome variable "self-concept." For example, if they ran a self-concept program for their students and their students' "self-concept" scores improved, how could they know whether it was really self-concept and not just reading ability that improved? In designing and carrying out any sort of data collection process, therefore, validity is of paramount importance.As we said with regard to reliability, it is important to keep in mind that it is the validity of the data collection process - not of the data collection instrument - that must be demonstrated. What we really want to do is strengthen the validity of the conclusions we draw based on the data collection process; we don't want to draw conclusions based on the measurement of the wrong outcomes. It is technically incorrect to refer to the validity of a test. A test, a checklist, an interview schedule, or any other data collection device that is valid in one setting or for one purpose may be invalid in another setting or for another purpose. Therefore, this chapter always refers to the validity of data collection processes. It is important to rein ember this distinction.

SOURCES OF INVALIDITYWhat makes a data collection process valid or invalid? A data collection process is valid to the extent that it meets the triple criteria of (1) employing a logically appropriate operational definition, (2) matching the items to the operational definition, and (3) possessing a reasonable degree of reliability. Invalidity enters the picture when the data collection strategy fails seriously with regard to one of these criteria or fails to lesser degrees in a combination of these criteria.It may be instructive to look at some examples of invalid data collection processes. Assume that a researcher wants to develop an intelligence test. He operationally defines intelligence as follows: "A person is intelligent to the extent that he/she agrees with me." He then makes up a list of 100 of his opinions and has people indicate whether they agree or disagree with each item on this list. A person agreeing with 95 of the items would be defined as being more intelligent than one who agreed with 90, and so on. This is an invalid measure of intelligence, because the operational definition has nothing to do with intelligence as any reputable theorist has ever defined it.Not all invalid data collection processes are so blatantly invalid. Indeed, one of the most heated arguments in psychology today is over the question of what intelligence tests actually measure. This whole question is one of validity. The advocates of many IQ tests argue that intelligence can be defined as general problem-solving ability. They operationally define intelligence as something like, "People are intelligent to the extent that they can solve new problems presented to them." They test for intelligence by giving a child a series of problems and counting how many she can solve. A child who can solve a large number of problems is considered to be more intelligent than one who can solve only a few. The opponents of such tests argue that the tests are invalid. They say that general problem-solving ability is not the only quality - or even the most important one required to do well on such tests. The tests, they argue, really measure how well a person has adapted to a specific middle-class culture. Success on such tests, therefore, is really an operational definition of ability to adapt to middle-class culture." Since the test is designed to measure intelligence but really measures a different ability, it is invalid. The argument over the validity of IQ tests is far from settled. Important theorists continue to line up on both sides, and others continue to suggest compromises - such as recommending new tests or redefining the concept of intelligence.Consider another hypothetical intelligence test. Assume that we ask the child one question directly related to a valid operational definition. This is an excessively short test, and thus it is likely to provide an unreliable estimate of intelligence. Our result is also likely to be invalid, because our conclusion that a child is a genius for answering 100% of the questions correctly is about as likely to be a result of chance factors (unreliability) as it is to be a result of real ability related to the concept of intelligence.The factors that determine the validity of a data collection process are diagrammed in Figure 5.1. The first test cited in this section was invalid because the operational definition was inappropriate. In the second case, the operational definition was logically appropriate, but it was not clear whether the tasks the child performed were really related to this operational definition. The final IQ test was considerably limited in its validity because the test was unreliable.

To the extent that there is a complete breakdown at any of these stages, the data collection process is invalid. Likewise, if there is a cumulative breakdown at several stages, the data collection process can be invalid.Figure 5.1

Factors Influencing Test ValidityESTABLISHING VALIDITYFrom the preceding discussion, it can be seen that there are three steps to establishing the validity of a data collection process designed to measure an outcome variable:

1. Demonstrate that the operational definition upon which the data collection process is based is actually a logically appropriate operational definition of the outcome variable under consideration. The strategy for demonstrating logical appropriateness was discussed in detail in chapter 4, where we pointed out that operational definitions are not actually synonymous with the outcome variable but rather represent the evidence that we are willing to accept to indicate that an internal behavior is occurring. Table 5.2 lists some cases where the operational definitions are to varying degrees logically inappropriate. For example, if the instructors in English 101 administer an anonymous questionnaire at the end of the semester to evaluate their performance in the course, they might think that the students are responding to questions about how they performed during the course. However, it's possible that the students who are completing the questionnaire are thinking, "If we tell them what we really think, they'll be upset and come down hard on us when they grade the exam. I think we should play it safe and give them good ratings for the course." If this is what students are thinking, then the favorable comments on the questionnaire are actually an operational definition of "anxiety over alienating instructor" rather than of "quality teaching."In many cases, the logical connection is easy to establish, and hence the logical fallacies found in Table 5.2 are often easy to avoid. For example, the connection between the operational definitions and the outcome variables in Table 5.3 are much more obvious than the connections in Table 5.2. It's still possible for a person to perform behaviors described in the operational definitions without having achieved the outcome variable, but it is much less likely than was the case in the situations in Table 5.2.Logical inappropriateness is most likely to occur when the outcome variable under consideration is a highly internalized one. Affective outcomes present particularly difficult problems, because the evidence is much less directly connected to the internal outcome than is the case with behavioral, psychomotor, and cognitive outcomes. The guidelines presented in chapter 4 are applicable here - namely, rule out as many alternative explanations as possible, and use more than one operational definition.Table 5.2 Some Examples of Logically Inappropriate Operational Definitions of Outcome Variables

Assumed Outcome VariableOperational DefinitionConceivable Real Outcome Variable

Ability to understand reading passagesThe pupil paraphrases a passage he/she has read silentlyAbility to guess from context clues

Love of Shakespearean dramaThe student will carry a copy of Shakespeare's plays with him to classEagerness to impress professor

Appreciation of English 101The students will indicate on a questionnaire that they liked the courseAnxiety over alienating instructor

Knowledge of driving lawsThe candidate will get at least 17 out of 20 true-false questions right on license testAbility to take true-false tests with subtle clues present in them

Friendliness toward peersThe pupil will stand near other children on the playgroundAnxiety over being beaten up if he or she stands apart

Appreciation of American heritageChild will voluntarily attend the Fourth of July picnic given by the American LegionAppreciation of watching fireworks explode

Table 5.3 Some Examples of Operational Definitions That Are Almost Certain to Be Appropriate for the Designated Outcome Variables

Ability to add single-digit integersThe student will add single-digit integers presented to him ten at a time on a test sheet

Ability to tie one's own shoesThe student will tie her own shoes after they have been presented to her untied

Ability to bench press 150 poundsThe student will bench press 150 pounds during the test period in the gymnasium.

Ability to spell correctly from memoryThe student will write down from memory the correct spelling of each word given in dictation

Ability to spell correctly on essays with use of dictionaryThe student will make no more than two spelling errors in a 200-word essay written during class with the aid of a dictionary

Ability to type 60 words per minuteThe student will type a designated 300-word passage in five minutes or less

Ability to raise hand before talking in classThe student will raise his hand before talking in class.

Ability to recall the quadratic equationThe student will write from memory the quadratic equation

Ability to apply the quadratic equationGiven the quadratic equation and ten problems that can be solved using the equation, the student will solve at least nine correctly

2. Demonstrate that the tasks the respondent has to perform to generate a score during the data collection process match the task suggested by the operational definition. The benefits of stating operational definitions can be completely nullified if the tasks that generate a score during the data collection process do not match the tasks stated in the operational definitions.Table 5.4 provides examples of such mismatches. The first three are not intended to be facetious. Mismatches this obvious actually do occur on teacher-designed tests. They say they are going to measure one thing, and then they measure something else. The other examples in Table 5.4 are more subtle. In these cases, the teacher has one behavior in mind; and in fact, many of the persons responding to the data collection process will perform the behavior anticipated by the teacher. But the mismatch occurs whenever a respondent performs the different or additional tasks indicated in the second column of the table.Table 5.4 Some Examples of a Mismatch Between the Operational Definition and the Task the Respondent Has to Perform on the Instrument

Operational DefinitionTask on Instrument

The student will add single-digit integers presented to him ten at a time on a test sheet"If I have three apples and you give me two more apples, how many do I have?"

The student will solve problems using the quadratic equation"Explain the derivation of the quadratic equation."

The student will use prepositions correctly in her essays"Write the definition of a preposition."

The student will apply the principles of operant conditioning to hypothetical situationsThe student first has to unscramble a complex multiple-choice thought pattern and then apply the principles

Given a (culturally familiar) novel problem to solve, the test taker will be able to solve the problemThe student is presented with a problem entirely foreign to his cultural background

The student will describe the relationship between nuclear energy and atmospheric pollutionThe student will write, in correct grammatical structures, a description of the relationship between nuclear energy and atmospheric pollution

The student will circle each of the prepositions in the paragraph providedThe student will first decipher the teacher's unintelligible directions and then circle each of the prepositions

The respondent will place herself in the simulated job situation provided to her and will indicate how she would perform in that situationThe respondent has to first ignore that the situation is absurdly artificial and highly different from the real world and then still respond as she would perform in the hypothetical situation

When questions arise concerning various sorts of bias in the data collection process, it is often the mismatch between task and operational definition that is being challenged. For example, with regard to bias in IQ tests, one of the most common arguments is essentially that middle-class youngsters who take the test are actually performing behaviors related to the operational definition, whereas equally intelligent lower-class youngsters are taking a test where there is a discrepancy between what they are doing and the operational definition of intelligence.

It is important to be aware of the various kinds of bias and other contaminating factors that could cause discrepancies, and to carefully rule these out. Such sources of mismatching include cultural bias, test-wiseness, reading ability, writing ability, ability to put oneself in a hypothetical framework, tendency to guess, and social responsibility bias. The preceding list is not to be considered exhaustive. There are her factors unique to specific individuals that produce a similar effect. A good way to assure a match to have several different qualified persons examine the data collection process and state whether the task matches the operational definition.A special type of mismatch between operational definition and task is worth mentioning. Some data collection strategies are so obtrusive that the respondent is more likely to be responding to the data collection process itself than to be performing the tasks indicated in the operational definition. For example, if a child knows that a questionnaire is measuring prejudice and that it is not nice to be prejudiced, the child may answer what he thinks he should answer instead of revealing his true attitude. (This is referred to as a social-desirability bias.) Likewise, if a researcher comes into the classroom and sits in a prominent position with a behavioral checklist, children may be acutely aware that something unusual is happening; and so the behavior recorded on the checklist is more a reaction to the data collection strategy than an indication of actual behavioral tendencies. (Specific strategies for overcoming obtrusiveness are discussed in chapter 6.)3. Demonstrate that the data collection process is reliable. Reliability was discussed extensively earlier in this chapter. The contribution of reliability to validity was mentioned in Figure 5.1 and in the accompanying discussion. The relationship between reliability and validity is diagrammed more specifically in Figure 5.2. As this diagram suggests, a certain amount of reliability is necessary before a data collection process can possess validity. In other words, a data collection process cannot measure what it's supposed to measure if it measures nothing consistently. In demonstrating that data collection processes are valid, professional test constructors first demonstrate that their data collection processes are reliable - that they measure something consistently; then they demonstrate that this something is the characteristic that the data collection processes are supposed to measure. In other words, they first demonstrate reliability in several ways, and then they demonstrate validity.An important caution is necessary in discussing the relationship between reliability and validity. It is crucial to realize that it is possible (but undesirable and inappropriate) to increase reliability while simultaneously reducing the validity of a data collection process. This can be done by either (1) narrowing or changing the operational definition so that it is no longer logically appropriate or (2) changing the tasks based on the operational definition to less directly related tasks and then (3) devising a more reliable data collection process based on the more measurable but less appropriate operational definition or tasks. This is obviously a bad idea, because the result is that the data collection now measures a less valid or wrong outcome "more reliably."Such an increase in reliability accompanied by a reduction in validity occurs, for example, if a teacher introduces unnecessarily complex language into a data collection process. A data collection process that had previously measured "ability to apply scientific concepts" might now instead measure "ability to decipher complex language and then apply scientific concepts." The resulting reliability might be higher; but if the teacher is still making decisions about the original outcome, the data collection process has become less valid.Overemphasis on reliability is one of the arguments against culturally biased norm-referenced tests. Their detractors argue that many standardized tests become more reliable when cultural bias is added, because such bias is a relatively stable (consistent) factor, which is likely to work the same way on all questions and on all administrations of the test. However, the cultural bias detracts from the validity of the test.It is important to be alert to the tendency to accept spuriously high statistical estimates of reliability as solid evidence of validity. The fact that a certain amount of reliability is a necessary prerequisite for validity does not mean that the most reliable data collection process is also the most valid. Statistical reliability is only one factor in establishing the validity of a data collection process. Another way to state this is to say that reliability is a necessary but not sufficient condition for validity.As you can see, establishing validity is predominantly a logical process.Finally, before leaving this introduction to the validity of data collection processes, it is important to note that a data collection process that provides valid data for group decisions will not always provide valid data for decisions about individuals. On the other hand, a data collection process that provides valid data for decisions about individuals will always provide valid data for group decisions. This is not as complicated as it sounds. To take an example, we might operationally define appreciation of Shakespeare as "borrowing Shakespearean books from the library without being required to do so." Even if Janet Jones borrows books on Shakespeare without being required to do so, it is not possible to diagnose her specifically as either appreciating or not appreciating the bard using this operational definition. There are too many competing explanations for her behavior, and these would invalidate this data collection process as an estimate of her appreciation. (For example, she might hate the subject but need to pass the exam; and so she has to borrow a vast number of books to do burdensome, additional studying. Or she might like Shakespeare so much that she owns annotated copies of all the plays and never has to borrow from any library except her own.) Nevertheless, it may still be valid to evaluate the group based on this operational definition. If you teach the Shakespeare plays a certain way one year and only 2% of the students ever borrow related books from the library, and the next year you teach the same subject differently and 50% of the students spontaneously borrow books, it is probably valid to infer from their available documented records that appreciation of Shakespeare has increased. The group decision, at any rate, is more likely to be valid than is the individual diagnosis.Box 5.1

An Argument-Based Approach to ValidityKane (1992) presents the practical yet sophisticated idea that validity should be discussed in terms of the practical effectiveness of the argument to support the interpretation of the results of a data collection process for a particular purpose. The researcher or user of the research chooses an interpretation of the data, specifies the interpretive argument associated with that interpretation, identifies competing interpretations, and develops evidence to support the intended interpretation and refute the competing interpretations. The amount and type of evidence needed in a particular case depend on the inferences and assumptions associated with a particular application.The key points in this approach are that the interpretive argument and the associated assumptions be stated as clearly as possible and that the assumptions be carefully tested by whatever strategies will best rule Out bias and other sources of faulty conclusions. As the most questionable inferences and assumptions are checked and either supported by the evidence or adjusted so that they become more plausible, the plausibility (validity) of the interpretive argument increases.This interpretation of validity is compatible with the discussion presented in this chapter. In addition, it has the advantage of presenting validity as a special instance of the overall application of formal and informal reasoning to solving problems. From this viewpoint, when educators do research, they are under the same obligation as any other person making public statements to demonstrate that those statements really do mean what the speaker or writer says they mean- Statistical procedures and other specific techniques are merely pieces of evidence to check the quality of inferences and the authenticity of the assumptions underlying a particular interpretation.(Source: Kane, M. T. [1992]. An argument-based approach to validity. Psychological Bulletin, 112, 327-535.)

REVIEW QUIZ 5.4Part IIdentify the item from each pair that is most likely to be an invalid measure of the outcome variable given in parentheses.Set 1. a. The child will correspond intelligibly with an assigned Spanish-speaking pen pal. (understands Spanish)b. The child will correspond intelligibly with an assigned Spanish-speaking pen pal. (appreciates Spanish culture)Set 2. a. The student will identify examples of the principles of physics in the kitchen at home. (understands principles of physics)b. The student will choose to take optional courses in the physical sciences. (appreciates physical sciences)Part 2Write Invalid next to statements that indicate an invalid data collection process; write Valid next to those that indicate a valid data collection process; write N if no relevant information regarding validity is contained in the statement.1____ The questions were so hard that I was reduced to flipping a coin to guess the answers.

2____ The test measures mere trivia, not the important outcomes of the course.3____ To rule out the influence of memorized information regarding a problem, only topics that were entirely novel to all the students were included on the problem-solving test.4____ The only way he got an A was by having his girlfriend write the term paper for him.5____ The length of the true-false English test was increased from 30 to 50 items to minimize the chances of getting a high score by guessing.6____ The teacher ruled out the likelihood of cheating by giving each of the students seated at the same table a different form of the test.7____ Since the personality test had such a difficult vocabulary level, it probably was influenced more by intelligence than by personality factors.8____ The observer rated the classroom as displaying a hostile environment toward handicapped people, but the teacher argued that the observer's judgment was clouded because she observed from a position where she was next to students who were not at all typical of the entire class.9____ The observer rated the atmosphere of the school hoard meeting as being supportive of innovative teaching, but the newspaper critic pointed out that this was because the board members were local residents with business interests and were therefore very likely to be supportive of innovation.If you got most of the questions in Review Quiz 5.4 correct, or if you easily saw the logic of the explanations, then you probably have a good basic grasp of the concept of validity. If you do not understand the concept, reread the chapter to this point, check the chapter in the workbook, refer to the recommended readings, or ask your instructor or a peer for help. Be sure that you understand the summary in the following paragraph so that you will profit from the rest of this chapter.In summary, validity refers to whether a data collection process really measures what it is designed to measure. Invalidity occurs to the extent that the data collection process measures an incorrect variable or no consistent variable at all. The main sources of invalidity are logically inappropriate operational definitions, mismatches between operational definitions and the tasks employed to measure them, and unreliability of data collection processes. Validity is not an all-or-nothing characteristic; data collection processes range from strong validity to weak validity. Because of the highly internalized nature of educational outcomes, data collection processes in education can never be perfectly valid. By carefully stating appropriate operational definitions, ascertaining that tasks employed in data collection processes are directly related to the operational definitions, and designing reliable data collection processes, we can increase the validity of our data collection processes and the probability that we will draw valid conclusions from them.SPECIFIC, TECHNICAL EVIDENCE OF MEASUREMENT VALIDITYIf you read a test manual or look up the citation of a test in The Mental Measurements Yearbook (Kramer & Conoley, 2002), you will find references to three basic types of evidence to support measurement validity. These have been defined by several major organizations interested in mental measurement (American Educational Research Association et al., 1985). The technical types of evidence for validity are rooted in the theory discussed earlier in this chapter, and it is not difficult to achieve a fundamental understanding of these concepts. A brief discussion of these types of evidence for validity can help teachers and researchers develop more valid data collection processes for their own use. In addition, an understanding of these concepts will be especially useful when selecting or using standardized tests, reading the professional literature, and attempting to measure psychological or theoretical characteristics beyond those that are typically covered by classroom tests. These three types of evidence for validity are (1) content validity, (2) criterion-related validity, and (3) construct validity.Content ValidityContent validity refers to the extent to which a data collection process measures a representative sample of the subject matter or behavior that should be encompassed by the operational definition. A high school English teacher's midterm exam, for example, lacks content validity when it focuses exclusively on what was covered in the last two weeks of the term and inadvertently ignores the first six weeks of the grading period. Likewise, a self-concept test would lack content validity if all the items focused on academic situations, ignoring the impact of home, church, and other factors outside the school. Content validity is assured by logically analyzing the domain of subject matter or behavior that would be appropriate for inclusion on a data collection process and examining the items to make sure that a representative sample of the possible domain is included. In classroom tests, a frequent violation of content validity occurs when test items are written that focus on knowledge and comprehension levels (because such items are easy to write), while ignoring the important higher levels, such as synthesis and application of principles (because such items are difficult to write).Criterion-Related ValidityCriterion-related validity refers to how closely performance on a data collection process is related to other measure of performance. There are two of criterion-related validity: predictive and concurrent.Predictive validity refers to how well a data collection process predicts some future performance. If a university uses the Graduate Record Exam (GRE) as a criterion for admission to graduate school, for example, the predictive validity of the GRE must be known. This predictive validity would have been established by administering the GRE to a group of students entering a school and determining how their performance on the GRE corresponded with their performance in that school. It would be expressed as correlation coefficient. A high positive coefficient would indicate that persons who did well on the GRE tended to do well in graduate school, whereas who scored low on the GRE tended to perform poorly in school. A low correlation would indicate that there was little relationship between GRE performance and success in that particular graduate school.Concurrent validity refers to how well a data collection process correlates with some current criterion - usually another test. It "predicts" the present. At first glance it sounds like an exercise in futility to predict what is already known, but more careful consideration will suggest two important uses for concurrent validity. First, it is a useful predecessor for predictive validity. If the GRE, for example, does not even correlate with success among those who are going to school right now, then there is little value in doing the more expensive, time-consuming, predictive validity study. Second, concurrent validity enables us to use one measuring strategy in place of another. If a university wants to require that students either take freshman composition or take a test to "test out" of the course, concurrent validity would enable the English department to demonstrate that a high score on the alternative test has a similar meaning to a high grade in the course. Like predictive validity, concurrent validity is expressed by a correlation coefficient.Construct ValidityConstruct validity refers to the extent to which the results of a data collection process can be interpreted in terms of underlying psychological constructs. A construct is a label or hypothetical interpretation of an internal behavior or psychological quality - such as self-confidence, motivation, or intelligence - that we assume exists to explain some observed behavior. Construct validity often necessitates an extremely complicated process of validation. To state it briefly, the researcher develops a theory about how people should perform during the data collection process if it really measures the alleged construct and then collects data to see whether this is what really happens. The process is complicated because the researcher is doing two separate things: (1) proving that the data collection process possesses construct validity and (2) refining the theo