Chapter 7 Evaluating What a Test Really Measures.

Chapter 7Chapter 7Evaluating What a Evaluating What a

Test Really Test Really MeasuresMeasures

ValidityValidity

APA – Standards for Educational and APA – Standards for Educational and Psychological Testing (1985) – Psychological Testing (1985) – Recognized three ways of deciding Recognized three ways of deciding whether a test is sufficiently valid to whether a test is sufficiently valid to be useful.be useful.

Validity: Validity: Does the test measure what it Does the test measure what it claims to measure?claims to measure?

The appropriateness with The appropriateness with which inferences can be made which inferences can be made on the basis of test results.on the basis of test results.

ValidityValidity

There is no single type of validity There is no single type of validity appropriate for all testing purposes.appropriate for all testing purposes.

Validity is not a matter of all or Validity is not a matter of all or nothing, but a matter of degree.nothing, but a matter of degree.

Types of ValidityTypes of Validity

ContentContent Criterion-Related (concurrent or Criterion-Related (concurrent or

predictive)predictive) ConstructConstruct FaceFace

Content ValidityContent Validity Whether items (questions) on a test Whether items (questions) on a test

are representative of the domain are representative of the domain (material) that should be covered by (material) that should be covered by the test.the test.

Most appropriate for test like Most appropriate for test like achievement tests (i.e., concrete achievement tests (i.e., concrete attributes)attributes)

Content ValidityContent ValidityGuiding Questions:Guiding Questions:1.1. Are the test questions appropriate and Are the test questions appropriate and

does the test measure the domain of does the test measure the domain of interest?interest?

2.2. Does the test contain enough information Does the test contain enough information to cover appropriately what it is to cover appropriately what it is supposed to measure?supposed to measure?

3.3. What is the level of master at which the What is the level of master at which the content is being assessed?content is being assessed?

***NOTE – Content validity does not involve ***NOTE – Content validity does not involve statistical analysis.statistical analysis.

Obtaining Content ValidityObtaining Content Validity

Two ways:Two ways: Define the testing universe and Define the testing universe and

administer the test.administer the test. Have experts rate “how essential” Have experts rate “how essential”

each question is. (1- essential, 2-each question is. (1- essential, 2-useful, but not essential, and 3-not useful, but not essential, and 3-not necessary) Questions are considered necessary) Questions are considered valid if more than ½ experts indicate valid if more than ½ experts indicate question is “essential”.question is “essential”.

Defining the Testing UniverseDefining the Testing Universe

What is the body of knowledge or What is the body of knowledge or behaviors that the test represents?behaviors that the test represents?

What are the intended outcomes What are the intended outcomes (skills, knowledge)?(skills, knowledge)?

Developing A Test PlanDeveloping A Test PlanStep 1Step 1 Define testing universeDefine testing universe

– Locate theoretical or empirical research on the attributeLocate theoretical or empirical research on the attribute– Interview expertsInterview experts

Step 2Step 2 Develop test specificationsDevelop test specifications

– Identify content areas (topics to be covered in test)Identify content areas (topics to be covered in test)– Identify instructional objectives (what one should be Identify instructional objectives (what one should be

able to do with these topics)able to do with these topics)

Step 3Step 3 Establish a test formatEstablish a test formatStep 4Step 4 Construct test questionsConstruct test questions

AttributesAttributes

Concrete AttributesConcrete AttributesAttributes that can Attributes that can

be described in be described in terms of specific terms of specific behaviors.behaviors.

e.g., ability to play e.g., ability to play piano, do math piano, do math problemsproblems

Abstract AttributesAbstract AttributesMore difficult to More difficult to

describe in terms describe in terms of behaviors of behaviors because people because people might disagree on might disagree on what the behaviors what the behaviors presentpresent

e.g., intelligence, e.g., intelligence, creativity, creativity, personalitypersonality

Chapter 8Chapter 8

Using Tests to Make Using Tests to Make Decisions:Decisions:

Criterion-Related ValidityCriterion-Related Validity

What is a criterion?What is a criterion?

This is the standard by which This is the standard by which your measure is being judged or your measure is being judged or evaluated. evaluated.

The measure of performance The measure of performance that is correlated with test that is correlated with test scores.scores.

An evaluative standard that can An evaluative standard that can be used to measure a person’s be used to measure a person’s performance, attitude, or performance, attitude, or motivation.motivation.

Two Ways to Demonstrate Two Ways to Demonstrate Criterion-Related ValidityCriterion-Related Validity

1.1. Predictive MethodPredictive Method

2.2. Concurrent MethodConcurrent Method

Criterion-Related ValidityCriterion-Related Validity

Predictive validityPredictive validity – correlating – correlating test scores with test scores with futurefuture behavior behavior on the behavior…after on the behavior…after examinees have had a chance to examinees have had a chance to exhibit the predicted behavior; exhibit the predicted behavior; e.g., success on the job.e.g., success on the job.

Concurrent validityConcurrent validity – correlating test scores – correlating test scores with an independent measure of the same with an independent measure of the same trait that the test is designed to measure trait that the test is designed to measure – currently available.– currently available.

Or being able to distinguish between groups Or being able to distinguish between groups known to be different; i.e., significantly known to be different; i.e., significantly different mean scores on the test.different mean scores on the test.

Examples of Concurrent ValidityExamples of Concurrent Validity

E.g.1, Teachers’ ratings of reading E.g.1, Teachers’ ratings of reading ability validated by correlating with ability validated by correlating with reading test scores.reading test scores.

E.g.2, validate an index of self-reported E.g.2, validate an index of self-reported delinquency by comparing responses delinquency by comparing responses to office police records on the to office police records on the respondents.respondents.

In both predictive and In both predictive and concurrent validity, we validate concurrent validity, we validate by comparing scores with a by comparing scores with a criterion (the standard by which criterion (the standard by which your measure is being judged or your measure is being judged or evaluated).evaluated).

Most appropriate for tests that Most appropriate for tests that claim to predict outcomes.claim to predict outcomes.

Evidence of criterion-related Evidence of criterion-related validity depends on empirical or validity depends on empirical or quantitative methods of data quantitative methods of data analysis.analysis.

Example of How To Determine Example of How To Determine Predictive ValidityPredictive Validity

Give test to applicants for a position. Give test to applicants for a position. For all those hired, compare their test For all those hired, compare their test

scores to supervisors’ rating after 6 scores to supervisors’ rating after 6 months on the job. months on the job.

The supervisors’ ratings are the The supervisors’ ratings are the criterion. criterion.

If employees scored on the test If employees scored on the test similarly to supervisors’ ratings, then similarly to supervisors’ ratings, then predictive validity of test is predictive validity of test is supported.supported.

Problems with using predictive Problems with using predictive validityvalidity

Restricted range of scores on Restricted range of scores on either predictor or criterion either predictor or criterion measure will cause an artificially measure will cause an artificially lower correlation.lower correlation.

Attrition of criterion scores; i.e., Attrition of criterion scores; i.e., some folks drop out before you some folks drop out before you can measure them on the criterion can measure them on the criterion measure (e.g., 6 months later).measure (e.g., 6 months later).

Selecting a CriterionSelecting a Criterion

Objective criteria:Objective criteria: observable and observable and measurable; e.g., sales figures, measurable; e.g., sales figures, number of accidents, etc.number of accidents, etc.

Subjective criteria:Subjective criteria: based on a based on a person’s judgment; e.g., employee person’s judgment; e.g., employee job ratings. Example…job ratings. Example…

CRITERION MEASUREMENTS MUST CRITERION MEASUREMENTS MUST THEMSELVES BE VALID!THEMSELVES BE VALID!

Criteria must be representative of the Criteria must be representative of the events that they are supposed to events that they are supposed to measure.measure.– i.e., sales ability – not just $ amount, but i.e., sales ability – not just $ amount, but

also # of sales calls made, size of target also # of sales calls made, size of target population, etc.population, etc.

Criterion ContaminationCriterion Contamination – If the – If the criterion measures more dimensions criterion measures more dimensions than those measured by the test.than those measured by the test.

BOTH PREDICTOR AND CRITERION BOTH PREDICTOR AND CRITERION MEASURES MUST BE RELIABLEMEASURES MUST BE RELIABLE FIRST!FIRST!

E.g., inter-rater reliability obtained by E.g., inter-rater reliability obtained by supervisors rating the same supervisors rating the same employees independently.employees independently.

Reliability estimates of predictors Reliability estimates of predictors can be obtained by one of the 4 can be obtained by one of the 4 methods covered in Chapter 6.methods covered in Chapter 6.

Calculating & Estimating Validity Calculating & Estimating Validity CoefficientsCoefficients

Validity Coefficient – Validity Coefficient – Predictive Predictive and concurrent validity also and concurrent validity also represented by correlation represented by correlation coefficients. Represents the coefficients. Represents the amount or strength of criterion-amount or strength of criterion-related validity that can be related validity that can be attributed to the test.attributed to the test.

Two Methods for Evaluating Two Methods for Evaluating Validity CoefficientsValidity Coefficients

1.1. Test of significance:Test of significance: A process of A process of determining what the probability is that determining what the probability is that the study would have yielded the validity the study would have yielded the validity coefficient calculated by chance.coefficient calculated by chance.

-Requires that you take into account the size of the -Requires that you take into account the size of the group (N) from whom we obtained our data.group (N) from whom we obtained our data.

-When researchers or test developers report a validity -When researchers or test developers report a validity coefficient, they should also report its level of coefficient, they should also report its level of significance.significance.

must be demonstrated to be greater than must be demonstrated to be greater than zerozero

p < .05. Look up in table.p < .05. Look up in table.

Two Methods for Evaluating Two Methods for Evaluating Validity CoefficientsValidity Coefficients

2.2. Coefficient of determination:Coefficient of determination: The amount of The amount of variance shared by two variables being variance shared by two variables being correlated, such as test and criterion, obtained correlated, such as test and criterion, obtained by squaring the validity coefficient.by squaring the validity coefficient.

rr22 tells us how much covariation exists between tells us how much covariation exists between predictor and criterion; e.g., if r = .7, then 49% predictor and criterion; e.g., if r = .7, then 49% of the variance is common to both.of the variance is common to both.

i.e., If correlation (r) is .30, then the coefficient of i.e., If correlation (r) is .30, then the coefficient of determination (r2) is .09. (This means that the determination (r2) is .09. (This means that the test and criterion have 9% of their variation in test and criterion have 9% of their variation in common.)common.)

Using Validity Information To Using Validity Information To Make PredictionsMake Predictions

Linear regression: predicting Y from X.Linear regression: predicting Y from X.

Set a “pass” or acceptance score on Y.Set a “pass” or acceptance score on Y.

Determine what minimum X score Determine what minimum X score (“cutting score”) will produce that Y score (“cutting score”) will produce that Y score or better (“success” on the job)or better (“success” on the job)

Examples…Examples…

Outcomes of PredictionOutcomes of PredictionHits:Hits: a) True positives - predicted to succeed a) True positives - predicted to succeed

and did.and did. b) True negatives - predicted to fail and b) True negatives - predicted to fail and

did.did.

Misses:Misses: a) False positives - predicted to a) False positives - predicted to succeed and didn’t.succeed and didn’t. b) False negatives - predicted to fail b) False negatives - predicted to fail

and would have succeeded.and would have succeeded.

WE WANT TO MAXIMIZE TRUE HITS AND WE WANT TO MAXIMIZE TRUE HITS AND MINIMIZE MISSES!MINIMIZE MISSES!

Predictive validity correlation Predictive validity correlation determines accuracy of predictiondetermines accuracy of prediction

Chapter 9Chapter 9

Construct ValidityConstruct Validity

Construct ValidityConstruct Validity

The extent to which the test The extent to which the test measures a theoretical construct.measures a theoretical construct.

Most appropriate when a test Most appropriate when a test measures an abstract construct (i.e., measures an abstract construct (i.e., marital satisfaction)marital satisfaction)

What is a construct?What is a construct? An attribute that exists in theory, but is not An attribute that exists in theory, but is not

directly observable or measurable. directly observable or measurable. (Remember there are 2 kinds: concrete and (Remember there are 2 kinds: concrete and abstract.)abstract.)

We can observe & measure the behaviors We can observe & measure the behaviors that show evidence of these constructs.that show evidence of these constructs.

Definitions of constructs can vary from Definitions of constructs can vary from person to person. person to person. – i.e., Self-efficacyi.e., Self-efficacy

Example…Example…

When some trait, attribute or quality When some trait, attribute or quality is not operationally defined you must is not operationally defined you must use indirect measures of the use indirect measures of the construct, e.g., a scale which construct, e.g., a scale which references behaviors that we references behaviors that we consider evidence of the construct.consider evidence of the construct.

But how can we validate that scale?But how can we validate that scale?

Construct ValidityConstruct Validity Evidence of construct validity of a scale may Evidence of construct validity of a scale may

be provided by comparing high vs. low be provided by comparing high vs. low scoring people on behavior implied by the scoring people on behavior implied by the construct, e.g., Do high scorers on the construct, e.g., Do high scorers on the Attitudes Toward Church Going Scale Attitudes Toward Church Going Scale actually attend church more often than low actually attend church more often than low scorers?scorers?

Or by comparing groups known to differ on Or by comparing groups known to differ on the construct; e.g., comparing pro-life the construct; e.g., comparing pro-life members with pro-choice members on members with pro-choice members on Attitudes Toward Abortion scale.Attitudes Toward Abortion scale.

Construct Validity (cont’d)Construct Validity (cont’d)

Factor analysis also gives you a look at Factor analysis also gives you a look at the unidimensionality of the construct the unidimensionality of the construct being measured; i.e., homogeneity of being measured; i.e., homogeneity of items.items.

As does the split-half reliability As does the split-half reliability coefficient.coefficient.

ONLY ONE CONSTRUCT CAN BE ONLY ONE CONSTRUCT CAN BE MEASURED BY ONE SCALE!MEASURED BY ONE SCALE!

Convergent ValidityConvergent Validity

Evidence that the scores on a test Evidence that the scores on a test correlate strongly with scores on correlate strongly with scores on other tests that measure the same other tests that measure the same construct.construct.– i.e.,would expect two measures on i.e.,would expect two measures on

general self-efficacy to yield strong, general self-efficacy to yield strong, positive, and statistically significant positive, and statistically significant correlations.correlations.

Discriminant ValidityDiscriminant Validity

When the test scores are not When the test scores are not correlated with unrelated correlated with unrelated constructs.constructs.

Multitrait-Multimethod MethodMultitrait-Multimethod Method

Searching for Searching for convergenceconvergence across across different measures of the same thing different measures of the same thing and for and for divergencedivergence between between measures of different things.measures of different things.

Face ValidityFace Validity

The items look like they reflect whatever is The items look like they reflect whatever is being measured. being measured.

The extent to which the test taker The extent to which the test taker perceives that the test measures what it is perceives that the test measures what it is supposed to measure.supposed to measure.

The attractiveness and appropriateness of The attractiveness and appropriateness of the test at perceived by the test takers.the test at perceived by the test takers.

Influences how test takers approach the Influences how test takers approach the test. test.

Uses experts to evaluate.Uses experts to evaluate.

Which type of validity would be Which type of validity would be most suitable for the following?most suitable for the following?

a) mathematics testa) mathematics test

b) intelligence testb) intelligence test

c) vocational interest inventoryc) vocational interest inventory

d) music aptitude testd) music aptitude test

Discuss the value of predictive Discuss the value of predictive validity to each of the following?validity to each of the following?

a) personnel managera) personnel manager

b) teacher or principalb) teacher or principal

c) college admissions officerc) college admissions officer

d) prison wardend) prison warden

e) psychiatriste) psychiatrist

f) guidance counselorf) guidance counselor

g) veterinary dermatologistg) veterinary dermatologist

h) professor in medical h) professor in medical school school

Chapter 7 Evaluating What a Test Really Measures.

Documents

test measure

test scores

test specifications

test questions appropriate

construct test questions

concurrent validity

test format step

test plan step