Controversy Surrounding Modern Validity Theory Dr Paul E. Newton Director, Cambridge Assessment Network Division Cambridge Assessment Paper presented to.

Controversy Surrounding Modern Validity Theory

Dr Paul E. Newton

Director, Cambridge Assessment Network Division

Cambridge Assessment

Paper presented to SQA Research Seminar

11/11/11, Glasgow

Messick (1989)

Why is validity important?

• Validity is a (if not the) hallmark of quality for educational assessment

• A declaration of validity provides a ‘green light’ to use an assessment procedure for the purpose at hand... a declaration of invalidity presents a ‘red light’

Q1: Can you date this quote?

“Two of the most important types of problems in measurement are those connected with the determination of what a test measures, and of how consistently it measures. The first should be called the problem of validity, the second, the problem of reliability.”

Q2: Can you identify either series?

Series 1

• 1951• 1971• 1989• 2006

Series 2

• 1954• 1966• 1974• 1985• 1999

Q3: Which is not a type of validity?

Construct validity Catalytic validity

Ecological validity Descriptive validity

External validity Evaluative validity

Internal validity Interpretive validity

Outcome validity Ironic validity

Population validity Paralogical validity

Statistical conclusion validity Rhizomatic validity

Temporal validity Theoretical validity

Treatment variation validity Transactional validity

Transformational validity

Voluptuous validity

60 more types of validityA priori validity Discriminant validity Manifest validity

Cognitive validity Discriminative validity Nomological validityComparative validity Divergent validity Operational validityConcurrent validity Domain validity Practical validityCongruent validity Edumetric validity Predictive validity

Consensual validity Elemental validity Sampling validityConsequential validity Empirical validity Scoring validity

Construct validity Etiological validity Semantic validityContent validity External validity Site validityContext validity Extratest validity Statistical validity

Convergent validity Face validity Status validityCorrelational validity Factorial validity Structural validity

Criteria validity Incremental validity Substantive validityCriterion validity Instructional validity Synthetic validity

Cross-age validity Internal validity System validityCross-cultural validity Interpretive validity Systemic validity

Curricular validity Interpretative validity Theory-based validityDesign validity Intrinsic validity Trait validity

Diagnostic validity Job component validity Translation validityDifferential validity Logical validity Treatment validity

Theoretical revolution

• From fragmented conception of validity (Trinitarian) to an holistic one (Unitarian)

• Championed by Samuel Messick

• Between mid-1970s and late-1980s

The 1920s definition

“By validity is meant the degree to which a test or examination measures what it purports to measure.”

(Ruch, 1924, p.13)

Validity is conditional

• Upon having observed procedural guidelines– e.g. a well-developed test that had been administered incorrectly

would not necessarily produce accurate results

• Upon the context of administration– e.g. a well-developed test designed in one decade would not

necessarily produce accurate results two decades later

• Upon characteristics of the group assessed– e.g. a well-developed test of reading comprehension designed for

16-year-olds would not necessarily produce accurate results for 11-year-olds

• Upon the use(s) to which results are to be put– e.g. a well-developed test designed for selection would not

necessarily produce accurate results for placement.

Anastasi (1976)

Contra 1920s definition

• For any test, validity may differ– if procedural guidelines are not followed– for different groups– within different contexts– when different interpretations (using different

constructs) are made, for different uses

So ‘the test’ cannot be valid or invalid, only ‘the interpretation’ of results.

• Each important interpretation needs to be validated in its own right.

Anastasi (1976)

The 1950-1970s ‘conception’

• Different kinds of validity, requiring different kinds of validation, apply to different kinds of testing.

• For curriculum-based testing:– content validity needs to be demonstrated– content validation is the appropriate method

• If there is satisfactory alignment between the content of the test and the content of the curriculum then the test is valid.

Mono-validation insufficient

• Even for testing educational attainment, content validation (to check adequate sampling of content) is insufficient– content validation can only help to ‘validate’ inferences

concerning students who score maximum marks– the way that questions present content may prevent

them from eliciting the intended KSU evidence– the way that questions are marked may prevent

evidence of KSU from being rewarded appropriately– different students will use different kinds of KSU to

answer the same question– inferences are drawn in terms of constructs (e.g. X is

better at ‘scientific reasoning’ than Y) and even these construct labels need validating

It’s all about construct validity

• “[...] the profession is coming around to the view that all validation is construct validation.” (Cronbach, 1984, p.126)

• “[...] construct validity may ultimately be taken as the whole of validity in the final analysis.” (Messick, 1989, p.21)

Contra 1950s-70s ‘conception’

CONTENT CONSTRUCT

1954Contentvalidity

Predictivevalidity

Concurrentvalidity

Constructvalidity

1966Contentvalidity

Constructvalidity

1974Contentvalidity

Constructvalidity

1985Content-related

evidenceConstruct-related

evidence

1999 Evidence based on test content

Evidence based on response processes

Evidence based on internal structure

Evidence based on relations to other variables

Evidence based on the

consequences of testing

Criterion-relatedvalidity

Criterion-relatedvalidities

Criterion-relatedevidence

CRITERION

Double whammy!

• Rejection of 1920s definition:– for any given test, multiple interpretations will

need to be validated (particularly when the same test is used for multiple purposes)

• Rejection of 1950-1970s ‘conception’:– for any given interpretation, multiple validation

activities will be required to establish its (construct) validity

The last word on validity?

• Despite talk of a general consensus over the central tenets of modern validity theory:– substantial ambiguity over detail of the theory– ongoing resistance to putting it into practice– growing debate over its plausibility

1a. Ambiguity – meaning

• Miller et al (2009), on a single page (p.104), refer to:– “the validity of an assessment” – “the validity of the assessment for that use or

interpretation”– “the validity of interpretations of tests and

assessments”– “the validity of test and assessment results”– “the validity of the uses and interpretations”

1a. Ambiguity – meaning

• Which is the ‘proper’ referent of validity?– the interpretation of the score (i.e. the claim) is valid– the use of results (i.e. the decision) is valid– the inferential process (assessment procedure) is valid– the intended, or actual, inferences from results are valid– the argument for interpreting and using results is valid– the inferential links within the argument chain are valid– the validation research conclusions are valid– the hypothesis is valid– the explanation is valid

1b. Ambiguity – evidence

• Relevance– is every kind of evidence/analysis relevant to

every validation?

• Necessity– is every kind of evidence/analysis required for

every validation?

Relevance and necessity

“Therefore, the profession is coming around to the view that all validation is construct validation. [...] Content- and criterion-based arguments develop parts of the story. With almost any test it makes sense to join all three kinds of inquiry in building an explanation. The three distinct terms do no more than spotlight aspects of the inquiry.”

(Cronbach, 1984, p.126)


“[...] test validity cannot rely on any one of the supplementary forms of evidence just discussed. But neither does validity require any one form, as long as there is defensible convergent and discriminant evidence supporting test meaning. To the extent that some form of evidence cannot be developed [...] heightened emphasis can be placed on other evidence [...] What is required is a compelling argument that the available evidence justifies the test interpretation and use, even though some pertinent evidence had to be forgone.”

(Messick, 1998, pp.70-71)


“[...] if the proposed interpretation of test results relies on predictions of future performance, these predictions should be empirically evaluated as part of the validation of the proposed interpretation; if no such predictions are made, no evidence for predictive accuracy is called for.”

(Kane, 2008, p.79)

2. Resistance – validation

• Well-established disjunction between modern validity theory and contemporary validation practice– Jonson & Plake (1998)– Hogan & Agnello (2004)– Cizek et al (2008)– Wolming & Wikstrom (2010)

2. Resistance – validation

• Validation has become very demanding...– multiple validation ‘foci’– multiple validation constructs– multiple uses of results

Multiple validation ‘foci’

• If we’re not just checking test content against curriculum content, for each test, what else do we need to do for each test?– OCR > A level > physics > version A > 2011

certification

• How much additional validation is required for distinct subgroups of the population?– ethnicity, class, gender, region, school, etc.

Multiple validation constructs

Construct (interpretation of results) Example of Selection Use (2012 entry)

1 Physics attainment, across broadly specified set of KSU (from specification)

Physics courses in UK HEIs

2 Physics attainment, across loosely specified set of KSU (from subject criteria)

Physics courses in UK HEIs

3 Physics aptitude, for degree-level physics Physics courses in UK HEIs

4 Physics aptitude, for a particular degree-level course in physics

Physics at Bath(ABB required, inc. grade B in physics)

5 Science aptitude, for degree-level science courses

Psychology at Portsmouth(320 points required, inc. grade B in a science)

6 General aptitude, for degree-level courses Courses in UK HEIs

7 Law aptitude, for a particular degree-level course in law

Law at Sussex(AAB profile required)

Multiple uses of results

1. student monitoring

2. formative

3. social evaluation

4. diagnosis

5. provision eligibility

6. screening

7. segregation

8. guidance

9. transfer

10. placement

11. qualification

12. selection

13. licensing

14. certification

15. school choice

16. institution monitoring

17. resource allocation

18. organizational intervention

19. programme evaluation

20. system monitoring

21. comparability

22. national accounting

3a. Debate – good/bad impacts

• Can bad impacts, from otherwise good tests, really undermine validity?– when test used for intended purpose?– when test used for unintended purposes?

• Must developers provide evidence that their tests have had (or will have) good impacts?– is it really their responsibility?– how could evidence be collected in advance?– who ought to judge what counts as a good or

bad impact?

3b. Debate – unitary concept

• Borsboom, et al (2004)– the test is valid (after all)

• Lissitz & Samuelsen (2007)– attainment tests don’t require much more than

content validation

• Murphy (2009)– aptitude tests don’t require much more than

criterion validation

3b. Debate – what’s in and out?

The Measurement (Interpretation of

Test Score)

The Useof Test Score

(Decision)

The Impactof Testing

(Cost vs. Benefit)

Spirit ofBorsboom

(Narrow)Construct Validity

Theorise use separately

Theorise impact separately

Spirit ofMessick

Theorise impact separately

Spirit ofCronbach

(Broad) Construct Validity

Validity (or maybe even 'Evaluation')

My modern validity theory

1. Technical standard• validity of (each) use of results – depends on

strength of argument for interpreting those results in terms of the validation construct (includes reliability)

2. Ethical standard• defensibility (includes social policy, i.e. good/bad

impacts)3. Legal standard

• legality4. Economic standard

• feasibility5. Political standard

• acceptability (includes ‘face validity’)

Messick (1989)

Controversy Surrounding Modern Validity Theory Dr Paul E. Newton Director, Cambridge Assessment Network Division Cambridge Assessment Paper presented to.

Documents

validity of test

construct validity slide

content validity

types of validity slide

s slide

validity of interpretations

validity important

declaration of validity