Top Banner
Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014
20

Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Large-scale testing: Uses and abuses

Richard P. Phelps

Universidad Finis Terrae, Santiago, Chile

January 7, 2014

Page 2: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Large-scale testing: Uses and abuses

1. 3 types of large-scale tests2. Measuring test quality3. A chronology of mistakes4. Economists misunderstand testing5. How SIMCE is affected

Page 3: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

AchievementAptitude

Non-cognitive

1. Three types of large-scale tests

Page 4: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Achievement tests Historically, were larger versions of classroom tests

~ 1900 - “scientific” achievement tests developed (Germany & USA)

SOURCE: Phelps, Standardized Testing Primer, 2007

J.M. Rice - systematically analyzed test structures & effects

E.L. Thorndike - developed scoring scales

Page 5: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Achievement tests

Purpose: to measure how much you know and can recall

Developed using: content coverage analysis

How validated: retrospective or concurrent validity (correlation with past measures, such as high school

grades)

Requires a mastery of content prior to test.

Fairness assumes that all have same opportunity to learn content

Coachable – specific content is known in advance

SOURCE: Phelps, Standardized Testing Primer, 2007

Page 6: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Aptitude tests

1917 – Adapted by U.S. Army to select, assign soldiers in World War 1

1930s – Harvard University president J. Conant- wanted new admission test to identify students from lower social classes with the

potential to succeed at Harvard- developed the first Scholastic Aptitude Test (SAT)

SOURCE: Phelps, Standardized Testing Primer, 2007

1890s – A. Binet & T. Simon (France)

- Pre-school children with mental disabilities

- achievement test not possible- developed content-free test of mental abilities

(association, attention, memory, motor skills, reasoning)

Page 7: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Aptitude testsPurpose: predict how much can be learned

Developed using: skills/job analysis

How validated: predictive validity, correlation with future activity (e.g., university or job evaluations)

Content independent. Measures: … what student does with content provided… how student applies skills & abilities developed over a lifetime

Not easily coachable – the content is either…… not known in advance, … basic, broad, commonly known by all, curriculum-free;… less dependent on the quality of schools

SOURCE: Phelps, Standardized Testing Primer, 2007

Page 8: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Aptitude tests

Aptitude tests can identify:

- Students bored in school who study what interests them on their own

- Students not well adapted to high school, but well adapted to university

- Students of high ability stuck in poor schools

SOURCE: Phelps, Standardized Testing Primer, 2007

Page 9: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Achievement Aptitude

Measure past learning potential

Development content analysis job/skills analysis

Validation retrospective predictive

Content dependent independent

Coachable? very much not much

Comparing Achievement & Aptitude tests

Page 10: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Non-cognitive tests

More recently developed – measure values, attitudes, preferences

Types: integrity tests career exploration matchmakingemployment “fit”

Page 11: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Non-cognitive tests

Purpose: to identify “fit” with others or a situation

Developed using: surveys, personal interviews

How validated? success rate in future activities

Content is personal, not learned

“Faking” can be an issue (e.g., “honesty” tests)

Page 12: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Achievement Aptitude Non-Cognitive

Measure past learning potential attitudes, values, preferences

Development content analysis job/skills analysis surveys

Validation retrospective predictive predictive

Content dependent independent independent

Coachable? very much very little can be faked

Comparing Achievement, Aptitude, & Non-Cognitive Tests

Page 13: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

2. Measuring test quality

3 measures are important:1. Predictive validity2. Content coverage3. Sub-group differences

Test reports can be “data dumps”

Page 14: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Predictive validity(values from -1.0 to +1.0)

…measures how well higher scores on admission test match better outcomes at university (e.g., grades, completion)

A test with low predictive validity provides a little information.

Page 15: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Source: NIST, Engineering Statistics Handbook

A positive correlation between two measures

Page 16: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Source: NIST, Engineering Statistics Handbook

A negative correlation between two measures

Page 17: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Source: NIST, Engineering Statistics Handbook

No correlation between two measures

Page 18: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

How does one measure predictive capacity?

Correlation Coefficient: I--------------------------------------------I

-1 0 1

Page 19: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

0

0.1

0.2

0.3

0.4

0.5

0.6

SAT

PSU 2010

Predictive validities: SAT and PSU

SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013

Page 20: Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014.

Language Mathematics SAT Writing PSU Social Science

0

0.1

0.2

0.3

0.4

0.5

0.6

SAT PSU Administracion

Predictive validities: SAT and PSU(faculty: Administracion)

SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013