YAACOV PETSCHER, PH.D. FLORIDA CENTER FOR READING RESEARCH FLORIDA STATE UNIVERSITY Statistical Considerations for Educational Screening & Diagnostic Assessments A discussion of methodological applications which have existed in the literature for a long time and are used in other disciplines but are emerging more now in education
112
Embed
YAACOV PETSCHER, PH.D. FLORIDA CENTER FOR READING RESEARCH FLORIDA STATE UNIVERSITY Statistical Considerations for Educational Screening & Diagnostic Assessments.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
YAACOV PETSCHER, PH.D.FLORIDA CENTER FOR READING
RESEARCHFLORIDA STATE UNIVERSITY
Statistical Considerations for Educational Screening & Diagnostic Assessments
A discussion of methodological applications which have existed in the literature for a long time and are used in other disciplines but are emerging more now in education
Discussion Points
Assessment AssumptionsContexts of AssessmentsStatistical Considerations
Reliability Validity Benchmarking
“Disclaimer” Focusing on Breadth not Depth Based on applied contract and grant research One slide of equations
Assumptions of Assessment - Researchers
Constructs exist but we can’t see themConstructs can be measuredAlthough we can measure constructs, our
measurement is not perfectThere are different ways to measure any
given constructAll assessment procedures have strengths
and limitations
Assumptions of Assessment - Practitioner
Multiple sources of information should be part of the assessment process
Performance on tests can be generalized to non-test behaviors
Assessment can provide information that helps educators make better educational decisions
Assessment can be conducted in a fair manner
Testing and assessment can benefit our educational institutions and society as a whole
Contexts of Assessments
Instructional Formative Interim Summative
Research Individual Differences Group Differences (RCT) Growth
Legislative Initiatives NCLB Reading First Race to the Top Common Core
Should not be viewed as interchangeable Once could have very high stability but very poor
internal consistency Date of Birth/Height/SSN
Statistical Considerations - Reliability
Most frequently used framework is classical test theory
What does this assume?
T
X
e
Benefits of IRT
Puts persons and individuals on the same scale CTT looks at total score by p-value (difficulty)
Can result in shorter tests CTT reliability increases with more items
Can estimate the precision of scores at the individual level CTT assumes error is the same
Item Difficulty by Total Score Decile Groups
Item Difficulty by Ability
Items Don’t Always Do What We Want
Item Information
Test Information – Standard Error
Precision/Reliability
Statistical Considerations - Reliability
While precision improves on the idea of reliability, can precision be improved? Account for context effects (Wainer et al., 2000)
Petscher & Foorman, 2011 Account for time (Verhelst, Verstralen, & Jansen,
1997) Prindle, Petscher, & Mitchell, 2013
Statistical Considerations - Reliability
Context effects Any influence or interpretation that an item may
acquire as a result of its relationship to other items Greater problem in CAT due to unique testing Emerges as an item and passage level problem
Statistical Considerations - Reliability
Common stimulus
Statistical Considerations - Reliability
“If several questions within a test a test are experimentally linked so that the reaction to one question influences the reaction to another, the entire group of questions should be treated preferably as an ‘item’ when the data arising from application of split-half or appropriate analysis-of-variance methods are reported in the test manual”
APA Standards of Educational and Psychological Testing
(1966)
Expressed in IRT
)](exp[1
)](exp[)1()|1(
)(
)(
ijdiji
ijdijiiijiij ba
baccxp
)](exp[1
)](exp[)1()|1(
iji
ijiiijiij ba
baccxp
Study 1Reading Comprehension in Florida
Precision – After 3 passages
FAIR Technical Manual
Simulations are all well and good…
How does accounting for item dependency improve testing in real world?
N ~= 800, randomly assigned to testing condition Control was current 2pl scoring Experimental was unrestricted bi-factor
Evaluate Precision # of passages Prediction to state achievement
RCT
What this suggests
“Newer” models help us to more appropriately model the data
Precision/reliability are improved just by modeling the context effect
Improve the efficiency and precision of a computer-adaptive test by modeling the item-dependency
Study 2Morphology CAT
Accounting for Time
Somewhat similar to the item dependency model
IRT models are concerned with accuracyWhat about fluency?
For a while, we have known that MA is correlated with reading comprehension (e.g., Carlisle, 2000; Freyd & Baron, 1982; Tyler & Nagy, 1990)
MA RC
MA predicts RC,above & beyond Vocabulary (V)
Unique contributions of MA to RC, controlling for vocabulary (e.g., Carlisle, 2000; Kieffer, Biancarosa, & Mancilla-Martinez, in press; Kieffer & Lesaux, 2008, 2012; Kieffer & Box, 2013; Nagy, Berninger, & Abbott, 2006)
MA RC
V
But wait…
Are we actually measuring MA and vocabulary as separate dimensions of lexical knowledge?Observed correlations between MA and
vocabulary are attenuated by measurement error
Reliability of researcher-created MA measures has been moderate In the .70-.80 range & occasionally lower
So, “unique” contributions of MA beyond V could be an artifact of measurement error
MA V
But wait…
Using Confirmatory Factor Analysis (CFA), Muse (2005) found that MA could not be distinguished from vocabulary in fourth grade, but instead form a unidimensional construct (See also Wagner, Muse, & Tannenbaum, 2007).
Spencer (2012) replicated this finding with eighth graders.
MA/V
On the other hand…
Using CFA, Kieffer & Lesaux (2012) found that MA was measurably separable from two other dimensions of vocabulary, though strongly related for both native English & language minority learners in Grade 6
Neugebauer, Kieffer, & Howard (under review) replicated this finding for Spanish speaking language minority learners in Grades 6-8
MA V
But
Is it possible a multidimensional structure exists but could be best captured by a general factor lexical knowledge and specific factors of morphological awareness and vocabulary?
If the common variance is captured by a general factor as well as specific factors, do they each predict distal outcome?
Modeling Dimensionality of Lexical Knowledge:Unidimensional
Fit poorlyRejected across parametric & nonparametric EFA & CFA models
Modeling Dimensionality of Lexical Knowledge:Two Dimensional
Modeling Dimensionality of Lexical Knowledge:Bi-factor Model
Students with poor reading skills have difficulty in closing achievement gaps
Accurate identification is necessary to remediate difficulties
Many assessments include guidelines for cut-points
Sample Risk Levels Chart
How to Validate – Current Theory
Variety of Methods Best Guess
+/- 1SD Percentile Ranks
Simple Stat Bivariate Correlations Interrater Reliability
More Advanced Logistic Regression Discriminant Function Analysis Achievement-IQ Discrepancies
Typical “Diagnostic/Screening” Q’s
WITR between blood characteristics and being HIV positive?
WITR between electromagnetic signals and correctly distinguishing from noise?
WITR between students’ scores on the Scholastic Reading Inventory and future risk on the SAT-10?
What is our question? Correlational?
Bivariate Correlation Interrater Reliability
Discrimination? Logistic Regression Discrimination Function Receiver Operating Characteristic (ROC) Curves
ROC
Graphical representation of operating pointsMultiple indices of efficiencyMoving cut-pointsOutperforms other techniques in diagnostic
efficiency (Hintze, 2005)
Advantages of using ROC
It defines the quality of a test or prediction using a measurement without specifying a cut off value for decision making
Greater flexibility in diagnostic accuracy and predictive power
Assuming Normal distribution The mean and Standard Error can be estimated The 95% CI can be estimated Statistical significance can be determined
Whether one test is better than another can be determined
Old School Discrimination
Form Two GroupsGiven the Test
4 Outcomes People who have the attribute were detected People who have the attribute were not detected People who don’t have the attribute were detected People who don’t have the attribute were not detected
Using the Results
What is a ROC Curve?
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1-Specificity
Sensitivity
What is a ROC Curve?
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1-Specificity
Sensitivity
Based on Cumulative Frequency %
Data Scheme
SRILexile Score
SAT-10 (<40th
%ile)Y-axis
SAT-10 (>=40th
%ile)X-axis
505 35 (.35) 5 (.05)
520 30 (.65) 10 (.15)
550 20 (.85) 20 (.35)
600 10 (.95) 30 (.65)
700 5 (1.00) 35 (1.00)
TOTALS N=100 N=100
What is a ROC Curve?
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1-Specificity
Sensitivity
505 35 (.35) 5 (.05)
520 30 (.65) 10 (.15)
550 20 (.85) 20 (.35)
600 10 (.95) 30 (.65)
700 5 (1.00)
35 (1.00)
Confusion Matrix
A B
C D
SAT10-Score
<40th%ile >=40th
%ile
At-Risk
SRI Score Not At-Risk
Confusion Matrix
A B
C D
FNTP
TP
CA
ASE
At-Risk
SRI Score Not At-Risk
SAT10-Score
<40th%ile >=40th
%ile
Confusion Matrix
A B
C D
FPTN
TN
DB
DSP
At-Risk
SRI Score Not At-Risk
SAT10-Score
<40th%ile >=40th
%ile
Confusion Matrix
A B
C D
FPTP
TP
BA
APPP
At-Risk
SRI Score Not At-Risk
SAT10-Score
<40th%ile >=40th
%ile
Confusion Matrix
A B
C D
TNFN
TN
DC
DNPP
At-Risk
SRI Score Not At-Risk
SAT10-Score
<40th%ile >=40th
%ile
Confusion Matrix
A B
C D
TNFNFPTP
TNTP
DCBA
DAOCC
At-Risk
SRI Score Not At-Risk
SAT10-Score
<40th%ile >=40th
%ile
Confusion Matrix
A B
C D
TNFNFPTP
FNTP
DCBA
CABR
At-Risk
SRI Score Not At-Risk
SAT10-Score
<40th%ile >=40th
%ile
Confusion Matrix
A B
C D
DCBA
DA OCC
DC
D NPP
BA
A PPP
DB
DSP
CA
ASE
At-Risk
SRI Score Not At-Risk
SAT10-Score
<40th%ile >=40th
%ile
Data Scheme
SRILexile Score
SAT-10 (<40th
%ile)Y-axis
SAT-10 (>=40th
%ile)X-axis
505 35 (.35) 5 (.05)
520 30 (.65) 10 (.15)
550 20 (.85) 20 (.35)
600 10 (.95) 30 (.65)
700 5 (1.00) 35 (1.00)
TOTALS N=100 N=100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1-Specificity
Sensitivity
505 35 (.35) 5 (.05)
520 30 (.65) 10 (.15)
550 20 (.85) 20 (.35)
600 10 (.95) 30 (.65)
700 5 (1.00)
35 (1.00)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1-Specificity
Sensitivity
505 35 (.35) 5 (.05)
Classification – Example 1
Evaluation of Cut Scores
505 35 (.35) 5 (.05)
Column NSRI At-Risk Not At-RiskAt-Risk 35 5 40Not At-Risk 65 95 160Row N 100 100 200
If we had employed a test with those measurement properties statewide to detect children who where at-risk for reading problems, we would have mislabeled around 16,000 kids as at-risk who weren’t (37%).
However, we would only have only missed 1,400 students who missed potential services that needed them (1%).
Concluding Thoughts - Reliability
Researchers Evaluating other methods of reliability
Precision Generalizability
Practitioners What is being reported?
Internal consistency, test-retest, etc How reliable is it?
Nunnally/Bernstein >.80 research >.90 clinical
Concluding Thoughts – Factor Validity
Researchers Testing additional specifications outside of the
traditional 1/multi framework Bi-factor, Causal Indicator, etc.
Practitioners What type of factor analysis was done?
EFA/CFA Rules of thumb?
Too many 200?
Concluding Thoughts - Benchmarking
Researchers Improve the rigor of our methods
ROC, Diagnostic Measurement, Cost Curves
Practitioners Identify what “at-risk” means Establish the goal of the screening process Study how the screen was developed Determine the base rate Attend to the +/- predictive power Collect local data
Implications of these Considerations
We must be careful in how we choose assessments AYP Value-added modeling Promotion/Retention
Moving toward a new phase in assessments Computer-delivered Computer-adaptive
Smarter Balanced, FCRR, RFU
Be more aware of what other disciplines are doingBe more aware of what’s in older literature
Technology!
Great Resources
IRT The Theory and Practice of Item Response Theory (De
Ayala) Fundamentals of Item Response Theory (Hambleton et
al.)Factor Analysis
CFA for Applied Research (Brown)SEM
Beginner’s Guide to SEM (Schumacker & Lomax)ROC analysis