Introduction to Measurement
Goals of Workshop
• Reviewing assessment concepts• Reviewing instruments used in norming process
• Getting an overview of the secondary and elementary normative samples
• Learning how to use the manuals in interpreting students’ scores.
ASSESSMENT
• The process of collecting data for the purpose of making decisions about students
• It’s a process and typically involves multiple sources and methods.
• Assessment is in service of a goal or purpose.
• The data we collect will be used to support some type of decision (e.g., monitoring, intervention, placement)
Major Types of Assessment in Schools
• More frequently used:– Achievement: how well is child doing in curriculum?
– Aptitude: what is this child’s intellectual and other capabilities?
– Behavior: Is the child’s behavior affecting learning?
• Less frequently used:– Teacher competence: Is teacher actually imparting knowledge?
– Classroom environment: Are classroom conditions conducive to learning?
– Other concerns: home, community,...
Types of Tests
• Norm-referenced– Comparison of performance to a specified population/set of individuals
• Individually-referenced– Comparisons to self
• Criterion-referenced– Comparison of performance to mastery of a content area; what does the student know?
• The data in the manual will allow you to do look at norms and at individual growth.
MAJOR CONCEPTS
• Nomothetic and Idiographic• Samples• Norms• Standardized Administration• Reliability• Validity
Nomothethic
• Relating to the abstract, the universal, the general.
• Nomothetic assessment focuses on the group as a unit.
• Refers to finding principles that are applicable on a broad level.
• For example, boys report higher math self-concepts than girls; girls report more depressive symptoms than boys..
Idiographic
• Relating to the concrete, the individual, the unique
• Idiographic assessment focuses on the individual student
• What type of phonemic awareness skills does Joe possess?
Populations and Samples I
• A population consists of all the representatives of a particular domain that you are interested in
• The domain could be people, behavior, curriculum (e.g. reading, math, spelling, ...
Populations and Samples II
• A sample is a subgroup that you actually draw from the population of interest
• Ideally, you want your sample to represent your population– people polled or examined, test content, manifestations of behavior
Random Samples
• A sample in which each member of the population had an equal and independent chance of being selected.
• Random samples are important because the idea is to have a sample that represents the population fairly; an unbiased sample.
• A sample can be used to represent the population.
Probability Samples I
• Sampling in which elements are drawn according to some known probability structure.
• Random samples are subcases of probability samples.
• Probability samples are typically used in conjunction with subgroups (e.g., ethnicity, socioeconomic status, gender).
Probability Samples II
• Probability samples using subgroups are also referred to as stratified samples.
• Standardization samples are typically probability or stratified samples.
• Standardization samples need to represent population because the sample’s results will be used to create norms against which all members of population will be compared.
Norms I
• Norms are examples of how the “average” individual performs.
• Many of the tests and rating scales that are used to compare children in the US are norm-referenced.– An individual child’s performance is compared to the norms established using a representative sample.
Norms II
• For the score on a normed instrument to be valid, the person being assessed must belong to the population for which the test was normed
• If we wish to apply the test to another group of people, we need to establish norms for the new group
Norms III
• To create new norms, we need to do a number of things:– Get a representative sample of new population
– Administer the instrument to the sample in a standardized fashion.
– Examine the reliability and validity of the instrument with that new sample
– Determine how we are going to report on scores and create the appropriate tables
Standardized Administration
• All measurement has error.• Standardized administration is one way to reduce error due to examiner/clinician effects.
• For example, consider these questions with different facial expressions and tone:
• Please define a noun for me :-)• DEFINE a noun if you can ? :- (
Distributions
• Any group of scores can arranged in a distribution from highest to lowest
• 10, 3, 31, 100, 17, 4
• 3, 4, 10, 17, 31, 100
Normal Curve
• Many distributions of human traits form a normal curve
• Most cases cluster near middle, with fewer individuals at extremes; symmetrical
• We know how the population is distributed based on the normal curve
Ways of reporting scores
• Mean, standard deviation• Distribution of scores
– 68.26% ± 1; 95.44 ± 2; 99.72 ±3• Stanines (1, 2, 3, 4, 5, 6, 7, 8, 9)
• Standard scores - linear transformations of scores, but easier to interpret
• Percentile ranks* • Box and Whisker Plots*
Percentiles
• A way of reporting where a person falls on a distribution.
• The percentile rank of a score tells you how many people obtained a score equal to or lower than that score.
• So if we have a score at the 23rd %tile and another at the 69th %tile, which score is higher?
Percentiles 2
• Is a high percentile always better than a low percentile?
• It depends on what you are measuring.
• For example….• Box and whisker plots are visual displays r graphic representation of the shape of a distribution using percentiles.
The box plot is a picture of the distribution of scores on a measure.
Explanation of the
Box Plot
Individual
Outliers
90th Percentile
Performance
75th Percentile
50th
Percentile
25th Percentile
10th Percentile
0
2
4
6
8
10
12
14
16
18
20
Grade 2 Students
Correlation
• We need to understand the correlation coefficient to understand the manual
• The correlation coefficient, r, quantifies the relationship between two sets of scores.
• A correlation coefficient can have a range from -1 to + 1.– Zero means the two sets of scores are not related.
– One means the two sets of scores are identical (a perfect correlation)
Correlation 2
• Correlations can be positive or negative.• A + correlation tells us that as one set of scores increases, the second set of scores also increases. they can be negative. Examples?
• A negative correlation tells us that as one set of scores increases, the other set decreases. Think of some examples of variables with negative r’s.
• The absolute value of a correlation indicates the strength of the relationship. Thus .55 is equal in strength to -.55.
How would you describe the correlations shown
by these charts?3 15 47 39 7
11 913 10
1098765
1.21.21.21.21.2
0
2
4
6
8
10
12
14
1 2 3 4 5 6
Series1
0
2
4
6
8
10
12
1 2 3 4 5 6
Series1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 2 3 4 5
Series1
Correlation 4
• .25, .70, -.40, .55, -.87, .58, .05• Order these from strongest to weakest• -.87, .70, .58, .57, -.40, .25, .05• We will meet 3 different types of correlation coefficients today:
• Reliability coefficients - Definitions?
• Validity coefficients• Pattern coefficients
Reliability
• Reliability addresses the stability, consistency, or reproducibility of scores.– Internal consistency
– Split half, Cronbach’s alpha
– Test-retest– Parallel forms– Inter-rater
Reliability 2
• Internal Consistency– How do the items on a scale relate to one another? Are respondents relating to them in the same way?
• Test-retest– How do respondents’ scores at Time 1 relate to their scores at Time 2?
Reliability 3
• Parallel forms– Begin by creating at least two versions of the exam. How do respondents performance on one version compare to their performance on another version
• Inter-rater– Connected to ratings of behavior. How does one rater’s scores compare to another’s?
Validity
• Validity addresses the accuracy or truthfulness of scores. Are they measuring what we want them to?– Content– Criterion - Concurrent– Criterion - Predictive– Construct
– Face
Content Validity
• Is the assessment tool representative of the domain (behavior, curriculum) being measured?
• An assessment tool is scrutinized for its (a) completeness or representativeness, (b) appropriateness, (c) format, and (d) bias– E.g., MSPAS
Criterion-related Validity
• What is the correlation between our instrument, scale, or test and another variable that measures the same thing, or measures something that is very close to ours?
• In concurrent validity, we compare scores on the instrument we are validating to scores on another variable that are obtained at the same time.
• In predictive validity, we compare scores on the instrument we are validating to scores on another variable that are obtained at some future time.
Structural Validity
• Used when an instrument has multiple scales.• Asks the question, “Which items go together best?
• For example, how would you group these items from the Self-Description Questionnaire?
• 3. I am hopeless in English classes.• 5. Overall, I am no good.• 7. I look forward to mathematics class.• 15. I feel that my life is not very useful.• 24. I get good marks in English.• 28. I hate mathematics.
Structural Validity 2
• We expect the English items (3, 24), Math items (7, 28) and global items (5, 15) to group together.
• The items that group together make up a new composite variable we call a factor.
• We want each item to correlate highly with the factor it clusters on, and less well with other factors.
• Typically, we accept item-factor coefficients from about .30 and higher.
What can we say about the structural validity of the SDQ given these scores?
Item # Verbal Math Global
3 .587 -.044 .624
5 -.016 .024 .561
7 .086 .630 -.059
23 .019 -.015 .625
24 .754 -.006 -.024
28 -.020 .750 .042
Construct Validity
• Overarching construct: Is the instrument measuring what it is supposed to?– Dependent on reliability, content and criterion-related validity.
• We also look at some other types of validity evidence some times– Convergent validity: r with similar construct
– Discriminant validity: r with unrelated construct
– Structural validity: What is the structure of the scores on this instrument?
Statistical Significance
• When we examine group differences in science, we want to make objective rather than subjective decisions.
• We use statistics to let us know if the difference we are observing occurs by chance.
• In psychology, we typically set our alpha or error rate at 5% (i.e., .05), and we conclude that if a difference was likely less than 5% of the time, that difference is statistically significant.
Statistical Significance 2
• When our statistical test tells us that our difference is statistically significant (i.e., < .05).
• Statistical significance is affected by a number of variables, including sample size. The larger the sample, the easier it is to achieve statistical significance.
• We also look at the magnitude of the difference (or effect size).
• A difference may be statistically significant, but have a small effect size.
• .10 to . 30 = small effect; .40 to .60 = medium effect; > .60 = large effect.