Top Banner
Test Construction and Reliability WISC-IV By Jill Hutzel, A.M, K.W & L.K

Test Construction and Reliability WISC-IV

Feb 22, 2016




Test Construction and Reliability WISC-IV. By Jill Hutzel , A.M, K.W & L.K. What Does This T est Measure (3.2) ?. The Wechsler Intelligence Scale for Children-Fourth Edition (WISC-IV) was designed to measure the intellectual functioning in specific cognitive areas such as: - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Page 1: Test Construction and Reliability  WISC-IV

Test Construction and Reliability WISC-IV

By Jill Hutzel, A.M, K.W & L.K

Page 2: Test Construction and Reliability  WISC-IV

What Does This Test Measure (3.2)?• The Wechsler Intelligence Scale for Children-Fourth

Edition (WISC-IV) was designed to measure the intellectual functioning in specific cognitive areas such as:

• Verbal Comprehension• Perceptual Reasoning• Working Memory• Processing Speed

This test also provides a composite score (ex. Full Scale IQ) that represents a child’s general intellectual ability.

Page 3: Test Construction and Reliability  WISC-IV

These Four Index Scores measures a child’s overall:

Crystallized Ability (Gc)- acquired skills and knowledge that are developmentally dependent on exposure to the culture

Visual Processing (Gv)- a facility for visualizing and manipulating figures and responding appropriately to spatial forms

Fluid Reasoning (Gf)- a broad pattern of reasoning, seriation, sorting, and classifying

Processing Speed (Gs)- an ability to scan and react to simple tasks rapidly

(Sattler, 2008)

Page 4: Test Construction and Reliability  WISC-IV

What Are the Test Specifics (3.3)?Age of Examinees- 6:0- 16:11

Number of Subtests- 10 subtests and 4 indexes including:(VCI) Similarities, Vocabulary, Comprehension(PRI) Block Design, Picture Concepts, Matrix Reasoning(WMI) Digit Span, Letter-Number Sequencing(PSI) Coding & Symbol Search

Number of Supplemental Subtests- 5 including:(VCI) Information & Word Reasoning(PRI) Picture Completion (WMI) Arithmetic(PSI) Cancellation

Administration time is approximately 65 to 80 minutes Qualification of Examiners- Graduate or Professional level of training in psychological assessment

Page 5: Test Construction and Reliability  WISC-IV

Procedures to Norm and Standardize Test Scores (3.4)

• This test developed in 5 general stages: Conceptual Development, Pilot, National Tryout, Standardization, Final Assembly and Evaluation

• Sample size 2,200 children ages 6:0 to 16:11 (the arithmetic subtest was normed a subsample of 1,100 children-100 children per age group)

• In order to prove evidence of the scales validity, additional children were administered the WISC-IV and other cognitive measures including the WISC III, WAIS III, WPPSI-III, WASI, WIAT-II,CMS,GRS, BarOn EQ and ABAS-II

• Description of the Sample- To ensure the standardization samples included representative proportions of children according to selected demographic variables including sex, age, race/ethnicity, parent education level, geographic region. Researchers used the March 2000 Census from the U.S. Bureau

• AGE- 2,200 Children divided into 11 age groups- 200 participants in each age group (ex. 6:0-6:11, 7:0-7:11… 16:11)

• SEX- Equal number of males and females in each age group (100 in each group) • RACE- The proportions of racial groups were based on the racial proportions of children within that

age group of the US population according to the Census • Parent Education Level- The sample was divided according to 5 parent education levels based on

years of education completed • Geographic Region-Divided into the 4 major geographic regions specified by the census reports

Page 6: Test Construction and Reliability  WISC-IV

Procedures Used to Develop Test Items 3.5 (Conduct and document review by relevant, independent experts, including review process and experts’ qualifications, relevant experiences, and demographics)

• Specific procedures were utilized in the WISC IV research program to optimize the quality of obtained data and to assist in the formulation of final scoring criteria.

• One of the first steps was to recruit examiners with extensive experience testing children and adolescents. Potential examiners completed a questionnaire by supplying information about their educational and professional experience, administration experience with various intellectual measures, certification, and licensing status. The majority was certified or licensed professional working in private or public facilities.

• Potential standardization examiners were provided training material, which consisted of training video, a summary of common administration and scoring errors, and a two part training quiz. The content of the training quiz included questions on administration and scoring rules as well as a task that required the examiner to identify administration and scoring errors in a fictitious test protocol.


Page 7: Test Construction and Reliability  WISC-IV

3.5 continued…

• Selected examiners scored at least 90% correction on both parts of the training quiz. Any errors or omissions on the training quiz were reviewed with the examiner. As an oversight measure, examiners were required to submit a review case prior to testing additional children. Every attempt was made to discuss administration and scoring errors on the review case with the examiner within 48 hours of its submission. Subsequent cases were reviewed within 72 hours of receipt if possible, and any errors resulting in loss or inaccuracy of data were discussed with the examiner. A periodic newsletter was sent to all examiners, alerting them to potentially problematic areas.

• All scorers had a minimum of a bachelor’s degree and attended a 5-day training program led by members of the research team. Scorers were required to score at least 90% correct on a quiz that required them to identify scoring errors in a fictitious protocol. Each protocol collected during the national tryout and standardization stages of development was rescored and entered into a database by two qualified scorers working independently. Any discrepancies between the two scorers were resolved daily by a third scorer (resolver).The resolvers were chosen based on their demonstration of exceptional scoring accuracy and previous scoring experience.

Page 8: Test Construction and Reliability  WISC-IV

3.6 (Empirical analyses and/or expert judgment as to the appropriateness of test items, content, and response formats for different groups of test takers)

• To ensure the validity of the WISC IV, 16 special group studies were conducted during the WISC IV Standardization. The results from the special group studies support for the validity and clinical utility of the WISC IV. The majority of results are consistent with expectations based on previous research and theoretical foundations of the scales development. It is expected that future investigations utilizing the WISC IV in different clinical settings and populations will provide additional evidence of the scales utility for clinical diagnosis and intervention purposes.

Page 9: Test Construction and Reliability  WISC-IV

3.7 (Procedures used to develop, review, and tryout items from item pool)

• Early in the development process, 45 assessments professional from eight major cities met in a focus group with members of a marketing research firm to refine revision goals and assist in the formulation of the scales working blueprint.

• Also, a telephone survey (N=308) was conducted with users of the WISC-III as well as professionals in child and adolescent assessment. The research team, advisory panel that was composed of national recognized experts in school psychology and clinical neuropsychology, and clinical measurement consultants from the Harcourt assessment reviewed the feedback from the focus groups and telephone surveys. Based on findings, the working blueprint was established and the first research version of the scale was developed for the use in the initial pilot study.

Page 10: Test Construction and Reliability  WISC-IV

3.7 continued…

• The primary goal of the pilot stage was to produce a version of the scale for use in the subsequent national tryout stage. A number of research questions were addressed through a series of five pilot studies (N= 255, 151, 110, 389, and 197) and three mini pilot studies (N=31,16, and 34). Each of these studies utilized a research version of the scale that included various groupings of subtests retained from the WISC III and new, experimental subtests that were being considered for inclusion at the national tryout stage.

• The primary research questions at this stage of development focused on such issues as content and reliance of items, adequacy of subtests floors and ceilings, clarity of instructions to the examiner and child, identification of response processes, administration procedures, scoring criteria, item bias and other relevant psychometric properties.

Page 11: Test Construction and Reliability  WISC-IV

3.8 (selection procedures and demographics of item tryout and/or standardization sample)

• The national tryout stage utilized a version of the scale with 16 subtests; Data were obtained from a stratified sample of 1,270children, who reflected key demographic variables in the national population. An analysis gathered by the U.S Bureau of the Census (1998) provided basis for stratification along the following variables: age, sex, race, parent education level, and geographic region.

• Using this larger, more representative sample of children, research questions from the pilot phased were reexamined, and additional issues were addressed. Refinements to the item order were made based on more precise estimates of their relative difficulty level, and exploratory and confirmatory factor analyses were conducted to determine the underlying factor structure of the scale.

• In addition, data were collected at this stage from a number of special groups (children identified as intellectually gifted, children with intellectual disability or learning disorders, and children with ADHD) to provide additional evidence regarding the adequacy of the subtest floors and ceilings, as well as clinical utility of the scale. An oversample of252 African American children and 186 Hispanic children was collected to allow for a statistical examination of item bias using IRT methods of analysis.

Page 12: Test Construction and Reliability  WISC-IV

3.8 Continued…

• After reviewing the accumulated evidence from the pilot and national tryout studies a standardization edition of the WISC IV was created.

• The standardization sample consisted of 2,200children who were divided into 11 age groups where each age group consisted of200 participants. The U.S. Bureau of the Census collected an analysis of data in March 2000 along the variables of: age, sex, race/ethnicity, parent education level, and geographic region. For each age group, the proportions of Whites, African Americans, Hispanics, Asians, and other racial groups were based in the racial proportions of children within the corresponding age group of the U.S> population according to March 2000 census data. The sample was stratified according to five-parent education levels based on the number of years of school completed. If the child resided with only one parent or guardian, the educational level of that parent of guardian was assigned. If the child resided with two parents, a parent and a guardian, or two guardians, the average of both individuals’ educational levels was used, with partial levels rounded up to the next highest level.

Page 13: Test Construction and Reliability  WISC-IV

Evidence for Internal Consistency2.7

According to the Technical Manual…

• The evidence for internal consistency was obtained using the normative sample and the split half method. “The split-half method is done by sorting the items on a test into two parallel subtests of equal size. Then you compute a composite score for each subtest and correlate the two composite scores. By doing so, you have created two parallel tests from the items within one test. It is possible to use these subtest scores to compute and estimate of total test reliability” (Furr and Bacharach, 2008).

• As stated by the WISC IV technical manual, the split half method was used on all subtests excluding Coding, Symbol Search and Cancellation due to these being Processing Speed subtests. Therefore, test-retest stability coefficients were used as the reliability estimates for these particular subtests.

Page 14: Test Construction and Reliability  WISC-IV

2.7 continued…

• The reliability coefficients for the WISC IV composite scales range from .88 (Processing Speed) to .97 (Full Scale). These coefficients are generally higher than those of the individual subtests that comprise these composite scales. The average reliability coefficient for the Processing Speed composite scale is slightly lower (.88) because it is based on the test-retest reliabilities which tend to be lower then the split half reliabilities. The reliability coefficients for the WISC IV composite scales are identical to or slightly better than corresponding scales in the WISC III.

• The evidence of Internal Consistency Reliability was obtained from the split half method from a group of children ranging from the ages of 6 to 16. The overall averages for these special clinical groups are as follows: Verbal Comprehension (VCI) .94, Perceptual Reasoning (PRI) .92, Working Memory (WMI) .92, Processing Speed (PSI) .88, and Full Scale (FSIQ) .97.

Page 15: Test Construction and Reliability  WISC-IV

Test-Retest Approaches (Are alternate-form or test-retest approaches used, and if so, what were the results? Were separately timed administrations used to investigate a practice effect, and if so, what were the results? Additional information includes procedures used to estimate this type of reliability)


• Test-Retest ApproachesYes. A test-retest approach was used. According to Wechsler (2004), the sample consisted of:243 children18 to 27 participants in each of the 11 age groupsEach participant was given two separate WISC- IV

administrations: Ranging from 13 to 63 days between test and retest(Mean interval of 32 days)

Page 16: Test Construction and Reliability  WISC-IV

2.9 Continued…

• The sample consisted of: •52.3% Female vs. 47.7% Male•74.1% White•7.8% African American•11.1% Hispanic•7.0% Other

• The Parent Education Level: •0 – 8 Years (Y): 4.9%•9 – 11 Y: 9.1%•12 Y: 25.9%•13 – 15 Y: 36.2&•> 16 Y: 23.9%

Page 17: Test Construction and Reliability  WISC-IV

2.9 Continued…

• Used Pearson’s product-moment correlation to estimate TEST-RETEST RELIABILITY for 5 different age groups (Wechsler, 2004)

Age groups: 6-7, 8-9, 10-11, 12-13, 14-16r = (SP)/(SqRt (SSxSSy))

Table 4.4 in the WISC-IV Integrated Technical and Interpretive Manual displays:• Mean subtest scaled scores and composite scores with SD• Standard Differences (effect sizes) between the first and second

testing's • Correlation coefficients corrected for the variability of the

standardization sample

Continuation and Chart Follows

Page 18: Test Construction and Reliability  WISC-IV
Page 19: Test Construction and Reliability  WISC-IV

2.9 Continued…

• (Williams, Weiss, & Rolfhus, 2003)• Used Fisher’s Z Transformation to calculate TEST-

RETEST COEFFICIENTS for the Overall Sample (Wechsler, 2004)

• Standard Difference calculated using:(The mean score difference between the first and second testing session) divided by (the pooled

standard deviation)Effect Size – A measure intended to provide a

measurement of the absolute magnitude of a treatment effect, independent of the size of the sample(s) being used (Gravetter & Wallnau, 2009)

Cohen’s d = mean difference/standard deviationComprehension had the smallest effect size (.08),

Picture Completion had the largest (.60), FSIQ had an effect size of (.46)

Page 20: Test Construction and Reliability  WISC-IV

2.9 Continued…• RESULTS:The WISC-IV scores have adequate stability across time for all five age groups (Wechsler, 2004)• Corrected Stability Coefficient

•Excellent (.92) -Vocabulary•Good (.80) -Block Design -Similarities -Digit Span -Coding -Letter-Number Sequencing -Matrix Reasoning

-Comprehension -Symbol Search -Picture Completion -Information

-Word Reasoning•Adequate (.70)

-Other subtests

Composite Scores have better stability than individual subtest scores Good (.80) or better

Page 21: Test Construction and Reliability  WISC-IV

2.9 Continued…

• Retest score means for the subtests of the WISC-IV are higher than the scores from the first testing session, possibly due to practice effects due to a short time interval between test and retest-– Practice Effects•Re-test gains were smaller for the VCI and WMI subtests compared to the PRI and PSI subtests•Score Differences between test-retest primarily due to practice effects:

VCI +2.1, PRI +5.2, WMI +2.6, PSI +7.1 , FSIQ +5.6

– Stability of the WISC-IV in a Sample of Elementary and Middle School Children

Ryan, Glass, and Bartels (2010) investigated test-retest stability of the WISC-IV in 43 elementary/middle school students in a rural location, tested on two separate occasions,

roughly 11 months apart

Believed that the stability found from the WISC-IV standardization sample does not generalize to clinically realistic test-rest intervals and does not generalize to other populations

Page 22: Test Construction and Reliability  WISC-IV

2.9 Continued…

• Participants– 76 students from a small private school in a Midwestern community – 43 were rested

25 female18 male

• Stability Coefficients ranged from: .26 (Picture Concepts).84 (Vocabulary).88 (FSIQ)

Table follows-

Page 23: Test Construction and Reliability  WISC-IV
Page 24: Test Construction and Reliability  WISC-IV

2.9 Continued…

• Results

Page 25: Test Construction and Reliability  WISC-IV

2.9 Continued…

• Stability Coefficients from the standardization sample were slightly larger than from the sample in this study (Ryan et al., 2010)

-FSIQ .91 in the Standardization Sample Vs. .88 in the abovementioned sample

-Similar to the standardization sample, stability coefficients for the composite scores were slightly more stable than individual subtest scores, with the FSIQ being the most stable

• Ryan et al. (2010) believe that the test-retest interval of 11 months, compared to the 32 day test-retest interval, accounted for an overall smaller stability coefficient and an overall smaller practice effect

• This study supports Wechsler’s statistical evidence that (Ryan et al., 2010):-The FSIQ is the most stable score provided by the WISC-IV over time-During long test-retest intervals, only the FSIQ has sufficient stability for

interpretation-Individual subtest scores should NOT be used for any diagnostic and/or

decision-making purposes

Page 26: Test Construction and Reliability  WISC-IV

Evidence Provided for Both Interrater Consistency & Consistency Over Repeated

Measurements (2.10)According to the WISC IV Technical Manual…

• The test-retest sample for the WISC-IV was composed of 243 children. There were 18-27 participants from each of the 11 age groups.

• The WISC-IV was given one time to all of the children. The test was then administered a second time anywhere from 13-63 days later. The mean interval was 32 days. There was %52.3 females and %47.7 males in the sample.

• “The test-retest reliability was calculated for five age groups (6:0-7:11, 8:0-9:11, 10:0-11:11, 12:0-13:11, and 14:0-16:11) using Pearson’s product-moment correlation.” The coefficients of the test-retest for the general sample were calculated using Fisher’s z transformation.

• “The standard difference was calculated using the mean score difference between two testing divided by the pooled standard deviation.”

• The mean scores for the retest for all of the seven scaled process scores are higher than that from the first testing, in which the effect sizes ranged from .14 to .41. “In general, test-retest gains are less pronounced for the process scores in the Working Memory domain than for the process scores in the Perceptual and Processing Speed domains” (pg. 136 of technical manual)

Page 27: Test Construction and Reliability  WISC-IV

2.10 Continued…

• In a Study done by Ryan, Glass and Bartels they had 76 students in a Midwestern community take the WISC-IV, 43 of the students agreed to take a second WISC-IV examination, those 43 students were the participants of the investigation.

• According to Ryan, Glass and Bartels (2010), in all of the dependent samples, except for one, the t-tests failed to discover significant differences in scores from the first time the WISC-IV was administered to the second time.

• “Stability coefficients in the present sample were consistently smaller than those reported in the WISC-IV Technical and Interpretive Manual (Wechsler, 2003b) for children 8 to 9 years of age.”

• This study did have some limitations though, the study was done in a rural community that is composed of mainly white students attending a private school and it is not a good representation of an ethnically diverse population (Ryan, Glass, Bartel, 2010).

Page 28: Test Construction and Reliability  WISC-IV

Reference• Furr, R. M. & Bacharach, V. R. (2008). Psychometrics: An Introduction. Thousand Oaks,

CA: Sage Publications.ISBN: 978-1-412-927604

• Gravetter, F. & Wallnau, L. (2009). Statistics for the BehavioralSciences-Eighth Edition. Wadsworth, CA: Cengage Learning.

• Ryan, J., Glass, L., Bartels, J. (2010). Stability of the WISC-IV in a Sample ofElementary and Middle School Children. Applied Neuropsychology, 17: 68-72.

• Sattler, J.M. (2008a). Assessment of children: Cognitive foundations (5th ed.). San Diego: Author

• Wechsler, D. (2004). WISC-IV Technical and Interpretive Manual. San Antonio, TX: Psychological Corporation.

• Williams, P., Weiss, L., Rolfhus, E. (2003). WISC-IV Technical Report # 1 Psychometric Properties. WISC-IV Technical Manual # 1. San Antonio, TX: Psychological Corporation.

• Williams, P., Weiss, L., Rolfhus, E. (2003). WISC-IV Technical Report # 2 Psychometric Properties. WISC-IV Technical Manual # 2. San Antonio, TX: Psychological Corporation.