FACTOR STRUCTURE OF WECHSLER PRESCHOOL AND PRIMARY …

The Pennsylvania State University

The Graduate School

College of Education

FACTOR STRUCTURE OF WECHSLER PRESCHOOL AND PRIMARY SCALE

OF INTELLIGENCE (THIRD EDITION) – SPANISH VERSION SCORES

AMONG CHILDREN IN PERU

A Dissertation in

School Psychology

by

Abigail E. Crimmins

© 2016 Abigail E. Crimmins

Submitted in Partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy

August, 2016

ii

The dissertation of Abigail E. Crimmins was reviewed and approved* by the following: Barbara A. Schaefer Associate Professor of Education Dissertation Advisor Chair of Committee Professor-in-Charge, School Psychology Program Peter M. Nelson Assistant Professor of School Psychology Richard M. Kubina, Jr. Professor of Education Laura E. Murray-Kolb Assistant Professor of Nutritional Sciences *Signatures are on file in the Graduate School.

iii

Abstract

A critical component in the adaptation of measures across culturally different populations

is the validation of the adapted measure for use in the new population. Validation

requires evidence that the scores from the new tool measure the same qualities or aspects

of the construct in the new population as they purport to measure in the original

population. This study examined the reliability and validity of scores for an adapted

version of the Spanish-language form of the Wechsler Preschool and Primary Scale of

Intelligence-Third Edition (WPPSI-III-SP; Wechsler, 2009) as a measure of cognitive

ability among a cohort of children in rural Peru. Using confirmatory factor analyses

(CFA), a series of models were fit to data from a cohort of children age 36 months (n =

147) and the same cohort at age 48 months (n = 167). These models represented the

theoretical factor structure established by the publisher for the normative data as well as

models derived from other studies of cross-cultural intelligence test adaptation in the

region. It was hypothesized that the models derived from prior South American studies

would yield a better fit for the data as compared to the normative sample model.

Convergent validity was also assessed based on the hypothesis that the scores from the

adapted WPPSI-III-SP would be positively and strongly correlated with cognitive scores

from a similarly adapted Bayley Scales of Infant and Toddler Development-Third Edition

(Bayley-III; Bayley, 2005) administered with these children at age 24 months. CFA

results support a one-factor model for both the 36- and 48-month time points for the

adapted WPPSI-III-SP measure; however, evidence for convergent validity with prior

estimates of cognitive ability using the adapted Bayley-III was minimal. Implications for

cross-cultural test adaptation and the use of the adapted WPPSI-III-SP are discussed.

iv

Table of Contents

List of Tables.............................................................................................................. vi List of Figures............................................................................................................. vii Chapter 1: INTRODUCTION..................................................................................... 1 Cultural Context of Peru................................................................................ 4 Purpose and Proposed Models........................................................................ 7 Chapter 2: LITERATURE REVIEW.......................................................................... 11 Cross-Cultural Test Adaptation...................................................................... 11 Cross-Cultural Intelligence Testing................................................................ 18 Adaptation of Preschool Cognitive Assessment............................................. 22

Description of the WPPSI-III.......................................................................... 25 Present Study.................................................................................................. 32 Chapter 3: METHOD................................................................................................. 42

Sample........................................................................................................... 42 Measures......................................................................................................... 43 Procedure........................................................................................................ 45 Data Analyses................................................................................................. 46

Chapter 4: RESULTS................................................................................................. 52 Younger Cohort.............................................................................................. 52 Preliminary Analyses and Descriptive Statistics...................................... 52 Model 1-NormYoung................................................................................ 54 Model 2-OneFactorYoung........................................................................ 56 Model Comparison.................................................................................... 58 Older Cohort................................................................................................... 59 Preliminary Analyses and Descriptive Statistics...................................... 59 Model 3-NormOlder................................................................................. 61 Model 4-OneFactorOlder.......................................................................... 62 Model 5-AltOlder...................................................................................... 64 Convergent Validity Analyses........................................................................ 65

Chapter 5: DISCUSSION........................................................................................... 67 Fit of Hypothesized Models............................................................................ 68 Convergent Validity Evidence........................................................................ 75

Limitations and Future Research.................................................................... 77 Implications..................................................................................................... 79 Conclusions..................................................................................................... 81

v

References................................................................................................................... 82 Appendix A. Overview of WPPSI Content................................................................ 93 Appendix B. Parameter Estimates and Standard Errors for Models 3, 4, and 5......... 95

vi

List of Tables

Table 1. Model Identification Rules for Standard CFA Models................................. 48

Table 2. Free Parameters and Observations for Specified Models............................. 48

Table 3. Intercorrelations, Descriptive Statistics, and Reliability Estimates for Adapted WPPSI-III-SP Subtest Scores (Younger Cohort)............................ 53

Table 4. Parameter and Standard Error Estimates for Model 1-NormYoung........... 55

Table 5. Parameter and Standard Error Estimates for Model 2-OneFactorYoung.. 57

Table 6. Selected Fit Indices for Younger Cohort Models......................................... 58

Table 7. Satorra-Bentler Chi-Square Difference Test................................................. 59

Table 8. Intercorrelations, Descriptive Statistics, and Reliability Estimates for Adapted WPPSI-III-SP Subtest Scores (Older Cohort).................................. 60

Table 9. Parameter and Standard Error Estimates for Model 4-OneFactorOlder with Modifications.......................................................................................... 63

Table 10. Selected Fit Indices for Older Cohort Models............................................ 65

Table 11. Descriptive Statistics and Reliability Estimates for Adapted WPPSI-III-SP Total Scores and Adapted Bayley-III Scores (Younger and Older Cohorts)........................................................................................................... 66

Table A1. Subtests and Composite Scores of the WPPSI, WPPSI-R, and WPPSI-III by Age........................................................................................... 94

Table B1. Parameter and Standard Error Estimates for Model 3-NormOlder............ 95

Table B2. Parameter and Standard Error Estimates for Model 4-OneFactorOlder.... 96 Table B3. Parameter and Standard Error Estimates for Model 5-AltOlder................ 97

vii

List of Figures

Figure 1. Hypothesized model of the adapted WPPSI-III-SP scores among the Peruvian cohort at age 36 months based on the factor structure of the normative data (Wechsler, 2002b), referred to as Model 1-NormYoung for analyses........................................................................................................... 37

Figure 2. Hypothesized single-factor model of the adapted WPPSI-III-SP for the

Peruvian cohort at age 36 months, referred to as Model 2-OneFactorYoung model for analyses................................................................................... 38

Figure 3. Hypothesized model of the adapted WPPSI-III-SP for the Peruvian

cohort at age 48 months based on the factor structure of the normative data (Wechsler, 2002b), referred to as Model 3-NormOlder for analyses............ 39

Figure 4. Hypothesized single-factor model of the adapted WPPSI-III-SP for the

Peruvian cohort at age 48 months, referred to as Model 4-OneFactorOlder model for analyses.......................................................................................... 40

Figure 5. Hypothesized model of the adapted WPPSI-III-SP for the Peruvian

cohort at age 48 months, referred to as Model 5-AltOlder model for analyses........................................................................................................... 41

Figure 6. Completely standardized factor loadings for Model 1-NormYoung........... 56 Figure 7. Completely standardized factor loadings for Model 2-OneFactor Young.. 58 Figure 8. Completely standardized factor loadings for modified Model 4-

OneFactorOlder............................................................................................... 64

1

Chapter 1: Introduction

The construct of intelligence is perhaps one of the most deeply theorized and

researched concepts within the field of psychology (Wasserman, 2012). For centuries,

researchers have attempted to describe and measure this multifaceted compilation of

human thought and behavior (Kamphaus, Winsor, Rowe, & Kim, 2012). A central

question in the study and measurement of intelligence is the extent to which cultures

differentially define intelligent behavior. Georgas (2003) theorizes that all humans

possess common cognitive processes (e.g., memory, spatial reasoning, verbal ability).

However, a culture is demarcated by its people’s unique traditions, language, beliefs, and

social norms (Sattler, 2008). These contextual characteristics affect not only the

manifestation of these universal cognitive abilities but also the importance each process is

to survival and success within that culture (Georgas, 2003).

Given these universal abilities and culture-specific behaviors, Greenfield (1985)

asked whether or not cognitive ability tests “travel” (p. 1115) between cultures. Can a

cognitive ability measure developed in one culture be appropriately used to measure

intelligence with an individual from another culture? In other words, do enough universal

abilities exist between cultures to justify using cognitive ability tests universally?

Overwhelmingly, research supports that the direct transfer of an intelligence measure

from one culture to another does not constitute appropriate test use (American

Educational Research Association, American Psychological Association, & National

Council on Measurement in Education, 1999; American Psychological Association,

2002; Greenfield, 1985). Often, the transferred assessment does not measure the same

qualities or aspects of intelligence in the new population as it proposes to measure in the

2

original population. As such, scores from the new measure are invalid and not

interpretable. While common elements of intelligence exist between cultures, the

cultural-specific context precludes "travelling" intelligence tests.

Instead, tests from one culture must be carefully translated and adapted for

appropriate use in another culture (Geisinger, 1994; Hambleton, 2001). A critical step in

this adaptation process is the validation of the adapted measure's scores for use in the new

population (Stein, Lee, & Jones, 2006). This step gathers evidence in support of the

scores from the adapted assessment measuring the construct of intelligence in the same

manner across both populations. The purpose of the proposed study is to complete this

step for a preschool-aged cognitive ability measure normed in Spain and adapted for use

among children in rural Peru.

This purpose encompasses three important considerations: (1) the age of the

population, (2) the test being adapted, and (3) the cultural context of the adaptation.

Between the ages of 2 and 7, a child develops physically, cognitively, emotionally, and

linguistically at a tremendous rate (Berk, 2009; Edwards, 1999). While great growth is

made during this time period, the cognitive abilities of young children are often more

homogenous as compared to more differentiated abilities demonstrated in older children

and adults (Baron & Leonberger, 2012). Furthermore, qualitative characteristics of

cognitive assessments (e.g., use of manipulatives, administration time, rapport with

examiner) must take into account the shorter attention span, higher energy level, and

lower impulse control demonstrated by preschool-age children (Alfonso & Flanagan,

1999). Developing a cognitive ability assessment that reliably measures these quick

3

changes over time while taking into account the broad abilities and unique test behaviors

of the age group is a challenging task.

One such measure that has attempted this challenge is the Wechsler Preschool and

Primary Scale of Intelligence - Third Edition (WPPSI-III; Wechsler, 2002a). The WPPSI-

III was designed to assess the cognitive functioning of children between the ages of 2

years 6 months and 7 years 3 months. The test separates children into two age bands: (1)

2 years 6 months to 3 years 11 months and (2) 4 years to 7 years 3 months. A Spanish-

language version of the WPPSI-III (WPPSI-III-SP) was published in 2009 and normed in

Spain.

The theoretical orientation of the WPPSI-III is reflected in the subtest and

composite scores included in the assessment battery. For children in the lower age range,

four core subtests combine to provide measures of verbal ability (i.e., verbal knowledge

and the use of language in novel situations) and of performance ability (i.e., problem

solving and interpreting visual stimuli). For the older children, seven core subtests

combine to measure these abilities. For both age bands, these broad composites or

components of intelligence (e.g., verbal ability, performance ability) then load onto a

global general intelligence factor. The idea of intelligence being defined not only by

individual manifestations of intelligent behavior but also by an overall global ability is a

direct reflection of David Wechsler's theory as to the definition of "intelligence"

(Wechsler, 1944). Recent research has expanded this theory to include the importance of

processing speed and working memory as separate measureable components affecting

global intelligence (Fry & Hale, 2000).

4

Cultural Context of Peru

Beyond the considerations of the child's age and the test being used, the most

important consideration in cross-cultural test adaptation is the cultural context in which a

text is being adapted. According to the Central Intelligence Agency (CIA; 2014), the

Republic of Peru is a large country located on the central-western coast of South America

with a population of approximately 30 million people. The country is slightly smaller

than Alaska and contains a wide-range of climates. In the east, the climate is tropical and

is comprised of the Amazon River basin; in the west, the climate is that of a dry desert.

The Andes Mountains are a main geological feature of the country. Spanish, Quechua,

and Aymara are the official languages spoken in Peru, with the latter two being languages

spoken by Amerindians. Forty-five percent of the population identifies as Amerindian.

Nationwide, the poverty rate is about 30% with higher poverty rates (around 55%) in

more rural areas. Approximately one-third of children ages 6 to 17 works, mainly in the

mining and construction industries. Approximately 89.6% of the population over the age

of 15 can read and write. The CIA classifies the risk of infectious diseases as very high,

with high risk for bacterial diarrhea, Hepatitis A, typhoid fever, dengue fever, malaria,

and oroya fever.

Of interest to this study are small rural communities in northeastern Peru, located

within the Department of Loreto (Yori et al., 2014). While other regions of Peru have

seen economic growth, the economic growth of this region has progressed more slowly.

Approximately 5,000 people live in these communities located along the Nanay River, a

tributary of the Amazon River. Given the tropical climate, rain occurs throughout the

year and the heaviest rains fall in the month of January, frequently causing flooding.

5

Compared to the rest of the country, this region has higher rates of malnutrition, child and

infant mortality, malaria, and tuberculosis. Furthermore, these sites lack consistent access

to potable water and sanitation (Yori et al., 2014). The main sources of income for

individuals in these communities are growing vegetables to sell in a nearby city, fishing,

driving a taxi, making bricks (Yori et. al., 2014), and harvesting palm and wood products

from the forest (Foundation for the National Institutes of Health, 2016). The majority of

families in this area live in single-family homes with wooden walls, earthen floors, and

thatched roofs. Seventy-four percent of these homes have electricity. Concerning parent

education levels, 58.2% of mothers participating in one study conducted in this region

reported to have completed primary school and some second school, whereas 21.1% of

mothers reported to have between one and five years of primary education (Yori et al.,

2014).

Adapting the WPPSI-III-SP for use among this sliver of the Peruvian population

offers potential lessons for the wider issues surrounding the use of intelligence tests and

test adaptation. First, these children face distinct challenges to their growth and

development (e.g., malnutrition, disease, limited access to health care). Malnutrition, for

example, has been linked with long-lasting negative impacts on both brain structure and

function (Levitsky & Strupp, 1995) in addition to stunting the development of cognitive

processes (Kar, Rao, & Chandramouli, 2008). However, these challenges are not unique

to children growing up in northeastern Peru. Other children living in developing nations

across the world face these difficulties. In following the malnutrition example and

according to the United Nations, poor nutrition contributes to about half (45%) of deaths

in children under the age of five worldwide (World Food Programme, 2016). In 2014,

6

approximately 1 in 13 children in the world suffered from sudden or acute malnutrition

(UNICEF, World Health Organization, & World Bank Group, 2015).

These environmental impacts highlight the important interplay between biological

and environmental factors affecting child development. Much research is conducted

globally to determine what interventions affecting a child, his or her environment, or both

can be put into place to help foster healthy development (Fernald, Kariger, Engle, &

Raikes, 2009). To effectively carry out this purpose, however, researchers across the

world need reliable and valid measurement tools. The adaptation and validation of scores

from the WPPSI-III-SP in this region can provide insight and lessons for researchers

looking at the potential effects of environmental factors (e.g., hunger, disease) or

intervention programs on the cognitive development of a child.

Second, in adapting the WPPSI-III-SP for use in Peru, at first glance it would

appear as if this adaptation is fairly straightforward as both countries speak Spanish.

Unlike some test adaptation projects, the complicated process of translation is potentially

avoided. However, the dialect of Spanish spoken in Peru is distinct from Castilian

Spanish -- the dialect of Spanish spoken in Spain and utilized for the original test

translation and development. Although Peruvian Spanish is considered one of the most

similar dialects to Castilian Spanish in terms of pronunciation (Benson, Hellander, &

Wlodarski, 2007), grammatical differences distinguish the two dialects.

For example, Louro and Yupanqui (2011) analyzed the use of the preterit (e.g., "I

saw") and past perfect (e.g., "I have seen") tenses in Castilian Spanish and Peruvian

Spanish. By general grammar rules, the preterit is used in situations in which something

has occurred in the past whereas the present perfect is generally used to describe events

7

that have occurred in the past but also have relevance to the present situation. The

researchers coded samples of recorded spontaneous conversations. When referring to past

events, the Castilian Spanish speakers used the preterit tense 46% of the time whereas the

Peruvian Spanish speakers used this tense 85% of the time. Furthermore, the use of the

present perfect tense appeared to be differentially influenced by the subject of the

conversation itself. For the Castilian Spanish speakers, the increased use of the present

perfect tense indicated less influence of the past events on the present situation. Among

the Peruvian speakers, the increased use of the present perfect indicated an increased

perceived relevancy of the past on the present, whether real or psychological (Louro &

Yupanqui, 2011).

Beyond grammar differences, the native languages of Peru (i.e., Quechua,

Aymara, and various languages spoken in the Amazonian jungle) have had some

influence on Peruvian Spanish. While a full translation of language is not needed (e.g.,

from English to Spanish), the test must be adapted and "translated" from Castilian to

Peruvian Spanish. The adaptation process, therefore, provides a potential guide or

lessons for others who are adapting a test between regions or countries with a similar

overall language but with distinct dialects and cultures.

Purpose and Proposed Models

The current study examines the complex and important task of validating a

cognitive ability test adapted for use in a new population. Specifically, this study

addresses whether or not scores from an adapted version of the WPPSI-III-SP are a valid

measure of cognitive ability among a cohort of children in rural Peru. A crucial

component of cross-cultural test adaptation is the validation of the adapted test's scores

8

among the new population of interest (Geisinger, 1994). As outlined by Stein and

colleagues (2006), one method for establishing construct validity is a systematic

evaluation as to how the internal factor structure of the adapted test compares to the

factor structure of the original version. As such, the first research question is as follows:

1. Are the factor structures of scores from the original version of the WPPSI-III

replicated with Peruvian children?

If the two tests' scores measure the same construct in the same manner, the internal factor

structures of the tests were hypothesized to be invariant (Stein et al., 2006). In other

words, the factor structure of the original population would be a good fit for the Peruvian

cohort.

However, it was also expected that cultural differences between Spain and Peru

would impede components of the normative factor structure from being a good fit for the

data. Therefore, the second research question is as follows:

2. Does another model fit the data better than the model documented with the

normative sample?

Previous research into the adaptation of intelligence tests for children within the

region provides potential alternative factor structures. Contreras and Rodriquez (2013)

adapted the Spanish version of the Wechsler Intelligence Scale for Children - Fourth

Edition (Wechsler, 2005) among a sample of children in Bucaramanga, Colombia. Factor

analyses indicated a single factor that accounted for 70.26% of variance among test

scores, as opposed to multiple factors as outlined in the normative sample. This factor

structure may be particularly applicable to the preschool children in Peru as cognitive

ability is thought to be more homogenous among younger children (Baron & Leonberger,

9

2012). As such, a single factor of general intellectual ability was proposed as an

alternative factor structure among the children aged 36 and 48 months.

For the children at 48 months, a third potential factor structure was proposed

stemming from the standardization of a Chilean adaptation of the Argentinean version of

the Wechsler Intelligence Scale for Children - Third Edition (Wechsler, 1991/1997)

conducted by Ramirez and Rosas (2007). Overall, the individual subtests of the Chilean

version loaded onto four distinct factors consistent with the Argentinean version.

However, analyzing the factor structures among four age bands derived from the overall

sample provided a different picture. A subtest asking participants to quickly match shapes

and symbols (Coding) did not load on the factor measuring processing speed as it was

theorized to do in the Argentinean sample. Instead, this subtest loaded on a factor

measuring the child's perceptual organization. As such, among this population, this

subtest may require more abstract thought than previously theorized (Ramirez & Rosas,

2007). Based on this finding, a third factor structure was proposed in which Coding was

theorized to load on the performance factor as opposed to the processing speed factor for

the children at 48 months of age.

While the cultures of Colombia and Chile are not the same as those of individuals

living in rural Peru, the cultures may be more similar than a comparison between Spain

and rural Peru. Therefore, it was hypothesized that these proposed models would provide

a better fit to the data than the model based on normative data.

Beyond analyzing the internal structure of an assessment, construct validity may

also be evidenced by the extent to which a test's scores correlate with scores from related

measures (i.e., convergent validity). If a test's scores validly measure an intended

10

construct, it would be expected that those scores would be highly and positively

correlated with scores from tests measuring the same or a theoretically related construct

(Brown, 2015; Sattler, 2008). The final research question addresses the convergent

validity of the adapted measure:

3. To what extent do the scores from an adapted WPPSI-III-SP correlate with the

scores from another adapted measure of cognitive development?

It was hypothesized that the scores from an adapted measure of intelligence completed by

the children at age 24 months would be positively correlated with the scores on the

adapted WPPSI-III-SP.

11

Chapter 2: Literature Review

Cross-Cultural Test Adaptation

In analyzing the process of cross-cultural test adaptation, an important first step is

to define what constitutes culture. Culture refers to the global aspects of daily life that are

passed down from generation to generation among a segment of the larger population.

These global aspects cover a wide range of individual and collective characteristics. For

example, individuals within a culture may share common behavior patterns, beliefs,

religious views, attitudes, values, social norms, self-definitions, language, norms for the

expression of emotions, and/or political views (Cohen, 2009; Sattler, 2008). In general,

cultures exist within definable geographic regions. Critically, the boundaries of these

cultural areas often do not equate to national borders. Various cultures may exist within

one country (Cohen, 2009). Of importance to psychological assessment, cultures differ

in their expression of cognitive, behavioral, and personality characteristics. For example,

in some cultures speaking in a loud voice in public is an accepted social behavior,

whereas in other cultures this behavior is considered impolite. In another example,

academic success is crucial to success within some cultures but is not considered as

important in others (Sattler, 2008).

Two important concepts for the conceptualization of culture are emic and etic.

According to Georgas (2003), emic refers to the idea that something is studied or

developed within a culture. For test development, this idea translates into individuals

from one culture developing a measure for use within that culture. Etic, on the other

hand, refers to the idea that something is studied or developed by someone outside the

culture. A test developed from this perspective would be one that was developed in one

12

culture and then is used in another culture with minimal adaptation or change. In this

way, the test does not reflect the behaviors, beliefs, traditions, or language indigenous to

the new culture.

This etic approach to test development introduces the possibility of test bias. A

test is biased against an individual or a culture when the reliability and validity of test

scores systematically vary according to group membership. Bias is often demonstrated

through differences in mean scores as a function of an examinee’s group membership.

These mean differences do not represent an objective difference in ability across groups.

Rather, these differences represent issues with test construction, administration, and

interpretation (Brown, Reynolds, & Whitaker, 1999). Sattler (2008) outlines the

following three types of potential bias: (1) construct, (2) content, and (3) predictive.

Construct bias refers to the construct or theoretical entity measured by the test being

differentially defined or expressed across different cultural or ethnic groups. Content bias

refers to specific items on a measure being differentially easy or difficult for one ethic or

cultural group over another group. Finally, predictive bias refers to differential

predictions being made from a test score dependent up group membership. Brown and

colleagues (1999) add that potential bias may also result from inappropriate and

nonrepresentative standardization samples, or from a test administrator’s own bias or

differential treatment of the examinee.

When psychologists interpret the results from potentially biased (and therefore

invalid) cognitive ability measures, the potential for harm is great. For example, the

misinterpretation of biased intelligence test results were historically used to justify

discrimination against African Americans (Jacob, Decker, & Hartshorne, 2011). Many

13

high-stakes decisions are made in consultation with the results from an intelligence test

(e.g., special education placement, criminal sentencing, diagnosis of psychopathology).

In the area of special education, two landmark court cases demonstrate the harm

done in the misinterpretation of culturally inappropriate cognitive ability scores. In the

case of Diana v. State Board of Education (1970), the parents of a group of Mexican

American students argued that their children’s placement in special education was unfair

due to the school’s sole reliance on the children’s intellectual ability scores when making

that determination. Of importance, the students’ cognitive ability was assessed using an

English-language test. When retested in Spanish and English, many of these students

scored significantly higher than they had previously scored (Jacob et al., 2011). When

tested in their primary language, the cultural bias was reduced. Second, in the case of

Larry P. v. Riles (1984), the court found intelligence tests to be racially discriminatory

and biased against African American students. As such, the identification of an African

American student as having an intellectual disability solely based on his or her

performance on an intelligence test was banned (Jacob et al., 2011). These two examples

demonstrate the importance of buffering against the effect of bias, especially in situations

in which high-stakes decisions are being made.

An alternative method for test development comes from the concept of derived

etic (Georgas, 2003). This approach suggests that individuals from within a culture either

adapt a test from another culture or develop a similar test. This method has both the

characteristics of the etic and emic approaches. While the new test is still partially a

reflection of the outside culture (i.e., etic), it was developed from within the target culture

(i.e., emic). Most cross-cultural cognitive test adaptation utilizes this approach (Georgas,

14

2003; e.g., Contreras & Rodriquez, 2013; Ramirez & Rosas, 2007). The process of

generating a new and unique cognitive ability measure is an expensive and time-

consuming endeavor. As such, test developers and researchers generally choose the more

efficient derived etic method and adapt a previously established assessment from a

different culture for use within their culture.

Adaptation considerations. In following this derived etic method and adapting a

cognitive ability measure for use in a new culture, all aspects of the test must be analyzed

for cultural appropriateness. These aspects include the test’s content and task demands.

Regarding the test’s content, Malda, van de Vijver, Srinivasan, Transler, Sukumar, and

Rao (2008) argue that a test’s content should undergo language-driven adaptations,

culture-driven adaptations, theory-driven adaptations, and familiarity-driven adaptations.

Language-driven adaptations refer to considerations of translating an item from one

language to another. At times, a direct translation between languages is not possible due

to nonequivalent words in the two languages or significant grammatical differences.

Cultural considerations for an item should also be considered. For example, an item

asking about holidays should take into account the traditions and customs of the new

population. At times, items must be changed for theoretical considerations. Malda et al.

(2008) gives the example of differing item lengths on a subtest asking children to repeat

back a series of numbers. When reading these numbers out loud, the auditory length of

the series may differ between languages. Theoretically, the span of digits should be the

same length from the original test to the adaptation. As such, adaptations may need to be

made for items in the new language. Finally, familiarity-driven adaptations refer to

adapting item content such that the child is familiar with all materials, images, tasks, and

15

instructions. For example, a common food in one culture (e.g., hamburger in the United

States) may not be familiar to a child from another culture.

Outside of the test’s content, the behavior required by the test itself, or test session

behavior, must also be considered in cross-cultural adaptation. As defined by Frisby

(1999a), test session behavior includes the observable verbal and motor behaviors that

can be evaluated during the assessment and may influence the examinee’s performance

on the measure. These behaviors may include the examinee’s ability to pay attention to

the task at hand, the examinee’s energy level at the time of the assessment, and the

examinee’s familiarity with the task demands.

This last aspect of test session behavior may be particularly relevant to the

adaptation of a test between cultures (Frisby, 1999b). Individuals from various cultures

will differ in their exposure to prior knowledge, formal education, and the types of

activities being completed (e.g., paper-and-pencil tasks; identifying similarities between

objects). In taking a cognitive ability test, this prior exposure may translate into better

time-management or guessing strategies, and a better ability to use the test and the testing

situation itself to complete an activity. For example, Malda and colleagues adapted the

American Kaufman Assessment Battery for Children – Second Edition (Kaufman &

Kaufman, 2004) for use with a low socio-economic status population of children in India

(Malda, van de Vijver, Srinivasan, Transler, & Sukumar, 2010). The researchers

observed that the children had difficulties completing the tasks with puzzles. These

children do not have extensive exposure to play materials such as puzzles. The puzzle

activity on the test, while not novel for the American culture in which the task was

16

originally developed, was novel to Indian children and, thus, differentially difficult for

the new population in India.

Guidelines for adaptation. The International Test Commission published a series

of guidelines for the translation and adaptation of assessments across cultures

(Hambleton, 2001). These guidelines address issues related to the context in which

assessments are administered and interpreted, test development and adaptation itself, the

administration of an adapted test, and the documentation and score interpretation of an

adapted test. Overall, these guidelines suggest analyzing how and to what extent a

construct (e.g., intelligence, spatial reasoning) overlaps between two cultures and

analyzing not only test content but also testing format.

The process of adapting and translating a test from one culture to another is a

complicated task. Geisinger (1994) proposes 10 steps for appropriate adaptation of

language and cognitive development assessments. These steps include: (1) translate and

adapt the test; (2) review the adaptation; (3) make further changes to the adapted measure

based on the review; (4) pilot the adapted measure; (5) field test the adapted measure

with a larger sample; (6) standardize the scores; (7) perform validation research; (8)

develop a manual and other documentation; (9) train users; and (10) collect reactions

from users. For each of these steps, a team of individuals who are not only

knowledgeable about the language and culture of the new test population but also

knowledgeable about these aspects of the test’s original population is crucial.

Validation of the adapted measure. Of these proposed steps for proper test

adaption, perhaps one of the most crucial steps is the validation of the adapted measure

for use in the new population. Validating the test for use in the new population requires

17

evidence that the new tool measures the same qualities or aspects of the construct in the

new population as it proposes to measure in the original population. In addition, the

scores from the adapted measure should be interpretable in the same manner as the

original test (Geisinger, 1994). Stein and colleagues proposed that the statistical

technique of structural equation modeling (SEM) may be a useful tool for providing

evidence in regards to construct validity (Stein et al., 2006).

Structural-equation modeling is able to test measurement invariance between

samples. In the case of cross-cultural assessment these samples are typically a new

population and an original normative population. If the assessment were measuring the

same construct in the same manner across the two populations, a consistent and similar

relationship between items and latent factors would be expected across the populations

(Stein et al., 2006). In other words, invariable factor structures would be expected across

groups. When applying confirmatory factor analysis (CFA) procedures, a model (i.e., the

model from the normative sample) is proposed and then tested on the new population in

order to determine whether or not this original model fits the new data equally well. If the

data do not fit the proposed model, the latent factor structure of the new data is dissimilar

to the original normative data. This incongruence supports the idea that the new

assessment is not measuring the construct in the same manner as the original test or may

not be measuring the same construct at all.

Another method for establishing construct validity is by examining how a test's

scores are related to the scores from other assessments. A test's scores evidence

convergent validity if those scores are positively and highly correlated with scores from

an assessment measuring the same or a similar construct; conversely, low or negatively

18

correlated scores from assessment measuring alternate construct would be evidence of

divergent validity (Brown, 2015; Sattler, 2008). For example, if an intelligence test's

scores measure the intended construct it would be expected for those scores to be highly

and positively correlated with the scores from another intelligence test. This method of

validating a test's scores has been used in measuring cross-language equivalence of

several common parenting measures (Nair, White, Knight, & Roosa, 2009), in validating

the translation of a language assessment into Galician (Perez-Pereira & Reches, 2011),

and in validating language sample measures taken from structured elicitation procedures

in Czech (Smolik & Malkova, 2011).

Cross-Cultural Intelligence Testing

The previously described process of adaptation and validation of assessments

becomes more complicated when complex constructs such as cognitive ability are

considered. As Greenfield (1985) argues, cognitive ability tests are a clear reflection of

the culture in which they were developed. The items on these tests represent a form of

“symbolic culture” (Greenfield, 1985, p. 1115). Each item represents the values,

knowledge, and methods of communication deemed important to the specific culture in

which the test was developed.

Universality of intelligence theory. When adapting a cognitive ability test, a first

consideration is the extent to which the construct of intelligence itself can be universally

applied across cultures. Georgas (2003) theorizes that all humans possess common

cognitive processes (e.g., memory, spatial reasoning, verbal ability). In a meta-analysis of

91 studies looking at the definition of intelligence in countries outside of Europe and the

United States (Irvine, 1979), the following cognitive processes were identified as

19

universal components to how different cultures define intelligence: visual/perceptual

processes, memory function, verbal abilities/skills, numerical operations, and

physical/temperamental qualities. However, a people’s unique culture affects how these

general processes are manifested within that particular environmental context. These

contextual characteristics affect not only the manifestation of these universal cognitive

abilities but also the importance each process is to survival and success within that

culture (Georgas, 2003).

Sternberg and Grigorenko (2004) refer to this ability to survive and thrive within

a specific culture or environmental context as 'successful intelligence'. For example, the

authors asked children living in Kenyan villages to identify local herbs and explain their

medicinal uses. The ability to locate, identify, and process these herbs into medicine is

crucial to survival within these Kenyan communities. In considering the universal

abilities identified by Irvine (1979), these children rely on visual/perceptual and memory

skills to complete the task. Children in the United States, on the other hand, may be asked

to demonstrate these processing and memory skills through more academic-based tasks.

While children across cultures, therefore, are required to express the same intellectual

abilities in order to successfully navigate their environment, how these skills are

demonstrated varies widely across cultures.

Georgas, van de Vijver, Weiss, and Saklofske (2003) looked at the factor

structures of the Wechsler Intelligence Scale for Children – Third Edition (WISC-III;

Wechsler, 1991) across 14 countries (U.S., United Kingdom, France and French-speaking

Belgium, The Netherlands and Flemish-speaking Belgium, Germany, Austria and

Switzerland, Sweden, Lithuania, Slovenia, Greece, Japan, South Korea, and Taiwan) in

20

an effort to analyze the extent to which the factor structure of the test is similar across

nations and cultural groups. As described by Stein et al., (2006), similar factor structures

across cultural groups is indicative of a cross-cultural similarity in the construct of

intelligence. Initially, the findings reported across these sites were either a three- or four-

factor solution (5 and 9 countries, respectively). For those reporting a three-factor

solution, the consistent finding was that the subtest of Arithmetic (in which children

perform mental math in response to word problems) loaded on the Verbal

Comprehension factor as opposed to the Freedom from Distractibility (working memory)

factor.

Georgas et al. (2003) reanalyzed the entire 14-country dataset using exploratory

factor analysis. A four-factor solution and unitary second-order factor loading (general

intelligence) were supported. The researchers then conducted pair-wise comparisons of

the factor structures from each nation. Tucker’s phi coefficients (Burt, 1948) were

calculated for each comparison. Generally, a Tucker’s phi greater than or equal to .90

indicates invariance across structures. For the first three factors, all Tucker phi

coefficients were greater than .90, indicating factorial stability for those three factors

across the datasets. For the fourth factor (Freedom from Distractibility), some Tucker phi

coefficients fell below .90 with the lowest being .79.

Overall, Georgas and colleagues (2003) argued that the factor structure of the

WISC-III demonstrated similarity across the various countries, which suggests a similar

construct of intelligence across cultures. However, the study also noted cultural

differences in the manifestations of some abilities thought to be universal. The

differential loading of the Arithmetic subtest on either the Verbal Comprehension or the

21

Freedom from Distractibility factor was perhaps a reflection of either the subtest itself or

of cultural differences (Georgas et al., 2003). The subtest relies on the child’s verbal

comprehension skills to interpret the word problem. In order to successfully complete the

problem, a child would need to decontextualize the arithmetic problem from the word

problem. Cultures may differ in the extent to which they emphasize or manifest this

ability to decontextualize. Cultures may also differ in the extent to which children are

exposed to word problems or to arithmetic and calculations.

This idea of differing exposure to test content or activities as a contributing factor

to noted differences in cognitive abilities across cultures was explored through a meta-

analysis conducted by Van de Vijver (1997). Van de Vijver identified 197 studies

conducted between 1973 and 1994 that compared cognitive test performance across

cultures. In support of the universality of intelligence, the most frequently reported

finding of these cross-cultural intelligence studies was the absence of any cross-cultural

differences in performance on a measure of cognitive abilities. When studies did note

differential performance across cultures, these differences were often related to the type

of task being completed or global contextual differences between countries (e.g., wealth,

educational opportunity). Greater differences in performance were noted for studies

looking cross-nationally as opposed to comparing cultural groups within the same nation.

Furthermore, a greater disparity in affluence between countries was associated

with greater differences in performance on the measure, with individuals in wealthier

nations scoring higher than individuals in less wealthy nations. In relation, more years in

school were associated with higher performance on a measure. Van de Vijver (1997)

concluded that wealthier nations often have more educational opportunities and this

22

education better prepares individuals to perform well on an intelligence test. Education

potentially exposes a person to tests, learning materials, and performance-based

activities. Of note, Van de Vijver (1997) found that greater differences in performance

were noted across cultures when the study used tasks developed within Western cultures,

as opposed to using locally developed non-Western tasks. Overall, the majority of studies

did not note significant differences in cognitive ability across cultures. However, the

measurement of these universal cognitive abilities is systemically impacted by contextual

factors such as the assessment itself and the individual's exposure to tests and educational

opportunities.

Adaptation of Preschool Cognitive Assessments

Beyond cultural considerations, the adaptation of a preschool-age cognitive

assessment presents distinct challenges. One of the main difficulties to test development

for children in this age range is adequately capturing the profound and rapid development

that occurs in young childhood (Berk, 2009). In the cognitive domain, a child’s memory

and attention span grow rapidly during this time (Edwards, 1999). According to Piaget’s

conceptualization of cognitive development (Berk, 2009), preschool-age children fall in

the preoperational stage of development. At this age children begin to engage in symbolic

thinking, or begin to use words and images to represent events or experiences. Their

communication skills allow for the expression of full concepts, but complex, logical, and

abstract reasoning is still difficult for children in this age range (Berk, 2009). This rapid

cognitive development presents a challenge to the reliable measurement of ability, as

abilities do not often remain stable over time. Measures of cognitive development must

be sensitive to changes over time (Alfonso & Flanagan, 1999). As this development

23

levels off over time, small increases in chronological age increase the ability for tests to

predict future performance (Baron & Leonberger, 2012). Consensus among researchers

suggests that this stability of cognitive ability occurs when a child is approximately six-

years old. After the age of 6, a child's score on an intelligence assessment is more

strongly correlated with later measures of intelligence (Sternberg, Grigorenko, & Bundy,

2001).

Furthermore, traditional preschool cognitive ability assessments are often

downward extensions of ability measures developed for older children and adults. As

such, these assessments may not be well equipped for measuring the unique cognitive

abilities of preschoolers (Alfonso & Flanagan, 1999). Baron and Leonberger (2012)

argued that the intellectual functioning of preschool-aged children may be more

homogeneous as compared to the cognitive ability of older children and adolescents.

Simply extending a measure for older youth downward may result in an invalid

measurement of the preschooler’s cognitive ability. The authors cite factor analyses that

support the construct validity of multidimensional ability tests for older children and

adults but do not support the same multi-factor structures in younger children (Baron &

Leonberger, 2012).

Neisworth and Bagnato (1992) go as far to argue that the inclusion of a

standardized intelligence test in the assessment of an infant or young child is

inappropriate and potentially harmful. The authors argue that intelligence tests for young

children lack a coherent and common definition as to what constitutes "intelligent"

behavior. This lack of a coherent construct produces tests that assess a mixture of skills

that may have little predictive validity and are heavily influenced by experience or other

24

environmental factors (e.g., intervention services, ability of the test administrator to

engage the child). Furthermore, Neisworth and Bagnato (1992) argue that one of the

cornerstones to standardized testing is standardized administration. Children in this age

range, however, may make adherence to standard protocol difficult due to individual

differences in emotional regulation, affect, attention, socialization, and familiarity with

testing procedures or objects.

Despite these challenges to ability measurement in the preschool period, these

tests are often included in a comprehensive assessment battery for young children.

Judicious use of intelligence tests can provide meaningful information in the assessment

of young children (Bracken, 1994). Furthermore, for some diagnoses (e.g., developmental

delay) under the Individual with Disabilities Education Act (2004) a measurement of the

child's cognitive ability may be required. However, careful examination and

consideration must be employed in developing these assessments in order to address

some of the potential drawbacks to engaging in standardized testing at this age

(Neisworth & Bagnato, 1992).

Alfonso and Flanagan (1999) outlined quantitative and qualitative characteristics

that must be examined when evaluating cognitive assessments for preschoolers. The

quantitative characteristics, or psychometric properties of the test and its scores, are much

the same as those that are examined for any age group. These characteristics include

making sure a representative standardization sample with adequate age divisions was

utilized during test and norm development. For preschool children, adequate age

stratification becomes even more important given the rapid changes that occur within this

age group. Other factors to consider are adequate score reliability, adequate floor items

25

such that the ability of lower performing students is sufficiently measured, sufficient

change in difficulty between items such that subtle changes in ability are captured (i.e.,

adequate item gradients), and adequate validity (Alfonso & Flanagan, 1999).

Alfonso and Flanagan (1999) also identified numerous qualitative factors of

preschool assessments that may influence the child’s performance on the exam. These

test characteristics include the test materials themselves and the administration of the test.

These assessments must take into account the age, developmental characteristics, and

interests of a typical preschool child. Assessment materials should include manipulatives

for the child to use that are colorful, simple, large, and universally appealing. The

administration of the test itself should not be long and should be engaging enough to hold

the attention of the child. As such, the administrator of the test can have a significant

impact on the performance of the child. When the administrator is skilled in building

rapport, holding a young child’s interest, and managing the behavior of a young child, the

child’s score is likely a more valid measurement of ability (Baron & Leonberger, 2012).

Finally, the expressive language requirements for test completion must be considered. As

language skills are still developing in this age range (Berk, 2009), the test should only

require short verbal answers and should allow for the use of gestures (Alfonso &

Flanagan, 1999). Therefore, when considering cross-cultural test adaptation for

preschool-age children these qualitative test characteristics must also be examined for the

appropriateness for young children in the new population of interest.

Description of the WPPSI-III

One such assessment for measuring the cognitive ability of preschoolers is the

Wechsler Preschool and Primary Scale of Intelligence – Third Edition (WPPSI-III;

26

Wechsler, 2002a). The WPPSI-III is a clinical instrument designed to assess the cognitive

functioning of children between the ages of 2 years 6 months and 7 years 3 months. The

test separates children into two age bands: (1) 2 years 6 months to 3 years 11 months and

(2) 4 years to 7 years 3 months.

Development and theoretical background. The WPPSI was initially published

in the United States in 1967 (Sattler, 2008), as a downward extension of the Wechsler

Intelligence Scale for Children (Lichtenberger & Kaufman, 2004). The content and

construction of a cognitive ability assessment is often a reflection of the test creators'

theory as to what constitutes intelligent behavior, and the original WPPSI is no exception.

While Wechsler himself did not develop many of the activities included in the first

version of the WPPSI (Boake, 2002), his compilation of subtests and overall construction

of the test was driven by a concrete view as to the nature of "intelligence". Wechsler

(1944) provided the following definition of intelligence:

Intelligence is the aggregate or global capacity of the individual to act

purposefully, to think rationally, and to deal effectively with his environment. It is

global because it characterizes the individual's behavior as a whole; it is aggregate

because it is composed of elements or abilities which, though not entirely

independent, are qualitatively differentiable. (p. 3)

The influence of earlier theoreticians on this definition is evident. The idea of intelligence

being a global entity reflects the work of psychologist Charles Spearman, while the

theories of Edward Thorndike are evident in the idea of a composition of qualitatively

different behaviors (Wasserman, 2012).

27

Wechsler (1944) also acknowledged that one test could not adequately measure

all possible factors that allow an individual to function (e.g., drive and incentive). Instead,

Wechsler divided the products of intelligence into two general categories, verbal abilities

and performance abilities. Wechsler set out to not only measure these components of

intelligence but also to provide an overall estimate of general intellectual ability. In

including activities that measure verbal and performance abilities, an assessment may

then extrapolate an estimate of the latent general intelligence ability.

The first and second editions of the WPPSI (published in 1967 and 1989)

followed closely to Wechsler's original division of general intellectual ability into general

verbal and performance abilities. The WPPSI-III, however, contains an updated

theoretical structure reflecting current research into the nature of intelligence (see the

Appendix for an overview of revisions of the WPPSI). More specifically, the WPPSI-III

measures the potential influence of a child's ability to process information quickly

(processing speed) and to form categories or to problem solve using novel stimuli (fluid

reasoning; Lichtenberger & Kaufman, 2004). This theoretical change may be particularly

important for preschool-aged children. According to Fry and Hale (1996), a child's ability

to process information quickly and to hold information in their memory while

manipulating it in some way are prerequisite skills to fluid reasoning or problem solving.

The addition of these skills into the theoretical construct of intelligence forms a more

accurate picture of the specific cognitive skills that allow a child to function well in his or

her environment. These updates to the theoretical underpinnings of the test are reflected

in the assessment's test content and structure.

28

Test structure and content. The WPPSI-III is comprised of a series of subtests

that provide a variety of composite scores dependent upon the child’s age. Table A1 in

the Appendix provides a summary of core composites and subtests for each age range.

For children in the lower age range, four core subtests combine to provide two composite

scores: (1) Verbal Intelligence Quotient (VIQ) and (2) Performance Intelligence Quotient

(PIQ). The VIQ is a measure of the child’s verbal knowledge and his or her ability to

apply those skills to a novel situation. This measure is also a reflection of the child’s

verbal understanding gained through informal education. The PIQ is generally considered

the nonverbal ability measure, assessing a child’s ability to solve problems and his or her

ability to organize and interpret visual stimuli.

The VIQ is comprised of two subtests (Receptive Vocabulary and Information).

During the Receptive Vocabulary subtest, the child looks at a group of four pictures and

points to the one that the examiner named aloud. For the Information subtest, the child

answers a series of questions that address a broad range of general knowledge topics. The

PIQ is also comprised of two subtests (Block Design and Object Assembly). For the

Block Design subtest children use bicolored blocks to recreate a design modeled or

pictured for them. Object Assembly asks a child to put together increasingly more

difficult puzzles within a certain time limit. The scores from these subtests are combined

to provide an overall measure of cognitive functioning (Full Scale Intelligence Quotient

[FSIQ]; Wechsler, 2002a).

For children in the older age range, seven core subtests also combine into two

core composite scores: (1) Verbal Intelligence Quotient and (2) Performance Intelligence

Quotient. The VIQ and PIQ aim to measure the same abilities as described for the

29

younger cohort. The subtests of Information and Block Design carry over from the

younger age range to the older range. Information, Vocabulary, and Word Reasoning

subtests combine to provide the VIQ. The Vocabulary subtest asks children to verbally

define a series of words, whereas the Word Reasoning subtest asks the children to

identify the common concept being described in a series of increasingly specific clues.

The Block Design, Matrix Reasoning, and Picture Concepts subtests combine to provide

the PIQ composite. During the Matrix Reasoning subtest, the child looks at an incomplete

matrix and selects the missing portion from response options. The child chooses pictures

from a series of rows that form a group with a common characteristic during the Picture

Concepts subtest. The scores from these core subtests, in addition to the Coding subtest,

are then combined to provide an overall measure of general intellectual functioning

(FSIQ; Wechsler, 2002a). During the Coding subtest, the child is presented with a series

of geometric shapes (e.g., star, circle, square). The child uses a key to copy symbols (e.g.,

line, cross) into each shape within a certain time limit.

Reliability and validity. The WPPSI-III was first developed and standardized in

the United States using a normative sample of 1,700 children. The sample was stratified

on the characteristics of age, sex, ethnicity, geographic region, and parental education

(Wechsler, 2002b). The scores from the test were found to have good reliability with

internal consistency indices ranging from .94 to .96 for the VIQ, from .89 to .95 for the

PIQ, and from .95 to .97 for the FSIQ. Split-half reliability estimates for the core subtests

ranged from .83 (Symbol Search) to .91 (Word Reasoning; Wechsler, 2002b). As

reported in the test’s technical manual (Wechsler, 2002b) and through a principal axis

factor analysis by Sattler (2008), the WPPSI-III is comprised of two factors for the

30

younger age range and four factors for the older age range. These factors align with the

composite and subtest groups described previously. For the younger children, significant

factor loadings for the subtests onto their respective factors ranged from .59 (Object

Assembly on PIQ) to .85 (Receptive Vocabulary on VIQ). For the older children,

significant factor loadings for the subtests onto their respective factors ranged from .38

(Matrix Reasoning on PIQ) to .88 (Vocabulary on VIQ).

However, Gordon (2004) notes in his review of the WPPSI-III that

intercorrelational relationships among the subtests point to a potential one-factor

structure, with all subtests loading on a general intellectual factor. In support of a two-

factor structure it would be expected for the VIQ subtests to correlate highly with each

other (convergent validity) than with the PIQ subtests (discriminant validity). Across age

bands, the VIQ subtests for both age bands do correlate more highly with each other than

with the subtests from the PIQ composite. This pattern, however, does not hold for the

PIQ subtests. These subtests correlated equally high among each other and with the

subtests from the VIQ factor. The WPPSI-III test manual (Wechsler, 2002b) addresses

these validity concerns. The authors posit that this lack of discriminant validity may be a

reflection of less differentiation between cognitive abilities evidenced among young

children and may be due to the high g (general intelligence) loadings of all subtests.

Based on these arguments and the intercorrelations presented, Gordon (2004) questions

whether or not a one-factor structure may be more appropriate, particularly for the

younger age band. The factor structure presented in the manual was replicated through

principal axis factor analysis (Sattler, 2008) and item response theory (Price, Raju, Lurie,

Wilkins, & Zhu, 2006). Beyond the interrcorrelational evidence highlighted by Gordon

31

(2004), no further studies were found to either confirm or dismiss the superiority of a

one-factor structure.

The WPPSI-III has been translated, adapted, and standardized for use in the

following languages: Spanish (normed in Spain), French (normed in France), French

Canadian, German, Italian, Swedish, Korean, Japanese, and Dutch. Standardization also

occurred in Australia, the United Kingdom, and Canada (Visser, Ruiter, van der Meulen,

Ruijssenaars, & Timmerman, 2012). Few further adaptations of the WPPSI-III for use in

a culture different from the original normative culture were found within the literature.

While Bagdonas, Pociute, Rimkute, and Valickas (2008) refer to the adaptation of the

WPPSI-III for use in Lithuania, no studies confirming this adaptation process were found.

Furthermore, Wasserman and colleagues (2004) outlined the translation and adaptation of

the WPPSI-III for use among young children in Bangladesh. However, no description of

reliability or validity evidence for scores from the adapted measure was provided. Similar

to Bangladesh, Karino, Laros, and Ribeiro de Jesus (2011) used an adapted version of the

WPPSI-III within a study with no mention of the adaptation and validation process. It

should be noted that the language of the articles in which these studies are written limits a

search for evidence of cross-cultural validation of the WPPSI-III. Many studies may have

provided this evidence and would be accessible for researchers or clinicians within a

country who may use the adapted measure. For the purpose of this study, however, it

remains unclear as to whether or not the factor structures provided within the

standardization samples of WPPSI-III are replicated when these assessments are adapted

for use in a dissimilar culture.

32

Spanish adaptation. In 2009, a Spanish-language version of the WPPSI-III, the

Escala de Inteligencia de Wechsler para Preescolar y Primaria – III (heretofore identified

as the WPPSI-III-SP; Wechsler, 2009) was adapted and normed for use in Spain. The test

was normed on a sample of 1,220 Spanish children (Rodriguez & Miguel, 2012). This

test contains the same subtests and composite scores as the English-language version.

Through the adaptation process, however, items were changed or adapted to be culturally

appropriate for use in Spain. The order of items was necessarily changed so as to ensure

that items became increasingly more difficult for the Spanish children.

Present Study

The question remains, however, as to whether or not an adapted version of the

WPPSI-III-SP reliably and fairly measures intelligence in a cohort of children in rural

Peru. The primary purpose of this study, therefore, is to examine the extent to which a

model based on the factor structure derived from scores on the WPPSI-III-SP completed

by the normative Spain sample fit the scores from an adapted WPPSI-III-SP completed

by children from rural communities in Peru (see Figures 1 and 3). Given the cultural,

developmental, construct, and adaptation considerations, however, the present study also

proposes to determine if another factor structure would provide a better fit. As such, the

study attempts to answer the following questions:

1. Is the factor structure of scores from the original version of the WPPSI-III

replicated with children living in rural Peruvian communities?

2. If not, does another model fit the data better than the model outlined within the

normative sample?

33

In addition, this study aims to assess the construct validity of the scores from the

adapted measure by assessing the relationship of these scores to the scores of another

measure of cognitive development. The final research question addresses the convergent

validity of the adapted measure’s scores:

3. To what extent do the scores from the adapted WPPSI-III-SP correlate with the

scores from another, previously administered adapted measure of cognitive

development?

Possible alternative factor structures. Looking to other instances of intelligence

test adaptation in the region may help predict possible factor structures of the adapted

WPPSI-III-SP, outside the factor structure of the normative sample. As noted earlier,

however, no studies were found to examine the validity of scores from a cross-culturally

adapted version of the WPPSI-III. However, two studies of the adaptation of an

intelligence measure for older children were conducted with children in Colombia and

Chile. While the cultures of Colombia and Chile are not the same as the culture in rural

Peru, the cultures may be more similar than a comparison between Spain and rural Peru.

As such, the standardization of intelligence tests in these two countries may offer possible

alternative factor structures to consider, despite outlining the psychometric properties of

an assessment aimed at older children.

Contreras and Rodriquez (2013) studied the reliability and validity of scores from

the Spanish version of the Wechsler Intelligence Scale for Children – Fourth Edition

(WISC-IV-SP; Wechsler, 2005) in a sample of children and adolescents from

Bucaramanga, Colombia. Similar to the WPPSI-III, the WISC-IV-SP indicates that the 15

subtests depend on a four-factor structure. Regarding reliability, the WISC-IV scores had

34

similar reliability estimates for the Colombian sample as they had for the Spanish

version’s normative sample. For the overall assessment, Contreras and Rodriquez (2013)

calculated a split-half alpha coefficient of .95 and a Cronbach’s alpha coefficient of .98.

Regarding validity, the data did not support a four-factor structure as presented in the

original version of the test. Instead, the researchers found evidence for a single factor that

accounted for 70.26% of the total variance in the test scores (Contreras & Rodriquez,

2013). In addition, Baron and Leonberger (2012) argue that the intellectual functioning of

the preschool-aged child is more homogeneous than is the cognitive ability of an older

individual. As such, the alternative models described in Figures 2 and 4 may prove a

better fit for the data from the Peruvian sample of children.

In a second study of an intelligence test adaptation in South America, Ramirez

and Rosas (2007) adapted the Argentinian version of the Wechsler Intelligence Scale for

Children – Third Edition (WISC-III; Wechsler, 1991/1997) for use in Chile. The

researchers administered the adapted test to a stratified sample of 1,914 children, divided

into 11 age categories. The internal consistency of the subscale and composite scores and

the factorial structure of the test are reported. Regarding internal consistency, Cronbach’s

alpha coefficients ranged from .65 to .91 for the subscale scores and from .75 to .87 for

the composite scores. Using factor analysis, Ramirez and Rosas (2007) analyzed the

factor structure of the sample as a whole and for four age ranges (6 – 7, 8 – 10, 11 – 13,

and 14 – 16 years). Overall, the individual subtests loaded onto four distinct factors in a

manner consistent with the original test’s factor structure. In looking at the results for the

four age ranges, however, the Coding subtest loaded significantly on the factor

representing Perceptual Organization and not on the supplemental Processing Speed

35

factor, as was Coding’s loading in the Argentinian sample. The authors argue that this

result perhaps demonstrates children’s cognitive skills have not fully differentiated and is

a reflection of preoperational thought. In other words, the Coding subtest may require

more abstract reasoning than previously theorized (Ramirez & Rosas, 2007).

This interpretation may be especially relevant for a population in which children

may not have had extensive exposure to paper-and-pencil educational activities. To be

able to quickly complete the Coding subtest, a child relies on fluent skills in shape and

symbol identification. A child who has not had formal exposure to performing this type

of paper-and-pencil test, may be relying more on his or her perceptual reasoning to

interpret the larger shape and then identify which symbol goes into that shape. As such,

the task becomes more a performance than a processing speed task. Therefore, a third

proposed model for the older cohort of children posits the loading of Coding on the PIQ

factor (see Figure 5).

In summary, for each age group of children (i.e., 36-month and 48-month

cohorts), two potential models are proposed: (1) models identical to the factor structure

demonstrated in the normative data (see Figures 1 and 3) and (2) a one-factor model, such

that all subtests will load on one general latent factor of intelligence (i.e., no separate

composite scores; see Figures 2 and 4; Contreras & Rodriquez, 2013). For the 48-month-

old children, a third model is proposed in which the Coding subtest loads on the

Performance factor (see Figure 5; Ramirez & Rosas, 2007).

Hypotheses. If the adapted WPPSI-III-SP measures the construct of intelligence

in the same manner as the original version, it was hypothesized that the theorized factor

structure of the adapted WPPSI-III-SP would be an adequate fit for the Peruvian cohort.

36

Given that, however, the cultures of Colombia and Chile may resemble more closely the

Peruvian culture as compared to Spain, it was also hypothesized that the models based on

the research in these South American countries would provide a better fit to the data than

the model based on the normative data. Finally, it was hypothesized that the scores from

an adapted measure of intelligence completed by the children at age 24 months would be

positively correlated with scores from the adapted WPPSI-III-SP.

37

Figure 1. Hypothesized model of the adapted WPPSI-III-SP scores among the Peruvian cohort at age 36 months based on the factor structure of the normative data (Wechsler, 2002b), referred to as Model 1-NormYoung for analyses. GLC = General Language Composite; VIQ = Verbal Intelligence Quotient; PIQ = Performance Intelligence Quotient; FSIQ = Full Scale Intelligence Quotient.

38

Figure 2. Hypothesized single-factor model of the adapted WPPSI-III-SP for the Peruvian cohort at age 36 months, referred to as Model 2-OneFactorYoung model for analyses. This model proposes a single overall ability as found by Contreras and Rodriguez (2013).

39

Figure 3. Hypothesized model of the adapted WPPSI-III-SP for the Peruvian cohort at age 48 months based on the factor structure of the normative data (Wechsler, 2002b), referred to as Model 3-NormOlder for analyses. VIQ = Verbal Intelligence Quotient; PIQ = Performance Intelligence Quotient; FSIQ = Full Scale Intelligence Quotient.

40

Figure 4. Hypothesized single-factor model of the adapted WPPSI-III-SP for the Peruvian cohort at age 48 months, referred to as Model 4-OneFactorOlder model for analyses. This model proposes a single overall ability as found by Contreras and Rodriguez (2013).

41

Figure 5. Hypothesized model of the adapted WPPSI-III-SP for the Peruvian cohort at age 48 months, referred to as Model 5-AltOlder model for analyses. This model proposes the loading of Coding on the factor representing the Performance Intelligence Quotient (PIQ), based on the findings of Ramirez and Rosas (2007). VIQ = Verbal Intelligence Quotient; FSIQ = Full Scale Intelligence Quotient.

42

Chapter 3: Method

Sample A total of 188 children (101 boys) completed the younger version only (10.63%

of children), older version only (18.09%), or both versions (71.28%) of the adapted

WPPSI-III-SP. Three children were missing gender data. Among this cohort of children,

on average, the children’s mothers completed 7.77 years of education (SD = 2.68). One

hundred and fifty-six children completed the younger version of the assessment, aged 35

to 47 months (M = 36.38, SD = 3.04). For the older version, 168 children completed the

test, all aged 48 months, with the exception of one child age 49 months and one child age

36 months. For brevity’s sake, these will be referred to as the younger and older cohorts,

respectively; however, substantial overlap of participants exists (i.e., 71.28% above).

These children were drawn from a larger sample of children participating in the

Interactions of Malnutrition and Enteric Infections: Consequences for Child Health and

Development (MAL-ED) project overseen by the Foundation for the National Institutes

of Health and the Fogarty International Center. All participants lived in rural

communities in northeastern Peru. A review of this cultural context is provided in the

Introduction. Children were eligible to participate in the longitudinal study if their mother

was older than 16 years of age, if no other child in the household participated in the

study, if they were healthy (e.g., no congenital diseases or severe neonatal diseases

requiring prolonged hospitalization), if the family had no plans to move away from the

community within 6 months, and if the child was not part of a multiple pregnancy.

43

Measures

Adapted Wechsler Preschool and Primary Scale of Intelligence - Spanish

Version (adapted WPPSI-III-SP). The cognitive ability of each child was measured

through an adapted version of the Wechsler Preschool and Primary Scale of Intelligence -

Spanish Version (Wechsler, 2009). A review of the original WPPSI-III and the WPPSI-

III-SP is presented in Chapter 2. For use with the Peruvian sample, adaptations were

made to the WPPSI-III-SP by researchers from the Department of International Health at

Johns Hopkins University. These adaptations included altering pictures to be culturally

appropriate and rewording instructions to be appropriate for the dialect of Spanish spoken

in Peru.

For example, for an item on the Picture Concepts subtest, the picture of a

capybara replaced the picture of a squirrel. As squirrels are not native to the Peruvian

jungle environment, the children would have been unfamiliar with the animal and the

item may have been unfairly difficult for them to answer. In another example on this

subtest, pictures of more realistic and recognizable dogs to the children replaced original

cartoon pictures of dogs. In an example of changes made for Peruvian Spanish, the item

asking children to define "Swing" on the Vocabulary subtest was altered. The word used

on the WPPSI-III-SP signifies both the object (e.g., a playground swing) and a movement

(e.g., to swing back and forth) in Castilian Spanish. In Peruvian Spanish, however, the

word only represents the object. Another word was substituted asking the child to

describe the movement of swinging (A. Orbe, personal communication, July 14, 2014).

For the Block Design subtest, children earned a possible score of 0, 1, or 2 for

each item. Scoring for this test was dependent on not only whether or not the child

44

completed the construction within a specified time limit but also on whether or not the

child required one or two trials to do so. Children could earn either a score of 0 (incorrect

answer) or 1 (correct answer) on each item of the Information, Receptive Vocabulary,

Word Reasoning, Matrix Reasoning, and Picture Concepts subtests. Scores on the Object

Assembly subtest were based on the number of junctures (i.e., the place where two

adjacent puzzle pieces meet) correctly joined, with a possible per item score ranging from

0 to 5 points. For the Coding subtest, children received one point for each correctly

paired symbol and shape. Finally, for each item on the Vocabulary subtest, children could

earn 0, 1, or 2 points, with more sophisticated and specific definitions earning a higher

score. Possible score ranges for the subtests were as follows: (1) Block Design: 0 - 40; (2)

Receptive Vocabulary: 0 - 38; (3) Information: 0 - 34; (4) Object Assembly: 0 - 70; (5)

Vocabulary: 0 - 43; (6) Word Reasoning: 0 - 28; (7) Picture Concepts: 0 - 28; (8) Matrix

Reasoning: 0 - 29; and (9) Coding: 0 - 50.

Adapted Bayley Scales of Infant and Toddler Development - Third Edition

(adapted Bayley-III). The Cognitive subscale of the Bayley-III (Bayley, 2005) was used

to assess cognitive development at age 24 months. The Bayley-III is an individually

administered standardized assessment of cognitive and motor ability for children ages 1

to 42 months. The Cognitive subscale assessed recognition memory, habituation, visual

preference, visual acuity skills, problem solving, number concepts, language, and social

development (Sattler, 2008). Similar to the adapted WPPSI-III-SP, this measure was

adapted for use in Peru. Murray-Kolb and colleagues (2014) provide an overview of the

selection, adaptation, and administration of this assessment for the MAL-ED project.

Adapted scores were calculated based on exploratory factor analyses, confirmatory factor

45

analyses, and item response theory analyses conducted on pooled data across all MAL-

ED project sites. The final adapted Bayley-III Cognitive subscale is comprised of 15

items and scores from the Peruvian cohort revealed good reliability (Cronbach's α = .82).

Scores from the Cognitive subscale of the Bayley (original version) were found to

correlate strongly (.79) with FSIQ composite scores from the WPPSI-III (original US

norms version; Sattler, 2008).

Procedure

For the MAL-ED project, expectant mothers were recruited to the study through

local health posts and lists of expectant mothers complied by researchers within the

community. After the birth of the child, researchers approached the mothers for

participation in the study. The newborn was screened for participation and informed

consent was gathered from the child's mother and father or from the mother and

grandfather. Informed consent forms were reviewed verbally with all participants (Yori et

al., 2014). The research team followed children from birth through the present time. In

Peru, the Asociación Benéfica PRISMA and the Johns Hopkins Bloomberg School of

Public Health oversee this research team.

Children completed the adapted Bayley-III at 24-months old and the adapted

WPPSI-III-SP shortly before or shortly after either their 3rd (younger cohort) or 4th

(older cohort) birthday. One psychologist trained in using the adapted Bayley-III and the

adapted WPPSI-III-SP conducted all assessments. For the adapted WPPSI-III-SP,

standard administration procedures were followed, with children stopping a subtest after

reaching a subtest's discontinuation rule (e.g., three incorrect answers in a row). As such,

not all participants completed all items.

46

Data Analyses

The data were analyzed using the statistical techniques of structural equation

modeling (SEM), specifically confirmatory factor analysis (CFA). Confirmatory factor

analysis (Stein et al., 2006; Brown, 2015) is a useful tool for validating translated and

adapted assessments. The relationships between latent variables (i.e., factors) and

observed variables (i.e., indicators) are hypothesized, or specified a priori. This

hypothesized model is then estimated with collected data to determine whether or not the

relationships observed within the data are consistent with the modeled relationships.

Model-fit statistics provide a measure of the extent to which the hypothesized model is a

plausible representation of the relationships observed within the sample data. This

method of analysis is in contrast to the statistical technique of exploratory factor analysis

in which all observed variables are loaded on all possible factors such that a factor

structure is derived from the data (Lei & Wu, 2007).

Figures 1 through 5 represent the a priori specified models subsequently estimated

using CFA. For each cohort of children (i.e., younger and older), two potential models

were proposed: (1) models identical to the factor structure demonstrated in the normative

data (Model 1-NormYoung and Model 3-NormOlder) and (2) a one-factor model, such

that all subtests load on a single general latent factor of intelligence (Model 2-

OneFactorYoung and Model 4-OneFactorOlder). For the older cohort, a third model was

proposed in which the Coding subtest loads on the Performance factor (Model 5-

AltOlder). Data consisted of the sum of points earned on each subtest. The CFA analyses

were based on the covariance matrices of these raw data.

47

Prior to analysis, a model must be theoretically identified. According to Kline

(1998), a model is theoretically identified when "it is possible...to derive a unique

estimate of every model parameter" (p. 49). Table 1 presents the requirements for

identification of standard CFA models as defined by Kline (1998). A standard model

contains no correlated errors or cross-loadings. As such, all five specified models are

standard CFA models. Models 1, 3, and 5 contain the second-order FSIQ factor. Second-

order factors are identified in the same manner as first-order models in that the first-order

factors are considered the indictor variables for the second-order.

All CFA models must first meet two necessary but not sufficient rules for

identification: (1) the number of free parameters must be less than or equal to the number

of observations, and (2) all latent variables must have a scale. Regarding the first rule, the

number of observations is calculated through the following equation: v(v + 1)/2, where v

is the number of observed variables. Parameters represent the total number of variances

and covariances of the factors, measurement errors, and factor loadings. In analyzing the

parameters and observations of each proposed model as outlined in Table 2, all models

meet the first rule for identification (i.e., Parameters ≤ Observations).

48

Table 1

Model Identification Rules for Standard CFA Models

Number of Latent Factors Identification Conditions Necessary or Sufficient?

1

1. Parameters ≤ Observations Necessary2. Scale for every factor Necessary

3. ≥ 3 Indicators Sufficient

≥ 2

1. Parameters ≤ Observations Necessary

2. Scale for every factor Necessary

3. ≥ 3 Indicators Sufficient Table 2

Free Parameters and Observations for Specified Models

Model Parameters Observations

Model 1-NormYoung - First Order 9 10Model 1-NormYoung - Second Order 2 3

Model 2-OneFactorYoung 8 10Model 3-NormOlder - First Order 13 21

Model 3-NormOlder- Second Order 6 6Model 4-OneFactorOlder 14 28

Model 5-AltOlder - First Order 15 28Model 5-AltOlder - Second Order 2 3 Regarding the second rule, latent variables are unobserved variables and,

therefore, do not inherently have a scale. However, they require a measurement scale in

order to be estimated. According to Brown (2015) a scale is applied to a latent variable in

one of two ways: (1) by fixing the loading of one indicator per factor to 1.0, which

transfers the scale of the indicator to the latent variable, or (2) by fixing the variance of

the latent variable to a constant. This constant is usually 1.0, which standardizes the latent

variable. For each of the first-order models (Models 2 and 4), the latent variable was

49

scaled by fixing the variance of the factor to 1.0. For the second-order models (Models 1,

3, and 5), the first order latent variables (e.g., VIQ and PIQ) were scaled by fixing one

factor loading to 1.0. The second-order factor was scaled by fixing the loadings between

latent variables to 1.0.

LISREL Student Version 9.2 was used for all analyses. To correct for non-

normality with continuous data, the Satorra-Bentler statistic was used. The Satorra-

Bentler statistic is a maximum likelihood estimator that is robust against non-normality.

This statistic “adjusts the value of the standard χ2 downward by a constant that reflects

the degree of observed kurtosis” (Kline, 1998, p. 210). Kline's (1998) criteria were used

to describe univariate skewness: mild = less than 1; moderate = |1| - |3|. Univariate

kurtosis values were considered mild between -1.3 and 7 (Kline, 1998). A covariance

matrix was analyzed; an asymptotic covariance matrix was analyzed for use with the

Satorra-Bentler statistic. No non-standard procedures (e.g., user-specified starting values,

changing convergence criterion, increasing the number of iterations) were used.

Goodness-of-fit was assessed using a variety of fit statistics. More specifically,

incremental and absolute measures of fit were analyzed (Worthington & Whittaker,

2006). Incremental fit indices indicate the improvement in fit by comparing a proposed

model to a baseline model, or a model in which all variables are unrelated (Bentler &

Bonnet, 1980). The incremental fit indices of the Comparative Fit Index (CFI; Bentler,

1990), the Non-Normed Fit Index (NNFI; Bentler & Bonnet, 1980), and the Incremental

Fit Index (IFI; Bollen, 1989) were analyzed. For these three indices, values above .95

were considered indicative of good fit (Hu & Bentler, 1995). Absolute indices indicate

the extent to which the proposed model explains the relationships observed within the

50

sample data. The Root Mean Square Error of Approximation (RMSEA) is an absolute

index that represents the lack of model fit when compared to a perfect model. RMSEA

values less than or equal to .06 are indicative of good fit, whereas values less than or

equal to .08 indicate acceptable fit (Hu & Bentler, 1995). Consistency or inconsistency in

findings across these fit statistics was used to assess overall model fit. More specifically,

the extent to which these indices were consistently high or low was the more important

consideration in overall model fit (Markland, 2006).

In addition to overall model fit, component fit was analyzed and reliability

estimates for each scale are reported. The following criteria from Barker, Pistrang, and

Elliot (2016) were used to describe reliability estimates: good ≥ .80; acceptable = .70 -

.79; low = .60 - .69; and poor ≤ .59. To describe the strength of intercorrelations the

following criteria were used (per Evans, 1996): very weak = 0 - .19; weak = .20 - .39;

moderate = .40 - .59; strong = .60 - .79; and very strong = .80 - 1.00.

Convergent validity analyses. Raw scores from the adapted WPPSI-III-SP

subtests were summed and standardized to provide an overall measure of cognitive

ability. The Cognitive subscale raw scores from the adapted Bayley-III at 24 months were

also standardized. The strength and direction of the relationship between scores on the

adapted WPPSI-III-SP and the adapted Bayley-III were calculated through a correlation

of the standardized adapted Bayley-III and adapted WPPSI-III-SP scores. A supplemental

convergent validity analysis was conducted among those children who completed the

adapted WPPSI-III-SP at both 36 and 48 months. The strength and direction of the

relationship between scores on the adapted WPPSI-III-SP at each time point were

51

calculated through a correlation of a child's total score (after standardization) at 36

months and at 48 months.

52

Chapter 4: Results

Younger Cohort

Preliminary analyses and descriptive statistics. Preliminary analyses were

conducted to examine the scores from the younger cohort for outliers and missing data.

No cases were identified as having missing data. One case was determined to have an

outlying score on the Object Assembly subtest, as the score was greater than three

standard deviations from the mean (z = 7.19). After removing this case, the preliminary

analyses for the Object Assembly scores more closely approached normality. As such,

this case was removed listwise. In addition, seven cases were deleted due to the child

being 48-months old at the age of the assessment, as these children fell outside the age

range delineated by the test's publisher (Wechsler, 2009). The final sample size for data

analysis was 147. Table 3 presents the descriptive statistics (i.e., intercorrelations, means,

skew, kurtosis, and coefficient alphas) for this dataset.

53

Table 3

Intercorrelations, Descriptive Statistics, and Reliability Estimates for Adapted WPPSI-III-SP Subtest Scores (Younger Cohort) Adapted WPPSI-III-SP Subtest 1 2 3 4 1. Receptive Vocabulary --

2. Block Design .32 --

3. Information .41 .39 --

4. Object Assembly .24 .26 .32 --

M 7.01 9.61 6.58 2.87

SD 4.10 3.92 2.30 1.69

Skew 0.74 -0.04 -0.51 1.53

Kurtosis 0.36 -0.40 0.44 2.35

Coefficient alpha (α) .83 .71 .66 .53

Note. n = 147. WPPSI-III-SP = Wechsler Preschool and Primary Scale of Intelligence (Third Edition) - Spanish Version The simultaneous test of multivariate skewness and kurtosis was statistically

significant, χ2 (2) = 52.28, p = .000. However, the relative multivariate kurtosis was 1.08,

indicating that multivariate kurtosis was 8% larger than a multivariate normal

distribution. Multivariate kurtosis, therefore, was considered mildly non-normal (per

Kline, 1998). The χ2 tests of simultaneous univariate skewness and kurtosis of were also

statistically significant for all variables at the .05 level, with the exception of the Block

Design subtest. For the variables that had statistically significant simultaneous univariate

skewness and kurtosis, all variables had skewness values that were significantly different

from normal (p < .05). Object Assembly was the only subtest with a kurtosis value

significantly different from normal. In analyzing the skewness and kurtosis values

presented in Table 3, univariate skewness fell in the moderate range for Object Assembly

and in the mild range for all other variables. Univariate kurtosis was considered mild for

54

all variables. In sum, the data were considered to be mildly non-normal, justifying the use

of robust tests in the analyses.

Intercorrelations were conducted to determine the relationships among the various

subtest scores and all values are reported in Table 3. All subtests were positively

correlated with each other. The correlation between Receptive Vocabulary and

Information fell in the moderate range; all other correlations were weak. Scores from the

Receptive Vocabulary subtest were found to have good reliability, and Block Design

subtest scores demonstrated acceptable reliability, whereas scores from the Information

and Object Assembly subtest scores revealed low and poor reliability, respectively (see

Table 3). Overall, the average reliability coefficient across all four subtests was low

(Cronbach's α = 0.61).

Model 1-NormYoung. The model based on the normative structure appears to

provide a reasonable fit to the data. Selected fit indices are presented in Table 6. The fit

indices of CFI, NNFI, and IFI fall above .95, indicating good fit. The RMSEA falls below

.06, also indicating good fit. Finally, the Satorra-Bentler Scaled Chi-Square is

nonsignificant, indicating good overall fit, χ2SB = 0.092, p = 0.762, df = 1.

Regarding component fit, parameter and standard error estimates are presented in

Table 4. All completely standardized factor loadings are within range and statistically

significant (z-test statistics > 1.96; see Figure 6). Furthermore, it would be expected all

path coefficients be positive (i.e., a positive relationship between indicator and latent

variable). The standard errors are reasonable as they are smaller than the standard

deviations of the indicator variables. All standardized residuals are acceptable (< |2.58|).

No modification indices were greater than 3.84, and all standardized expected change

55

values were small, suggesting that no paths should have been freed. Measurement model

R2 values were poor (< .36) for Object Assembly (R2 = .21), Block Design (R2 = .33), and

Receptive Vocabulary (R2 = .32), and moderate for Information (R2 = .52).

Table 4 Parameter and Standard Error Estimates for Model 1-NormYoung

Model Parameters Standardized Estimate

Unstandardized Estimate

Standard Error

Loadings on VIQ

Receptive Vocabulary .57 2.33a 2.02

Information .73 1.67* 0.63

Loadings on PIQ

Block Design .57 2.25a 1.99

Object Assembly .46 0.78* 0.41

Loadings on FSIQ

VIQ .96 0.43a 0.21

PIQ .99 0.45a 0.33

Note. Table values are Maximum Likelihood estimates. VIQ = Verbal Intelligence Quotient; PIQ = Perceptual Intelligence Quotient; FSIQ = Full Scale Intelligence Quotient. *p < .05 a fixed factor loading.

56

Figure 6. Completely standardized factor loadings for Model 1-NormYoung. * p < .05 a fixed factor loading Model 2-OneFactorYoung. The one-factor model based on the research of

Contreras and Rodriquez (2013) appears to provide a reasonable fit to the data. Selected

fit indices are presented in Table 6. The fit indices of CFI, NNFI, and IFI fall above .95,

indicating good fit. The RMSEA falls below .06, also indicating good fit. Finally, the

Satorra-Bentler Scaled Chi-Square is nonsignificant, indicating good overall fit, χ2SB =

0.191, p = .0.91, df = 2.



significant (z-test statistics > 1.96; see Figure 7). As expected, all path coefficients were

positive (i.e., a positive relationship between indicator and latent variable). The standard

errors are reasonable as they are all smaller than the standard deviations of the indicator

.57a

.73*

.57a

.46*

.96a

.99a

57

variables. All standardized residuals are acceptable (< |2.58|). No modification indices

were greater than 3.84, and all standardized expected change values were small,

suggesting that no paths should have been freed. Measurement model R2 values were

poor (< .36) for Receptive Vocabulary (R2 = .32), Block Design (R2 = .31), and Object

Assembly (R2 = .20), and moderate for Information (R2 = .52).

Table 5

Parameter and Standard Error Estimates for Model 2-OneFactorYoung



Standard Error

Loadings on FSIQ

Receptive Vocabulary .57 2.33* 0.32


Block Design .55 2.17* 0.35

Object Assembly .45 0.76* 0.14

Note. Table values are Maximum Likelihood estimates. FSIQ = Full Scale Intelligence Quotient. *p < .05

58

Figure 7. Completely standardized factor loadings for Model 2-OneFactorYoung. * p < .05 Table 6

Selected Fit Indices for Younger Cohort Models

χ2SB df RMSEA (CI90) CFI NNFI IFI

Model 1-NormYoung 0.092 1 0.0 (0.0 - 0.15) 1.00 1.00 1.00

Model 2-OneFactorYoung 0.188 2 0.0 (0.0 - 0.07) 1.00 1.00 1.00Note. χ2

SB = Satorra-Bentler Chi-Square; df = degrees of freedom; RMSEA = Root Mean Square Error of Approximation; CI90 = 90% Confidence Interval for RMSEA; CFI = Comparative Fit Index; NNFI = Non-Normed Fit Index; IFI = Incremental Fit Index. Model comparisons. To compare the overall fit of Model 1-NormYoung to

Model 2-OneFactorYoung, a Satorra-Bentler Chi-Square difference test was conducted.

As summarized in Table 7, results suggest that Model 1-NormYoung and Model 2-

OneFactorYoung are equivalent.

.57*

.72*

.55*

.45*

59

Table 7

Satorra-Bentler Chi-Square Difference Test

χ2SB Df

Model 1-OneFactorYoung 0.188 2

Model 2-NormYoung 0.092 1

Difference 0.096 1 Note. Difference is statistically significant if greater than 3.84.

Older Cohort

Preliminary analyses and descriptive statistics. Preliminary analyses were

conducted to examine the scores from the older cohort for outliers and missing data. The

data from the child aged 36 months was deleted listwise due to this age being outside the

range of the assessment. Although some cases had scores greater than three standard

deviations from the mean, no Mahalanobis' distance scores were significant (CV = 24.32

at p = .001). The final sample size for data analysis was 167. Table 8 presents the

descriptive statistics (i.e., intercorrelations, means, skew, and kurtosis) for this dataset.

60

Table 8 Intercorrelations, Descriptive Statistics, and Reliability Estimates for Adapted WPPSI-III-SP Subtest Scores (Older Cohort) Adapted WPPSI-III-SP Subtest 1 2 3 4 5 6 7 1. Picture Concepts --

2. Block Design .07 --

3. Information .23 .32 --

4. Word Reasoning .32 .32 .47 --

5. Vocabulary .35 .28 .47 .50 --

6. Matrix Reasoning .49 .30 .28 .29 .38 --

7. Coding .29 .31 .22 .46 .26 .26 --

M 1.46 15.06 12.23 1.09 6.43 1.58 1.08

SD 2.05 3.44 3.21 1.81 4.03 2.14 2.53

Skew 1.72 -0.55 -0.15 2.56 0.72 1.61 2.56

Kurtosis 2.62 1.38 1.79 7.98 -0.19 3.07 6.37

Coefficient alpha (α) .80 .66 .78 .76 .81 .79 .90

Note. n = 167. WPPSI-III-SP = Wechsler Preschool and Primary Scale of Intelligence (Third Edition) - Spanish Version. The simultaneous test of multivariate skewness and kurtosis was statistically

significant, χ2 (2) = 474.35, p = .000. The relative multivariate kurtosis was 1.52,

indicating that multivariate kurtosis was 51.8% larger than a multivariate normal

distribution. The χ2 tests of simultaneous univariate skewness and kurtosis of were also

statistically significant for all variables at the .05 level. Regarding univariate skewness

and kurtosis, all variables had skewness and kurtosis values that were significantly

different from normal (p < .05), with two exceptions. The skewness value for Information

and the kurtosis value for Vocabulary were not significant. In analyzing the skewness and

kurtosis values presented in Table 8, univariate skewness fell in the moderate range for

all variables. Univariate kurtosis was considered mild for all variables, with the exception

61

of the Word Reasoning subtest. This subtest demonstrated moderate kurtosis. In sum, the

data was considered to be non-normal, justifying the use of robust tests in the analyses.

Intercorrelations were conducted to determine the relationships among the various

subtest scores and all values are reported in Table 8. All subtests were positively

correlated with each other. Correlations between Matrix Reasoning and Picture Concepts,

Information and Word Reasoning, Information and Vocabulary, and Vocabulary and

Word Reasoning fell in the moderate range, whereas all other correlations fell in the

weak to very weak ranges. Scores from the Vocabulary, Picture Concepts, and Coding

subtests were found to have good reliability. Scores from the Block Design, Information,

Matrix Reasoning, and Word Reasoning subtests were found to have acceptable

reliability. Overall, the average reliability estimate across subtest scores fell in the

acceptable range (Cronbach's α = .75).

Model 3-NormOlder. The model based on the normative structure appears to

provide a poor fit to the data. Initial analyses yielded a non-positive definite matrix for

latent variables, with a negative error of variance for the PIQ factor. Gerbing and

Anderson (1987) studied three methods for respecification of initial models with one

negative estimate and small sample sizes. The authors suggested fixing the variance of

the improper parameter to a negligible number. This method is also noted by Brown

(2015). As such, the variance of PIQ was set to 0.001. Selected fit indices are presented

in Table 10. The fit indices of CFI, NNFI, and IFI fall below .95, indicating inadequate

fit. The RMSEA falls above .08, also indicating inadequate fit. Finally, the Satorra-

Bentler Scaled Chi-Square is significant, indicating poor overall fit, χ2SB = 38.79, p =

0.007, df = 14. Parameter and standard error estimates are presented in Table B1 in

62

Appendix B. While some modification indices were greater than 3.84, these

modifications to the model were not theoretically supported.

Model 4-OneFactorOlder. The one-factor model based on the research of

Contreras and Rodriquez (2013) appears to provide a poor fit to the data. Selected fit

indices are presented in Table 10. The fit indices of CFI, NNFI, and IFI fall below .95,

indicating inadequate fit. The RMSEA falls above .08, also indicating inadequate fit.

Finally, the Satorra-Bentler Scaled Chi-Square is significant, indicating poor overall fit,

χ2SB = 37.87, p = .000, df = 14. Parameter and standard error estimates are presented in

Table B2 in Appendix B.

Of the modification indices (MI) greater than 3.84, the MIs suggesting correlated

errors between Coding and Block Design and between Vocabulary and Word Reasoning

were the largest (20.78 and 8.62, respectively) and the most theoretically supported.

Coding and Block Design rely on recognizing and matching shapes. Vocabulary and

Word Reasoning draw on word knowledge and definition skills. The selected fit statistics

for this reduced model are also presented in Table 13. While the fit index of IFI and CFI

falls at the threshold of 0.95, the NNFI falls below. The RMSEA is equal to .08,

indicating acceptable fit. However, the Satorra-Bentler Scaled Chi-Square is significant,

indicating potentially unacceptable overall fit, χ2SB = 21.534, p = .04, df = 12. Taking into

account all fit evidence, it appears that, overall, correlating these errors provided an

adequate fitting model.



significant (z-test statistics > 1.96; see Figure 8). As expected, all path coefficients were

63

positive (i.e., a positive relationship between indicator and latent variable). The standard

errors are reasonable as they are all smaller than the standard deviations of the indicator

variables. All standardized residuals are acceptable (<|2.58|). Measurement model R2

values were poor (< 0.36) for Information (R2 = 0.21), Block Design (R2 = 0.24), Coding

(R2 = 0.18), and Vocabulary (R2 = 0.18), and moderate for Picture Concepts (R2 = 0.42),

Word Reasoning (R2 = 0.49), and Matrix Reasoning (R2 = 0.51).

Table 9

Parameter and Standard Error Estimates for Model 4-OneFactorOlder with Modifications



Standard Error

Loadings on FSIQ

Picture Concepts .65 2.08* 1.07



Word Reasoning .70 1.26* 0.49

Vocabulary .42 1.07* 1.04

Matrix Reasoning .72 2.88* 1.53

Coding .42 0.86* 0.49

Note. Table values are Maximum Likelihood estimates. FSIQ = Full Scale Intelligence Quotient. * p < .05 level.

64

Figure 8. Completely standardized factor loadings for modified Model 4-OneFactorOlder * p < .05 Model 5-AltOlder. The alternative second-order model based on the research of

Ramirez and Rosas (2007) appears to provide a poor fit to the data. Similar to Model 3-

NormOlder model, initial analyses yielded a non-positive definite matrix for latent

variables, with a negative error of variance for the VIQ factor. As such, the variance of

VIQ was set to 0.001 (Gerbing & Anderson, 1987). Selected fit indices are presented in

Table 10. The fit indices of CFI, NNFI, and IFI fall below .95, indicating inadequate fit.

The RMSEA falls above .08, also indicating inadequate fit. Finally, the Satorra-Bentler

Scaled Chi-Square is significant, indicating poor overall fit, χ2SB = 36.83, p = .000, df =

.42*

.46*

.70*

.49*

.65*

.72*

.42*

65

13. Parameter and standard error estimates are presented in Table B3 in Appendix B.

While some modification indices were greater than 3.84, these modifications to the model

were not theoretically supported.

Table 10

Selected Fit Indices for Older Cohort Models

χ2SB df RMSEA (CI90) CFI NNFI IFI

Model 3-NormOlder 38.70 14 0.11(0.08 - 0.15) .88 .83 .89

Model 4-OneFactorOlder 37.87 14 0.13(0.09 - 0.16) .86 .79 .86Model 5-AltOlder 25.94 13 0.12(0.08 - 0.16) .88 .81 .89

After Modifications

Model 4-OneFactorOlder 21.53 12 .08(0.01 - 0.18) .95 .92 .96Note. χ2

SB = Satorra-Bentler Chi-Square; df = degrees of freedom; RMSEA = Root Mean Square Error of Approximation; CI90 = 90% Confidence Interval for RMSEA; CFI = Comparative Fit Index; NNFI = Non-Normed Fit Index; IFI = Incremental Fit Index. Convergent Validity Analyses

Of the 147 children in the younger cohort who were included in the sample for

confirmatory factor analyses, 141 of these children completed the adapted Bayley-III

Cognitive subtest at 24 months. Among the older cohort, 158 children completed the

adapted Bayley-III and the adapted WPPSI-III-SP. Table 11 presents the descriptive

statistics (i.e., means, skew, kurtosis, and coefficient alphas) for these samples. In

addition, 134 children completed the adapted WPPSI-III-SP at 36 and 48 months. Data

for eight children from this sample were deleted due to the child's age being outside the

range of the assessment. As such, the final sample size for examining the direction and

strength of the relationship between scores at each time point was 126.

Cognitive subtest scores from the adapted Bayley-III at 24 months were

significantly and positively correlated (r = .21; p < .05) with scores from the adapted

66

WPPSI-III-SP at 36 months. This correlation was weak. For the older cohort, cognitive

subtest scores from the adapted Bayley-III at 24 months were significantly and positively

correlated (r = .28; p < .05) with scores from the adapted WPPSI-III-SP at 48 months.

This relationship was also weak. Regarding the cohort of children who completed the

adapted WPPSI-III-SP at 36 and 48 months, scores from these time points were

significantly, positively, and moderately correlated (r = .55; p < .05).

Table 11

Descriptive Statistics and Reliability Estimates for Adapted WPPSI-III-SP Total Scores and Adapted Bayley-III Scores (Younger and Older Cohorts) Bayley- III

(Younger Cohort) Bayley- III

(Older Cohort) WPPSI-III-SP

(Younger Cohort) WPPSI-III-SP (Older Cohort)

M 4.66 6.37 26.06 39.11

SD 2.98 2.35 8.64 12.66

Skew 0.45 0.01 -0.19 0.82

Kurtosis -0.57 -0.39 -0.03 1.11

Note. n =141 for younger cohort; n = 158 for older cohort.

67

Chapter 5: Discussion

The primary purpose of this study was to examine the construct validity of the

WPPSI-II-SP adapted for use among children in rural Peru. Using confirmatory factor

analyses, data from a younger and older cohort of children were fitted to models based on

the normative structure of the WPPSI-III-SP. If the adapted version measured

intelligence in the same manner as the original version, it was hypothesized that these

normative models would provide an adequate fit for the scores from the Peruvian cohorts.

The process of adapting a test for use among a different culture, however, is complicated

and must take into account many cultural, developmental, construct, and adaptation

considerations. For example, the process considers differences as to how the construct in

question is expressed across cultures, differences in language that may affect how items

are interpreted across cultures, and differences in exposure to the assessment's tasks (e.g.,

completing pencil-and-paper tasks) across cultures.

Given these adaptation considerations, the present study also proposed to

determine if another factor structure would provide a better fit to the scores. Based on the

research of others (Contreras & Rodriquez, 2013; Ramirez & Rosas, 2007) conducting

test adaptation research in Colombia and Chile, three additional factor structures were

proposed. For both the younger and older cohorts, a one-factor model was proposed. For

the older cohort, an alternative to the normative structure was proposed in which the

Coding subtest loaded on the Performance factor instead of on the overall Full Scale

Intelligence Quotient factor. As the cultures of Colombia and Chile may more closely

resemble the Peruvian culture (in comparison to Spanish culture), it was hypothesized

that the models based on the research in South America would provide a better fit to the

68

data than the models based on the normative data. Finally, to examine evidence of

convergent validity the total subtest scores from the younger cohort and from the older

cohort were correlated with scores from another adapted cognitive ability measure

completed by the children at 24 months. It was expected that the scores from these

measures of cognitive ability would be positively correlated. A supplemental analysis

was conducted to examine the direction and strength of the relationship between scores

on the adapted WPPSI-III-SP from children who completed the assessment at 36 and 48

months.

Fit of the Hypothesized Models

Younger cohort. For the younger cohort, both the model based on the normative

data (Model 1-NormYoung) and the one-factor model based on the research of Contreras

and Rodriquez (2015; Model 2-OneFactorYoung) appear to provide an adequate fit to the

data. Furthermore, Model 1-NormYoung appeared to fit the data as well as Model 2-

OneFactorYoung. As such, the first hypothesis was supported as Model 1-NormYoung

provided an adequate fit for the data. The second hypothesis, however, was not supported

based on the confirmatory factor analyses. In comparing the fit for Model 2-

OneFactorYoung and Model 1-NormYoung, the prior model did not provide a better fit

for the data as compared to the latter model.

Regarding component fit, all standardized estimates were significant across

models. Scores from the Information subtest demonstrated the strongest relationship with

VIQ and with the latent global intelligence factor, while Receptive Vocabulary, Block

Design, and Object Assembly were moderately related to the latent factors. It should be

noted that measurement model R2 values were poor for three out of four subtests

69

(Receptive Vocabulary, Block Design, and Object Assembly), indicating that much of the

variance associated with these subtests was left unexplained. Finally, the second-order

factor loadings were larger than the subtest loadings on the first-order loadings,

suggesting the verbal and performance factors were strongly influenced by overall

cognitive ability.

The relationships between subtest scores demonstrated potential support for the

one-factor structure. In looking at evidence for convergent validity, the scores from the

VIQ subtests (Receptive Vocabulary and Information) were moderately correlated.

However, the relationship between scores from the PIQ subtests (Object Assembly and

Block Design) was weak. Furthermore, little evidence was present for discriminant

validity. Information subtest scores demonstrated a similar strength relationship with

scores from the Block Design subtest and scores from the Receptive Vocabulary subtest.

Scores from the Block Design subtest were more strongly related to scores from the VIQ

subtests as compared to Object Assembly subtest scores. In other words, scores did not

demonstrate consistently stronger relationships among subtest scores from the same

factor (convergent validity) and consistently weaker relationships among subtest scores

from differing factors (discriminant validity). As such, this pattern may provide evidence

of the superiority of a one-factor structure with an overall cognitive ability factor.

Reliability. In exploring the validity of an assessment it is imperative to also look

at the test's scores' reliability, as reliability is the foundation of validity (Sattler, 2008).

The reliability estimate of Object Assembly subtest scores is questionable. As noted

earlier, scores from assessments of young children struggle to produce high reliability

estimates (Alfonso & Flanagan, 1999; Sattler, 2008). In addition, the test taking behavior

70

of young children adds potential error and inconsistency to scores (Frisby, 1999b). Young

children have shorter attention spans, less expressive language, and less exposure to the

formal assessment process (Alfonso & Flanagan, 1999; Baron & Leonberger, 2012; Berk,

2009). The influence of test taking behavior may be particularly impactful for this cohort

of children. Prior to Kindergarten, children living in these Peruvian communities often

have limited exposure to formal schooling let alone exposure to formal testing situations.

It is customary for children to remain in the care of a parent or family member during the

day or to attend a day-care. In addition, it is customary for children within these

communities to be more introverted and quiet around new individuals and within new

situations (A. Orbe, personal communication, April 25, 2016). This culturally influenced

temperamental characteristic may add to inconsistency in responding and to the potential

measurement error of scores.

The poor reliability estimate for the scores from the Object Assembly subtest

suggest a higher amount of measurement error for this subtest. Similar to other puzzle

task adaptations (Malda et al., 2010), the Peruvian children appeared to struggle

completing the puzzles. While children could earn between 0 and 70 points on this

subtest, the subtest mean score was 2.87. Due to discontinuation rules, most children,

therefore, completed a short assessment. As noted by Sattler (2008), the length of an

assessment may impact the reliability of scores, such that scores from longer measures

tend to yield higher reliability estimates.

Furthermore, many potential sources of measurement error may arise due to

adaptation concerns. Perhaps, and similar to the children in India (Malda et al., 2010), the

Peruvian children were not familiar with the task as they have little everyday exposure to

71

puzzles. In addition, perhaps the children were not familiar with the stimuli enough to

understand how to compile the puzzle pieces. For example, one puzzle is of a stereotypic

American house (i.e., door in the center with two windows on either side, triangular roof

with a chimney, yellow in color). Children living in these communities may live in

houses with a dirt floor and a thatched roof (Yori et al., 2014), or a flat metal roof. Thus,

this puzzle may not conform to the child's idea of a house. These children may struggle to

put the pieces together without the context of knowing that the completed picture is of a

house. The puzzles of a bird, fish or dog, on the other hand, may have been differentially

easy for the Peruvian children as these are animals they would encounter frequently in

their environment. A child's performance on this subtest, therefore, should be interpreted

with caution. Instead, the child’s overall performance on other subtests and on the overall

test should be considered as these scores had higher reliability estimates.

Older cohort. For the older cohort, both the model based on the normative data

(Model 3-NormOlder) and the alternative to the normative model based on the research

of Ramirez and Rosas (2007; Model 5-AltOlder) appear to provide a poor fit to the data.

These results should also be interpreted with caution as each model indicated a potential

specification error (i.e., negative error of variance for latent variables). The first

hypothesis was not supported, whereas the second hypothesis was supported. The one-

factor model (Model 4-OneFactorOlder) yielded an adequate fit, after modifications,

while the models based on the normative structure did not yield adequate fit. Scores from

the subtests demonstrated adequate reliability.

Regarding component fit, all standardized estimates were significant. Factor

loadings were moderate to strong, with scores from Picture Concepts, Matrix Reasoning,

72

and Word Reasoning demonstrating the strongest relationship with the general

intelligence factor. The measurement model R2 values were poor for four out of seven

subtests (Information, Block Design, Coding, and Vocabulary), indicating that much of

the variance associated with these subtests was left unexplained. Scores from the

Information subtest were moderately related to scores from Word Reasoning and

Vocabulary subtests. In addition, the Vocabulary and Word Reasoning subtests' scores

were moderately related. Other intercorrelations between subtests fell in the very weak to

moderate range. In other words, the scores from the VIQ subtests were moderately

correlated with each other and more weakly correlated with scores from the PIQ subtests.

However, scores from any of the PIQ subtests were not noted to be any more or less

correlated with VIQ and PIQ subtests' scores. Gordon (2004) noted a similar pattern of

relationships among scores from the VIQ and PIQ subtests in the original U.S. version of

the WPPSI-III. Similar to the younger cohort, this lack of evidence for discriminant

validity and the moderate to strong factor loadings within the one-factor model provides

support for a one-factor structure.

Floor items. Alfonso and Flanagan (1999) highlight the importance of adequate

floor items on intelligence tests for preschool-aged children. A test's floor is the lowest

possible score a child could earn on the test. Intelligence tests with adequate floor items

are able to distinguish among children with lower levels of cognitive ability (Sattler,

2008). For the older age band of the WPPSI-III this question becomes even more crucial,

as the assessment must be able to adequately capture the ability of a low performing four-

year-old child as well as the ability of a typically developing seven-year-old child.

Children at these extreme ages may vary widely in the differentiation of cognitive

73

abilities, in their exposure to formal test taking situations, and in their development of

attention and emotional regulation. Flanagan and Alfonso (1995) posit that a test has a

sufficient floor if a raw score of 1 on an assessment is more than two standard deviations

below the mean. Gordon’s (2004) review of the original WPPSI-III (U.S. version) noted

that some of the subtests (Picture Concepts, Word Reasoning, and Coding) failed to meet

this requirement for children at age 48 months.

In examining the scores for the older cohort, the children earned, on average, at or

below two points total on the Picture Concepts, Word Reasoning, Matrix Reasoning, and

Coding subtests. In other words, the Peruvian children struggled to correctly answer

items on these subtests. A low raw score was typical rather than uncommon.

Consequently, the subtests may have difficulty distinguishing between ability levels for

these low-scoring children. The task demands and item content of these tasks in particular

should be further examined for potential sources of bias. A discussion of floor effects and

potential sources of bias as result of cultural influence is an important aspect of the larger

discussion of item quality. Item quality refers to an item's ability to differentiate between

a lower- and higher-performing child. A test is said to have items of good quality if those

children with higher overall scores tend to answer the more difficult items correctly

whereas children with lower overall scores tend to answer these items wrong (Sattler,

2008). Item discrimination analyses should be conducted to allow for the examination of

this response pattern.

One-factor models. Across both cohorts, confirmatory factor analyses and

intercorrelations between subtests provided support for a one-factor structure such that all

subtests load on a latent factor of general intelligence. This support was stronger for the

74

older cohort, as the scores from those subtests demonstrated stronger factor loadings

overall as compared to the factor loadings of subtests for the younger cohort. These

results support criticism leveled against the original WPPSI-III (U.S. version). Gordon

(2004) cites a lack of clear discriminant validity among subtest score intercorrelations

and high g factor loadings as the foundation of an argument for a one-factor structure to

the WPPSI-III.

Developmentally, the intellectual functioning of young children may be

comprised of more homogenous abilities as compared to the greater differentiation in

abilities among older children (Baron & Leonberger, 2012). Thus, a one-factor solution

may better represent the cognitive developmental stage of a preschool-age child. While a

child may possess differing verbal and performance abilities, he or she may still rely on

all skills to successfully navigate his or her environment. As a child grows, these skills

may become more specific and differentiated from other abilities. The superior overall

and component fit of the one-factor models within this study lends support to the idea

that the homogeneity of cognitive abilities in young children is universal across cultures.

Outside of developmental considerations, a one-factor structure may be more

appropriate when conducting cross-cultural intelligence measurement. Researchers

(Georgas, 2003; Georgas et al., 2003; Irvine, 1979; Van de Vijver, 1997) specify

universalities across cultures as to what constitutes "intelligence". Cultures may differ,

however, in how these abilities are manifested and the extent to which these

manifestations contribute to success (Sternberg & Grigorenko, 2004). A one-factor

structure may better capture this universality despite variability in the manifestation of

separate cognitive abilities.

75

Convergent Validity Evidence

In line with predictions, the scores from the adapted cognitive ability assessment

completed at 24 months were significantly and positively related to cognitive ability

scores from the adapted WPPSI-III-SP for both the younger or older cohorts. These

relationships, however, were weak. Thus, evidence for convergent validity was minimally

supported. Children typically demonstrate rapid cognitive, linguistic, and physical

development between their second and third year (Berk, 2009). Scores from the Bayley-

III have previously demonstrated strong positive correlations with cognitive ability scores

from the WPPSI-III (Bode, D'Eugenio, Mettelman, & Gross, 2014; Sattler, 2008).

However, these previous convergent validity studies were not conducted using adapted

measures. The adapted Bayley-III and the adapted WPPSI-III-SP have somewhat

differing definitions as to what constitutes intelligent behavior. For example, the adapted

Bayley-III Cognitive subscale includes the measurement of social development, visual

preference, and number concepts. This emphasis on somewhat differing cognitive

abilities may contribute to weaker relationships between scores from the scales,

particularly within a cross-cultural context. While these cross-assessment abilities may

have been somewhat commiserate within American culture (the original culture of

development), differential importance and manifestation of abilities within the Peruvian

culture may exaggerate the differences in definitions.

As Van de Vijver (1997) noted, the activities included within a cognitive ability

assessment and an individual's exposure to these tasks may impact performance. The

tasks included within the adapted Bayley-III differ from the tasks included in the adapted

WPPSI-III-SP. The adapted Bayley-III involves more play-based activities and has a

76

more interactive administration. As the adapted Bayley-III aims to assess the abilities and

skills of infants and toddlers, an interactive assessment that includes age-appropriate

manipulatives to engage the child’s interest follows the guidance of Alfonso and

Flanagan (1999). However, the relationship between overall scores from these

assessments may be impacted by differential familiarity with task demands.

To potentially reduce the impact of task unfamiliarity, the scores from the adapted

WPPSI-III-SP administered at 36 months were correlated with 48-month-administration

scores of the adapted measure. These scores across time points were positively,

moderately, and significantly related. While some subtests were new to the child retaking

the adapted measure at 48 months, the testing situation itself was less novel. It is possible

that within a more consistent testing environment, the children performed more

consistently over time. As such, this relationship provides evidence for convergent

validity.

Furthermore, cognitive ability assessments for young children often face the

challenge of adequately capturing not only large changes in ability but also the

incremental development of skills. As these cognitive changes become more stable over

time, the scores from an intelligence test typically demonstrate more consistency and

have a higher relationship with related assessments (Baron & Leonberger, 2012). This

pattern was demonstrated within this study as the strength of the relationship between

scores on the adapted measures increased as the children aged. Further support should be

sought, however, to build upon this evidence. Sternberg and colleagues (2001) noted that

scores from cognitive ability assessments become better predictors of later cognitive

ability scores after approximately the age of 6. Considering the developmental trend

77

toward the stability of intelligence, the current adapted WPPSI-III-SP scores should be

correlated with the scores from any future administrations of the adapted WPPSI-III-SP.

Limitations and Future Research

The findings of this study should be interpreted in light of demographic and

methodological limitations. Several demographic characteristics of the sample of children

limit the generalizability of the study. All data were collected within small communities

in Peru's department of Loreto. Geographically, Peru is a diverse country (CIA, 2014),

containing large urban centers to agrarian communities in the Andes Mountains and the

Amazonian jungle. These regions are also differentially influenced by the cultures of

indigenous populations that inhabit them. Consequently, the findings of this study may

not generalize to other regions of Peru with different socio-cultural characteristics.

Adapting and validating scores from the WPPSI-III-SP for use in this segment of the

Peruvian population lays the foundation for validating scores from the measure among

children in other regions of the country.

Additionally, while the WPPSI-III is appropriate for use among children ages 2

years 6 months to 7 years 3 months, the age range of participants within this study was

restricted to three- and four-year-old children. Thus, it is unclear as to what is the most

appropriate factor structure across the full age range of the assessment. Given the rapid

cognitive development within this age range and the trend toward differentiation of

ability as a child ages, it is crucial to explore the best factor structure across the entire age

range of the test. Finally, the sample sizes for each cohort were somewhat small for use in

structural equation modeling. A general rule of thumb for researchers using CFA is to

have a minimum of 200 participants (Brown, 2015). The specification errors encounter in

78

the second-order models for the older cohort may have been avoided with a larger sample

size. A larger sample size across cohorts would increase the statistical power and

precision of the parameter estimates and estimates of overall model fit.

Reliability estimates for adapted WPPSI-III-SP scores for the younger cohort

were also somewhat low, particularly for the Object Assembly subtest. This suggests that

a child's performance on this subtest should be interpreted with caution. Sattler (2008)

suggests that increasing the length of a test may increase reliability. For the adapted

WPPSI-III-SP, adding more items to the end of the subtests may not be feasible as many

children reached a subtest's ceiling many items from the end. Adding items to the

beginning of a subtest, however, may be a feasible way in which to address not only poor

score reliability but also the test floor issue demonstrated by many subtests. Increasing

the number of less difficult items may help to better distinguish the abilities of lower

performing children. In addition, giving all children the opportunity to answer all items

may also allow for a more accurate analysis of score reliability. Increasing the length of

the test, however, must be done in consideration of what is developmentally appropriate

for a preschool-aged child. Care must be taken to ensure that a subtest does not become

overly taxing for the children.

Future research on the validity of scores from the adapted WPPSI-III-SP should

include more diverse samples of Peruvian children, in terms of both geography and age.

Demonstrating a consistent factor structure across ages and cultural groups within Peru

may help strengthen the argument that this adapted test is appropriate for use among a

wider-range of Peruvian preschoolers. Furthermore, this validation process should

consider having all children answer all items on each subtest. If all children answer all

79

items, further information can be gathered as to item quality, appropriate item order, and

the appropriateness of overall task demands. Item analysis, or the process of determining

how well an item is functioning, may also provide valuable information as to the validity

of scores from the adapted WPPSI-III-SP. These analyses include the examination of not

only item discrimination (as described earlier) but also item difficulty (i.e., the percentage

of children who answer an item incorrectly; Sattler, 2008). Item response theory may also

be a helpful tool for conducting item analysis (de Gruijter & van der Kamp, 2005). The

validation of the test for a more diverse population and an analysis of item quality

continue the iterative process of adapting this measure (Geisinger, 1994).

Future research should also continue collecting evidence for construct validity by

examining the adapted WPPSI-III-SP scores' convergent, discriminant, and predictive

validity. The scores from this adapted measure of cognitive ability did not evidence a

strong relationship with scores from an earlier measure of cognitive development.

However, a moderate and positive relationship was observed between scores at 36

months and at 48 months for the adapted WPPSI-III-SP. Further research should continue

to explore how scores from the adapted WPPSI-III-SP relate to scores from other

measures, such as measures of future academic achievement (predictive validity), other

measure of intelligence and measures of memory and executive functioning (convergent

validity), and measures of behavioral functioning (discriminant validity).

Implications

The findings of this study support the use of the adapted WPPSI-III-SP as a

measure of general intellectual ability among three- and four-year-old children living in

small river communities of the Peruvian Amazon basin. Children growing up in these

80

communities face distinct contextual challenges to their growth and development (e.g.,

malnutrition, disease, limited access to health care). Measuring a child's cognitive ability

may be an important component of research examining how children’s development is

affected by biological and environmental factors (Fernald, Kariger, Engle, & Raikes,

2009). For example, malnutrition has been associated with negative impacts on brain

structure and function and on the development of cognitive processes (Kar et al., 2008;

Levisky & Strupp, 1995). The adapted WPPSI-III-SP can be used to measure these

potential negative impacts. Results from this research can then inform and provide insight

for others studying the negative impact of poverty, malnutrition, and disease in other

areas of the world.

The use of scores from the adapted WPPSI-III-SP to make individual treatment

decisions for children, however, is not supported. Scores from intelligence tests are often

used to inform diagnostic and treatment decisions for children (Bracken, 1994). Lower

reliability estimates suggest that measurement error may be influencing test scores and

that scores are not consistent over time or across situations. In other words, the chance of

misdiagnosis or misinterpretation is higher. Misdiagnosis or misinterpretation may have

serious consequences for a child. Therefore, a high reliability estimate (≥.9) is suggested

when using the scores of a test to inform decisions (Sattler, 2008). The reliability

estimates for the adapted WPPSI-III-SP did not reach this threshold. Thus, as the scores

may be used for research purposes, the use of scores for diagnosis and treatment of an

individual child is questionable.

81

Conclusions

A critical step in the adaptation of a measure for use within a new population is

the establishment of evidence that scores from the adapted assessment continue to

measure the same construct in the same manner as scores from the original measure

(Stein et al, 2006). The present study examined evidence for construct validity of a

version of the WPPSI-III-SP adapted for use among three- and four-year-old children

living in rural Peru. The WPPSI-III has been adapted and validated for use within a

number of countries (Visser et al., 2012). In considering overall model fit and component

fit, a one-factor structure was supported across cohorts such that all subtests load on a

general intelligence factor. Evidence for convergent validity was minimally supported,

as the scores from the adapted test were weakly related to scores from an earlier-

administered adapted measure of cognitive development completed by the children. A

moderately positive relationship, however, was observed between scores on the adapted

WPPSI-III-SP at 36 months and scores at 48 months. As the adaptation of a test is an

iterative process (Geisinger, 1994), the findings from this study should be used inform

future changes to the adapted measure to make it a more accurate assessment of

intelligence for a wider range of Peruvian children.

82

References

Alfonso, V. C., & Flanagan, D. P. (1999). Assessment of cognitive functioning in

preschoolers. In E. N. Nuttall, I. Romero, & J. Kalesnik (Eds.), Assessing and

screening preschoolers: Psychological and educational dimensions, (pp. 186-

217). Needham Heights, MA: Allyn & Bacon.

American Educational Research Association, American Psychological Association, &

National Council on Measurement in Education. (1999). Standards for

educational and psychological testing. Washington, DC: American Educational

Research Association.

American Psychological Association. (2002). American Psychological Association

ethical principles of psychologists and code of conduct. Retrieved from

http://www.apa.org

Bagdonas, A., Pociute, B., Rimkute, E., & Valickas, G. (2008). The history of Lithuanian

psychology. European Psychologist, 13, 227-237. doi: 10.1027/1016-

9040.13.3.227

Barker, C., Pistrang, N., & Elliott, R. (2016). Research methods in clinical psychology:

An introduction for students and practitioners (3rd ed.) [Wiley Online Library

version]. doi: 10.1002/9781119154082

Baron, I. S., & Leonberger, K. A. (2012). Assessment of intelligence in the preschool

years. Neuropsychological Review, 22, 334 – 344. doi: 10.1007/s11065-012-

9215-0

Bayley, N. (2005). Bayley Scales of Infant and Toddler Development (3rd ed.). San

Antonio, TX; Psychological Corporation.

83

Benson, S., Hellander, P., & Wlodarski, R. (2007). Peru (6th Ed.). Oakland, CA: Lonely

Planet Publications.

Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological

Bulletin, 107, 238-246.

Bentler, P. M., & Bonnet, D. G. (1980). Significance tests and goodness of fit in the

analysis of covariance structures. Psychological Bulletin, 88, 588-606.

Berk, L. E. (2009). History, theory, and applied directions. In L. E. Berk (Ed.), Child

development: Custom edition for Pennsylvania State University. Boston: Pearson

Custom Publishing.

Boake, C. (2002). From the Binet-Simon to the Wechsler-Bellevue: Tracing the history

of intelligence testing. Journal of Clinical and Experimental Neuropsychology,

24, 383-405. doi: 1380-3395/02/2402-383

Bode, M. M., D'Eugenio, D. B., Mettelman, B. B., & Gross, S. J. (2014). Predictive

validity of the Bayley, third edition for 2 years for intelligence quotient at 4 years

for preterm infants. Journal of Developmental and Behavioral Pediatrics, 35,

570-575. doi: 10.1097/DBP.0000000000000110

Bollen, K. A. (1989). A new incremental fit index for general structural models.

Sociological Methods and Research, 17, 303-316.

Bracken, B. B. (1994). Advocating for effective preschool assessment practices: A

comment on Bagnato and Neisworth. School Psychology Quarterly, 9, 103-108.

Brown, T. A. (2015). Confirmatory factor analysis in applied research (second edition).

New York, NY: The Guilford Press.

84

Brown, R. T., Reynolds, C. R., & Whitaker, J. S. (1999). Bias in mental testing since Bias

in Mental Testing. School Psychology Quarterly, 14, 208-238.

Burt, C. (1948). The factorial study of temperamental traits. British Journal of Statistical

Psychology, 1, 178-203.

Central Intelligence Agency. (2014). The world factbook: Peru. Retrieved from

https://www.cia.gov/library/publications/the-world-factbook/geos/pe.html

Cohen, A. B. (2009). Many forms of culture. American Psychologist, 64, 194 – 204. doi:

10.1037/a0015308

Contreras, D. M. M., & Rodriguez, A. P. A. (2013). Estudio preliminary de las

propiedades psicometrias del WISC-IV en una muestra de escolares de

Bucaramanga. Informes Psicologicos, 13, 13 – 25.

de Gruijter, D. N. M., & van der Kamp, L. J. T. (2005). Statistical test theory for

education and psychology. Retrieved from http://irt.com.ne.kr/data/

test_theory.pdf

Diana v. State Board of Education, Civ. Act. No. C-70-37 (N.D. Cal., 1970, further

order, 1973).

Edwards, C. P. (1999). Development in the preschool years: The typical path. In E. N.

Nuttall, I. Romero, & J. Kalesnik (Eds.), Assessing and screening preschoolers:

Psychological and educational dimensions, (pp. 186 – 217). Needham Heights,

MA: Allyn & Bacon.

Evans, J. D. (1996). Straightforward statistics for the behavioral sciences. Pacific Grove,

CA: Brooks/Cole Publishing.

85

Fernald, L. C. H., Kariger, P., Engle, P., & Raikes, A. (2009). Examining early child

development in low-income countries: A toolkit for the assessment of children in

the first five years of life. Washington, D.C.: The World Bank.

Flanagan, D. P., & Alfonso, V. C. (1995). A critical review of the technical

characteristics of new and recently revised intelligence tests for children. Journal

of Psychoeducational Assessment, 13, 66-90.

Foundation for the National Institutes of Health. (2016). Iquitos, Peru: About the site.

Retrieved from: http://mal-ed.fnih.org/?page_id=329

Frisby, C. L. (1999a). Culture and test session behavior: Part I. School Psychology

Quarterly, 14, 263 – 280.

Frisby, C. L. (1999b). Culture and test session behavior: Part II. School Psychology

Quarterly, 14, 281 – 303.

Fry, A. F., & Hale, S. (2000). Relationships among processing speed, working memory,

and fluid intelligence in children. Biological Psychology, 54, 1-34.

Geisinger, K. F. (1994). Cross-cultural normative assessment: Translation and adaptation

issues influencing the normative interpretation of assessment instruments.

Psychological Assessment, 6, 304 – 312.

Georgas, J. (2003). Cross-cultural psychology, intelligence and cognitive processes. In J.

Georgas, F. J. R. van de Vijver, L. G. Weiss, & D. H. Saklofske (Eds.), Culture

and children’s intelligence: Cross-cultural analysis of the WISC-III (pp. 23-37).

San Diego: Academic Press.

86

Georgas, J., van de Vijver, F. J. R., Weiss, L. G., & Saklofske, D. H. (2003). A cross-

cultural analysis of the WISC-III. In J. Georgas, F. J. R. van de Vijver, L. G.

Weiss, & D. H. Saklofske (Eds.), Culture and children’s intelligence: Cross-

cultural analysis of the WISC-III (pp. 277-313). San Diego: Academic Press.

Gerbing, D. W., & Anderson, J. C. (1987). Improper solutions in the analysis of

covariance structures: Their interpretation and a comparison of alternate

respecifications. Psychometrika, 52, 99-111.

Gordon, B. (2004). [Review of Wechsler Preschool and Primary Scale of Intelligence-

Third Edition by D. Wechsler]. Canadian Journal of School Psychology, 19, 205-

220.

Greenfield, P. M. (1985). You can’t take it with you: Ability assessments don’t cross

cultures. American Psychologist, 52, 1115 – 1124.

Hambleton, R. K. (2001). The next generation of the ITC test translation and adaptation

guidelines. European Journal of Psychological Assessment, 17, 164 – 172.

Hu, L., & Bentler, P. M. (1995). Cutoff criteria for fit indexes in covariance structure

analysis: Conventional criteria versus new alternatives. Structural Equation

Modeling, 6, 1-55. doi: 10.1080/10705519909540118

Individuals with Disabilities Education Act of 2004. 20 U.S.C. § 1400, et. seq.

Irvine, S. H. (1979). The place of factor analysis in cross-cultural methodology and its

contribution to cognitive theory. In L. Eckensberger, W. Lonner, & Y. H.

Poortinga (Eds.). Cross-cultural contributions to psychology, (pp. 300-343).

Lisse, The Netherlands: Swets and Zeitlinger.

87

Jacob, S. J., Decker, D. M., & Hartshorne, T. S. (2011). Ethics and law for school

psychologists (6th ed). Hoboken, NJ: John Wiley & Sons, Inc.

Kamphaus, R. W., Winsor, A. P., Rowe, E. W., & Kim, S. (2012). A history of

intelligence test interpretation. In D. P. Flanagan, & P. L. Harrison (Eds.),

Contemporary intellectual assessment: Theories, tests, and issues, (pp. 3 – 55).

New York, NY: The Guildford Press.

Kar, B. R., Rao, S. L., & Chandramouli, B. A. (2008). Cognitive development in children

with chronic protein energy malnutrition. Behavioral and Brain Functions, 4, 31-

42. doi: 10.1186/1744-9081-4-31

Karino, C. A., Laros, J. A., & Ribeiro de Jesus, G. (2011). Evidence of convergent

validity of Son-R 2 1/2-7 [a] with WPPSI-III and WISC-III. Psicologia: Reflexão

e Crítica, 24, 621-629. doi: 10.1590/S0102-79722011000400001

Kaufman, A. S., & Kaufman, N. L. (2004). Kaufman Assessment Battery for Children,

Second Edition: Manual. Circle Pines, MN: AGS Publishing.

Kline, R. B. (1998). Principles and Practice of Structural Equation Modeling. New

York: Guilford Press.

Larry P. v. Riles, 343 F. Supp. 1306 (D.C. N.D. Cal., 1972), aff’d 506 F.2d 963 ((th Cir.

1974), further proceedings, 495 F. Supp. 926 (D.C. N.D. Cal., 1979), aff’d, 502 F.

2d 693 (9th Cir. 1984).

Lei, P. W., & Wu, Q. (2007). Introduction to structural equation modeling: Issues and

practical considerations. Educational Measurement: Issues and Practices, 26, 33-

43.

88

Levitsky, D. A., & Strupp, B. J. (1995). Malnutrition and the brain: Changing concepts,

changing concerns. The Journal of Nutrition, 125, 2212S-2220S.

Lichtenberger, E. O., & Kaufman, A. S. (2004). Essentials of WPPSI-III assessment.

Hoboken, NJ: John Wiley & Sons, Inc.

Louro, C. R., & Yupanqui, M. J. (2011). Otra mirada a los procesos de gramaticalización

del presente perfecto español: Perú y Argentina. Studies in Hispanic and

Lusophone Linguistics, 4, 55-80.

Markland, D. (2006). The golden rule is that there are no golden rules: A commentary on

Paul Barrett's recommendations for reporting model fit in structural equation

modeling. Personality and Individual Differences, 42, 851-858. doi:

10.1016/j.paid.2006.09.023

Malda, M., van de Vijver, F. J. R., Srinivasan, K., Transler, C., & Sukumar, P. (2010).

Traveling with cognitive tests: Testing the validity of a KABC-II adaptation in

India. Assessment, 17, 107-115. doi: 10.1177/1073191109341445

Malda, M., van de Vijver, F. J. R., Srinivasan, K., Transler, C., Sukumar, P., & Rao, K.

(2008). Adapting a cognitive test for a different culture: An illustration of

qualitative procedures, Psychology Science Quarterly, 50, 451-468.

Murray-Kolb, L. E., Rasmussen, Z. A., Scharf, R. J., Rasheed, M. A., Svensen, E.,

Seidman, J. C. … Lang, D. (2014). The MAL-ED cohort study: Methods and

lessons learned when assessing early child development and caregiving mediators

in infants and young children in 8 low- and middle-income countries. Clinical

Infectious Diseases, 59, S261 – S272. doi: 10.1093/cid/ciu437

89

Nair, R. L., White, R. M. B., Knight, G. P., & Roosa, M. W. (2009). Cross-language

measurement equivalence of parenting measures for use with mexican american

population. Journal of Family Psychology, 5, 680-689. doi: 10.1037/a0016142

Neisworth, J. T., & Bagnato, S. J. (1992). The case against intelligence testing in early

intervention. Topics in Early Childhood Special Education, 12, 1-20.

Perez-Pereira, M., & Resches, M. (2011). Concurrent and predictive validity of the

Galician CDI. Journal of Child Language, 38, 121-140. doi:

10.1017/S0305000909990262

Price, L. R., Raju, N., Lurie, A., Wilkins, C., & Zhu, J. (2006). Conditional standard

errors of measurement for composite scores on the Weschler Preschool and

Primary Scale of Intelligence-Third edition. Psychological Reports, 98, 237-252.

Ramirez, V., & Rosas, R. (2007). Estandarizacion del WISC-III en Chile: Descripcion del

test, estructura factorial y consistencia interna de las escalas. Psykhe, 16, 91 –

109.

Rodriguez, L. J. S., & Miguel C. A. S. (2012). Evaluacion en psicologia clinica.

Retrieved from http://www.pir.es

Sattler, J. M. (2008). Assessment of children: Cognitive foundations. La Mesa, CA:

Jerome M. Sattler Publisher.

Smolik, F., & Malkova, G. S. (2011). Validity of language sample measures taken from

structured elicitation procedures in Czech. Ceskoslovenska psychologie, 55, 448-

458.

90

Stein, J. A., Lee, J. W., & Jones, P. S. (2006). Assessing cross-cultural differences

through use of multiple-group invariance analyses. Journal of Personality

Assessment, 87, 249 – 258.

Sternberg, R. J., & Grigorenko, E. L. (2004). Intelligence and culture: how culture shapes

what intelligence means, and the implications for a science of well being.

Philosophical Transactions of the Royal Society of London, 359, 1427-1434. doi:

10.1098/rstb.2004.1514

Sternberg, R. J., Grigorenko, E. L., & Bundy, D. A. (2001). The predictive value of IQ.

Merrill-Palmer Quarterly, 47, 1-41.

UNICEF, World Health Organization, & World Bank Group. (2015). Levels and trends

in child malnutrition. Retrieved from: http://data.unicef.org/corecode/uploads/

document6/uploaded_pdfs/corecode/JME-2015-edition-Sept-2015_203.pdf

Van de Vijver, F. (1997). Meta-analysis of cross-cultural comparisons of cognitive test

performance. Journal of Cross-Cultural Psychology, 28, 678-709.

Visser, L. Ruiter, S. A. J., van der Meulen, B. F., Ruijssenaars, W. A J. J. M., &

Timmerman, M. E. (2012). A review of standardized developmental assessment

instruments for young children and their applicability for children with special

needs. Journal of Cognitive Education and Psychology, 11, 102-127. doi:

10.1891/1945-8959.11.2.102

Wasserman, G. A., Lui, X., Parvez, F., Ahsan, H., Factor-Litvak, P., van Geen, A. ...

Graziano, J. H. (2004). Water arsenic exposure and children's intellectual function

in Araihazar, Bangladesh. Environmental Health Perspectives, 112, 1329-1333.

91

Wasserman, J. D. (2012). A history of intelligence assessment: The unfinished tapestry.

In D. P. Flanagan, & P. L. Harrison (Eds.), Contemporary intellectual assessment:

Theories, tests, and issues, (pp. 3 – 55). New York, NY: The Guilford Press.

Wechsler, D. (1944). The measurement of adult intelligence. Baltimore, MD: Waverly

Press.

Wechsler, D. (1991). Wechsler Intelligence Test for Children –Third Edition. San

Antonio, TX: The Psychological Corporation.

Wechsler, D. (1991/1997). Test de Inteligencia Para Ninos WISC-III: Manual

(Translation by Ofelia Castillo). Buenos-Aires: Paidos.

Wechsler, D. (2002a). WPPSI-III: Administration and scoring manual. San Antonio, TX:

The Psychological Corporation.

Wechsler, D. (2002b). WPPSI-III: Technical and interpretive manual. San Antonio, TX:

The Psychological Corporation.

Wechsler, D. (2005). Escala de inteligencia de Wechsler para niños IV (WISC IV).

Madrid: TEA Ediciones.

Wechsler, D. (2009). Escala de inteligencia de Wechsler para preescolar y primaria –

III. Madrid: TEA Ediciones.

World Food Programme (2016). Hunger statistics. Retrieved from:

https://www.wfp.org/hunger/stats

Worthington, R. L., & Whittaker, T. A. (2006). Scale development research: A content

analysis and recommendations for best practice. The Counseling Psychologist, 34,

806-838. doi: 10.1177/0011000006288127

92

Yori, P. P., Lee, G., Olortegui, M. P., Chavez, C. B., Flores, J. T., Vasquez, A. O. …

Kosek, M. (2014). Santa-Clara de Nanay: The MAL-ED cohort in Peru. Clinical

Infectious Diseases, 59, S310-316. doi: 10.1093/cid/ciu460

93

Appendix A

Overview of WPPSI Content

The following table (Table A1) provides an overview of the revisions to the WPPSI's

content and construction from the original WPPSI published in 1967 through the WPPSI-

III. Core and supplemental subtests are included for reference.

94

Table A1

Subtests and Composite Scores of the WPPSI, WPPSI-R, and WPPSI-III by Age

WPPSI (1967) WPPSI-R (1991) WPPSI-III (2002) Composites Subtests VIQ PIQ FSIQ Subtests VIQ PIQ FSIQ Subtests VIQ PIQ GLC PSQ FSIQ4:0 - 6:6 3:0 - 7:3 2:6 - 3:11 Geometric Design ✓ ✓ Object Assembly ✓ ✓ Receptive Vocabulary ✓ (✓) Block Design ✓ ✓ Geometric Design ✓ ✓ Information ✓ ✓Mazes ✓ ✓ Block Design ✓ ✓ Block Design ✓ ✓Picture Completion ✓ ✓ Mazes ✓ ✓ Object Assembly ✓ ✓(Animal Houses) (✓) ✓ Picture Completion ✓ ✓ (Picture Naming) (✓)

Information (Animal Pegs) (✓) 4:0 - 7:3 Comprehension ✓ ✓ Information ✓ ✓ Vocabulary ✓ ✓Arithmetic ✓ ✓ Comprehension ✓ ✓ Information ✓ ✓Vocabulary ✓ ✓ Arithmetic ✓ ✓ Word Reasoning ✓ ✓Similarities ✓ ✓ Vocabulary ✓ ✓ Block Design ✓ ✓(Sentences) (✓) ✓ Similarities ✓ ✓ Matrix Reasoning ✓ ✓ (Sentences) (✓) Picture Concepts ✓ ✓ Coding (✓) ✓ (Symbol Search) (✓) (✓)

(Comprehension) (✓)

(Similarities) (✓)

(Object Assembly) (✓)

(Picture Completion) (✓)

Note. WPPSI = Wechsler Preschool and Primary Scale of Intelligence. VIQ = Verbal Intelligence Quotient; PIQ = Performance Intelligence Quotient; FSIQ = Full Scale Intelligence Quotient; GLC = General Language Composite; PSQ = Processing Speed Quotient. Check marks indicate the inclusion of a specific subtest into the composite score. Parentheses indicate a supplemental subtest. Reviewed tests are based on US normative sample.

95

Appendix B

Parameter Estimates and Standard Errors for Models 3, 4, and 5

Table B1

Parameter and Standard Error Estimates for Model 3-NormOlder



Standard Error

Loadings on VIQ


Information .39 1.27a 1.45


Loadings on PIQ




Loadings on FSIQ

VIQ .79 -- 0.32

PIQ -- -- --

Coding .52 1.06* 0.44

Note. Table values are Maximum Likelihood estimates. VIQ = Verbal Intelligence Quotient; PIQ = Perceptual Intelligence Quotient; FSIQ = Full Scale Intelligence Quotient. *p < .05 a fixed factor loading.

96

Table B2 Parameter and Standard Error Estimates for Model 4-OneFactorOlder



Standard Error

Loadings on FSIQ







Coding .49 1.00* 0.45

Note. Table values are Maximum Likelihood estimates. FSIQ = Full Scale Intelligence Quotient. *p < .05 level.

97

Table B3

Parameter and Standard Error Estimates for Model 5-AltOlder



Standard Error

Loadings on VIQ




Loadings on PIQ



Picture Concepts .62 1.97a 1.09

Coding .52 1.06* 0.44

Loadings on FSIQ

VIQ -- -- --

PIQ .51 -- 0.22

Note. Table values are Maximum Likelihood estimates. VIQ = Verbal Intelligence Quotient; PIQ = Perceptual Intelligence Quotient; FSIQ = Full Scale Intelligence Quotient. * p < .05 level. a fixed factor loading.

VITA

Abigail E. Crimmins 96 Sunnyside Dr. Elmira, NY 14905 [email protected]

607-215-3252

Education 2011 – present The Pennsylvania State University M.S. (August 2013), Ph.D. (exp. August, 2016) School Psychology 2005 – 2009 Hamilton College B.A. (May 2009; GPA – 3.85) Psychology and Hispanic Studies

Research Experience • Research Assistant, Penn State Department of Special Education, Summer 2014 • Research Assistant, LEGACY Project, Summer 2013 • Predissertation Research Project, Student-Teacher Relationships among Children with Autism:

Contribution of Students’ Social Skills, August 2013 • Honors Thesis, The Use of Thought Suppression to Cope with Ego-Threat among Those with

Fragile Self-Esteem, May 2009 • Research Assistant, Department of Psychology, Hamilton College, Summer 2007

Clinical Experience • Doctoral School Psychology Intern, Letchworth Central School District, 2015 – present • CEDAR Clinic Mobile Clinician, Juniata County School District, Spring 2015 • CEDAR Clinic Student Supervisor, Penn State CEDAR Clinic, 2014 – 2015 • School Psychology Practicum Intern, State College Area School District, 2013 - 2014 • School Psychology Student Clinician, Penn State CEDAR Clinic, 2011 – 2014

Teaching Experience • Graduate Teaching Assistant, Human Development and Family Studies, 2011 – 2014 • Clinical Graduate Assistant, School Psychology Program, 2012 – 2014 • Statistics Teaching Assistant, Psychology Department, 2007 – 2009

Work Experience • Respite Care Specialist, 2011 – present • Level II Teacher, New England Center for Children, 2009 – 2011 • Undergraduate Counselor Intern for Children with ADHD, Center for Children and Families, 2008

Publications and Presentations Woika, S. A., & Crimmins, A. E. (2014, October). Practical Guidance for Supervisors of School Psychologists. Presentation at the meeting of the Association of School Psychologists of Pennsylvania, State College, PA. Crimmins, A. E. (2014, February). Student-Teacher Relationships among Children with Autism: Contribution of Students’ Social Skills. Poster presented at the meeting of the National Association of School Psychologists, Washington DC. Clark, T. C., Crimmins, A. E., & Leposa, B. (2012, February). Reading first, or is it? Paper presented at the meeting of the National Association of School Psychologists, Philadelphia, PA. Borton, J. L. S., Crimmins, A. E., Ashby, R. S., & Ruddiman, J. F. (2012). How do individuals with fragile self-esteem cope with intrusive thoughts following ego threat? Self and Identity, 11, 16 – 35.

Awards and Honors • Membership Award, Pennsylvania Psychologists Association, June 2013

FACTOR STRUCTURE OF WECHSLER PRESCHOOL AND PRIMARY …

Documents