Top Banner

of 27

IR a META an of TEST Format Effects on Reading and Listening TEST Performance

Apr 07, 2018

Download

Documents

tomor2
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    1/27

    http://ltj.sagepub.com

    Language Testing

    DOI: 10.1177/02655322081010062009; 26; 219Language Testing

    Yo In'nami and Rie Koizumiperformance: Focus on multiple-choice and open-ended formats

    A meta-analysis of test format effects on reading and listening test

    http://ltj.sagepub.com/cgi/content/abstract/26/2/219The online version of this article can be found at:

    Published by:

    http://www.sagepublications.com

    can be found at:Language TestingAdditional services and information for

    http://ltj.sagepub.com/cgi/alertsEmail Alerts:

    http://ltj.sagepub.com/subscriptionsSubscriptions:

    http://www.sagepub.com/journalsReprints.navReprints:

    http://www.sagepub.co.uk/journalsPermissions.navPermissions:

    http://ltj.sagepub.com/cgi/content/refs/26/2/219Citations

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/cgi/alertshttp://ltj.sagepub.com/cgi/alertshttp://ltj.sagepub.com/subscriptionshttp://ltj.sagepub.com/subscriptionshttp://ltj.sagepub.com/subscriptionshttp://www.sagepub.com/journalsReprints.navhttp://www.sagepub.com/journalsReprints.navhttp://www.sagepub.co.uk/journalsPermissions.navhttp://www.sagepub.co.uk/journalsPermissions.navhttp://ltj.sagepub.com/cgi/content/refs/26/2/219http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/cgi/content/refs/26/2/219http://www.sagepub.co.uk/journalsPermissions.navhttp://www.sagepub.com/journalsReprints.navhttp://ltj.sagepub.com/subscriptionshttp://ltj.sagepub.com/cgi/alerts
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    2/27

    Address for correspondence: Yo Innami, Department of Humanities, Management Science, andEngineering, Toyohashi University of Technology, 11 Hibarigaoka, Tempaku, Toyohashi, Aichi4418580, Japan; email: [email protected]

    Language Testing2009 26 (2) 219244

    The Author(s), 2009. Reprints and Permissions: http://www.sagepub.co.uk/journalsPermissions.nav

    DOI:10.1177/0265532208101006

    A meta-analysis of test formateffects on reading and listening test

    performance: Focus on multiple-choiceand open-ended formatsYo Innami Toyohashi University of Technology, JapanRie Koizumi Tokiwa University, Japan

    A meta-analysis was conducted on the effects of multiple-choice and open-ended formats on L1 reading, L2 reading, and L2 listening test performance.Fifty-six data sources located in an extensive search of the literature werethe basis for the estimates of the mean effect sizes of test format effects. Theresults using the mixed effects model of meta-analysis indicate that multi-ple-choice formats are easier than open-ended formats in L1 reading andL2 listening, with the degree of format effect ranging from small to large inL1 reading and medium to large in L2 listening. Overall, format effects inL2 reading are not found, although multiple-choice formats are found to beeasier than open-ended formats when any one of the following four condi-tions is met: the studies involve between-subjects designs, random assign-ment, stem-equivalent items, or learners with a high L2 proficiency level.Format effects favoring multiple-choice formats across the three domainsare consistently observed when studies employ between-subjects designs,random assignment, or stem-equivalent items.

    Keywords: meta-analysis, test format, test method, multiple-choice,open-ended, reading test, listening test

    Among the many existing variables that are considered to affectlanguage test performance, one central issue is the effect of test for-mats on test performance (e.g., Alderson, 2000; Bachman & Palmer,1996; Brantmeier, 2005; Buck, 2001). Although this topic has beenresearched in the fields of language learning and educational meas-urement, it appears that previous studies have been limited by thenarrative approach to accumulate their findings. This approach hasbeen criticized because (a) it has less objectivity and replicability

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    3/27

    220 Focus on multiple-choice and open-ended formats

    due to individual differences among reviewers, (b) it mainly buildson conclusions drawn by the authors of original studies withoutreanalysis or reinterpretation, and (c) it has difficulty handling the

    rich volume of information extracted from studies (e.g., Cooket al.,1992; Norris & Ortega, 2000). To avoid these limitations, the currentstudy uses meta-analysis to quantitatively synthesize format effectson reading and listening test performance.

    I Literature review

    1 Format effects

    A variety of test formats or methods have been employed inlanguage testing, including cloze, c-test, gap-filling, matching,multiple-choice, open-ended (or short-answer), ordering, recall,summary, and summary gap-filling (e.g., Alderson, 2000; Buck,2001; Kobayashi, 2002). Since there is no perfect test format thatfunctions well in every situation, researchers must understand thecharacteristics of each format and make the best selection accord-ing to which one(s) most appropriately serve(s) the purpose of

    a test in each context. While the literature on test format effectsis enormous, the literature review below focuses on comparingmultiple-choice and open-ended formats in reading and listening.This has been one of the most investigated comparisons; thus, itprovides a solid basis for quantitative synthesis. Based on Davieset al. (1999), these two formats are defined as follows. Multiple-choice is a format with a stem and three or more options fromwhich learners are required to select one. An open-ended formatrefers to a question that requires learners to formulate their own

    answers with several words or phrases. Table 1 provides examplesfrom a reading test. In these examples, the same correct responsesare required across the formats that share the same or a similarstem (i.e., stem equivalent).

    The previous literature from a quantitative perspective has mainlyfocused on two issues: (a) differences in construct or trait measuredusing multiple-choice and open-ended formats and (b) differencesbetween test scores in multiple-choice and open-ended formats(i.e., the relative difficulty of test formats). Among a wide range

    of studies investigating the former issue (e.g., Bennett & Ward,1993; Buck, 2001; Campbell, 1999; Cohen, 1998), highly importantand interesting in terms of its comprehensive coverage of previous

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    4/27

    Yo Innami and Rie Koizumi 221

    studies is Rodriguez (2003), which synthesized 56 sets of correlationcoefficients between stem-equivalent multiple-choice and open-ended formats based on 29 studies from a variety of disciplines. The

    results revealed a correlation between them approaching unity (0.95[95% confidence interval: 0.91, 0.97]). This appears to suggest thatmultiple-choice and open-ended formats measure a very similar con-struct when they use the same stem.

    In the case of the differences between the test scores in both testformats, most studies have used the ttest or analysis of variance tocompare the mean scores in L1 reading, L2 reading, and L2 listen-ing. Since format effects vary by domain (Traub, 1993), the previ-ous findings are summarized below, according to each domain. It

    should also be noted that the summary was based on a narrativeor traditional review of the literature, which was somewhat prob-lematic (see section I.2); however, this approach was employedto illustrate how different conclusions were drawn according tothe way the literature was summarized (i.e., narrative review vs.quantitative synthesis).

    The literature in L1 reading has provided mixed results. Somestudies have shown that multiple-choice formats are easier thanopen-ended formats (e.g., Arthur et al., 2002; Davey, 1987), whereas

    some other studies have found no statistical difference between thetwo formats (e.g., Elinor, 1997; Pressley et al., 1990). In L2 reading,most studies have shown that multiple-choice formats are easier than

    Table 1 Examples of parallel formats

    Multiple-choice Open-ended

    Based on the passage, what is the Based on the passage, what is the

    most probable reason Howe and most probable reason Howe andMann encouraged Dorothea Dix to Mann encouraged Dorothea Dix to

    push for reform? push for reform?

    A. They wanted to share in her fame. (Example answers)

    B. They did not fully understand how They were (also) concerned about

    much danger she was in. people with mental illness.

    C. They were also concerned about They understood the plight of people

    people with mental illness.* with mental illness.

    D. They thought it would be They believed in what Dorothea

    a good way for her to be elected to was doing.

    public office. They truly thought Dorothea

    could make a difference.

    Note: Adopted from Campbell (1999, pp. 223, 226).

    * = answer keys in the multiple-choice version. Underlined letters in the open-ended

    version indicate examples of correct answers.

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    5/27

    222 Focus on multiple-choice and open-ended formats

    open-ended formats (e.g., Shohamy, 1984; Wolf, 1991), whichcontradicts Elinor (1997) and Trujillo (2006), in which the twoformats were considered to be of similar difficulty. All studies

    have shown that in L2 listening, multiple-choice formats are easierthan open-ended formats (e.g., Berne, 1992; Innami, 2006; Teng,1999). What became clear through the review of L1 reading, L2reading, and L2 listening was that in most studies, multiple-choiceformats were easier than open-ended formats; however, in someL1 and L2 reading studies, no significant difference was foundbetween the two formats.

    Regarding the two issues of multiple-choice and open-endedformat effects (i.e., differences in construct measured and test

    scores), the former was quantitatively summarized in Rodriguez(2003), but the latter does not appear to have been quantitativelysynthesized.

    2 Research synthesis approach

    Although previous studies have individually provided valuableinformation on format effects, all of the studies that examined thetest score differences between multiple-choice and open-ended

    formats have used a narrative approach to reviewing the literature.However, such a method precludes integrative and systematicreviews for three reasons (e.g., Cooper & Hedges, 1994; Norris &Ortega, 2000). First, a narrative approach appears less objective andless replicable in collecting and summarizing previous studies, sincethe studies included in the integration of the findings often dependon the reviewer (e.g., Glass et al., 1981; Light & Pillemer, 1984).Not surprisingly, different reviewers can draw different conclusionsabout the findings in question.

    Second, even if all of the relevant studies are collected andreviewed, reviewers mostly take the findings of previous studies atface value and directly infer their own conclusions without rean-alysis or reinterpretation (Norris & Ortega, 2000). However, sinceall studies are limited in one way or another (e.g., sample size orresearch design), it should not be taken for granted that the conclu-sion drawn in each study is trustworthy; therefore, the interpretationof each study must always be carefully examined.

    Third, even if each study was reanalyzed and reinterpreted, itwould become very difficult to synthesize previous studies becausereviewers easily lose track of the information collected as the number

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    6/27

    Yo Innami and Rie Koizumi 223

    of studies or amount of information regarding the relationshipsbetween study findings and study characteristics increases (Lipsey& Wilson, 2001).

    One way to address these three problems with the narrative reviewis to use meta-analysis, which is a research method that summarizesa set of empirical data across studies. To contend with the first prob-lem, meta-analysis reports the detailed processes of data retrieval,inclusion criteria for analysis, and data summarization, in order forthe reader to evaluate the appropriateness of each step, and if neces-sary, replicate the entire step (e.g., Cooper & Hedges, 1994; Lipsey& Wilson, 2001). In response to the second and third problems,meta-analysis reanalyzes and reinterprets previous studies by taking

    the study characteristics into account. A great deal of informationfrom the study findings and characteristics is systematically codedand used to conduct a detailed analysis of how the study findings areexplained by the study features.

    Reflecting the advantages discussed above, meta-analysis hasrecently been used in second language acquisition studies (e.g.,Blok, 1999; Norris & Ortega, 2000, 2006), but it has rarelybeen used in language testing (except for Ross, 1998). Thus, thecurrent study conducts meta-analysis on the effects of two formats

    (multiple-choice and open-ended) on reading and listening test per-formance to combine and interpret previous studies in a meaningfulway. Besides the use of meta-analysis, this study expands previousstudies by targeting a wide range of areas, including L1 reading, L2reading, and L2 listening. Although test format effects are claimedto vary by domain (Traub, 1993), a comparison of meta-analyticfindings across domains would help clarify the specificity and gen-eralizability of test format effects.

    The research question, investigated separately for L1 reading, L2

    reading, and L2 listening studies, is as follows: Which are easier,multiple-choice or open-ended formats? To what degree are theyeasier? Are there any variables related to format effects?

    II Method

    1 Data collection

    a Data identification: In order to identify as many relevantstudies as possible to obtain the most comprehensive synthesisof format effects, three approaches were used to perform the

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    7/27

    224 Focus on multiple-choice and open-ended formats

    literature search. First, we conducted literature retrieval throughcomputer searches of the Educational Resources InformationCenter, FirstSearch, Linguistics and Language Behavior Abstracts,

    PsycINFO, ScienceDirect, and Web of Science. Since a single key-word retrieves a large number of irrelevant studies, combinationswere used: test method, test format, task format, task type, responseformat, response type, response mode, assessment method, assess-ment format, assessment type, evaluation method, evaluation format,evaluation type, question format, question type, item format, itemtype, multiple-choice, selected response, open-ended, short answer, free response, constructed response, comparison, difference, anddifficulty. This list of keywords was constructed by the authors based

    on the keywords and synonyms retrieved from the thesauruses sup-plied in databases, books and articles reviewed, authors experiences,and feedback from colleagues. Abstract, title, and article keywordsearches were used. A date range restriction was not imposed.

    Second, books and journals in language testing, first and secondlanguage acquisition, and educational measurement were reviewed.The books in language testing mainly included Alderson, Clapham,and Wall (1995), Bachman (1990), Bachman and Palmer (1996),Brown (2005), Clapham and Corson (1997), Cohen (1994), Davies

    et al. (1999), Davies and Elder (2004), Fulcher and Davidson(2007), ILTA Language Testing Bibliography (1999), Studiesin Language Testing series published by Cambridge UniversityPress, and Weir (2005). The books in first and second lan-guage acquisition mainly included Brown (2006), Cohen (1998),Doughty and Long (2003), Ellis (1994), Flowerdew and Miller(2005), Grabe and Stoller (2002), Hinkel (2005), Kaplan (2002),Kintsch (1998), Richards and Schmidt (2002), Rost (2002), andUrquhart and Weir (1998). The books in educational measure-

    ment mainly included Anastasi and Urbina (1995), Bennett andWard, (1993), Brennan (2006), Cronbach (1990), Downing andHaladyna (2006), and Haladyna (2004). Different editions ofthe same book were also checked. The journals included AnnualReview of Applied Linguistics, Applied Linguistics, ELT Journal, Language Assessment Quarterly, Language Learning, LanguageTeaching, Language Testing, Modern Language Journal, Reading Research Quarterly, RELC Journal, Second LanguageResearch, Studies in Second Language Acquisition, System, andTESOL Quarterly; Applied Measurement in Education, AppliedPsychological Measurement, Educational and Psychological

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    8/27

    Yo Innami and Rie Koizumi 225

    Measurement, Educational Measurement: Issues and Practice, Journal of Educational and Behavioral Statistics, and Journal ofEducational Measurement.

    Third, relevant studies were searched through communicationwith other researchers. In each of the three approaches, the referencelist of every empirical, theoretical, and review paper and chapter,both published and unpublished, was further scrutinized for addi-tional relevant materials.

    b Criteria for the inclusion of a study: The literature searchretrieved approximately 10,000 studies. Their titles, abstracts, andstudy descriptors were inspected if (a) the studies used multiple-

    choice and open-ended formats (see section I.1 for the definitions)and (b) the subject matter of the test was language comprehensionin a certain length of text in L1/L2 reading or L1/L2 listening.A sample of 0.2% (n = 20) of the 10,000 studies was independ-ently examined by both authors. The agreement percentage was90, and the kappa coefficient was 0.80. Disagreement was resolvedthrough discussion. The remaining studies were examined by thefirst author.

    These processes reduced the number of retrieved studies to 237.

    They were further inspected if they met all of the following criteria:(a) the information required to calculate effect sizes was reported(i.e., means, SDs, n, r, t, or Fvalues); and (b) the full score was thesame across formats or the percentage of correct responses couldbe calculated. When necessary, every effort was made to contactthe authors to request further details. A sample of 8.4% (n = 20)of the 237 studies was separately examined by both authors. Theagreement percentage was 80, and the kappa coefficient was 0.60.Disagreement was discussed and solved. The first author investi-

    gated the remaining studies. As a result, 37 studies were retained andincluded in the meta-analysis.

    c Moderator variables coded for each study: In order to examinethe relation between the format effects in a study and the variablesaffecting those format effects (moderator variables), the 37 stud-ies retained for the current meta-analysis were inspected, and thereported moderator variables in at least two studies (the minimumnumber of studies required for meta-analysis) were coded. Thisresulted in 15 coded variables.

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    9/27

    226 Focus on multiple-choice and open-ended formats

    Between-subjects or within-subjects designs (counterbalancedor non-counterbalanced): Studies with between-subjects designstended to find multiple-choice formats easier than open-ended

    formats (e.g., Shohamy, 1984; Wolf, 1991), whereas with within-subjects studies, counterbalanced designs found the two formatsto be of similar difficulty (Trujillo, 2006). Maxwell and Delaney(1990) suggested that due to learning effects, learners assignedto one condition (a multiple-choice followed by open-endedformat) were more likely to perform better on both formats thanthose assigned to the opposite condition (an open-ended fol-lowed by multiple-choice format), and this may obscure differ-ences in test format effects. Thus, it appears that format effects

    are more clearly observable in studies with between-subjectsdesigns than those with within-subjects designs, and more so instudies that administer open-ended formats first than studies thatadminister multiple-choice formats first.Random or non-random assignment: A random assignmentof learners to different treatments can exclude factors that areirrelevant to treatments and enable the interpretation of the dif-ferences observed between treatment groups as effects of thetreatments (Shadish et al., 2002). Thus, studies with random

    assignment were predicted to more clearly show format effectsthan were studies with non-random assignment.Stem or non-stem equivalency: Sharing the same or similarstem as well as the same correct response across formats seemsto be crucial in investigating format effects because it avoidsconfounding format effects with other variables (Rodriguez,2003). Thus, studies with stem equivalency were predicted tomore clearly show format effects than were studies with non-stem equivalency.

    Access or no access to the text when answering: Davey andLaSasso (1984) reported that when learners were allowed to con-sult the text (i.e., access to the text), there was no significant dif-ference between the test scores in multiple-choice and open-endedformats. In contrast, when learners were not allowed to refer tothe text (i.e., no access to the text), multiple-choice formats wereeasier than open-ended formats. This suggests that an opportunityto consult the text in open-ended formats may increase scoresand reduce format differences. In listening, whether learners were

    permitted to take notes while listening to the text and refer to themwhile answering questions was coded for access to the text.

    a)

    b)

    c)

    d)

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    10/27

    Yo Innami and Rie Koizumi 227

    Text explicit or implicit questions: Kobayashi (2004) found thatlearners performed equally in both multiple-choice and open-ended formats when the questions were explicit in the text; how-

    ever, they performed better on multiple-choice formats when thequestions were implicit.The number of options in a multiple-choice format: SinceRodriguez (2005) showed that reductions in the number ofmultiple-choice options tended to make multiple-choice formatsslightly easier, reductions in the number of multiple-choiceoptions were hypothesized to widen mean score differencesbetween multiple-choice and open-ended formats, therebyincreasing the degree of format effects.

    Learners L2 proficiency level: The definition of this variable isoften inconsistently reported and not easily comparable acrossstudies. However, fairly relevant information often reported, andthus, coded for the current meta-analysis, was the L2 instructiontime that learners had received. Based on Norris and Ortega(2000), learners L2 proficiency level was defined as follows:A middle level of L2 proficiency referred to learners with threeto four semesters of L2 study, whereas a high level of L2 pro-ficiency referred to learners with five or more semesters of L2

    study. Based on Shohamy (1984), high proficiency learnerswere hypothesized to be less affected by format differences thanwere low-proficiency learners.Learners age (primary, secondary, or adult); (i) learners L1;and (j) learners L2: These are among the test taker characteris-tics whose plausible effects on test performance must be consid-ered (Bachman & Palmer, 1996); thus, they were coded.Reliability of tests with a multiple-choice format; (l) reliabilityof tests with an open-ended format; and (m) reliability of scor-

    ing in open-ended formats: The high reliabilities of tests andscoring suggest that the tests consistently measured somethingrelative to the errors. Thus, studies with high reliabilities of testsand scoring were predicted to more clearly show format effectsthan were studies with low reliabilities of tests and scoring.Concerning the reliability of tests with a multiple-choice format,another possibility was that if the distractors did not functionwell, the reliability may be depressed, and the scores may alsobe increased, which leads to the format effect.

    Percentage correct in a multiple-choice format and (o) percent-age correct in an open-ended format: Although it appears that

    e)

    f)

    g)

    h)

    k)

    n)

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    11/27

    228 Focus on multiple-choice and open-ended formats

    no studies have investigated this issue, a positive relationshipmight exist between the percentages of correct responses inmultiple-choice formats and format effects. The reason for this

    is that since multiple-choice formats usually appear to yieldhigher scores than do open-ended formats, if multiple-choiceformats are designed to be more difficult, they might producelower scores, which would be similar to the scores in open-ended formats. This would result in a decrease in the degree offormat effects. On the other hand, a negative relationship couldbe predicted between the percentage correct in open-endedformats and format effects. This is because since open-endedformats usually appear to yield lower scores than do multiple-

    choice formats, if open-ended formats are designed to be easier,they might produce higher scores, which would be similar tothe scores in multiple-choice formats. The result would be adecrease in the format effects.

    The coding for all the variables was independently conducted byboth authors based on a sample of 27% (n = 10) of the 37 studies.The agreement percentage between the two authors was 90, andthe kappa coefficient was 0.80. Disagreement was discussed andresolved. Coding for the remaining studies was conducted by thefirst author. When necessary, further details were solicited from theauthors of each study. When the reliability of a test was not reported,it was estimated using Kuder-Richardson formula 21 (KR21). Theeffect size data associated with each variable of interest was alsocoded.

    2 Meta-analysis

    a Effect size for individual studies: Effect size is an estimateof the magnitude of the observed effect or relationship, and can bedivided into two types (Kline, 2004): (a) d type, which representsthe standardized mean difference between groups, and (b) r type,which indicates the proportion of variance explained by an effect ofinterest. The former type of effect size was used because this studyfocused on the mean score difference between multiple-choice andopen-ended tests. Among dtype, Hedges g was selected because ituses a pooled SD computed from two groups and tends to estimate

    the population SD accurately, compared with Cohens d, which usesthe SD of only one group (e.g., Cooper & Hedges, 1994). Accordingto Morris and DeShon (2002), there are two types of effect size

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    12/27

    Yo Innami and Rie Koizumi 229

    g: g for a between-subjects design and g for a within-subjects design.They are defined as

    gbetween-subjects =Mean Mean

    s

    1 2

    p

    ,

    sp (pooled or combined, SD) =n s( )( )

    ,n s

    n n

    21

    22

    2

    1

    2 2+

    +

    11

    gwithin-subjects =Mean Mean

    SD

    1 2

    D

    ,

    SDD (the SD of the difference scores) =

    SD SD r SD SDgroup group group group2 2

    1 2 1 22+ ( )( )( )( ).One difficulty in calculating SDD was that the correlation between

    groups 1 and 2 was not always reported in all studies, demandingthat we impute values for the missing correlations. The mean ofcorrelations was calculated by Fishers Z transformation from sixof the within-subjects studies included in the current meta-analysis.The derived value was 0.74 (Pearsons correlations before FishersZtransformation ranged from 0.34 to 0.93 [mean = 0.68, SD = 0.21]),and was used to replace a missing correlation. Nevertheless, meanimputation is considered to leave a centralizing artifact and result inthe artificial homogeneity of effect sizes (Cooper & Hedges, 1994).However, other alternatives to mean imputation such as Bucksmethod and maximum likelihood methods (Pigott, 1994) assume alarge set of data sources; therefore, the mean imputation method wasconsidered to be more appropriate in the current study.

    Once the effect sizes from each study were computed, those fromthe within-subjects design were converted into the between-subjects

    metric (Morris & DeShon, 2002), usinggbetween-subjects = gwithin-subjects 2 1( ),where is the population correlation between the repeated

    measures. Since an aggregate of correlations across within-subjectsstudies is the best estimate of (Morris & DeShon), the average cor-relation of 0.74, calculated above, was also used as . The effect sizegbetween-subjects was hereafter denoted simply as g. Then, the samplingvariance of the effect size was computed according to the studydesign and effect size metric based on Morris and DeShon (2002).

    When multiple effect sizes were available in a single study, aweighted mean of the means was calculated, along with a weightedmean of the SDs, which was computed using a formula for a pooled

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    13/27

    230 Focus on multiple-choice and open-ended formats

    SD, described above (Lipsey & Wilson, 2001). Thus, each studycontributed a single effect size. This weighted-average procedure formeans and SDs is widely used to reduce the bias caused by depend-

    ency between the effect sizes in one study (e.g., Cooper & Hedges,1994; Lipsey & Wilson, 2001). However, this procedure might mixup some potentially important variables that the primary researcherindependently investigated; therefore, information on moderator vari-ables could be lost. Among such variables, 15 were coded (see sectionII.1.c) and each effect size was calculated along with an overall singleeffect size from each study. Although such coding allowed multipleeffect sizes from one study to be entered into the meta-analysis, itwas conducted in order to obtain the most comprehensive picture of

    the variables to explain the format effects across studies. In the end,the current meta-analysis included 56 data sources from 37 studies,all of which reported format effects based on mean scores. The fullreference of studies can be obtained from the authors.

    b Effect size aggregation: The effect sizes from individual studieswere combined to attain an aggregate effect size for each formateffect using the mixed effects model. Analysis by another meta-analysis model, that is, the fixed effects model, demonstrated a wide

    variability among the effect sizes, which suggested the necessity ofusing the mixed effects model. The mixed effects model assumesthat effect sizes vary across studies partly due to study character-istics (i.e., moderator variables) and partly due to other randomlydistributed unaccountable sources (Lipsey & Wilson, 2001). Themodel also assumes that each study in a meta-analysis is a randomlysampled observation drawn from a population of studies; thus, themixed effects model not only permits the drawing of inferencesabout studies that have already been conducted but also generalizes

    to the population of studies from which these studies were sam-pled (Raudenbush, 1994; Shadish & Haddock, 1994; see Lipsey &Wilson, 2001, for the equations in the mixed effects model).

    Moderator variable analysis was conducted by grouping studiesaccording to the moderator variables or by performing a regressionanalysis. Moderator variables (a) to (j) were categorical, whereasmoderator variables (k) to (o) were continuous, and an artificial cat-egorization of the latter variables was not appropriate (e.g., Hunter& Schmidt, 2004). These continuous variables were analyzed using

    a weighted single variable linear regression based on a method ofmoments (Raudenbush, 1994), with the standardized mean score dif-ference (g) between the two formats as a dependent variable and each

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    14/27

    Yo Innami and Rie Koizumi 231

    of the five moderator variables ([k] to [o]) as a separate independentvariable. Since moderator variables were often highly correlated (VIFs[variance inflation factor] ranged from 2.04 to 50.00 [mean = 21.84,

    SD = 19.75]), multiple regression could not be conducted. In addition,since the number of studies was small (k= 4 to 21), especially in L2 lis-tening (k= 4 to 5), in which the regression assumptions were difficult totest, the results from the regression analysis were considered tentative.

    A 95% confidence interval around the mean effect size wascalculated. A result with a small confidence interval is more trust-worthy than a result with a large confidence interval. If a confi-dence interval does not include zero, this suggests a statisticaldifference between the observed effect and null hypothesis of no

    effect, which, in the current study, means that there was a signifi-cant difference between the mean scores of the two formats. If oneconfidence interval does not overlap with another, this suggests astatistical difference between the two observed effects.

    The calculated effect size g was interpreted as follows. First,if g = 0.50, the mean of one format is half a pooled SD higherthan the mean of another format. To simplify, assume a testwith an SD of 10 and two groups of learners. If g = 0.50, themean of one format (e.g., multiple-choice) is five points higher

    than that of another format (e.g., open-ended). Second, g and itsconfidence interval were interpreted based on Cohens (1988)guideline of |0.20| as small, |0.50| as medium, and |0.80| as largeeffects, although Cohen warned against its use because of itsarbitrariness and recommended a context-specific interpretationof magnitude differences. In the current study, a positive effectsize with a 95% confidence interval not encompassing zeroindicated that a multiple-choice format was easier than an open-ended format. The effect size for individual studies was calcu-

    lated using Excel, effect size aggregation was conducted usingComprehensive Meta-Analysis (Borenstein et al., 2005), and aweighted single variable linear regression was conducted using aSPSS macro (METAREG.SPS) written by Wilson (2001).

    III Results and discussion

    1 Format effects in L1 reading

    The results are shown in Table 2. The overall effect size waspositive with a confidence interval greater than zero, ranging from

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    15/27

    232 Focus on multiple-choice and open-ended formats

    0.24 to 1.06 (g = 0.65 [0.24, 1.06]). This suggested that multiple-choice formats were easier than open-ended formats and that thedegree of difference ranged from small to large. Format effectswere particularly observed across conditions such as between-subjects designs (g = 0.64 [0.20, 1.07]), within-subjects designs(g = 0.66 [0.10, 1.22] for both counterbalanced and non-coun-terbalanced), and random assignment (g = 0.44 [0.13, 0.76]).Furthermore, the results of the single variable linear regression

    in Table 3 indicated that the moderator variables did not sig-nificantly affect multiple-choice and open-ended format effects(p = 0.15, 0.87, 0.42, 0.85, and 0.06, respectively).

    Table 2 Meta-analysis of multiple-choice and open-ended format effects in L1reading

    Variable n1 n2 k g(95% CI)

    Overalla 18581 3661 22 0.65 (0.24, 1.06)

    Between-subjects 16795 1875 8 0.64 (0.20, 1.07)

    Within-subjects 1786 1786 14 0.66 (0.10, 1.22)

    Counterbalanced 670 670 6 0.53 (0.23, 0.83)

    Not counterbalanced 926 926 5 1.10 (0.08, 2.12)

    Multiple-choiceOpen-ended 527 527 3 1.15 (0.56, 2.86)

    Open-endedMultiple-choice 399 399 2 1.03 (0.06, 2.11)

    Random assignment 2513 2509 11 0.44 (0.13, 0.76)

    Not random assignment 16068 1152 11 0.85 (0.25, 1.45)

    Stem equivalency 18025 3105 18 0.55 (0.26, 0.84)

    Non-stem equivalency 556 556 4 1.10 (0.32, 2.51)

    Access to the text when answering 3619 3571 21 0.62 (0.20, 1.04)

    No access to the text when answering 15195 323 4 0.79 (0.58, 1.00)

    Text explicit question 121 121 2 0.30 (0.73, 0.13)

    Text implicit question 121 121 2 0.14 (0.33, 0.60)

    Number of multiple-choice options (3) 176 144 2 0.62 (0.33, 1.57)

    Number of multiple-choice options (4) 2212 2208 8 0.17 (0.12, 0.45)

    Number of multiple-choice options (5) 269 269 3 0.91 (0.10, 1.73)

    Learners age (primary) 352 352 5 0.49 (0.22, 0.77)

    Learners age (secondary) 16676 1800 4 0.72 (0.18, 1.26)Learners age (adult) 1293 1249 11 0.76 (0.09, 1.43)

    Learners L1 (Dutch) 274 316 1 0.60 (0.40, 0.79)

    Learners L1 (English) 3308 3230 19 0.65 (0.20, 1.11)

    Learners L1 (Hebrew) 37 25 1 0.06 (0.46, 0.58)

    Learners L1 (Swedish) 14962 90 1 1.10 (0.90, 1.29)

    Notes: n1 and n2 = sample sizes. k= the number of data sources. CI = confidence

    interval. aThe italicized results for the moderator variables are based on one data

    source and are not interpreted.

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    16/27

    Yo Innami and Rie Koizumi 233

    Table 3 Weighted single variable linear regression meta-analysis of multiple-

    choice and open-ended format effects

    Domain k Variable B Standard Z p R2

    error of B

    L1 reading 19 Constant 0.10 0.53 0.18 .86 .00

    Test reliability 1.14 0.79 1.44 .15 .41 .17

    (Multiple-choice)

    19 Constant 0.71 0.71 0.99 .32 .00

    Test reliability 0.16 1.01 0.16 .87 .05 .00

    (Open-ended)

    12 Constant 4.45 6.38 0.70 .49 .00

    Scoring reliability 5.39 6.68 0.81 .42 .28 .08

    (Open-ended)

    21 Constant 0.42 0.86 0.49 .62 .00

    Percentage correct 0.27 1.43 0.19 .85 .05 .00(Multiple-choice)

    21 Constant 1.84 0.70 2.64 .01 .00

    Percentage correct 2.45 1.29 1.90 .06 .50 .25

    (Open-ended)

    L2 reading 8 Constant 1.02 0.50 2.02 .04 .00

    Test reliability 1.04 0.83 1.26 .21 .41 .17

    (Multiple-choice)

    8 Constant 1.56 0.86 1.81 .07 .00

    Test reliability 1.56 1.17 1.32 .18 .43 .19

    (Open-ended)

    5 Constant 10.82 9.05 1.20 .23 .00

    Scoring reliability 11.10 9.46 1.17 .24 .52 .27

    (Open-ended)

    10 Constant 0.55 0.79 0.70 .48 .00

    Percentage correct 1.59 1.44 1.10 .27 .36 .13

    (Multiple-choice)

    10 Constant 1.26 0.66 1.91 .06 .00

    Percentage correct 2.10 1.27 1.67 .10 .52 .27

    (Open-ended)

    L2 listening 5 Constant 3.14 0.44 7.08 .00 .00

    Test reliability 5.39 1.12 4.81 .00 .93 .87

    (Multiple-choice)

    5 Constant 4.73 1.33 3.53 .00 .00Test reliability 6.21 2.27 2.74 .01 .83 .68

    (Open-ended)

    4 Constant 17.47 11.98 1.46 .14 .00

    Scoring reliability 19.06 12.36 1.54 .12 .75 .56

    (Open-ended)

    5 Constant 1.01 0.53 1.89 .06 .00

    Percentage correct 4.16 1.03 4.04 .00 .90 .82

    (Multiple-choice)

    5 Constant 0.07 2.56 0.03 .98 .00

    Percentage correct 3.61 8.81 0.41 .68 .19 .04

    (Open-ended)

    Note: k= the number of data sources. B= an unstandardized regression coefficient.

    = a standardized regression coefficient.

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    17/27

    234 Focus on multiple-choice and open-ended formats

    2 Format effects in L2 reading

    As shown in Table 4, the overall effect of multiple-choice andopen-ended formats was not observed because its confidence

    interval contained zero, ranging from 0.23 to 0.66 (g = 0.22 [0.23,0.66]). This indicated that the mean scores in multiple-choice and

    Table 4 Meta-analysis of multiple-choice and open-ended format effects in L2

    reading

    Variable n1 n2 k g(95% CI)

    Overalla 1217 1093 11 0.22 (0.23, 0.66)

    Between-subjects 609 485 6 0.70 (0.207, 1.20)

    Within-subjects 608 608 5 0.33 (0.87, 0.208)Counterbalanced 126 126 1 0.05 (0.09, 0.19)

    Not counterbalanced 329 329 2 0.15 (0.75, 1.04)

    Multiple-choiceOpen-ended

    Open-endedMultiple-choice 329 329 2 0.15 (0.75, 1.04)

    Random assignment 698 586 6 0.72 (0.23, 1.20)

    Not random assignment 519 507 5 0.38 (0.98, 0.21)

    Stem equivalency 1064 940 9 0.49 (0.10, 0.88)

    Non-stem equivalency 153 153 2 1.02 (1.46, 0.58)

    Access to the text when answering 876 752 9 0.21 (0.40, 0.81)

    No access to the text when answering 341 341 2 0.28 (0.89, 1.45)Text explicit question 32 32 1 1.87 (1.35, 2.39)

    Text implicit question 32 32 1 0.27 (0.25, 0.78)

    Number of multiple-choice options (3)

    Number of multiple-choice options (4) 1148 1036 9 0.37 (0.14, 0.88)

    Number of multiple-choice options (5)

    Learners L2 proficiency level (mid) 56 56 2 0.03 (1.57, 1.64)

    Learners L2 proficiency level (high) 94 85 2 1.29 (0.65, 1.94)

    Learners age (primary)

    Learners age (secondary) 449 337 2 1.03 (0.04, 2.10)

    Learners age (adult) 768 756 9 0.03 (0.42, 0.47)

    Learners L1 (Chinese) 32 32 1 1.30 (0.78, 1.81)

    Learners L1 (English) 80 80 2 0.05 (1.59, 1.69)

    Learners L1 (Hebrew) 459 344 3 0.19 (0.23, 0.60)

    Learners L1 (Japanese) 70 61 1 1.59 (1.25, 1.92)

    Learners L1 (Spanish) 126 126 1 0.05 (0.09, 0.19)

    Learners L1 (Taiwanese) 121 121 1 1.23 (1.37, 1.09)

    Learners L2 (English) 776 652 6 0.13 (0.65, 0.91)

    Learners L2 (Japanese) 32 32 1 1.30 (0.78, 1.81)

    Learners L2 (Spanish) 80 80 2 0.05 (1.59, 1.69)

    Note:a See Table 2 note.

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    18/27

    Yo Innami and Rie Koizumi 235

    open-ended formats were not statistically different across studies.However, positive values of effect sizes with confidence intervals notincluding zero were observed, which suggests that multiple-choice

    formats were easier than open-ended formats in the following con-ditions: between-subjects designs (g = 0.70 [0.207, 1.20]), randomassignment (g = 0.72 [0.23, 1.20]), stem equivalency (g = 0.49 [0.10,0.88]), and learners L2 proficiency level being high (g = 1.29 [0.65,1.94]). In contrast, a negative value of effect size with the confidenceinterval not including zero was observed for non-stem equivalency(g = 1.02 [1.46, 0.58]), which suggests that multiple-choice for-mats were more difficult than open-ended formats. Among the fivemoderator variables investigated in the regression, none of them was

    found to significantly affect the effect sizes, as shown in Table 3(p = 0.21, 0.18, 0.24, 0.27, and 0.10, respectively).

    3 Format effects in L2 listening

    As seen in Table 5, the overall effect size was positive with a confi-dence interval greater than zero, ranging from 0.57 to 1.66 (g = 1.11[0.57, 1.66]). This suggested that multiple-choice formats were eas-ier than open-ended formats and that the degree of difference ranged

    from medium to large. In particular, format effects were foundacross conditions such as between-subjects designs (g = 0.99 [0.36,1.62]) and random assignment (g = 1.11 [0.57, 1.66]). In addition,Table 3 illustrates that three moderator variables (i.e., test reliabili-ties of multiple-choice and open-ended formats, and percentage cor-rect in multiple-choice formats) in the regression each significantlyaffected the effect sizes (p = 0.00, 0.01, and 0.00, respectively). Thisis interpreted in section III.4.

    4 Analysis of moderator variables

    In section II.1.c, 15 moderator variables were listed with possiblepredictions of format effects. However, as shown in Tables 2 to 5,for most moderator variables, the confidence intervals for effectsizes within each moderator variable overlapped; thus, overall, ourpredictions were not supported. For example, the confidence inter-vals for effect sizes overlapped in L1 reading for between-subjects

    designs (g = 0.64 [0.20, 1.07]) and within-subjects designs (g = 0.66[0.10, 1.22]). However, there were two moderator variables withconfidence intervals that did not overlap, which indicated that they

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    19/27

    236 Focus on multiple-choice and open-ended formats

    were crucial moderator variables for test format effects. First, in L2reading, there was no overlap between studies with random assign-ment (g = 0.72 [0.23, 1.20]) and non-random assignment (g = 0.38[0.98, 0.21]). Since the confidence intervals for effect sizes forstudies with non-random assignment included zero, this indicatedthat studies with random assignment more clearly showed formateffects than did those with non-random assignment, which was in

    line with the prediction. Second, again in L2 reading, the confidenceintervals for effect sizes did not overlap between studies with stemequivalency (g = 0.49 [0.10, 0.88]) and non-stem equivalency

    Table 5 Meta-analysis of multiple-choice and open-ended format effects in L2

    listening

    Variable n1 n2 k g(95% CI)

    Overalla 321 324 5 1.11 (0.57, 1.66)

    Between-subjects 276 279 4 0.99 (0.36, 1.62)

    Within-subjects 45 45 1 1.58 (1.30 1.86)

    Counterbalanced 45 45 1 1.58 (1.30 1.86)

    Not counterbalanced

    Multiple-choiceOpen-ended

    Open-endedMultiple-choice

    Random assignment 321 324 5 1.11 (0.57, 1.66)

    Not random assignment

    Stem equivalency 321 324 5 1.11 (0.57, 1.66)

    Non-stem equivalency

    Access to the text when answering 264 271 4 1.17 (0.50, 1.85)

    No access to the text when answering 57 53 1 0.89 (0.50, 1.28)

    Text explicit question

    Text implicit question

    Number of multiple-choice options (3) 57 53 1 0.89 (0.50, 1.28)

    Number of multiple-choice options (4) 63 70 2 1.02 (0.80, 2.84)

    Number of multiple-choice options (5)

    Learners L2 proficiency level (mid) 18 17 1 0.89 (0.50, 1.28)

    Learners L2 proficiency level (high) 275 281 5 1.10 (0.49, 1.70)Learners age (primary)

    Learners age (secondary)

    Learners age (adult) 321 324 5 1.11 (0.57, 1.66)

    Learners L1 (English) 57 53 1 0.89 (0.49, 1.28)

    Learners L1 (Japanese) 219 226 3 1.04 (0.15, 1.92)

    Learners L1 (Taiwanese) 45 45 1 1.58 (1.30 1.86)

    Learners L2 (English) 264 271 4 1.17 (0.50, 1.85)

    Learners L2 (Spanish) 57 53 1 0.89 (0.50, 1.28)

    Note:a

    See Table 2 note.

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    20/27

    Yo Innami and Rie Koizumi 237

    (g = 1.02 [1.46, 0.58]), both of which did not include zero,which suggests that studies with stem equivalency indicated formateffects of multiple-choice formats being easier than open-ended

    formats, whereas studies with non-stem equivalency showed formateffects in the opposite direction. This did not support the predictionthat studies with stem equivalency were likely to more clearly showformat effects than were studies with non-stem equivalency.

    In addition to the inspection of confidence intervals for effectsizes within each moderator variable, single variable linear regres-sion analysis was used to examine the relationships between testformat effects and moderator variables. Although most of the rela-tionships were not significant, three were found to be significant:

    the reliabilities of multiple-choice and open-ended test formats andpercentage correct in multiple-choice formats, all in L2 listening.Concerning reliability, there were two contradictory predictions.Studies with high test reliabilities were predicted to more clearlyshow format effects than were studies with low test reliabilities, andwere also predicted to have positive regression coefficients. On theother hand, studies with low test reliabilities of multiple-choice for-mats were predicted to more clearly show format effects, and werealso predicted to have negative regression coefficients. The results

    showed negative standardized regression coefficients for multiple-choice formats (p = 0.00, = 0.93,R2 = 0.87) and open-ended for-mats (p = 0.01, = 0.83,R2 = 0.68). The results of test reliabilitiesof multiple-choice formats supported the latter prediction, whereasthere was only one prediction for the test reliabilities of open-endedformats, which was contrary to the results. Regarding the relation-ship between percentage correct and format effects, our hypothesisthat a positive relationship would exist between the percentage cor-rect in multiple-choice formats and format effects was supported

    (p = 0.00, = 0.90,R2 = 0.82). These results suggested that the mod-erator variables influenced the test format effects in both predictableand unpredictable ways; however, this requires further testing withlarger data sets.

    5 Comparison of moderator variables across domains

    Thus far, some moderator variables were found to have relation-ships with multiple-choice and open-ended format effects in a rathercomplicated way. However, when the variables were comparedacross the three domains of L1 reading, L2 reading, and L2 listening,

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    21/27

    238 Focus on multiple-choice and open-ended formats

    three variables were consistently and significantly related to thoseformat effects and could be considered especially important. Theywere between-subjects designs (L1 reading g = 0.64 [0.20, 1.07];

    L2 reading g = 0.70 [0.207, 1.20]); and L2 listening g = 0.99 [0.36,1.62]), random assignment (L1 reading g = 0.44 [0.13, 0.76]; L2reading g = 0.72 [0.23, 1.20]; and L2 listening g = 1.11 [0.57, 1.66]),and stem equivalency (L1 reading g = 0.55 [0.26, 0.84]; L2 readingg = 0.49 [0.10, 0.88]; and L2 listening g = 1.11 [0.57, 1.66]). Theirconfidence intervals for effect sizes across the three domains did notinclude zero. This suggested that in studies with these three vari-ables, multiple-choice formats were likely to be easier than open-ended formats. This suggested the importance of these variables in

    the investigation of multiple-choice and open-ended format effectsand the generalizability of the functions of these variables across L1reading, L2 reading, and L2 listening. One reason for which thesethree variables were related to multiple-choice and open-endedformat effects would be that between-subjects designs exclude thelearners carry-over effects by requiring them to take either multiple-choice or open-ended formats, and that the random assignmentof learners to multiple-choice or open-ended conditions and stemequivalent tests help eliminate irrelevant factors and enable a direct

    comparison of test formats.

    6 Summary and comparison of narratives with meta-analysis

    synthesis

    This study used meta-analysis to quantitatively synthesize theeffects of two test formats (multiple-choice and open-ended) ontest performance in L1 reading, L2 reading, and L2 listening. Morespecifically, the current study examined the relative difficulty of

    multiple-choice and open-ended formats and the variables relatedto this difficulty. The results indicate that multiple-choice formatsare easier than open-ended formats in L1 reading and L2 listening.The degree of format effect difference ranges from small to large inL1 reading and medium to large in L2 listening. In contrast, on thewhole, the effects of multiple-choice and open-ended formats inL2 reading are not observed in the current meta-analysis, althoughmultiple-choice formats are found to be easier than open-endedformats when either one of the four moderator variables (i.e.,between-subjects designs, random assignment, stem equivalency,or learners L2 proficiency level being high) are included in the

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    22/27

    Yo Innami and Rie Koizumi 239

    analysis. Format effects favoring multiple-choice formats acrossdomains are consistently observed when the studies use between-subjects designs, random assignment, or stem-equivalent items.

    Of interest is a comparison between the findings in themeta-analysis and narrative review. It should be remembered that thenarrative review (see section I.1) summarized that in most studiesacross L1 reading, L2 reading, and L2 listening, multiple-choice for-mats were easier than open-ended formats. However, some studies inL1 and L2 reading showed no significant difference between the twoformats. Overall, the results of the meta-analysis indicate that in L1reading, multiple-choice formats yield higher scores than do open-ended formats with the degree of difference ranging from small

    to large whereas no significant difference is found in L2 reading.In L2 listening, multiple-choice formats are shown to be easier thanopen-ended formats to a medium to large degree. Thus, when thefindings from the narrative and meta-analytic reviews are compared,the inconsistent results in L1 and L2 reading from the narrativeapproach to research synthesis are clarified in the meta-analysis.Furthermore, the degree of format effects from the narrative reviewwas unknown, but it is found to be from small to large in L1 readingand medium to large in L2 listening.

    IV Implications

    Two implications are discussed. First, based on the finding thatin L1 reading and L2 listening, multiple-choice formats are easierthan open-ended formats, a test developer might prefer using theformer if the intention is to make the test easier. The option ofmultiple-choice or open-ended formats would produce a small

    to large degree of score difference in L1 reading and medium tolarge degree of score difference in L2 listening. For example, ina test with an SD of 10, these would be a 6.5 point difference inL1 reading with a 95% confidence interval of 2.4 and 10.6 points,and an 11.1 point difference with a 95% confidence interval of5.7 and 16.6 points. Using Dunlaps (1999) conversion program,effect sizes can be converted into probabilities based on normalcurvez probability values; in percentile terms, the probability thatthe multiple-choice format is easier than the open-ended format is

    approximately 68% in L1 reading and 78% in L2 listening. Thesedifferences may be of little importance in a low-stakes test but of

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    23/27

    240 Focus on multiple-choice and open-ended formats

    great significance in a high-stakes test, especially if an examineesscore is located around the cut-off point.

    Second, the inconsistent relative difficulty of multiple-choice and

    open-ended formats in L1 and L2 reading found in the narrativereview was clarified in the current meta-analysis, which alsoshowed the prevalent effects of three moderator variables (between-subjects designs, random assignment, and stem-equivalent items)in investigating multiple-choice and open-ended format effectsacross L1 reading, L2 reading, and L2 listening. Thus, these resultsprovide evidence that quantitative data synthesis plays an importantrole in explaining inconsistencies in test format effects.

    V Suggestions for future research

    There are two future areas of research that may provide more insightinto the effects of format on test performance. First, although meta-analysis can be performed with at least two studies, larger aggrega-tions of studies are desirable for obtaining more precise informationon format effects. This would be especially significant in L2 listen-ing, in which only a few studies were currently available for inclu-

    sion in the current meta-analysis. Furthermore, the tentative findingsfrom the regression analysis must be replicated, and other moderatorvariables that were not included in the current meta-analysis (e.g.,passage length or learners background knowledge) must be inves-tigated. Second, another equally important and promising researchagenda would be conducting meta-analysis on the effects of othertest formats, such as cloze and c-tests. These two tests have beenwidely examined in language testing (e.g., Brown, 1993; Grotjahn,2006), and a rich volume of literature would be ideal for quantitative

    synthesis.

    Acknowledgements

    We would like to thank Akihiko Mochizuki, Miyoko Kobayashi,Takayuki Nakanishi, and two anonymous reviewers for theirvaluable comments on earlier versions of this paper. This researchwas supported by Educational Testing Service (TOEFL Small Grantsfor Doctoral Research in Second or Foreign Language Assessment)

    and Japan Society for the Promotion of Science (Grants-in-Aid forScientific Research; No. 06J03782).

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    24/27

    Yo Innami and Rie Koizumi 241

    VI References

    Alderson, J. C. (2000). Assessing reading. Cambridge, UK: CambridgeUniversity Press.

    Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test constructionand evaluation. Cambridge, UK: Cambridge University Press.Anastasi, A., & Urbina, S. (1995). Psychological testing (7th ed.). Upper

    Saddle River, NJ: Prentice Hall.Arthur, W. Jr., Edwards, B. D., & Barrett, G. V. (2002). Multiple-choice and

    constructed response tests of ability: Race-based subgroup perform-ance differences on alternative paper-and-pencil test formats. PersonnelPsychology, 55, 9851008.

    Bachman, L. F. (1990). Fundamental considerations in language testing.Oxford, UK: Oxford University Press.

    Bachman, L. F., & Palmer, A. S. (1996).Language testing in practice. Oxford,UK: Oxford University Press.Bennett, R. E., & Ward, W. C. (Eds.). (1993). Construction versus choice in

    cognitive measurement: Issues in constructed response, performancetesting, and portfolio assessment. Hillsdale, NJ: Erlbaum.

    Berne, J. E. (1992). The effects of text type, assessment task, and target lan-guage experience on foreign language learners performance on listeningcomprehension tests. (UMI No. 9236396)

    Blok, H. (1999). Reading to young children in educational settings: A meta-analysis of recent research.Language Learning,49, 343371.

    Borenstein, M., Hedges, L., Higgins, J., & Rothstein, H. (2005). Comprehensivemeta-analysis (Version 2.2.023) [Computer software]. Englewood Cliffs,NJ: Biostat.

    Brantmeier, C. (2005). Effects of readers knowledge, text type, and test typeon L1 and L2 reading comprehension in Spanish. Modern Language

    Journal, 89, 3753.Brennan, R. L. (2006). Educational measurement(4th ed.). Westport, CT:

    Praeger.Brown, H. D. (2006). Principles of language learning and teaching (5th ed.).

    White Plains, NY: Pearson.

    Brown, J. D. (1993). What are the characteristics of natural cloze tests?Language Testing, 10, 93116.Brown, J. D. (2005). Testing in language programs: A comprehensive guide to

    English language assessment. New York: McGraw Hill.Buck, G. (2001).Assessing listening. Cambridge, UK: Cambridge University

    Press.Campbell, J. R. (1999). Cognitive processes elicited by multiple-choice and

    constructed-response questions on an assessment of reading comprehen-sion. (UMI No. 9938651)

    Clapham C., & Corson, D. (Eds.). (1997). Encyclopedia of language and

    education, Volume 7: Language testing and assessment. Dordrecht,Netherlands: Kluwer.

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    25/27

    242 Focus on multiple-choice and open-ended formats

    Cohen, A. D. (1994). Assessing language ability in the classroom (2nd ed.).Boston, MA: Heinle & Heinle.

    Cohen, A. D. (1998). Strategies in learning and using a second language.Harlow, Essex, UK: Longman.

    Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nded). Hillsdale, NJ: Erlbaum.

    Cook, T. D., Cooper, H., Cordray, D. S., Hartmann, H., Hedges, L. V., Light,R. J. et al. (1992).Meta-analysis for explanation: A casebook. New York:Russell Sage Foundation.

    Cooper, H., & Hedges, L. V. (Eds.). (1994). The handbook of research synthe-sis. New York: Russell Sage Foundation.

    Cronbach, L. J. (1990).Essentials of psychological testing (5th ed.). New York:HarperCollins.

    Davey, B. (1987). Postpassage questions: Task and reader effects on com-

    prehension and metacomprehension processes. Journal of ReadingBehavior, 19, 261283.

    Davey, B., & LaSasso, C. (1984). The interaction of reader and task factorsin the assessment of reading comprehension. Journal of Experimental

    Education,52, 199206.Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999).

    Dictionary of language testing. Cambridge, UK: Cambridge UniversityPress.

    Davies, A., & Elder, C. (Eds.). (2004). The handbook of applied linguistics.Malden, MA: Blackwell.

    Doughty, C. J., & Long, M. H. (Eds.). (2003). The handbook of second lan-guage acquisition. Malden, MA: Blackwell.

    Downing, S. M., & Haladyna, T. M. (Eds.). (2006).Handbook of test develop-ment. Mahwah, NJ: Erlbaum.

    Dunlap, W. P. (1999). A program to compute McGraw and Wongs commonlanguage effect size indicator.Behavior Research Methods, Instruments,& Computers, 31, 706709.

    Elinor, S.-H. (1997, May). Reading native and foreign language texts and tests:The case of Arabic and Hebrew native speakers reading L1 and English FLtexts and tests. Paper presented at the Language Testing Symposium, Ramat-

    Gan, Israel.(ERIC Document Reproduction Service No. ED 412746)Ellis, R. (1994). The study of second language acquisition. Oxford, UK:

    Oxford University Press.Flowerdew, J., & Miller, L. (2005). Second language listening: Theory and

    practice. New York: Cambridge University Press.Fulcher, G., & Davidson, F. (2007). Language testing and assessment: An

    advanced resource book. New York: Routledge.Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social

    research. Beverly Hills, CA: Sage.Grabe, W., & Stoller, F. L. (2002). Teaching and researching reading. Harlow,

    UK: Pearson.Grotjahn, R. (Ed.). (2006). The C-test: Theory, empirical research, applica-

    tions. Frankfurt am Main, Germany: Peter Lang.

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    26/27

    Yo Innami and Rie Koizumi 243

    Haladyna, T. M. (2004).Developing and validating multiple-choice test items(3rd ed.). Mahwah, NJ: Erlbaum.

    Hinkel, E. (Ed.). (2005). Handbook of research in second language teachingand learning. Mahwah, NJ: Erlbaum.

    Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correctingerror and bias in research findings (2nd ed.). Thousand Oaks, CA: Sage.

    ILTA Language Testing Bibliography. (1999). Retrieved January 2, 2006, fromhttp://www.iltaonline.com/ILTA_pubs.htm

    Innami, Y. (2006). The effects of task types on listening test performance:A quantitative and qualitative study. Unpublished doctoral dissertation,University of Tsukuba, Japan.

    Kaplan, R. B. (Ed.). (2002). The Oxford handbook of applied linguistics. NewYork: Oxford University Press.

    Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge,

    UK: Cambridge University Press.Kline, R. B. (2004).Beyond significance testing: Reforming data analysis meth-

    ods in behavioral research. Washington, DC: American PsychologicalAssociation.

    Kobayashi, K. (2004). Dokkai test kaito hoho ga jukensha no test tokuten niataeru eikyo: Chugokugo bogo washa no baai [The effects of test methodson reading test scores of Chinese students learning Japanese as a foreignlanguage]. Unpublished masters thesis, Ochanomizu University, Japan.

    Kobayashi, M. (2002). Method effects on reading comprehension test per-formance: Text organization and response format.Language Testing, 19,

    193220.Light, R., & Pillemer, D. (1984). Summing up: The science of reviewing

    research. Cambridge, MA: Harvard University Press.Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand

    Oaks, CA: Sage.Maxwell, S. E., & Delaney, H. D. (1990).Designing experiments and analyz-

    ing data: A model comparison perspective. Belmont, CA: Wadsworth.Morris, S. B., & DeShon, R. P. (2002). Combining effect size estimates in

    meta-analysis with repeated measures and independent-group designs.Psychological Methods,7, 105125.

    Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research syn-thesis and quantitative meta-analysis.Language Learning,50, 417528.

    Norris, J. M., & Ortega, L. (Eds.). (2006). Synthesizing research on languagelearning and teaching. Amsterdam: John Benjamins.

    Pigott, T. D. (1994). Methods for handling missing data in research synthesis.In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis(pp. 163175). New York: Russell Sage Foundation.

    Pressley, M., Ghatala, E. S., Woloshyn, V., & Pirie, J. (1990). Sometimesadults miss the main ideas and do not realize it: Confidence in responsesto short-answer and multiple-choice comprehension questions. Reading

    Research Quarterly,25, 232249.

    by Tomislav Bunjevac on September 9, 2009http://ltj.sagepub.comDownloaded from

    http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/http://ltj.sagepub.com/
  • 8/4/2019 IR a META an of TEST Format Effects on Reading and Listening TEST Performance

    27/27

    244 Focus on multiple-choice and open-ended formats

    Raudenbush, S. W. (1994). Random effects models. In H. Cooper &L. V. Hedges (Eds.), The handbook of research synthesis (pp. 301321).New York: Russell Sage Foundation.

    Richards, J. C., & Schmidt, R. (2002).Longman dictionary of language teach-

    ing & applied linguistics (3rd ed.). Harlow, Essex, UK: Longman.Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and con-

    structed-response items: A random effects synthesis of correlations.Journal of Educational Measurement, 40, 163184.

    Rodriguez, M. C. (2005). Three options are optimal for multiple-choice items:A meta-analysis of 80 years of research. Educational Measurement:

    Issues and Practice,24(2), 313.Ross, S. (1998). Self-assessment in second language testing: A meta-analysis

    and analysis of experiential factors.Language Testing, 15, 120.Rost, M. (2002). Teaching and researching listening. Harlow, UK: Pearson

    Education.Shadish, W. R., Cook, T. D., & Campbell. D. T. (2002). Experimental and

    quasi-experimental designs for generalized causal inference. Boston,MA: Houghton Mifflin.

    Shadish, W. R., & Haddock, C. K. (1994). Combining estimates of effect size.In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis(pp. 261281). New York: Russell Sage Foundation.

    Shohamy, E. (1984). Does the testing method make a difference? The case ofreading comprehension.Language Testing,1, 147170.

    Teng, H.-C. (1999, March). The effects of question type and preview on EFL

    listening assessment. Paper presented at the American Association forApplied Linguistics. (ERIC Document Reproduction Service No. ED432920)

    Traub, R. E. (1993). On the equivalence of the traits assessed by multiple-choice and constructed-response tests. In R. E. Bennett & W. C. Ward(Eds.), Construction versus choice in cognitive measurement: Issuesin constructed response, performance testing, and portfolio assessment(pp. 2944). Hillsdale, NJ: Erlbaum.

    Trujillo, J. L. (2006). The effect of format and language on the observed scoresof secondary-English speakers. (UMI No. 3198256)

    Urquhart, A. H., & Weir, C. (1998). Reading in a second language: Process,product and practice. London: Longman.

    Weir, C. J. (2005). Language testing and validation: An evidence-basedapproach. Basingstoke, Hampshire, UK: Palgrave Macmillan.

    Wilson, D. B. (2001). METAREG.SPS [SPSS macro]. Retrieved February 13,2005, from http://mason.gmu.edu/~dwilsonb/downloads/spss_macros.zip

    Wolf, D. F. (1991). The effects of task, language of assessment, and target lan-guage experience on foreign language learners performance on readingcomprehension tests.(UMI No. 9124507)