-
Introduction
Administration, scoring, and reporting scores are essential
elements of the testing process because they can significantly
impact the quality of the inferences that can be drawn from test
results, that is, the validity of the tests (Bachman & Palmer,
1996; McCallin, 2006; Ryan, 2006). Not surprisingly, therefore,
professional language-testing organizations and educational bodies
more generally cover these elements in some detail in their
guidelines of good practice.
The Standards for Educational and Psychological Testing devote
several pages to describing standards that relate specifically to
test administration, scoring, and reporting scores (American
Educational Research Association, American Psycho-logical
Association, & National Council on Measurement in Education
[AERA, APA, & NCME], 1999, pp. 616). Also the three major
international language-testing organizations, namely the
International Language Testing Association (ILTA), the European
Association for Language Testing and Assessment (EALTA), and the
Association of Language Testers in Europe (ALTE), make specific
recom-mendations about administration, scoring, and reporting
scores for different con-texts and purposes (e.g., classroom tests
and large-scale examinations) and for different stakeholders (e.g.,
test designers, institutions, and test takers).
Although the detailed recommendations vary depending on the
context, stake-holder, and professional association, the above
guidelines endorse very similar practices. Guidelines on the
administration of assessments typically aim at creat-ing
standardized conditions that would allow test takers to have a fair
and equal opportunity to demonstrate their language proficiency.
These include, for example, clear and uniform directions to test
takers, an environment that is free of noise and disruptions, and
adequate accommodations for disadvantaged test takers, such as
extra time for people with dyslexia or a different version of the
test for
The Companion to Language Assessment, First Edition. Edited by
Antony John Kunnan. 2014 John Wiley & Sons, Inc. Published 2014
by John Wiley & Sons, Inc.DOI:
10.1002/9781118411360.wbcla035
58
Administration, Scoring, and Reporting Scores
Ari HuhtaUniversity of Jyvskyl, Finland
-
2 Assessment Development
blind learners. A slightly different consideration is test
security: Individual test takers should not have an unfair
advantage over others by accessing test material prior to the test
or by copying answers from others during the test because of
inadequate invigilation, for example. Administration thus concerns
everything that is involved in presenting the test to the test
takers: time, place, equipment, and instructions, as well as
support and invigilation procedures (see Mousavi, 1999, for a
detailed definition).
Scoringgiving numerical values to test items and tasks (Mousavi,
1999)is a major concern for all types of testing, and professional
associations therefore give several recommendations. From the point
of view of test design, these associations emphasize the creation
of clear and detailed scoring guidelines for all kinds of tests but
especially for those that contain constructed response items and
speaking and writing tasks. Accurate and exhaustive answer keys
should be developed for open-ended items, raters should be given
adequate training, and the quality of their work should be
regularly monitored. Test scores and ratings should also be
analyzed to examine their quality, and appropriate action should be
taken to address any issues to ensure adequate reliability and
validity.
The main theme in reporting, namely communicating test results
to stakehold-ers (Cohen & Wollack, 2006, p. 380), is ensuring
the intelligibility and interpretabil-ity of the scores. Reporting
just the raw test scores is not generally recommended, so usually
test providers convert test scores onto some reporting scale that
has a limited number of score levels or bands, which are often
defined verbally. An increasingly popular trend in reporting scores
is to use the Common European Framework of Reference (CEFR) to
provide extra meaning to scores. Other recom-mendations on
reporting scores include that test providers give information about
the quality (validity, reliability) of their tests, and about the
accuracy of the scores, that is, how much the score is likely to
vary around the reported score.
Test Administration, Scoring, and Reporting Scores
In the following, test administration, scoring, and reporting
scores are described in terms of what is involved in each, and of
how differences in the language skills tested and the purposes and
contexts of assessment can affect the way tests are administered,
scored, and reported. An account is also given of how these might
have changed over time and whether any current trends can be
discerned.
Administration of Tests
The administration of language tests and other types of language
assessments is highly dependent on the skill tested and task types
used, and also on the purpose and stakes involved. Different
administration conditions can significantly affect test takers
performance and, thus, the inferences drawn from test scores. As
was described above, certain themes emerge in the professional
guidelines that are fairly common across all kinds of test
administrations. The key point is to create standardized conditions
that allow test takers a fair opportunity to demonstrate what they
can do in the language assessed, and so to get valid,
comparable
-
Administration, Scoring, and Reporting Scores 3
information about their language skills. Clear instructions, a
chance for the test taker to ask for clarifications, and
appropriate physical environment in terms of, for example, noise,
temperature, ventilation, and space all contribute in their own
ways to creating a fair setting (see Cohen & Wollack, 2006, pp.
35660, for a detailed discussion of test administration and special
accommodations).
A general administration condition that is certain to affect
administration condi-tions and also performance is the time limit
set for the test. Some tests can be speeded on purpose, especially
if they attempt to tap time-critical aspects of per-formance, such
as in a scanning task where test takers have to locate specific
information in the text fast. Setting up a speeded task in an
otherwise nonspeeded paper-based test is challenging
administratively; on computer, task-specific time limits are
obviously easy to implement. In most tests, time is not a key
component of the construct measured, so enough time is given for
almost everybody to finish the test. However, speededness can occur
in nonspeeded tests when some learners cannot fully complete the
test or have to change their response strategy to be able to reply
to all questions. Omitted items at the end of a test are easy to
spot but other effects of unintended speededness are more difficult
to discover (see Cohen & Wollack, 2006, pp. 3578 on research
into the latter issue).
A major factor in test administration is the aspect of language
assessed; in prac-tice, this boils down to testing speaking versus
testing the other skills (reading, writing, and listening). Most
aspects of language can be tested in groups, some-times in very
large groups indeed. The prototypical test administration context
is a classroom or a lecture hall full of learners sitting at their
own tables writing in their test booklets. Testing reading and
writing or vocabulary and structures can be quite efficiently done
in big groups, which is obviously an important practical
consideration in large-scale testing, as the per learner
administration time and costs are low (for more on test
practicality as an aspect of overall test usefulness, see Bachman
& Palmer, 1996). Listening, too, can be administered to big
groups, if equal acoustic reception can be ensured for
everybody.
Certain tests are more likely to be administered to somewhat
smaller groups. Listening tests and, more recently, computerized
tests of any skill are typically administered to groups of 1030
learners in dedicated language studios or com-puter laboratories
that create more standardized conditions for listening tests, as
all test takers can wear headphones.
Testing speaking often differs most from testing the other
skills when it comes to administration. If the preferred approach
to testing speaking is face to face with an interviewer or with
another test taker, group administrations become almost impossible.
The vast majority of face-to-face speaking tests involve one or two
test takers at a time (for different oral test types, see Luoma,
2004; Fulcher, 2003; Taylor, 2011). International language tests
are no exception: Tests such as the International English Language
Testing System (IELTS), the Cambridge examinations, the Goethe
Instituts examinations, and the French Diplme dtudes en langue
franaise (DELF) and Diplme approfondi de langue franaise (DALF)
examina-tions all test one or two candidates at a time.
Interestingly, the practical issues in testing speaking have led
to innovations in test administration such as the creation of
semidirect tests. These are administered in a language or computer
laboratory: Test takers, wearing headphones and micro-
-
4 Assessment Development
phones, perform speaking tasks following instructions they hear
from a tape or computer, and possibly also read in a test booklet.
Their responses are recorded and rated afterwards. There has been
considerable debate about the validity of this semidirect approach
to testing speaking. The advocates argue that these tests cover a
wider range of contexts, their administration is more standardized,
and they result in very similar speaking grades compared with
face-to-face tests (for a summary of research, see Malone, 2000).
The approach has been criticized on the grounds that it solicits
somewhat different language from face-to-face tests (Shohamy,
1994). Of the international examinations, the Test of English as a
Foreign Language Internet-based test (TOEFL iBT) and the Test
Deutsch als Fremdsprache (TestDaF), for example, use computerized
semidirect speaking tests that are scored afterwards by human
raters. The new Pearson Test of English (PTE) Academic also employs
a computerized speaking test but goes a step further as the scoring
is also done by the computer.
The testing context, purpose, and stakes involved can have a
marked effect on test administration. The higher the stakes, the
more need there is for standardiza-tion of test administration,
security, confidentiality, checking of identity, and meas-ures
against all kinds of test fraud (see Cohen & Wollack, 2006, for
a detailed discussion on how these affect test administration).
Such is typically the case in tests that aim at making important
selections or certifying language proficiency or achievement. All
international language examinations are prime examples of such
tests. However, in lower stakes formative or diagnostic
assessments, admin-istration conditions can be more relaxed, as
learners should have fewer reasons to cheat, for example (though of
course, if an originally low stakes test becomes more important
over time, its administration conditions should be reviewed).
Obviously, avoidance of noise and other disturbances makes sense in
all kinds of testing, unless the specific aim is to measure
performance under such conditions. Low stakes tests are also not
tied to a specific place and time in the same way as high stakes
tests are. Computerization, in particular, offers considerable
freedom in this respect. A good example is DIALANG, an online
diagnostic assessment system which is freely downloadable from the
Internet (Alderson, 2005) and which can thus be taken anywhere, any
time. Administration conditions of some forms of continuous
assessment can also differ from the prototypical invigilated
setting: Learners can be given tasks and tests that they do at home
in their own time. These tasks can be included in a portfolio, for
example, which is a collection of different types of evidence of
learners abilities and progress for either forma-tive or summative
purposes, or both (on the popular European Language Portfo-lio, see
Little, 2005).
Scoring and Rating Procedures
The scoring of test takers responses and performances should be
as directly related as possible to the constructs that the tests
aim at measuring (Bachman & Palmer, 1996). If the test has test
specifications, they typically contain information about the
principles of scoring items, as well as the scales and procedures
for the rating of speaking and writing. Traditionally, a major
concern about scoring has been reliability: To what extent are the
scoring and rating consistent over time and
-
Administration, Scoring, and Reporting Scores 5
across raters? The rating of speaking and writing performances,
in particular, continues to be a major worry and considerable
attention is paid to ensuring a fair and consistent assessment,
especially in high stakes contexts. A whole new trend in scoring is
computerization, which is quite straightforward in selected
response items but much more challenging the more open-ended the
tasks are. Despite the challenges, computerized scoring of all
skills is slowly becoming a viable option, and some international
language examinations have begun employ-ing it.
As was the case with test administration, scoring, too, is
highly dependent on the aspects of language tested and the task
types used. The purpose and stakes of the test do not appear to
have such a significant effect on how scoring is done, although
attention to, for instance, rater consistency is obviously closer
in high stakes contexts. The approach to scoring is largely
determined by the nature of the tasks and responses to be scored
(see Millman & Greene, 1993; Bachman & Palmer, 1996).
Scoring selected response items dichotomously as correct versus
incorrect is a rather different process from rating learners
performances on speaking and writing tasks with the help of a
rating scale or scoring constructed response items polytomously
(that is, awarding points on a simple scale depend-ing on the
content and quality of the response).
Let us first consider the scoring of item-based tests. Figure
58.1 shows the main steps in a typical scoring process: It starts
with the test takers responses, which can be choices made in
selected response items (e.g., A, B, C, D) or free responses to
gap-fill or short answer items (parts of words, words, sentences).
Prototypical responses are test takers markings on the test
booklets that also contain the task materials. Large-scale tests
often use separate optically readable answer sheets for multiple
choice items. Paper is not, obviously, the only medium used to
deliver tests and collect responses. Tape-mediated speaking tests
often contain items that are scored rather than rated, and test
takers responses to such items are normally recorded on tape. In
computer-based tests, responses are captured in electronic format,
too, to be scored either by the computer applying some scoring
algorithm or by a human rater.
In small-scale classroom testing the route to step 2, scoring,
is quite straightfor-ward. The teacher simply collects the booklets
from the students and marks the papers. In large-scale testing this
phase is considerably more complex, unless we have a computer-based
test that automatically scores the responses. If the scoring is
centralized, booklets and answer sheets first need to be mailed
from local test
Figure 58.1 Steps in scoring item-based tests
3 4 5 6 71Individualitemresponses
Individualitemscores
(Weightingof scores)
Sum ofscores(scorescale)
Applicationof cutoffs
(Standard setting)
Score bandor reportingscale
(Item analyses:deletion ofitems, etc.)
Scoring key
Scoring2
-
6 Assessment Development
centers to the main regional, national, or even international
center(s). There the optically readable answer sheets, if any, are
scanned into electronic files for further processing and analyses
(see Cohen & Wollack, 2006, pp. 3727 for an extended discussion
of the steps in processing answer documents in large-scale
examinations).
Scoring key: An essential element of scoring is the scoring key,
which for the selected response items simply tells how many points
each option will be awarded. Typically, one option is given one
point and the others zero points. However, sometimes different
options receive different numbers of points depending on their
degree of correctness or appropriateness. For productive items, the
scoring can be considerably more complex. Some items have only one
acceptable answer; this is typical of items focusing on grammar or
vocabulary. For short answer items on reading and listening, the
scoring key can include a number of different but acceptable
answers but the scoring may still be simply right versus wrong, or
it can be partial-credit and polytomous (that is, some answers
receive more points than others).
The scoring key is usually designed when the test items are
constructed. The key can, however, be modified during the scoring
process, especially for open-ended items. Some examinations employ
a two-stage process in which a propor-tion of the responses is
first scored by a core group of markers who then complement the key
for the marking of the majority of papers by adding to the list of
acceptable answers based on their work with the first real
responses.
Markers and their training: Another key element of the scoring
plan is the selec-tion of scorers or markers and their training. In
school-based testing, the teacher is usually the scorer, although
sometimes she may give the task to the students themselves or, more
often, do it in cooperation with colleagues. In high stakes
contexts, the markers and raters usually have to meet specified
criteria to qualify. For example, they may have to be native
speakers or non-native speakers with adequate proficiency and they
probably need to have formally studied the lan-guage in
question.
Item analyses: An important part of the scoring process in the
professionally designed language tests is item analyses. The
so-called classical item analyses are probably still the most
common approach; they aim to find out how demand-ing the items are
(item difficulty or facility) and how well they discriminate
between good and poor test takers. These analyses can also identify
problematic items or items tapping different constructs. Item
analyses can result in the accept-ance of additional responses or
answer options for certain itemsa change in the scoring keyor the
removal of entire items from the test, which can change the overall
test score.
Test score scale: When the scores of all items are ready, the
next logical step is to combine them in some way into one or more
overall scores. The simplest way to arrive at an overall test score
is to sum up the item scores; here the maximum score equals the
number of items in the test, if each item is worth one point. The
scoring of a test comprising a mixture of dichotomously (0 or 1
point per item) scored multiple choice items and
partial-credit/polytomous short answer items is obviously more
complex. A straightforward sum of such items results in the short
answer questions being given more weight because test takers get
more
-
Administration, Scoring, and Reporting Scores 7
points from them; for example, three points for a completely
acceptable answer compared with only point from a multiple choice
item. This may be what we want, if the short answer items have been
designed to tap more important aspects of proficiency than the
other items. However, if we want all items to be equally important,
each item score should be weighted by an appropriate number.
Language test providers increasingly complement classical item
analyses with analyses based on what is known as modern test theory
or item response theory (IRT; one often-used IRT approach is Rasch
analysis). What makes them particu-larly useful is that they are
far less dependent than the classical approaches on the
characteristics of the learners who happened to take the test and
the items in the test. With the help of IRT analyses, it is
possible to construct test score scales that go beyond the simple
summing up of item scores, since they are adjusted for item
difficulty and test takers ability, and sometimes also for item
discrimination or guessing. Most large-scale international language
tests rely on IRT analyses as part of their test analyses, and also
to ensure that their tests are comparable across
administrations.
An example of a language test that combines IRT analysis and
item weighting in the computation of its score scale is DIALANG,
the low stakes, multilingual diagnostic language assessment system
mentioned above (Alderson, 2005). In the fully developed test
languages of the system, the items are weighted differentially,
ranging from 1 to 5 points, depending on their ability to
discriminate.
Setting cutoff points for the reporting scale: Instead of
reporting raw or weighted test scores many language tests convert
the score to a simpler scale for reporting purposes, to make the
test results easier to interpret. The majority of educational
systems probably use simple scales comprising a few numbers (e.g.,
15 or 110) or letters (e.g., AF). Sometimes it is enough to report
whether the test taker passes or fails a particular test, and thus
a simple two-level scale (pass or fail) is sufficient for the
purpose. Alternatively, test results can be turned into
developmental scores such as age- or grade-equivalent scores, if
the group tested are children and if such age- or grade-related
interpretations can be made from the particular test scores.
Furthermore, if the reporting focuses on rank ordering test takers
or com-paring them for some normative group, percentiles or
standard scores (z or T scores) can be used, for example (see Cohen
& Wollack, 2006, p. 380).
The conversion of the total test score to a reporting scale
requires some mecha-nism for deciding how the scores correspond to
the levels on the reporting scale. The process through which such
cutoff points (cut scores) for each level are decided is called
standard setting (step 6 in Figure 58.1).
Intuition and tradition are likely to play at least as big a
role as any empirical evidence in setting the cutoffs; few language
tests have the means to conduct systematic and sufficient
standard-setting exercises. Possibly the only empirical evidence
available to teachers, in particular, is to compare their students
with each other (ranking), with the students performances on
previous tests, or with other students performance on the same test
(norm referencing). The teacher may focus on the best and weakest
students and decide to use cutoffs that result in the regular top
students getting top scores in the current test, too, and so on. If
the results of the current test are unexpectedly low or high, the
teacher may raise or lower the cutoffs accordingly.
-
8 Assessment Development
Many large-scale tests are obviously in a better position to
make more empiri-cally based decisions about cutoff points than
individual teachers and schools. A considerable range of
standard-setting methods has been developed to inform decisions
about cutoffs on test score scales (for reviews, see Kaftandjieva,
2004; Cizek & Bunch, 2006). The most common standard-setting
methods focus on the test tasks; typically, experts evaluate how
individual test items match the levels of the reporting scale.
Empirical data on test takers performance on the items or the whole
test can also be considered when making judgments. In addition to
these test-centered standard-setting methods, there are
examinee-centered methods in which persons who know the test takers
well (typically teachers) make judgments about their level.
Learners performances on the items and the test are then compared
with the teachers estimates of the learners to arrive at the most
appropriate cutoffs.
Interestingly, the examinee-centered approaches resemble what
most teachers are likely to do when deciding on the cutoffs for
their own tests. Given the diffi-culty and inherent subjectivity of
any formal standard-setting procedure, one wonders whether
experienced teachers who know their students can in fact make at
least equally good decisions about cutoffs as experts relying on
test-centered methods, provided that the teachers also know the
reporting scale well.
Sometimes the scale score conversion is based on a type of norm
referencing where the proportion of test takers at the different
reporting scale levels is kept constant across different tests and
administrations. For example, the Finnish school-leaving
matriculation examination for 18-year-olds reports test results on
a scale where the highest mark is always given to the top 5% in the
score distribu-tion, the next 15% get the second highest grade, the
next 20% the third grade, and so on (Finnish Matriculation
Examination Board, n.d.).
A recent trend in score conversion concerns the CEFR. Many
language tests have examined how their test scores relate to the
CEFR levels in order to give added meaning to their results and to
help compare them with the results of other language tests (for a
review, see Martyniuk, 2011). This is in fact score conversion (or
setting cutoffs) at a higher or secondary level: The first one
involves converting the test scores to the reporting scale the test
uses, and the second is about convert-ing the reporting scale to
the CEFR scale.
Scoring Tests Based on Performance Samples
The scoring of speaking and writing tasks usually takes place
with the help of one or more rating scales that describe test-taker
performance at each scale level. The rater observes the test takers
performance and decides which scale level best matches the observed
performance. Such rating is inherently criterion referenced in
nature as the scale serves as the criteria against which test
takers performances are judged (Bachman & Palmer, 1996, p.
212). This is in fact where the rating of speaking and writing
differs the most from the scoring of tests consisting of items
(e.g., reading or listening): In many tests the point or level on
the rating scale assigned to the test taker is what will be
reported to him or her. There is thus no need to count a total
speaking score and then convert it to a different reporting scale,
which is the standard practice in item-based tests. The above
simplifies
-
Administration, Scoring, and Reporting Scores 9
matters somewhat because in reality some examinations use more
complex pro-cedures and may do some scale conversion and setting of
cutoffs also for speaking and writing. However, in its most
straightforward form, the rating scale for speak-ing and writing is
the same as the reporting scale, although the wording of the two
probably differs because they target different users (raters vs.
test score users).
It should be noted that instead of rating, it is possible to
count, for example, features of language in speaking and writing
samples. Such attention to detail at the expense of the bigger
picture may be appropriate in diagnostic or formative assessment
that provides learners with detailed feedback.
Rating scales are a specific type of proficiency scale and
differ from the more general descriptive scales designed to guide
selection test content and teaching materials or to inform test
users about the test results (Alderson, 1991). Rating scales should
focus on what is observable in test takers performance, and they
should be relatively concise in order to be practical. Most rating
scales refer to both what the learners can and what they cannot do
at each level; other types of scales may often avoid references to
deficiencies in learners proficiency (e.g., the CEFR scales focus
on what learners can do with the language, even at the lowest
proficiency levels).
Details of the design of rating scales are beyond the scope of
this chapter; the reader is advised to consult, for example,
McNamara (1996) and Bachman and Palmer (1996). Suffice it to say
that test purpose significantly influences scale design, as do the
designers views about the constructs measured. A major deci-sion
concerns whether to use only one overall (holistic) scale or
several scales. For obtaining broad information about a skill for
summative, selection, and placement purposes, one holistic scale is
often preferred as a quick and practical option. To provide more
detailed information for diagnostic or formative purposes, analytic
rating makes more sense. Certain issues concerning the validity of
holistic rating, such as difficulties in balancing the different
aspects lumped together in the level descriptions, have led to
recommendations to use analytic rating, and if one overall score is
required, to combine the component ratings (Bachman & Palmer,
1996, p. 211). Another major design feature relates to whether only
language is to be rated or also content (Bachman & Palmer,
1996, p. 217). A further important question concerns the number of
levels in a rating scale. Although a very fine-grained scale could
yield more precise information than a scale consisting of just
three or four levels, if the raters are unable to distinguish the
levels it would cancel out these benefits. The aspect of language
captured in the scale can also affect the number of points in the
scale; it is quite possible that some aspects lend themselves to be
split into quite a few distinct levels whereas others do not (see,
e.g., the examples in Bachman & Palmer, 1996, pp. 21418).
Since rating performances is usually more complex than scoring
objective items, a lot of attention is normally devoted, in high
stakes tests in particular, to ensuring the dependability of
ratings. Figure 58.2 describes the steps in typical high stakes
tests of speaking and writing. While most classroom assessment is
based on only one rater, namely the teacher, the standard practice
in most high stakes tests is for at least a proportion of
performances to be double rated (step 3 in Figure 58.2). Sometimes
the first rating is done during the (speaking) test (e.g., the
rater is present in the Cambridge examinations but leaves the
conduct of the test to an
-
10 Assessment Development
interlocutor), but often the first and second ratings are done
afterwards from an audio- or videorecording, or from the scripts in
the writing tests. Typically, all raters involved are employed and
trained by the testing organization, but some-times the first
rater, even in high stakes tests, is the teacher (as in the Finnish
matriculation examination) even if the second and decisive rating
is done by the examination board.
Large-scale language tests employ various monitoring procedures
to try to ensure that their raters work consistently enough. Double
rating is in fact one such monitoring device, as it will reveal
significant rater disagreement in their ratings; if this can be
spotted while rating is still in progress, one or both of the
raters can be given feedback and possibly retrained before being
allowed to continue. Some tests use a small number of experienced
master raters who continuously sample and check the ratings of a
group of raters assigned to them. The TOEFL iBT has an online
system that forces the raters to start each new rating session by
assessing a number of calibration samples, and only if the rater
passes them is he or she allowed to proceed to the actual
ratings.
Figure 58.2 Steps in rating speaking and writing
performances
1 Performance during the test
2 First rating
(a) During the test (speaking)(b) Afterwards from a recording or
a script
3 Second rating (typically afterwards)
5 Third and possibly more ratings
6 Compilation of different raters ratings
7 (Compilation of different rating criteria into one, if
analytic rating is used but only one score reported)
8 (Sum of scores if the final rating is not directly based on
the rating scale categories or levels)
9 (Application of cutoffs)
10 Reporting of results on the reporting scale
(Standard setting)
Rater or ratinganalyses
Rating scale(s) &benchmark samples
Rating scale(s) &benchmark samples
Rating scale(s) &benchmark samples
Monitoring
Monitoring
Monitoring
4 Identification of (significant) discrepancies between raters;
identification of difficult performances to rate
-
Administration, Scoring, and Reporting Scores 11
A slightly different approach to monitoring raters involves
adjusting their ratings up or down depending on their severity or
lenience, which can be esti-mated with the help of multifaceted
Rasch analysis. For example, the TestDaF, which measures German
needed in academic studies, regularly adjusts reported scores for
rater severity or lenience (Eckes et al., 2005, p. 373).
Analytic rating scales appear to be the most common approach to
rating speak-ing and writing in large-scale international language
examinations, irrespective of language. Several English (IELTS,
TOEFL, Cambridge, Pearson), German (Goethe Institut, TestDaF), and
French (DELF, DALF) language examinations implement analytic rating
scales, although they typically report speaking and writing as a
single score or band.
It is usually also the case that international tests relying on
analytic rating weigh all criteria equally and take the arithmetic
or conceptual mean rating as the overall score for speaking or
writing (step 7, Figure 58.2). Exceptions to this occur, however.
The International Civil Aviation Organization (ICAO) specifies that
all aviation English tests adhering to their guidelines must
implement the five dimen-sions of oral proficiency in a
noncompensatory fashion (Bachman & Palmer, 1996, p. 224). That
is, the lowest rating across the five criteria determines the
overall level reached by the test taker (ICAO, 2004).
Reporting Scores
Score reports inform different stakeholders, such as test
takers, parents, admission officers, and educational authorities,
about individuals or groups test results for possible action. Thus,
these reports can be considered more formal feedback to the
stakeholders. Score reports are usually pieces of paper that list
the scores or grades obtained by the learner, possibly with some
description of the test and the meaning of the grades. Some,
typically more informal reports may be electronic in format, if
they are based on computerized tests and intended only for the
learn-ers and their teachers (e.g., the report and feedback from
DIALANG). Score reports use the reporting scale onto which raw
scores were converted, as described in the previous section.
Score reports are forms of communication and thus have a sender,
receiver, content, and medium; furthermore, they serve particular
purposes (Ryan, 2006, p. 677). Score reports can be divided into
two broad types: reports on individuals and reports on groups.
Reporting scores is greatly affected by the purpose and type of
testing.
The typical sender of score reports on individual learners and
based on class-room tests is the teacher, who acts on behalf of the
school and municipality and ultimately also as a representative of
some larger public or private educational system. The sender of
more formal end-of-term school reports or final school-leaving
certificates is most often the school, again acting on behalf of a
larger entity. The main audiences of both score reports and formal
certificates are the students and their parents, who may want to
take some action based on the results (feedback) given to them.
School-leaving certificates have also other users such as
higher-level educational institutions or employers making decisions
about admit-ting and hiring individual applicants.
-
12 Assessment Development
School-external tests and examinations are another major
originator of score reports for individuals. The sender here is
typically an examination board, a regional or national educational
authority, or a commercial test provider. Often such score reports
are related to examinations that take place only at important
points in the learners careers, such as the end of compulsory
education, end of pre-university education, or when students apply
for a place in a university. The main users of such reports are
basically the same as for school-based reports except that in many
contexts external reports are considered more prestigious and
trustworthy, and may thus be the only ones accepted as proof of
language profi-ciency, for instance for studying in a university
abroad.
In addition to score reports on individuals performance,
group-level reports are also quite common. They may be simply
summaries of individual score reports at the class, school,
regional, or national level. Sometimes tests are administered from
which no reports are issued to individual learners; only
group-level results are reported. The latter are typically tests
given by educa-tional authorities to evaluate students achievement
across the regions of a country or across different curricula.
International comparative studies on edu-cational achievement
exist, in language subjects among others. The best known is the
Programme for International Student Assessment (PISA) by the
Organisa-tion for Economic Co-operation and Development (OECD),
which regularly tests and reports country-level reports of
15-year-olds reading skills in their language of education.
The content of score reports clearly depends on the purpose of
assessment. The prototypical language score report provides
information about the test takers proficiency on the reporting
scale used in the educational system or the test in question.
Scales consisting of numbers or letters are used in most if not all
edu-cational systems across the world. With the increase in
criterion-referenced testing, such simple scales are nowadays often
accompanied by descriptions of what dif-ferent scale points mean in
terms of language proficiency. Entirely non-numeric reports also
exist; in some countries the reporting of achievement in the first
years of schooling consists of only verbal descriptions.
Score reports from language proficiency examinations and
achievement tests often report on overall proficiency only as a
single number or letter (e.g., the Finnish matriculation
examination). Some proficiency tests, such as the TOEFL iBT and the
IELTS, issue subtest scores in addition to a total score. In many
placement contexts, too, it may not be necessary to report more
than an overall estimate of candidates proficiency. However, the
more the test aims at support-ing learning, as diagnostic and
formative tests do, the more useful it is to report profiles based
on subtests or even individual tasks and items. For example, the
diagnostic DIALANG test reports on test-, subskill-, and item-level
performance.
Current Research
Research on the three aspects of the testing process covered
here is very uneven. Test administration appears to be the least
studied (McCallin, 2006, pp. 63940),
-
Administration, Scoring, and Reporting Scores 13
except for the types of testing where it is intertwined with the
test format, such as in computerized testing, which is often
compared with paper-based testing, and in oral testing, where
factors related to the setting and participants have been studied.
Major concerns with computerized tests include the effect of
computer familiarity on the test results and to what extent such
tests are, or should be, comparable with paper-based tests (e.g.,
adaptivity is really possible only with computerized tests)
(Chapelle & Douglas, 2006).
As far as oral tests are concerned, their characteristics and
administration have been studied for decades. In particular, the
nature of the communication and the effect of the tester
(interviewer) have been hotly debated. For example, can the
prototypical test format, the oral interview, represent normal
face-to-face com-munication? The imbalance of power, in particular,
has been criticized (Luoma, 2004, p. 35), which has contributed to
the use of paired tasks in which two candi-dates interact with each
other, in a supposedly more equal setting. Whether the pairs are in
fact equal has also been a point of contention (Luoma, 2004, p.
37). Research seems to have led to more mixed use of different
types of speaking tasks in the same test, such as both interviews
and paired tasks. Another issue with the administration conditions
and equal treatment of test takers concerns the consist-ency of
interviewers behavior: Do they treat different candidates in the
same way? Findings indicating that they do not (Brown, 2003) have
led the IELTS , for example, to impose stricter guidelines on their
interviewers to standardize their behavior.
An exception to the paucity of research into the more general
aspects of test administration concerns testing time. According to
studies reviewed by McCallin (2006, pp. 6312), allowing examinees
more time on tests often benefits everybody, not just examinees
with disabilities. One likely reason for this is that many tests
that are intended to test learners knowledge (power tests) may in
fact be at least partly speeded.
Compared with test administration, research on scoring and
rating of perform-ances has a long tradition. Space does not allow
a comprehensive treatment but a list of some of the important
topics gives an idea of the research foci:
analysisoffactorsinvolvedinratingspeakingandwriting,suchastherater,rating
scales, and participants (e.g., Cumming, Kantor, & Powers,
2002; Brown, 2003; Lumley, 2005);
linkingtestscores(andreportingscales)withtheCEFR(e.g.,Martyniuk,2011);
validity of automated scoring ofwriting and speaking (e.g.,
Bernstein,Van
Moere, & Cheng, 2010; Xi, 2010); and
scoringshortanswerquestions(e.g.,Carr&Xi,2010).
Research into reporting scores is not as common as studies on
scoring and rating. Goodman and Hambleton (2004) and Ryan (2006)
provide reviews of prac-tices, issues, and research into reporting
scores. Given that the main purpose of reports is to provide
different users with information, Ryans statement that what-ever
research exists presents a fairly consistent picture of the
ineffectiveness of score reports to communicate meaningful
information to various stakeholder groups (2006, p. 684) is rather
discouraging. The comprehensibility of large-scale
-
14 Assessment Development
assessment reports, in particular, seems to be poor due to, for
example, the use of technical terms, too much information too
densely packed, and lack of descriptive information (Ryan, 2006, p.
685). Such reports could be made more readable, for example, by
making them more concise, by providing a glossary of the terms
used, by displaying more information visually, and by supporting
figures and tables with adequate descriptive text.
Ryans own study on educators expectations of the score reports
from the state-wide assessments in South Carolina, USA, showed that
his informants wanted more specific information about the students
performance and better descriptions of what different scores and
achievement levels meant in terms of knowledge and ability (2006,
p. 691). The educators also reviewed different types of individual
and group score reports for mathematics and English. The most
meaningful report was the achievement performance level narrative,
a four-level description of content and content demands that
systematically covered what learners at a par-ticular level could
and could not do (Ryan, 2006, pp. 692705).
Challenges
Reviews of test administration (e.g., McCallin, 2006, p. 640)
suggest that nonstand-ard administration practices can be a major
source of construct-irrelevant varia-tion in test results. The
scarcity of research on test administration is therefore all the
more surprising. McCallin calls for a more systematic gathering of
information from test takers about administration practices and
conditions, and for a more widespread use of, for example, test
administration training courseware as effec-tive ways of increasing
the validity of test scores (2006, p. 642).
Scoring and rating continue to pose a host of challenges,
despite considerable research. The multiple factors that can affect
ratings of speaking and writing, in particular, deserve further
attention across all contexts where these are tested. One challenge
such research faces is that applying such powerful approaches as
multifaceted Rasch analysis in the study of rating data requires
considerable expertise.
Automated scoring will increase in the future, and will face at
least two major challenges. The first is the validity of such
scoring: to what extent it can capture everything that is relevant
in speaking and writing, in particular, and whether it works
equally well with all kinds of tasks. The second is the
acceptability of auto-mated scoring, if used as the sole means of
rating. Recent surveys of users indicate that the majority of test
takers feel uneasy about fully automated rating of speak-ing (Xi,
Wang, & Schmidgall, 2011).
As concerns reporting scores, little is known about how
different reports are actually used by different stakeholders
(Ryan, 2006, p. 709), although something is already known about
what makes a score report easy or difficult to understand. Another
challenge is how to report reliable profile scores for several
aspects of proficiency when each aspect is measured by only a few
items (see, e.g., Ryan, 2006, p. 699). This is particularly
worrying from the point of view of diagnostic and formative
testing, where rich and detailed profiling of abilities would be
useful.
-
Administration, Scoring, and Reporting Scores 15
Future Directions
The major change in the administration and scoring of language
tests and in the reporting of test results in the past decades has
been the gradual introduction of different technologies.
Computer-based administration, automated scoring of fairly simple
items, and the immediate reporting of scores have been technically
possible for decades, even if not widely implemented across
educational systems. With the advent of new forms of information
and communication technologies (ICT) such as the Internet and the
World Wide Web, all kinds of online and computer-based
examinations, tests, and quizzes have proliferated.
High stakes international language tests have implemented ICT
since the time optical scanners were invented. Some of the more
modern applications are less obvious, such as the distribution of
writing and speaking samples for online rating. The introduction of
a computerized version of such high stakes examina-tions as the
TOEFL in the early 2000s marked the beginning of a new era. The new
computerized TOEFL iBT and the PTE are likely to show the way most
large-scale language tests are headed.
The most important recent technological innovation concerns
automated assess-ment of speaking and writing performances. The
TOEFL iBT combines human and computer scoring in the writing test,
and implements automated rating in its online practice speaking
tasks. The PTE implements automated scoring in both speaking and
writing, with a certain amount of human quality control involved
(seealsotheVersantsuiteofautomatedspeakingtests[Pearson,n.d.]). It
can be predicted that many other high stakes national and
international language tests will become computerized and will also
implement fully or partially automated scoring procedures.
What will happen at the classroom level? Changes in major
examinations will obviously impact schools, especially if the
country has high stakes national examinations. Thus, the inevitable
computerization of national examinations will have some effect on
schools over time, irrespective of their current use of ICT. The
effect may simply be a computerization of test preparation
activities, but changes may be more profound, because there is
another possible trend in com-puterized testing that may impact
classrooms: more widespread use of compu-terized formative and
diagnostic tests. Computers have potential for highly
individualized feedback and exercises based on diagnosis of
learners current proficiency and previous learning paths. The
design of truly useful diagnostic tools and meaningful
interventions for foreign and second language learning are still in
their infancy and much more basic research is needed to understand
language development (Alderson, 2005). However, different
approaches to designing more useful diagnosis and feedback are
being taken currently, includ-ing studies that make use of insights
into dyslexia in the first language (Alderson & Huhta, 2011),
analyses of proficiency tests for their diagnostic potential (Jang,
2009), and dynamic assessment based on dialogical views on learning
(Lantolf & Poehner, 2004), all of which could potentially lead
to tools that are capable of diagnostic scoring and reporting, and
could thus have a major impact on lan-guage education.
-
16 Assessment Development
SEE ALSO: Chapter 51, Writing Scoring Criteria and Score
Reports; Chapter 52, Response Formats; Chapter 56, Statistics and
Software for Test Revisions; Chapter 59, Detecting Plagiarism and
Cheating; Chapter 64, Computer-Automated Scoring of Written
Responses; Chapter 67, Accommodations in the Assessment of English
Language Learners; Chapter 80, Raters and Ratings
References
Alderson, J. C. (1991). Bands and scores. In J. C. Alderson
& B. North (Eds.), Language testing in the 1990s: The
communicative legacy (pp. 7186). London, England: Macmillan.
Alderson, J. C. (2005). Diagnosing foreign language proficiency:
The interface between learning and assessment. New York, NY:
Continuum.
Alderson, J. C., & Huhta, A. (2011). Can research into the
diagnostic testing of reading in a second or foreign language
contribute to SLA research? In L. Roberts, M. Howard, M. Laioire,
& D. Singleton (Eds.), EUROSLA yearbook. Vol. 11 (pp. 3052).
Amster-dam, Netherlands: John Benjamins.
American Educational Research Association, American
Psychological Association, & National Council on Measurement in
Education. (1999). Standards for educational and psychological
testing. Washington, DC: American Educational Research
Association.
Bachman, L., & Palmer, L. (1996). Language testing in
practice: Designing and developing useful language tests. Oxford,
England: Oxford University Press.
Bernstein,J.,VanMoere,A.,&Cheng,J.(2010).Validatingautomatedspeakingtests.Lan-guage
Testing, 27(3), 35577.
Brown, A. (2003). Interviewer variation and the co-construction
of speaking proficiency. Language Testing, 20(1), 125.
Carr, N., & Xi, X. (2010). Automated scoring of short-answer
reading items: Implications for constructs. Language Assessment
Quarterly, 7(2), 20518.
Chapelle, C., & Douglas, D. (2006). Assessing language
through computer technology. Cam-bridge, England: Cambridge
University Press.
Cizek, G., & Bunch, M. (2006). Standard setting: A guide to
establishing and evaluating perform-ance standards on tests.
London, England: Sage.
Cohen, A., & Wollack, J. (2006). Test administration,
security, scoring, and reporting. In R. L. Brennan (Ed.),
Educational measurement (4th ed., pp. 35586). Westport, CT:
ACE.
Cumming, A., Kantor, R., & Powers, D. (2002). Decision
making while rating ESL/EFL writing tasks: A descriptive framework.
Modern Language Journal, 86, 6796.
Eckes,T.,Ellis,M.,Kalnberzina,V.,Piorn,K.,Springer,C.,Szolls,K.,&Tsagari,C.(2005).Progress
and problems in reforming public language examinations in Europe:
Cameos from the Baltic States, Greece, Hungary, Poland, Slovenia,
France and Germany. Lan-guage Testing, 22(3), 35577.
Finnish Matriculation Examination Board. (n.d.). Finnish
Matriculation Examination. Retrieved July 14, 2011 from
http://www.ylioppilastutkinto.fi
Goodman, D., & Hambleton, R. (2004). Student test score
reports and interpretive guides: Review of current practices and
suggestions for future research. Applied Measurement in Education,
17(2), 145221.
International Civil Aviation Organization. (2004). Manual on the
implementation of ICAO language proficiency requirements. Montral,
Canada: Author.
Jang,E.(2009).CognitivediagnosticassessmentofL2readingcomprehensionability:Valid-ity
arguments for Fusion Model application to LanguEdge assessment.
Language Testing, 26(1), 3173.
http://www.ylioppilastutkinto.fi
-
Administration, Scoring, and Reporting Scores 17
Kaftandjieva, F. (2004). Standard setting. Reference supplement
to the preliminary pilot version of the manual for relating
language examinations to the Common European Framework of Reference
for Languages: Learning, teaching, assessment. Strasbourg, France:
Council of Europe.
Lantolf, J., & Poehner, M. (2004). Dynamic assessment:
Bringing the past into the future. Journal of Applied Linguistics,
1, 4974.
Little, D. (2005). The Common European Framework and the
European Language Portfolio: Involving learners and their judgments
in the assessment process. Language Testing, 22(3), 32136.
Lumley, T. (2005). Assessing second language writing: The raters
perspective. Frankfurt, Germany: Peter Lang.
Luoma, S. (2004). Assessing speaking. Cambridge, England:
Cambridge University Press.Malone, M. (2000). Simulated oral
proficiency interview: Recent developments (EDO-FL-00-14).
Retrieved July 14, 2011 from
http://www.cal.org/resources/digest/0014simulated.htmlMartyniuk, W.
(Ed.). (2011). Aligning tests with the CEFR: Reflections on using
the Council of
Europes draft manual. Cambridge, England: Cambridge University
Press.McCallin, R. (2006). Test administration. In S. Downing &
T. Haladyna (Eds.), Handbook of
test development (pp. 62551). Mahwah, NJ: Erlbaum.McNamara, T.
(1996). Measuring second language performance. Boston, MA: Addison
Wesley
Longman.Millman, J., & Greene, J. (1993). The specification
and development of tests of achievement
and ability. In R. L. Linn (Ed.), Educational measurement (3rd
ed., pp. 33566). Phoenix, AZ: Oryx Press.
Mousavi, S. E. (1999). A dictionary of language testing (2nd
ed.). Tehran, Iran: Rahnama Publications.
Pearson. (n.d.). Versant tests. Retrieved July 14, 2011 from
http://www.versanttest.comRyan, J. (2006). Practices, issues, and
trends in student test score reporting. In S. Downing
& T. Haladyna (Eds.), Handbook of test development (pp.
677710). Mahwah, NJ: Erlbaum.Shohamy, E. (1994). The validity of
direct versus semi-direct oral tests. Language Testing,
11(2), 99123.Taylor, L. (2011). Examining speaking: Research and
practice in assessing second language speak-
ing. Cambridge, England: Cambridge University Press.Xi, X.
(2010). Automated scoring and feedback systems: Where are we and
where are we
heading? Language Testing, 27(3), 291300.Xi, X., Wang, Y., &
Schmidgall, J. (2011, June). Examinee perceptions of automated
scoring of
speech and validity implications. Paper presented at the LTRC
2011, Ann Arbor, MI.
Suggested Readings
Abedi, J. (2008). Utilizing accommodations in assessment. In E.
Shohamy & N. Hornberger (Eds.), Encyclopedia of language and
education. Vol. 7: Language testing and assessment (2nd ed., pp.
33147). New York, NY: Springer.
Alderson, J. C. (2000). Assessing reading. Cambridge, England:
Cambridge University Press.Becker, D., & Pomplun, M. (2006).
Technical reporting and documentation. In S. Downing
& T. Haladyna (Eds.), Handbook of test development (pp.
71123). Mahwah, NJ: Erlbaum.Bond, T., & Fox, C. (2007).
Applying the Rasch model: Fundamental measurement in the human
sciences (2nd ed.). Mahwah, NJ: Erlbaum.Buck, G. (2000).
Assessing listening. Cambridge, England: Cambridge University
Press.Fulcher, G. (2003). Testing second language speaking. Harlow,
England: Pearson.
http://www.cal.org/resources/digest/0014simulated.htmlhttp://www.versanttest.com
-
18 Assessment Development
Fulcher, G. (2008). Criteria for evaluating language quality. In
E. Shohamy & N. Hornberger (Eds.), Encyclopedia of language and
education. Vol. 7: Language testing and assessment (2nd ed., pp.
15776). New York, NY: Springer.
Fulcher, G., & Davidson, F. (2007). Language testing and
assessment: An advanced resource book. London, England:
Routledge.
North, B. (2001). The development of a common framework scale of
descriptors of language profi-ciency based on a theory of
measurement. Frankfurt, Germany: Peter Lang.
Organisation for Economic Co-operation and Development. (n.d.).
OECD Programme for International Student Assessment (PISA).
Retrieved July 14, 2011 from http://www.pisa.oecd.org
Weigle, S. (2002). Assessing writing. Cambridge, England:
Cambridge University Press.Xi, X. (2008). Methods of test
validation. In E. Shohamy & N. Hornberger (Eds.), Encyclo-
pedia of language and education. Vol. 7: Language testing and
assessment (2nd ed., pp. 17796) . New York, NY: Springer.
http://www.pisa.oecd.orghttp://www.pisa.oecd.org