ITEM EVALUATION OF THE READING TEST OF THE MALAYSIAN UNIVERSITY ENGLISH TEST (MUET) RUSILAH BINTI YUSUP Submitted in partial fulfilment of the requirements for the degree of Master of Assessment and Evaluation Melbourne Graduate School of Education (MGSE) The University of Melbourne September 2012
106
Embed
ITEM EVALUATION OF THE READING TEST OF THE MALAYSIAN ... · Form and pre-university students, that the Malaysian University English Test (MUET) was first introduced in 1999, along
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ITEM EVALUATION OF THE READING TEST OF THE MALAYSIAN UNIVERSITY ENGLISH TEST (MUET)
RUSILAH BINTI YUSUP
Submitted in partial fulfilment of the requirements for the degree of Master of Assessment and Evaluation
Melbourne Graduate School of Education (MGSE)
The University of Melbourne
September 2012
i
ABSTRACT
The present study is an item-level evaluation of the reading test of the Malaysian
University English Test (MUET), the high-stakes entrance test for Malaysian pre-degree
students. It comprises an in-depth analysis of student responses at item level as an
explanation why this compulsory entry test appears to be formidable challenge for test
takers. The study aims to assess the quality of the test items from the framework of two
widely used psychometric theories – classical test theory (CTT) and the Rasch model.
Additionally, it examines the effects of item features and examinees’ characteristics in
determining the difficulty level of the test items. These two issues have been explored
by using regression analysis and differential item functioning (DIF) respectively. The
findings of item analysis demonstrate the complementary nature of CTT and the Rasch
model as useful tools for test design and evaluation. The study also reports that item
difficulty of this reading test is influenced largely by question format features
(particularly plausibility of the distractors) rather than passage-related variables and
question-type variables. DIF analysis points out that natural/real differences are seen as
a possible explanation for variation between the various groups being examined. These
findings, though subject to limitations, have practical implications for instruction, test
construction and educational research. Also, it provides directions for future exploration
of several issues identified through this study. Due to the limitation of the study, it only
focuses on one of the four components of MUET, that is reading test.
ii
DECLARATION
This thesis does not contain material which has been accepted for any other degree in
any university. To the best of my knowledge and belief, this thesis contains no material
previously published or written by any other person, except where due reference is
2.2 Assessment of Reading Comprehension ......................................................... 14
2.3 Psychometric Item Analysis of the MUET Reading Test ............................... 17
2.4 The Effects of Test Features and Examinee’s Characteristics on Item Difficulty of Reading Test ...................................................................... 26
2.5 Review of Previous Studies ............................................................................. 34
Figure 4.6 ICC Plot of Item with Negative Discrimination Index (Item 34)........ 61
Figure 4.7 ICC Plot of Item with Negative Discrimination Index (Item 38)........ 61
Figure 4.8 ICC Plot of Misfit Item (Item 34)........................................................ 63
Figure 4.9 Boxplot of Interaction between Plausibility of Distractors and Item Difficulty…………………………………………………………..
69
Figure 4.10 Boxplot of Interaction between Inference Level and Item Difficulty 69
Figure 4.11 Boxplot of Interaction between Structure of Responses and Item Difficulty…………………………………………………………..
70
Figure 4.12 ICC Plot of Item with Gender DIF (Item 6) ...................................... 74
Figure 4.13 ICC Plot of Item without Gender DIF (Item 15) ............................... 75
Figure 4.14 ICC Plot of Item with Ethnicity DIF (Item 30) ............................... 77
1
CHAPTER 1: INTRODUCTION
1.1 INTRODUCTION
Broadly speaking, the purpose of educational testing and assessment is straightforward;
to measure or gauge what learners know or can do. Testing or assessment is seen as a
process of gathering evidence to infer the performance of an individual
(McNamara,2000). In the educational context, many educators recognize that tests play
a crucial role as an arsenal of tools to measure student achievement.
In the Malaysian context, the outcome of assessment through standardized tests1 is seen
as the linchpin to track how well students perform throughout their schooling years.
Stakeholders in education, particularly students, teachers and parents, make inferences
about students’ overall performance from national standardized high-stakes
examinations like Penilaian Menengah Rendah (PMR or Lower Secondary Evaluation),
Sijil Pelajaran Malaysia (SPM or Malaysian Certificate of Education) and Sijil Tinggi
Pelajaran Malaysia (STPM or Malaysian Higher School Certificate Examination).
After PMR for example, students will be streamed into either a Science stream or Arts
stream based on their results. Those with distinctive results in SPM will have the
advantage of entering matriculation centres2 which offer a pre-university programme for
Malaysian ‘bumiputera’ students as a preparation for them to qualify to Degree
Programmes in the fields of Science and Technology in both local and overseas
universities.
The results of standardized tests serve a variety of purposes in educational settings. At
the individual level, Mertler (2007) asserts that test scores are used to describe one’s
learning abilities and levels of achievements. The information helps students to identify
1 A test that is developed, administered and scored in a predetermined standard manner.
Students take the same set of exam questions, marked with the same marking scheme and graded using the same grading system.
2 Centres for foundation studies which offer one or two-year programmes run by the Ministry of Education.
2
their areas of strength and weakness (Schwartz, 1984) and this guides them to modify or
adapt to the instruction based on their own needs (Mertler, 2007). In addition, test
results can provide useful information at group level. Often, test results are used to
compare students with other students (Schwartz, 1984). Tests serve as an indicator of
general ability levels of students across classes, grade levels, schools or states.
Over the decades, we have witnessed a change in the use of educational testing.
Educational tests no longer serve primarily as indicators of educational achievement.
Tests have become an effective policy device to implement changes or modifications to
educational policies (Baker, 1989) and to monitor the effectiveness of instruction or
academic courses (Bachman & Palmer, 1996). Test results are now used to evaluate
teachers, administrators, and even the quality of an entire curricular and instructional
program. As an example, many higher educational institutions make use of scores from
standardized tests as the sole, mandatory, or primary criterion for admissions or
certification. Therefore, education stakeholders should view the results of tests as a
source of information which needs to be put into good use to reach appropriate
decisions about students, instruction and curriculum at large.
1.1.1 English Language in Malaysia
Owing to the legacy of British, the English language has been spoken in Malaysia for
decades. From pre-independence days until today, it has been widely spoken and is
therefore considered the second language of the country. It has been used extensively in
commercial and social settings, formal and informal situations – in business
transactions, internet communication, advertisement and entertainment industry. In
government administration, although Malay is the official language, English usage is
frequent and necessary in many international transactions and correspondences. To a
certain extent, English has become part and parcel of the life of Malaysians. As an
example, failure in securing jobs after graduation is often linked to the inability to
communicate effectively in English. It is also a common notion in Malaysia that one’s
success in today’s competitive global world is associated with the mastery of the
English language.
3
Due to its importance, English has been made a compulsory subject taught and tested as
a second language from the first year of an individual's primary education to the end of
his/her secondary education in Form Five. Unfortunately, prior to 1999, English was not
taught or tested at the Sixth Form or pre-university level. However, upon entry into the
local public tertiary institutions, these pre-1999 students were required to undergo a
course in English language proficiency. This is because at the tertiary level, although
the medium of instruction in the public universities is the national language (Malay),
English is widely used to teach science and mathematics-related subjects or courses.
It was with the dual purpose of filling the gap with respect to the training and learning
of English and that of consolidating and enhancing the language literacy of the Sixth
Form and pre-university students, that the Malaysian University English Test (MUET)
was first introduced in 1999, along with a curriculum/syllabus for delivery at Sixth
Form and equivalent level.
MUET is administered twice a year, i.e. at mid-year (April/May) and year-end
(October/November). The test is developed and run by the Malaysian Examination
Council3. It is a test to measure the English language proficiency of pre-university
students for entry into tertiary education. It is a mandatory test to gain entry into degree
courses offered at all Malaysian public universities. Unlike the International English
Language Testing System (IELTS) and Test of English as a Foreign Language
(TOEFL) which are globally accepted as the certification of English language
proficiency, MUET is recognized only in Malaysia and Singapore (National University
of Singapore, Nanyang Technological University and Singapore Management
University).
MUET comprises the four language skills of listening, speaking, reading and writing. It
gauges and reports a candidate’s level of proficiency based upon an aggregated score
3 A statutory body under the Ministry of Education, which is solely responsible for the
development and administration of MUET. This body is not involved in the management of other high stakes examinations like PMR and SPM. These two standardized tests are run by the Malaysian Examination Syndicate.
4
ranging from zero to 300 which is then converted into a banding system ranging from
the lowest, Band 1 to the highest, Band 6.
The MUET syllabus aims to equip students with the appropriate level of proficiency in
English so as to enable them to perform effectively in their academic pursuits at tertiary
level. The syllabus is designed to bridge the gap in language needs between secondary
and tertiary education by enhancing communicative competency, providing the context
for language use that is related to the tertiary experience and developing critical
thinking through the competent use of language skills. In a broader sense, it aims to
prepare Malaysian university graduates to be able to compete effectively at the global
level which requires the mastery of the lingua-franca spoken all over the world.
1.2 PROBLEM STATEMENT
After having received the SPM examination results, qualified students may move on to
study in various higher learning institutions in the country. They can choose to enrol in
Form Six (pre-university level), a matriculation college, a teacher training institute, a
polytechnic or a community college. At this level, English is given considerable
emphasis. For example, English is taught in teacher training colleges and matriculation
centres to help students to enhance their English proficiency as well as to prepare them
for the MUET exam. Teaching English or the MUET syllabus for pre-university
students is therefore, seen as a consolidation phase or continuation of what they have
learnt in secondary schools.
Achievement in MUET acts as an indicator of a student’s language proficiency level
and enables him or her to enrol for undergraduate programmes at Malaysian public
universities or other higher learning institutions. For most universities, students must
obtain a higher band in MUET in order to be accepted in the faculties of Engineering,
Dentistry, Medicine and Law. In University Malaya, for example, students aspiring to
pursue Bachelor of Law and Bachelor of TESL need to pass with at least Band 4 or
equivalent. In another case, Band 5 for MUET is the minimum requirement for students
enrolling in the Faculty of Law in MARA University of Technology. Thus, to be
granted admission to their choice of programme, students must pass the MUET with a
5
satisfactory grade to meet the requirement outlined by the universities. The following
chart depicts the use of MUET for pre-degree students in sixth form and equivalent.
Figure 1.1: Flow Chart of MUET Use for Pre-degree Students
Despite its status as a hurdle requirement or mandatory language test for entry into
public universities, the exam is a formidable challenge for many students. The final
analysis of MUET-END 2009 by the Malaysian Examination Council (see Table 1.1),
for example, revealed that 89% of students fell below Band 3 (see the band descriptor in
Appendix A). It was noted that 39% of test-takers were categorised as limited users
(Band 2). Also, the percentage of students obtaining the upper bands (Band 4 - Band 6)
is small. This figure shows that the level of English language proficiency among
Malaysian students is at low ebb. These low results restrict many students’ chances of
entry to the programme of their choice.
SPM
Sixth Form Matriculation College Teacher training college
Many studies have been conducted to understand the factors which contribute to
students’ poor performance in English. A study conducted by Hamzah and Abdullah
(2009) found that ESL learners are unable to use the language because of a lack of
learning strategies. The result of the research showed that the respondents who
consisted of ESL learners in institutions of higher learning could not master the
language without proper training in metacognitive strategies in their ESL learning.
Other possible reasons for this problem are factors such as attitude, perception and
environment (Kaur & Thiyagarajah, 1999; Jalaluddin, Awal & Bakar, 2009). The
researchers revealed that social embarrassment fuelled the hesitation to use the
language. This means that students hesitate to practise the language and are more
comfortable communicating using their mother tongue. Moreover, Jalaluddin et al.
(2009) added that differences in language structures act as a barrier to acquisition of the
second language. The study demonstrated that structural differences between the first
language (i.e. Malay) and the target language (i.e. English) can inhibit mastery of the
language.
1.3 AIM / PURPOSE OF THE STUDY
A review of literature has indicated that the possible reasons for Malaysian students’
lack of English proficiency have been the object of numerous academic inquiries. The
emphasis has primarily been on extraneous variables such as students’ perception and
7
attitude, social environment and linguistic factors. It appears that these extraneous
variables are hindrances to Malaysian students mastering the language and eventually
this affects their performance in a language test, in this case MUET (Kaur &
Thiyagarajag, 1999; Jalaluddin, et al., 2009; Hamzah & Abdullah, 2009).
Unfortunately, studies that concentrate on the influence of the actual test items on the
difficulty level of a particular test have been minimal. To date, there have been no
previously published or unpublished studies undertaking a comprehensive exploration
of these psychometric issues in the Malaysian context, although analysis of this problem
has been hampered by restricted access to the test data.
Based on the results obtained from the previous administrations of MUET, many test-
takers of MUET struggle with the reading comprehension test. There is concern that
Malaysian students and graduates lack reading comprehension skills (Sarudin &
Zubairy, 2008). Malaysian university graduates also have been criticised for lacking
general reading skills to perform effectively at the workplace. Of the four components
in MUET, reading comprehension has been given the highest weighting, i.e. 40% of the
total score. This clearly shows that the Malaysian educational policy is concerned with
equipping students with reading skills to engage successfully in tertiary education. This
is due to the fact that in the second language learning context, reading is perceived as a
prominent academic skill for university students. Carrell (1988) acknowledges that:
It is through reading that learners are exposed to new information and are able to
interpret, evaluate and synthesize the course content. Yet, most often, many students
who enrol in higher learning institutions are unprepared for the reading demands of
academic life. Poor performance on MUET can be seen as one of the indicators of this
“In second language teaching/learning situations for academic purposes, especially in higher education in English-medium universities, or other programmes that make extensive use of academic reading materials written in English, reading is paramount. Quite simply, without solid reading proficiency, second language readers cannot perform at levels they must in order to succeed...”
(Carrell, 1988, p.1)
8
problem. Factors such as poor reading strategies, low interest in reading English
materials and reading habits are mentioned by researchers as the causes of reading
problems for Malaysian ESL learners (Ramaiah & Nambiar, 1993; Abdul Majid, Jelas
& Azman, 2002, Ibrahim, 2005, 2006). Obviously, reading comprehension is seen as
the key to unlocking success and thus warrants particular attention, especially in the
ESL context.
As mentioned earlier, a national standardized high stakes examination like MUET plays
a vital role in assessing Malaysian students’ academic achievement. MUET is used as a
means of entry to undergraduate courses in public universities. It is essential therefore
that the test is of high quality. Previous studies show that extraneous factors have been
the focal point of examining the poor achievement in the English language test. In the
Malaysian context, empirical research that focuses on the psychometric property of the
test at the item level has yet to receive due attention. The main purpose of the present
study is to address this gap and to examine the psychometric properties of the test (the
quality of the test) as well as to investigate the role played by test item characteristics as
the contributing factors for the difficulty level of MUET, particularly the reading
component.
Pumfrey (1976) and Schwartz (1984) summarized that the most important
characteristics of a good reading test are validity, reliability and practicality. The first
two characteristics are relevant to the present study. Therefore, for the purpose of this
research, it is necessary to investigate the quality of individual items in order to examine
the reliability and validity of this MUET reading test. This is because test developers
have recognized that reliability and validity of test scores are contingent upon the
quality of the test items (Reynolds, Livingston & Wilson, 2009). Logically, as the
quality of the individual items improves, the overall quality of the test also improves.
This process of item analysis is viewed as the key to the development of a successful
test as it provides insights about the pattern of students’ response to an item and the
relation of the item to the overall performance (Nunnally & Bernstein, 1994). The item
analysis in this research utilized the two analytic procedures that are commonly used in
test development and validation, namely traditional or standard item analysis within the
9
framework of Classical Test Theory (CTT) and the Rasch model, one of the models of
Item Response Theory (IRT).
The second aim of this research is to explore the influence of item features on the
difficulty of reading comprehension items. Investigation dwelled on several
characteristics of test items such as the type of question, length of the passage,
plausibility of distractors and number of alternatives as the contributing agents to item
difficulty. In this study, a regression analysis was conducted to investigate the
relationship between the selected item features/characteristics and the difficulty level of
the reading comprehension items.
Another purpose of this research is to examine the effect of students’ background
characteristics on their responses to items of the MUET reading test. Differential Item
Functioning (DIF) analysis was utilised to explore the extent to which the indicators of
DIF such as gender, geographical location and ethnicity, are likely to reflect students’
responses. In the case of high-stake examination, DIF analysis is important as the reality
of plurality in Malaysia should be taken into consideration in the construction of any
test item. Standardized tests, which, by definition, give all test-takers the same test
under the same (or reasonably equal) conditions, should ensure fairness regardless of
race, socioeconomic status, or other considerations.
1.4 RESEARCH QUESTIONS
Based on the above discussion, this research is designed to address the following
questions:
1. How do the items spread in terms of their difficulty value and ability of the
students?
2. How good are the items of the MUET reading test?
3. To what extent do the selected features of the test items contribute to the item
difficulty in the MUET reading test?
4. Is there any differential item functioning in the MUET reading test in terms of
gender, geographical location and ethnicity?
10
1.5 SIGNIFICANCE AND LIMITATION OF THE STUDY
It is hoped that this study will meet its objectives as mentioned earlier. Furthermore, it is
intended to provide sound information to the Ministry of Education generally, and to the
concerned divisions, particularly the Malaysian Examination Council and the Malaysian
Examination Syndicate. It is certainly a hope that the findings of this research will help
the institutions to implement quality control measures on their examination materials.
Information gained from this research can serve as a guide for those individuals who are
actively involved in the design and construction of test items, as it will provide a better
understanding of measurement complexity. More specifically, the findings from item
analysis of this study can be used by the Malaysian Examination Council test
constructors to design new sets of items for a more defensible reading test in MUET.
Item analysis indeed is valuable in improving items which will be used again in later
tests. It can also be used to eliminate ambiguous or misleading items. Popham (2000)
suggested that in large-scale-test development, empirical item-improvement through
item analysis should be given a major emphasis. It is this kind of empirical analysis that
facilitates the revision of test items.
The study of test features effects and DIF, in addition, can contribute to the
understanding of the effect of item features and examinee background characteristics on
the construct the test is intended to measure. The results of the effects of individual item
on the difficulty level of this reading test will inform the test writers to balance the
contents of particular features in the development of test items. The DIF analysis,
furthermore, will guide the test designers to control the possible causes of differences of
the groups being compared. Test characteristic effect and DIF investigation help test
evaluators to ensure test validity (Osterlind & Everson, 2009; Camilli & Shepard, 1994)
and to make decisions on the interpretation of a test score.
Due to time constraints, this research examines the factors affecting the item difficulty
of the reading test only. Thus, the findings of the study cannot be seen as the whole
performance and quality of MUET which consists of three other language skills;
listening, speaking and writing. In addition, the statistical analysis generated from the
11
data only relies on the features of model used; the Rasch model. It does not deal with
the other derivations of CTT (e.g., generalizability theory) and IRT models (e.g., two-
parameter model and three-parameter model).
Due to its limitation, the findings of this study cannot be generalized to the whole
population because the participants are limited to those candidates of MUET-END 2009
from two states; Sabah and Capitol Territory of Kuala Lumpur only.
1.6 STUCTURE OF THE THESIS
This thesis is divided into five chapters.
Chapter 1 provides introductory information for the study. The importance of English
language in Malaysian setting and the implementation of MUET for pre-university
students is described. The problems of low English proficiency among Malaysian
students are also discussed. This chapter introduces the aims of the study and the
research questions which need to be addressed in this study.
Chapter 2 outlines a review of literature on the topics of interest in this study. First, it
highlights the psychometric properties of CTT and the Rasch model for the item
analysis. It explains the item facility, item discrimination, reliability and fit index,
which are used to check the quality of the overall test. Second, it looks at the type of
several test characteristics which have been examined by the researchers to influence
the item difficulty of a test. Third, the discussion of the meaning of DIF and its relation
to bias is then presented. At the end, this chapter reviews the previous research related
to this study.
Chapter 3 describes the methodology utilized for this research. This includes the
description of the data/sample and the materials. This section also outlines the three
phases which are conducted in order to investigate the answers of the research
questions. The three procedures involved in the study are:
Item analysis of CTT and the Rasch model
12
Coding of individual items and regression analysis
DIF analysis
Chapter 4 presents the findings of the study. The first section describes the results of
the CTT and the Rasch item analysis. Next, the findings of the investigation on the six
predictors of item difficulty are discussed. The last part of this chapter reports the extent
to which the DIF indicators (gender, ethnicity and state) influence students’ response to
an item.
Chapter 5 gives the main conclusions from the findings of the study by providing
answers to the research questions. Implications of the findings for teaching and testing
MUET and further research are also given in this chapter.
13
CHAPTER 2: A REVIEW OF LITERATURE
2.1 INTRODUCTION
In the context of second language acquisition, reading is by far the most important skill
to be learnt (Carrell, 1988). Certainly, many learners of English language find
themselves engaged in reading most of the time in order to master the language. In
addition, the ability to read is a central asset in today’s modern, technologically-oriented
world. Numerous research findings have shown strong links between reading
proficiency and success in educational contexts at all ages; from the primary school to
university level (Adamson, 1993; Collier, 1989). In higher educational institutions that
make extensive use of academic materials written in English, reading is arguably the
basic foundation on which academic skills of the individual are built. In academia, most
subjects taught are based on a simple process – read, synthesize, analyze and process
information. Simply put, students’ performance in tertiary level is contingent upon their
reading proficiency.
Recognizing the importance of reading as a part of academic literacy, it is no surprise
that there have been many attempts to measure reading skills. Students’ reading ability
is frequently assessed using standardized tests. Today, there are dozens of commercial
reading tests, and for the purpose of English as a foreign/second language assessment,
the most frequently used tests are TOEFL and IELTS, that can be used as a means to
determine the attainment in or attitude towards reading (Pumfrey, 1976). These tests are
assumed to gauge reading ability which requires the test-takers to read various types of
passages and to respond to questions about the passage. The nature of the multiple-
choice format, which characterizes many standardized tests, provides an objective way
to determine the correct and incorrect responses. Many educators and researchers favour
this type of standardized test mainly due to its practicality (Brown, 2004).
14
2.2 ASSESSMENT OF READING COMPREHENSION
Devine (1989) defines reading comprehension as:
The above definition of reading comprehension implies that reading is a dynamic and
complex process (Pumfrey, 1976; Alderson, 2000; Devine, 1989; Carrell, 1988;
Schwartz, 1984). This means that readers are involved in an active process to construct
meaning from print or writing. The interactivity nature of reading, nonetheless, poses
challenges to the test design of reading skills. Ample studies have demonstrated that
reading assessment has particular complexities (Weaver & Kintsch, 1991; Klapper,
1992) due to the complex and active interactions between reader, text and task.
The first major challenge is that reading comprehension involves dynamic and multi-
component processes (Fletcher, 2006; Snow, 2003). Readers use a variety of reading
strategies to decipher the meaning of a written text. For example, readers may use
semantic, syntax and context clues to make sense of the meaning of unknown words.
They may also use various cognitive skills such as inferring, reasoning, predicting,
comparing and contrasting to draw conclusion of their interpretation of the text.
Readers also need to integrate the words they have read with their prior knowledge,
experience, attitude and language (in the case of second/foreign language context, this
refers to the interference of first language into the reading process). These complicated
activities pose an essential question; how do test constructors decide which aspect of
reading to measure? It appears that the complexity of cognitive process to derive
“Reading comprehension is the process of using syntactic, semantic, and rhetorical information found in printed texts to reconstruct in the reader’s mind, using the knowledge of the world he or she possesses (plus appropriate cognitive and reasoning abilities when necessary), a hypothesis or personal explanation that may account for the message that existed in the writer’s mind as the printed text was prepared.”
(Devine, 1989, p.120)
15
meaning challenges the test designers to accurately measure the many skills required for
a particular reading test.
Alderson (2000) and McKay (2006) supported the above notion and confirmed that
reading is both process and product. The process of reading is a reader-text interaction
which involves many different things that are going on when a reader reads. The
product of reading is comprehension or construction of meaning; that is, the
understanding of what has been read. Both need to be assessed. Alderson asserted that
any variable that has impact on either reading process or its product needs to be taken
into account in test design and test validation. He also noted that assessing the process
of reading can be a challenging task for educational practitioners.
Additionally, Pumfrey (1976) pointed out that “reading is characteristically
developmental” (p.11). This suggests that the skills required by young readers
inevitably differ from adult learners especially those at the tertiary stages of education.
Thus, the relative importance of particular reading skills at a given stage should be a
prime concern in designing items for reading assessment.
Another challenge of reading assessment is that, like listening, it is often associated with
the measurement of other skills (Mckay, 2006). For instance, judgement of student’s
reading ability is observable through speaking or writing. Therefore, care needs to be
taken so that assessment of reading will not be ‘contaminated’ by other skills. In regard
to the integration of reading with other language skills, its unobservable nature calls for
the assessment to be carried out by inference (Brown, 2004). As a result, this leads to a
challenge in the justification or interpretation of the test. The irony here is that the
interpretation of testing comprehension of receptive skills has become a controversial
issue due to the reality that different readers infer from or interpret a written text in
different ways (Alllison, 1999).
The preceding challenges, nonetheless, lead to another crucial issue in reading
assessment, that is, construct validity. As described in the Standards for Educational
and Psychological Testing (AERA, APA, & NCME, 1999), validity refers to “the
16
degree to which evidence and theory support the interpretations of test scores entailed
by proposed uses of tests” (p. 9) and a construct is defined as “the concept or
characteristic that a test is designed to measure” (p. 173). In its simplest terms, construct
validity refers to multiple sources of evidence supporting or refuting the accurate
interpretation of test score (Messick, 1995).
Leading scholars of language testing have identified two major threats to score validity:
construct underrepresentation and construct-irrelevant variance. Messick (1996)
asserted that the validity of the test is affected by an inadequate or incomplete sampling
of the construct (construct underrepresentation) and the measurements of ‘things’ that
are simply not relevant to a construct (construct-irrelevant variance).
It has been repeatedly noted that any sources of construct-irrelevant variance may lead
to incorrect inference of the test takers, and therefore, diminish validity (McNamara,
2000; Alderson, 2000).
It is clear that a construction of reading assessment requires a series of decisions. It is a
demanding task for test writers to decide what skills to measure, how to measure them
and how to interpret the test score. Despite its challenges, it has been acknowledged that
assessment of reading plays a crucial role in educational practice and research. It is
claimed that reading test results can be used as an indicator for evaluation of various
approaches to the teaching of reading (Pumfrey, 1976; Schwartz, 1984) and for
improvement of reading comprehension ability (Snow, 2003). This is because of the
positive washback4 of reading assessment that provides strategies for researchers and
teachers to identify and diagnose reading comprehension problem in students. The
importance of reading assessment then justifies that research on comprehension
assessment is paramount. In his introductory note of Alderson’s (2000) book, Bachman
recognized that “reading, through which we can access worlds of ideas and feelings, as
4 Generally, washback refers to the effect of testing on the process of teaching and learning.
Bachman and Palmer (1996) consider washback to be a subset of a test impact on a larger context; educational system and society.
17
well as the knowledge of the ages and visions of the future, is at once the most
extensively researched and the most enigmatic of the so-called language skills” (p. x).
2.3 PSYCHOMETRIC ITEM ANALYSIS OF THE MUET READING TEST
Both sets of stakeholders, teachers and students, perceive MUET as a high stakes test.
Due to its significance as a mandatory requirement for admission into public
universities, it is, therefore, essential to assess the reliability and validity of this test. In
other words, as part of evaluation practice, it is fundamental to review test items after
they have been constructed or administered.
The process of evaluating the effectiveness of individual items in a test is called item
analysis. It is normally conducted for the purpose of item selections in the construction
and revision phases of the test. In addition, it is also performed to investigate how well
the items are working with a target group of students. Nunnally and Bernstein (1994)
highlighted that item analysis is extremely useful as it furnishes important information
how examinees respond to each item and how each item relates to the overall
performance of the test. In this study, all 45 items of the MUET reading test are
scrutinized for statistical analysis using the framework of CTT and the Rasch model.
It should be emphasized here that the purpose of this paper is not to compare the two
approaches; but to demonstrate how they complement each other as a tool for
educational assessment. The discussion of psychometric characteristics of CTT and the
Rasch model in this section will necessarily be an overview, without extensive recourse
to the mathematical equations of the concerned properties, and the contentious
arguments about which particular approach is superior.
2.3.1 Classical Test Theory (CTT)
CTT is derived from a relatively simple assumption. CTT statistics are based on the
total scores on a test. It assumes that total scores, typically defined as the number of
correct responses, serve as the sole indicator of a person’s level of ability or knowledge
(de Ayala, 2009). Obviously, in CTT, the examinee’s attained score on the whole test is
18
the unit of focus. Hambleton and Jones (1993) acknowledge that the major advantage of
CTT is its “relatively weak assumptions”, which makes it easy to apply in many testing
situations. It is considered to be “weak” because the above assumption is likely to be
met by the data.
Within this theoretical framework, it is postulated that the score obtained by an
individual is made up of two facets; a true score and a random error (de Ayala, 2009;
Hambleton & Jones, 1993). The theory concludes that the observed score is a function
of the true score plus the random error. The relationship between the three components
is written as in the equation (2.1)
X = T + E
Where
X is the observed score
T is the true score
E is the error score
CTT theorizes that each person has a true score. It is calculated by taking the mean
score that he or she obtains on the parallel tests administered at infinite number of
Due to the fact that standardized reading tests often utilize the multiple-choice format,
classifications of question format focus on features of multiple-choice items which
consist of stem and alternatives (made up of several wrong answers, known as
distractors, and at least one correct answer). For example, stem, the stimulus segment or
statement of a multiple-choice item, is frequently grouped into wh- direct question8 and
incomplete statement9 format (Popham, 2000).
Other question format variables that have become of interest for exploration include
stem length, stem content words, structure of alternatives/options, length of correct
answer and distractors, etc. For example, Scheuneman and Gerritz (1990) recommended
three categories of option structures based on the previous work of Carlton and Harris
(1989). The categories were: a) complete sentence or complex phrases containing
clauses that could stand alone as sentence, b) simple phrases, and c) short lists of 1-4
words. In another study, question format is addressed in terms of the falsity of the
distractors. A falsifiable distractor means that the information which establishes that the
option is incorrect is explicit in the text, whereas a distractor is not falsifiable if the
passage does not provide explicit textual evidence (Ozuru et al., 2008).
6 The format which requires students to respond to a question by selecting the correct answer
from three, four or five options
7 Also known as constructed-response item. This question requires students to write or construct their answer, rather than simply selecting it
8 Complete statement of question which normally begins with wh-question (i.e. what, who, when, where, which, why, whose and how) and ends with question mark
9 The question is formatted as incomplete statement where an omission occurs at the end of the stem/question
31
2.4.2 Individual Characteristics and Differential Item Functioning (DIF)
According to Bachman’s and Palmer’s (1996) philosophy of language testing, fairness
is one of the central considerations in test design. Fairness stipulates equal educational
opportunities for all students regardless of their ethnic background, economic and social
status and gender. The Code of Fair Testing in Education developed by the Joint
Committee on Testing Practices (2004) urges test developers to design tests that are as
fair as possible without demeaning the examinees of different races, ethnic background,
gender or demographical location (rural and urban).
Fairness is a complex and broad area, involving test design, development, test
administration and scoring procedures (Kunnan, 2000; Popham, 2000; McNamara &
Roever, 2006). In the layperson’s view, bias is typically associated with unfairness and
favouritism. Psychometrically, Angoff (1993) defined an item is biased if test takers of
equal ability from different groups respond differently to the item. Shepard et al. (1981)
defined bias as “a kind of invalidity that harms one group than another” (p. 318).
In the language testing context, examination items are considered biased if they contain
sources of difficulty that are not relevant to the construct being measured (Zumbo,
1999). This suggests that bias is present when construct-irrelevant characteristics of the
test takers influence the score of a test. An item might also be considered biased if it
contains language or content that is differentially difficult for different subgroups of
test-takers. In addition, an item might demonstrate item structure and format bias if
there are ambiguities or inadequacies in the item stem, test instructions, or distractors
(Hambleton & Rogers, 1995).
There are two methods to investigate potential bias in measurement/assessment
(Zumbo, 1999); (a) judgmental and (b) statistical. Zumbo recommended that in a high-
stake test, statistical techniques seem feasible and defensible to flag potentially biased
items and this leads us to differential item functioning (DIF).
The problem of inconsistent behaviour of common items across administrations can be
viewed as an instance of (DIF), where two groups taking two different forms with some
items in common are the focal and reference groups. Supposedly, two groups of student
32
with the same level of English language proficiency should have equal probabilities of
responding to a reading test item correctly. If their probabilities are different, the item is
said to exhibit DIF.
Dorans and Holland (1993) defined DIF as a psychometric difference between groups
that are matched on the ability or the achievement measured by an item. That is, an item
exhibits DIF if it provides a consistent advantage or disadvantage to members of a
group, not because of differences in the trait of interest, but because of differences in
other traits or because different versions (e.g., translations) of an item measure different
traits. More simply, when examinees in different groups have different probabilities of
answering an item correctly after controlling for overall ability, the item is said to
exhibit DIF (Shepard et al., 1981).
de Ayala (2009) further defined DIF based on its graphic representation; the differences
between two item response functions (IRF), commonly known as ICC slopes/curves.
The IRFs represent the item parameter estimate of the focal group and the reference
group. An item is flagged to have DIF when the two IRFs are not superimposed on one
another. Figure 2.4 and Figure 2.5 show an illustration of DIF.
Figure 2.4: ICC Plot of Item with DIF
33
Figure 2.5: ICC Plot of Item without DIF
Figure 2.4 shows that the observed curves of the two groups (for example, bumiputera
and non-bumiputera) are far apart from each other. On the other hand, an item with no
significant DIF index in Figure 2.5 indicates that the curves representing boys (L) and
girls (P) are very close. de Ayala’s elaboration of DIF is equally suitable with the
definition provided by Hambleton (1989): “a test is unbiased if the item characteristic
curves across different groups are identical” (p. 189).
The use of DIF analysis, however, should be dealt with caution. Some researchers have
at times used the terms ‘item bias’ and DIF interchangeably. Camilli and Shepard
(1994) and Angoff (1993) advocated that the two terms must be treated as two different
entities to avoid the perception that DIF is a source of bias. The term DIF, though, is
very much preferred by the researchers due to its concern on what is actually being
observed rather than making inference of the nature of the effect of variance
In traditional item analysis, reliability is an essential characteristic of a good test
(McNamara, 1996; Allison, 1999). This is because if a test does not measure
consistently (reliably), one could not infer the score resulting from a particular
administration of a test to be an accurate index of students’ achievement. The reliability
of a test therefore refers to the extent to which the test is likely to produce consistent
scores.
Table 4.5: Summary of CTT Item Analysis
For this test, the coefficient alpha shown in Table 4.5 is 0.78. This signifies that the
overall test is moderately reliable. This reliability index concludes that the range of
most items is good for a classroom test with probably a few items needing further
inspection.
Another essential index which is linked to the reliability coefficient is the standard error
of measurement (SEM). Conceptually, the SEM is related to test reliability because it
indicates the amount of error contained in an observed score of the examinees. The
SEM is denoted as a function of the reliability and standard deviation (SD) of a test
(Reynolds et al., 2009). General prediction of the overall performance of a test can be
seen through the index of the SEM. The smaller the error, the more accurate the
measurement provided by the test. It is noted that the SEM is about 3. Also, observe that
as the reliability index increases, the SEM decreases (see Table 4.6).
N 8472 Mean 21.19 Standard Deviation 6.33 Variance 40.09 Skewness 0.40 Kurtosis -0.25 Standard error of mean 0.07 Standard error of measurement 2.99 Coefficient Alpha 0.78
65
Ebel and Frisbie (1991) explain that test reliability is sensitive to items characteristics
particularly discrimination index. Hence, in order to obtain a higher reliability,
Alugamalai and Curtis (2005) suggest that low discriminating items should be removed
from the test. Table 4.6 demonstrates how deletion of non-discriminating items can
improve the reliability of a test.
Table 4.6: Summary of Reliability Analyses
A 43-item data in Table 4.6 is a dataset which removes 2 items with negative value of
discrimination index. In the 36-item data, 9 items with a discrimination index ≤ 0.10
have been deleted from the dataset. Notice that the removal of low discriminating items
improves the reliability index of the test. Interestingly, fewer items lead to higher
reliability – evidence that poor items ought to be removed.
43-ITEM DATA N 8472 Mean 20.80 Standard Deviation 6.36 Variance 40.44 Skewness 0.37 Kurtosis -0.34 Standard error of mean 0.07 Standard error of measurement 2.94 Coefficient Alpha 0.79 ============================================================== 36-ITEM DATA N 8472 Mean 17.64 Standard Deviation 5.93 Variance 35.20 Skewness 0.33 Kurtosis -0.46 Standard error of mean 0.06 Standard error of measurement 2.66 Coefficient Alpha 0.80 ===============================================================
66
4.3 RELATIONSHIP BETWEEN ITEM CHARACTERISTICS AND ITEM
DIFFICULTY
A regression analysis has been performed to show the predictive relationship between
dependent variable (item difficulty) and the six sets of item characteristics. Here,
stepwise method, which automatically sets the statistical models with the highest
multiple correlations in order, has been employed.
The characteristics of the items studied are shown in Table 4.7.
Table 4.7: Characteristics of the MUET Reading Items
Type of variable
Predictors Characteristics of item N
Passage Length of the passage
More than 25 lines More than 35 lines More than 45 lines
7 21 17
Question type
Type of question Retrieving directly stated information (RI) Interpreting explicit information (IE) Interpreting by making inference (II) Reflecting on texts (RF)
10 12 19 4
Inference type No inference Paraphrase inference Across-sentence (bridging) inference Macrostructure (gist) inference Reader-based (prior knowledge) inference
3 7 30 1 4
Question format
Structure of the responses/ alternatives
Short lists of 1-4 words Simple phrases Complex phrases containing clauses that could stand alone as sentences
18 13 14
Number of options
3-option 4-option
29 16
Plausibility of the distractors
No response options are plausible One or more options are plausible
23 22
The analyses are expected to explore specific item characteristics associated with the
difficulty of the reading comprehension test items. The summary of the regression
analyses is shown in Table 4.8.
67
Table 4.8: Results of Linear Regression Model R R square Adjusted R
square 1 (Constant)
plausibility of the distractors
.371
.137
.117 2 (Constant)
plausibility of the distractors, structure of the responses
.514 .264 .229
3 (Constant) plausibility of the distractors, structure of the responses, Inference level
.587 .345 .297
As shown in Table 4.8, the values of R indicate the correlation between the predictors
and the dependent variable. It can be seen that plausibility of the distractors accounts for
14 percent of the variance in item difficulty. Model 3 (R =.587) demonstrates that the
correlations between dependent variable and the three predictors; plausibility of the
distractors, structure of the response and inference type, are fairly strong.
The above information also reports measure of the model fit. Large values of R square
indicate that the models fit the data well. For example, 34.5 percent of variation in the
dependent variable is explained by Model 3. Apparently, the effects of the three
predictors are significant for reading items.
The regression results clearly reveal that there is a significant relationship between test
features and item difficulty. The findings single out three significant predictors of item
difficulty: plausibility of the distractors (p =.003), structure of the response (p =.011)
and inference type (p =.030)
So, it is of great interest to focus on these three variables. Table 4.9 depicts the result of
coefficient analysis which shows the statistical significance15 and the direction of the
correlation of these predictors.
15 Conventionally set at p ≤ 0.05
68
Table 4.9: Results of Coefficient Analysis
Model
Unstandardized Coefficients
Standardized Coefficients
t Sig. B Std. Error Beta 1
(Constant) -1.087 .438 -2.481 .017 plausibility of the distractors .730 .279 .371 2.617 .012
2
(Constant) -.494 .465 -1.062 .294 plausibility of the distractors
.880 .267 .447 3.301 .002
structure of the responses -.427 .159 -.364 -2.688 .010
3
(Constant) 1.363 .589 -2.315 .026
plausibility of the distractors .818 .256 .415 3.192 .003
structure of the responses
-.406 .152 -.346 -2.669 .011
Inference Type .317 .141 .286 2.249 .030 a Dependent Variable: item difficulty estimate
It is noted that plausibility of the distractors is highly significant (p =.003). The positive
sign of the coefficient signals positive relationship which indicates that items with more
than one plausible distractor are related to the difficulty of the items. Next, the effect of
structure of response is also statistically significant (p =0.11). Nonetheless, its
coefficient relation is negative. It seems to imply that the longer the structure of the
response, the easier the items – this result is somewhat unexpected. The third best
predictor, inference type, also appears to be related significantly to the difficulty of the
items in this test. This supports the notion that items requiring higher level of inference
skills are difficult to answer.
The illustration of the direction of the relation of the above variables can be seen from
the following graphs. Figure 4.9 and Figure 4.10 visualize the positive coefficient
between plausibility of the distractors and inference level and item difficulty. It is seen
that the more the options with plausible distractors and the higher the inference level,
the more difficult the item.
69
Figure 4.9: Boxplot of Interaction between Plausibility of Distractors and Item
Difficulty
Figure 4.10: Boxplot of Interaction between Inference Level and Item Difficulty
70
Figure 4.11: Boxplot of Interaction between Structure of Distractors and Item
Difficulty
Conversely, the correlation between structure of response and the dependent variable is
negative (Figure 4.11). Surprisingly, items with short list of 1-4 words turn out to be
difficult items, while items with complex structures are the easier items. This is
probably due to the interaction of this variable with the other predictors, plausibility of
the distractors or inference level. The test data in Table 4.10 reveals that this MUET
reading test consisted of 40% items with short list of responses. Out of 18 items, 13
were categorised in levels 3, 4 and 5 of inference types. On the other hand, none of the
longer and more complex phrases are those items requiring macrostructure and prior
knowledge inferences.
71
Table 4.10: Interaction between Structure of Responses and Inference Type
Predictors Characteristics of item N Structure of the responses/ alternatives
The preceding observations summarize that the other variables (i.e. length of the
passage, type of question and number of options) have not played an important role in
influencing the item difficulty of the MUET reading test.
In short, items in this data set are more difficult if they have the following
characteristics:
Contain more than one plausible distractors
Consist of simple and short option structure
Require higher level of inference skills
It is also important to conclude here that two predictors – plausibility of the distractors
and structure of the responses – reveal the presence of construct-irrelevant variance in
this set of MUET reading tests.
72
4.4 RESULT OF DIF ANALYSES
DIF occurs when people from different groups (commonly gender, nationality or
ethnicity) with the same latent trait have a different probability of giving a certain
response to an item.
In the Rasch model, an item exhibits DIF when people from different groups of same
underlying true ability have a different probability to give a certain response. The model
advocates that the index value outside the range of -2 and +2 is statistically significant.
4.4.1 Findings of Gender DIF
Table 4.11: Summary of Overall Performance between Males and Females
Table 4.11 shows the comparison of overall performance between groups on this MUET
reading comprehension test. A negative sign is used to indicate the easiness of the test
between the two groups; male (L) and female (P). The result concludes that male
students have performed slightly better than female students.
The difference is statistically significant (3.375). The fact that its parameter estimate is
more than twice its standard error indicates that this variance is statistically significant
(Wu et al., 2007).
The content of Table 4.12 reports the result of DIF investigation on gender differences
for this data. It reveals that there are 19 items which are flagged as having DIF. Notice
the gender-by-item interaction estimates have more than twice its standard error, which
TERM 1: gender ------------------------------------------------------------------------ VARIABLES UNWEIGHTED FIT WEIGHTED FIT --------------- ----------------------- -------------------- gender ESTIMATE ERROR^ MNSQ CI T MNSQ CI T ------------------------------------------------------------------------ L -0.027 0.008 1.04 (0.95,1.05) 1.6 1.04 (0.95,1.05) 1.5 P 0.027* 0.008 1.04 (0.96,1.04) 1.8 1.03 (0.96,1.04) 1.6 ------------------------------------------------------------------------ An asterisk next to a parameter estimate indicates that it is constrained Separation Reliability Not Applicable Chi-square test of parameter equality = 11.10, df = 1 ^ Empirical standard errors have been used
73
indicate significant difference between these two subgroups. Ten items (Item 8, 12, 14,
16, 26, 33, 34, 35, 37,and 44) favour females while the other nine items (Item 2, 4, 6, 9,
13, 19, 21,40 and 45) seem to favour male students.
Table 4.12: Parameter Estimates of Gender DIF Investigation
Another method to trace DIF in an item is through the visual observation of ICC. Figure
4.12 of Item 6 demonstrates that the empirical curves of the two groups (male
represented in blue line, while female represented in green line) are far apart from each
other. This ICC also points out that the probability of male students to answer this item
correctly is higher than the females.
Figure 4.12: ICC Plot of Item with Gender DIF (Item 6)
Figure 4.13 represents the graphical display of an item with no DIF. Unlike items which
exhibit DIF, this item (Item 15) shows that the empirical curves which represent
females and males are very close. This is an indication that there is no significant
difference between males and females in responding to this item.
75
Figure 4.13: ICC Plot of Item without Gender DIF (Item 15)
A further examination of DIF for all the items in this test signifies that there is a pattern
between these two groups in responding to a specific type of passage. Interestingly, it is
noted that three out of six passages are relatively easier for male students. These
passages deal with topics related to legal issue (Passage 1), business (Passage 3) and
leadership (Passage 6). On the other hand, topics related to
advertisement/communication (Passage 2) and travel (Passage 5) appear to advantage
female students.
4.4.2 Findings of Ethnicity and State DIF
A description of DIF analysis for ethnicity and state is treated as one discussion due to
the fact that these two facets are related to each other. In the context of this study, it
appears that demographic location is characterized by its multi-racial ethnic
composition. Geographically, many non-Bumiputeras reside in capital territory, Kuala
Lumpur and Bumiputeras (more than 30 different ethnics) are found in Sabah.
The estimation of ethnicity parameter in Table 4.13 informs that this test is relatively
difficult for the bumiputeras, who largely made up of the candidates from Sabah.
Presumably, this is because Sabahans, who are multilingual, do not consider English
their second language. This finding is consistent with the summary result for the
76
difference in the state as the test appears to advantage candidates in Kuala Lumpur, who
are familiar with the use of English language in their daily life.
Table 4.13: Summary of Overall Performance between Ethnics and States
A closer look at the results of the analyses revealed a significant discrepancy between
the two groups and states. Notice the parameter estimate for both analyses is larger than
twice its standard error. Parameter estimate for state, for example, is 82 times greater
than the standard error. This huge number implies that there is great difference between
Bumiputera and non-Bumiputera and between candidates in Kuala Lumpur and Sabah.
The notion of DIF between states and ethnics can be examined by looking at the ICC
plots of the individual items. For example, Item 30 in Figure 4.14 illustrates that there is
TERM 1: Ethnicity ------------------------------------------------------------------------ VARIABLES UNWEIGHTED FIT WEIGHTED FIT --------- ----------------------- ---------------------- ethnic ESTIMATE ERROR^ MNSQ CI T MNSQ CI T -------------------------------------------------------------------------------------------------------------- 1 B 0.255 0.004 1.02 (0.97,1.03) 1.2 1.02 (0.97,1.03) 1.1 2 N -0.255* 0.004 1.09 (0.92,1.08) 2.2 1.08 (0.92,1.08) 1.9 ------------------------------------------------------------------------ An asterisk next to a parameter estimate indicates that it is constrained Separation Reliability Not Applicable Chi-square test of parameter equality = 4884.57, df = 1 ========================================================================
TERM 1: State ------------------------------------------------------------------------ VARIABLES UNWEIGHTED FIT WEIGHTED FIT --------- ------------------ ------------------- state ESTIMATE ERROR^ MNSQ CI T MNSQ CI T ------------------------------------------------------------------------ 1 Kuala Lumpur -0.329 0.004 1.09(0.95,1.05) 3.9 1.08(0.95,1.05) 3.5 2 Sabah 0.329* 0.004 0.99(0.96,1.04)-0.7 0.99(0.96,1.04)-0.6 ------------------------------------------------------------------------ An asterisk next to a parameter estimate indicates that it is constrained Separation Reliability Not Applicable Chi-square test of parameter equality = 8093.95, df = 1 ========================================================================
77
a wide gap between the empirical curves of Bumiputera (blue line) and non-Bumiputera
(green line). Here, the empirical curves are likely to be far away from each other.
Figure 4.14: ICC Plot of Item with Ethnicity DIF (Item 30)
Full analyses of DIF between ethnicity and states report that there are many items
flagged as having DIF. There are 37 items with DIF for states and 35 items for
Bumiputera and non-Bumiputera. The following table summarizes ten items with the
largest DIF index for both facets.
78
Table 4.14: Items with the Largest DIF Indices
ETHNICITY STATE ITEM DIF INDEX IN FAVOUR OF ITEM DIF INDEX IN FAVOUR OF
33 23.9 Non Bumiputera 38 36.5 Sabah
34 22.2 Bumiputera 34 29.5 Sabah
23 22.1 Bumiputera 30 20.6 Kuala Lumpur
38 20.4 Bumiputera 2 19.6 Kuala Lumpur
30 18.4 Non Bumiputera 33 19.2 Kuala Lumpur
2 18.2 Non Bumiputera 37 18.3 Kuala Lumpur
17 14.7 Non Bumiputera 23 16.5 Sabah
35 14.4 Bumiputera 35 16.2 Sabah
32 12.5 Non Bumiputera 17 15 Kuala Lumpur
42 11.9 Bumiputera 42 14.7 Sabah
Notice here that those items which show the existence of DIF between the states and
ethnics tend to have large DIF indices.
An interesting examination of the results in Table 4.14 shows that there are 9 items
which display DIF in both facets. Items 2, 17, 30, and 33 appear to advantage the
candidates from Kuala Lumpur and non-Bumiputera. On the other hand, Items 23, 34,
35, 38 and 42 are relatively easier for Bumiputera and candidates from Sabah. A
possible explanation for this observation can be attributed to the composition of
population in both states.
79
CHAPTER 5: DISCUSSION AND CONCLUSION
5.1 INTRODUCTION
This study sets out to explore several issues which may impact the difficulty level of
MUET, the high-stake English test used for admission into Malaysian public
universities. A primary interest of the study has been focused on assessing the quality of
the highest weighted component of MUET, that is the reading test. The 45 test items
have been scrutinised through item analyses based on CTT and the Rasch model. Also,
the study investigates to what extent features of test items and student background
characteristics influence the difficulty of the items. These two issues have been
examined through regression analysis and DIF analysis respectively.
The subsequent sections present the summary of the main findings. The discussion
focuses on addressing the research questions raised in Chapter 1. Furthermore, this
chapter discusses the implications of the findings for practice and future research.
5.2 DISCUSSION OF MAJOR FINDINGS
This section discusses the findings with respect to the four research questions posed in
Chapter 1. The results from item analysis, regression analysis and DIF analysis are used
to address the questions.
How do the items spread in terms of their difficulty values and the ability of the
students?
It is concluded that the test was reasonably well targeted for the ability of students. On
the basis of CTT and Rasch analyses, it can be seen that there is equal distribution of
easy and difficult items in the test. The Rasch analyses provide a clear view of the item
distribution and student ability through the item-person map. From the map, it is clear
that the majority of the items are distributed around the average ability of students at
logit -0.168.
80
The person-item map from the Rasch model signifies its advantage over CTT.
McNamara (1996) views the calibration of item difficulty and person ability on the
same scale as one of the key features of the Rasch model. This is because the map is
very useful for the test constructors to trace how well the items are matched to the
ability of students.
How good are the items of the MUET reading test?
The quality of the items was examined using discrimination index, reliability index and
fit statistics.
CTT analysis of item discrimination shows that 60 percent of the items are classified as
weak items with 9 items having discrimination index ≤ 0.10. Further investigation of
items with a negative discrimination index (Item 34 and Item 38) indicates that the
response options are miskeyed and the distractors are misleading. As advocated by
Popham (2000) and Reynolds et al. (2009), these problematic items need closer
examination. If such items were to be retained for future use, it is evident that they
would require revision in terms of rewording or omitting the options given.
The overall consistency of the test is moderately reliable. Its coefficient alpha of 0.78
suggests that there are several items which need further inspection. The identification of
many low discriminating items appears to contribute to this moderate reliability index.
It would be appropriate for this test to have reliability estimate as high as 0.90 or at least
0.85 due to its significance to pre-degree students who wish to enrol in undergraduate
programme in public universities. This is supported by Nunnally and Berstein (1994)
who recommended a reliability estimate of 0.90 or even 0.95 for tests which involve
decision making about individuals.
The Rasch analysis of fit statistics reveals that there are many misfitting items in the
test. This signals poor item construction. The discovery of miskeyed / misleading
distractors of Item 34 and Item 38 and other low discriminating items can be traced as
the source of misfits. However, interpretation of misfit should be treated with caution.
Misfitting items are not necessarily problematic items. Keeves and Alagumalai (1999)
81
warned that the sensitivity of fit statistics to sample size should be taken into account in
interpreting items fit. With a sample size of 8472, departure from the model can be
detected easily through the fit statistics. Thus, misfitting items should not be discarded
without good reason.
From the complementary findings based on the two measurement approaches, it appears
that the IRT model supplements rather than contradicts CTT (Lord, 1980; Barnard,
1999; Zubairi & Abu Kassim, 2006).
To what extent do selected features of the test items contribute to the item difficulty in
the MUET reading test?
Characteristics of test items have a profound effect on the item difficulty of this reading
test. Three out of six predictors show a strong relationship between the predictors and
item difficulty. Plausibility of the distractors has the most significant relationship,
followed by structure of the options and inference level.
There are significant and positive relationships between two independent variables
(plausibility of distractors and inference level) and item difficulty. The difficulty level
of item is primarily determined by plausibility of distractors which independently
contributes 37 percent of the whole variance. A positive coefficient (.818) implies that
an item tends to be more difficult when one or more distractors are plausible. This
finding is in line with Rupp et al. (2001) and Drum et al. (1981) who discovered an
increase of difficulty level with the increase number of plausible distractors. This
implies that items with more plausible distractors are difficult to answer because they
seem to share several features with the correct answer.
The inference level of the questions also influences difficulty level, though the effect is
relatively small. Apparently, items requiring a higher level of cognitive skills turn out to
be difficult. Item 42 (one of the most difficult items), for instance, requires students to
bring in their prior knowledge of language function in order to respond to the question
correctly. This result corroborates the view of Hamzah and Abdullah (2009) and
82
Sarudin and Zubairy (2008) who claimed that lack of metacognitive skills is one of the
factors contributing to reading problems among Malaysian students.
In addition to the above variables, item difficulty appears to be influenced by the
structure of the options in a negative direction (coefficient = -.406). Surprisingly, this
group of examinees found items with longer and complex structure easier than items
with options consisting of 1-4 words. One possible explanation for this unexpected
trend is that there are strong interactions among the three significant predictors.
Possibly, the characteristics of the other two variables (i.e. plausibility of the distractors
and inference level) have played a better role in influencing the item difficulty. The
results show that items with option of longer and complex phrases are those items
requiring lower inference skills. In contrast, items with simple and short options need
students to use macrostructure inference and schemata to reach for an answer.
An important aspect that has emerged from the above situation is that there is evidence
of interaction among the predictors in determining the difficulty level of items. This is
supported by the significant regression results which imply that the three variables,
alone or in combination, have accounted for significant variance in item difficulty in
this MUET reading test.
The exploration in this study also indicates that the presence of other variables – length
of the passage, type of question and number of options – do not seem to make the items
more difficult as hypothesized. In comparison to earlier research (e.g. Just & Carpenter,
1992; Ozuru et al., 2008; Rupp et al., 2001), this study does not find that the longer the
passage, the more difficult the question. In addition, explicitness and implicitness of
information in a text and number of options are not significant factors influencing
examinees’ performance on the test. The contradictory results might be due to an
uneven distribution of coded categories within each of the variables. As an example, for
the length of the passage, almost half of the items are coded within the category of
passage having more than 35 lines.
83
Taken altogether, these findings suggest that difficulty level of the MUET reading test
is significantly affected by variables related to question format (plausibility of
distractors and structure of the options) rather than passage-related variable and
question format variables. In other words, there is presence of construct irrelevance
variance in this reading test and this may result in negative wash back to the teaching
and learning of MUET syllabus, particularly the reading component. Test developers of
MUET, therefore, should test the range of the construct that needs to be measured and
avoid test-method effects and other contributors to construct irrelevant variance.
Is there any differential item functioning in the MUET reading test in terms of
gender, geographical location and ethnicity?
From the gender analysis, some test items are found to be easier for one group than the
other. Ten items which exhibit DIF tend to favour females while the other 9 advantage
males. A possible explanation for gender variations is resulted from the response of the
two groups to subject matter of the passages. The results show that items that originate
from male-friendly topics (i.e. legal issues, business and leadership) are relatively easier
for males than females. In contrast, items from passages which deal with
communication and travel seem to favour females.
Such differences are considered to be based purely on gender preference. This confirms
the view of previous researchers (e.g. O’Neill & McPeek, 1993; Dolittle & Welch,
1989) that males and females display some distinct preference regarding reading. The
responses of examinees in this test indicate that females are interested in humanities-
related reading materials. On the other hand, males prefer topics related to law, business
and governance perhaps because they play dominant role in these fields.
In terms of ethnicity and geographical location, the findings reveal that there is large
disparity between groups being compared. Many items appear to be difficult for
bumiputera and Sabahans and vice-versa. Similarly, this significant discrepancy is
attributed to the real difference between the groups. This means that population
composition in both states plays an important role in affecting students’ responses to an
84
item. Many items are easy for non-bumiputeras who mainly constitute candidates from
Kuala Lumpur; the majority of these prefer to use English in their daily life.
The results in this study replicate observations from previous projects (Elder, 1996;
Chen & Henning, 1985) which concluded that actual differences between subgroups of
examinees are seen as a potential source of DIF. Therefore, before omitting a
‘problematic’ item, it should be reviewed thoroughly by the panel of experts to
illuminate the possible causes of its significant difference.
5.3 IMPLICATIONS OF THE FINDINGS
The findings yielded in this study have implications for different groups of people. The
first group is language teachers. Utilization of item analysis can help teachers to identify
misconceptions in the materials that need further explanation. With regard to the effects
of test features on students’ performance in reading test, it provides insights for teachers
to focus more on the features that have strong influence on the difficulty of an item. In
addition, DIF analysis helps teachers to understand the difference between groups so
that they can look for solutions to decrease the gap.
The second group is test developers. It is clear that an examination of individual items is
fundamental in test evaluation as it allows test developers to identify problematic items
and remedy them. Moreover, the complementary nature of CTT and the Rasch model
demonstrates that test constructors can incorporate both approaches as measurement
strategies for test design and evaluation. Furthermore, the evidences of test features and
individuals’ characteristics effect on examinees’ performance on an item informs test
makers to balance content or item features in the test specification, so that they can
reduce the variation effect. Findings from both analyses also can be used as a test
validation to clarify what the test is measuring. The study also shows that DIF is
difficult to interpret. Therefore, removal of items flagged as DIF should be considered
seriously.
85
The last group is researchers of reading and language assessment. The findings have
added support to the notion that CTT and the Rasch model are complementary
approaches which prove to be useful tools for language test development and
evaluation. In addition, test feature effects on item difficulty provide useful information
about specific variables that can influence students’ performance on a reading test. The
findings of DIF also show that DIF is not necessarily an indication that the test
disadvantages one particular group. This reemphasizes the standpoint that DIF is not
necessarily evidence for test bias (Angoff, 1993; Camilli & Shepard, 1994; McNamara
& Roever, 2006; de Ayala, 2009). This study also presents directions for researchers to
investigate several issues identified below.
5.4 DIRECTIONS FOR FUTURE RESEARCH
The findings of the study have shed light on directions for future research. The
following are several areas that seem ripe for further exploration:
1. Since the results are based on a limited item pool (45 items) and passages (6
passages), there is a need to replicate the study on a bundle of items and
passages so that the findings can be generalized in a wider context. It would also
be interesting to explore the same issues on other components of MUET –
listening, speaking and writing. This would provide a clear picture of the factors
that might affect lack of satisfactory grade on MUET.
2. The unexpected effect of the structure of the options clearly deserves more
attention in future. Further investigation is necessary to explain the negative
effects of this variable on the difficulty level of reading items.
3. Increasing the number of variables might be useful in examining the influence of
test features on students’ success on reading assessment or other types of
language tests. Other important item characteristics that might be added to future
study are correct-answer variables, stem-related variables, vocabulary-related
variables and new passage-related variables.
4. Given the fact that pre-university students are streamed into various classes (i.e.
Science, Art and Commerce), it would be useful to investigate the effect of
students’ background discipline on reading comprehension. Presumably,
86
examinees’ responses to specific subject matter are related to their prior
knowledge – particularly to the field/course they have taken.
5. Another important area of additional research is interaction between indicators
of DIF and item characteristics as a source contributing to item difficulty. It
would be of great interest to explore if a causal link between these two facets
could be established in influencing students’ responses to an item.
5.5 CONCLUSION
This study is built primarily on the exploration of factors that affect students’
performance on reading test. Previous research on reading English as a second language
in the Malaysian context, has seldom examined the role of item-level analysis as an
explanation for why the reading test of MUET appears to be challenging for Malaysian
students. The current study addresses this gap.
One salient finding that has emerged from this study is the usefulness of CTT and the
Rasch model as a tool for measurement and evaluation of language test. It is shown that
both psychometric theories complement each other. The study also concludes that
characteristics of the test items are significant factors in item difficulty. It is found that
plausibility of distractors, structure of the options and inference level have strong
influence on determining the difficulty level of this reading test. The DIF procedure,
furthermore, provides insights about the influence of examinees’ background on their
success to respond to individual items. Real differences between gender, ethnicity and
state are seen as the source of DIF in this test.
Conclusions derived from this study should be interpreted in the light of a few
constraints. As a small and focused analysis, the results of this research may provide
useful insights for researchers, test developers and educators about important issues that
need to be dealt with in reading assessment.
87
REFERENCES:
Abdul Majid, F., M.Jelas, Z., & Azman, N. (2002). Selected Malaysian adult learners' academic reading strategies: a case study (Publication. Retrieved 15 April 2010: http:www.face.stir.ac.uk/documents/Paper61
Adams, R. J. (2005). Reliability as a measurement design effect. Studies in Educational
Evaluation, 31(2-3), 167 - 172. Adams, R. J., & Khoo, S. T. (1993). QUEST: The interactive test analysis system.
Melbourne: ACER. Adamson, H. D. (1993). Academic competence: theory and classroom practice --
preparing ESL for content courses. White Plains, NY: Longmans. Alagumalai, S., & Curtis, D. (2005). Classical test theory. In S. Alagumalai, D. D.
Curtis & N. Hungi (Eds.), Applied Rasch measurement: a book of exemplars: papers in honour of John P.Keeves (pp. 1-14). Dordrecht ; Norwell, MA Springer.
Alderson, J. C. (2000). Assessing reading. Cambridge: Cambridge University Press. Alderson, J. C., & Urquhart, A. H. (1983). The effect of student background discipline
on comprehension: a pilot study. In A. Hughes & D. Porter (Eds.), Current development in language testing (pp. 121-128). London: Academic Press.
Allison, D. (1999). Language testing and evaluation: an introductory course.
Singapore: Singapore University Press. American Psychological Association, American Educational Research Association, &
National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, D.C.: American Educational Research Association.
Angoff, W. H. (1993). Perspective on differential item functioning methodology. In P.
W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3 - 33). New Jersey: Lawrence Erlbaum Associates, Publishers.
Athanasou, J. A., & Lamprianou, I. (2004). Reading in one's ethnic language: a study of
Greek-Australian high school students. Australian Journal of Educational & Developmental Psychology, 4, 86 - 96.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford:
Oxford University Press. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford
University Press.
88
Baker, D. (1989). Language testing: a critical survey and practical guide. London: Edward Arnold.
Baker, E. L. (1989). Mandated tests: educational reform or quality indicator. In B. R.
Gifford (Ed.), Test policy and test performance: education, language and culture. Boston: Kluwer Academic Publishers.
Barnard, J. J. (1999). Item analysis in test construction. In G. N. Masters & J. P. Keeves
(Eds.), Advances in measurement in educational research and assessment. Amsterdam; New York: Pergamon.
Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: fundamental
measurement in the human sciences (2nd ed.). New Jersey: Lawrence Erlbaum Associates, Publishers.
Brown, H. D. (2004). Language assessment: principles and classroom practices. New
York: Longman. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items (Vol. 4).
Thousand Oaks, California: SAGE Publications. Carlton, S. T., & Harris, A. M. (1989). Characteristics associated with differential item
performance on the SAT: gender and majority/minority group comparisons. Unpublished manuscript.
Carr, N. T. (2006). The factor structure of test task characteristics and examinee
performance. Language Testing, 23(3), 269-289. Carrell, P. L. (1988). Introduction: interactive approach to second language reading. In
P. L. Carrell, J. Devine & D. Eskey (Eds.), Interactive approach to second language reading. Cambridge: Cambridge University Press.
Chen, Z., & Henning, G. (1985). Linguistic and cultural bias in language proficiency
tests. Language Testing, 2(2), 155-163. Code of Fair Testing Practices in Education. (2004). Washington, DC: Joint Committee
on Testing Practices.
Collier, V. P. (1989). How long? A synthesis of research on academic achievement in a second language. TESOL Quarterly, 23(3), 509 - 531.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New
York: Holt, Rinehart and Winston. Davey, B. (1988). Factors affecting the difficulty of reading comprehension items for
successful and unsuccessful readers. Journal of Experimental Education, 56(2), 67 - 76.
89
Davey, B., & LaSasso, C. (1984). The interaction of reader and task factors in the assessment of reading comprehension. Journal of Experimental Education, 52(4), 199 - 206.
Davey, B., LaSasso, C., & Macready, G. (1983). A comparison of reading
comprehension task performance for dear and hearing readers. Journal of Speech and Hearing Research, 26, 622 - 628.
Davey, B., & Macready, G. (1985). Prerequisite relations among inference tasks for
skilled and less-skilled reader. journal of Educational Psychology, 77, 539 - 552. de Ayala, R. J. d. (2009). The theory and practice of item response theory. New York:
The Guilford Press. Devine, T. G. (1989). Teaching reading in the elementary school: from theory to
practice. Massachusetts, US: Allyn and Bacon, Inc. Dolittle, A., & Welch, C. (1989). Gender differences in performance on a college level
achievement test. Iowa City, IA: American College Testing Programme. Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: mantel-haenszel
and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35 - 66). New Jersey: Lawrence Erlbaum Associates, Publishers.
Drum, P. A., Calfee, R. C., & Cook, L. K. (1981). The effects of surface structure
variables on performance in reading comprehension test. Reading Research Quarterly, 16(14), 486-514.
Ebel, R. L. (1982). Proposed solutions to two problems of test construction. Journal of
Educational Measurement, 19, 276 - 278. Ebel, R. L., & Frisbie, D. A. (1991). Essentials of educational measurement (5th ed.).
Engelwood Cliffs, N.J: Prentice Hall. Elder, C. (1996). The effect of language background on foreign language test
performance: the case of Chinese, Italian, and modern Greek. Language Learning, 46, 233-282.
Embretson, S., & Wetzel, C. D. (1987). Component latent trait models for paragraph
comprehension. Applied Psychological Measurement, 11, 175 - 193. Fletcher, M. J. (2006). Measuring reading comprehension. Scientific study of Reading,
10, 323 - 330. Freedle, R., & Kostin, I. (1993). The prediction of TOEFL reading item difficulty:
implications for construct validity. Language Testing, 10(2), 133 - 167.
90
Freedle, R., & Kostin, I. (1994). Can multiple-choice reading tests be construct-valid? A reply to Katz, Lautenschlager, Blackburn and Harris. Psychological Science, 5, 107 - 110.
Hambleton, R. K. (1989). Principles and selected applications of item response theory.
In R. L. Linn (Ed.), Education measurement (3rd ed., pp. 147-200). New York: MacMillan Publishers.
Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item
response theory and their applications to test development. Educational Measurement: Issues and Practices, 12(3), 38 - 47.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item
response theory. Newbury Park, California: SAGE Publications. Hamzah, M. S. G., & Abdullah, S. K. (2009). Analysis on metacognitive strategies in
reading and writing among Malaysian ESL learners in four education institutions. European Journal of Sciences, 11(4), 676 - 683.
Henning, G. (1984). Advantages of latent trait measurement in language testing.
Language Testing, 1(2), 123-133. Henning, G. (1987). A guide to language testing: development, evaluation, research.
Cambridge Mass: Newberry House Publisher. Henning, G., Hudson, T., & Turner, J. (1985). Item response theory and the assumption
of unidimensionality for language tests. Language Testing, 2(2), 141-154. Ibrahim, A. H. (2005). The effect of purposeful questioning technique in reading
performance. Paper presented at the 2nd National Seminar on Second/Foreign Language Learners and Learning.
Ibrahim, A. H. (2006). The process and problems of reading. Masalah Pendidikan, 115
- 129. Jalaluddin, N. H., Awal, N. M., & Bakar, K. A. (2009). Linguistics and environment in
English language learning: towards the development of quality human capital. European Journal of Sciences, 9(4), 627 - 642.
Just, M. A., & Carpenter, P. A. (1992). A capacity theory of comprehension: individual
differences in working memory. Psychological Review, 99, 122 - 149. Kartz, S., & Lautenschlager, G. J. (1994). Answering reading comprehension questions
without passages on the SAT-I, ACT and GRE. Educational Assessment, 2, 295 - 308.
91
Kartz, S., & Lautenschlager, G. J. (2001). The contribution of passage and no-passage factors to item performance on the SAT reading task. Educational Assessment, 7(2), 165 - 176.
Kaur, S., & Thiyagarajah, R. (1999). The English reading habits of ELLs students in
University Science Malaysia. Paper presented at the 6th International Literacy and Education Research Network Conference on Learning.
Keeves, J. P., & Alagumalai, S. (1999). New approaches to measurement. In G. N.
Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 23-42). Amsterdam, New York: Pergamon.
Kirsch, I., Jong, J. d., LaFontaine, D., McQueen, J., Mendelovits, J., & Monseur, C.
(2002). Reading for change: performance and engagement across countries: result from PISA 2000. Paris: OECD.
Klapper, J. (1992). Reading in a foreign language; theoretical issues. Language
Learning, 1(5), 53 - 56. Kunnan, A. J. (1990). DIF in native language and gender groups in an ESL placement
test. TESOL Quarterly, 24, 741-746. Kunnan, A. J. (2000). Fairness and justice for all. In A. J. Kunnan (Ed.), Fairness and
validation in language assessment (pp. 1 - 14). Cambridge: Cambridge University Press.
Lord, F. M. (1980). Application of item response theory to practical testing problems.
New Jersey: Lawrence Erlbaum Associates Publishers. McKay, P. (2006). Assessing young language learners. Cambridge: Cambridge
University Press. McKenna, M. C., & Stahl, K. A. D. (2009). Assessment for reading instruction (2nd
ed.). New York: The Guilford Press. McNamara, T., & Roever, C. (2006). Language testing: the social dimension. Malden,
MA: Blackwell Publishing. McNamara, T. F. (1996). Measuring second language performance. London; New
York: Longman. McNamara, T. F. (2000). Language testing. New York: Oxford University Press. Mertler, C. A. (2007). Interpreting standardized test scores: strategies for data-driven
instructional decision making. Los Angeles: SAGE Publications.
92
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50 (9), 741-749.
Messick, S. (1996). Validity and washback in language testing. Language Testing 13
(3), 241-256. Mosenthal, P. (1996). Understanding the strategies of document literacy and their
conditions of use. Journal of Educational Psychology, 88, 314 - 332. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York:
McGraw-Hill, Inc. O'Neill, K. A., & McPeek, W. M. (1993). Item and test characteristics that are
associated with differential item functioning. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 255-276). Hillsdale, NJ: Lawrence Erlbaum Associates.
Osterlind, S. J., & Everson, H. T. (2009). Differential Item Functioning (2nd ed.).
Thousand Oaks, CA: SAGE Publications. Ozuru, Y., Best, R., Bell, C., Witherspoon, A., & McNamara, D. S. (2007). Influence of
question format and text availability on the assessment of expository text comprehension. Cognition and Instruction, 25(4), 399 - 438.
Ozuru, Y., Rowe, M., O'Reilly, T., & McNamara, D. S. (2008). Where's the difficulty in
standardized reading tests: the passage or the question? Behaviour Research Methods, 40(4), 1001 - 1015.
Pae, T.-I. (2004). DIF for examinees with different academic background. Language
Testing, 21(1), 53 - 73. Pearson, P. D., & Johnson, D. D. (1978). Teaching reading comprehension. New York,
NJ: Holt, Rinehart and Winston. Perfetti, C. (1985). reading ability. New York: Oxford University Press. Perkins, K., & Miller, L. D. (1984). Comparative analyses of English as a second
language reading comprehension data: classical test theory and latent trait measurement. Language Testing, 1(1), 20-31.
Popham, W. J. (2000). Modern educational measurement: practical guidelines for
educational leaders (3rd ed.). Boston: Allyn and Bacon. Pumfrey, P. D. (1976). Reading: tests and assessment techniques. London: Hodder and
Stoughton.
93
Ramaiah, M., & Nambiar, M. K. (1993). Do undergraduates understand what they read: an investigation into the comprehension monitoring of ESL students through the use of textual anomalies. Journal of Educational Research, 15, 95 - 106.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment test.
Copenhagen: Danmarks Paedogogiske Institut. Reynolds, C. R., Livingston, R. B., & Wilson, V. (2009). Measurement and assessment
in education. Upper Saddle River, N.J: Pearson Merrill. Rupp, A. A., Garcia, P., & Jamieson, J. (2001). Combining multiple regression and
CART to understand difficulty in second language reading and listening comprehension test. International Journal of Testing, 1(3 & 4), 185 - 216.
Sarudin, I., & Zubairy, A. M. (2008). Assessment of language proficiency of university
students. Paper presented at the 34th International Association for Educational Assessment (IAEA).
Scheuneman, J. D. (1982). A posteriori analyses of biased items. In R. A. Berk (Ed.),
Handbook of methods for detecting test bias (pp. 180 - 197). Baltimore and London: The Johns Hopkins University Press.
Scheuneman, J. D., & Gerritz, K. (1990). Using differential item functioning procedures
to explore sources of item difficulty and group performance characteristics. Journal of Educational Measurement, 27(2), 109-131.
Schwartz, S. (1984). Measuring reading competence: a theoretical-prescriptive
approach. New York: Plenum Press. Sheehan, K. M., & Ginther, A. (2001). what do passage-based multiple-choice verbal
reasoning items really measure? an analysis of the cognitive skills underlying performance on the current TOEFL reading section. Paper presented at the the 2001 Annual Meeting of the National Council of Measurement in Education.
Shepard, L. A., Camilli, G., & Averill, M. (1981). Comparison of procedures for
detecting test-item bias with both internal and external ability criteria. Journal of Educational Statistics, 6, 317 - 375.
Snow, C. E. (2003). Assessment of reading comprehension. In A. P. Sweet & C. E.
Stephanou, A., Anderson, P., & Urbach, D. (2008). PAT-R Progressive Achievement
Tests in Reading: comprehension, vocabulary and spelling (4th ed.). Camberwell, Victoria: Australian Council for Educational Research.
Twist, L., & Sainsbury, M. (2009). Girl friendly? Investigating the gender gap in
national reading tests at age 11. Educational Research, 51(2), 283-297.
94
Weaver, C. A., & Kintsch, W. (1991). Expository text. In R. Barr, M. L. Kamil, P. Mosenthal & P. D. Pearson (Eds.), Handbook of reading research (Vol. 2). New York: Longman.
Woods, A., & Baker, R. (1985). Item response theory. Language Testing, 2(2), 117-140. Wright, B. D. (1999). Rasch measurement models. In G. N. Masters & J. P. Keeves
(Eds.), Advances in measurement in educational research and assessment. Amsterdan; New York: Pergamon.
Wu, M. (2010). Using item response theory as a tool in educational measurement.
Unpublished book chapter. University of Melbourne. Wu, M. L., & Adams, R. J. (2008). Properties of Rasch residual fit statistics.
Unpublished paper. Wu, M. L., Adams, R. J., Wilson, M. R., & Haldane, S. A. (2007). ACER ConQuest
Version 2.0. Camberwell, Victoria: ACER Press. Zubairi, A. M., & Kassim, N. L. A. (2006). Classical and Rasch analyses of
dichotomously scored reading comprehension test items. Malaysian Journal of ELT Research, 2, 1-20.
Zumbo, B. D. (1999). A handbook on the theory and methods of differential item
functioning: logistic regression modelling as unitary framework for binary and likert-type (ordinal) item scores. Ottawa, Canada: Directorate of Human Resources Research and Evaluation, Department of National Defence.
95
APPENDICES
APPENDIX A
BAND DESCRIPTOR OF MUET
AGGREGATED
SCORE
BAND USER COMMUNICATIVE
ABILITY
COMPREHENSION TASK
PERFORMANCE
260 – 300 6 Highly proficient user
Very fluent; highly appropriate use of language; hardly any grammatical error
Very good understanding of language and context
Very high ability to function in the language
220 – 259 5 Proficient user
Fluent; appropriate use of language; few grammatical errors
Good understanding of language and context
High ability to function in the language
180 – 219 4 Satisfactory user
Generally fluent; generally appropriate use of language; some grammatical errors
Satisfactory understanding of language and context
Satisfactory ability to function in the language
140 – 179 3 Modest user Fairly fluent; fairly appropriate use of language; many grammatical errors
Fair understanding of language and context
Fair ability to function in the language
100 – 139 2 Limited user Not fluent; inappropriate use of language; very frequent grammatical errors
Limited understanding of language and context
Limited ability to function in the language
Below 100 1 Very limited user
Hardly able to use the language
Very limited understanding of language and context
Very limited ability to function in the language
96
APPENDIX B
CODING SCHEME OF TEST ITEMS
Type of variable
Predictors Characteristics of item Code
Passage Length of the passage
More than 25 lines More than 35 lines More than 45 lines
1 2 3
Question type
Type of question
Retrieving directly stated information (RI) Interpreting explicit information (IE) Interpreting by making inference (II) Reflecting on texts (RF)
Short lists of 1-4 words Simple phrases Complex phrases containing clauses that could stand alone as sentences
1 2 3
Number of options
3-option 4-option
1 2
Plausibility of the distractors
No response options are plausible One or more options are plausible
1 2
97
APPENDIX C
CONQUEST COMMAND FILES
1) Command File for CTT and Rasch Item Analysis
2) Command File for DIF Gender
datafile readingmuet.txt; format id 1-4 state 6-17 gender 19 ethnic 23 responses 27-71; key BCBAABCABCBCACCABBBABCACBCABBCCCABBCDDACBDDAC!1; keepcases L,P!gender; model gender + item + gender*item; estimate; show !estimate=latent >> DIFreadingmuet3.shw; itanal >> DIFreadingmuet3.itn;
datafile readingmuet.txt; format id 1-4 state 6-17 gender 19 ethnic 23 responses 27-71; key BCBAABCABCBCACCABBBABCACBCABBCCCABBCDDACBDDAC!1; group gender; model item; estimate; show !estimate=latent >> readingmuet2.shw; itanal >> readingmuet2.itn;
Minerva Access is the Institutional Repository of The University of Melbourne
Author/s:
Yusup, Rusilah Binti
Title:
Item evaluation of the reading test of the Malaysian University English Test (MUET)
Date:
2012
Citation:
Yusup, R. B. (2012). Item evaluation of the reading test of the Malaysian University English
Test (MUET). Masters by Coursework & Shorter thesis, Melbourne Graduate School of
Education, The University of Melbourne.
Persistent Link:
http://hdl.handle.net/11343/37608
File Description:
Item evaluation of the reading test of the Malaysian University English Test (MUET)
Terms and Conditions:
Terms and Conditions: Copyright in works deposited in Minerva Access is retained by the
copyright owner. The work may not be altered without permission from the copyright owner.
Readers may only download, print and save electronic copies of whole works for their own
personal non-commercial use. Any use that exceeds these limits requires permission from
the copyright owner. Attribution is essential when quoting or paraphrasing from these works.