CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

CHAPTER　4　RESEARCH　DESIGN

　　　　　In　the　previous　chapters，　ways　of　approaching　how　reading　ability　could　be

defined　from　the　perspective　of　test　item　specifications　were　explored．　In　Chapter　2，

it　has　been　examined　and　emphasized　that，　in　investigating　the　nature　of　reading　test

with　its　relation　to　the　latent　structure　of　reading　ability，　the　scope　ofthe　present　study

is　on　the“product”of　FL　reading　as　a　result　of　FL　reading“process”．　Furthermore，

Chapter　3　had　described　a　way　in　which　a　constnlct　of　reading　ability　could　be

defined　by　developing　test　items　that　elicit　certain　types　of　reading　product　in　test

takers’reading　comprehension．　Reading“competence”was　termed　to　be　a　facet　that

constitutes　a　major　part　of　reading“performance”，　and　in　defining　the　reading

construct　fbr　the　purpose　of　reading　test　item　development，　it　was　proposed　that，

although　a　test　item　is　defined　to　be　a　tool　which　elicits　a　reading　performance，　that

performance　should　be　accepted　as　something　that　allows　the　testers　to　draw

inferences　and　make　generalizations　about　what　sort　of　reading　activities　the　test　taker

might　be　able　to　do．　Furthermore，　this　should　be　considered　analytically　as　an

interaction　of　his　competence　and　the　context　rather　than　considering　it　as　something

holistic　and　content－representative．　Tb　continue　along　the　same　lines　of　apProach，

the　significance　of　specifying　the　components　of　a　test　item，‘‘question　types”in

particular，　in　operationalizing　the　reading　construct　to　be　tested　was　discussed．　This

was　f［耐her　explored　by　reflecting　on　item　diffriculty，　or　a　quantitative　aspect　of　a　test

item．　The　discussion　had　concluded　in　suggesting　a　possibility　of　a　link　between　the

question　type　of　a　test　item　and　its　difficulty，　which　provides　the　fbllowing　research

questions　to　the　present　research．

4．1Research　questions

Research　Question　1：

Is　it　valid　to　employ‘question　types，　as　a　prime　component　that　constructs　test

items　used　in　eliciting　test　takers，　L2　reading　performances？

46

東京外国語大学博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

　　　　　What　are　the　factors　that　constitute　the　L2　reading　performances　of　leamers　of

English　at　secondary　education　in　Japan，　when　they　are　extracted　from　factor　anal）rtic

studies　of　reading　products　elicited　using　reading　test　items？　Would　they　differ

across　learners　with　different　reading　abilities？

　　　　　In　an　attempt　to　come　up　with　a　test　item　specification　that　effectively

operationalizes　different　reading　performances　to　be　tested，　inspired　by　Negishi

（1996）and　Wada（2003），　the　present　study　proposes　the‘question　typ♂of　a　test　item

to　be　a　prime　component　to　constitute　such　a丘amework．　At　the　same　time，　however，

because　Negishi（1996）and　Wada（2003）had　not　accommodated　the　interactions　of

these　constructing　components　with　the　latent　reading　structure　of　test　takers，　an

attention　will　be　rendered　to　this　aspect　in　much　greater　depth，　as　it　is　possible　that

the　prime　factors　could　change　in　accordance　with　the　test　takers’reading　abilities．

Research　Question　2：

Is　it　valid　to　assume　a　certain　relationship　betWeen　question　types　and　item

difficulty　in　eliciting　test　takers，　L2　reading　performances？

　　　　Is　the　item　difficulty　of　a　test　item，　calibrated　using　Item　Response　Theory，

affected　by　its　question　type？　If　so，　how？　Wbuld　this　relationship　differ　across

learners　with　different　reading　abilities？

　　　　　With　an　intere　st　in　suggesting　the　facets　of　a　reading　te　st　item　that　would　allow

the　writers　of　test　items　to　predeterrnine　the　difficulty　of　a　test　item，　the　present　study

investigates　the　possibility　of　a　link　between　the　item　diflriculty　of　a　test　item　and　its

question　type．　Attention　will　also　be　given　to　cases　with　different　abilities　of　test

takers　to　see　if　the　orders　of　perceived　dif6culties　across　different　question　types

differ　according　to　the　different　ability　groups　of　test　takers．

47


42Data　Collection

4．2．1Subjects

　　　　　Asample　of　8301earners　of　English　from　senior　high　school　and　university　in

Japan　had　participated　in　the　main　part　of　the　present　study．　Of　these，280　were

third－year　high　school　students　and　550　were　first－year　undergraduate　students　in

皿lverslty．

　　　　　The　maj　ority　of　high　school　students　had　five　years　of　English　education　under

the　Course　of　Study　provided　by　the　Ministry　of　Education，　Culture，　Sports，　Science

and　Technology　in　English　in　a　foreign　language　environment．　They　were　told　that

the　test　was　administered　to　collect　data　on　individual’s　English　proficiency．　The

students　had　five　English　classes　in　a　week；nothing　was　done　in　the　classroom　that

would　help　the　students　to　prepare　fbr　the　tests　administered　in　this　study．

　　　　　For　the　university　students，　the　circumstances　were　the　same　as　high　school

students　except　that　the　duration　of　time　English　was　leamed　was　mostly　six　years．

All　of　the皿iversity　students　maj　ored　in　one　foreign　language　other　than　English　and

were　given　the　test　early　in　April，　immediately　after　they　had　entered　university，　as　a

placement　test　fbr　their　English　classes　that　were　prerequisite　in　the　university

curriculum．　This　was　to　ensure　that　the　test　takers　did　not　have　any　special

knowledge　of　English　or　of　any　other　academic　field　that　would　distort　the　outcome

of　data　collections．

　　　　　There　were　some　variations　in　both　high　school　and　university　students’

background　of　how　and　how　long　English　was　learned（e．g．　students　who　had

overseas　experiences），　however，　the　variation　in　the　number　of　years　they　had　spent

time　abroad　or　the　intensity　of　how　much　English　they　had　leamed　were　so　great　that

it　was　not　possible　to　come　up　with　any　generalizable　criterion　fbr　omitting　the　scores．

Moreover，　it　could　be　assumed　that　tho　se　variations　would　be　an　inherent　factor　in

leamers’reading　ability　that　enables　them　to　score　high　on　the　test，　so　the　present

author　had　decided　to　disregard　such　factors　in　the　process　of　data　collection　as　long

as　it　did　not　affect　the　distribution　of　scores　too　greatly．

48


4．2．2　Materials

　　　　　The　two　sets　of　test　instrument　were　employed　in　the　main　study．

4．2．2．1Test　Set／l

　　　　　Test　Set　A（presented　in　Appendix　A）consists　of　nine　passages，　each　passage

with　three　multiple－choice　test　items（one　correct　option　and　three　distracters

provided）to　be　responded　on　the　base　of　its　comprehension．　These　nine　passages

were　selected　after　an　item　selection　was　done　in　the　pilot　study，　providing　27　reading

test　items．　The　features　of　these　nine　passages　are　as　follows：

Table　4－1　The　features　of　passages　employed　in　Test　Set　A

TEXT Item＃ REase Gr　Level Words＃1 1－3 56．8 8．7 952 4－6 66．4 7．3 1093 7－9 55．2 10 1085 13－15 65．2 9．6 1106 16－18 64．8 7．5 957 19－21 65．7 7．6 101

8 22－24 68．1 7．4 1049 25－27 55．8 8．4 97

10 28－30 57．1 9．5 103

61．68 8．44 102．44

（TeXt　4，　as　well　as　ltem　10，11　and　12　are　missing　from　the　table　because　they　were　omitted　after　the

item　SeleCtiOn．）

　　　　　All　of　the　passages　are　taken　from　Reading　Comprehension　Section（advanced

level）of　Global　Test　of　English　Comm皿ication（GTEC）developed　by　Benesse

Corporation．　The　pre　sent　author　had　determined　GTEC　to　be　an　appropriate　source　of

reading　texts　since　it　was　designed　to　test　English　proficiency　of　high－intermediate

leamers　in　senior　high　schools　and皿iversities　in　Japan，　which　is　at　an　equivalent

level　of　the　subjects　to　be　tested　and　also　of　what　the　Course　of　Study　provided　by

Ministry　of　Education，　Culture，　Sports，　Science　amd　Teclmology　aims　for．

49


　　　　　In　Table　4－1，‘‘R．　Ease”indicates　the　Flesch　Reading　Ease　and‘‘Gr．　Level”

indicates　the　Flesch－Kincaid　Grade　Leve1．　They　both　indicate　a　readability　index，　a

means　of　describing　how　easily　written　materials　could　be　read　and皿derstood．

Although　they　employ　the　same　core　measures（word　length　and　sentence　length）to

calculate　the　index，　they　have　different　weighting　factors，　which　sometimes　create

incoherence　in　the　outcome　of　calculations．　The　indices　provided　by　the　Flesch

Reading　Ease　indicates　the　easiness　of　reading　a　passage　from　the　scale　of　zero　to　one

h皿dred，　zero　being　the　most　difficult　to　one　h皿dred　being　the　easiest．

Flesch－Kincaid　Grade　Level　expresses　the　readability　in　a　grade　level　of　US

educational　system，　making　it　easier　to　j　udge　the　readability　level　of　various　books

and　texts．　Observing　these　indices　fbr　the　nine　passages　used　in　Test　Set　A，　the

present　author　assumes　the　diffriculty　of　passages　were　appropriate　for　the　subj　ects

and　fbr　the　purpose　of　the　present　research（see　4．3．1　for　fUrther　explanations　on　how

the　subj　ect　groups　were　predetermined　for　the　main　study）．

　　　　　The　number　of　words　in　each　passage　was　co皿ted　so　as　to　regulate　the

characteristics　of　each　passage．　The　present　author　had　selected　passages　that　were

around　100　words　in　total，　considering　the　time　constraint　of　testing　environments．

The　numbers　at　the　bottom　indicate　the　means　for　each　index．

　　　　　As　fbr　the　three　multiple－choice　test　items　that　were　to　be　answered　after

reading　each　passage，　the　present　author　had　written　the　questions　and　four　options・

The　validity　of　which　question　type（see　3．4．2　fbr　detailed　explanations）each　item

represented　was　checked　by　her　colleagues（two　teachers　at　a　senior　high　school）and

their　assessment　had　sufflcient　correlation　of．76．　For　the　items　where

disagreements　were　fbund，　they　were　discussed　and　revised　so　that　all　three　people

（the　two　colleagues　and　I）were　satisfied　with　the　decision．

　　　　　　For　each　passage，　the　first　item　was　written　so　that　the　question　elicits　a

“global－inferential”comprehension　of　the　passage．　These　were　the　items　numbered

1，4，7，13，16，19，22，25，and　28，　and　they　asked　fbr　the　main　idea　of　the　passage．

For　example，　item　l　of　Test　Set　A（‘‘1．What　is　the　main　idea　of　this　passage？”）can　be

answered　correctly　if　a　test　taker　comprehends　that　the　main　idea　in　the　passage　is　the

50


growing　seam　in　the　seafloor　of　the　Atlantic　Ocean．　The　wording　and　phrases　used

in　each　question　may　vary，　but　all　nine　questions（items　1，4，7，13，16，19，22，25，　and

28）are　made　to　elicit“global－inferentiar’type　of　reading．

　　　　　The　second　item　was　written　so　that　the　question　asks　fbr　a‘‘local－literal　

comprehension．　These　were　the　items　numbered　2，5，8，14，17，20，23，26，　and　29，

and　they　asked　fbr　the　information　which　is　directly　interpreted丘om　a　relatively

small　amount　of　text　source．　With　regard　to　the　first　passage　which　appears　in　Te　st

Set　A，　item　2　is　such　test　item．　Item　2　requires　a　test　taker　to　complete　the　sentence，

‘‘

Q．The　speed　at　which　the　seafloor　is　spreading　is＿”　The　correct　option‘‘（C）half

as　fast　as　human　fingemails　grow，”can　be　chosen　if　the　test　taker　can　spot　and

understand　the　last　sentence　in　the　passage，“This　spreading　occurs　in　half　of　a　speed

of　how　fast　fingernails　grow，”as　it　i　s，　Without　any　fUrther　inferring　from　the　text．

　　　　　The　last　item　was　composed　so　that　the　question　provokes　a“local－inferentialうう

皿derstanding　of　the　passage．　These　were　items　3，6，9，15，18，21，27，30，　and　they

called　for　the　information　which　could　be　obtained　after　making　an　inference　from　a

relatively　small　amount　of　text　source．　With　regard　to　the　first　passage　which

appears　in　Test　Set　A，　item　3（‘‘3．　The　break－off　of　Pangaea　started　because．．．”）

requires　such　type　of　comprehension　and　asks　fbr　the　cause　of　the　growing　seam　in

the　seafloor　ofthe　Atlantic　Ocean．　In　order　to　choose　the　correct　option，‘‘（B）aplate

started　to　develoP皿derwater　and　the　land　was　separated，”atest　taker　needs　to

understand　the　sentence，‘‘Since　that　time，　the　Atlantic　Ocean　has　widened　along　a　hot，

rock－producing　seεm　in　the　seafloor，e’and　infer　that　the‘rock－producing　searn’is　the

cause　the　break－off　of　Pangaea．

　　　　　The　three　questions　fbr　each　passage　were　asked　so　that　the　global－inferential

question　would　come　first，　the　local－literal　question　second，　and　the　local－inferential

third．　The　present　author　had　chosen　to　provide　them　in　this　order　because　this　is

the　order　in　which　the　questions　seem　to　appear　in　the　reading　sections　of　common

standardized　proficiency　tests，　such　as　TOEFL　or　TOEIC．

　　　　　As　fbr　the　time　allocated　to　this　test，　because　one　class　period　in　senior　high

schools　is　usually　50　minutes，50　minutes　was　the　maximum　length　of　time　allowed

51


to　implement　Test　Set　A．　Ideally，　sufficient　time　should　be　given　to　the　test　takers

since　the　fbcus　of　the　present　study　is　in　the　test　takers’‘power’，　rather　than　their

‘speed’．　Therefbre，　special　attention　was　given　so　that　the　test　takers　would　be　able

to　complete　the　test　set　within　the　time　allocated．

　　　　　Prior　to　the　test　implementation　fbr　the　main　study，　a　pilot　test　was　carried　out

in　order　to　validate　the　test　items　developed　by　the　procedures　described　above．　The

subj　ects　were　143　students　from　a　senior　high　school　which　is　considered　to　be　ofthe

equivalent　academic　level　to　the　high　school　at　which　Test　S　et　A　was　implemented　in

the　main　study．

　　　　　The　main　interest　in　canying　out　the　pilot　test　was　to　find　and　edit　the　test

items　that　exhibit　problems　with　its　item　discrimination　indices．　Item　discrimination

is‘‘the　capacity　of　test　items　to　differentiate　among　candidates　possessing　more　or

less　of　the　trait　that　the　test　is　designed　to　measure．”（Davies　et．　al．1999：96）　In

developing　a　test　instrument，　it　is　essential　that　the　test　items　have　high　levels　of　item

discriminability　to　ensure　a　reliable　measurement　of　test　takers’ability．　Items　with

low　item　discrimination　index　are　usually　eliminated丘om　a　test　or　edited．　In　the

present　study，　item　discriminability　was　calculated　using　classical　test　theory

（point－biserial　correlation　calculated　by　ITEMAN）due　to　the　small　number　of

subj　ects　and　items．

　　　　　In　Table　4－2，‘‘PBs”indicates　point－biserial　correlation，　and‘‘PCう’indicates　the

percentage　of　test　takers　who　correctly　answered　each　item．　Indices　fbr

point－biserial　correlation　are　used　to　indicate　how　well　an　item　discriminates　test

takers　who　are　more　capable　with　those　who　are　not　so　capable．　It　is　often　defined

that　point　biserial　correlations　of．25　and　above　are　acceptable（Henning　1987：53），

and　most　of　the　items　surpassed　this　criterion．　Percentage　correct　is　used　to　show

how　easy（or　difficult）atest　item　is　because　the　higher（lower）the　percentage　of

people　who　correctly　answered　a　test　item，　the　easier（more　diffiT　cult）atest　item　had

been　perceived　by　the　test　takers．

　　　　　　As　it　is　apparent，　items　10，11，12　were　considered　to　be　problematic　because

they　show　negative　or　very　low　discrimination．　These　were　the　items　provided　fbr

52


the　sarne　passage，　so　it　could　be　presumed　that　the　passage　itself　was　problematic　for

this　level　of　test　takers．　For　this　reason，　the　present　author　had　decided　it　best　to

eliminate　all　three　items　along　with　the　passage．　Items　1，2，3，9and　l　6　also　had　low

discriminability，　so　the　present　author　had　reviewed　and　revised　each　item．　Test　Set

Apresented　in　Appendix　A　is　the　final　version　of　these　items　after　the　revision．（The　item

numbers　were　left　as　they　were　when　the　test　set　was　implemented　in　the　main　stUdy，　and　this

was　announced　orally　to　test　takers　by　the　proctors．）

Table　4－2　The　discrimination　indices　of　test　items　in　the　pilot　version　of　Test　Set　A

ITEM＃ PBs PC1 0．05 0．46

2 0．13 0．43

3 0．21 0．33

4 0．51 0．8

5 0．49 0．7

6 0．39 0．4

7 0．35 0．76

8 0．42 0．64

9 0．01 0．13

10 一〇．02 0．19

ll 一〇．1 0．12

12 0．18 0．38

13 0．41 0．47

14 056 0．36

15 0．51 0．69

16 0．13 0．4

17 0．49 0．43

18 0．49 0．48

19 0．42 0．54

20 0．42 0．57

21 0．39 0．28

22 0．51 0．62

23 0．58 0．45

24 0．44 0．53

25 0．51 0．62

26 0．54 0．43

27 0．38 0．68

28 0．47 0．42

29 0．56 0．47

30 0．51 0．42

In　order　to　compare　the　reading　abilities　of　test　takers　who　took　this　set　of　test

53


and，　more　importantly，　to　observe　the　alteration　of　latent　ability　structure　among　test

takers　with　different　reading　abilities，　items　1，2　and　3　reappear　in　Test　Set　B　as　items

1，2，and　3，　items　7，8，9as　items　4，5，　and　6　and　items　10，11，　and　12　as　items　7，8，

and　9．　However，　as　it　was　stated　in　the　previous　paragraph，　because　items　10，11，

and　12　were　omitted　from　Test　S　et　A，　items　7，8，　and　9　had　to　be　omitted　from　Test

Set　B　as　well．

　　　　　As　fbr　the　time　allocated　fbr　the　completion　of　the　test，　it　was　reported　from　the

teachers　who　had　proctored　for　the　pilot　study　that　most　of　the　test　takers　appeared　to

have　reached　the　last　item　of　the　te　st，　which　proves　that　50　minutes　was　a　sufficient

time　fbr　the　test　takers　in　the　present　study．

4．2．2．2Test　Set　B

　　　　　Test　S　et　B　i　s　presented　in　Appendix　B．　In　total，　there　are　27　test　items　in　the

test　set；nine　passages　are　provided，　each　with　three　multiple－choice　test　items　to　test

test　takers’comprehension．　Each　item　has　one　correct　option　and　three　distracters．

These　nine　passages　were　selected　after　an　item　selection　was　done　in　the　pilot　study．

The　features　of　these　nine　passages　are　presented　in　Table　4－3．

Table　4－3　The　features　of　passages　employed　in　Test　Set　B

TEXT 1tem＃ REase Gr　Level Words＃1 1－3 56．8 8．7 952 4－6 55．2 10 108

4 10－12 34．1 12 157

5 13－15 35．3 12 142

6 16－18 34．8 12 1607 19－21 37．7 12 160

8 22－24 38．9 12 152

9 25－27 33．4 12 155

10 28－30 33．6 12 151

40 11．4 142．22

（「reXt　3，　as　well　as　ltem　7，8and　g　are　missing　from　the　table　because　they　were　omitted　after　the

item　seleCtion．）

Text　l　is　the　same　passage　as　Text　l　in　Test　Set　A，　Text　2　is　the　same　passage　as

54


Text　3　in　Test　Set　A，　and　Text　3　is　the　same　passage　as　Text　4　in　Test　Set　A．　This

was　done　to　compare　the　reading　abilities　of　test　takers　who　took　this　set　of　test，　Test

Set　B，　with　the　test　takers　who　took　the　Test　Set　A　and，　in　particular，　to　see　if　any

alteration　would　emerge　with　regard　to　test　takers’latent　ability　stmcture　among

different　ability　groups．　The　rest　of　the　passages　were　taken丘om　Reading

Comprehension　S　ection　of　TOEFL　Test　Preparation　Kit〃Morkbook（ETS　1998）．　The

present　author　had　determined　TOEFL　Test　preparation　material　to　be　an　appropriate

source　of　reading　passages　because，　since　TOEFL　was　designed　to　test　English

proficiency　of　students　who　are　seeking　to　study　at　an　undergraduate　or　graduate　level

in　the　English－speaking　environment，　the　level　of　English　proficiency　required　to

succeed　in　completing　them　would　be　the　same　as　that　of　advanced　learners　in　Japan，

which　is　at　an　equivalent　level　of　the　subj　ects　to　be　tested　by　Test　S　et　B．

　　　　　In　the　Table　4－3，“R．　Ease”indicates　Flesch　Reading　Ease　and‘‘Gr．　Leverう

indicate　s　Flesch－Kincaid　Grade　Level．　The　number　of　words　were　counted　so　as　to

regulate　the　characteristics　of　each　passage．　The　present　author　had　selected

passages　that　were　around　150　words　in　total　for　Texts　4　to　10，　considering　the　time

constraint　of　testing　environments．　The　numbers　at　the　bottom　indicate　the　means

fbr　each　index．

　　　　　As　fbr　the　three　multiple－choice　test　items　that　were　to　be　answered　after

reading　each　passage，　the　present　author　had　written　the　questions　and　fbur　options・

These　questions　and　options　were　examined　fbr　their　validity　by　her　two　colleagues．

After　each　passage，　a‘‘global－inferentia1”question，‘‘local－literalう’question，　and

‘‘撃盾モ≠戟|inferential”question（see　3．4．2　fbr　detailed　explanations　of‘question　types’）

are　presented　in　the　same　manner　as　these　questions　are　presented　in　Test　Set　A．

Thi　s　means　that，　fbr　each　passage，　a‘‘global－inferential”　question　is　the　first　item　that

comes　after　the　passage，　a‘‘local－literal”question　the　second，　and　a‘‘local－inferential”

que　stion　the　last．　Therefore，　items　numbered　1，4，10，13，16，19，22，25，　and　28　are

‘‘№撃盾b≠戟|inferential’うquestion　which　asked　fbr　the　main　idea　of　the　passage，　items

numbered　2，5，11，14，17，20，23，26，　and　29　are“local－literal”questions　which　asked

fbr　the　information　which　is　directly　interpreted　from　a　relatively　small　amount　of

55


text　source，　and　items　3，6，12，15，18，21，27，30　are‘‘local－inferentialううquestions

asked　for　the　information　which　could　be　obtained　after　making　an　inference　from

relatively　small　amo皿t　of　text　source（see　pP．51－52　fbr　detailed　explanation　and

examples　of　how　these　questions　were　presented）．　The　validity　of　which　question

type　each　item　represented　was　confirmed　by　the　two　colleagues　who　had　worked　on

the　question　types　of　Test　Set　A，　and　their　correlation　was．71．　For　the　items　where

disagreements　were　fbund，　they　were　discussed　and　revised　so　that　all　three　people

（the　two　colleagues　and　I）were　satisfied　with　the　decision．

　　　　　For　Test　Set　B，　the　time　allocated　to　the　test　was　50　minutes　in　order　to　parallel

Test　Set　A．　In　writing　and　revising　Test　Set　B，　special　attention　was　also　given　so

that　the　test　takers　would　be　able　to　complete　the　test　set　within　the　time　allocated．

　　　　　Prior　to　the　test　implementation　fbr　the　main　study，　a　pilot　test　was　carried　out

in　order　to　validate　the　test　items　developed　by　the　procedures　described　above．　The

subjects　were　156　students　from　the　same皿iversity　at　which　Test　Set　B　was

implemented　in　the　main　study．　They　were　of　the　same　academic　background　as　the

subj　ects　who　had　participated　in　the　main　study．

　　　　　The　main　interest　in　carrying　out　the　pilot　test　was　to　find　and　edit　the　test

items　that　exhibit　problems　with　its　item　discrimination　indices．　As　it　was　done　in

the　pilot　study　for　Test　S　et　A，　item　discriminability　was　calculated　using　classical　test

theory（point－biserial　correlation　calculated　by　ITEMAN）due　to　the　small　number　of

subj　ects．

　　　　　In　Table　4－4，‘‘PBs”indicates　point－biserial　correlation　fbr　item　discriminability，

and‘‘PC”indicates　the　percentage　of　test　takers　who　correctly　answered　each　item　to

show　item　di伍culty．　Items　7，8，9were　automatically　eliminated　because　they　were

the　same　items　as　those　eliminated　from　Test　Set　A（items　10，11，　and　12）．　The

present　author　had　originally　intended　to　use　these　three　items　fbr　level　comparison

across　different　subject　groups　but　decided　to　discard　them　fbr　this　reason　and　also

due　to　the　time　constraint　expected　in　the　testing　environment．　Furthermore，　items　l

and　2，　which　reveal　low　item　discrimination　in　Table　4－4，　were　revised　because　they

were　the　items　presented　as　items　l　and　2　in　Test　Set　A　and　had　also　shown　low　item

56


discrimination　in　the　pilot　test　fbr　Test　Set　A．　The　same　was　true　fbr　item　4　and　6

which　were　numbered　7　and　9　in　Test　S　et　A．　Items　23　and　29　also　had　low

discriminability，　so　they　were　reviewed　and　revised　accordingly．　Test　Set　B　which　is

presented　in　Appendix　B　the　final　version　after　these　revisions．（The　item　numbers

were　left　as　they　were　when　the　test　set　was　implemented　in　the　main　study，　and　this

was㎜o皿ced　orally　to　test　takers　by　the　proctors．）

Table　44　The　discrimination　indices　of　test　items　in　the　pilot　version　of　Test　Set　B

ITEM＃ PBs PC1 0．27 0．94

2 0．18 0．99

3 0．30 0．86

4 0．29 0．43

5 0．43 0．81

6 0．20 0．44

7 0．38 0．80

8 0．56 0．63

9 0．18 0．67

10 0．45 0．71

ll 0．39 0．84

12 0．49 0．57

13 0．42 0．84

14 0．54 0．31

15 0．41 0．36

16 0．30 0．36

17 0．33 0．63

18 0．32 0．21

19 0．20 0．97

20 0．28 0．91

21 0．33 0．81

22 0．23 0．65

23 0．15 0．36

24 0．42 0．51

25 0．47 0．52

26 0．27 0．40

27 0．36 0．22

28 0．29 0．91

29 0．18 0．84

30 0．33 0．75

As　fbr　the　time　allocated　fbr　the　completion　of　the　test，　it　was　reported　from　the

57


teachers　who　had　proctored　for　the　pilot　study　that　most　of　the　test　takers　appeared　to

have　reached　the　last　item　of　the　test，　which　proves　that　50　minutes　was　a　sufficient

time　fbr　the　test　takers　in　the　present　study．

4．2．3　Test　Administration

　　　　　Test　Set　A　and　Test　Set　B　were　both　administered　in　50　minutes．　Senior　high

school　students　were　given　Test　Set　A．　It　was　implemented　as　a　reading　proficiency

test　in　a　50－minute　class　period，　proctored　by　the　teachers　who　taught　the　class　in　the

regular　lesson．

　　　　　For皿iversity　students，　the　test　was　administered　as　a　part　of　a　placement　test

fbr　their　required　English　classes　which　consisted　of　a　listening　comprehension

section　and　a　reading　comprehension　section．　They　were　given　either　Test　Set　A　or

Test　Set　B，　depending　on　the　date　they　were　taking　the　test．　Those　students　who

took　the　test　on　the　first　day　of　the　placement　test　were　given　the　test　which　included

Test　Set　A　as　the　reading　comprehension　section，　and　those　who　took　the　test　on　the

second　day，　Test　Set　B．　The　scores　on　the　reading　comprehension　section　of　the　test

were　not　counted　in　the　placement　itself　because　of　the　difference　in　difficulty

between　the　two　test　sets．　In　the　first　half　of　the　testing　time，　students　were　given　50

items　that　tested　their　listening　skills．　In　this　part　of　the　test，　the　time　was　regulated

by　the　listening　material．　At　the　end　of　this　section，　which　was　announced　by　the

listening　material　itself，　students　were　told　to　begin　the　reading　section．　The

students　were　given　50　minutes　fbr　the　reading　section．　The　test　was　proctored　by

the　teachers　who　teach　the　required　English　classes．

　　　　　Both　high　school　students　and皿iversity　students　were　asked　to　provide　their

answers　on　mark－sheets．　These　mark－sheets　were　scored　electrically　on　the

mark－sheet　sca皿er．

4．3Data　Analysis

4．3．1　Predetermining　Ability　Groups

58


　　　　　Prior　to　the　data　analyses，　three　groups　of　different　abilities　were　determined

based　on　the　results　of　the　data　collection　above．　The　three　groups　are：Group

A－Low，　Group　A－High，　and　Group　B．

　　　　　Group　A－Low　and　Group　B　were　to　represent　the　groups　of　test　takers　who

were　responding　to　the　items　that　had　a　difficulty　that　is　equivalent　to　their　reading

ability，　and　Group　A－High　to　represent　the　test　takers　who　were　responding　to　the

items　that　were　considered　to　have　a　difficulty　lower　than　their　reading　ability．　In

this　way，　the　results　of　Group　A－Low　and　Group　A－High　could　be　compared　to

investigate　the　differences　exhibited　by　test　takers　with　different　reading　abilities

tackling　the　test　items　of　the　same　difficulty．　R耐hermore，　the　results　of　Group

A－Low　and　Group　B　were　to　be　compared　to　observe　the　differences　presented　by　test

takers　with　different　reading　abilities　responding　to　the　test　items　that　had　the

difficulty　equivalent　to　their　ability．

　　　　　Here，　an　explanation　of　what　is　meant　by‘‘test　takers　with　different　reading

abilities　responding　to　the　test　items　that　had　the　difficulty　equivalent　to　their　ability”

for　Group　A－Low　and　Group　B　and‘‘the　te　st　takers　who　were　re　sponding　to　the　items

that　were　considered　to　have　the　difficulty　lower　than　their　reading　ability”fbr　Group

A－High　may　be　necessary．’In　Item　Response　Theory（IRT），　the　theory　on　which　the

calculation　of　item　difficulty　was　based　in　the　analyses　of　Section　5．3，　the　idea　is　to

find　the　relationship　between　the　difficulty　of　a　test　item，　the　ability　of　a　test　taker，

and　the　probability　of　a　test　taker　answering　a　test　item　correctly（Ohtomo　1996：69）．

The　difficulty　of　a　test　item　is　determined　by　its“item　characteristic　curve”，　a　graph

which　is　drawn　after　the　calibration　using　logistic　fUnction．　On　this　graph，　the　point

where　it　meets　where　the　probability　of　a　person　responding　to　that　item　is　O．50（50％）

indicates　the　ability　level　of　that　person，　the　person　whose　probability　of　answering

that　test　item　correctly　is　O．50，　and　that　ability　index　is　employed　as　the　difficulty　of

the　test　item．　Therefbre，　the　index　provided　as‘‘theta”in　Appendix　C－1，　D－1，　and

E－1，indicates　the　ability　level（from－3．O　to　3．0）of　a　person　whose　probability　of

responding　to　that　item　is　O．50　and　that　also　represents　the　difficulty　of　the　test　item．

This　relationship　between　the　ability　of　a　test　taker　and　the　difficulty　of　a　test　item

59


brings　the　present　reader　to　characterize　each　subj　ect　group　as　having　an　ability　that　is

‘‘?曹浮奄魔≠撃?獅煤@to’うor‘‘higher　than”the　difficulty　level　oftest　items．

　　　　　Originally，　the　present　author　had　chosen　to　give　Test　Set　A　to　high　school

students　and　half　of　the　university　students，　so　that　high　school　students　would

represent　Group　A－Low　and　university　students，　Group　A－High．　Test　Set　B　was

given　to　the　rest　of　the　university　students　to　represent　Group　B．　At　this　point，　the

author　had　assumed　that　university　students　would　possess　higher　ability　in　English

reading　comprehension　since　they　had　had　an　extra　year　of　English　education　along

with　their　preparatory　learning　experience　fbr皿iversity　entrance　examinations．

However，　this　method　of　predetermining　the　ability　groups　did　not　fUnction　for　the

present　study　because，　virtually，　no　difference　could　be　fbund　between　the　scores　of

high　school　students　and　university　students　on　Test　Set　A；the　mean　scores　were　17．6

fbr　the　high　school　students　and　17．9　fbr皿iversity　students．　One　possibility　which

could　have　caused　this　to　hapPen　is　the　fact　that皿iversity　students　were　given　the

reading　comprehension　test　after　they　had　worked　on　the　listening　comprehension

section　in　the　placement　test．　The　cognitive　load　which　was　imposed　on　the　test

takers　while　working　on　the　listening　comprehension　could　have　exhausted　them

cognitively　and　impeded　their　performances　on　the　reading　section，　rendering　the

result　above．　However，　when　the　listening　test　material　was　evaluated，　it　was

determined　that　it　did　not　appear　to　exhibit　the　diffriculty　that　would　influence　test

takers’performance　in　the　latter　section　of　the　test．　Therefbre，　it　was　presumed　that

there　indeed　was　little　difference　in　reading　ability　between　high　school　students　and

university　students　who　were　given　Test　Set　A．　For　this　reason，　at　this　point，　the

present　author　decided　to　look　at　the　results　of　test　takers　who　worked　on　Test　Set　A

as　a　whole，　regardless　of　whether　they　were　high　school　students　or　university

students，　and　predetermine　the　ability　groups　based　on　their　test　scores　on　Test　Set　A．

Adetailed　description　of　how　these　groups　were　decided　is　presented　in　Chapter　5．

No　change　was　made　in　predetermining　Group　B　since　the　university　students　who

worked　on　Test　Set　B　had　averaged　16．3，　which　showed　that　the　test　takers　who　were

given　Test　Set　B　were　advanced　leamers　who　are　at　the　same　ability　level　as　the

60


reading　ability　expected　to　correctly　respond　to　the　test　items　in　Test　Set　B．

4．3．2Statistical　Proc…edures

　　　　　Three　stati　stical　procedures　were　taken　in　order　to　analyze　the　data　collected．

4．3．2．1Descriptive　Statistics

　　　　　For　each　test　set，　mean　and　standard　deviation　were　calculated．　KR20　was

used　to　estimate　the　intemal　consistency　of　each　test　set　to　ensure　its　reliability　in

measuring　students’reading　ability．　For　the　purpose　of　test　validation，　the　facility

value（percentage　correct）and　discrimination　index（point－biserial　correlation）

calculated　using　Classical　Test　Theory　by　ITEMAN　（Assessment　Systems

Corporation）was　also　provided．

4．3．2．2Factor／lnalytic　Studies

　　　　　In　an　attempt　to　come　up　with　a　test　item　specification　that　effectively

operationalizes　different　reading　performances　to　be　tested，　the　present　study

proposes　that　the‘‘question　typeう’of　a　test　item　could　be　a　prime　component　to

constitute　such　a　framework．　In　order　to　identifシthe　components，　or　factors，　that

constitute　L2　reading　performances，飴ctor　analyses　are　done　fbr　the　collected　data　in

each　Test　Set．　The　nature　ofthe　factors　generated　is　consulted　qualitatively．

　　　Full－information　factor　analysis　was　applied　in　factor　analytic　studies　of　both　test

sets　via　TESTFACT　2（Scientific　Software　Intemational）．　Although　some　problems

are　pointed　out　in　using　traditional　factor　analysis　methods　with　binary　data（i．e．　items

that　are　scored　dichotomously　by　judging　right　or　wrong），　fUll－information　factor

analysis　has　been　evaluated　to　accommodate　such　circumstances（Negishi　1996；Bock

1984）．

4．3．2．2」rtem　A　n　alyses

　　　　　To　discover　which　facets　of　a　reading　test　item　would　allow　the　writers　of　test

items　to　predetermine　the　diffriculty　of　a　test　item，　the　present　study　investigates　the

61


possibility　of　a　link　between　the　item　difficulty　of　a　test　item　and　its　question　type．

For　this　purpose，　test　items　are　analyzed　by　consulting　their　item　difficulty　indices

calculated　via　Rasch　Analysis　using　RASCAL（Assessment　Systems　Corporation）in

relation　with　question　type．　Other　information　in　the　final　parameter　estimates　as

well　as　a　raw　score　conversion　table，　an　item　by　person　distribution　map，　a　test

characteristic　curve，　and　a　test　information　curve　are　provided　in　this　section　of

analysis．

62


CHAPTER 4 RESEARCH DESIGN - repository.tufs.ac.jp

Documents