Fred Li, 20031 測驗理論 2 邁向科學之路 ☆量化科學之路：觀察、實驗、測量界定心理建構決定測量單位編製測量工具基本條件：待測的特質可量化嗎？

Fred Li, 2003 1

測驗理論

Fred Li, 2003 2

邁向科學之路☆ 量化科學之路：觀察、實驗、測量界定心理建構決定測量單位編製測量工具基本條件：待測的特質可量化嗎？亦即具有次序性與

可加性嗎？ (See Michell, 1990)測量的思路： We dream before we think We think before we point. We point before we count. We count before we rank. We rank before we define equal units. We define equal units before seek natural origin.

Fred Li, 2003 3

測驗理論：從傳統到當代 Random sampling theory： Classical test theory(True-score test theory)：基本模式： X=T + E (traced back to Spearman, 1904； Gulliksen, 1950)

基本假設： Traits are constant and the variation in observed scores are caused by

random errors. Generalizability theory (Cronbach 等， 1972)： True score 並非唯一 , depending on

the measurement design, random facets, and the sources of measurement error。

Item response theory ：當代測驗之父： Frederic Lord(1913~2000) ，其博士論文 (1951)為 IRT 之經典之作

基本假設：單向度、局部獨立性基本模式： )b(a

jjj jje1

c1c)(P

Fred Li, 2003 4

測驗理論的功能

* 提供測驗發展歷程與測驗工具編製的評鑑準則* 提供解決測驗實務問題的立論基礎與方法：估計信、效度進行項目分析偵測偏誤題目測驗分數的等化決定計分方法設定通過標準等提供解釋測驗分數應注意事項

Fred Li, 2003 5

良好測量工具的特色☆ 單向度 (uni-dimensionality)線性 (linearity) ，方可進行算術的運算次序性 (ordinality 與可加性 (conjoint additivity) ：例如ICC 不交叉，測量單位相等不依賴樣本的題目參數估計 (sample-free item calibration)不依賴測驗的個人能力參數估計 (test-free person measurement)具有好的信、效度應用簡便通常原始分數因單位不相等且非線性等等因素，原始分數的使用可能導致偏差的測量。

Fred Li, 2003 6

CTT 的起源• Classical test theory is nearly a century old; Charles

Spearman(1904) laid its foundation in a paper in which he introduced the decomposition of an observed score into a true score and an error and showed how to estimate the reliability of observed scores. 經過 60 餘年的擴充與推演之後 , Novick (1966) 終於可以推出完整的 CTT 理論。

• Novick, M. R. (1966). The axioms and principal results of classical test theory. Journal of Mathematical Psychology, 3, 1-18.

Fred Li, 2003 7

測驗理論： CTT(1)

• X =T + E • X = the observed test score• T = a hypothetical error-free true-score• E = the random error associated with a true score.• Further, items are assumed to be sampled from

“universes” or “domains”. Estimation of reliability and other parameters may be made using the algebra of linear sums.

• See Nunnally (1979) pp. 190-224 and Suen (1990), pp. 27-39).

Fred Li, 2003 8

測驗理論： CTT(2)

• Observed score = True score + error score

• X = T + E

• SEM for a sample

• SEM using population parameters (note: r is for sample while is for a population. There is a tendency to use population parameters to denote reliability coefficients.

• Confidence intervals (bands):

95%CI = X + 1.96

s s r s re x xx x TX 1 1 2'

SEM e x XX X TX 1 1 2'

Fred Li, 2003 9

CTT 的基本假設 : 平行測驗

• 假如兩個測驗的平均數 (A= b) 與變異數 (A2= B

2) 相等、兩個測驗的原始分數與真分數間的相關亦相等 (rtA

= rtB) ，且其誤差分數間的關係為 0(Cov(eA,eB)=0) ，即可直接估計信度，複本信度、重測信度即是一例。

2

2

2

2

),(),(),()(

),(),(

x

t

x

BABA

x

BA

AAAB

eeCovetCovetCovtVar

etetCovBACovr

Fred Li, 2003 10

平行測驗的意義 : CTT與 IRT

• In CTT, observed X & X’ are parallel tests if 1) X=T+E, 2)E(X)=T, 3)ET=0, 4) E1E2=0, 5) E1T2=0, 6)T=T ’ , 7)σ2

E=σ2

E’

• In IRT, parallel tests are defined as 1)Same ability measured, 2)equal information function, 3)identical item parameters in the 2 forms.

Fred Li, 2003 11

測驗理論：推論力理論• True-Score Theory (or CTT) is largely concerned with

estimating the reliability of a test, given the assumptions about error made within the model (all sources are aggregated into the error parameter(E), regardless of context in which the errors occur). Generalizability theory is simply an extension of CTT, whereby the components of error may be estimated, using an ANOVA approach.Generalizability can thus provide many different types of reliability coefficient, depending upon which sources of error are partitioned from e. Further, there may be more than one “true score” for an individual, taking into account the context in which the score was obtained.

Fred Li, 2003 12

測驗理論： IRT 理論• The focus of the theory is at the item level. Rather

than make assumptions about hypothetical domains of items, or the nature of “true-scores”, IRT theory seeks to impose a response model on the pattern of responses on an item, thus permitting a quantitative estimate of degree of fit of the data to the model. Further, response error can be partitioned precisely into that due to the “badness of fit” of items to the model (test score error or unreliability), and that due to the “badness of fit” of persons to the model (person error or unreliablity).

Fred Li, 2003 13

測驗向度的檢驗• The models presented make a common

assumption of unidimensionality• Hattie (1985) reviewed 30 techniques• Some propose the ratio of the 1st eigenvalue to the

2nd eigenvalue (Lord, 1980)• Others describe how to examine the eigenvalues

following Principal Axis Factoring (PAF)

Fred Li, 2003 14

題目難度與能力強弱的圖示題目受試者 9 A( 能力最強，答對 9

題 )

8,10

5

B( 答錯 5,8,9,10 四題 )

3

7

6,4

2 C 答錯 1,2 兩題 )

1

簡單最弱

Fred Li, 2003 15

PAF 與 scree 圖

• If the data are dichotomous, factor analyze tetrachoric correlations– Assume continuum

underlies item responses

Dominant

first factor

Fred Li, 2003 16

以 3PL與 SGR 為例• The Three Parameter Logistic model (3PL)

– For dichotomous data– E.g., cognitive ability tests

• Samejima's Graded Response model– For polytomous data where options are

ordered along a continuum– E.g., Likert scales

Fred Li, 2003 17

The 3PL model

• 參個參數為 :– a = item discrimination– b = item extremity/ difficulty– c = lower asymptote, “pseudo-guessing”

= the latent trait

Fred Li, 2003 18

“a” 參數的效應

Small “a,” poor

discrimination

Fred Li, 2003 19

“a” 參數的效應

Larger “a,” better

discrimination

Fred Li, 2003 20

“b” 參數的效應

Low “b,” “easy item”

Fred Li, 2003 21

“b” 參數的效應

Higher “b,” more difficult

item

“ b” inversely proportional to CTT p

Fred Li, 2003 22

“c” 參數的效應

c=0, asymptote

at zero

Fred Li, 2003 23

“c” 參數的效應

“ low ability”

respondents may

endorse correct

response

Fred Li, 2003 24

Samejima's Graded Response model

• 用於選項具有次序性的量尺 , 如 Likert-type 量尺– v = response to the polytomously scored item i– k = particular option– a = discrimination parameter– b = extremity parameter

Fred Li, 2003 25

SGR 圖例

“ Low option”

“ High option”

Low discrimination (a=0.4)

Fred Li, 2003 26

SGR 圖例

Better discrimination (a=2)

Fred Li, 2003 27

原始分數與測量單位 (1)

• Logit measures are "equal interval" in the sense that they take item difficulty into account.

• A unit of measurement is always a process of some kind which can be repeated without modification in the different parts of the measurement continuum. (Thurstone,1931).

Fred Li, 2003 28


• An inch is an inch regardless of where you are on the ruler. A logit is a logit regardless of which test items you have taken.

• Logit 單位的定義 : 對數勝算尺的單位

)(

)(ln

q

p

Fred Li, 2003 29


• The ratio, (Probability of Success)/ (Probability of Failure) is called the "odds of success". "Log[(Probability of Success)/(Probability of Failure)]" is called log-odds. The units of measurement constructed by this theory are called "log-odds units" or "logits".

Fred Li, 2003 30

Nominal ScalesNominal Scales

Ordinal Scales

Interval Scales

Ratio Scales

Steven ， s 四種基本量尺

Fred Li, 2003 31

測量的層次：類別量尺• 類別 ... the assignment of numbers to labels

or classes of objects. E.g. types of cigarettes smoked, the presence or absence of a symptom, the names of each course within the dept. of Education.

實例：

Please indicate your current martial status.

__Married __ Single __ Single, never married __ Widowed

Fred Li, 2003 32

測量的層次：次序性量尺• 次序 ... the assignment of numbers to

persons or objects so that they reflect their rank ordering on a chosen attribute. E.g. the order of runners finishing in a race. Ordered ranks are monotonic in sequence.

實例：Which one category best describes your knowledge about the assortment of services offered by your main HCP? __ Complete knowledge of services__ Good knowledge of services__ Basic knowledge of services__ Little knowledge of services__ No knowledge of services

Fred Li, 2003 33

測量的層次：等距量尺 (1)

• 等距 ... numbers are assigned to objects such that they satisfy the ordinal-level measurement constraints, AND, that the differences between the numbers are constant. However, interval measures do not possess a true ZERO - therefore, any ratio formed by two values within this scale is not equivalent to any other ratio if the units of measurement are changed.

• 例如 : 華氏 70 度是華氏 35 度的兩倍，但轉換成攝氏 21.11 度與 1.67 度後之比值變則為 12.67.

• 具有加、減的特質

Fred Li, 2003 34

測量的層次：等距量尺 (2)

• 實例：• Approximately, how many overdrawn

charges on your checking account (NSF checks) has “your” bank imposed on you in the past year?

• __ None __ 1-2 __ 3-7 __8-15 __ 16-25 __ More than 25

Fred Li, 2003 35

測量的層次：比率量尺 (1)

• 比率 ... in addition to the constraints of the ordinal and interval measurement scales, ratio measurement possesses a true ZERO value. E.g. length, time, weight, absolute temperature. Change of units of measurement does not change any ratio formed by two points on the measurement scale.

• 例如： 12kg 為 6kg 的兩倍，而轉換成英磅 26.4554 與 13.2277 後，其比值仍為兩倍。

• 具有乘、除的特質

Fred Li, 2003 36

測量的層次：比率量尺 (2)

• 實例：• Please circle the number of children

under 18 years of age currently living in your household.

• 0 1 2 3 4 5 6 7 (if more than 7, please specify ___.)

Fred Li, 2003 37

對於 Stevens 的批判• An absolute scale: 例如 : 機率，不能乘以任異數以建立

新量尺。因此，機率是 beyond ratio 的變數。 Stevens的量尺分類並不週圓。

• A cyclical measure: 例如 : 角度 359 度與 0 度的距離，事實上，跟 1 度一樣。

量尺類別的界定並非靠資料的特質而定，而是依待答問題而定。例如， 8-cylinder, 6-cylinder,

4-cylinder engines參閱 Velleman, Wilkinson(1993). Nominal, ordinal, interval,

and ratio typologies are misleading. The American Statistician, 47(1), 65-72.

Fred Li, 2003 38

測量層次與統計方法• Scale Operation Location Dispersion Association Test• --------------------------------------------------------------------------------------• 名義 Equality 眾數 (range?) 2

• 次序 < , > 中位數百分等級等級相關 Sign test• (range)

等距 Distance 平均數標準差積差相關 F-test• 比率 Ratio 幾何平均數變異 % • ( 調和平均數 ) ( 變異係數 )

• --------------------------------------------------------------------------------------• To obtain the mean & SD of a set of ratio scale numbers, the logarithm

of each number must be calculated.

• Stevens' Classification of Scales (after Stevens, 1959, p.25,27)

Fred Li, 2003 39

測驗是什麼？• A Test• A device that is used to make measurement• Measurement• A procedure for identifying values of quantitative variables

through their numerical relationships to other values.• Unit of Measurement• A particular value of the relevant variable that is being

measured.It is singled out as that value relative to which all others are to be compared (Michell (1990), p.63-64).

• Assessment• The process by which measurements are interpreted in order to

provide extra information about an individual .

Fred Li, 2003 40

State of Being

生理或人口變項

i.e. age, income level

State of Intention

未來行動計畫

i.e. use surveys to ask future intentions

State of Mind

人們的態度

i.e. attitudes, beliefs

State of Behavior

目前可觀察到或

已有記錄可察的行動

i.e. shopping habits

蒐集資料的類別

Fred Li, 2003 41

測驗的哲學 (1): Representationalism

• From Holder (1901) through Campbell (1920) and finally Stevens (1946, and 1951).

• A system of labels used to describe a property of an empirical object or event. The assignment of numbers to attributes of objects according to a rule or convention. e.g. “the assignment of numerals to objects or events according to rules” … “provided a consistent rule is followed, some measurement is achieved”.

• The problem with this is that the assignment of numbers is quite arbitrary, leading to some extraordinary results. The degree of subjectivity in assigning numbers can create great confusion in the apparent measurement being made. More formal arguments against representationalism exist within Michell (1990), pp. 28-49.

• 物件之數字關係，非物件之固有的特質，係根據量尺測量的運作結果 ( 源自心理物理學 )

Fred Li, 2003 42

測驗的哲學 (2): Empirical Realism

• From Michell (1986, 1990, 1994, 1997). • The relations represented in measurement have an existence

independent of human observations or operations. Numbers are considered to be empirical facts, not abstract entities. That is, reference to numbers in quantitative science is literal, and not merely metaphorical, as the representational theory of measurement would have it. Therefore, in order to use numbers to stand for the units of measurement of a variable (quantification), we need to first confirm the quantitative structure of our variable in order to establish that any mapping between numbers and our proposed variable units is valid.

• 物件之數字關係，係物件之固有的特質，而非人類的的定義 ( 源自物理學 )

Fred Li, 2003 43

測量類別• 最根本的直接測量 ( 單一變項 ) 。旨在直接反應待測

量變項的量化結構：發現物件間的次序性與連結關係。 Examples of such extensively measured variables are: length, weight, duration (time), electrical resistance.

• 衍生而來的間接測量 (含 2 個以上的根本測量變項 ) 。例如，物理界中的 velocity, acceleration, force, work等量數均為根本變項 weight, time, length所組成。 .

• 隱含的 (conjoint) 測量 (含 2 個以上的次序性或類別變項 ) 。

Fred Li, 2003 44

聯合測量 (Conjoint Measurement)

• Conjoint measurement is used to investigate the joint effect of a set of independent variables on an ordinal-scale-of-measurement dependent variable. The independent variables are typically nominal and sometimes interval-scaled variables. Conjoint measurement simultaneously finds a monotonic scoring of the dependent variable and numerical values for each level of each independent variable. The goal is to monotonically transform the ordinal values to equal the sum of their attribute level values. Hence, conjoint measurement is used to derive an interval variable from ordinal data.

• 市場行銷常用的 conjoint analysis 即應用此法 !

Fred Li, 2003 45

Conjoint Measurement(1)

• The function that describes the concatenation relation between two variables and a third can be deduced axiomatically from the measurements made of the outcome (the third variable) produced by combining the values of the two variables. For example, in a test containing say 20 items, where we assume that the “ability” to answer those items is a “latent” variable, the items and the amount of latent variable are “combined” to produce a third variable (the test score). This is the essence of Rasch scaling and IRT theory.

Fred Li, 2003 46


• However, the function that describes the concatenation relation between two variables and a third is not required to be arithmetic addition. However, it does require that the two variables in the concatenation operation are “non-interactive” (i.e. values on each variable can be manipulated independently of each other). It enables quantitative structure to be detected via ordinal relations upon a variable. As Cliff (1992) has written … “a certain kind of mild-looking ordinal consistency among three or more variables is necessary and sufficient to define equal-interval scales”.

Fred Li, 2003 47


• It is also to be noted that conjoint measurement (usually referred to as additive conjoint measurement [ACM] because of the arithmetic addition of units in most scales) is not different from, or an alternative to, extensive measurement, but in fact, is capable of demonstrating extensive measurement of a third variable from a pair of ordinally or nominally scaled variables, given certain relations hold between them. The Rasch model is a probabilistic implementation of ACM.

Fred Li, 2003 48

Rasch Reliability & Separation(1)

• 某一測量工具之信度是針對某一群體在某些條件下而言，因此測量工具之信度並非一成不變之特性。從信度之公式亦可看出其端倪：

• Observed Variance=True Variance + Error Variance

上式中 True Variance 反應施測樣本的特性，而 Error Variance 反應測量工具之特質。在 Rasch model中， Error Variance 是指Mean-square Error( 可由模式推衍求得 ) 。

• Reliability=True Variance/Observed Variance

Fred Li, 2003 49

Rasch Reliability & Separation(2)

• 測量工具之信度值介於 0到 1 之間， Wright(1996) 使用Separation 係數 (G, 介於 0到之間 ) ，表示測驗可以明確分辨不同表現層次之數目。

• 由式中可知當 G=1時， True Sd= RMSE ，其信度為 0.5 。此即意涵著測量分數間之差異均為測量誤差所造成。而當信度為 0.7時，亦即表示百分之七十的變異量非由誤差所造成。

et

t

GG

liabilityliability

I

AlpahaKRliability

SRMSE

SdTrueG

22

2

1

Re1Re

)(

11

1

/20....Re

Ratioeparation....

2

2

Fred Li, 2003 50

信度的特質

• 受試者測驗分數的一致性量化指標• 測驗題目能夠測到相同能力或特質的一

致性程度• 信度並非效度的充分條件，有信度不能保證一定有效度

• 有效度則可確定某種程度的信度

Fred Li, 2003 51

真分數 (universe scores)

• 測驗 (Tests) 通常視為從母群領域中隨機抽樣而來，因它具有母群代表性，故可是為考生真分數或領域分數的估計值 true score or universe score.）。

• 實得分數與領域分數間的相關為測驗信度指標，而他們之間的差異分數稱為測量誤差。

Fred Li, 2003 52

測量誤差

• 偏離個體真正能力的差異分數 ( 可正、可負 )

• 假設一：誤差分數為以平均數為 0 的常態分配

• 假設二：誤差分數與能力的相關為零

Fred Li, 2003 53

誤差來源隨機誤差

系統性誤差

1) fluctuations in the person’s current mood.2) misreading or misunderstanding the questions3) measurement of the individuals on different days or in different places.

These error may cancel out as you collect many samples

Sources of error including the style of measurement, tendency toward self-promotion, cooperative reporting, and other conceptual variables are being measured.

Fred Li, 2003 54

測量標準誤： SEM

• SEM - is the standard deviation of error scores for a specific examinee under repeated independent testings with the same test or parallel tests.– assumption of homoscedasticity implies that these conditional

distributions are the same for each universe score

– the smaller the SEM, the more reliable the test

• where SD is the standard deviation of the obtained test scores

t)coefficieny reliabilit - (1* SD SEM

Fred Li, 2003 55

信度係數與信度指標

• 兩套實得分數或平行測驗分數間之相關

• square root of a test’s reliability coefficient equals the test’s index of reliability

• index of reliability is numerically larger than the reliability coefficient

index of reliability TX XX '

Fred Li, 2003 56

重測信度

• 指同一測驗在不同時間施測，所得分數的相關test-retest reliability coefficients– known as stability coefficients– usually higher than parallel form reliability

coefficients– requires two test administrations - disadvantage– researchers need to report specific type of

reliability coefficient for their studies

Fred Li, 2003 57

Spearman-Brown 公式

• used to estimate how much a test’s reliability will increase when the length of the test is increased by adding parallel items

where L = the number of times longer the new test will be，

• estimate of SEM for different test lengths can be obtained using

. ( )

'

'

'

XX

XX

XX

L

L

10 1

e k.43 where k = the number of items

Fred Li, 2003 58

內部一致性信度• 旨在探究某一測驗內的所有題目是否均在測相同特質或技能 (e.g., the test is “consistently” measuring the same skill.) 。

• only requires one test administration

– Split-half - test is split in half and the correlation between scores on each half constitutes your reliability coefficient

– Kuder-Richardson 20 (KR20) - mean of all possible split-half coefficients

– Coefficient alpha - most often used for survey scales and objective assessments

– Kuder-Richardson 21 (KR21)

Fred Li, 2003 59

Kuder-Richardson 20 信度

• KR20公式：

where k = 題數 p = 答對百分比 q = 1 - p

KR

X

k

k

p q20 21

1

Fred Li, 2003 60

Kuder-Richardson 21 信度

• similar to KR20, yet easier to compute

• KR21 will always be less than KR20 unless all the items are equal in difficulty (in which case KR21 = KR20

• 公式： KR

X

k

k

X k X

k21 211

( )

Fred Li, 2003 61

Alpha 係數

• 當題目並非完全在測單一特質時，為信度的下限估計值

• 可能為負值 ( if inter-item correlations are negative): 其全距為 : -∞ ~ 1

• If items are dichotomously scored, coefficient alpha equals KR20 value

• 公式：

k

ki

X11

2

2

Fred Li, 2003 62

當共變量值為負時，為負值

2

11

2

1

11

2

1

2

11

2

11

2

11

2

1

2

2

2

1

]

2

2

[1

]

22

2

[1

]

2

1[1

11

x

n

i

n

ijjiij

n

ii

n

i

n

ijjiij

n

i

n

ijjiij

n

ii

n

ii

n

i

n

ijjiij

n

ii

n

i

n

ijjiij

n

ii

n

i ijjiij

n

ii

n

ii

X

i

S

共變量k

k

ssrs

ssr

k

k

ssrs

s

ssrs

ssrs

k

k

ssrs

s

k

k

S

s

k

k

使用共變數矩陣計算

Fred Li, 2003 63

為負值的原因

• 反向題目未更正回來• 樣本太小 + 題目太少• 負相關的題目太多 ( 題目非測單一特質 )

Fred Li, 2003 64

Standardized Item

** 使用相關矩陣資料計算

rk

rk

rr

r

k

kn

i

n

ijij

n

i

n

ijij

n

i

n

ijij

)1(1

]

2

2

[1

11

1

Fred Li, 2003 65

Rulon’s 折半信度

• Split test into two halves and create half-test scores

• Compute the difference between half-test scores

• Compute the variances of differences and total scores

• reliability estimate = XX

diff

X

' 12

2

Fred Li, 2003 66

評分者間信度• Inter-rater reliability is the correlation

between two or more scorers looking at the same phenomena.

• Sometimes called coefficient of agreement - indicates the percent of agreement between observers.

Fred Li, 2003 67

評分者間信度求法

Aggression CodeCoder 1 Coder 2

Hit boy A ______ ______Hit boy B ______ ______Hit girl A ______ ______Hit girl B ______ ______

1331

3321

使用 Cohen’s Kappa或 Tetrachoric correlation 或 Polychoric correlation 計算評分者間評定結果的一致性。

Fred Li, 2003 68

平行測驗信度

• The extent to which two equivalent variables given at different time correlate each other.

• 兩套不同測驗分數但屬平行測驗的相關。• 依照傳統測驗理論，真正平行測驗，每一套

測驗的真分數、 SEM 應該相等。• 又稱複本信度• 缺點：需編製兩套測驗、施測兩次

實例： GRE, SAT, GMAT, TOEFL

Fred Li, 2003 69

Test-Retest Reliability

Equivalent-Forms Reliability

Reliability as Internal Consistency

Interrater Reliability

Questionnaire 1

Item 1

Item 2

Item 3

Questionnaire 1

Item 1

Item 2

Item 3

Questionnaire 2

Item 1

Item 2

Item 3

各種信度間之關係

Fred Li, 2003 70

信度與測驗類別

• Power test– tests in which every

student has adequate time to complete the test

– appropriate reliability indices include

• split-half or internal consistency measures

• Speeded test– tests in which all

examinees are not expected to finish.

– appropriate reliability indices include

• Rulon’s split-half for separately timed comparable halves of the test

Fred Li, 2003 71

item 1st

half2nd

half total

Student 1 2 3 4 5 6 score score score

1 1 1 1 0 1 0 3 (9) 1 (1) 4 (16)

2 0 1 1 0 1 0 2 (4) 1 (1) 3 (9)

3 0 0 1 1 0 0 1 (1) 1 (1) 2 (4)

4 0 0 0 1 0 1 0 (0) 2 (4) 2 (4)

5 1 1 1 1 1 1 3 (9) 3 (9) 6 (36)Tot 9 (23) 8(16) 17(69)

p-value .4 .6 .8 .6 .6 .4

1 - p .6 .4 .2 .4 .4 .6

pq .24 .24 .16 .24 .24 .24

模擬資料

Fred Li, 2003 72

演算問題• 根據模擬資料：

– determine the mean and SD for the total test– estimate reliability using

• Spearman-Brown

• Rulon split-half

• KR20

• KR21

Fred Li, 2003 73

問題二：計算 KR20與 Alpha 係數題目

學生 1 2 3 4 5 6 7 8 9 10

1 1 1 1 1 1 1 1 1 0 1

2 1 1 1 0 0 0 0 1 0 1

3 1 1 1 0 1 0 0 0 0 0

4 0 0 0 0 0 0 0 1 0 1

5 0 0 0 0 0 0 0 0 0 1

6 1 1 1 0 1 1 1 1 1 1

7 1 1 1 1 1 0 0 1 0 1

8 0 1 0 0 0 0 0 0 0 0

9 1 1 1 1 1 0 1 1 1 1

10 0 1 0 0 0 0 0 1 1 1

Fred Li, 2003 74

測驗等化的基本原則

* 均在測量相同之特質* Symmetry

* Invariance

* Equity

Equating、 Scaling 兩者相同嗎？Equating、 Linking 兩者相同嗎？

Fred Li, 2003 75

測驗等化的步驟

* 選擇資料蒐集設計方法 The single group design

Random groups design

Linking item non-equivalent groups design

* 選擇分數轉換類型 Mean equating

Linear equating

Equipercentile equating

* 選擇統計方法進行等化

Fred Li, 2003 76

測驗等化的種類

* 水平等化* 垂直等化

Fred Li, 2003 77

DIF 分析的方法

* Bias 的類別： Content bias

Atmosphere bias

Prediction bias

Consequence bias Dif 分析的方法 Mantel-Haenszel 2 方法 Logistic regression 方法

Fred Li, 2003 78

為何使用 IRT研究 DIF/DTF?

• Researchers are often interested in comparing cultural, ethnic, or gender groups.

• Meaningful comparisons require that measurement equivalence holds.

• Classical test theory methods confound “bias” with true mean differences; IRT does not.

• In IRT terminology, item/test bias is referred to as DIF/DTF

Fred Li, 2003 79

DIF 與 DTF 的界定

• DIF refers to a difference in the probability of endorsing an item for members of a reference group (e.g., US workers) and a focal group (e.g., Chinese workers), having the same standing on theta.

• DTF refers to a difference in the test characteristic curves, obtained by summing the item response functions for each group.

• DTF is perhaps more important for selection because decisions are made based on test scores, not individual item responses.

Fred Li, 2003 80

DIF 實例Uniform DIF Against Focal Group

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3Theta

Pro

b.

of

Po

siti

ve R

esp

on

se

Reference

Focal

Nonuniform (Crossing) DIF

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3Theta

Pro

b.

of

Po

siti

ve R

esp

on

se

Reference

Focal

Reference group favored at all levels

Focal favored at low theta

Reference favored at high theta

Fred Li, 2003 81

DIF/DTF 的檢驗

• DIF– Parametric

• Lord’s Chi-Square • Likelihood Ratio Test • Signed and Unsigned Area Methods

– Nonparametric• SIBTEST• Mantel-Haenszel

• DTF– Parametric

• Raju’s DFIT Method

– Nonparametric• SIBTEST

Fred Li, 2003 82

Lord‘s Chi-Square考驗

i1

i2i vv

vi is a vector of the differences in the estimated item parameters for the ith item between the focal and reference groups

i is the variance-covariance matrix for

the differences in item parameter estimates

Lord’s Chi-Square is sensitive to both uniform and nonuniform DIF.

Fred Li, 2003 83

Lord‘s Chi-Square考驗

1. Estimate item parameters and covariances for focal and reference groups separately.

2. Obtain linking constants, A and K, for putting the focal and reference parameters on a common metric.

3. Compute Lord’s chi-square to identify DIF items using the reference and transformed focal group parameters and their covariances.

4. Once the DIF items have been identified, reequate the focal and reference group metrics using only the non-DIF items.

5. Repeat steps 2 through 4 until the same items are identified on consecutive trials.

This procedure is implemented in the program ITERLINK.

Fred Li, 2003 84

Using ITERLINK

• ITERLINK is an interactive program that performs iterative linking for the 2PL and 3PL models using Lord’s Chi-Square.

• Creates three output files:– ITERLINK.DBG

• DIF results and linking constants across iterations

– PAIRDIF.DBG • Summary of DIF results

– User-named file • Contains transformed focal parameters

Fred Li, 2003 85

ITERLINK.DBG --------------------------------------------------------------- ITEM P obs. DIF PRESENT using P<.05 Corrected P --------------------------------------------------------------- 1 .03910006 YES NO 2 .22354231 NO NO 3 .55714918 NO NO 4 .00030662 YES YES 5 .00669607 YES NO 6 .47102172 NO NO 7 .00001849 YES YES 8 .00527094 YES NO 9 .12618701 NO NO 10 .02565616 YES NO 11 .46091057 NO NO 12 .05661002 NO NO 13 .00001317 YES YES 14 .74325914 NO NO 15 .70399638 NO NO 16 .00314141 YES NO 17 .12334338 NO NO 18 .33370542 NO NO 19 .13698496 NO NO 20 .22581136 NO NO 21 .36562514 NO NO 22 .10503999 NO NO 23 .58623106 NO NO 24 .21399115 NO NO 25 .02355756 YES NO 26 .51280795 NO NO 27 .51541161 NO NO 28 .00079191 YES YES 29 .65164401 NO NO 30 .89292899 NO NO 31 .04190196 YES NO 32 .28610441 NO NO

Bonferroni Corrected p:

.05 / #items

“ Yes” = DIF

“No” = No DIF

Fred Li, 2003 86

PAIRDIF.DBG

Check that BILOG *.cov and *.3PL files were read correctly

Fred Li, 2003 87

PAIRDIF.DBG

-1

v

Transformed FocalParameters

Fred Li, 2003 88

Example of DTF for 50-Item Test

DTF Against Reference Group

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Theta

Pro

po

rtio

n C

orr

ec

t T

rue

Sc

ore

Focal

Reference

Most focal group members expected to score about 3 points higher

Fred Li, 2003 89

Detecting DTF Using the DFITD4 Program

• Parametric procedure that detects DTF by comparing test characteristic curves.

• Determines whether DIF cancels or cumulates to produce DTF.– Linking coefficients, item parameters, and thetas

are required.

• Note: What we refer to as the reference group, Raju calls the focal group

Fred Li, 2003 90

JCL File for DFITD4

FORTRAN format statements for reading item parameters in *.3pl files

FORTRAN format statement for reading thetas in *.sco file

“1” to include item in analysis

# Items Logistic

DTF criterion for Dichotomous data

Linking Constants, A and K

Fred Li, 2003 91

ITEM DELETION PROCEDURE A RUN ITEM REMOVED DTF CHI-SQUARE PROB MEAN D MEAN lDl --- ----------- --- ---------- ---- ------ -------- 1 NONE .11780 10278.45 .000 -.29857 .31108 2 63 .09449 12044.00 .000 -.27363 .28243 3 64 .07466 11768.57 .000 -.24248 .25236 4 65 .05990 14139.40 .000 -.22206 .22801 5 3 .04637 9342.44 .000 -.18428 .19983 6 88 .03507 8842.57 .000 -.15860 .17361 7 99 .02569 8253.24 .000 -.13382 .14930 8 50 .01790 8599.06 .000 -.11267 .12488 9 67 .01252 7671.49 .000 -.09188 .10473 10 13 .00816 5725.96 .000 -.06782 .08423 11 84 .00494 4808.99 .000 -.04869 .06501

Output File for DFITD4

DTF present; value > .006

On each run, item with largest DIF statistic removed

DTF eliminated after removing 10 items

Fred Li, 2003 92

Detecting DIF/DTF Using SIBTEST

• Nonparametric method that can be used to examine individual items or groups of items– Assumes only monotonicity– Requires only item response data– Works well with fairly small samples (250+)

• Several variations exist– Original SIBTEST: Uniform DIF– Crossing SIBTEST: Nonuniform DIF– PolySIB: Uniform DIF, polytomous data– MultiSIB: Uniform DIF, multiple dimensions

Discussed in Web Tutorial

Fred Li, 2003 93

Using SIBTEST

• SIBTEST consists of two executable files: – SIBIN.EXE : interactive, creates input file– SIBTEST.EXE : performs DIF/DTF analyses

• Choose “E” for either, “R” for reference, or “F” for focal group

• Detailed discussion of running SIBIN and SIBTEST is presented on the web

Fred Li, 2003 94

SIBTEST DIF Output

Magnitude of DIF

Use Bonferroni correction

E: either group

MH results for comparison

Fred Li, 2003 95

標準設定的理論與方法

* 標準設定 (standard setting)涉及切割點決定，以用以區分精熟或不精熟、通過或不通過等重大決定。

受試者中心模式測驗中心模式

連續模式狀態模式

人工判定模式電腦適性化判定模式

標準設定方法

Fred Li, 20031 測驗理論 2 邁向科學之路 ☆量化科學之路：觀察、實驗、測量 界定心理建構 決定測量單位 編製測量工具 基本條件：待測的特質可量化嗎？

Documents

Fred Li, 20031 測驗理論 2 邁向科學之路 ☆量化科學之路：觀察、實驗、測量界定心理建構決定測量單位編製測量工具基本條件：待測的特質可量化嗎？