Fred Li, 2003 1 測測測測
Fred Li, 2003 1
測驗理論
Fred Li, 2003 2
邁向科學之路☆ 量化科學之路:觀察、實驗、測量界定心理建構 決定測量單位 編製測量工具基本條件:待測的特質可量化嗎? 亦即具有次序性與
可加性嗎? (See Michell, 1990)測量的思路: We dream before we think We think before we point. We point before we count. We count before we rank. We rank before we define equal units. We define equal units before seek natural origin.
Fred Li, 2003 3
測驗理論:從傳統到當代 Random sampling theory: Classical test theory(True-score test theory): 基本模式: X=T + E (traced back to Spearman, 1904; Gulliksen, 1950)
基本假設: Traits are constant and the variation in observed scores are caused by
random errors. Generalizability theory (Cronbach 等, 1972): True score 並非唯一 , depending on
the measurement design, random facets, and the sources of measurement error。
Item response theory :當代測驗之父: Frederic Lord(1913~2000) ,其博士論文 (1951)為 IRT 之經典之作
基本假設:單向度、局部獨立性 基本模式: )b(a
jjj jje1
c1c)(P
Fred Li, 2003 4
測驗理論的功能
* 提供測驗發展歷程與測驗工具編製的評鑑準則* 提供解決測驗實務問題的立論基礎與方法: 估計信、效度 進行項目分析 偵測偏誤題目 測驗分數的等化 決定計分方法 設定通過標準 等 提供解釋測驗分數應注意事項
Fred Li, 2003 5
良好測量工具的特色☆ 單向度 (uni-dimensionality)線性 (linearity) ,方可進行算術的運算次序性 (ordinality 與可加性 (conjoint additivity) :例如ICC 不交叉,測量單位相等不依賴樣本的題目參數估計 (sample-free item calibration)不依賴測驗的個人能力參數估計 (test-free person measurement)具有好的信、效度應用簡便通常原始分數因單位不相等且非線性等等因素,原始分數的使用可能導致偏差的測量。
Fred Li, 2003 6
CTT 的起源• Classical test theory is nearly a century old; Charles
Spearman(1904) laid its foundation in a paper in which he introduced the decomposition of an observed score into a true score and an error and showed how to estimate the reliability of observed scores. 經過 60 餘年的擴充與推演之後 , Novick (1966) 終於可以推出完整的 CTT 理論。
• Novick, M. R. (1966). The axioms and principal results of classical test theory. Journal of Mathematical Psychology, 3, 1-18.
Fred Li, 2003 7
測驗理論: CTT(1)
• X =T + E • X = the observed test score• T = a hypothetical error-free true-score• E = the random error associated with a true score.• Further, items are assumed to be sampled from
“universes” or “domains”. Estimation of reliability and other parameters may be made using the algebra of linear sums.
• See Nunnally (1979) pp. 190-224 and Suen (1990), pp. 27-39).
Fred Li, 2003 8
測驗理論: CTT(2)
• Observed score = True score + error score
• X = T + E
• SEM for a sample
• SEM using population parameters (note: r is for sample while is for a population. There is a tendency to use population parameters to denote reliability coefficients.
• Confidence intervals (bands):
95%CI = X + 1.96
s s r s re x xx x TX 1 1 2'
SEM e x XX X TX 1 1 2'
Fred Li, 2003 9
CTT 的基本假設 : 平行測驗
• 假如兩個測驗的平均數 (A= b) 與變異數 (A2= B
2) 相等、兩個測驗的原始分數與真分數間的相關亦相等 (rtA
= rtB) ,且其誤差分數間的關係為 0(Cov(eA,eB)=0) ,即可直接估計信度,複本信度、重測信度即是一例。
2
2
2
2
),(),(),()(
),(),(
x
t
x
BABA
x
BA
AAAB
eeCovetCovetCovtVar
etetCovBACovr
Fred Li, 2003 10
平行測驗的意義 : CTT與 IRT
• In CTT, observed X & X’ are parallel tests if 1) X=T+E, 2)E(X)=T, 3)ET=0, 4) E1E2=0, 5) E1T2=0, 6)T=T ’ , 7)σ2
E=σ2
E’
• In IRT, parallel tests are defined as 1)Same ability measured, 2)equal information function, 3)identical item parameters in the 2 forms.
Fred Li, 2003 11
測驗理論:推論力理論• True-Score Theory (or CTT) is largely concerned with
estimating the reliability of a test, given the assumptions about error made within the model (all sources are aggregated into the error parameter(E), regardless of context in which the errors occur). Generalizability theory is simply an extension of CTT, whereby the components of error may be estimated, using an ANOVA approach.Generalizability can thus provide many different types of reliability coefficient, depending upon which sources of error are partitioned from e. Further, there may be more than one “true score” for an individual, taking into account the context in which the score was obtained.
Fred Li, 2003 12
測驗理論: IRT 理論• The focus of the theory is at the item level. Rather
than make assumptions about hypothetical domains of items, or the nature of “true-scores”, IRT theory seeks to impose a response model on the pattern of responses on an item, thus permitting a quantitative estimate of degree of fit of the data to the model. Further, response error can be partitioned precisely into that due to the “badness of fit” of items to the model (test score error or unreliability), and that due to the “badness of fit” of persons to the model (person error or unreliablity).
Fred Li, 2003 13
測驗向度的檢驗• The models presented make a common
assumption of unidimensionality• Hattie (1985) reviewed 30 techniques• Some propose the ratio of the 1st eigenvalue to the
2nd eigenvalue (Lord, 1980)• Others describe how to examine the eigenvalues
following Principal Axis Factoring (PAF)
Fred Li, 2003 14
題目難度與能力強弱的圖示 題目 受試者 9 A( 能力最強,答對 9
題 )
8,10
5
B( 答錯 5,8,9,10 四題 )
3
7
6,4
2 C 答錯 1,2 兩題 )
1
簡單 最弱
Fred Li, 2003 15
PAF 與 scree 圖
• If the data are dichotomous, factor analyze tetrachoric correlations– Assume continuum
underlies item responses
Dominant
first factor
Fred Li, 2003 16
以 3PL與 SGR 為例• The Three Parameter Logistic model (3PL)
– For dichotomous data– E.g., cognitive ability tests
• Samejima's Graded Response model– For polytomous data where options are
ordered along a continuum– E.g., Likert scales
Fred Li, 2003 17
The 3PL model
• 參個參數為 :– a = item discrimination– b = item extremity/ difficulty– c = lower asymptote, “pseudo-guessing”
= the latent trait
Fred Li, 2003 18
“a” 參數的效應
Small “a,” poor
discrimination
Fred Li, 2003 19
“a” 參數的效應
Larger “a,” better
discrimination
Fred Li, 2003 20
“b” 參數的效應
Low “b,” “easy item”
Fred Li, 2003 21
“b” 參數的效應
Higher “b,” more difficult
item
“ b” inversely proportional to CTT p
Fred Li, 2003 22
“c” 參數的效應
c=0, asymptote
at zero
Fred Li, 2003 23
“c” 參數的效應
“ low ability”
respondents may
endorse correct
response
Fred Li, 2003 24
Samejima's Graded Response model
• 用於選項具有次序性的量尺 , 如 Likert-type 量尺– v = response to the polytomously scored item i– k = particular option– a = discrimination parameter– b = extremity parameter
Fred Li, 2003 25
SGR 圖例
“ Low option”
“ High option”
Low discrimination (a=0.4)
Fred Li, 2003 26
SGR 圖例
Better discrimination (a=2)
Fred Li, 2003 27
原始分數與測量單位 (1)
• Logit measures are "equal interval" in the sense that they take item difficulty into account.
• A unit of measurement is always a process of some kind which can be repeated without modification in the different parts of the measurement continuum. (Thurstone,1931).
Fred Li, 2003 28
原始分數與測量單位 (2)
• An inch is an inch regardless of where you are on the ruler. A logit is a logit regardless of which test items you have taken.
• Logit 單位的定義 : 對數勝算尺的單位
)(
)(ln
q
p
Fred Li, 2003 29
原始分數與測量單位 (3)
• The ratio, (Probability of Success)/ (Probability of Failure) is called the "odds of success". "Log[(Probability of Success)/(Probability of Failure)]" is called log-odds. The units of measurement constructed by this theory are called "log-odds units" or "logits".
Fred Li, 2003 30
Nominal ScalesNominal Scales
Ordinal Scales
Interval Scales
Ratio Scales
Steven , s 四種基本量尺
Fred Li, 2003 31
測量的層次: 類別量尺• 類別 ... the assignment of numbers to labels
or classes of objects. E.g. types of cigarettes smoked, the presence or absence of a symptom, the names of each course within the dept. of Education.
實例:
Please indicate your current martial status.
__Married __ Single __ Single, never married __ Widowed
Fred Li, 2003 32
測量的層次:次序性量尺• 次序 ... the assignment of numbers to
persons or objects so that they reflect their rank ordering on a chosen attribute. E.g. the order of runners finishing in a race. Ordered ranks are monotonic in sequence.
實例:Which one category best describes your knowledge about the assortment of services offered by your main HCP? __ Complete knowledge of services__ Good knowledge of services__ Basic knowledge of services__ Little knowledge of services__ No knowledge of services
Fred Li, 2003 33
測量的層次:等距量尺 (1)
• 等距 ... numbers are assigned to objects such that they satisfy the ordinal-level measurement constraints, AND, that the differences between the numbers are constant. However, interval measures do not possess a true ZERO - therefore, any ratio formed by two values within this scale is not equivalent to any other ratio if the units of measurement are changed.
• 例如 : 華氏 70 度是華氏 35 度的兩倍,但轉換成攝氏 21.11 度與 1.67 度後之比值變則為 12.67.
• 具有加、減的特質
Fred Li, 2003 34
測量的層次:等距量尺 (2)
• 實例:• Approximately, how many overdrawn
charges on your checking account (NSF checks) has “your” bank imposed on you in the past year?
• __ None __ 1-2 __ 3-7 __8-15 __ 16-25 __ More than 25
Fred Li, 2003 35
測量的層次:比率量尺 (1)
• 比率 ... in addition to the constraints of the ordinal and interval measurement scales, ratio measurement possesses a true ZERO value. E.g. length, time, weight, absolute temperature. Change of units of measurement does not change any ratio formed by two points on the measurement scale.
• 例如: 12kg 為 6kg 的兩倍,而轉換成英磅 26.4554 與 13.2277 後,其比值仍為兩倍。
• 具有乘、除的特質
Fred Li, 2003 36
測量的層次:比率量尺 (2)
• 實例:• Please circle the number of children
under 18 years of age currently living in your household.
• 0 1 2 3 4 5 6 7 (if more than 7, please specify ___.)
Fred Li, 2003 37
對於 Stevens 的批判• An absolute scale: 例如 : 機率,不能乘以任異數以建立
新量尺。因此,機率是 beyond ratio 的變數。 Stevens的量尺分類並不週圓。
• A cyclical measure: 例如 : 角度 359 度與 0 度的距離,事實上,跟 1 度一樣。
量尺類別的界定並非靠資料的特質而定,而是依待答問題而定。例如, 8-cylinder, 6-cylinder,
4-cylinder engines參閱 Velleman, Wilkinson(1993). Nominal, ordinal, interval,
and ratio typologies are misleading. The American Statistician, 47(1), 65-72.
Fred Li, 2003 38
測量層次與統計方法• Scale Operation Location Dispersion Association Test• --------------------------------------------------------------------------------------• 名義 Equality 眾數 (range?) 2
• 次序 < , > 中位數 百分等級 等級相關 Sign test• (range)
等距 Distance 平均數 標準差 積差相關 F-test• 比率 Ratio 幾何平均數 變異 % • ( 調和平均數 ) ( 變異係數 )
• --------------------------------------------------------------------------------------• To obtain the mean & SD of a set of ratio scale numbers, the logarithm
of each number must be calculated.
• Stevens' Classification of Scales (after Stevens, 1959, p.25,27)
Fred Li, 2003 39
測驗是什麼?• A Test• A device that is used to make measurement• Measurement• A procedure for identifying values of quantitative variables
through their numerical relationships to other values.• Unit of Measurement• A particular value of the relevant variable that is being
measured.It is singled out as that value relative to which all others are to be compared (Michell (1990), p.63-64).
• Assessment• The process by which measurements are interpreted in order to
provide extra information about an individual .
Fred Li, 2003 40
State of Being
生 理 或 人 口 變 項
i.e. age, income level
State of Intention
未 來 行 動 計 畫
i.e. use surveys to ask future intentions
State of Mind
人 們 的 態 度
i.e. attitudes, beliefs
State of Behavior
目 前 可 觀 察 到 或
已 有 記 錄 可 察 的 行 動
i.e. shopping habits
蒐集資料的類別
Fred Li, 2003 41
測驗的哲學 (1): Representationalism
• From Holder (1901) through Campbell (1920) and finally Stevens (1946, and 1951).
• A system of labels used to describe a property of an empirical object or event. The assignment of numbers to attributes of objects according to a rule or convention. e.g. “the assignment of numerals to objects or events according to rules” … “provided a consistent rule is followed, some measurement is achieved”.
• The problem with this is that the assignment of numbers is quite arbitrary, leading to some extraordinary results. The degree of subjectivity in assigning numbers can create great confusion in the apparent measurement being made. More formal arguments against representationalism exist within Michell (1990), pp. 28-49.
• 物件之數字關係,非物件之固有的特質,係根據量尺測量的運作結果 ( 源自心理物理學 )
Fred Li, 2003 42
測驗的哲學 (2): Empirical Realism
• From Michell (1986, 1990, 1994, 1997). • The relations represented in measurement have an existence
independent of human observations or operations. Numbers are considered to be empirical facts, not abstract entities. That is, reference to numbers in quantitative science is literal, and not merely metaphorical, as the representational theory of measurement would have it. Therefore, in order to use numbers to stand for the units of measurement of a variable (quantification), we need to first confirm the quantitative structure of our variable in order to establish that any mapping between numbers and our proposed variable units is valid.
• 物件之數字關係,係物件之固有的特質,而非人類的的定義 ( 源自物理學 )
Fred Li, 2003 43
測量類別• 最根本的直接測量 ( 單一變項 ) 。旨在直接反應待測
量變項的量化結構:發現物件間的次序性與連結關係。 Examples of such extensively measured variables are: length, weight, duration (time), electrical resistance.
• 衍生而來的間接測量 (含 2 個以上的根本測量變項 ) 。 例如,物理界中的 velocity, acceleration, force, work等量數均為根本變項 weight, time, length所組成。 .
• 隱含的 (conjoint) 測量 (含 2 個以上的次序性或類別變項 ) 。
Fred Li, 2003 44
聯合測量 (Conjoint Measurement)
• Conjoint measurement is used to investigate the joint effect of a set of independent variables on an ordinal-scale-of-measurement dependent variable. The independent variables are typically nominal and sometimes interval-scaled variables. Conjoint measurement simultaneously finds a monotonic scoring of the dependent variable and numerical values for each level of each independent variable. The goal is to monotonically transform the ordinal values to equal the sum of their attribute level values. Hence, conjoint measurement is used to derive an interval variable from ordinal data.
• 市場行銷常用的 conjoint analysis 即應用此法 !
Fred Li, 2003 45
Conjoint Measurement(1)
• The function that describes the concatenation relation between two variables and a third can be deduced axiomatically from the measurements made of the outcome (the third variable) produced by combining the values of the two variables. For example, in a test containing say 20 items, where we assume that the “ability” to answer those items is a “latent” variable, the items and the amount of latent variable are “combined” to produce a third variable (the test score). This is the essence of Rasch scaling and IRT theory.
Fred Li, 2003 46
Conjoint Measurement(2)
• However, the function that describes the concatenation relation between two variables and a third is not required to be arithmetic addition. However, it does require that the two variables in the concatenation operation are “non-interactive” (i.e. values on each variable can be manipulated independently of each other). It enables quantitative structure to be detected via ordinal relations upon a variable. As Cliff (1992) has written … “a certain kind of mild-looking ordinal consistency among three or more variables is necessary and sufficient to define equal-interval scales”.
Fred Li, 2003 47
Conjoint Measurement(3)
• It is also to be noted that conjoint measurement (usually referred to as additive conjoint measurement [ACM] because of the arithmetic addition of units in most scales) is not different from, or an alternative to, extensive measurement, but in fact, is capable of demonstrating extensive measurement of a third variable from a pair of ordinally or nominally scaled variables, given certain relations hold between them. The Rasch model is a probabilistic implementation of ACM.
Fred Li, 2003 48
Rasch Reliability & Separation(1)
• 某一測量工具之信度是針對某一群體在某些條件下而言,因此測量工具之信度並非一成不變之特性。從信度之公式亦可看出其端倪:
• Observed Variance=True Variance + Error Variance
上式中 True Variance 反應施測樣本的特性,而 Error Variance 反應測量工具之特質。在 Rasch model中, Error Variance 是指Mean-square Error( 可由模式推衍求得 ) 。
• Reliability=True Variance/Observed Variance
Fred Li, 2003 49
Rasch Reliability & Separation(2)
• 測量工具之信度值介於 0到 1 之間, Wright(1996) 使用Separation 係數 (G, 介於 0到之間 ) ,表示測驗可以明確分辨不同表現層次之數目。
• 由式中可知當 G=1時, True Sd= RMSE ,其信度為 0.5 。此即意涵著測量分數間之差異均為測量誤差所造成。而當信度為 0.7時,亦即表示百分之七十的變異量非由誤差所造成。
et
t
GG
liabilityliability
I
AlpahaKRliability
SRMSE
SdTrueG
22
2
1
Re1Re
)(
11
1
/20....Re
Ratioeparation....
2
2
Fred Li, 2003 50
信度的特質
• 受試者測驗分數的一致性量化指標• 測驗題目能夠測到相同能力或特質的一
致性程度• 信度並非效度的充分條件,有信度不能保證一定有效度
• 有效度則可確定某種程度的信度
Fred Li, 2003 51
真分數 (universe scores)
• 測驗 (Tests) 通常視為從母群領域中隨機抽樣而來,因它具有母群代表性,故可是為考生真分數或領域分數的估計值 true score or universe score.)。
• 實得分數與領域分數間的相關為測驗信度指標,而他們之間的差異分數稱為測量誤差。
Fred Li, 2003 52
測量誤差
• 偏離個體真正能力的差異分數 ( 可正、可負 )
• 假設一:誤差分數為以平均數為 0 的常態分配
• 假設二:誤差分數與能力的相關為零
Fred Li, 2003 53
誤差來源隨機誤差
系統性誤差
1) fluctuations in the person’s current mood.2) misreading or misunderstanding the questions3) measurement of the individuals on different days or in different places.
These error may cancel out as you collect many samples
Sources of error including the style of measurement, tendency toward self-promotion, cooperative reporting, and other conceptual variables are being measured.
Fred Li, 2003 54
測量標準誤: SEM
• SEM - is the standard deviation of error scores for a specific examinee under repeated independent testings with the same test or parallel tests.– assumption of homoscedasticity implies that these conditional
distributions are the same for each universe score
– the smaller the SEM, the more reliable the test
• where SD is the standard deviation of the obtained test scores
t)coefficieny reliabilit - (1* SD SEM
Fred Li, 2003 55
信度係數與信度指標
• 兩套實得分數或平行測驗分數間之相關
• square root of a test’s reliability coefficient equals the test’s index of reliability
• index of reliability is numerically larger than the reliability coefficient
index of reliability TX XX '
Fred Li, 2003 56
重測信度
• 指同一測驗在不同時間施測,所得分數的相關test-retest reliability coefficients– known as stability coefficients– usually higher than parallel form reliability
coefficients– requires two test administrations - disadvantage– researchers need to report specific type of
reliability coefficient for their studies
Fred Li, 2003 57
Spearman-Brown 公式
• used to estimate how much a test’s reliability will increase when the length of the test is increased by adding parallel items
where L = the number of times longer the new test will be,
• estimate of SEM for different test lengths can be obtained using
. ( )
'
'
'
XX
XX
XX
L
L
10 1
e k.43 where k = the number of items
Fred Li, 2003 58
內部一致性信度• 旨在探究某一測驗內的所有題目是否均在測相同特質或技能 (e.g., the test is “consistently” measuring the same skill.) 。
• only requires one test administration
– Split-half - test is split in half and the correlation between scores on each half constitutes your reliability coefficient
– Kuder-Richardson 20 (KR20) - mean of all possible split-half coefficients
– Coefficient alpha - most often used for survey scales and objective assessments
– Kuder-Richardson 21 (KR21)
Fred Li, 2003 59
Kuder-Richardson 20 信度
• KR20公式:
where k = 題數 p = 答對百分比 q = 1 - p
KR
X
k
k
p q20 21
1
Fred Li, 2003 60
Kuder-Richardson 21 信度
• similar to KR20, yet easier to compute
• KR21 will always be less than KR20 unless all the items are equal in difficulty (in which case KR21 = KR20
• 公式: KR
X
k
k
X k X
k21 211
( )
Fred Li, 2003 61
Alpha 係數
• 當題目並非完全在測單一特質時,為信度的下限估計值
• 可能為負值 ( if inter-item correlations are negative): 其全距為 : -∞ ~ 1
• If items are dichotomously scored, coefficient alpha equals KR20 value
• 公式:
k
ki
X11
2
2
Fred Li, 2003 62
當共變量值為負時,為負值
2
11
2
1
11
2
1
2
11
2
11
2
11
2
1
2
2
2
1
]
2
2
[1
]
22
2
[1
]
2
1[1
11
x
n
i
n
ijjiij
n
ii
n
i
n
ijjiij
n
i
n
ijjiij
n
ii
n
ii
n
i
n
ijjiij
n
ii
n
i
n
ijjiij
n
ii
n
i ijjiij
n
ii
n
ii
X
i
S
共變量k
k
ssrs
ssr
k
k
ssrs
s
ssrs
ssrs
k
k
ssrs
s
k
k
S
s
k
k
使用共變數矩陣計算
Fred Li, 2003 63
為負值的原因
• 反向題目未更正回來• 樣本太小 + 題目太少• 負相關的題目太多 ( 題目非測單一特質 )
Fred Li, 2003 64
Standardized Item
** 使用相關矩陣資料計算
rk
rk
rr
r
k
kn
i
n
ijij
n
i
n
ijij
n
i
n
ijij
)1(1
]
2
2
[1
11
1
Fred Li, 2003 65
Rulon’s 折半信度
• Split test into two halves and create half-test scores
• Compute the difference between half-test scores
• Compute the variances of differences and total scores
• reliability estimate = XX
diff
X
' 12
2
Fred Li, 2003 66
評分者間信度• Inter-rater reliability is the correlation
between two or more scorers looking at the same phenomena.
• Sometimes called coefficient of agreement - indicates the percent of agreement between observers.
Fred Li, 2003 67
評分者間信度求法
Aggression CodeCoder 1 Coder 2
Hit boy A ______ ______Hit boy B ______ ______Hit girl A ______ ______Hit girl B ______ ______
1331
3321
使用 Cohen’s Kappa或 Tetrachoric correlation 或 Polychoric correlation 計算評分者間評定結果的一致性。
Fred Li, 2003 68
平行測驗信度
• The extent to which two equivalent variables given at different time correlate each other.
• 兩套不同測驗分數但屬平行測驗的相關。• 依照傳統測驗理論,真正平行測驗,每一套
測驗的真分數、 SEM 應該相等。• 又稱複本信度• 缺點:需編製兩套測驗、施測兩次
實例: GRE, SAT, GMAT, TOEFL
Fred Li, 2003 69
Test-Retest Reliability
Equivalent-Forms Reliability
Reliability as Internal Consistency
Interrater Reliability
Questionnaire 1
Item 1
Item 2
Item 3
Questionnaire 1
Item 1
Item 2
Item 3
Questionnaire 2
Item 1
Item 2
Item 3
各種信度間之關係
Fred Li, 2003 70
信度與測驗類別
• Power test– tests in which every
student has adequate time to complete the test
– appropriate reliability indices include
• split-half or internal consistency measures
• Speeded test– tests in which all
examinees are not expected to finish.
– appropriate reliability indices include
• Rulon’s split-half for separately timed comparable halves of the test
Fred Li, 2003 71
item 1st
half2nd
half total
Student 1 2 3 4 5 6 score score score
1 1 1 1 0 1 0 3 (9) 1 (1) 4 (16)
2 0 1 1 0 1 0 2 (4) 1 (1) 3 (9)
3 0 0 1 1 0 0 1 (1) 1 (1) 2 (4)
4 0 0 0 1 0 1 0 (0) 2 (4) 2 (4)
5 1 1 1 1 1 1 3 (9) 3 (9) 6 (36)Tot 9 (23) 8(16) 17(69)
p-value .4 .6 .8 .6 .6 .4
1 - p .6 .4 .2 .4 .4 .6
pq .24 .24 .16 .24 .24 .24
模擬資料
Fred Li, 2003 72
演算問題• 根據模擬資料:
– determine the mean and SD for the total test– estimate reliability using
• Spearman-Brown
• Rulon split-half
• KR20
• KR21
Fred Li, 2003 73
問題二:計算 KR20與 Alpha 係數 題目
學生 1 2 3 4 5 6 7 8 9 10
1 1 1 1 1 1 1 1 1 0 1
2 1 1 1 0 0 0 0 1 0 1
3 1 1 1 0 1 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0 1
5 0 0 0 0 0 0 0 0 0 1
6 1 1 1 0 1 1 1 1 1 1
7 1 1 1 1 1 0 0 1 0 1
8 0 1 0 0 0 0 0 0 0 0
9 1 1 1 1 1 0 1 1 1 1
10 0 1 0 0 0 0 0 1 1 1
Fred Li, 2003 74
測驗等化的基本原則
* 均在測量相同之特質* Symmetry
* Invariance
* Equity
Equating、 Scaling 兩者相同嗎?Equating、 Linking 兩者相同嗎?
Fred Li, 2003 75
測驗等化的步驟
* 選擇資料蒐集設計方法 The single group design
Random groups design
Linking item non-equivalent groups design
* 選擇分數轉換類型 Mean equating
Linear equating
Equipercentile equating
* 選擇統計方法進行等化
Fred Li, 2003 76
測驗等化的種類
* 水平等化* 垂直等化
Fred Li, 2003 77
DIF 分析的方法
* Bias 的類別: Content bias
Atmosphere bias
Prediction bias
Consequence bias Dif 分析的方法 Mantel-Haenszel 2 方法 Logistic regression 方法
Fred Li, 2003 78
為何使用 IRT研究 DIF/DTF?
• Researchers are often interested in comparing cultural, ethnic, or gender groups.
• Meaningful comparisons require that measurement equivalence holds.
• Classical test theory methods confound “bias” with true mean differences; IRT does not.
• In IRT terminology, item/test bias is referred to as DIF/DTF
Fred Li, 2003 79
DIF 與 DTF 的界定
• DIF refers to a difference in the probability of endorsing an item for members of a reference group (e.g., US workers) and a focal group (e.g., Chinese workers), having the same standing on theta.
• DTF refers to a difference in the test characteristic curves, obtained by summing the item response functions for each group.
• DTF is perhaps more important for selection because decisions are made based on test scores, not individual item responses.
Fred Li, 2003 80
DIF 實例Uniform DIF Against Focal Group
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3Theta
Pro
b.
of
Po
siti
ve R
esp
on
se
Reference
Focal
Nonuniform (Crossing) DIF
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3Theta
Pro
b.
of
Po
siti
ve R
esp
on
se
Reference
Focal
Reference group favored at all levels
Focal favored at low theta
Reference favored at high theta
Fred Li, 2003 81
DIF/DTF 的檢驗
• DIF– Parametric
• Lord’s Chi-Square • Likelihood Ratio Test • Signed and Unsigned Area Methods
– Nonparametric• SIBTEST• Mantel-Haenszel
• DTF– Parametric
• Raju’s DFIT Method
– Nonparametric• SIBTEST
Fred Li, 2003 82
Lord‘s Chi-Square考驗
i1
i2i vv
vi is a vector of the differences in the estimated item parameters for the ith item between the focal and reference groups
i is the variance-covariance matrix for
the differences in item parameter estimates
Lord’s Chi-Square is sensitive to both uniform and nonuniform DIF.
Fred Li, 2003 83
Lord‘s Chi-Square考驗
1. Estimate item parameters and covariances for focal and reference groups separately.
2. Obtain linking constants, A and K, for putting the focal and reference parameters on a common metric.
3. Compute Lord’s chi-square to identify DIF items using the reference and transformed focal group parameters and their covariances.
4. Once the DIF items have been identified, reequate the focal and reference group metrics using only the non-DIF items.
5. Repeat steps 2 through 4 until the same items are identified on consecutive trials.
This procedure is implemented in the program ITERLINK.
Fred Li, 2003 84
Using ITERLINK
• ITERLINK is an interactive program that performs iterative linking for the 2PL and 3PL models using Lord’s Chi-Square.
• Creates three output files:– ITERLINK.DBG
• DIF results and linking constants across iterations
– PAIRDIF.DBG • Summary of DIF results
– User-named file • Contains transformed focal parameters
Fred Li, 2003 85
ITERLINK.DBG --------------------------------------------------------------- ITEM P obs. DIF PRESENT using P<.05 Corrected P --------------------------------------------------------------- 1 .03910006 YES NO 2 .22354231 NO NO 3 .55714918 NO NO 4 .00030662 YES YES 5 .00669607 YES NO 6 .47102172 NO NO 7 .00001849 YES YES 8 .00527094 YES NO 9 .12618701 NO NO 10 .02565616 YES NO 11 .46091057 NO NO 12 .05661002 NO NO 13 .00001317 YES YES 14 .74325914 NO NO 15 .70399638 NO NO 16 .00314141 YES NO 17 .12334338 NO NO 18 .33370542 NO NO 19 .13698496 NO NO 20 .22581136 NO NO 21 .36562514 NO NO 22 .10503999 NO NO 23 .58623106 NO NO 24 .21399115 NO NO 25 .02355756 YES NO 26 .51280795 NO NO 27 .51541161 NO NO 28 .00079191 YES YES 29 .65164401 NO NO 30 .89292899 NO NO 31 .04190196 YES NO 32 .28610441 NO NO
Bonferroni Corrected p:
.05 / #items
“ Yes” = DIF
“No” = No DIF
Fred Li, 2003 86
PAIRDIF.DBG
Check that BILOG *.cov and *.3PL files were read correctly
Fred Li, 2003 87
PAIRDIF.DBG
-1
v
Transformed FocalParameters
Fred Li, 2003 88
Example of DTF for 50-Item Test
DTF Against Reference Group
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Theta
Pro
po
rtio
n C
orr
ec
t T
rue
Sc
ore
Focal
Reference
Most focal group members expected to score about 3 points higher
Fred Li, 2003 89
Detecting DTF Using the DFITD4 Program
• Parametric procedure that detects DTF by comparing test characteristic curves.
• Determines whether DIF cancels or cumulates to produce DTF.– Linking coefficients, item parameters, and thetas
are required.
• Note: What we refer to as the reference group, Raju calls the focal group
Fred Li, 2003 90
JCL File for DFITD4
FORTRAN format statements for reading item parameters in *.3pl files
FORTRAN format statement for reading thetas in *.sco file
“1” to include item in analysis
# Items Logistic
DTF criterion for Dichotomous data
Linking Constants, A and K
Fred Li, 2003 91
ITEM DELETION PROCEDURE A RUN ITEM REMOVED DTF CHI-SQUARE PROB MEAN D MEAN lDl --- ----------- --- ---------- ---- ------ -------- 1 NONE .11780 10278.45 .000 -.29857 .31108 2 63 .09449 12044.00 .000 -.27363 .28243 3 64 .07466 11768.57 .000 -.24248 .25236 4 65 .05990 14139.40 .000 -.22206 .22801 5 3 .04637 9342.44 .000 -.18428 .19983 6 88 .03507 8842.57 .000 -.15860 .17361 7 99 .02569 8253.24 .000 -.13382 .14930 8 50 .01790 8599.06 .000 -.11267 .12488 9 67 .01252 7671.49 .000 -.09188 .10473 10 13 .00816 5725.96 .000 -.06782 .08423 11 84 .00494 4808.99 .000 -.04869 .06501
Output File for DFITD4
DTF present; value > .006
On each run, item with largest DIF statistic removed
DTF eliminated after removing 10 items
Fred Li, 2003 92
Detecting DIF/DTF Using SIBTEST
• Nonparametric method that can be used to examine individual items or groups of items– Assumes only monotonicity– Requires only item response data– Works well with fairly small samples (250+)
• Several variations exist– Original SIBTEST: Uniform DIF– Crossing SIBTEST: Nonuniform DIF– PolySIB: Uniform DIF, polytomous data– MultiSIB: Uniform DIF, multiple dimensions
Discussed in Web Tutorial
Fred Li, 2003 93
Using SIBTEST
• SIBTEST consists of two executable files: – SIBIN.EXE : interactive, creates input file– SIBTEST.EXE : performs DIF/DTF analyses
• Choose “E” for either, “R” for reference, or “F” for focal group
• Detailed discussion of running SIBIN and SIBTEST is presented on the web
Fred Li, 2003 94
SIBTEST DIF Output
Magnitude of DIF
Use Bonferroni correction
E: either group
MH results for comparison
Fred Li, 2003 95
標準設定的理論與方法
* 標準設定 (standard setting)涉及切割點決定,以用以區分精熟或不精熟、通過或不通過等重大決定。
受試者中心模式 測驗中心模式
連續模式 狀態模式
人工判定模式 電腦適性化判定模式
標準設定方法