IUPAC-CITAC Guide Draft 0PT Schemes2019.10.09

7/25/2019 IUPAC-CITAC Guide Draft 0PT Schemes2019.10.09

1/75

INTERNATIONAL UNION OF PURE AND APPLIED CHEMISTRY1

ANALYTICAL CHEMISTRY DIVISION*2

INTERDIVISIONAL WORKING PARTY FOR HARMONIZATION3

OF QUALITY ASSURANCE SCHEMES4

5

COOPERATION ON INTERNATIONAL TRACEABILITY6

IN ANALYTICAL CHEMISTRY (CITAC)7

8

IUPAC/CITAC GUIDE9

SELECTION AND USE OF PROFICIENCY TESTING SCHEMES10

FOR A LIMITED NUMBER OF PARTICIPANTS 11

CHEMICAL ANALYTICAL LABORATORIES12

13

(IUPAC Technical Report)14

15

Prepared for publication by16

ILYA KUSELMAN1,AND ALE FAJGELJ2171The National Physical Laboratory of Israel, Givat Ram, Jerusalem 91904, Israel;18

2International Atomic Energy Agency, Wagramer Strasse 5, P.O.Box 100, Vienna19

A-1400, Austria20

21

Corresponding author: e-mail: [email protected]

23


2/75

2

*Membership of the Analytical Chemistry Division during the final preparation of this1

report was as follows:2

President:A. Fajgelj (IAEA); Vice-President: W. Lund (Norway);Past-President:3

R. Lobinski (France); Secretary: D.B. Hibbert (Australia); Titular Members:4

M.F. Cames (Portugal); Z. Chai (China); P. De Bivre (Belgium); J. Labuda5

(Slovakia); Z. Mester (Canada); S. Motomizu (Japan); Associate Members: P. De6

Zorzi (Italy); A. Felinger (Hungary); M. Jarosz (Poland); D.E. Knox (USA);7

P.Minkkinen (Finland); P.M. Pingarrn (Spain); National Representatives: S.K.8

Aggarwal (India); R. Apak (Turkey); M.S. Iqbal (Pakistan); H. Kim (Korea); T.A.9

Maryutina (Russia); R.M. Smith (UK); N. Trendafilova (Bulgaria)10

11

Membership of the Task Group:12

Chairman:A. Fajgelj (IAEA);Members: I.Kuselman (Israel); M.Belli (Italy); S.L.R.13

Ellison (UK); U.Sansone (IAEA); W.Wegscheider (Austria)14

15

ACKNOWLEDGEMENTS16

The Task Group would like to thank P. Fisicaro (France) and M. Koch (Germany) for17

their data used and help in preparation of Examples 1 and 2, respectively, in Annex B18

of the Guide; H. Emons (IRMM) for helpful discussions; Springer, Heidelberg19

(www.springer.com) and the Royal Society of Chemistry, London (www.rsc.org) for20

permission to use material from the published papers cited in the Guide.21

22

23

24

25


3/75

3

IUPAC/CITAC Guide1

Selection and Use of Proficiency Testing Schemes for a Limited2

Number of Participants Chemical Analytical Laboratories3

(IUPAC technical Report)4

5

Abstract:A metrological background for implementation of proficiency testing (PT)6

schemes for a limited number of participating laboratories (fewer than 30) is7

discussed. Such schemes should be based on the use of certified reference materials8

with traceable property values to serve as proficiency test items whose composition is9

unknown to the participants. It is shown that achieving quality of PT results in the10

framework of the concept tested once, accepted everywhere requires both11

metrological comparability and compatibility of these results.12

A possibility to assess collective/group performance of PT participants by13

comparison of the PT consensus value (mean or median of the PT results) with the14

certified value of the test items, is analyzed. Tabulated criteria for this assessment are15

proposed.16

Practical examples are described for illustration of the issues discussed.17

18

Keywords: proficiency testing, sample size, metrological traceability, measurement19

uncertainty, metrological comparability and compatibility20

21

22

23

24

25


4/75

4

ABBREVIATIONS AND SYMBOLS1

2

A- critical value for numbersN+ and/orN-3

AAS - atomic absorption spectrometry4

ai- empirical sensitivity coefficient of the i-th component5

AN- acid number6

AS adequacy score7

- probability equivalent to the area under the tail/s of a distribution;8

bcf- buoyancy correction factor9

- probability of type 2 error10

c1, c2 measurement/test results corresponding to the crossing points of two11

probability density functions12

ccert certified (assigned) value of a particular property of a CRM13

ci measurement/test result of i-th laboratory participating in PT14cis value of a particular property of routine samples15

CP- criterion power16

cPT- population (theoretical) mean of PT results17

cPT/avg observed/experimental mean of PT results (consensus value)18

CRM certified reference material19

- ratiocert/PT20

- permissible bias ofMPTfrom ccert21

and - parameters22

EMD - Ecole des Mines de Douai23

F- frequencyof a c-value24

f- probability density function25


5/75

5

GC-MS gas chromatography-mass spectroscopy1

GF-AAS graphite furnace-atomic absorption spectrometry2

H0- null hypothesis3

H1- alternative hypothesis4

hand - hand preparation of a sample5

HPLC high performance liquid chromatography6

i, j, n index numbers7

ICP-MS - inductively coupled plasma mass spectroscopy8

ICP-OES inductively coupled plasma-optical emission spectroscopy9

ID-ICP-MS isotope dilution-inductively coupled plasma-mass spectrometry10

IHRM in-house reference material11

INPL National Physical Laboratory of Israel12

ISO International Organization for Standardization13

K kelvin14

LNE - Laboratoire National de Mtrologie et dEssais15

MCL - maximum contaminant level16

32OAsm - mass of a sample of arsenic oxide17

mdil- mass of the diluted solution (a sample)18

mdil/t total mass of the diluted solution19

mlot- total mass of final lot20

MPT population median of PT results21

mss - mass of the stock solution (a sample)22

mss/t- total mass of the stock solution23

N-- number of PT results ci < ccert- 24

N size of the a statistical sample of measurement results of PT participants25


6/75

6

N* - number of potentiometric titration results1

N+- number of PT results ci > ccert+ 2

NIST SRM standard (certified) reference material developed by the National3

Institute of Standards and Technology, USA4

NMR nuclear magnetic resonance5

Np - size of the population of PT participants6

P probability7

pc purity of chemicals8

Pe- probability of an event9

pH-metr. pH-metric method10

Pot. titr. potentiometric titration11

PT proficiency testing12

32/ OAsAsp - proportion of atomic weights of As and As2O313

- symbol of multiplication14

Qest questionable15

RAN limit of a difference between two results ofANdetermination (range)16

Ri-ratio of the min to the max values from two concentrations17

RL reference laboratory18

lot density of a lot of an aqueous IHRM19

s observed sample standard deviation20

SADCMET - Southern African Cooperation in Measurement Traceability21

sbsiand sisi- between-sample and intra-sample standard deviations22

SI International System of Units;23

sPT- observed sample standard deviation of PT results24

PT- population standard deviation of PT results25


7/75

7

PT/av - standard deviation of the sample mean cPT/avof PT results1

targ- target standard deviation of PT results2

t1-/2- percentile of the one-tailed Students distribution at level of confidence 1-/23

TP - test power4

u(ci)andU(ci) - standard and expanded uncertainties of ci, respectively5

ucertand Ucert- standard and expanded uncertainty of ccert, respectively6

ucomb combined standard uncertainty7

umLP- standard measurement uncertainty declared by a laboratory participating in PT8

umRL standard measurement uncertainty declared by the reference laboratory9

USN - ultrasonic nebulization10

UV ultraviolet11

vibr sample preparation with a vibrating table12

VIM3 International Vocabulary of Metrology; 3rded.13

xj-normalized value of the j-th PT result14

2{,N-1} - 100percentile of the 2distribution atN 1 degrees of freedom15

- function of normalized normal distributionfunction16

(xj) - value of the function of the normalized normal distribution forxj17

fraction of the statistical sample of sizeNfrom the population of sizeNp18

2

empirical value of the Cramer-von-Mises criterion19

z, andEn- scores for assessment of proficiency of a laboratory participating in PT20

21

22

23

24

25


8/75

8

CONTENTS1

1. INTRODUCTION2

1.1. Scope and field of application3

1.2. Terminology4

2. APPROACH5

2.1. Properties of PT consensus values: dependence on the statistical sample size6

2.2. Measurement uncertainty use for interpretation of PT results7

2.3. What is a metrological approach to PT?8

3. VALUE ASSIGNMENT9

3.1. Metrological traceability of a CRM property value and of PT results10

3.1.1. Commutability of the CRMs and routine samples11

3.1.2. Three scenarios12

3.2. Scenario I: Use of adequate CRM13

3.3. Scenario II: No closely matched CRMs14

3.4. Scenario III: Appropriate CRMs are not available15

4. INDIVIDUAL LABORATORY PERFORMANCE EVALUATION AND16

SCORING17

4.1. Single (external) criterion for all laboratories participated in a PT18

4.2. Own criterion for every laboratory19

5. METROLOGICAL COMPARABILITY & COMPATIBILITY OF PT RESULTS20

6. EFFECT OF SMALL LABORATORY POPULATION ON SAMPLE21

ESTIMATES22

7. OUTLIERS23

8. EFFECTIVENESS OF APPROACHES TO PT24


9/75

9

ANNEX A. CRITERIA FOR ASSESSMENT OF METROLOGICAL1

COMPATIBILITY OF PT RESULTS2

ANNEX B. EXAMPLES3

ANNEX C. REFERENCES4

5

6

1. INTRODUCTION7

The International Harmonized Protocol for the proficiency testing (PT) of analytical8

chemistry laboratories adopted by IUPAC in 1993 [1] was revised in 2006 [2].9

Statistical methods for use in PT [3] have been published as a complementary standard10

to ISO/IEC Guide 43, which describes PT schemes based on interlaboratory11

comparisons [4]. General requirements for PT are updated in the new standard [5].12

International Laboratory Accreditation Corporation (ILAC) Guidelines define13

requirements for the competence of PT providers [6]. Guidelines for PT use in specific14

sectors, like clinical laboratories, have also been widely available [7]. In some other15

sectors they are under development.16

These documents are, however, oriented mostly towards PT schemes for a17

relatively large number Nof laboratories or participants (greater than or equal to 30),18

henceforth referred to as "large schemes". This is important from a statistical point of19

view, since with Nbelow 30, evaluations by statistical methods become increasingly20

unreliable, especially for N< 20. For example, uncertainties in estimates of location21

(such as mean and median) are sufficiently small to be neglected in scoring as N22

increases to approximately 30, but cannot be neglected safely withN < 20. Deviations23

from normal distribution are harder to identify if Nis small. Robust statistics, too, are24

not usually recommended when N< 20. Therefore, the assigned/certified value of the25


10/75

10

proficiency test items ccertcan not be calculated safely from the measurement results1

obtained by the participants (PT results) as a consensus value: its uncertainty becomes2

large enough to affect scores in "small schemes", that is, schemes with small numbers3

of participants (N< 20).4

Moreover, if the sizeNp of the population of laboratories participating in PT is not5

infinite, and the size of the statistical sample N is greater than 5 to 10 % of Np, the6

value of the sample fraction = N/Npmay need to be taken into account.7

Thus, implementation of small PT schemes is sometimes not a routine task. Such8

schemes are quite often required for quality assurance of environmental analysis9

specific for a local region, analysis of specific materials in an industry (e.g. under10

development), for purposes of a regulator or a laboratory accreditation body, etc. [8].11

12

1.1. Scope and field of application13

This Guide is developed for implementation of simultaneous participation schemes14when the number of laboratories is smaller than 30. This includes: 1) selection of a15

scheme based on simultaneous distribution of test items to participants for concurrent16

quantitative testing; 2) use of certified reference materials (CRMs) as test items17

unknown to the participants; 3) the individual laboratory performance assessment and18

assessment of the metrological comparability and compatibility of the measurement19

results of the laboratories taking part in the PT scheme as a collective (group) of the20

participants.21

The document is intended for PT providers and PT participants (chemical22

analytical laboratories), for accreditation bodies, laboratory customers, regulators,23

quality managers, metrologists and analysts.24

25


11/75

11

1.2.Terminology1

Terminology used in this Guide corresponds to ISO standards 17043 [5] and 3534 [9],2

and ISO Guide 99 (VIM) [10].3

4

2. APPROACH5

2.1. Properties of PT consensus values: dependence on the statistical sample size6

The difference between the population parameters and the corresponding sample7

estimates increases with decreasing sample sizeN. In particular, a sample mean cPT/avg8

ofNPT results can differ from the population mean cPTby up to 1.96PT /Nwith9

95 % probability, 1.96 being the appropriate percentile of the normal distribution for a10

two-sided 95 % interval, and PT is the population standard deviation of the results.11

Dependence of the upper limit of the interval for the expected bias |cPT/avg- cPT| onN12

is shown (in units of PT) in Fig. 1, where the range N = 20 to 30 is indicated by the13

grey bar. Even forN= 30 the bias may reach 0.36PTat the 95 % level of confidence.14

Similarly, the sample standard deviation sPT is expected to be in the range15

PT [2{0.025,N1}/(N1)]1/2sPT PT [

2{0.975,N 1}/(N1)]1/2with probability16

of 95 %, where2{, N 1} is the 100 percentile of the 2 distribution at N 117

degrees of freedom. The dependence of the range limits for sPTonNis shown in Fig. 218

(again in PT

units), also with the range N = 20 to 30 marked by the grey bar. For19

example, for N = 30 the upper 95 % limit for sPTis 1.26PT. In other words, sPTcan20

differ from PT for N = 30 by over 25 % rel. at the level of confidence 0.95. For21

N< 30 the difference between the sample and the population characteristics increases22

with decreasingN, especially dramatically for the standard deviation whenN < 20.23


12/75

12

0.1

0.3

0.5

0.7

0.9

1.1

1.3

0 20 40 60 80 100

N

B

ias/PT

1

Fig. 1. Dependence of the upper limit of the bias |cPT/avg- cPT| (in units of PT)on the2

numberNof PT results; reproduced from ref. [8] by permission of Springer. The line3

is the upper 97.5thpercentile, corresponding to the upper limit of the two-sided 95 %4

interval for the expected bias. The range of N = 20 to 30, intermediate between small5

and large sample sizes, is shown by the grey bar.6

7

While consensus mean values are less affected than observed standard deviations,8

uncertainties in consensus means are relatively large in small schemes, and will9

practically never meet the guidelines for unqualified scoring suggested in the IUPAC10

Harmonized Protocol [2] for cases when the uncertainties are negligible. It follows11

that scoring for small schemes should usually avoid simple consensus values.12

Methods of obtaining traceable assigned values ccertare to be used wherever possible13

to provide comparable PT results [11, 12].14

The high variability of dispersion estimates in small statistical samples has special15

implications for scoring based on observed participant standard deviation sPT. This16

practice is already not recommended even for large schemes [3], on the grounds that it17


13/75

13

does not provide consistent interpretation of scores from one round (or scheme) to the1

next. For small schemes, the variability of sPTmagnifies the problem.2

3

0.0

0.5

1.0

1.5

2.0

0 20 40 60 80 100

N

s

PT/PT

4

Fig. 2. Dependence of the sample standard deviation sPTlimits (in units of PT)on the5

number N of PT results; reproduced from ref. [8] by permission of Springer. Solid6

lines show 2.5th(lower line) and 97.5th(upper line) percentiles for sPT. The dashed line7

is at sPT/PT=1.0 for reference.The grey bar shows the range of intermediate sample8

sizes (N= 20 to 30).9

10

It follows that scores based on the observed participant standard deviation should11

not be applied in such a case. If a PT provider can set an external, fit-for purpose,12

normative or target standard deviation targ, then z-scores, which compare a result bias13

from the assigned value with targ, can be calculated in a small scheme in the same14

manner as recommended in refs. [1-5] for a large scheme. The condition is only that15

the standard uncertainty of the assigned/certified value ucert is insignificant in16

comparison to targ(ucert2

< 0.1targ2

).17


14/75

14

2.2. Measurement uncertainty use for interpretation of PT results1

When information necessary to set targis not available, and/or ucertis not negligible,2

the information, included in the measurement uncertainty u(ci)of the result cireported3

by the i-th laboratory, is helpful for performance assessment using zeta-scores and/or4

En numbers [2, 3]. It may also be important for a small scheme that laboratories5

working according to their own fitness-for-purpose criteria (for example, in conditions6

of competition) can be judged by individual criteria based on their declared7

measurement uncertainty values.8

9

2.3. What is a metrological approach to PT?10

The approach based on metrological traceability of an assigned value of test items,11

providing comparability of PT results, and on scoring PT results taking into account12

uncertainties of the assigned value and uncertainties of the measurement results, has13

been described as a "metrological approach" [13].14Two main steps are common for any PT scheme using this approach:15

1) establishment of a metrologically traceable assigned value, ccert, of analyte16

concentration in the test items/reference material and quantification of the standard17

uncertainty ucert of this value, including components arising from the material18

homogeneity and stability during the PT round, and 2) calculation of fitness-for-19

purpose performance statistics as well as assessment of the laboratory performance,20

taking into account the laboratory measurement uncertainty. For the second step it21

may be necessary in addition to take into account the small population size of22

laboratories able to take part in the PT. These issues are considered below.23

24

25


15/75

15

3. VALUE ASSIGNMENT1

3.1. Metrological traceability of a CRM property value and of PT results2

Since the approach to PT for a limited numberNof participants is based on the use of3

CRMs as test items unknown to the participants, metrological traceability of a CRM4

property value is a key to understanding metrological comparability and compatibility5

of the PT results. Interrelations of these parameters are shown in Fig. 3.6

7

8

9

10

11

12

13

14

15

16

17

18

19

Fig. 3. A scheme of calibration hierarchy, traceability and commutability (adequacy20

or match) of reference materials used for PT, comparability and compatibility of PT21

results; reproduced from ref. [16] by permission of Springer.22

23

The left pyramid in Fig. 3 illustrates the calibration hierarchy of CRMs as24

measurement standards or calibrators [10] ranked by increasing uncertainties of25

Uncertain

ty

Comparabi

lity

Traceabi

lity

Assigned value-measurement

result

SI unitskg K mol others

Primary CRM

NMIs

Secondary CRM

CRM producers

Working CRM/ IHRM

Testing labs and other users

Ref.meas.stand.

Ref.meas.stand.

CRM commutability

Compatibility


16/75

16

supplied property values from primary CRMs (mostly pure substances developed by1

National Metrology Institutes - NMIs), to secondary CRMs (e.g. a matrix CRM2

traceable to primary CRMs), and from secondary to working CRMs (certified in-3

house reference materials - IHRMs - developed by testing/analytical laboratories, PT4

providers and other users) [14,15]. When a CRM of a higher level is used for5

certification of a reference material of a lower level by comparing them (for example,6

for certification of IHRM), the first one plays the role of a reference measurement7

standard: shown in Fig. 3 by semicircular pointers. Since uncertainty of CRM8

property values is increasing in this way, the uncertainty pointer is directed from the9

top of the pyramid to the bottom.10

The same CRM can be used for calibration of a measurement system and for PT,11

i.e. for two different purposes: as a calibrator and as a quality control material (test12

items), but not at the same time, in the same measurement or in the same test [17].13

The right-side overturned pyramid in Fig. 3 shows traceability chains from a14

reference material certified value and the corresponding measurement/analysis/test15

results to SI units. As a rule, one result is to be traceable to the definition of its unit,16

while simultaneously there are several influence quantities which need also to be17

traceable to their own definition of units: to the mole of the analyte entities per mass18

of sample (i.e. for the concentrations in the calibration solutions), to the kilogram19

because a size of a sample under analysis is quantified by mass or volume, to the20

Kelvin when the temperature influences the results obtaining for the main quantity,21

etc. Thus, the traceability pointer has a direction which is opposite to the measurement22

uncertainty. Of course, the width of the overturned pyramid is not correlated with the23

uncertainty values, as the case is in the left-side pyramid.24


17/75

17

Understanding traceability of measurement/analysis/test and PT results to the mole1

(realized through the chain of the CRMs according to their hierarchy) is often not2

simple and requires reliable information about the measurement uncertainty. The3

problem is that the uncertainty of analytical results may increase because of4

deviations of the chemical composition of the matrix CRM (used for calibration of the5

measurement system) from the chemical composition of the routine samples under6

analysis. Similarly, the difference between a certified value of the matrix reference7

material (applied in a PT as test items) and the result of a laboratory participating in8

the PT may increase when the CRM has a different chemical composition than the9

routine samples. This is known as the problem of CRM commutability - adequacy or10

match - to a sample under analysis [18], and is shown in Fig. 3 as an additional11

pointer above the uncertainty pointer. The commutability is discussed in the following12

paragraph 3.1.1, while the metrological comparability and compatibility pointers13

shown also in Fig. 3 in paragraph 5.14

15

3.1.1. Commutability of the CRMs and routine samples16

Since a difference in property values and matrices of CRM and of routine samples17

influences the measurement uncertainty in PT, the chemical composition of both, the18

measurement standard (the CRM used as test items) and the routine samples of the19

test object, should be as close as possible. Algorithm for a priori evaluation of CRMs20

adequacy can be based on the use of an adequacy score: AS % =100n

i

a

iiR , where 21

is the symbol of multiplication, i = 1, 2, , nis the number of a component or of a22

physico-chemical parameter; Ri= [min(ci,s, ci,cert)/max(ci,s, ci,cert)] is the ratio of the23

minimal to the maximal values from ci,sand ci,cert; ci,sand ci,cert are the concentrations24

of the i-th component or the values of the i-th physico-chemical parameter in the25


18/75

18

sample and certified in the CRM, respectively; 0 ai1 is the empirical sensitivity1

coefficient which allows decreasing the influence of a component or a parameter on2

the score value, if the component or the parameter is less important for the analysis3

than others. According to this score, the ideal adequacy (AS= 100 %) is achieved4

when the composition and properties of the sample and of the RM coincide. The5

adequacy is absent (AS= 0 %) when the sample and the CRM are different substances6

or materials, and/or the analyte is absent in the CRM (ci,cert= 0). Intermediate cases,7

for example for two components under control, are shown in Fig. 4. The ratios R1and8

R2providing adequacy score valuesAS= 70, 80 and 90 %, form here curves 1, 2 and9

3, respectively.10

11

Fig. 4. Adequacy scoreASvalues in dependence on ratiosR1andR2of concentrations12

of two components in a sample under analysis and in a CRM; reproduced from ref.13

[16] by permission of Springer. Curves 1, 2 and 3 correspond to AS= 70, 80 and 9014

%, respectively. The dotted pointer shows the direction of the adequacy increasing.15

16


19/75

19

The adequacy score may be helpful for CRM choice as a calibrator since direct use1

of a CRM having a low adequacy score can lead to an incorrect/broken traceability2

chain. Such a CRM applied for PT will decrease the reliability of a laboratory3

performance assessment. Therefore, CRM commutability in PT and a score allowing4

its evaluation are also important. However, the adequacy score does not properly5

quantify the measurement uncertainty contribution caused by insufficient6

commutability (AS< 100 %). This requires a special study.7

More details ofAScalculations see in Annex B, Example 5.8

9

3.1.2. Three scenarios10

Thus, the task of value assignment is divided into the following three scenarios: I) an11

adequate matrix CRM with traceable property value is available for use as test items;12

II) available matrix CRMs are not directly applicable, but a CRM can be used in13

formulating a spiked material with traceable property values; III) only an IHRM with14

a limited traceability chain of the property value is available (for example, because15

instability of the material under analysis).16

17

3.2. Scenario I: Use of adequate CRM18

The ideal case is when the test items distributed among the laboratories participating19

in the PT are portions of a purchased adequate matrix CRM (primary or secondary20

measurement standard). However, when the CRMs available in the market are too21

expensive for direct use in PT in the capacity of test items, a corresponding IHRM22

(working measurement standard) is to be developed. Characterization of an IHRM23

with a property value traceable to the CRM value by comparison, and application of24

the IHRM for PT are described in refs. [3, 19-21]. The characterization can be25


20/75

20

effectively carried out by analysis of the two materials in pairs, each pair consisting of1

one portion of the IHRM and one portion of the CRM. A pair is analyzed practically2

simultaneously, by the same analyst and method, in the same laboratory and3

conditions. According to this design, the analyte concentration in the IHRM under4

characterization is compared with the certified value of the CRM and is calculated5

using differences in results of the analyte determinations in the pairs. The standard6

uncertainty of the IHRM certified value is evaluated as a combination of the CRM7

standard uncertainty and of the differences' standard uncertainty (the standard8

deviation of the mean of the differences). The uncertainty of the IHRM certified value9

includes homogeneity uncertainties of both the CRM and the IHRM, since the10

differences in the results are caused not only by the measurement uncertainties, but11

also by fluctuations of the analyte concentrations in the test portions. When more than12

one unit of IHRM is prepared for PT, care still needs to be taken to include the IHRM13

between-unit homogeneity term in evaluating the uncertainty. Since, in this scenario,14

the CRM and IHRM have similar matrixes and close chemical compositions, at15

similar processing, packaging and transportation conditions their stability16

characteristics during PT are assumed to be identical unless there is information to the17

contrary. The CRM uncertainty forms a part of the IHRM uncertainty budget and is18

expected to include any necessary uncertainty related to stability, therefore no19

additional stability term is included in the IHRM uncertainty.20

The criterion of fitness-for-purpose uncertainty of the property value of a reference21

material applied for PT is formulated depending on the task. For example, for PT in the22

field of water analysis in Israel [22], expanded uncertainty valuesshould be negligible23

in comparison to the maximum contaminant level (MCL), i.e. the maximum24

permissible analyte concentration in water delivered to any user of the public water25


21/75

21

system. In this example, the uncertainty was limited to 2ucert


22/75

22

A related scenario is based on traceable quantitative elemental analysis and1

qualitative information on purity/degradation of the analyte under characterization in2

the IHRM. For example, IHRMs for determination of inorganic polysulfides in water3

have been developed in this way [24]. The determination included the polysulfides4

derivatization with a methylation agent followed by GC-MS or HPLC analysis of the5

difunctionalized polysulfides. Therefore, the IHRMs were synthesized in the form of6

dimethylated polysulfides containing four to eight atoms of sulfur. Composition of the7

compounds was confirmed by NMR and by dependence of HPLC retention time of the8

dimethylpolysulfides on the number of sulfur atoms in the molecule. Stability of the9

IHRMs was studied by HPLC with UV detection. Total sulfur content was determined10

by the IHRMs oxidation with perchloric acid in high-pressure vessels (bombs),11

followed by determination of the formed sulfate using ICP-OES. IHRM certified12

values were traceable to NIST SRM 682 through the Anion Multi-Element Standard II13

from Merck (containing certified concentration of sulfate ions) that was used for the14

ICP-OES calibration, and to the SI kg, since all the test portions were quantified by15

weight.16

More detailed example see in Annex B, Example 2.17

18

3.4. Scenario III: Appropriate CRMs are not available19

This scenario can arise when a component or an impurity of an object/material under20

analysis is unstable, or the matrix is unstable, and no CRMs (primary or secondary21

measurement standards) are available. The proposed PT scheme for such a case is22

based on preparation of an individual sample of IHRM for every participant in the23

same conditions provided by a reference laboratory (RL), allowing the participant to24

start the measurement/test process immediately after the sample preparation. In this25


23/75

23

scheme IHRM instability is not relevant as a source of measurement/test uncertainty,1

while intra- and between-samples inhomogeneity parameters are evaluated using the2

results of RL testing of the samples taken at the beginning, the middle and the end of3

the PT experiment. For example, such a PT scheme was used for concrete testing:4

more details see in Annex B, Example 3.5

6

4. INDIVIDUAL LABORATORY PERFORMANCE EVALUATION AND7

SCORING8

4.1. Single (external) criterion for all laboratories participated in a PT9

The present IUPAC Harmonized Protocol [2] recommends thatz-score values10

arg

-

t

certi

i

ccz

= ,11

are considered acceptable within 2, unacceptable with values outside 3, and12

questionable with intermediate values (the grounds for that are discussed thoroughly13

elsewhere [2]). This score provides the simplest and most direct answer to the14

question: Is the laboratory performing to the quantitative requirement (targ) set for15

the particular scheme? The laboratorys quoted uncertainty is not directly relevant to16

this particular question, so is not included in the score. Over the longer term, however,17

a laboratory will be scored poorly if its real (as opposed to estimated) uncertainty is18

too large for the job, whether the problem is caused by unacceptable bias or19

unacceptable variability. This scoring, based on an externally set value targ(without20

explicitly taking uncertainties of the assigned value and participant uncertainties into21

account), remains applicable to small schemes, provided that laboratories share a22

common purpose for which a single value of targcan be determined for each round.23

Examples ofthe targsetting andz-score use see in Annex B, Examples 1-2.24


24/75

24

4.2. Own criterion for every laboratory1

Often, however, a small group of laboratories has sufficiently different requirements2

that a single criterion is not appropriate. It may then (as well as generally) be of3

interest to consider a somewhat different question about performance: Are the4

participants results consistent with their own quoted uncertainties? For this purpose,5

zeta() andEnnumber scores are appropriate. The scores are calculated as6

7

22

)(

-

certi

certi

i

ucu

cc

+

= and22

-

certi

certi

n

U)c(U

ccE

+

= ,8

9

where u(ci) and U(ci) are the standard and expanded uncertainties of the i-th10

participant result ci, respectively, Ucertis the expanded uncertainty of the certified (or11

otherwise assigned) value ccert.Zetascore values are typically interpreted in the same12

way asz-score values (see Annex B, Example 3).Ennumber differs fromzetascore in13

the use of expanded uncertainties and En values are usually considered acceptable14

within 1. The advantages of zetascoring are that i) it takes explicit account of the15

laboratorys reported uncertainty; ii) it provides feedback on both the laboratory result16

and on the laboratorys uncertainty estimation procedures. The main disadvantages17

are that i) it cannot be directly related to an independent criterion of fitness-for-18

purpose; ii) pessimistic uncertainty estimates lead to consistently good zeta scores19

irrespective of whether they are fit for a particular task; and iii) the PT provider has no20

way of checking that reported uncertainties are the same as those given to customers,21

although a customer or accreditation body is able to check this if necessary. The En22

number shares these characteristics, but adds two more. First, it additionally evaluates23

the laboratorys choice of coverage factor for converting standard to expanded24


25/75

25

uncertainty. This is an advantage. Second, unless the confidence level is set in1

advance, Enis sensitive to the level of confidence chosen both by participant and by2

provider in calculating U(ci) and Ucert. It is obviously important to ensure consistency3

in the use of coverage factors ifEnnumbers are to be compared.4

It is clear that a single score cannot provide simultaneous information on whether5

laboratories meet external criteria (z-scores apply best here) and on whether they meet6

their own criteria (zetaorEnnumber apply best).7

8

5. METROLOGICAL COMPARABILITY & COMPATIBILITY OF PT RESULTS9

The meaning of metrological comparability of PT results is that being traceable to the10

same metrological reference, they are comparable independently of the result values11

and of the associated measurement uncertainties. Since scoring a laboratory12

proficiency in the discussed small PT schemes is based on evaluation of the bias13

ci c

certof i-th laboratory result c

ifrom the certified property value c

certof the test14

items, both PT results and the CRM certification (measurement) data should be15

comparable, i.e. traceable to the same metrological reference. The same is correct for16

different runs of the PT scheme, when laboratory score values obtained in these runs17

are compared. As much as metrological comparability is a consequence of18

metrological traceability, the comparability pointer in Fig. 3 is directed like the19

traceability one.20

Metrological compatibility can be interpreted for PT results as the property21

satisfied by each pair of PT results, so that the absolute value of the difference22

between them is smaller than some chosen multiple of the standard measurement23

uncertainty of that difference. Moreover, successful PT scoring means that the24

absolute value of the bias ci ccertis smaller than the corresponding chosen multiple25


26/75

26

of the bias standard uncertainty. In other words, a PT result is successful when it is1

compatible with the CRM (test item) certified value. Therefore compatibility is shown2

in Fig. 3 by a horizontal pointer uniting the direct and the inversed pyramids.3

Thus, achieving the quality of measurement/analysis/test and PT results in the4

framework of the concept tested once, accepted everywhere [11, 25] requires both5

comparability and compatibility of the results.6

When PT is based on the metrological approach, there are two key parameters for7

assessment of comparability & compatibility of results [26]: 1) position of the CRM8

sent to the participants in the calibration hierarchy of measurement standards, and 2)9

closeness of the distribution of PT results to the distribution of the CRM data.10

The position of a CRM in the calibration hierarchy depends on the top11

measurement standard in the traceability chain. For example, if a CRM property value12

is traceable to SI units (by scenarios I and II), it confirms world-wide comparability of13

PT results. Any PT scheme based on the use of IHRM with a limited traceability14

chain of the property value (not traceable to SI units: scenario III) provides the15

possibility of confirming local comparability only. The same situation took place in16

the classical fields of mass and length measurements before the Convention of the17

Metre, when measurement results in different countries had been traceable to different18

national (local) measurement standards.19

At any traceability of the CRM property value used, the closeness of the20

distributions of the PT results and of the CRM data is important for the result21

compatibility and performance assessment. Since laboratory performance is assessed22

individually for each PT participant, even in a case when the performance of the23

majority of them is found to be successful, compatibility of all the PT results (i.e. a24


27/75

27

group performance characteristic of the laboratories participating in PT) still remains1

unassessed.2

The situation is illustrated in Fig. 5, where both distribution density functions fof3

PT results (curve 1) and of CRM data (curve 2) are shown as normal ones. The vertical4

lines are the centers of these distributions: cPT and ccert, respectively. The common5

shaded area P under the density function curves is the probability of obtained PT6

results belonging to the population of the RM data. It can be considered as a parameter7

of compatibility. The value Ptends to zero when the difference between cPTand ccertis8

significantly larger than standard deviations PT and ucert of both distributions. The9

closer cPT is to ccert (shown by the semicircular pointers in Fig. 5), the higher the P10

value is.11

0.0

1.0

2.0

3.0

9.8 10.4 11.0 11.6 12.2 12.8 13.4C

fCPT Ccert

12

Fig. 5. Probability density functionsf of PT results, curve 1, and of CRM data, curve13

2; reproduced from ref [16] by permission of Springer. Vertical lines are the centers of14

these distributions: cPT and ccert, respectively. The common shaded area under the15

density function curves is the probability Pof obtained PT results belonging to the16

population of the CRM data. The semicircular pointers show the direction of the17

compatibility increasing.18

19

1

2P

c

fcPT ccert


28/75

28

The distributions, Pvalues, hypotheses necessary for assessment of compatibility of1

results of a limited number Nof PT participants, as a group, and suitable criteria for2

that based on analysis of the statistical sample characteristics (average cPT/avg, standard3

deviation sPT,etc.) are discussed in detail in Annex A.4

In principle, cPT/avg and sPT are the consensus values which cannot be used for a5

reliable assessment of an individual laboratory performance when the number of the6

laboratories participating in the PT scheme is limited. However, here the consensus7

values are used for another purpose: for comparison of PT results, as a statistical8

sample, with the CRM data (see Examples 1-4 in Annex B). The compatibility of PT9

results of a group of laboratories can be low if one or more laboratories from the group10

perform badly. Analysis of reasons leading to such a situation, as well as ways to11

correct it, are a task for the corresponding accreditation body and/or the regulator12

responsible for these laboratories and interested in the comparability & compatibility13

of the results.14

15

6. EFFECT OF SMALL LABORATORY POPULATION ON SAMPLE16

ESTIMATES17

The population of possible laboratory participants is not usually infinite. For example,18

the population size of possible PT participants in motor oil testing organized by the19

Israel Forum of Managers of Oil Laboratories was Np =12 only, while the statistical20

sample size, i.e. the number of the participants agreed to take part in the PT in21

different years was N= 6 to 10 (see Annex B, Example 4). In such cases the sample22

fraction = 6/12 to 10/12 = 0.5 to 0.8 (i.e. 50 to 80 %) is not negligible and23

corrections for finite population size are necessary in the statistical data analyses. The24

corrections include the standard deviation (standard uncertainty) of the sample mean25


29/75

29

ofNPT results cPT/av,equal toPT/av = PT{[(NP N)/(NP 1)]/N}1/2and the standard1

deviation of a PT result equal to sPT= PT[NP/(NP 1)]1/2.2

After simple transformations the following formula for the sample mean can be3

obtained: PT/av/(PT/N) = [(NP N)/(NP 1)]1/2= [(1 )/(1 1/Np)]

1/2. The4

dependence of PT/avon is shown (in units of PT/N) in Fig. 6 for the populations of5

NP = 10, 20 and 100 laboratories, curves 1, 2 and 3, respectively.6

0.4

0.6

0.8

1.0

0 20 40 60 80

, %, %, %, %

PT/av

/(PT/N

)

7

Fig. 6. Dependence of the standard deviation of the sample meanPT/av(in units of8

PT/N) on the sample fraction; reproduced from ref. [8] by permission of Springer.9

Curves 1, 2 and 3 are for the populations of NP = 10, 20 and 100 laboratories,10

respectively. The grey bar shows the intermediate range of sample fraction values11

= 5 to 10 % (at < 5 % corrections for a finite population size are negligible, as a12

rule).13

14

Since at least two PT results are necessary for calculation of a standard deviation (i.e.15

the minimal sample size is N= 2), curve 1 is shown for 20 %, curve 2 - for16

2

3

1

, %


30/75

30

10 %, and curve 3 - for 2 %. The population size has much less influence here1

than the sample fraction value.2

Dependence of sPT on by the formula sPT/PT = [1/(1 /N)]1/2 is weak in3

comparison with the previous one in Fig. 6, since the correction factor values are of4

0.96 to 1.00 only for any event when the sample size is ofN= 10 to 100 PT results.5

AsNP increases and decreases, the values (NPN)/(NP 1) 1 and 1/(1 /N)6

1, and the corrections for finite population size disappear: PT/av PT/NandsPT7

PT. Therefore, the corrections are negligible for values up to around 5 to 10 %8

(shown by the grey bars in Fig. 6).9

These corrections should, however, be applied with care, only when the population10

is really finite.11

12

7. OUTLIERS13

Since the number of PT results (the sample size N) is limited, it is also important to14

treat extreme results correctly if they are not caused by a known gross error or15

miscalculation. Even at large Nextreme results can provide valuable information to16

the PT provider and should not be disregarded entirely in analysis of the PT results17

without due consideration. When N is small, extreme results cannot usually be18

identified as outliers by known statistical tests because of low power of these tests.19

Fortunately, the metrological approach for small schemes makes outlier handling20

less important, since assigned values should not be calculated by consensus, and21

scores are not expected to be based on observed standard deviations. Accordingly,22

outliers have effect on scoring only for the laboratory reporting outlying results and23

for the PT provider seeking the underlying causes of such problems.24

25


31/75

31

8. EFFECTIVENESS OF APPROACHS TO PT1

While traditional approaches to PT (used consensus values for assessment of a2

laboratory performance) are not acceptable forN< 30, the metrological one (based on3

the CRM use) is acceptable from statistical and metrological points of view for anyN,4

includingN30 as well. However, a PT cost increasing withNshould also be taken5

into account for any correct PT scheme design.6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25


32/75

32

ANNEX A. CRITERIA FOR ASSESSMENT OF METROLOGICAL1

COMPATIBILITY OF PT RESULTS2

3

CONTENTS4

1. RELATIONSHIP BETWEEN THE DISTRIBUTION OF CRM ASSIGNED5

VALUE DATA AND THE DISTRIBUTION OF PT RESULTS6

2. NULL AND ALTERNATIVE HYPOTHESES7

3. A CRITERION FOR PT RESULTS BEING NORMALLY DISTRIBUTED8

3.1. Example9

3.2. Reliability of the assessment10

4. A NON-PARAMETRIC TEST FOR PT RESULTS WITH AN UNKNOWN11

DISTRIBUTION12

4.1. Reliability of the test13

4.2. Example14

4.3. Limitations15

16

17

1. RELATIONSHIP BETWEEN THE DISTRIBUTION OF CRM ASSIGNED18

VALUE DATA AND THE DISTRIBUTION OF PT RESULTS19

Data used for calculation of the CRM assigned value, and the measurement/analysis20

results of the laboratories participating in PT can be considered as independent21

random events. Therefore, the relation between them can be characterized by the22

common area P under the density function curves for both CRM data and for PT23

results. The Pvalue is the probability of joint events and, therefore, the probability of24

obtained PT results belonging to the population of CRM data.25


33/75

33

For the sake of simplicity, both distributions are assumed to be normal, with1

parameters ccert, certand cPT, PT, as shown in Fig. 7. The figure refers to a simulated2

example of aluminum determination in coal fly ashes using a CRM developed by3

NIST, USA: SRM 2690 with ccert= 12.35 % and cert= 0.14 % (as mass fraction)4

[27].5

0.0

1.0

2.0

3.0

9.8 10.4 11.0 11.6 12.2 12.8 13.4

c

f c PT c cert

c 1 c 2

6

Fig. 7. Probability density functions f of the PT results and of the CRM data when7

cPT= 12.25 % and PT= 0.34 %; reproduced from ref. [27] by permission of RSC.8

Values c1and c2are the measurement/test results corresponding to the crossing points9

of thefcurves.10

11

Since both density functions,fcertof CRM data andfPTof PT results are equal at the12

c1and c2values, one can write13

14

cert

/)cc(

cert

/)cc(

PT

PT feefcertcertPTPT === 2222 22

2

1

2

1

(1)15

16

As shown in ref. [27], after transformations of expression (1), c1 and c2 can be17

calculated by the following formula:18

19


34/75

34

,)(

,22

22

21

PTcert

PTcertcertPTPTcert cccc

= (2)1

where2

.ln)(2)( 222

cert

PT

certPTPTcert cc

+= (3)3

When c1 and c2 are known, the probability calculation is convenient by the next4

formula:5

6

+

+=++=

+

cert

cert

PT

PT

PT

PT

cert

cert

c c

c c

certPTcert

cc

ccccccdcfdcfdcfP

2

12111 2

1 2 7

8

where stands for the normalized normal distribution function. For example,9

calculations by formulas (2)-(4) in the case shown in Fig. 7 yield c1= 12.16,10c2= 12.58 and P= 0.58.11

Information on the distributions of both PT results and CRM data is limited by12

experimental statistical sample sizes. Therefore, the common area P under the13

probability density function curves of the distributions (the probability of obtained PT14

results belonging to the population of the CRM data) can adequately characterize the15

metrological compatibility only as much as the goodness-of-fit of empirical and16

theoretical distributions is high. However, the Pvalue is of practical importance since17

it allows one to choose a suitable null hypothesis for a criterion of a yes-no type for18

assessment of the metrological compatibility of relatively small (not infinite) number19

of PT results.20

21

(4),


35/75

35

2. NULL AND ALTERNATIVE HYPOTHESES1

The chosen null hypothesisH0states that the metrological compatibility is satisfactory2

if the bias | certPT cc | exceeds cert only by a value which is insignificant in3

comparison with random interlaboratory errors:4

5

H0: ( ) 2/122 ]3.0[ PTcertcertPT cc + . (5)6

7

where a coefficient of 0.3 is used according to the known metrological rule defining8

one standard deviation insignificant in comparison with another one when the former9

does not exceed 1/3 of the latter (i.e. the first variance is smaller than the second one10

by an order). By this hypothesis, the probability Pof considering the PT results as11

belonging to the population of CRM data is P0.53 for the ratio= cert/PT0.412

(as shown in Fig. 7), when the right-hand side of expression (5) reaches the value of13

1.25cert.14

The alternative hypothesis H1 assumes that the metrological compatibility is not15

satisfactory and the bias | certPT cc | exceeds certsignificantly, for example:16

17

H1: ( ) 2/122 ]3.0[0.2 PTcertcertPT cc += , (6)18

etc.19

20

3. A CRITERION FOR PT RESULTS BEING NORMALLY DISTRIBUTED21

The criterion for not rejectingH0 fora statistical sample of sizeN, i.e. for results ofN22

laboratories participating in the PT, is23

( ) 2/1222/1/ ]3.0[/ PTcertPTcertavPT Nstcc ++ , (7)24


36/75

36

where cPT/avand sPTare the sample estimates of cPTand PTcalculated from the sameN1

results as the sample average and standard deviation, correspondingly; the left-hand2

side of the expression represents the upper limit of the confidence interval for the bias3

| certPT cc |; t1-/2 is the percentile of the one-tailed Students distribution for the4

number of degrees of freedom N-1; the 1-/2 value is the probability of the bias not5

exceeding the upper limit of its confidence interval.6

By substituting the ratio and sPT/PT=2/12

2/ )]1/([ N , where 2

/2is the 100/27

percentile of 2distribution for the number of degrees of freedom N-1, into formula8

(7), the following transformation of the criterion is obtained:9

10

( ) ( )N

tNscc PTcertavPT

2/1

2/1

22

2// 09.0

1/

+

. (8)11

12

Table 1 gives the numerical values for the right-hand side of the criterion at =0.05.13

Table 114

The bias norms in sPTunits by criterion (8)15

N

5 10 15 20 30 40 50

0.4 0.20 0.20 0.23 0.26 0.30 0.32 0.34

0.7 0.95 0.68 0.65 0.64 0.65 0.66 0.67

1.0 1.76 1.19 1.09 1.06 1.03 1.02 1.02

16

17

These values are the norms for the bias of the average PT result from the analyte18

concentration certified in the CRM (in sPTunits). The value of should be set based19


37/75

37

on the requirements to the analytical results taking into account PT fit-for-purpose1

valuethat is equal either to the standard analytical/measurement uncertainty or to the2

target standard deviation targ calculated using the Horwitz curve [2, 3] or another3

database.4

5

3.1. Example6

According to the ASTM standard [29], the means of the results of duplicate7

aluminum determinations in coal fly ashes carried out by different laboratories on8

riffled splits of the analysis sample should not differ by more than 2.0 % for Al2O3,9

i.e. 1.06 % for aluminum. Since the range for two laboratory results is limited by the10

standard, PT= 1.06/2.77 = 0.38 %, where 2.77 is the 95 % percentile of the range11

distribution. In case of the discussed SRM 2690 with cert = 0.14 % the value12

is 0.14/0.38 = 0.4. Simulated statistical samples of the PT results are given in13

Table 2. Metrological compatibility of results of the first 15 laboratories can be14

assessed as satisfactory by the norm in Table 1 for = 0.4 (0.23), since15

cPT/av - ccert= 12.30 12.35= 0.05 < 0.23 sPT= 0.23 0.34 = 0.08 % (as mass16

fraction). The same is true concerning the metrological compatibility of results of all17

the 30 laboratories (the norm in Table 1 is 0.30):cPT/av - ccert= 12.38 12.35=18

0.03 < 0.30 sPT

= 0.30 0.35 = 0.11 %.19

Other detailed examples see in Annex B, Examples 3 and 4.20

21

22

23

24

25


38/75

38

Table 21

PT results of aluminum determination in SRM 2690 (simulated in % as mass2fraction)3

4

Lab. No. i 100 ci Lab. No. i 100 ci

1 12.76 16 12.60

2 12.19 17 12.81

3 12.68 18 12.39

4 12.21 19 11.96

5 12.96 20 11.91

6 12.27 21 11.86

7 11.96 22 12.32

8 12.03 23 12.53

9 11.88 24 12.84

10 11.97 25 12.67

11 12.23 26 12.86

12 12.48 27 12.75

13 12.69 28 12.66

14 12.21 29 11.99

15 11.98 30 12.61

cPT/av 12.30 cPT/av 12.38

sPT 0.34 sPT 0.35

5

6

3.2. Reliability of the assessment7

Reliability in such metrological compatibility assessment is determined by the8

probabilities of not rejecting the null hypothesis H0when it is true, and rejecting it9


39/75

39

when it is false (i.e. when the alternative hypothesis H1is true). Criterion (8) does not1

allow rejecting hypothesisH0with probability 1-/2 when it is true. Probability of an2

error of type 1 by this criterion (to reject the H0hypothesis when it is true) is /2.3

Probability of rejecting H0, when it is false, i.e. when the alternative hypotheses H14

are actually true (the criterion power - CP) is:5

6

CP=[ ]

+

+

2/122/1

2/

)1(2/1 Nt

t

, (9)7

where8

=N

cc

PT

PTcertPT

/

)09.0( 2/12

+. (10)9

10

The value of the deviation parameter is calculated substituting the bias | certPT cc |11

in equation (10) by its value corresponding to the alternative hypothesis. For12

hypothesisH1by formula (6) the substitution is ( ) 2/122 ]3.0[0.2 PTcert + and, therefore,13

= [(0.09 + 2)N]1/2. The probability of an error of type 2 (not rejecting theH0when it14

is false) equals to = 1 - CP. Both operational characteristics of the criterion CPand15

are shown in Fig. 8 at = 0.05 for different values and different numbersNof the16

PT participants.17

Thus, the reliability of the compatibility assessment using the hypotheses H018

againstH1for the PT scheme for aluminum determination in coal fly ashes (where =19

0.4) can be characterized by 1) probability 1- /2 = 0.975 of the correct assessment of20

the compatibility as successful (i.e. not rejecting the null hypothesis H0 when it is21

true) for any number Nof the laboratories participating in PT, and by 2) probability22

CP= 0.42 of correct assessment of the compatibility as unsuccessful (i.e. rejectingH023


40/75

40

when the alternative hypothesisH1 is true) forN= 15, and probability CP= 0.75 for1

N = 30 results. Probability /2 of a type 1 error is 0.025 for anyN, while probability 2

of a type 2 error is 0.58 forN= 15, and 0.25 forN= 30, etc.3

4

0

0.2

0.4

0.6

0.8

1

5 15 25 35 45N

CP

5

Fig. 8. Power CP of the criterion and probability of an error of type 2 (in6

dependence on the numberN of laboratories participating in PT) for probability/2=7

0.025 of an error of type 1; reproduced from ref. [28] by permission of Springer.8

Curve 1 are at = 0.4, and curve 2 - at = 1.0.9

10

The power of criterion (8) is high (CP > 0.5) for a number of PT participants11

N20.12

13

14

15

1

2

N

0

0.2

0.4

0.6

0.8

1


41/75

41

4. A NON-PARAMETRIC TEST FOR PT RESULTS WITH UNKNOWN1

DISTRIBUTION2

In the case of unknown distributions differing from the normal one, the median is3

more robust than the average, i.e. better reproduced in the repeated experiments, being4

less sensitive to extreme results/outliers. Therefore, the null hypothesis assuming here5

that the bias of PT results exceeds certby a value which is insignificant in comparison6

with random interlaboratory errors, has the following form:7

8

H05: ( ) =+ 2/122 ]3.0[- PTcertcertPT cM , (11)9

10

where MPT is the median of PT results of hypothetically infinite number N of11

participants, i.e. the population median.12

IfMPTccert, the null hypothesisH0 implies that probability Peof an event when a13

result ci of the i-th PT-participating laboratory exceeds the value ccert+ , is14

Pe{ci> ccert+ } according to the median definition. If MPT< ccert, the probability15

of ciyielding the value ccertis also Pe{ci< ccert- }. The alternative hypothesis16

assumes that the bias exceeds cert significantly and probabilities of the events17

described above are Pe > , for example:18

19

H1: =certPT cM - 2, (12)20

21

where is the same as in expression (11). Probabilities Pe of the events according to22

the alternative hypothesisH1at normal distribution (depending on the permissible bias23

in PTunits at different values) are shown in Table 3.24


42/75

42

Table 31

ProbabilityPe according to alternative hypothesisH12

/PT Pe

0.4 0.50 0.69

0.7 0.75 0.77

1.0 1.04 0.85

3

Since the population median is unknown in practice, and results of Nlaboratories4

participating in PT form aN-size statistical sample from the population, hypothesisH05

is not rejected when the upper limit of the median confidence interval does not exceed6

ccert+ , or the lower limit does not yield ccert - . The limits can be evaluated based7

on the simplest non-parametric sign test[30]. According to this test, the numberN+of8

results ci > ccert+ or the number N-of results ci < ccert- should not exceed the9

critical value A(the bias norm) in order not to reject H0. The Avalues are available,10

for example, in ref. [31]. ForNfrom 5 to 50 PT participants and levels of confidence11

0.975 (/2 = 1-0.975 = 0.025) and 0.95 (/2 = 0.05), these values are shown in Table12

4. The Avalue for fewer than six participants at /2 = 0.025 cannot be determined,13

and therefore, is not presented in Table 4 forN= 5.14

Table 415

The bias normsAby the sign test16

N/2

5 10 15 20 30 40 50

0.025 - 1 3 5 9 13 17

0.05 0 1 3 5 10 14 18

17


43/75

43

4.1. Reliability of the test1

The test does not allow rejecting hypothesisH0with a probability of 1-/2, when it is2

true. Probability of an error of type 1 by this test (to reject theH0hypothesis when it is3

true) is /2. Probability of rejecting the null hypothesis when it is false, i.e. when the4

alternative hypothesis is actually true (the test power: TP), is tabulated in ref. [31].5

The probability of type 2 error (not rejecting H0when it is false) equals to= 1-TP.6

The operational characteristics of the test (TPand ) are shown in Fig. 9 at= 0.057

for the alternative hypothesisH1at different values and different numbers Nof the8

PT participants.9

0.0

0.2

0.4

0.6

0.8

1.0

5 15 25 35 45N

TP

10

Fig. 9. PowerTP of the test and probabilityof an error of type 2 in dependence on11

the number N of laboratories participating in PT, when probability of an error of12

type 1 is /2 = 0.025; reproduced from ref. [30] by permission of Springer.The null13

hypothesis H0 is tested against the alternative hypotheses H1 at = 0.4 and =1.014

shown by curves 1 and 2, respectively.15

0.0

0.2

0.4

0.6

0.8

1.0

1

2


44/75

44

4.2. Example1

The hypothesis about normal distribution of the PT results in the example shown in2

Table 2 was not tested because of the small size of the statistical samples. Therefore,3

the sample size is increased here to N = 50: the simulated data are presented in Table4

5 (the simulation is performed by the known method of successive approximations).5

Such sample size allows testing the hypothesis about the data normal distribution6

applying the Cramer-von-Mises 2-criterion, powerful for statistical samples of small7

sizes [32]:8

9

2= -N- 2 )]}(1ln[]2/)12(1[)(ln]2/)12[({

1jj

N

j

xNjxNj +=

, (13)10

11

where j = 1, 2, ,Nis the number of the PT result Cjin the statistical sample ranked12

by increasing c value (c1c2 cN);xj= (cj cPT/av)/sPT is the normalized value13

of the j-th result which is distributed with the mean of 0 and the standard deviation of14

1; and (xj) isthe value of the function of the normalized normal distribution forxj.15

The probability that 2= 1.95calculated by formula (13) for the data in Table 516

exceeded randomly the critical value 1.94 (forN= 50) equals to 0.10 [31]. Therefore,17

the hypothesis about normal distribution of these data should be rejected at the level18

of confidence of 0.90. The corresponding empirical histogram and the theoretical19

(normal) distribution are shown in Fig. 10. It is clear that the empirical distribution is20

a bimodal one, therefore, no normal distribution can fit it. Since other known21

distributions are also not suitable here, let us apply the proposed non-parametric test22

for the comparability assessment of the results.23

Table 524


45/75

45

PT results of aluminum determination in SRM 2690 (simulated in % as mass1

fraction) ranked according to their increasing value2

No.

j

Result,

100Ci

Cj ccert

Sign No.

j

Result,

Ci100

Cj ccert

Sign No.

j

Result,

100 Ci

Cj ccert

Sign

1 11.86 -0.49 - 18 12.44 0.09 0 35 12.53 0.18 0

2 11.88 -0.47 - 19 12.44 0.09 0 36 12.55 0.20 +

3 11.90 -0.45 - 20 12.45 0.10 0 37 12.56 0.21 +

4 11.91 -0.44 - 21 12.46 0.11 0 38 12.57 0.22 +

5 11.93 -0.42 - 22 12.46 0.11 0 39 12.60 0.25 +

6 11.96 -0.39 - 23 12.47 0.12 0 40 12.61 0.26 +

7 11.96 -0.39 - 24 12.48 0.13 0 41 12.64 0.29 +

8 11.97 -0.38 - 25 12.49 0.14 0 42 12.66 0.31 +

9 11.98 -0.37 - 26 12.49 0.14 0 43 12.67 0.32 +

10 11.99 -0.36 - 27 12.50 0.15 0 44 12.68 0.33 +

11 12.03 -0.32 - 28 12.50 0.15 0 45 12.69 0.34 +

12 12.07 -0.28 - 29 12.51 0.16 0 46 12.76 0.41 +

13 12.17 -0.18 0 30 12.51 0.16 0 47 12.81 0.46 +

14 12.19 -0.16 0 31 12.52 0.17 0 48 12.84 0.49 +

15 12.20 -0.15 0 32 12.52 0.17 0 49 12.90 0.55 +

16 12.34 -0.01 0 33 12.53 0.18 0 50 12.96 0.61 +

17 12.43 0.08 0 34 12.53 0.18 0 N-= 12; N+= 15

3

Taking into account ccert= 12.35 %, cert= 0.14 %, PT= 0.38 %, and = 0.14/0.384

= 0.4, one can calculate = 0.500.38 = 0.19 % (Table 5), ccert+ = 12.54 % and5

ccert- = 12.16 %.There are N+= 15 results cj > 12.54 %, N-= 12 results cj < 12.166


46/75

46

%, andN-N+ -N-= 23 values in the range ccert . The sample median found is c25=1

c26= 12.49 > ccert=12.35 % andN+>N-. However,N+ is lower than the critical value2

A= 17 at /2 = 0.025 andN= 50 (Table 4).Therefore, null hypothesisH0concerning3

successful metrological compatibility of the results is not rejected.4

0.00

0.10

0.20

0.30

0.40

0.50

11.7 12.0 12.3 12.6 12.9 13.2

C, %

F

5

Reliability of the assessment with hypotheses H0 against H1 for this case can be6

characterized by: 1) probability 1- /2 = 0.975 of correct assessment of the7

compatibility as successful (not rejecting the null hypothesis when it is true) for any8

number N 6 of the PT participants, and 2) probability TP= 0.73 of correct9

Fig. 10.Histogram of PT results (frequencyF of a result valuec) solid line, and

the fitted normal distribution dotted line; reproduced from ref. [30] by permission

of Springer.

c,%

F


47/75

47

assessment of the compatibility of N = 50 PT results as unsuccessful (rejecting H01

when alternative hypothesis H1 is true). Probability /2 of a type 1 error is 0.025 for2

anyN6, while probability of type 2 error is 0.27 forN= 50.3

Additional examples of the use of the sign test see in Annex B, Examples 1 and 2,4

of 2-criterion application Example 3.5

6

4.3. Limitations7

Since the sign test critical A values are determined for N 4 8 depending on8

probabilities , and the test power is calculated also only for N6 8, the proposed9

metrological compatibility assessment cannot be performed for a smaller sample size.10

The power efficiency of the sign test in relation to the t-test (ratio of the sizes Nof11

statistical samples from normal populations allowing the same power) is from 0.96 for12

N= 5 to 0.64 for infinite N. For example, practically the same power (0.73 and 0.75)13

was achieved in the sign test of the compatibility of PT results for aluminum14

determination in coal fly ashes at N= 50 discussed above, and in the t-test for the15

same purpose at N= 30 in the previous paragraph 3. The power efficiency here is16

approximately of 30/50 = 0.6. On the other hand, when information about the17

distribution of PT results is limited by N < 50, it is a problem to evaluate the18

goodness-of-fit empirical and theoretical/normal distributions, a decrease of the t-test19

power and the corresponding decrease of reliability of the compatibility assessment20

caused by deviation of the empirical distribution from the normal one.21

22

23

24

25


48/75

48

ANNEX B. EXAMPLES1

2

CONTENTS3

EXAMPLE 1. SCENARIO 1: PT FOR LEAD DETERMINATION IN AIRBORNE4

PARTICLES5

1.1.Aim of the PT6

1.2. Procedure for preparation of the IHRM7

1.3. Analytical methods used and raw data8

1.4. Statistical analysis of the data9

1.4.1. Metrological compatibility assessment10

EXAMPLE 2. SCENARIO 2: PT FOR ARSENIC DETERMINATION IN WATER11

2.1.Aim of the PT12



2.4.Statistical analysis of the data15


EXAMPLE 3. SCENARIO 3: PT FOR DETERMINATION OF CONCRETE17

COMPRESSIVE STRENGTH18

3.1. Aim of the PT19


3.2.1. IHRM homogeneity, certified value and its uncertainty21

3.3. Methods used and raw data22




49/75

49

EXAMPLE 4. A LIMITED POPULATION OF PT PARTICIPANTS: PT FOR ACID1

NUMBER DETERMINATION IN USED MOTOR OILS2

4.1. Aim of the PT3


4.2.1. Characterization of the IHRM5

4.3. Methods used and raw data6



EXAMPLE 5. SELECTION OF THE MOST COMMUTABLE (ADEQUATE) CRM9

FOR PT OF CEMENTS10

5.1. Twelve components11

5.2. Six components12

5.3. One component13

5.4. Sensitivity coefficient14

15

1617

EXAMPLE 1. SCENARIO 1: PT FOR LEAD DETERMINATION IN AIRBORNE18

PARTICLES19


The objectives of this PT were to determine whether the quality criteria described in21

the European Directives [33, 34] concerning the analysis of As, Cd, Ni and Pb in22

airborne particles, are reached and the most important sources of uncertainties are23

identified. The measurement method is divided by the standard [35] into two main24

parts: first the sampling in the field and second the analysis in the laboratory. During25

sampling, particles are collected by drawing a measured volume of air through a filter26


50/75

50

mounted in a sampler designed to collect the fraction of suspended particulate matter1

of less than 10 m (PM10) [36]. The sample filter is transported to the laboratory and2

the analytes are taken into solution by closed vessel microwave digestion using nitric3

acid and hydrogen peroxide. The resultant solution is analysed by known analytical4

methods. When quantity of an analyte in the solution is measured, its concentration5

can be expressed in ng/m3of the sampled air.6

The PT was organized in 2005 and focused on the second (analytical) part of the7

method. The PT provider was the Ecole des Mines de Douai (EMD) supported by the8

Laboratoire National de Mtrologie et dEssais (LNE). Ten laboratories (N= 10) of9

the Association Agres de Surveillance de la Qualit de lAir participated in this10

trial.11

Results for lead only are discussed below for briefness.12

13


The PM10 fraction of suspended particulate matter was collected by EMD on an15

industrial site according to the standard [36]. The sampling was performed on 2016

quartz filters (diameter of 50 mm) during one week at a flow rate of 1 m 3h-1, which17

means a total of 168 m3. Dust on the filters was then digested with 5 ml HNO 3+ 1 ml18

H2O2in a closed microwave oven.19

The LNE was in charge to prepare one liter of a solution from the digestion residue20

which could be used in the PT as an IHRM. The assigned/certified value of the lead21

content in the solution ccert = 26.72 g l-1 provided by LNE was obtained with a22

primary method: isotope dilution inductive coupled plasma mass spectrometry (ID-23

ICP-MS). This content corresponds to 26.72 1000/168 = 159 ng m-3 Pb in the24

sampled air. The expanded measurement uncertainty of the certified value was Ucert =25


51/75

51

0.77 g l-1at the level of confidence 0.95 and the coverage factor of 2. No stability1

tests were conducted, since the laboratories used the solution just after the2

preparation. The uncertainty due to inhomogeneity of the one liter solution was3

considered negligible. Note, the standard uncertainty was ucert = 0.77/2 = 0.38 g l-1,4

i.e. 1.4 % of the certified value.5

Each laboratory received a bottle of 50 ml of this solution (for all analytes).6

7


The list of the laboratories-participants was confidential. All of them followed the9

standard [35]. The methods used were: inductively coupled plasma mass spectrometry10

(ICP-MS), graphite furnace atomic absorption spectrometry (GF-AAS), and11

inductively coupled plasma optical emission spectroscopy with ultrasonic12

nebulization (ICP-OES-USN). The measurements results of i-th laboratory ci, i= 1, 2,13

,N= 10 are shown in Table 6.14

15


There was no statistically significant dependence of the results on the analytical17

method used. The robust value of the experimental standard deviation sPT of a18

laboratory result ci calculated by the LNE from the data shown in Table 6 using19

Algorithm A of the standards [3, 37] was of 3.93 g l-1, i.e. 14.7 % of the certified20

value. Since the expanded uncertainty stated for lead in the European Directives21

[33, 34] and the standard [35, p.30] is 25 %, the target value for standard deviation of22

a laboratory result in the PT was targ= 25/2 = 12.5 % or 3.34 g l-1.23

Table 624


52/75

52

Results of the PT for lead content determination in the solution1

Lab No,

i

Method ci

g l-1

ci- ccert

g l-1

zi Sign

1 ICP-MS 20.12 -6.60 -1.98 -

2 ICP-MS 20.28 -6.44 -1.93 -

3 ICP-OES-USN 30.34 3.62 1.08 +

4 GF-AAS 29.00 2.28 0.68 +

5 ICP-MS 25.00 -1.72 -0.51 -

6 GF-AAS 28.40 1.68 0.50 +

7 ICP-MS 27.80 1.08 0.32 +

8 ICP-MS 25.70 -1.02 -0.31 -

9 GF-AAS 28.20 1.48 0.44 +

10 ICP-MS 25.51 -1.21 -0.36 -

2

Uncertainty of the certified value ucert= 1.4 % was negligible in comparison with3

targ and z-score was applicable for the proficiency testing based on the target targ4

value. The calculatedz-score values are shown in Table 6. All of them are between 25

and +2, and therefore, were interpreted as satisfactory.6

7

8

9



53/75

53

Since a hypothesis on the normal distribution of the PT results was not taken into1

account, compatibility of the results (as a group) is tested based on non-parametric2

statistics as shown in Annex A, para. 4.3

As the standard uncertainty of the certified value ucert = 1.4 % was insignificant in4

comparison with the target standard deviation of PT results targ= 12.5 %, the5

permissible bias of the median of the PT results from the certified value was =6

0.3targ= 3.75 % or 1.00 g l-1. Therefore, ccert+ = 27.72 g l

-1and ccert- = 25.727

g l-1. There wereN+= 5 results ci> 27.72 g l-1andN-= 5 results ci< 25.72 g l

-1.8

They are shown in Table 6 as signs "+" and "-", respectively. Both N+andN-values9

are high than the critical value A = 1 in Table 4. Therefore, null hypothesis H010

concerning compatibility of this group of results should be rejected, in spite of the11

satisfactory z-score values for every laboratory-participant of the PT. Probability of12

type 1 error (to reject the hypothesis when it is correct) of the decision is of 0.025,13

while probability of type 2 error (to not reject the hypothesis when it is false) is of14

above 0.85 according to Fig. 9.15

16

17

EXAMPLE 2. SCENARIO 2: PT FOR ARSENIC DETERMINATION IN WATER18


The aim of the PT was to support water testing laboratories from the Southern African20

Development Community (SADC) and from East African Community in their effort21

to improve the quality of measurement results. The PT round was organized in 200622

within the Water PT Scheme of the SADCMET (SADC Cooperation in Measurement23

Traceability). The organizers were the Water Quality Services, Windhoek, Namibia,24

in cooperation with the Universitt Stuttgart, Germany, and with financial support by25


54/75

54

the Physikalisch-Technische Bundesanstalt, Braunschweig, Germany. The analytes1

were Ca, Mg, Na, K, Fe, Mn, Al, Pb, Cu, Zn, Cr, Ni, Cd, As, SO42-, Cl-, F-, NO3

-, and2

PO43-

in synthetic water modeling drinking/ground water. Three IHRMs with different3

analyte concentrations were prepared and distributed between the laboratories-4

participants for analysis.5

In the following description the determination of the arsenic concentration in one6

IHRM only was selected as an example.7

8


The IHRM was formulated on the basis of analytical grade water spiked with pure10

chemicals. Arsenic (III) oxide from Sigma-Aldrich (purity pc= 99.995 %) was used11

for the preparation of the stock solution with a content of As of about 0.4 mg g-1. The12

mass32OAs

m of the oxide was measured on an analytical balance (Sartorius RC 210D),13

the total mass mss/tof the stock solution was determined by the difference weighing on14

a Sartorius BA3100P balance. About mss = 100 g of the stock solution was diluted to15

about mdil/t= 1000 g also on a Sartorius BA3100P balance. Finally about mdil= 200 g16

of the diluted solution (also weighed on the same balance) were diluted to about mlot=17

49900 g. The total mass mlotof this lot was determined by difference weighing on a18

Sartorius F150S balance.19

The assigned/certified value of the As concentration in the IHRM was assessed20

according to the preparation procedure and taking into account the proportion21

32/ OAsAsp of atomic weights (from IUPAC publications), the purity of As2O3used, the22

densitylotof the final lot and a buoyancy correction factor bcf. The density of the23

final lot was measured gravimetrically using a 100 ml pycnometer. The certified value24


55/75

55

ccertof the mass concentration of As in the final lot was calculated by the following1

formula:2

tdilcflottss

dillotsscOAsAsOAs

cert mbmm

mmppm

c //

/ 3232

=

. (14)3

4

Formula (14) enables also calculation of the uncertainty budget of the certified5

value. The uncertainties of the masses were derived from precision experiments,6

delivering directly the standard uncertainty, and from the linearity tolerances given by7

the manufacturer (used as rectangular distribution). The uncertainty of the purity was8

derived from manufacturers information. The uncertainty of the buoyancy correction9

factor was estimated from the possible variations in the atmospheric pressure, air10

humidity and temperature [38]. For the estimation of the uncertainty of density, a11

separate budget was calculated taking into account the uncertainties of the weighing12

and that of the temperature measurement. The uncertainties of the atomic weights and13

of stability and homogeneity of the solution were neglected.14

The assigned/certified value of the As content in the IHRM and its expanded15

uncertainty were ccert Ucert = 0.1706 0.0001 mg l-1at the level of confidence 0.9516

and the coverage factor of 2. Note, the expanded uncertainty was of 0.07 % of the17

reference value.18

Each laboratory received a bottle of 1 L of this IHRM (for all analytes).19

20


Nine laboratories-participants (N = 9) reported results on determination of the As22

concentration shown in Table 7. One of the major problems of current situation with23

water analysis in Africa is absence of any common standard for analytical methods.24


56/75

56

The methods used were: inductively coupled plasma optical emission spectrometry1

(ICP-OES), atomic absorption spectrometry (AAS) and others.2

3


High standard deviations from the certified value (above 20 % of the value) were5

expected at a workshop organized for representatives of the laboratories-participants6

prior to this PT round. Therefore, it was decided to use the target standard deviation7

targof 20 % of the certified value, when the experimental standard deviation sPT> 208

%. Since in the As case the robust sPT value, calculated from the data shown in9

Table 7 by Algorithm A of the standards [3, 37], was of 50.5 % (0.086 mg l -1), the10

stated target value targ= 20 % (0.034 mg l-1) was applied for the proficiency11

assessment withz-score. Thez-score values are shown in Table 7 with the comments:12

satisfactory (Yes) when they were between 2 and +2, questionable (Quest) for 2


57/75

57

Table 71

Results of the PT for arsenic content determination in water2

Lab N

i

Method ci

mg l-1ci- ccert

mg l-1zi Comment Sign

4 AAS 0.03 -0.1406 -4.12 No -

10 other 0. 20 0.0294 0.86 Yes +

18 ICP-OES 0.20 0.0294 0.86 Yes +

19 ICP-OES 0.12 -0.0506 -1.48 Yes -

26 ICP-OES 0.12 -0.0506 -1.48 Yes -

34 AAS 0.169 -0.0206 -0.05 Yes 0

35 AAS 0.08 -0.0906 -2.66 Quest -

37 ICP-OES 0.789 0.6184 18.12 No +

38 other 0.258 0.0874 2.56 Quest +

3

Therefore, the permissib

IUPAC-CITAC Guide Draft 0PT Schemes2019.10.09

Documents