Analytical and Bioanalytical Chemistry Electronic ...10.1007/s00216-012-5869...Analytical and Bioanalytical Chemistry Electronic Supplementary Material ... i may not be available in

S-1

Analytical and Bioanalytical Chemistry

Electronic Supplementary Material

An international assessment of the

metrological equivalence of higher-order measurement services for

creatinine in serum

Johanna E. Camara , Katrice A. Lippa, David L. Duewer, Hugo Gasca-Aragon, Blaza Toman

S-2

Repeatability Measurements. Table S1 lists the repeatability measurements for the CCQM-K80

study. All values are formally expressed in arbitrary units.

Variance Components and Uncertainty Estimation. The sources of measurement variability

(instrumental, sample preparation and between-campaign) can be described using the

experimental design in Figure 1 as:

2

r,~ ,iijkijiijkl NR [S1]

where i indexes materials, j indexes units, k indexes independent aliquots per unit, l indexes

independent replicates per aliquot, “~ N” indicates “is distributed as Normal (i.e., Gaussian) with

the specified mean and variance,” μi respresents the unknowable true value of the measurand in

the material (i.e., creatinine in serum), γij are between-unit differences, δijk are between-aliquot

differences, and ζr,i is the true instrumental repeatability and is assumed to be the same across all

units. The γij are assumed to be

2c,0~ ,iij N [S2]

where ζc,i reflects the true between-campaign variability for the material. The δijk are assumed to

be

2

a,0~ ,iiijk N [S3]

where ζa,i reflects the true within-unit (between-aliquot) material variability and is assumed to be

the same for all units of a given material. While potentially over-simplified, this model with

these assumptions is likely fit-for-purpose given that HOVAMs are designed to be stable and

S-3

homogenous. Use of more complex models, e.g., allowing ζr,i or ζa,i to vary between units,

generally will require more data in order to provide reliable estimates.

For completely balanced measurement designs, an appropriate estimate for μi is the grand mean,

Ri, of all the measurements for each material. The usual anova-type estimate of the standard

uncertainty of Ri is:

rac

2r,

2a,r

2cra

nnn

nnnRu

ii,i

i

[S4]

where nc is the number of campaigns, na is the number of aliquots taken from the unit used in

each campaign, and nr is the number of replicates of each aliquot.

In CCQM-K80, nc and nr are always 2 and na is either 2 or 3. Given that these serum-based

materials may not be of uniform composition throughout a given unit, analyzing duplicate

aliquots (or triplicate in the case of LGC materials) within each unit as well as evaluating at least

two separate units can help ensure a representative assessment of each material. Hence, this

necessitates the use of the three-level nested measurement design as depicted in Figure 1 in main

text.

Estimates for the variance components, symbolized ic, , ia, , and ir, , can be obtained using

linear mixed model analysis systems such as the SAS MIXED [1] and the R “lmer” [2]

procedures. The variance component and standard uncertainty estimates are listed in Table S2.

All non-zero estimates of relative ,ir were pooled and resulted in a value of 0.99 %, which

estimates the instrumental sources of variance. There are no significant differences in the

S-4

estimates based on duplicate analyses of two aliquots per unit versus those on duplicates of three

aliquots. Since many of the ,ia are estimated as zero, the 0.68 % pooled relative standard

deviation provides only a worst-case bound on the aliquot preparation variance. Since many of

the ,ic are likewise estimated as zero, the 0.77 % pooled relative standard deviation also

provides only a worst-case bound on the sample preparation variance sources.

Estimating 95 % level of confidence coverage intervals U95(Ri) for each Ri from a small number

of measurements can be much more complicated than multiplying the standard uncertainties

u(Ri) obtained from equation [S4] by a coverage factor of 2. The approach used herein is to first

expand the standard uncertainties into U95(Ri) form and then revert to “large sample standard

uncertainties,” by dividing by a factor of 2. We refer to this quantity as u∞(Ri):

u∞(Ri) = U95(Ri)/2 [S5]

The u∞(Ri) estimates provide the same statistical interpretation as the U95(Vi)/2 in equation 3 and

thus provide a consistent description for both the assigned values, V, and the measurements

responses, R.

Two methods were used to estimate the U95(Ri): classical long-term frequency (“frequentist”)

expansion and constrained empirical Bayesian analysis. The frequentist method recommended

in the GUM [3] expands the standard uncertainty u(Ri) using an appropriate Student’s t coverage

factor to generate an expanded uncertainty:

ivi RutRUi %,95f95 [S6]

where vi are the degrees of freedom. Depending on the relative magnitude of the estimated

S-5

variance components r, a, and c, the vi for the CCQM-K80 materials range from 7 to 1. This

generates two-tailed ivt %,95 that range from 2.4 to 12.7, the latter of which translates into

unrealistically large U95(Ri)f values. The vi and U95(Ri)f values are listed in Table S2.

Bayesian analysis is based on a somewhat different definition of probability than the frequentist

interpretation that underpins classical statistical inference. Under the Bayesian paradigm,

parameters such as the measurand value and variance components have probability distributions

that quantify our knowledge about them. The estimation process starts with quantification of

prior knowledge about the parameters followed by specification of the statistical model that

relates the parameters to the data. The statistical model is called a likelihood function and is the

same as described in Equation S1. The components of these models are combined via Bayes

Theorem to obtain posterior distributions for the parameters. These distributions update our

knowledge about the parameters based on the evidence provided by the data. This analysis can

produce a probability distribution for each of the i (the true value of measurand quantity

estimated by the measurement mean, Ri) which encompasses all of the information and

variability present in the data but is confined by bounds based on prior knowledge. The process

yields a probability interval which is interpretable as an uncertainty interval as defined in JCGM

101:2008 “Evaluation of measurement data — Supplement 1 to the GUM — Propagation of

distributions using a Monte Carlo method (GUM Supplement 1)” [4].

While the probability distributions for the μi may not be available in closed form, with these

priors and using empirical Bayesian methods, it is possible to estimate the desired coverage

intervals, U95(Ri)B. Software systems suitable for computing the intervals, such as WinBUGS

S-6

[5], are freely available and (relatively) easy-to-use. The U95(Ri)B are listed in Table S2. The

U95(Ri)B are smaller than the U95(Ri)f for materials associated with one degree of freedom and

tend to be somewhat larger than the U95(Ri)f for materials associated with more than one degree

of freedom.

As discussed in the main text, “Leave-one-out” (LOO) cross-validation was used if the GDR

function was strongly influenced by materials having relatively small U95(Vi) and/or U95(Ri) or

very low or very high {Vi, Ri}. LOO is an established and routine tool for evaluating the

predictive utility of a model [6] and is an efficient, if empirical approach to establishing which, if

any, materials are sufficiently influential to distort the consensus estimation of the GDR

function. Figure S1 compares the “exact” uncertainty-scaled distances, i, calculated using all

materials with the LOO-estimated i for each material. Circles with crosses that are not

substantially on the diagonal line indicate materials that strongly influence the GDR. Circles

outside this square are potentially anomalous. Only materials B-1 and E-1 appear either

anomalous or strongly influential using the u∞(Ri)f. None of the materials appear particularly

problematic when the u∞(Ri)B are used. This reflects the somewhat more realistic estimates of

repeatability measurement provided by the Bayesian procedure.

After careful consideration, it was decided by NIST to exclude material “E-1” from the

estimation of the GDR parameters on the grounds that 1) U95(Vi) is suspiciously small, 2) the

process used to establish the U95(Vi) does not reflect their current practice, and 3) this CRM is

completely sold out and no longer available for re-evaluation, let alone sale. After E-1’s

exclusion from the GDR model, none of the remaining 16 materials were anomalous or

influential with either set of repeatability measurement uncertainties.

S-7

Degrees of Equivalence. The degree of equivalence for a given material expressed as a relative

percent, %di, can be estimated from the signed orthogonal distance of the certified and measured

values (Vi, Ri) relative to the estimates provided from the GDR analysis (V , R , , ):

2ˆˆ

ˆˆˆˆSIGN100%

22

ii

iii

iiRV

RRVVVVd [S7]

The function SIGN returns the sign (±1) of its argument and defines whether the observed {V,

Ri} pair is “above” or “below” the GDR function. The measurement-related terms are

transformed to have the same scale as the assigned values.

Given the covariances among the and parameters of the GDR function and the

iiii RuRVuV , used in their estimation, the expanded uncertainties for the %di,

U95(%di), were estimated via a parametric bootstrap Monte Carlo (PBMC) approach [7]. For a

sufficiently large number of samples from the bootstrap analysis and assuming an approximately

symmetric distribution, the percentiles of the empirical distribution of %di provide the desired

estimates:

2/025.0,%Percentile975.0,%Percentile%95 ijiji dddU . [S8]

If a distribution is significantly asymmetric, a symmetric ±U95(%di) interval can be estimated

from the largest half-interval using the same empirical percentiles. The PBMC distributions for

all of the CCQM-K80 materials were approximately symmetrical.

S-8

If the joint distribution of the PBMC %di estimates for all of the materials submitted by a given

participating NMI is approximately symmetric, the relative degrees of equivalence for the

participating NMIs, %D, can be estimated as the mean of the %di for each material and the

U95(%D) can be estimated from the standard deviation of these %di and the pooled U95(%di). If

the distribution is significantly asymmetric, the ±U95(%D) interval can be estimated empirically

from the joint distribution. The direct and empirical U95(%D) estimates for all of the CCQM-

K80 participants were essentially identical.

S-9

Table S1. Summary of measured creatinine responses (in arbitrary units) for each human serum material investigated.

Campaign1 (Unit1) Campaign2 (Unit2) Material Aliquot1 Aliquot2 Aliquot3

a Aliquot1 Aliquot2 Aliquot3 a

NMI Label Code Rep1 Rep2 Rep1 Rep2 Rep1 Rep2 Rep1 Rep2 Rep1 Rep2 Rep1 Rep2 CENAM DMR 263a A-1 7.105 6.983 7.067 7.042 7.132 7.096 7.172 7.422 KRISS 111-01-01A B-1 5.964 5.848 5.858 5.828 5.789 5.928 5.940 5.830 KRISS 111-01-03A B-2 7.136 7.048 7.078 7.103 7.020 7.024 7.010 7.013 KRISS 111-01-04A B-3 24.825 25.083 25.220 25.278 25.163 25.387 25.406 25.171 KRISS 111-01-02A B-4 27.370 27.436 27.170 27.467 27.508 27.429 27.431 27.651 LGC ERM-DA252a C-1 2.986 3.098 3.146 3.128 3.121 3.125 3.083 3.124 LGC ERM-DA251a C-2 22.074 21.933 22.580 22.091 21.989 22.293 21.646 21.811 22.009 21.847 21.600 22.186 LGC ERM-DA250a C-3 39.911 39.023 40.485 39.723 40.168 39.628 39.400 39.251 40.362 39.826 40.759 41.267 LGC ERM-DA253a C-4 50.868 50.225 49.429 49.346 50.042 49.163 51.468 50.045 50.621 50.521 50.618 50.478 NIM Creatinine-1 D-1 7.959 8.098 8.143 7.996 8.001 8.152 8.104 8.157 NIM Creatinine-2 D-2 34.045 34.562 34.031 33.249 34.421 34.320 33.852 34.512 NIST SRM 909b I b E-1 7.075 7.156 7.139 7.184 7.065 7.166 7.236 7.152 NIST SRM 967a I E-2 8.347 8.116 8.270 8.161 8.265 8.218 8.293 8.294 NIST SRM 909b II b E-3 34.158 34.256 34.065 34.283 34.335 33.759 33.712 33.927 NIST SRM 967a II E-4 37.451 38.416 37.720 38.268 38.054 37.386 37.934 38.484 PTB RELA 1/05 KS-A F-1 45.180 46.153 45.425 44.829 45.792 45.188 45.009 45.174 PTB RELA 1/05 KS-B F-2 58.431 56.935 57.079 57.317 58.195 58.066 57.437 57.466

a Results for Aliquot3 were obtained using the “routine” sample preparation protocol used for all other materials; the Aliquot1 and Aliquot2 results for these three materials are for aliquots prepared using the certificate-specified minimum sample volume.

b Results adjusted for the measured fill-weights of each unit.

S-10

Table S2. Variance components, means and standard uncertainties together with the 95 % level of confidence intervals from both a frequentist-type expansion and a Bayesian estimation

Materials Numbers a Variance Components b Means and Uncertainties c

NMI Label Code Nt Nr Na Nc r a c R u(R) v U95(R)f U95(R)B

CENAM DMR 263a A-1 8 2 2 2 0.100 0.059 0.089 7.127 0.078 1 0.991 0.280 KRISS 111-01-01A B-1 8 2 2 2 0.063 0 0 5.873 0.022 7 0.053 0.095 KRISS 111-01-03A B-2 8 2 2 2 0.027 0 0.051 7.054 0.037 1 0.474 0.111 KRISS 111-01-04A B-3 8 2 2 2 0.148 0.104 0.073 25.192 0.090 1 1.144 0.385 KRISS 111-01-02A B-4 8 2 2 2 0.120 0 0.082 27.433 0.072 1 0.913 0.365 LGC ERM-DA252a C-1 8 2 2 2 0.043 0.029 0 3.101 0.021 3 0.067 0.137 LGC ERM-DA251a C-2 d 8 2 2 2 0.198 0.134 0.199 21.999 0.171 1 2.171 0.990 LGC ERM-DA251a C-2 12 2 3 2 0.230 0 0.198 22.005 0.155 1 1.969 0.805 LGC ERM-DA250a C-3 d 8 2 2 2 0.458 0.251 0 39.748 0.205 3 0.652 1.360 LGC ERM-DA250a C-3 12 2 3 2 0.431 0.516 0 39.983 0.245 5 0.629 1.130 LGC ERM-DA253a C-4 d 8 2 2 2 0.554 0.437 0.266 50.315 0.348 1 4.429 1.635 LGC ERM-DA253a C-4 12 2 3 2 0.520 0.248 0.488 50.235 0.390 1 4.951 1.045 NIM Creatinine-1 D-1 8 2 2 2 0.079 0 0 8.076 0.028 7 0.066 0.120 NIM Creatinine-2 D-2 8 2 2 2 0.407 0.166 0 34.124 0.166 3 0.529 0.900 NIST SRM 909b I E-1 8 2 2 2 0.056 0 0 7.147 0.020 7 0.047 0.098 NIST SRM 967a I E-2 8 2 2 2 0.076 0 0 8.245 0.027 7 0.064 0.132 NIST SRM 909b II E-3 8 2 2 2 0.212 0 0.148 34.062 0.129 1 1.636 0.505 NIST SRM 967a II E-4 8 2 2 2 0.420 0 0 37.964 0.148 7 0.351 0.675 PTB RELA 1/05 KS-A F-1 8 2 2 2 0.434 0 0 45.344 0.154 7 0.363 0.895 PTB RELA 1/05 KS-B F-2 8 2 2 2 0.538 0.112 0 57.616 0.198 3 0.631 1.025

a Nr is the number of replicates per aliquot, Na is the number of aliquots per campaign, and Nc is the number of campaigns. Nt is the total number of

measurements, which is equal to Nr × Na × Nc. b , , and are the estimated between-replicate, between-aliquot, and between-campaign components of variance expressed as standard deviations. c R is the mean of the measured creatinine responses, u(R) the standard uncertainty for R, v the number of degrees of freedom associated with u(R), U95(R)f, a

frequentist 95 % confidence estimate for R, and U95(R)B a Bayesian 95 % confidence estimate for R. d Estimated using only the Aliquot1 and Aliquot2 measurements.

r a c

S-11

Fig. S1. “Leave One Out” (LOO) Analysis in the Identification of Potentially Anomalous,

Strongly Influential Materials.

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

"Exact" Uncertainty-Weighted Distance, ε i

LO

O U

nce

rtai

nty

-Wei

gh

ted

Dis

tan

ce, ε i

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

"Exact" Uncertainty-Weighted Distance, ε i

LO

O U

nce

rtai

nty

-Wei

ghte

d D

ista

nce

, ε i

E-1

B-1

E-1

B-1

The panel to the left presents results using the frequentist u∞(Ri)f, the panel to the right presents results using the Bayesian u∞(Ri)B. Each open circle represents estimates for a particular material; the cross represents the estimated 95 % level of confidence intervals on the estimates. The red lines bound the distances expected for materials compatible with the GDR function with a 95 % level of confidence.

S-12

References

1. SAS/STAT 9.2 User’s Guide (2008) SAS Institute Inc. Cary, NC USA.

2. Bates D, Maechler M (2009) lme4: Linear mixed-effects models using S4 classes. R package version 0.999375-32 http://cran.us.r-project.org/web/packages/lme4/

3. JCGM 100:2008 (2008) Guide to the Expression of Uncertainty in Measurement (GUM). BIPM, Sèvres, France. http://www.bipm.org/utils/common/documents/jcgm/JCGM_100_2008_E.pdf.

4. JCGM 101:2008 (2008) Evaluation of measurement data — Supplement 1 to the “Guide to the expression of uncertainty in measurement” — Propagation of distributions using a Monte Carlo method. BIPM, Sèvres, France. http://www.bipm.org/utils/common/documents/jcgm/JCGM_101_2008_E.pdf.

5. Lunn, DJ, Thomas, A, Best, N, Spiegelhalter, D (2000) WinBUGS -- a Bayesian modelling framework: concepts, structure, and extensibility. Stat Comput 10:325-337

6. Picard RR, Cook RD (1984) Cross-validation of regression models. J Am Stat Assoc 79(387):575-583

7. Duewer DL, Kowalski BR, Fasching JL. Improving reliability of factor-analysis of chemical data by utilizing measured analytical uncertainty (1976) Anal Chem 48:2002-10

Analytical and Bioanalytical Chemistry Electronic ...10.1007/s00216-012-5869...Analytical and Bioanalytical Chemistry Electronic Supplementary Material ... i may not be available in

Documents