Options for Summarizing Medical Test Performance in the Absence of a “Gold Standard” Prepared for: The Agency for Healthcare Research and Quality (AHRQ)

Options for SummarizingMedical Test Performancein the Absence of a “Gold

Standard”Prepared for:

The Agency for Healthcare Research and Quality (AHRQ)

Training Modules for Medical Test Reviews Methods Guide

www.ahrq.gov

Recognize settings where the reference standard may be imperfect (i.e., no “gold standard”)

Describe sources of potential bias resulting from the use of an imperfect reference standard when estimating the sensitivity and specificity of a medical test

Understand the options for analyzing data, their advantages and justification, and potential assumptions

Learning Objectives

Trikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Introduction: Classical Paradigm

“Truly” Diseased “Truly” Healthy

Index text (+) True positive (TP) False positive (FP)

Index test (-) False negative (FN) True negative (TN)


“True” status is directly observable (e.g., for tests predicting short-term mortality after a procedure).

“True” status is commonly based on a reference standard (test), which is considered to be a “gold standard” if it actually reflects the “true” status.

“Reference standard bias” arises when the reference test does not mirror the truth well. The further the reference test deviates from the truth, the

less accurate the estimate of the index test’s performance.

An “imperfect reference standard” is a reference standard test that misclassifies “true” status at a rate that cannot be ignored.

Introduction: Reference Standard Issues


The simplest case is an index test and a reference standard that give dichotomous results (e.g., positive or negative for disease).

Both the index and reference tests can err by not reflecting the true status.

The example in the following slide shows true 2-by-2 table probabilities in relation to the eight combinations of index and reference test results. These eight probabilities (1, 1, 1, 1, 2, 2, 2, and

2) need to be estimated from the accuracy data.

The “perfect” reference standard is the “gold standard.”

Imperfect Reference Standard Scenario (1 of 2)


Imperfect Reference Standard Scenario (2 of 2)

“Truly” Diseased “Truly” Healthy

RS (+) RS (-) RS (+) RS (-)

Index test (+)

1 2 2 1

Index test (-)

1 2 2 1

RS (+) RS (-)

Index test (+) = 1 + 2 = 1 + 2

Index test (-) = 1 + 2 = 1 + 2


Imperfect Reference Standard Bias (1 of 2)


“Naïve” estimates are underestimates versus true values when test results are independent among those with and without the condition of interest (“conditional independence”).

Imperfect Reference Standard Bias (2 of 2)

Solid red line = true sensitivity Dashed red line = true specificitySolid black line = naïve sensitivity Dashed black line = naïve specificity

Abbreviations:

Seindex = index test specificitySpindex = index test specificityP = disease prevalence


Only rarely are we absolutely sure that the reference standard is a perfect reflection of the truth.

Often, we are comfortable with overlooking small or moderate misclassifications by the reference standard.

Hard-and-fast rules for judging the (in)adequacy of the reference standard do not exist. Consult content experts on a case-by-case basis to make

judgments.

There are three settings in which one might question the validity of the reference standard. The reference method yields different measurements over time

or across settings. The condition of interest is variably defined. The new method is an improved version of a usually applied test

Reference Standard Validity


Situation: The reference method yields different measurements over time or across settings.

Example: Diagnosis of obstructive sleep apnea typically requires a high Apnea-Hypopnea Index (AHI; an objective measurement) and the presence of suggestive symptoms and signs.

Problem: There is large night-to-night variability in measured AHI and substantial between-rater and between-laboratory variability.

Imperfect Reference Standard: Setting 1


Situation: The condition of interest is variably defined.

Example: The disease, such as psoriatic arthritis, is complex.

Problem: There is no single symptom, sign, or measurement that suffices to make a diagnosis of the disease with certainty. Instead, a set of diagnostic criteria (symptoms, signs, imaging results, and laboratory measures) is used to identify the disease, which will unavoidably be differentially applied across studies.



Situation: The new method is an improved version of a usually applied test.

Example: Measurement of parathyroid hormone (PTH)

Problem: Older measurement methodologies are being replaced by newer, more specific ones. Measurements with the new and old methodologies do

not agree very well. It is incorrect to use the older method as the reference

standard.



Analytic options 1 and 2 below are preferred when possible to summarize data from two fallible tests; option 3 is also suitable.1. Forgo the classical paradigm, which focuses on test accuracy;

assess the ability of the index test to predict patient outcomes (using the index test as a predictive instrument).

2. Forgo the classical paradigm; assess agreement of the index and reference test results, that is, treat index and reference tests as two alternative measurement methods.

3. Using the classical paradigm, calculate “naïve” estimates of the index test’s sensitivity and specificity, but qualify study findings to avoid misinterpretation.

4. Mathematically adjust the “naïve” estimates of the index test’s sensitivity and specificity to account for the imperfect reference standard.

Analytic Options for a Systematic Review


Forgo the classical paradigm, which compares the index test to a reference standard (test “accuracy”). This information is not informative or interpretable

with an “imperfect” reference standard.

Instead, assess the ability of the index test to predict patient outcomes such as history, future clinical events, and response to therapy.

This option follows a well-known paradigm in systematic reviews for evaluating prognostic tests (more information is available in Module 12).

Analysis Option 1:Focus on Prediction of Patient Outcomes


Forgo the classical paradigm (test “accuracy”). Instead, assess agreement (concordance) of the index

and reference test results. Simply treat the index and reference tests as two

alternative measurement methods. How to do this depends on whether the results are

categorical or continuous.

For categorical test results: Cohen’s kappa statistic is a measure of categorical

agreement that accounts for agreement by chance. Meta-analyses of kappa statistics are not common in the

medical literature; they will need to be explained and interpreted in detail.

Analysis Option 2: Focus on the Agreement of Index and Reference Tests (1 of 2)


When there are continuous test results but individual data points are available, the researcher can: Directly compare measurements between tests Pool data from all available studies and:

1. Perform regression of one test versus another, which accounts for measurement error

2. Conduct a Bland-Altman analysis (difference vs. the average of the two test results)

When there are continuous test results but individual data points are not available, the researcher can: Summarize study-level information from (1) or (2) above

Analysis Option 2: Focus on the Agreement of Index and Reference Tests (2 of 2)


Calculate “naïve” estimates of the index test sensitivity (Se) and specificity (Sp), ignoring imperfection of the reference standard but making qualitative judgments on the direction of bias of these “naïve” estimates. Index and reference tests are independent within strata of

disease (conditional independence). Naïve estimates of index test Se and Sp are biased downward (underestimated).

Index and reference tests are correlated within strata of disease. Naïve estimates of Se and Sp can be:

1. Overestimates if tests agree more than by chance

2. Underestimates when tests disagree more than by chance

Problem: The researcher cannot assume conditional independence without justification; external data are needed.

Analysis Option 3:Calculate “Naïve” Estimates and Discuss Bias


The prostate-specific antigen (PSA) test is used to detect prostate cancer. Numerous methods have been developed to test PSA levels. These tests prone to false-negative misclassification: PSA

levels are not elevated in up to 15 percent of prostate cancer cases. Obesity can reduce serum PSA. Obesity will likely affect all PSA-detection methods, old and

new (“conditional dependence”). Conditional dependence of PSA tests results in overestimation

of the accuracy of a new (index) test. When compared to a non-PSA reference (e.g., a prostate

biopsy), this is no conditional dependence; misclassification results in in underestimation.

Analysis Option 3: Example


Mathematically adjust (correct) the “naïve” estimates of the index test sensitivity and specificity to account for the imperfect reference standard. Data from 2 2 tables are not enough; additional information is

needed from the literature. The task is easiest if conditional independence can be assumed

when:1. The sensitivity and specificity of an imperfect reference test are

known from other studies.

2. The specificity of both the index and imperfect reference standard are known from other studies, but the sensitivities are unknown.

3. Use Bayesian inference to add prior distribution data from other studies as opposed to fixed values. It provides data on sensitivity, specificity, and disease prevalence.

Alternative sets of assumptions are possible. Problem: Model mis-specification can result in biased estimates.

Analysis Option 4: Mathematically Adjust “Naïve” Estimates


Obstructive sleep apnea (OSA) is characterized by sleep disturbances secondary to upper airway obstruction. OSA has a prevalence of 2 to 4 percent in middle-aged adults. It is associated with daytime somnolence, cardiovascular

morbidity, diabetes, and other adverse outcomes. Treatment includes continuous positive airway pressure.

A systematic review on the diagnosis of OSA in the home setting used:< Portable monitors as the index diagnostic test

< Facility-based polysomnography as the reference standard

The reviewers first attempted analysis option 3, then moved on to analysis option 2.

Example: Performing a Systematic Review on Obstructive Sleep Apnea

Trikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Trikalinos TA, Ip S, Raman G, et al. Home diagnosis of obstructive sleep apnea-hypopnea syndrome. Technology Assessment. Available at www.cms.gov/Medicare/Coverage/DeterminationProcess/downloads/id48TA.pdf.

There is no “perfect” or accepted reference standard for obstructive sleep apnea (OSA).

A diagnosis of OSA is based on suggestive signs and symptoms and objective assessment of breathing patterns during sleep with facility-based polysomnography (PSG). PSG quantifies the Apnea-Hypopnea Index (AHI). Portable monitors can also measure AHI.

A high AHI (usually ≥15 events per hour of sleep) is suggestive of OSA; alternative cutoffs range from 5 to 40 events/hour. The main analysis in the systematic reviews used a cutoff

of AHI ≥15, but cutoffs of 10 and 20 were also analyzed (there were too few data to analyze other cut-offs).

Systematic Review Example:Choice of Reference Standard and Cutoff


The reviewers calculated “naïve” estimates of the sensitivity (Se) and specificity (Sp) of the Apnea-Hypopnea Index by comparing portable monitors with polysomnography and qualified the results.

“Naïve” estimates of sensitivity and specificity were displayed in the receiver operator characteristic space.

High Se and Sp levels were suggested.

Systematic Review Example:Analysis Option 3 — Naïve Estimates

However, there was considerable variability in the measurements. It was not possible to deduce whether the “naïve” estimates

overestimate or underestimate the “true” Se and Sp.


Reviewers also described concordance between Apnea-Hypopnea Index (AHI) measured by portable monitors (“index” test) versus polysomnography (“reference” test) with Bland-Altman analysis (continuous data with individual points available), but are the tests interchangeable?

They found better agreement for lower AHI levels.

Systematic Review Example:Analysis Option 2 — Pooled Data Analysis

Dashed line = line of perfect agreement

Broad limits = suboptimal agreement


The reviewers summarized Bland-Altman plots across studies.

The mean difference in the two measurements of the Apnea-Hypopnea Index (mean bias) and the 95-percent limits of agreement are shown for each study.

The 95-percent limits of agreement are very wide in most studies, suggesting great variability in the measurements with the two methods.

Systematic Review Example:Analysis Option 2 — Study-Specific Results


Measurements of the Apnea-Hypopnea Index (AHI) with the two methods generally agree on which patients have 15 or less events per hour of sleep (low AHI).

The methods disagree on the exact measurement among people who have higher AHIs on average.

The reviewers identified a gap in the literature. The reviewers recommended undertaking studies that

perform clinical validation of portable monitors, i.e. their ability to predict patients’ history, risk propensity, or clinical profile (analysis option 1).

Systematic Review Example:Conclusions and a Recommendation


When multiple reference standard tests, or multiple cutoffs for the same reference test, are available: Justify the choice of test and/or cutoff or Consider analyzing multiple options

Decide on the most appropriate analysis options to synthesize test performance. The four analysis options presented in this module are

largely complementary approaches and are not mutually exclusive.

Analysis options 1, 2, and 3 are recommended. Analysis option 4 requires expert statistical help. There are no empirical data on the merits and pitfalls of

the mathematical adjustments in option 4 for an imperfect reference standard.

Overall Recommendations


1. The validity of the reference standard should be questioned when the new test being evaluated is an improved version of the usually applied test.

a. True

b. False

Practice Question 1 (1 of 2)

Explanation for Question 1:

The statement is true. There are several situations when the validity of the reference standard should be questioned. These include when a new method of testing is an improved version of the usually applied test. Measurements using the different methods may not agree well.


2. Which of the following options is considered most preferable for evaluating information on a diagnostic test when there is no perfect reference test (gold standard)?

a. Assess the test’s ability to predict patient-relevant outcomes instead of test accuracy.

b. Assess whether the results of the two tests agree or disagree and treat them as two alternative measurement methods.

c. Calculate estimates of the index test’s sensitivity and specificity from each study, but qualify the study findings.

d. Adjust the estimates of sensitivity and specificity of the index test to account for the imperfect reference standard.



The correct answer is a. All of the options listed are suggested methods for synthesizing information on medical tests when there is no gold standard. The preferred method involves assessing the test’s ability to predict patient-relevant outcomes instead of calculating test accuracy when compared with an imperfect standard. This way, the index test is treated as a predictive instrument.


3. When considering imperfect reference standard bias, which of the following applies to naïve estimates of sensitivity and specificity when there is conditional independence of the results?

a. They are overestimates compared to the true values.

b. They are underestimates compared to the true values.

c. They are always equal to the true values.

d. They cannot be compared to the true values.



The correct answer is b. Conditional independence implies that the results of the index and reference tests are independent among people with and without the condition of interest. In this case, estimates of sensitivity and specificity from the standard formulas will usually be smaller than the true values. In other words, the naïve estimates of sensitivity and specificity for the index test will be underestimates of the true values.


4. When evaluating a medical test with no gold standard, one can mathematically calculate accurate sensitivity and specificity of the index test using standard 2 2 cross-tabulation of test results.

a. True

b. False



The statement is false. The estimates of sensitivity and specificity will have to be adjusted to account for the imperfect reference standard. This may require expert statistical help.


This presentation was prepared by Brooke Heidenfelder, Andrzej Kosinski, Rachael Posey, Lorraine Sease, Remy Coeytaux, Gillian Sanders, and Alex Vaz, members of the Duke University Evidence-based Practice Center

The module is based on Trikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Chang SM and Matchar DB, eds. Methods guide for medical test reviews. Rockville, MD: Agency for Healthcare Research and Quality; June 2012. p. 9.1-16. AHRQ Publication No. 12-EHC017. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Authors

Albert PS, Dodd LE. A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics 2004 Jun;60(2):427-35. PMID: 15180668.

Altman DG, Bland JM. Absence of evidence is not evidence of absence. BMJ 1995 Aug;311(7003):485. PMID: 7647644.

Bablok W, Passing H, Bender R, et al. A general regression procedure for method transformation. Application of linear regression procedures for method comparison studies in clinical chemistry, Part III. J Clin Chem Clin Biochem 1988 Nov;26(11):783-90. PMID: 3235954.

Black MA, Craig BA. Estimating disease prevalence in the absence of a gold standard. Stat Med 2002 Sep 30;21(18):2653-69. PMID: 12228883.

Bland JM, Altman DG. Measuring agreement in method comparison studies. Stat Methods Med Res 1999 Jun;8(2):135-60. PMID: 10501650.

References (1 of 8)

Bland JM, Altman DG. Applying the right statistics: analyses of measurement studies. Ultrasound Obstet Gynecol 2003 Jul;22(1):85-93. PMID: 12858311.

Bossuyt PM. Interpreting diagnostic test accuracy studies. Semin Hematol 2008 Jul;45(3):189-95. PMID: 18582626.

Dendukuri N, Hadgu A, Wang L. Modeling conditional dependence between diagnostic tests: a multiple latent variable model. Stat Med 2009 Feb 1;28(3):441-61. PMID: 19067379.

Dendukuri N, Joseph L. Bayesian approaches to modeling the conditional dependence between multiple diagnostic tests. Biometrics 2001 Mar;57(1):158-67. PMID: 11252592.

Garrett ES, Eaton WW, Zeger S. Methods for evaluating the performance of diagnostic tests in the absence of a gold standard: a latent class model approach. Stat Med 2002 May 15;21(9):1289-307. PMID: 12111879.

References (2 of 8)

Gart JJ, Buck AA. Comparison of a screening test and a reference test in epidemiologic studies. II. A probabilistic model for the comparison of diagnostic tests. Am J Epidemiol 1966 May;83(3):593-602. PMID: 5932703.

Goldberg JD, Wittes JT. The estimation of false negatives in medical screening. Biometrics 1978 Mar;34(1):77-86. PMID: 630038.

Gyorkos TW, Genta RM, Viens P, et al. Seroepidemiology of Strongyloides infection in the Southeast Asian refugee population in Canada. Am J Epidemiol 1990 Aug;132(2):257-64. PMID: 2196791.

Hui SL, Zhou XH. Evaluation of diagnostic tests without gold standards. Stat Methods Med Res 1998 Dec;7(4):354-70. PMID: 9871952.

Joseph L, Gyorkos TW. Inferences for likelihood ratios in the absence of a "gold standard". Med Decis Making 1996 Oct-Dec;16(4):412-7. PMID: 8912303.

References (3 of 8)

Jonas DE, Wilt TJ, Taylor BC, et al. Chapter 11: challenges in and principles for conducting systematic reviews of genetic tests used as predictive indicators. J Gen Intern Med 2012 Jun;27 Suppl 1:S83-93. PMID: 22648679.

Linnet K. Estimation of the linear relationship between the measurements of two methods with proportional errors. Stat Med 1990 Dec;9(12):1463-73. PMID: 2281234.

Linnet K. Performance of Deming regression analysis in case of misspecified analytical error ratio in method comparison studies. Clin Chem 1998 May;44(5):1024-31. PMID: 9590376.

Qu Y, Tan M, Kutner MH. Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics 1996 Sep;52(3):797-810. PMID: 8805757.

References (4 of 8)

Reitsma JB, Rutjes AW, Khan KS, et al. A review of solutions for diagnostic accuracy studies with an imperfect or missing reference standard. J Clin Epidemiol 2009 Aug;62(8):797-806. PMID: 19447581.

Rutjes AW, Reitsma JB, Coomarasamy A, et al. Evaluation of diagnostic tests when there is no gold standard. A review of methods. Health Technol Assess 2007 Dec;11(50):iii, ix-51. PMID: 18021577.

Sokal RR, Rohlf EF. Biometry. New York, NY: Freeman; 1981. Sun S. Meta-analysis of Cohen's kappa. Health Serv

Outcomes Res Method 2011;11:145-163. Thompson IM, Pauler DK, Goodman PJ, et al. Prevalence of

prostate cancer among men with a prostate-specific antigen level < or =4.0 ng per milliliter. N Engl J Med 2004 May 27;350(22):2239-46. PMID: 15163773.

References (5 of 8)

Toft N, Jorgensen E, Hojsgaard S. Diagnosing diagnostic tests: evaluating the assumptions underlying the estimation of sensitivity and specificity in the absence of a gold standard. Prev Vet Med 2005 Apr;68(1):19-33. PMID: 15795013.

Torrance-Rynard VL, Walter SD. Effects of dependent errors in the assessment of diagnostic test performance. Stat Med 1997 Oct 15;16(19):2157-75. PMID: 9330426.

Trikalinos TA, Balion TA. Options for summarizing medical test performance in the absence of a “gold standard.” In: Chang SM and Matchar DB, eds. Methods guide for medical test reviews. Rockville, MD: Agency for Healthcare Research and Quality; June 2012. p. 9.1-16. AHRQ Publication No. 12-EHC017. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

References (6 of 8)

Trikalinos TA, Balion CM, Coleman CI, et al. Chapter 8: meta-analysis of medical test performance when there is a “gold standard.” J Gen Intern Med 2012 Jun;27 Suppl 1:S56-66. PMID: 22648676.

Trikalinos TA, Ip S, Raman G, et al. Home diagnosis of obstructive sleep apnea-hypopnea syndrome. Technology Assessment (Prepared by the Tufts–New England Medical Center Evidence-based Practice Center). Rockville, MD, Agency for Healthcare Research and Quality; August 2007. Available at www.cms.gov/Medicare/Coverage/ Determination Process/downloads/id48TA.pdf.

Vacek PM. The effect of conditional dependence on the evaluation of diagnostic tests. Biometrics 1985 Dec;41(4):959-68. PMID: 3830260.

Walter SD, Irwig LM. Estimation of test error rates, disease prevalence and relative risk from misclassified data: a review. J Clin Epidemiol 1988;41(9):923-37. PMID: 3054000.

References (7 of 8)

Walter SD, Irwig L, Glasziou PP. Meta-analysis of diagnostic tests with imperfect reference standards. J Clin Epidemiol 1999 Oct;52(10):943-51. PMID: 10513757.

Whiting P, Rutjes AW, Reitsma JB, et al. Sources of variation and bias in studies of diagnostic accuracy: a systematic review. Ann Intern Med 2004 Feb 3;140(3):189-202. PMID: 14757617.

References (8 of 8)