The University of Sydney School of Public Health Diagnostic test accuracy reviews. Advanced Meta-analysis: dealing with heterogeneity and test comparisons. Petra Macaskill Screening and Test Evaluation Program School of Public Health University of Sydney Co-convenor, Cochrane Screening and Diagnostic Tests Methods Group
75
Embed
Diagnostic test accuracy reviews. Advanced Meta-analysis ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The University of Sydney
School of Public Health
Diagnostic test accuracy reviews.
Advanced Meta-analysis: dealing with
heterogeneity and test comparisons.
Petra Macaskill
Screening and Test Evaluation Program
School of Public Health
University of Sydney
Co-convenor, Cochrane Screening and Diagnostic Tests
Methods Group
Outline
• Background
• Descriptive Analyses (available in Revman)
– Graphical displays
– Summary ROC
– Exploring heterogeneity
• Hierarchical Models (not available in Revman)
– Rationale for using hierarchical models
– Choice of model:
• Bivariate
• HSROC (Rutter and Gatsonis model)
– Investigating heterogeneity
– Index test comparisons
Requires statistical expertise
Major steps covered in:
Cochrane Handbook for Systematic Reviews of
Diagnostic Test Accuracy
Objective of the review (e.g. performance of a single test,
exploring heterogeneity in test performance, test comparisons)
Locating and selecting studies
Assessing study quality – QUADAS2 updates in preparation
Extracting data – to be updated
Meta-analysis
Interpretation of the results – in preparation
Chapter 10: Analysing and Presenting Results Petra Macaskill, Constantine Gatsonis, Jonathan Deeks, Roger Harbord, Yemisi
Takwoingi.
Systematic Review of
Diagnostic Test Performance
http://srdta.cochrane.org/handbook-dta-reviews
Single index test:
Remains a common form of systematic review
Heterogeneity in test performance between studies is likely to be present, and reasons for it should be explored.
Test comparisons:
Increasing in importance and relevance
Methods for investigating heterogeneity can be applied
Ideally, test comparisons should focus on studies that directly compare the tests of interest
Systematic Review of
Diagnostic Test Performance
Reference test (binary)
“true” disease status, i.e. target condition
Index test (continuous, ordinal or binary)
Test threshold
Sensitivity and specificity
Likelihood ratios
ROC curve
Underlying Concepts
Test threshold: Individual Study Level
A plot of sensitivity against 1-specificity across the range of thresholds
results in a receiver operating characteristic (ROC) curve.
a single study:
diseasednon-diseased
TP
TP increases
FP increases
FP
threshold
TP decreases
FP decreases
ROC curves: Individual Study Level
diseasednon-diseased
0 40 80 120
test measurement
0.0
0.2
0.4
0.6
0.8
1.0
sen
sitiv
ity
0.00.20.40.60.81.0
specificity
diseasednon-diseased
0 40 80 120
test measurement
0.0
0.2
0.4
0.6
0.8
1.0
sen
sitiv
ity
0.00.20.40.60.81.0
specificity
diseasednon-diseased
0 40 80 120
test measurement
0.0
0.2
0.4
0.6
0.8
1.0
sen
sitiv
ity
0.00.20.40.60.81.0
specificity
diseasednon-diseased
0 40 80 120
test measurement
0.0
0.2
0.4
0.6
0.8
1.0
sen
sitiv
ity
0.00.20.40.60.81.0
specificity
Most studies report test sensitivity and specificity at a threshold(s),
or provide sufficient information to construct the following 2 x 2
table at the threshold(s):
From this table we can compute
True positive rate (tpr):
False positive rate (fpr):
Data extraction
DnTPFNTPTPysensitivit
D
nFPTNFPFPyspecificit 1
“true” disease status
+ -
test
result
+ TP FP
- FN TN
Reasons for variability in test accuracy
between studies
• Random sampling error
For each study, the estimated sensitivity and specificity is subject to
sampling error. The larger the sample size, the smaller the
sampling error as shown by the confidence intervals in a Forest
plot.
Because the sensitivity and specificity are both proportions, the
within study sampling error is straightforward to estimate using
the binomial distribution.
Reasons for variability in test accuracy
between studies
• True underlying differences between studies
– In diagnostic reviews, sampling error is unlikely to account for all
of the variability (scatter) between studies.
– Additional heterogeneity in test performance between studies is
likely to occur for other reasons, including differences in:
• Cut-point chosen to define a positive test (threshold effect)
• Spectrum of disease
• Clinical setting
• Study design
• etc…
Even if all studies use the same cut-point, sensitivity and
specificity are expected to vary between studies
Graphical Displays
Descriptive plots should include:
– Forest plot showing sensitivity and specificity for each study and
the numbers on which these estimates are based for each study
– Scatter plot showing (1-specificity, sensitivity) pair for each study
in ROC space. The size of each marker should ideally reflect the
numbers in both the diseased and non-diseased groups.
RevMan provides facilities for:
• graphical displays (improvements made in version 5.2).
• summary ROC curve estimation based on Moses-Littenberg method
• Descriptive exploration of heterogeneity using subgroup analyses
50 studies taken from the review conducted by Nishimura (2007) of
Rheumatoid factor (RF) as a marker for rheumatoid arthritis (RA)
The cut-point for test positivity for RF varied between studies ranging 3
to 100 U/ml (not all studies reported the cut-point)
The reference standard was based on the 1987 revised American
College of Rheumatology (ACR) criteria or clinical diagnosis.
Note: RF contributes to the ACR criteria so there is some risk of bias in
this analysis.
Example: Rheumatoid Factor as a marker
for Rheumatoid Arthritis
Study
Aho 1999
Anuradha 2005
Banchuin 1992
Bas 2003
Berthelot 1995
Bizzaro 2001
Bombardieri 2004
Carpenter 1989
Choi 2005
Cordonnier 1996
Das 2004
Davis 1989
de Bois 1996
De Rycke 2004
Despres 1994
Dubucquoi 2004
Fernandez-Suarez 2005
Girelli 2004
Goldbach-Mansky 2000
Gomes-Daudrix 1994
Greiner 2005
Grootenboer-Mignot 2004
Hitchon 2004
Jansen 2003
Jonsson 1998
Kamali 2005
Kwok 2005
Lee 2003
Lopez-Hoyos 2004
Nell 2005
Quinn 2006
Rantapaa-Dahlqvist 2003
Raza 2005
Saraux 1995
Saraux 2003
Sauerland 2005
Schellekens 2000
Soderlin 2004
Spiritus 2004
Suzuki 2003
Swedler
Thammanichanond 2005
Vallbracht 2004
van Leeuwen 1988
Vasiliauskiene 2001
Visser 1996
Vittecoq 2001
Vittecoq 2004
Winkles 1989
Young 1991
TP
64
482
36
143
80
61
27
60
261
20
42
18
8
93
143
84
30
32
70
48
75
64
32
130
50
20
77
73
36
56
115
49
22
8
35
161
80
5
57
383
89
57
196
163
75
157
26
62
113
25
FP
16
2
6
43
50
36
6
8
54
2
46
3
8
28
39
41
2
29
39
1
42
18
10
8
14
32
16
22
3
11
53
23
2
8
8
89
28
4
9
38
3
25
75
10
21
287
1
11
19
1
FN
27
82
41
53
39
37
3
20
63
29
14
31
0
25
63
56
23
3
36
40
12
29
9
128
20
26
52
29
5
46
67
28
20
31
51
7
69
11
33
166
9
6
99
28
21
78
32
114
29
14
TN
153
153
313
196
45
196
33
119
197
18
127
25
31
118
130
90
73
13
93
99
191
73
13
113
191
25
52
90
70
87
63
359
80
91
149
360
284
49
93
170
39
111
345
140
106
1466
29
127
481
20
cutoff
8.0
100.0
15.0
87.0
9.0
40.0
16.3
3.0
3.125
100.0
20.0
50.0
20.0
20.0
20.0
20.0
30.0
40.0
20.0
15.0
80.0
22.0
40.0
20.0
30.0
40.0
20.0
3.0
20.0
20.0
15.0
20.0
20.0
15.0
17.0
9.0
80.0
16.0
40.0
Method
LA
LA
ELISA
ELISA
LA
Nephelometry
Nephelometry
ELISA
LA
LA
Nephelometry
ELISA
ELISA
LA
LA
ELISA
Nephelometry
Nephelometry
Nephelometry
ELISA
Nephelometry
Nephelometry
Nephelometry
Nephelometry
ELISA
LA
Nephelometry
LA
Nephelometry
Not reported
Not reported
ELISA
LA
LA
ELISA
Nephelometry
ELISA
LA
Nephelometry
Nephelometry
Nephelometry
LA
ELISA
ELISA
ELISA
ELISA
LA
ELISA
LA
RA hemagglutination
Sensitivity
0.70 [0.60, 0.79]
0.85 [0.82, 0.88]
0.47 [0.35, 0.58]
0.73 [0.66, 0.79]
0.67 [0.58, 0.76]
0.62 [0.52, 0.72]
0.90 [0.73, 0.98]
0.75 [0.64, 0.84]
0.81 [0.76, 0.85]
0.41 [0.27, 0.56]
0.75 [0.62, 0.86]
0.37 [0.23, 0.52]
1.00 [0.63, 1.00]
0.79 [0.70, 0.86]
0.69 [0.63, 0.76]
0.60 [0.51, 0.68]
0.57 [0.42, 0.70]
0.91 [0.77, 0.98]
0.66 [0.56, 0.75]
0.55 [0.44, 0.65]
0.86 [0.77, 0.93]
0.69 [0.58, 0.78]
0.78 [0.62, 0.89]
0.50 [0.44, 0.57]
0.71 [0.59, 0.82]
0.43 [0.29, 0.59]
0.60 [0.51, 0.68]
0.72 [0.62, 0.80]
0.88 [0.74, 0.96]
0.55 [0.45, 0.65]
0.63 [0.56, 0.70]
0.64 [0.52, 0.74]
0.52 [0.36, 0.68]
0.21 [0.09, 0.36]
0.41 [0.30, 0.52]
0.96 [0.92, 0.98]
0.54 [0.45, 0.62]
0.31 [0.11, 0.59]
0.63 [0.53, 0.73]
0.70 [0.66, 0.74]
0.91 [0.83, 0.96]
0.90 [0.80, 0.96]
0.66 [0.61, 0.72]
0.85 [0.80, 0.90]
0.78 [0.69, 0.86]
0.67 [0.60, 0.73]
0.45 [0.32, 0.58]
0.35 [0.28, 0.43]
0.80 [0.72, 0.86]
0.64 [0.47, 0.79]
Specificity
0.91 [0.85, 0.94]
0.99 [0.95, 1.00]
0.98 [0.96, 0.99]
0.82 [0.77, 0.87]
0.47 [0.37, 0.58]
0.84 [0.79, 0.89]
0.85 [0.69, 0.94]
0.94 [0.88, 0.97]
0.78 [0.73, 0.83]
0.90 [0.68, 0.99]
0.73 [0.66, 0.80]
0.89 [0.72, 0.98]
0.79 [0.64, 0.91]
0.81 [0.73, 0.87]
0.77 [0.70, 0.83]
0.69 [0.60, 0.77]
0.97 [0.91, 1.00]
0.31 [0.18, 0.47]
0.70 [0.62, 0.78]
0.99 [0.95, 1.00]
0.82 [0.76, 0.87]
0.80 [0.71, 0.88]
0.57 [0.34, 0.77]
0.93 [0.87, 0.97]
0.93 [0.89, 0.96]
0.44 [0.31, 0.58]
0.76 [0.65, 0.86]
0.80 [0.72, 0.87]
0.96 [0.88, 0.99]
0.89 [0.81, 0.94]
0.54 [0.45, 0.64]
0.94 [0.91, 0.96]
0.98 [0.91, 1.00]
0.92 [0.85, 0.96]
0.95 [0.90, 0.98]
0.80 [0.76, 0.84]
0.91 [0.87, 0.94]
0.92 [0.82, 0.98]
0.91 [0.84, 0.96]
0.82 [0.76, 0.87]
0.93 [0.81, 0.99]
0.82 [0.74, 0.88]
0.82 [0.78, 0.86]
0.93 [0.88, 0.97]
0.83 [0.76, 0.89]
0.84 [0.82, 0.85]
0.97 [0.83, 1.00]
0.92 [0.86, 0.96]
0.96 [0.94, 0.98]
0.95 [0.76, 1.00]
Sensitivity
0 0.2 0.4 0.6 0.8 1
Specificity
0 0.2 0.4 0.6 0.8 1
Forest plot – sorted by specificity
Example: Rheumatoid Factor as a marker
for Rheumatoid Arthritis
Moses LE, Shapiro D, Littenberg B Stat Med 1993; 12:1293-1316.
The common shape parameter to all 3 curves is given by beta
Example: Rheumatoid Factor as a marker for
Rheumatoid Arthritis:
Method of measurement of RF
LA appears to be less accurate
than N and E whose curves show
very similar accuracy.
Removing a1*rfm1 +a2*rfm2
from the model gave a chi-squared
statistic of 0.6, 2df, P=0.74. Hence,
there is no statistical evidence
that the method of measurement
of RF is associated with
accuracy.
The effect of potentially influential
studies should be investigated.
Index Test Comparisons
Comparison based on all studies that evaluate one or both tests:
Methods of analysis follow the same approach as already outlined for investigation of heterogeneity
It may be necessary to allow variances of random effects to vary by test.
Such comparisons may be biased due to confounding arising from heterogeneity among studies in terms of design, study quality, setting, etc
Adjusting for potential confounders is often not feasible because the required information is typically missing or poorly reported.
Index Test Comparisons
Comparison restricted to studies that evaluate both tests:
Restricting the analysis to studies that evaluated both tests in the same patients ( truly “paired” studies), or randomized patients to receive each test, removes the need to adjust for confounders.
Methods of analysis for investigation of heterogeneity are extended to model sensitivity and specificity for both tests within each study (i.e. 2 records for sensitivity and 2 records for specificity per study, with a covariate for test type) all studies are analysed as if they are randomised
this approach is generally conservative
methods for dealing for pairing of test results within studies under development
The cross classification of tests results within disease groups for truly paired studies is generally not reported
Example: Comparison of Computed Tomogrpahy (CT)
and Ultrasonography (US) for the diagnosis of
appendicitis.
22 studies were included in the review by Terasawa (2004)
12 studies evaluated CT
14 studies evaluated US
4 studies evaluated both CT and US.
Example: Comparison of Computed Tomogrpahy (CT)
and Ultrasonography (US) for the diagnosis of
appendicitis.
Analysis based on all studies:
Strong statistical evidence of a difference
in sensitivity and specificity between the
tests (P<0.001)
CT has higher sensitivity and specificity
than US.
Example: Comparison of Computed Tomogrpahy (CT)
and Ultrasonography (US) for the diagnosis of
appendicitis.
Analysis based on comparative studies:
CT consistently shows higher sensitivity
than US
Specificity for CT is equal to or greater
than for US
Only 4 studies available for this model.
Convergence is an issue, and simplifying
assumptions may be necessary.
Analyses in RevMan are designed to be descriptive and exploratory.
Hierarchical models provide a more rigorous approach. The Bivariate model
and Rutter and Gatsonis HSROC models are the most commonly used.
The choice of model must be informed by the research question and whether a
common threshold for test positivity is used across studies.
Covariates can be included in hierarchical models to investigate heterogeneity.
The results can be input to RevMan for graphical display.
Modelling of test comparisons follows approach for investigation of
heterogeneity.
Ideally, comparative meta-analysis should focus on studies that compare tests
directly.
A comprehensive list of references is provided in Chapter 10 of the Handbook
for DTA Reviews.
Concluding Remarks
Small number of studies
Convergence issues
Model checking
Data reported at multiple thresholds per study:
• choosing a cutpoint for each study
• methods for analysing multiple 2x2 tables per study Hamza Taye H.; Arends Lidia R.; van Houwelingen Hans C.; Stijnen Theo
Multivariate random effects meta-analysis of diagnostic tests with multiple thresholds BMC MEDICAL RESEARCH METHODOLOGY Vol 9, Article Number: 73 DOI: 10.1186/1471-
2288-9-73 Published: NOV 10 2009
Other?
Discussion Points ( Methods continue to be extended and refined! )