Top Banner
Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010
47

Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Dec 28, 2015

Download

Documents

Lucy Gregory
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Studies of Diagnostic Tests

Thomas B. Newman, MD, MPH

October 14, 2010

Page 2: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Reminders/Announcements

Write down answers to as many of the problems in the book as you can and check your answers!

Final exam to be passed out 12/2, reviewed 12/9– Send questions!

Page 3: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Overview Common biases of studies of diagnostic test

accuracy Prevalence, spectrum and nonindependence Meta-analysis of diagnostic tests Checklist & systematic approach Examples:

– Pain with percussion, hopping or cough for appendicitis

– Pertussis

Page 4: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Bias #1 Example Study of BNP to diagnose congestive

heart failure (CHF, Chapter 4, Problem 3)

Page 5: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Bias #1 Example Gold standard: determination of CHF by two

cardiologists blinded to BNP “The best clinical predictor of congestive

heart failure was an increased heart size on chest roentgenogram (accuracy, 81 percent)”

Is there a problem with assessing accuracy of chest x-rays to diagnose CHF in this study?

*Maisel AS, Krishnaswamy P, Nowak RM, McCord J, Hollander JE, Duc P, et al. Rapid measurement of B-type natriuretic peptide in the emergency diagnosis of heart failure. N Engl J Med 2002;347(3):161-7.

Page 6: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Bias #1: Incorporation bias

Cardiologists not blinded to chest X-ray Probably used (incorporated) it to make

final diagnosis Incorporation bias for assessment of

Chest X-ray (not BNP) Biases both sensitivity and specificity

upward

Page 7: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Bias #2 Example: Visual assessment of

jaundice in newborns– Study patients who are

getting a bilirubin measurement

– Ask clinicians to estimate extent of jaundice at time of blood draw

Page 8: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Visual Assessment of jaundice*: Results

*Moyer et al., APAM 2000; 154:391

Sensitivity of jaundice below the nipple line for bilirubin ≥ 12 mg/dL = 97%

Specificity = 19%

What is the problem?

Editor’s Note: The take-home message for me is that no jaundice below the nipple line equals no bilirubin test, unless there’s some other indication.

--Catherine D. DeAngelis, MD

Page 9: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Bias #2: Verification Bias* -1 Inclusion criterion for study: gold standard

test was done – in this case, blood test for bilirubin

Subjects with positive index tests are more likely to be get the gold standard and to be included in the study– clinicians usually don’t order blood test for bilirubin

if there is little or no jaundice How does this affect sensitivity and

specificity?

*AKA Work-up, Referral Bias, or Ascertainment Bias

Page 10: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Bias #2: Verification Bias

TSB >12 TSB < 12

Jaundice below nipple

a b

No jaundice below nipple

c d

Sensitivity, a/(a+c), is biased ___.

Specificity, d/(b+d), is biased ___.

*AKA Work-up, Referral Bias, or Ascertainment Bias

But is sensitivity what we really want to know to support Cathy’s conclusion?

Page 11: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Bias #2: Verification Bias

TSB >12 TSB < 12

Jaundice below nipple

a b

No jaundice below nipple

c d

Negative predictive value was 94%. Is it biased? The “Test negative” group (no jaundice) that

still gets the gold standard may have other risk factors or indications

Therefore, c may be too high relative to d and NPV may be underestimated

Page 12: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Bias #3 Example: Pioped study of accuracy of

ventilation/perfusion (V/Q) scan to diagnose pulmonary embolus*

Study Population: All patients presenting to the ED who received a V/Q scan

Test: V/Q Scan Disease: Pulmonary embolism (PE) Gold Standards:

– 1. Pulmonary arteriogram (PA-gram) if done (more likely with more abnormal V/Q scan)

– 2. Clinical follow-up in other patients (more likely with normal VQ scan

*PIOPED. JAMA 1990;263(20):2753-9.

Page 13: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Double Gold Standard Bias Two different “gold standards”

– One gold standard (usually an immediate, more invasive test, e.g., angiogram, surgery) is more likely to be applied in patients with positive index test

– Second gold standard (e.g., clinical follow-up) is more likely to be applied in patients with a negative index test.

Page 14: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Double Gold Standard Bias

There are some patients in whom the two “gold standards” do not give the same answer– Spontaneously resolving disease (positive with

immediate invasive test, but not with follow-up)– Newly occurring or newly detectable disease

(positive with follow-up but not with immediate invasive test)

Page 15: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Effect of Double Gold Standard Bias 1: Spontaneously resolving disease

Test result will always agree with gold standard Both sensitivity and specificity increase Example: Joe has a small pulmonary embolus (PE)

that will resolve spontaneously. – If his VQ scan is positive, he will get an

angiogram that shows the PE (true positive) – If his VQ scan is negative, his PE will resolve and

we will think he never had one (true negative) VQ scan can’t be wrong!

Page 16: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Effect of Double Gold Standard Bias 2: Newly occurring or newly detectable disease

Test result will always disagree with gold standard Both sensitivity and specificity decrease Example: Jane has a nasty breast cancer that is currently

undetectable – If her mammogram is positive, she will get biopsies that

will not find the tumor (mammogram will look falsely positive)

– If her mammogram is negative, she will return in several months an we will think the tumor was initially missed (mammogram will look falsely negative)

Mammogram can’t be right!

Page 17: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Spectrum of Disease, Nondisease and Test Results

Disease is often easier to diagnose if severe

“Nondisease” is easier to diagnose if patient is well than if the patient has other diseases

Test results will be more reproducible if ambiguous results excluded

Page 18: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Spectrum Bias

Sensitivity depends on the spectrum of disease in the population being tested.

Specificity depends on the spectrum of non-disease in the population being tested.

Example: Absence of Nasal Bone (on 13-week ultrasound) as a Test for Chromosomal Abnormality

Page 19: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Spectrum Bias Example: Absence of Nasal Bone as a Test for Chromosomal Abnormality*

Sensitivity = 229/333 = 69%BUT the D+ group only included fetuses with

Trisomy 21

Nasal Bone Absent D+ D- LR

Yes 229 129 27.8No 104 5094 0.32

Total 333 5223

Cicero et al., Ultrasound Obstet Gynecol 2004; 23: 218-23

Page 20: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

The D+ group excluded 295 fetuses with other chromosomal abnormalities (mainly Trisomy 18)

Among these fetuses, the sensitivity of nasal bone absence was 32% (not 69%)

What decision is this test supposed to help with?– If it is whether to test chromosomes using chorionic

villus sampling or amniocentesis, these 295 fetuses should be included!

Spectrum Bias: Absence of Nasal Bone as a Test for Chromosomal Abnormality

Page 21: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Sensitivity = 324/628 = 52%NOT 69% obtained when the D+ group only included

fetuses with Trisomy 21

Spectrum Bias:Absence of Nasal Bone as a Test for Chromosomal Abnormality, effect of including other trisomies in D+ group

Nasal Bone Absent D+ D- LR Yes 229 + 95 =324 129 20.4No 104 + 200=304 5094 0.50Total 333 + 295=628 5223

Page 22: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Quiz: What if we considered the nasal bone absence as a test for Trisomy 21?

Then instead of excluding subjects with other chromosomal abnormalities or including them as D+, we should count them as D-. Compared with excluding them,

What would happen to sensitivity? What would happen to specificity?

Page 23: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Quiz: What if we considered the nasal bone absence as a test for Trisomy 21?

Nasal Bone

Absent D+ D-Yes 229 478+ 95 =573 No 104 4745+ 200=4945

Total 333 5223+295=608

Sensitivity unchanged Specificity reduced

Page 24: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Prevalence, spectrum and nonindependence

Prevalence (prior probability) of disease may be related to disease severity

One mechanism is different spectra of disease or nondisease

Another is that whatever is causing the high prior probability is related to the same aspect of the disease as the test

Page 25: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Prevalence, spectrum and nonindependence

Examples– Iron deficiency, HIV– Diseases identified by screening

Urinalysis as a test for UTI in women with more and fewer symptoms (high and low prior probability)

Sensitivity Specificity LR+ LR-

High Prior 92% 42% 1.6 0.19

Low Prior 56% 78% 2.5 0.56

Page 26: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Overfitting

Page 27: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Meta-analyses of Diagnostic Tests

Systematic and reproducible approach to finding studies

Summary of results of each study Investigation into heterogeneity Summary estimate of results, if appropriate Unlike other meta-analyses (risk factors,

treatments), results aren’t summarized with a single number (e.g., RR), but with two related numbers (sensitivity and specificity)

These can be plotted on an ROC plane

Page 28: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

MRI for the diagnosis of MS

Whiting et al. BMJ 2006;332:875-84

Page 29: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Dermoscopy vs Naked Eye for Diagnosis of Malignant Melanoma

Br J Dermatol. 2008 Sep;159(3):669-76

Page 30: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Studies of Diagnostic Test Accuracy: Checklist Was there an independent, blind

comparison with a reference (“gold”) standard of diagnosis?

Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)?

Was the reference standard applied regardless of the diagnostic test result?

Was the test (or cluster of tests) validated in a second, independent group of patients?

From Sackett et al., Evidence-based Medicine,2nd ed. (NY: Churchill Livingstone), 2000. p 68

Page 31: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Systematic Approach Authors and funding

source Research question Study design Study subjects Predictor variable Outcome variable Results & Analysis Conclusions

Consider possible biases due to deviations from a perfect study and estimate the magnitude and direction of each

Page 32: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

A clinical decision rule to identify children at

low risk for appendicitis (Problem 5.6) Study design: prospective cohort study Subjects

– 4140 patients 3-18 years presenting to Boston Children’s Hospital ED with abdominal pain

– Of these, 767 (19%) received surgical consultation for possible appendicitis

– 113 Excluded (Chronic diseases, recent imaging)– 53 missed– 601 included in the study (425 in derivation set)

Kharbanda et al. Pediatrics 2005; 116(3): 709-16

Page 33: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

A clinical decision rule to identify children at low risk for appendicitis Predictor variable

– Standardized assessment by pediatric ED attending

– Focus on “Pain with percussion, hopping or cough” (complete data in N=381)

Outcome variable: – Pathologic diagnosis of appendicitis (or not) for

those who received surgery (37%)– Follow-up telephone call to family or pediatrician

2-4 weeks after the ED visit for those who did not receive surgery (63%)

Kharbanda et al. Pediatrics 116(3): 709-16

Page 34: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

A clinical decision rule to identify children at low risk for appendicitis Results: Pain with percussion, hopping or

cough

78% sensitivity and 83% NPV seem low to me. Are they valid for me in deciding whom to image?

Kharbanda et al. Pediatrics 116(3): 709-16

Page 35: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Checklist Was there an independent, blind

comparison with a reference (“gold”) standard of diagnosis?

Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)?

Was the reference standard applied regardless of the diagnostic test result?

Was the test (or cluster of tests) validated in a second, independent group of patients?

From Sackett et al., Evidence-based Medicine,2nd ed. (NY: Churchill Livingstone), 2000. p 68

Page 36: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

In what direction would these biases affect results? Sample not representative (population

referred to pedi surgery)? Verification bias? Double-gold standard bias? Spectrum bias

Page 37: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

For children presenting with abdominal pain to SFGH 6-M Sensitivity probably valid (not

falsely low)– But whether all of them tried to hop is not

clear Specificity probably low PPV is too high NPV is too low Does not address surgical consultation

decision

Page 38: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Does this coughing patient have pertussis?* RQ (for us): what are LR for coughing fits,

whoop, and post-tussive vomiting in adults with persistent cough?

Design (for one study we reviewed**): Prospective cross-sectional study

Subjects: 217 adults ≥18 years with cough 7-21 days, no fever or other clear cause for cough enrolled by 80 French GPs.– In a subsample from 58 GPs, of 710 who met

inclusion criteria only 99 (14%) enrolled

* Cornia et al. JAMA 2010;304(8):890-896**Gilberg S et al. J Inf Dis 2002;186:415-8

Page 39: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Petussis diagnosis

Predictor variables: “GPs interviewed patients using a standardized questionnaire.”

Outcome variable: Evidence of pertussis based on– Culture (N=1)– PCR (N=36)– Or ≥ 2-fold change in anti-pertussis toxin

IgG (N=40)– Total N = 70/217 with evidence of pertussis

(32%)*Gilberg S et al. J Inf Dis 2002;186:415-8

Page 40: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Results

Sensitivity SpecificityParoxysm 93% 6%Whoop 16% 79%Post-tussive vomiting 25% 79%

89% in both groups met CDC criteria for pertussis

Page 41: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Issues Verification (selection) bias: only 14% of

eligible subjects included Questionable gold standard

– 2-fold dilution too small– Increase or decrease counted– Internally inconsistent: pts with + PCR no

more likely to have change in Ab titres.

Page 42: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Questions?

Page 43: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Additional slides

Page 44: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Double Gold Standard Bias: effect of spontaneously resolving disease

PE + PE -

V/Q Scan + a b

V/Q Scan - c d

Sensitivity, a/(a+c) biased __Specificity, d/(b+d) biased __

Double gold standard compared with immediate invasive test for all

Double gold standard compared with follow-up for all

Page 45: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Double Gold Standard Bias: effect of newly occurring cases

PE + PE -

V/Q Scan + a b

V/Q Scan - c d

Sensitivity, a/(a+c) biased __Specificity, d/(b+d) biased __

Double gold standard compared with PA-Gram for all

Double gold standard compared with follow-up for all

Page 46: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Double Gold Standard Bias: Ultrasound diagnosis of intussusception

Intussusception No Intussusception

U/S + 37 7U/S - 3 86+18=104Total 40 111

Sens = 37/40 = 93%Spec = 104/111 = 94%

Page 47: Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.

Intussusception No Intussusception

U/S + 37 7U/S - 3 86+18=104Total 40 111

Sens = 37/40 Spec = 104/111 = 93% = 94%

Intussusception No Intussusception

U/S +U/S -Total

What if 10% of the 86 U/S- followed subjects actually had intussusceptions that resolved spontaneously?