Ground truth generation in medical imaging: a crowdsourcing-based iterative approach Antonio Foncubierta-Rodríguez Henning Müller
Jul 13, 2015
Ground truth generation in medical imaging:
a crowdsourcing-based iterative approach
Antonio Foncubierta-Rodríguez
Henning Müller
Introduction
• Medical image production grows rapidly in
scientific and clinical environment
• If images are easily accessed, they can be
reused:
• Clinical decision support
• Young physician training
• Relevant document retrieval for researchers
• Modality classification improves retrieval and
accessibility of images
Motivation and dataset
• ImageCLEF dataset:
• Over 300,000 images from open access
biomedical literature
• Over 30 modalities hierarchically defined
• Manual classification is expensive and time
consuming
• How can this be done in a more efficient way?
Classification Hierarchy D
iag
no
stic
Co
nve
ntio
na
l
Radiology
Ultra
so
un
d
MR
I
CT
2D
X-R
AY
An
gio
gra
ph
y
PE
T
SP
EC
T
Infr
are
d
Co
mb
ine
d
Visible light
Gro
ss
Skin
Org
an
s
En
do
sco
py
Signals, waves
EE
G
EC
G, E
KG
EM
G
Microscopy
Lig
ht M
icr.
Ele
ctr
on
M
icr.
T
ran
sm
issio
n
Mic
rosco
pe
Flu
ore
sce
nce
Inte
rfe
ren
ce
P
ha
se
co
ntr
ast
Da
rk fie
ld
Reconstructions
2D
3D
Graph
Ta
ble
s, fo
rms
Pro
gra
m L
istin
g
Sta
tistica
l fig
ure
s,
gra
ph
s a
nd
ch
art
s
Syste
m
ove
rvie
ws
Flo
wch
art
s
Ge
ne
se
qu
en
ce
Ch
rom
ato
gra
ph
y,
ge
l
Ch
em
ica
l str
uctu
re
Sym
bo
l
Ma
th fo
rmu
lae
Non clinical photos
Hand-drawn sketches
Compound
Image examples
COMPOUND
DIAGNOSTIC Radiology CT
DIAGNOSTIC Microscopy Fluorescence
DIAGNOSTIC Radiology Ultrasound
GENERIC Figures/Charts
GENERIC Table
Iterative workflow
• Avoid manual classification as much as possible
• Iterative approach:
1. Create a small training set • Manual classification into 34 categories
2. Use an automatic tool that learns from training set
3. Evaluate results • Manual classification into right/wrong categories
4. Improve training set
5. Repeat from 2
Crowdsourcing in medical imaging
• Crowdsourcing reduces time and cost for
annotation
• Medical image annotation is often done by
• Medical doctors
• Domain experts
• Can unknown users provide valid annotations?
• Quality?
• Speed?
User Groups
• Experiments were performed with three different
user groups:
1 MD
18 known experts
2470 contributors from open crowdsourcing
Crowdsourcing platform
• Crowdflower platform was chosen for the
experiments
• Integrated interface for job design
• Complete set of management tools: gold
creation, internal interface, statistics, raw data
• Hub feature: jobs can be announced in several
crowdsourcing pools:
• Amazon Mturk
• Get Paid
• Zoombucks
Experiment: Initial training set generation
• Initial training set
generation
• 1,000 images
• Limited to 18
known experts
• Aim: test the
crowdsourcing
interface
Experiment: Automated classification verification
• 300,000 images
• Binary task: approve
or refuse classification
• Aim: evaluate speed
and difficulty of
verification task
Experiments: trustability
• Trustability experiments
• Aim: compare user groups expected accuracy
• 3,415 images were classified by the Medical
Doctor
• The two user groups were required to reclassify
images
• Random subset of 1,661 images used as gold
standard • Feedback on wrong classification was given to the
known experts for detecting ambiguities
• Feedback on 847 of the gold images was muted for
the crowd
Results: user self assessment
• Users were required to answer how sure they
were of their choice
• Allows discarding untrusted data from trusted
sources
• Confidence rate
• Medical doctor: 100 %
• Known experts group: 95.04 %
• Crowd group: 85.56 %
Results: MD and known experts
• Agreement
• Broad category: 88.76 %
• Diagnostic subcategory: 97.40 % • Microscopy: 89.06 %
• Radiology: 90.91 %
• Reconstructions: 100 %
• Visible light photography: 79.41 %
• Conventional subcategory: 76 %
• Speed
• MD: 85 judgements per hour
• Experts: 66 judgements per hour and user
Results: MD and Crowd
• Agreement
• Broad category: 85.53 %
• Diagnostic subcategory: 85.15 % • Microscopy: 70.89 %
• Radiology: 64.01 %
• Reconstructions: 0 %
• Visible light photography: 58.89 %
• Conventional subcategory: 75.91 %
• Speed
• MD: 85 judgements per hour
• Crowd: 25 judgements per hour and user
Results: Automatic classification verification
• Verification by experts
• 1,000 images were verified
• Agreement among annotators: 100%
• Speed:
• Users answered twice as fast
Conclusions
• Iterative approach reduces amount of manual
work
• Only a small subset is fully manually annotated
• Automatic classification verification is faster
• Significant differences among user groups
• Faster crowd annotations due to the number of
contributors
• Poorer crowd annotations in the most specific
classes
• Comparable performance among user groups
• Broad categories
Future work
• Experiments can be redesigned to fit the crowd
behaviour:
• A smaller number of (good) contributors has
previously led to CAD-comparable performance
• Selection of contributors: • Historical performance on the platform?
• Selection/Training phase within the job
Thanks for your attention!
Antonio Foncubierta-Rodríguez and Henning Müller. “Ground truth generation in medical imaging: A crowdsourcing based iterative approach”,in Workshop on Crowdsourcing for Multimedia,
ACM Multimedia, Nara, Japan, 2012
Contact: [email protected]