Ground truth generation in medical imaging: a crowdsourcing-based iterative approach

Ground truth generation in medical imaging:

a crowdsourcing-based iterative approach

Antonio Foncubierta-Rodríguez

Henning Müller

Introduction

• Medical image production grows rapidly in

scientific and clinical environment

• If images are easily accessed, they can be

reused:

• Clinical decision support

• Young physician training

• Relevant document retrieval for researchers

• Modality classification improves retrieval and

accessibility of images

Motivation and dataset

• ImageCLEF dataset:

• Over 300,000 images from open access

biomedical literature

• Over 30 modalities hierarchically defined

• Manual classification is expensive and time

consuming

• How can this be done in a more efficient way?

http://imageclef.org/

Classification Hierarchy D

iag

no

stic

Co

nve

ntio

na

l

Radiology

Ultra

so

un

d

MR

I

CT

2D

X-R

AY

An

gio

gra

ph

y

PE

T

SP

EC

T

Infr

are

d

Co

mb

ine

d

Visible light

Gro

ss

Skin

Org

an

s

En

do

sco

py

Signals, waves

EE

G

EC

G, E

KG

EM

G

Microscopy

Lig

ht M

icr.

Ele

ctr

on

M

icr.

T

ran

sm

issio

n

Mic

rosco

pe

Flu

ore

sce

nce

Inte

rfe

ren

ce

P

ha

se

co

ntr

ast

Da

rk fie

ld

Reconstructions

2D

3D

Graph

Ta

ble

s, fo

rms

Pro

gra

m L

istin

g

Sta

tistica

l fig

ure

s,

gra

ph

s a

nd

ch

art

s

Syste

m

ove

rvie

ws

Flo

wch

art

s

Ge

ne

se

qu

en

ce

Ch

rom

ato

gra

ph

y,

ge

l

Ch

em

ica

l str

uctu

re

Sym

bo

l

Ma

th fo

rmu

lae

Non clinical photos

Hand-drawn sketches

Compound

Image examples

COMPOUND

DIAGNOSTIC Radiology CT

DIAGNOSTIC Microscopy Fluorescence

DIAGNOSTIC Radiology Ultrasound

GENERIC Figures/Charts

GENERIC Table

Iterative workflow

• Avoid manual classification as much as possible

• Iterative approach:

1. Create a small training set • Manual classification into 34 categories

2. Use an automatic tool that learns from training set

3. Evaluate results • Manual classification into right/wrong categories

4. Improve training set

5. Repeat from 2

Crowdsourcing in medical imaging

• Crowdsourcing reduces time and cost for

annotation

• Medical image annotation is often done by

• Medical doctors

• Domain experts

• Can unknown users provide valid annotations?

• Quality?

• Speed?

User Groups

• Experiments were performed with three different

user groups:

1 MD

18 known experts

2470 contributors from open crowdsourcing

Crowdsourcing platform

• Crowdflower platform was chosen for the

experiments

• Integrated interface for job design

• Complete set of management tools: gold

creation, internal interface, statistics, raw data

• Hub feature: jobs can be announced in several

crowdsourcing pools:

• Amazon Mturk

• Get Paid

• Zoombucks

Experiment: Initial training set generation

• Initial training set

generation

• 1,000 images

• Limited to 18

known experts

• Aim: test the

crowdsourcing

interface

Experiment: Automated classification verification

• 300,000 images

• Binary task: approve

or refuse classification

• Aim: evaluate speed

and difficulty of

verification task

Experiments: trustability

• Trustability experiments

• Aim: compare user groups expected accuracy

• 3,415 images were classified by the Medical

Doctor

• The two user groups were required to reclassify

images

• Random subset of 1,661 images used as gold

standard • Feedback on wrong classification was given to the

known experts for detecting ambiguities

• Feedback on 847 of the gold images was muted for

the crowd

Results: user self assessment

• Users were required to answer how sure they

were of their choice

• Allows discarding untrusted data from trusted

sources

• Confidence rate

• Medical doctor: 100 %

• Known experts group: 95.04 %

• Crowd group: 85.56 %

Results: tasks completed per user

Open crowdsourcing Internal interface

Results: MD and known experts

• Agreement

• Broad category: 88.76 %

• Diagnostic subcategory: 97.40 % • Microscopy: 89.06 %

• Radiology: 90.91 %

• Reconstructions: 100 %

• Visible light photography: 79.41 %

• Conventional subcategory: 76 %

• Speed

• MD: 85 judgements per hour

• Experts: 66 judgements per hour and user

Results: MD and Crowd

• Agreement

• Broad category: 85.53 %

• Diagnostic subcategory: 85.15 % • Microscopy: 70.89 %

• Radiology: 64.01 %

• Reconstructions: 0 %

• Visible light photography: 58.89 %

• Conventional subcategory: 75.91 %

• Speed

• MD: 85 judgements per hour

• Crowd: 25 judgements per hour and user

Results: Automatic classification verification

• Verification by experts

• 1,000 images were verified

• Agreement among annotators: 100%

• Speed:

• Users answered twice as fast

Conclusions

• Iterative approach reduces amount of manual

work

• Only a small subset is fully manually annotated

• Automatic classification verification is faster

• Significant differences among user groups

• Faster crowd annotations due to the number of

contributors

• Poorer crowd annotations in the most specific

classes

• Comparable performance among user groups

• Broad categories

Future work

• Experiments can be redesigned to fit the crowd

behaviour:

• A smaller number of (good) contributors has

previously led to CAD-comparable performance

• Selection of contributors: • Historical performance on the platform?

• Selection/Training phase within the job

Thanks for your attention!

Antonio Foncubierta-Rodríguez and Henning Müller. “Ground truth generation in medical imaging: A crowdsourcing based iterative approach”,in Workshop on Crowdsourcing for Multimedia,

ACM Multimedia, Nara, Japan, 2012

Contact: [email protected]

http://publications.hevs.ch/index.php/publications/show/1209




Ground truth generation in medical imaging: a crowdsourcing-based iterative approach

Technology

gold images

classification aim

medical imaging crowdsourcing

wrong classification

images limited

results manual classification

user groups experiments

crowdsourcing pools