Top Banner
Crowdsourcing and Citizen Science for Biology Andrew Su, Ph.D. @andrewsu [email protected] http://sulab.org February 6, 2015 UCSD Slides: slideshare.net/andrewsu
70
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: UCSD / DBMI seminar 2015-02-6

Crowdsourcing and

Citizen Science for

Biology

Andrew Su, Ph.D.@andrewsu

[email protected]

http://sulab.org

February 6, 2015

UCSD

Slides: slideshare.net/andrewsu

Page 2: UCSD / DBMI seminar 2015-02-6

Few genes are well annotated…2

Data: NCBI, February 2013

41%

65%

CTNNB1

VEGFA

SIRT1

FGFR2

TGFB1

TP53

MEF2C

BMP4

LEF1

WNT5A

TNF

20,473

protein-

coding

genes

Genes, sorted by decreasing counts

GO

An

no

tati

on

Co

un

ts

Page 3: UCSD / DBMI seminar 2015-02-6

… because the literature is sparsely curated?3

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1983 1988 1993 1998 2003 2008 2013

Number of new PubMed-indexed articles

Page 4: UCSD / DBMI seminar 2015-02-6

… because the literature is sparsely curated?4

0

10

20

30

40

1983 1988 1993 1998 2003 2008 2013

Average capacity of human scientist

Page 5: UCSD / DBMI seminar 2015-02-6

5

311,696 articles (1.5% of PubMed)

have been cited by GO annotations

Page 6: UCSD / DBMI seminar 2015-02-6

6

0

Sooner or later, the

research community will

need to be involved in the

annotation effort to scale

up to the rate of data

generation.

Page 7: UCSD / DBMI seminar 2015-02-6

The Long Tail is a prolific source of content7

Short

Head

Long Tail

Content

produced

Contributors (sorted)

News :

Video:

Product reviews:

Food reviews:

Talent judging:

Newspapers

TV/Hollywood

Consumer reports

Food critics

Olympics

Blogs

YouTube

Amazon reviews

Yelp

American Idol

Page 8: UCSD / DBMI seminar 2015-02-6

Wikipedia is reasonably accurate8

Page 9: UCSD / DBMI seminar 2015-02-6

Wikipedia has breadth and depth9

http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008

Articles

Words(millions)

Wikipedia Britannica

Online

Page 10: UCSD / DBMI seminar 2015-02-6

10

We can harness the

Long Tail of scientists

to directly participate in

the gene annotation

process.

Page 11: UCSD / DBMI seminar 2015-02-6

From crowdsourcing to structured data11

The Gene Wiki

Mark2Cure

Page 12: UCSD / DBMI seminar 2015-02-6

Filtering, extracting, and summarizing PubMed

Documents

Concepts Review article

Page 13: UCSD / DBMI seminar 2015-02-6

Filtering, extracting, and summarizing PubMed

Documents

Concepts

Page 14: UCSD / DBMI seminar 2015-02-6

Wiki success depends on a positive feedback14

Gene wiki page utility

Number of

users

Number of

contributors

1001

2002

Page 15: UCSD / DBMI seminar 2015-02-6

10,000 gene “stubs” within Wikipedia15

Protein structure

Symbols and

identifiers

Tissue expression

pattern

Gene Ontology

annotations

Links to structured

databases

Gene

summary

Protein

interactions

Linked

references

Huss, PLoS Biol, 2008

Utility

Users

Contributors

Page 16: UCSD / DBMI seminar 2015-02-6

Gene Wiki has a critical mass of readers16

Total: 4.0 million views / month

Huss, PLoS Biol, 2008; Good, NAR, 2011

Utility

Users

Contributors

Page 17: UCSD / DBMI seminar 2015-02-6

Gene Wiki has a critical mass of editors17

Increase of ~10,000 words / month from >1,000 edits

Currently 1.42 million words

Approximately equal to 230 full-length articles

Good, NAR, 2011

Utility

Users

Contributors

Editor

count Editors

Edits Edit c

ount

Page 18: UCSD / DBMI seminar 2015-02-6

A review article for every gene is powerful18

References to the literature

Hyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002

Heparin: 358 editors, 654 edits since June 2003

AMPK: 109 editors, 203 edits since March 2004

RNAi: 394 editors, 994 edits since October 2002

Page 19: UCSD / DBMI seminar 2015-02-6

Making the Gene Wiki more computable19

Structured annotationsFree text

Analyses

Text-mining

Page 20: UCSD / DBMI seminar 2015-02-6

Making the Gene Wiki more computable20

Structured annotationsFree text

Analyses

Text-mininghttp://fiehnlab.ucdavis.edu/projects/rice_metabolome/

Page 21: UCSD / DBMI seminar 2015-02-6

Making the Gene Wiki more computable21

Structured annotationsFree text

Analyses

Text-mining

Page 22: UCSD / DBMI seminar 2015-02-6

Making the Gene Wiki more computable22

Structured annotationsFree text

Databases

Page 23: UCSD / DBMI seminar 2015-02-6

Making the Gene Wiki more computable23

Structured annotationsFree text

Page 24: UCSD / DBMI seminar 2015-02-6

Making the Gene Wiki more computable24

Structured annotationsFree text

Page 25: UCSD / DBMI seminar 2015-02-6

Wikidata25

Provide a database of the

world’s knowledge that

anyone can edit

- Denny Vrandečić

Page 26: UCSD / DBMI seminar 2015-02-6

Centralizing key data storage26

Source: http://commons.wikimedia.org/wiki/File:Wikidata_slides_Magnus_Manske,_Cambridge,_2014-02-27.pdf

Page 27: UCSD / DBMI seminar 2015-02-6

Centralizing key data storage27

Page 28: UCSD / DBMI seminar 2015-02-6

Centralizing key data storage28

Page 29: UCSD / DBMI seminar 2015-02-6

Centralizing key data storage29

287 language editions of Wikipedia

Bioinformatics community

Page 30: UCSD / DBMI seminar 2015-02-6

Loading biological data into Wikidata30

Entrez

Gene

Ensembl

UniProt

UCSC

PDB

RefSeq

Page 31: UCSD / DBMI seminar 2015-02-6

Wikidata for biology31

is a

regulates

Interacts

with

Protein

Glycoprotein

Neural

development

VLDL receptor

Amyloid

precursor

protein

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

Reelin

http://www.wikidata.org/wiki/Q414043

Page 32: UCSD / DBMI seminar 2015-02-6

Wikidata for biology32

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en

Page 33: UCSD / DBMI seminar 2015-02-6

Current progress

• All human and mouse genes and

proteins loaded

• All diseases (Human Disease Ontology)

loaded

• Dataset of all drugs in preparation

• Datasets for gene-disease, drug-

disease, and drug-protein relationships

in preparation

33

Page 34: UCSD / DBMI seminar 2015-02-6

The

Long Tail of scientists

is a valuable source of

information on gene

function

34

Page 35: UCSD / DBMI seminar 2015-02-6

From crowdsourcing to structured data35

The Gene Wiki

Mark2Cure

Page 36: UCSD / DBMI seminar 2015-02-6

The biomedical literature is growing fast…36

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1983 1988 1993 1998 2003 2008 2013

Number of new PubMed-indexed articles

Page 37: UCSD / DBMI seminar 2015-02-6

… but it is very hard to query and compute37

Page 38: UCSD / DBMI seminar 2015-02-6

… but it is very hard to query and compute38

Imatinib

Crizotinib

Erlotinib

Gefitinib

Sorafenib

Lapatinib

Dasatinib

Acute myeloid leukemia

Acute lymphoblastic leukemia

Chronic myelogenous leukemia

Chronic lymphocytic leukemia

Hodgkin lymphoma

Non-Hodgkin lymphoma

Myeloma

AND

Page 39: UCSD / DBMI seminar 2015-02-6

Information Extraction39

1. Find mentions of high level concepts in

text

2. Map mentions to specific terms in

ontologies

3. Identify relationships between concepts

Page 40: UCSD / DBMI seminar 2015-02-6

Disease mentions in PubMed abstracts40

NCBI Disease corpus

• 793 PubMed abstracts

• (100 development, 593 training, 100 test)

• 12 expert annotators (2 annotate each abstract)

6,900 “disease” mentions

Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in

PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural

Language Processing. Association for Computational Linguistics.

Page 41: UCSD / DBMI seminar 2015-02-6

Question: Can a group of non-scientists

collectively perform concept recognition in

biomedical texts?

41

Page 42: UCSD / DBMI seminar 2015-02-6

The Mechanical Turk42

http://en.wikipedia.org/wiki/The_Turk

Page 43: UCSD / DBMI seminar 2015-02-6

The Mechanical Turk43

http://en.wikipedia.org/wiki/The_Turk

Page 44: UCSD / DBMI seminar 2015-02-6

Amazon Mechanical Turk (AMT)44

Requester

Amazon

For each task, specify:

• a qualification test

• how many workers per task

• how much we will pay per task

Manages:

• parallel execution of jobs

• worker access to tasks

via qualification tests

• payments

• task advertising

Workers

1. Create tasks

2. Execute

3. Aggregate

Page 45: UCSD / DBMI seminar 2015-02-6

Instructions to workers45

• Highlight all diseases and disease abbreviations

• “...are associated with Huntington disease ( HD )... HD patients

received...”

• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked

immunodeficiency…”

• Highlight the longest span of text specific to a disease

• “... contains the insulin-dependent diabetes mellitus locus …”

• Highlight disease conjunctions as single, long spans.

• “... a significant fraction of familial breast and ovarian cancer , but

undergoes…”

• Highlight symptoms - physical results of having a

disease

– “XFE progeroid syndrome can cause dwarfism, cachexia, and

microcephaly. Patients often display learning disabilities, hearing loss,

and visual impairment.

Page 46: UCSD / DBMI seminar 2015-02-6

Qualification test46

Test #1: “Myotonic dystrophy ( DM ) is associated with a ( CTG ) in

trinucleotide repeat expansion in the 3-untranslated region of a protein

kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ”

Test #2: “Germline mutations in BRCA1 are responsible for most cases of

inherited breast and ovarian cancer . However , the function of the BRCA1

protein has remained elusive . As a regulated secretory protein , BRCA1

appears to function by a mechanism not previously described for tumour

suppressor gene products.”

Test #3: “We report about Dr . Kniest , who first described the condition in

1952 , and his patient , who , at the age of 50 years is severely

handicapped with short stature , restricted joint mobility , and blindness but

is mentally alert and leads an active life . This is in accordance with

molecular findings in other patients with Kniest dysplasia and…”

26 yes / no questions

Page 47: UCSD / DBMI seminar 2015-02-6

Qualification test results47

Threshold

for passing

33/194 passed

17%Workers

qualified

workers

Page 48: UCSD / DBMI seminar 2015-02-6

Simple annotation interface48

Click to see

instructions

Highlight

disease

mentions

Page 49: UCSD / DBMI seminar 2015-02-6

Experimental design

• Task: Identify the disease mentions in

the 593 abstracts from the NCBI disease

corpus

– $0.06 per Human Intelligence Task (HIT)

– HIT = annotate one abstract from PubMed

– 5 workers annotate each abstract

49

Page 50: UCSD / DBMI seminar 2015-02-6

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

Aggregation function based on simple voting50

5

0

1 or more votes (K=1)This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

K=2

K=3 K=4

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

Page 51: UCSD / DBMI seminar 2015-02-6

Comparison to gold standard51

F = 0.81, k = 2

• 593 documents

• 5 users / doc

• 7 days

• $192.90PrecisionRecall

Page 52: UCSD / DBMI seminar 2015-02-6

Comparison to gold standard52

F = 0.87, k = 6

• 593 documents

• 15 users / doc

• 9 days

• $630.96

Precision

Recall

Page 53: UCSD / DBMI seminar 2015-02-6

Comparison to gold standard53

0 1614121086420

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Workers per document

Maxim

um

F-s

core

Page 54: UCSD / DBMI seminar 2015-02-6

Comparisons to text-mining algorithms54

F s

core

Text-mining

BA

NN

ER

NC

BO

Annota

tor

Mechanical

Turk

Page 55: UCSD / DBMI seminar 2015-02-6

Comparisons to human annotators55

Average level of

agreement

between expert

annotators

(stage 1)

F = 0.76

Page 56: UCSD / DBMI seminar 2015-02-6

Comparisons to human annotators56

F = 0.76F = 0.87

Average level of

agreement

between expert

annotators

(stage 2)

Page 57: UCSD / DBMI seminar 2015-02-6

57

In aggregate, our worker

ensemble is faster, cheaper

and as accurate as a single

expert annotator for disease

concept recognition.

Page 58: UCSD / DBMI seminar 2015-02-6

Information Extraction58

1. Find mentions of high level concepts in

text

2. Map mentions to specific terms in

ontologies

3. Identify relationships between concepts

Page 59: UCSD / DBMI seminar 2015-02-6

Annotating the relationships59

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

therapeutic target

subjectpredicate

object

GENE

DISEASE

Page 60: UCSD / DBMI seminar 2015-02-6

Does Mechanical Turk scale?60

1,000,000 articles per year

10 annotators / article

4 tasks / doc

$0.06 / task

$ 2,400,000 / year

Page 61: UCSD / DBMI seminar 2015-02-6

61

http://mark2cure.org

Page 62: UCSD / DBMI seminar 2015-02-6

Key stats

• Launched Jan 19, 2015

• In 2.5 weeks

– 1984 document annotations

– 80 unique users

– 22% complete

62

Docum

ent

annota

tions

Page 63: UCSD / DBMI seminar 2015-02-6

The

Long Tail of

citizen scientists

can collaboratively

annotate biomedical

text.

63

Page 64: UCSD / DBMI seminar 2015-02-6

64

Ben Good

Andra Waagmeester

Lynn Schriml, U Maryland

Elvira Mitraka, U Maryland

Gang Fu, NCBI

Evan Bolton, NCBI

Paul Pavlidis, U British Columbia

Peter Robinson, Charite

Many Wikipedia and Wikidata

editors

WP:MCB Project

Gene Wiki / Wikidata

Ramya Gamini

Louis Gioia

Salvatore Loguercio

Adam Mark

Erick Scott

Greg Stupp

Kevin Xin

Other Group members

Funding and Support

BioGPS: GM83924

Gene Wiki: GM089820

BD2K COE: GM114833

Contact

http://sulab.org

[email protected]

@andrewsu

+Andrew Su

Mark2Cure

Ben Good

Max Nanis

Ginger Tsueng

Chunlei Wu

Next slide!

Page 65: UCSD / DBMI seminar 2015-02-6

Why do I Mark2Cure?65

I am retired, have a doctorate in

medical humanities, and have two

children with Gaucher disease. I am

just looking for some way to put my

education to use. Sounds like a perfect

situation for me.

My 4 year old daughter Phoebe is

living with and battling rare

disease.

I have Ehlers Danlos Syndrome. I hope to help people

learn about this painful and debilitating disorder, so that

others like me can receive more effective medical care.

Take part in

something that

helps humanity.

I Mark2Cure in memory of

my son Mike who had type 1

diabetes.

Studied biology in

college and I really

miss it!

In memory of my daughter

who had Cystic Fibrosis

Give back

Page 66: UCSD / DBMI seminar 2015-02-6

Worker demographics: gender66

First HIT was a survey

Page 67: UCSD / DBMI seminar 2015-02-6

Age67

Page 68: UCSD / DBMI seminar 2015-02-6

Occupation 68

Page 69: UCSD / DBMI seminar 2015-02-6

Education 69

Page 70: UCSD / DBMI seminar 2015-02-6

Why? 70