Open biomedical knowledge using crowdsourcing and citizen science Andrew Su, Ph.D. @andrewsu [email protected] http://sulab.org November 5, 2015 UCSD Slides: slideshare.net/andrewsu
Open biomedical knowledge
using crowdsourcing and
citizen science
Andrew Su, Ph.D.@andrewsu
http://sulab.org
November 5, 2015
UCSD
Slides: slideshare.net/andrewsu
2
Candidate genes
FLNB
CTNNB1
EPHA3
SMAD3
XPO1
RPS27
FLCN
ATR
FLT3
BRD2
ERG
RAF1
EGFR
ERBB4
RARA
JAK3
LRP1
WT1
PML
SMARCA4
…
Candidate variants
chr1:g.156084782C>G
chr6:g.31911991G>T
chr19:g.3767338C>T
chr19:g.3783925C>T
chr7:g.552021G>A
chr3:g.123005609G>T
…
3
Biology is an
INFORMATIONscience
Pietro Bellini https://flic.kr/p/k5jmja
Prioritization of human genetic variants4
1000s of genetic variants
< 10 candidate genes
Filters
- Variant type
- Allele frequencies
- Previous clinical
observation
- Predicted
functional effects
- Gene function
- …
Data integration as a cottage industry5
dbNSFP
Data integration as hardened community software6
dbNSFP
MyVariant.info
MyGene.info for integrating gene annotations7
Gene
MyGene.info
MyGene.info for integrating gene annotations8
http://mygene.info/metadata
Current version history
Current stats
MyGene.info for integrating gene annotations9
399070
210381
120173
222497292 3563 1767 1031 616 406 2724
10 20 30 40 50 60 70 80 90 100 More
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
request time (ms)
Fre
qu
en
cyGene annotation service (/v2/gene)
MyGene.info for integrating gene annotations10
2 ~ 3M requests per month
MyGene.info for integrating gene annotations11
MyGene.info for integrating gene annotations12
2015 – 2018
Bioinformatician-friendly JSON output, REST API13
http://MyGene.info/v2/gene/7157 http://MyVariant.info/v1/variant/
chr7:g.55241707G>T
Variant and gene prioritization14
Variant and gene prioritization15
2441
2308
1917
18
9
5
Variant and gene prioritization16
2441
2308
1917
18
9
5
https://github.com/SuLab/myvariant.info/
blob/master/docs/ipynb/myvariant_R_miller.ipynb
Open biomedical knowledge17
MyVariant.info MyGene.info
Integration of molecular
biology databases via
high performance APIs
Open biomedical knowledge18
MyVariant.info MyGene.info
Integration of molecular
biology databases via
high performance APIs
Biomedical Linked
Open Data
The Gene Wiki project19
Protein structure
Symbols and
identifiers
Tissue expression
pattern
Gene Ontology
annotations
Links to structured
databases
Gene
summary
Protein
interactions
Linked
references
Huss, PLoS Biol, 2008
The Gene Wiki project20
The Gene Wiki project21
Wikidata22
Provide a database of the
world’s knowledge that
anyone can edit
- Denny Vrandečić
Centralizing key data storage23
Source: http://commons.wikimedia.org/wiki/File:Wikidata_slides_Magnus_Manske,_Cambridge,_2014-02-27.pdf
Centralizing key data storage24
Centralizing key data storage25
Loading biological data into Wikidata26
Entrez
Gene
Ensembl
UniProt
UCSC
PDB
RefSeq
Wikidata for biology27
is a
regulates
Interacts
with
Protein
Glycoprotein
Neural
development
VLDL receptor
Amyloid
precursor
protein
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
Reelin
http://www.wikidata.org/wiki/Q414043
Wikidata for biology28
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
29
~150k genes
and proteins
~2k FDA-approved
drugs
~7k human
diseases
Centralizing key data storage30
287 language editions of Wikipedia
Bioinformatics
community
Toxicology
community
Epidemiology
community… …
Open biomedical knowledge31
MyVariant.info MyGene.info
Integration of molecular
biology databases via
high performance APIs
Biomedical Linked
Open Data
Open biomedical knowledge32
Free text to structured data
MyVariant.info MyGene.info
Integration of molecular
biology databases via
high performance APIs
Biomedical Linked
Open Data
The biomedical literature is massive…33
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1983 1988 1993 1998 2003 2008 2013
Number of new PubMed-indexed articles
… but it is very hard to query and compute34
… but it is very hard to query and compute35
Imatinib
Crizotinib
Erlotinib
Gefitinib
Sorafenib
Lapatinib
Dasatinib
…
Acute myeloid leukemia
Acute lymphoblastic leukemia
Chronic myelogenous leukemia
Chronic lymphocytic leukemia
Hodgkin lymphoma
Non-Hodgkin lymphoma
Myeloma
…
AND
The Network of BioThings36
1. Identify biomedical concepts in text
… We report a case of familial systemic
mastocytosis with the rare KIT K509I germ
line mutation. In vitro treatment with imatinib,
dasatinib and PKC412 reduced cell viability
of primary mast cells harboring KIT K509I
mutation. Both patients with familial systemic
mastocytosis had remarkable hematological
and skin improvement after three months of
imatinib treatment.
Leuk Res. 2014 Oct;38(10):1245-51. doi: 10.1016/j.leukres.
GENES
DISEASES
DRUGS
VARIANTS
The Network of BioThings37
imatinib
dasatinib
PKC412
Familial systemic
mastocytosis
KIT
K509I
1. Identify biomedical concepts in text
2. Identify relationships between concepts
Mutation
of
Mutation
causes
causes
treats
inhibits
38
Goal: Assemble a network of biomedical
knowledge that is comprehensive,
current, computable and traceable.
Question: Can Citizen Scientists
collectively perform concept recognition in
biomedical texts?
39
Simple annotation interface40
Click to see
instructions
Highlight
disease
mentions
15 workers annotate each abstract
41
Experts versus crowd for concept identification
593 PubMed abstracts
6,900 mentions of
“disease concepts”
F = 0.87F = 0.78
$$$
42
Experts versus crowd for concept identification
593 PubMed abstracts
6,900 mentions of
“disease concepts”
F = 0.87F = 0.87
$$$
• 9 days
• 145 workers
• Total: $630.96
Does Mechanical Turk scale?43
1,000,000 articles per year
10 annotators / article
4 tasks / doc
$0.066 / task
$ 2,640,000 / year
44
http://mark2cure.org
45
Paid crowdsourcing
• F = 0.84
• 28 days
• 212 workers
• Total cost: $0
$$$
• F = 0.87
• 9 days
• 145 workers
• Total: $630.96
“Help science, please”
Citizen Science
Does Citizen Science scale?46
1,000,000 articles * 10 AE / article 15,828
volunteers
needed
10,275 AE * 365 days
212 annotators* 28 days
AE = Annotation events
=
Number of annotation
events per year
Number of annotation
events per year
per volunteer
Does Citizen Science scale?47
15,828
volunteers
needed
175,000
volunteers
300,000
volunteers
37,000
volunteers
1,000,000
volunteers
Annotating the relationships48
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
therapeutic target
subjectpredicate
object
GENE
DISEASE
49
Goal: Assemble a network of biomedical
knowledge that is comprehensive,
current, computable and traceable.
50
Nina Hale https://flic.kr/p/zoVih
Rare disease case study #151
Photo: Retta Beery
52
Bainbridge et al., STM, 2011
53
Photo: Retta Beery
Rare disease case study #254
55
56
… but no obvious treatments
57
Bainbridge et al., STM, 2011
SPR
What differentiates SPR and NGLY1?58
SPR
59
Sarah Olmstead
https://flic.kr/p/364dZW
NGLY1
60
NGLY1
(11 PubMed articles)
Congenital disorders of
glycosylation
(822)
PNGase
(686)ERAD
(1330)
glycosylation
(48,862)
alacrima
(164)
Genetic
interactors
(3016)
symptoms
(109,928)
24 million articles in PubMed
Mapping the biomedical network around NGLY1 61
NGLY1
62
63
A preliminary view of the NGLY1-
focused biological network
Why do I Mark2Cure?64
I am retired, have a doctorate in
medical humanities, and have two
children with Gaucher disease. I am
just looking for some way to put my
education to use. Sounds like a perfect
situation for me.
My 4 year old daughter Phoebe is
living with and battling rare
disease.
I have Ehlers Danlos Syndrome. I hope to help people
learn about this painful and debilitating disorder, so that
others like me can receive more effective medical care.
Take part in
something that
helps humanity.
I Mark2Cure in memory of
my son Mike who had type 1
diabetes.
Studied biology in
college and I really
miss it!
In memory of my daughter
who had Cystic Fibrosis
Give back
Open biomedical knowledge65
Free text to structured data
MyVariant.info MyGene.info
Integration of molecular
biology databases via
high performance APIs
Biomedical Linked
Open Data
66
Contact
http://sulab.org
@andrewsu
Gene Wiki / Wikidata
Ben Good
Sebastian Burgstaller
Tim Putman
Julia Turner
Ginger Tsueng
Andra Waagmeester
Elvira Mitraka, UMB
Lynn Schriml, UMB
Justin Leong, UBC
Paul Pavlidis, UBC
Join the team!
http://bit.ly/JoinSuLab
Slides: slideshare.net/andrewsu
Funding and Support
BioGPS: GM83924
Gene Wiki: GM089820
MyGene / MyVariant: HG008473
BD2K COE: GM114833
Icon credits (Noun Project, Wikimedia Commons): Zach VanDeHey, hunotika, Viktorvoigt, Alberto Rojas, Lloyd Humphreys
Other Group members
Jake Bruggemann
Ramya Gamini
Karthik Gangavarapu
Louis Gioia
Toby Li
Greg Stupp
MyGene / MyVariant
Chunlei Wu
Cyrus Afrasiabi
Kevin Xin
Adam Mark
Mark2Cure
Max Nanis
Ginger Tsueng
Jennifer Fouquier
Ben Good
Chunlei Wu
All Mark2Curators!