Linking Database Submissions to Primary Citations with PubMed Central Heather Piwowar and Wendy Chapman Department of Biomedical Informatics University of Pittsburgh BioLINK 2008
Nov 01, 2014
Linking Database Submissions to Primary Citationswith PubMed Central
Heather Piwowar and Wendy ChapmanDepartment of Biomedical Informatics
University of Pittsburgh
BioLINK 2008
These links are important for several reasons
Sometimes the links are easy to discover
But the meaning of hyperlinks is ambiguous:
And often no hyperlinks at all:
One way to identify links:
NLP systems that identify statements of shared data
from within full text.
BUT this requires developing and maintaining a full-text archive!
What about using PubMed Central?
Usage?
• scientists looking for datasets for reuse• curators looking for primary citations• researchers studying data sharing
behaviour
Goal:
Use the simple, full-text query interface of PubMed Central
to identify articles with shared datasets
Method:
• Gene expression microarray data• GEO database
Method:
• Open Access articles to train• Non-Open access articles to test
• Gene-expression articles selected by MeSH term query
Gold Standard:
• True positives (N=550)Articles with primary citation links from GEO + screening of full-text
• True negatives (N=165)The rest
Building the query:
• Used full-text of open-access cohort• Removed words <40 occurrences• Unigram bag-of-words vectors
• Tree and Rule algorithms, a variety of parameters
(geo OR omnibus) AND microarray AND "gene expression" AND accessionNOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published))
(geo OR omnibus) AND microarray AND "gene expression" AND accessionNOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published))
(geo OR omnibus) AND microarray AND "gene expression" AND accessionNOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published))
(geo OR omnibus) AND microarray AND "gene expression" AND accessionNOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published))
(geo OR omnibus) AND microarray AND "gene expression" AND accessionNOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published))
Evaluation Results
• 40% recall• 94% precision,
65% for those not yet linked
• worse than full-NLP results (~ 89%,83%)• slightly better than trivial query (34%,90%)
Limitations
• only one datatype• database-centric• performance so far is rather mediocre…
Impact?
• Today’s performance:– would increase GEO links by 2.6%– by 5.5% annually when all NIH in PMC
• Double the recall, to 80%:– double the numbers above
☺ GEO curators added the 40 links identified by this study
We hope this work
inspires future enhancements, and
highlights the opportunities forsimple full-text queries in PubMed Central given the mandated influx of NIH-funded research reports.
Thank youAdvisor: Dr. Wendy ChapmanFunders: NLM and Pitt DBMIEnablers: Everyone who deposits their
publications in PubMed Central!
My shared data: www.dbmi.pitt.edu/piwowarShare your research data too!