Top Banner
Linking Database Submissions to Primary Citations with PubMed Central Heather Piwowar and Wendy Chapman Department of Biomedical Informatics University of Pittsburgh BioLINK 2008
33

BIOLINK 2008: Linking database submissions to primary citations with PubMed Central

Nov 01, 2014

Download

Health & Medicine

Heather Piwowar

Abstract: Background: Dataset submissions are growing exponentially. Links between dataset submissions and primary literature that describe the data collection are useful for many reasons: rich documentation, proper attribution, improved information retrieval, and enhanced text/data integration for analysis. Unfortunately, many database submissions do not include primary citation links, as database submissions are often made prior to publication. We suggest that automated tools can be developed to help identify links between dataset submissions and the primary literature. These tools require full text to differentiate cases of data sharing from data reuse and other contexts. In this study, we explore the possibility that deep analysis of full text may not be necessary, thereby enabling the querying of all reports in PubMed Central. Methods: We trained machine learning tree and rule-based classifiers on full-text open-access article unigram vectors, with the existence of a primary citation link from NCBI’s Gene Expression Omnibus (GEO) database submission records as the binary output class. We manually combined and simplified the classifier trees and rules to create a query compatible with the interface for PubMed Central. Results: The query identified 40% of non-OA articles with dataset submission links from GEO (recall), and 65% of the returned articles without dataset submission links were manually judged to include statements of dataset deposit despite having no link from the database (applicable precision). Conclusion: We hope this work inspires future enhancements, and highlights the opportunities for simple full-text queries in PubMed Central given the mandated influx of NIH-funded research reports.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

Linking Database Submissions to Primary Citationswith PubMed Central

Heather Piwowar and Wendy ChapmanDepartment of Biomedical Informatics

University of Pittsburgh

BioLINK 2008

Page 2: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central
Page 3: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central
Page 4: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

These links are important for several reasons

Page 5: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central
Page 6: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central
Page 7: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central
Page 8: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

Sometimes the links are easy to discover

Page 9: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central
Page 10: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central
Page 11: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central
Page 12: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central
Page 13: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

But the meaning of hyperlinks is ambiguous:

Page 14: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

And often no hyperlinks at all:

Page 15: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

One way to identify links:

NLP systems that identify statements of shared data

from within full text.

Page 16: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

BUT this requires developing and maintaining a full-text archive!

Page 17: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

What about using PubMed Central?

Page 18: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

Usage?

• scientists looking for datasets for reuse• curators looking for primary citations• researchers studying data sharing

behaviour

Page 19: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

Goal:

Use the simple, full-text query interface of PubMed Central

to identify articles with shared datasets

Page 20: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

Method:

• Gene expression microarray data• GEO database

Page 21: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

Method:

• Open Access articles to train• Non-Open access articles to test

• Gene-expression articles selected by MeSH term query

Page 22: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

Gold Standard:

• True positives (N=550)Articles with primary citation links from GEO + screening of full-text

• True negatives (N=165)The rest

Page 23: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

Building the query:

• Used full-text of open-access cohort• Removed words <40 occurrences• Unigram bag-of-words vectors

• Tree and Rule algorithms, a variety of parameters

Page 24: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

(geo OR omnibus) AND microarray AND "gene expression" AND accessionNOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published))

Page 25: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

(geo OR omnibus) AND microarray AND "gene expression" AND accessionNOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published))

Page 26: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

(geo OR omnibus) AND microarray AND "gene expression" AND accessionNOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published))

Page 27: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

(geo OR omnibus) AND microarray AND "gene expression" AND accessionNOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published))

Page 28: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

(geo OR omnibus) AND microarray AND "gene expression" AND accessionNOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published))

Page 29: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

Evaluation Results

• 40% recall• 94% precision,

65% for those not yet linked

• worse than full-NLP results (~ 89%,83%)• slightly better than trivial query (34%,90%)

Page 30: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

Limitations

• only one datatype• database-centric• performance so far is rather mediocre…

Page 31: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

Impact?

• Today’s performance:– would increase GEO links by 2.6%– by 5.5% annually when all NIH in PMC

• Double the recall, to 80%:– double the numbers above

☺ GEO curators added the 40 links identified by this study

Page 32: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

We hope this work

inspires future enhancements, and

highlights the opportunities forsimple full-text queries in PubMed Central given the mandated influx of NIH-funded research reports.

Page 33: BIOLINK 2008:    Linking database submissions to primary citations with PubMed Central

Thank youAdvisor: Dr. Wendy ChapmanFunders: NLM and Pitt DBMIEnablers: Everyone who deposits their

publications in PubMed Central!

My shared data: www.dbmi.pitt.edu/piwowarShare your research data too!