Top Banner
Linking literature to data in the life sciences OpenAIREplus workshop, Copenhagen, 11 June 2012
29

Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

May 11, 2015

Download

Technology

OpenAIRE

OpenAIREplus workshop - “Linking Open Access publications to data – policy development and implementation” (June 11, 2012)
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Linking literature to data in the life sciences

OpenAIREplus workshop, Copenhagen, 11 June 2012

Page 2: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Overview

• What literature? What data?

• How we make literature-data connections

• Case study

• Challenges and future directions

Page 3: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

What literature? What data?

Page 4: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Big Data:

Deposition

Primary

Research

articles

Big Data:

Curated

Annotation

Unstructured Data

Funder mandatesJournal requirementsMetadata

Standards

Data Landscape and Definitions

*reuse

Page 5: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

PMC336623 Extended to several other biological data types

Page 6: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year

Nucle

otid

es (m

illio

ns) European Nucleotide Archive

0

50

100

150

200

250

300

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Ensembl and Ensembl Genomes

Year

Geno

mes

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

UniProt

Year

Entr

ies

InterPro

0

5000

10000

15000

20000

25000

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year

Entr

ies

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

Year

ArrayExpress

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Hyb

rid

isatio

ns

Str

uctu

res

0

10000

20000

30000

40000

50000

60000

70000

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year

PDBe

• Big data• Thematic data

• Public data• Archived data

• Two petabytes of data

• Scales to 7 pbs raw disk

• Majority is DNA

Page 7: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Two core literature databases

• 26 million abstracts

PubMed, Patents, Agricola• Website and web services

• 2.2 million full text articles(217K articles with suppl data)

• Website

• Citation networks

• Database links

• Whatizit textmining

• Supplemented by CiteXplore

• Additional text mining

• over 1.1 million new records per year • over 150K new articles per year

Page 8: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

UK PubMed Central Overview

• Built in collaboration with PubMed Central USA (+ PMC Canada) since 2006

• Led by the European Bioinformatics Institute since 2011, with the

British Library, and the University of Manchester

• Supported by 16 UK and 2 European Funders, led by the WellcomeTrust. Research spend: ~ 2 billion GBP

• A life-science web-based repository

• Manuscript submission service (self archiving by grant holders)

• Database of grant information – with details of about 18000 PIs

• Grant reporting and funder analysis tool

• 250K requests, 40K IPs, 7K direct interactive searches per day

Page 9: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

How many articles?

Overall: 20% OA (~ 450K OA articles out of 2.2 million total)

Page 10: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

How we make literature-data connections

Page 11: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Links

• by the author - on submission, as metadata (primary databases)

• by database curators - information and links from the

literature

• expensive, slow, but high quality

Text mining

• by algorithms that use terminologies (can be subject to lag)

• post publication – can find new associations

• variable quality, but high throughput

Page 12: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Links from Literature to Databases

• Proteins

• Nucleotides

• OMIM

• Chemicals

• Structure

• Clinical reviews

• Protein families

• Protein-protein interactions

• Gene expression experiments …

800 K

370 K

110 K

Page 13: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Semantic Type Unique Terms Articles Annotations

Gene/Protein 225,905 1,288,809 15,021,502

GO Terms 32,486 1,806,539 15,016,957

Organism 178,847 1,689,251 12,322,782

Disease 170,592 1,743,212 16,201,198

Accession No. 232,950 65,640 331,329

Chemical 76,350 1,669,500 22,438,980

Text Mining in UKPMC (2.2 million articles)

Page 14: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Case study

Page 15: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

3.9 billion years ago

Page 16: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

E. Coli meets humans

Human colon cancer DNA repair

Page 17: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

07/21/10 17

Page 18: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
Page 19: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Protein structure in PDBe

Page 20: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Link to the literature from the PDBe record

Page 21: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Algorithms that find similar structures

Page 22: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Text mine full text for 1ewq

Page 23: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Towards understanding DNA repair mechanisms

Page 24: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Challenges and future directions

Page 25: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Data-driven science

Data re-use: biology is

post publication

Linking: citing papers

and data (provenance

and integration)

Metrics and attribution

Hard decisions about

value of keeping

complete data sets

Page 26: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Big Data:

Deposition

Primary

Research

articles

Big Data:

Curated

Annotation

Unstructured Data

Data landscape - possibilities

reuse?

Structured links

analysis

Page 27: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Analysis supplied by Mimas, University of Manchester

PDF

TIF

XSLDOC

MOV

HTML

GIF

JPG

Page 28: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Solutions that make sense to scientists

Page 29: Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

http://ukpmc.ac.uk