Defrosting the Digital Library: A survey of bibliographic tools for the next generation web

Defrosting the Digital Library

A survey of bibliographic tools for the next generation Web

Duncan Hull

Faculty of Life Sciences (1992-6) BSc. Computer Science (2002-2007) MSc, PhD. Chemistry (2008-date) Postdoc

It’s all Casey’s fault!

Dr. Casey Bergman,Lecturer Faculty of Life Sciences

I s Citeulike.or

g!http://ukpmc.ac.uk/

http://ukpmc.ac.uk/

http://pubmed.gov/19060304


Defrosting the Digital Library (in one slide)

• There are lots of digital libraries out there for scientists!

– ACM, IEEE, PubMed, DBLP, Scopus, ISI-WoK, Google Scholar, arXiv

• But they have some fundamental problems with their data

– Identity crisis: identifying people accurately

– Identity crisis: identifying publications accurately

– Keeping data and metadata coupled together

– Impersonal, unsociable, difficult to use: “Cold”

• Some new tools exist to make things better: “warmer”

– Citeulike, Mendeley, Zotero, Papyro, Papers etc

– BUT Fundamental problems with identity and data need to be fixed before the tools will get any better

Metawhat?

getMetadata

getData

• From the Greek μετά (meta) meaning after

– metadata not just data about data

– metadata is data after data

– data first

–metadata second

– Reversible reaction (“round-tripping”)

Title: defrosting the digital library

Authors: Duncan Hull, Steve Pettifer and Douglas Kell

Published: 2008

Journal: PLoS Computational Biology

Tell me more?

What is it about?

Where did it come from?

Chemistry(Science of Matter)

Biology(Science of Life)

Informatics(Science of Information)

CheminformaticsBiochemistry

Bioinformatics

Science!

www.mib.ac.uknactem.ac.uk/refine

www.citeulike.org

Metadata in:

http://www.mib.ac.uk/

http://www.nactem.ac.uk/refine

Representing Evidence For Interacting Network Elements

www.sbml.org from www.biomodels.net database at the EBI.ac.uk

http://www.sbml.org/

http://www.biomodels.net/

http://www.EBI.ac.uk/

Example from Glycolysis in Yeast

reactant

reactant product

productmodifier

This is just one reaction, there are at least another 1700+ in Yeast

Name Synonyms

D-Glucose dextrose; D-Glucose; D-(+)-glucose; D(+)-glucose; grape sugar; Traubenzucker

ATP Adenosine 5'-triphosphate; Adenosine triphosphate; H4atp

Hexokinase Hexokinase-1; Hexokinase-A; Hexokinase PI; YFR053C

ADP 5'-adenylphosphoric acid; Adenosine 5'-diphosphate; H3adp

Glucose-6-phosphate Robison ester, D-Glucose 6-phosphate

Synonyms from Pedro Mendes B-Net Databasehttp://www.comp-sys-bio.org/yeastnet/

http://www.comp-sys-bio.org/yeastnet/

Chemistry

Biology Informatics

CheminformaticsBiochemistry

Bioinformatics

Formoreinfo.

www.nactem.ac.uk/refine

One of the biggest challenges is getting hold of accurate metadata from libraries and databases


But first…

• Before getting into the paper…

• Some lessons I learnt while working in industrial informatics for a small startup company called CSW Informatics Ltd

– Ford and BBC

• How business and governments manage metadata

• Ford Focus (launched 1998)

getMetadata

getData

6 million+ “units” sold worldwide to date:america, europe, middle east, africa, australasia

Lots of data, metadata and money!

Owner’s handbook

Tell me more?

What is it about?

Final solution:

Web XSLT Print

Summary: Lessons from Ford

• Data often the tip of the iceberg

– If the data doesn’t sink you, the metadata will

• Businesses like Ford spent $ £ € keeping data and metadata stay together

• Data is often worthless without it

– Can’t sell data (cars) without metadata (manuals)

– Don’t just “make cars”

DATA

METADATA

BBC Spooks?

• Open Source Intelligence (OSINT)

• Overt not Covert espionage: 370 journalists, 24-7, ~100 languages Caversham, Reading.

Keeping an eye on people around the world

since 1939

Winston Churchill

“Big British Castle” (BBC)

I

hate

powerpoint

Radio

MS Word

TV

How do they stay in business?

Broadcasting House, London

Foreign governments, e.g. U.S.A. etc

Word: Not the best way to manage data and metadata

Getting Rid of Worddatabase

XML schema

Web & Intranet

Printed documents

XSLT

A solution that worked!

getMetadata

getData

Who is Thabo Mbeki?

These documents are all about Thabo Mbeki

Thabo Mbeki

Summary: Lessons from the BBC

• Important decisions made on the basis metadata

– Crucial that metadata is accurate, high quality and trustworthy

– Identify people properly is crucial (100%)

– You know what data is about (getMetadata)

– You know where it came from (getData)

– Looked after properly (this can be expensive)

– Businesses built on buying/selling metadata:

How have libraries managed metadata?

On paper since 300 B.C.

(Library of Alexandria)

Organised in physical space

In buildings made from bricks and mortar

Expensive and slow distribute

Only ever read by humans

Filled with content bought from publishers, locked up with copyright

Image via http://en.wikipedia.org/wiki/Library_of_Alexandria

http://en.wikipedia.org/wiki/Library_of_Alexandria

From ~1824 until~1989

Photos via dpicker http://www.flickr.com/photos/dpicker/3107856991/ and pit yacker http://www.flickr.com/photos/78825653@N00/131611136

JRULM (Main Library)Joule Library

Mostly “private” only available to an elite (e.g. University of Manchester Students and Staff)

http://www.flickr.com/photos/dpicker/3107856991/

http://www.flickr.com/photos/78825653@N00/131611136

Metadata (after)

Data

Tightly bound (literally)

Rarely separated

First published 1687, over 300 years old

Data and metadata was like this for centuries!

• Until…

+

Tim Berners-Lee

1989

Timeline: Unchanged for centuries but…

20 years ÷

2309 years

= <1%

Everything’s Gone Digital!

www.scopus.com

www.pubmed.gov

http://ukpmc.ac.uk

www.isiknowledge.com

scholar.google.com

http://www.scopus.com/

http://www.pubmed.gov/

http://ukpmc.ac.uk/

http://www.isiknowledge.com/



http://scholar.google.com/

Digital Utopia?

• Bits and bytes 1010100101000001101010 (not paper)

• In pervasive cyberspace (not physical space)

• Databases and/or Web identified by URIs: (not buildings)

• Cost of distribution fallen by orders of magnitude

• Read and indexed by machines like Googlebot et al (not just humans)

• Increasingly public, available to everyone via Open-Access publishing (less private, less restrictive copyright)

• Everything is great?

Alexander Griekspoor

www.mekentosj.com

http://www.mekentosj.com/

Welcome to Digital Dystopia

• Isolation

– each discipline has its own data silo

• Impersonal and unsociable

– “who the hell are you”?

– Where are “my” papers? (authored by me, or of interest to me)

– What are my friends and colleagues reading?

– What are the experts reading? What is popular this week / month / year ?

• “Cold”: Identity of publications and authors is inadequate

• Data divorced from its metadata

– GetMetadata / GetData unreliable

– Therefore can be difficult to tell what data is about, or where metadata came from

• Obsolete models of publication, not everything fits publication-sized holes

– Micro-attribution

– Mega-attribution

– Digital contributions (databases, software, wikis/blogs?)

Isolated publication silos

Chemistry

Informatics

Biology

impersonal,isolated, unsociable,Generally rubbish

Identity Crisis part 1: Which publication?

1. http://pubmed.gov/18974831

2. http://www.ncbi.nlm.nih.gov/pubmed/18974831

3. http://ukpmc.ac.uk/articlerender.cgi?accid=pmcA2568856

4. http://ukpmc.ac.uk/picrender.cgi?artid=1687256&blobtype=pdf

5. http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000204

6. http://www.dbkgroup.org/Papers/hull_defrost_ploscb08.pdf

7. http://dx.doi.org/10.1371/journal.pcbi.1000204

• One paper, many URIs. Disambiguation algorithms rely on getting metadata for each

– Big problem for libraries is these redundant duplicates

• Matching can be done by Digital Object Identifier (DOI) and PubMed ID (PMID);

– these are frequently absent < 5% (Kevin Emamy, citeulike)


http://www.ncbi.nlm.nih.gov/pubmed/18974831

http://ukpmc.ac.uk/articlerender.cgi?accid=pmcA2568856

http://ukpmc.ac.uk/picrender.cgi?artid=1687256&blobtype=pdf

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000204

http://www.dbkgroup.org/Papers/hull_defrost_ploscb08.pdf

http://dx.doi.org/10.1371/journal.pcbi.1000204

Identity crisis part 2: Who are you? Who, who … who, who?

1. Douglas Kell

2. Doug Kell

3. Douglas B Kell

4. Kell, D

5. Kell, D.B.

6. Douglas Bruce Kell

7. Druglas Kell

Neil Smalheiser and Vetle Torvik

Typo

Attribution would seem to be a simple process and yet it represents a

major, unsolved problem for information science.

http://tinyurl.com/authorid

http://tinyurl.com/authorid

Identity crisis part 3: Mistaken Identity

Google Scholar thinks I’m Maurice Wilkins

Dr. Duncan HullHumble Postdoc

Articleabout Authored-by

Authored-by

Wrong!

“DNA mania”

title

http://tinyurl.com/mistakenid

Can’t get metadata (decoupled from data): PDF

getMetadata

getData

Title: defrosting the digital library

Authors: Duncan Hull, Steve Pettifer and Douglas Kell

Published: 2008

Tell me more

Don’t know,

Try google

Don’t know,

Title might be

“defrosting…”

Where did this come from?


Why can't I manage

academic papers like MP3s?

http://tinyurl.com/mp3vpdf

James Howison, Carnegie Mellon University

Data is tightly coupled to its metadata

MP3 music file in iTunes

getMetadata

getData

Artist: The Who

Title: Who Are You?

Recorded: 1978

Album: Who Are You





Peter Murray-Rust

Hamburger

(unstructured data)

PDF is a hamburger,

and we're trying to turn it

back into a cow.

http://tinyurl.com/pdfhamburger

Cow (structured data)

publishing

text-mining

Can’t get metadata (decoupled from data): HTTP

• Arbitrary URI (not just pubmed, but any scientific paper) http://www.ncbi.nlm.nih.gov/pubmed/18974831

Can’t get metadata (decoupled from data): HTTP

• Fundamental problem with the way the web is built using HTTP, can’t change it now…

Tim Bray, Sun Microsystems

One of the Web's distinguishing features

is that there's a big gaping hole where the metadata ought to be.

http://tinyurl.com/nometadata

I’ll stop moaning now

• Isolation

• Can’t identify people

• Can’t identify publications

• Metadata gets divorced from its data

• But what are the solutions?

www.citeulike.org

Richard CameronKevin EmamyPicture from http://network.nature.com/people/mfenner/blog/2009/01/30/interview-with-kevin-emamy and http://www.citeulike.org/faq/faq.adp

The reason I wrote the site [citeulike.org] was, after recently coming back to academia,

I was slightly shocked by the quality of some of the tools available to help academics

do their job. I found it preferable to start writing proper tools for my own use than to use existing

software.

http://www.citeulike.org/

http://network.nature.com/people/mfenner/blog/2009/01/30/interview-with-kevin-emamy

http://www.citeulike.org/faq/faq.adp

Why should you care about citeulike?

1. Could save you time

2. But also like Green Fluorescent Protein…

All references in one place

Click Post to Citeulike

Tag it (optional)

Citeulike: Recoupling data and metadata

• Wouldn’t be a problem if the publishers hadn’t decoupled it in the first place!

Citegeist = Citeulike + Zeitgeist

How Big?

0

2

4

6

8

10

12

14

16

Scopus Citeulike Pubmed Arxiv

Library / Database

Publications (millions)

Size

allegedly

2,243,177

~2,000 /day

variable

674,076

2,880 /day

2 papers / min

Linear growth

~500,000

Where will citeulike break?

• The more people that use “social software”, the better they get

– Citeulike is one of the leading ones, but there is plenty of competition

• Parsers are fragile, easily (and deliberately) broken by publishers

– ISI WOK and Scopus

– Each publisher has its own parser (euuuggh!)

• Privacy and competition

– “I don’t want to share any of my data before publication”

– “It’s nobody’s business but mine” (basic human right to privacy)

• Closer integration with Word (and latex tools)

• Might go bust? Why put all my precious data in the hands of a commercial company?

Why should you bother with citeulike?

• Organisation and time saving

– Searching

– Browsing

– Managing references while writing papers

• Quick and efficient sharing of data before publication

– e.g. tag “defrost” when writing this paper

– http://www.citeulike.org/tag/defrost

• Serendipity

– Casey Bergman story

Casey Bergman story

I was importing papers on solexa and 454

genome assembly and came across the following paper:

http://www.citeulike.org/user/cisevol/article/1465689

which was a real find in terms of convincing me

that light shotgun sequence data is worth analysing.

I nicked this from a phd student's library in Brazil

http://www.citeulike.org/profile/GustavoLacerda

Wouldn’t have found this any other way e.g(keyword searching or following citation trails)

Many different solutions

e.g. Papyro: Steve Pettifer

http://utopia.cs.manchester.ac.uk/

http://utopia.cs.manchester.ac.uk/

And the rest…

www.mendeley.com

www.zotero.org

www.connotea.org

www.mekentosj.com

www.hubmed.org

Re-couple metadata that has be de-coupled from data

www.2collab.com

www.refworks.com

“iTunes for PDF files”

http://www.mendeley.com/

http://www.zotero.org/



http://www.hubmed.org/

http://www.2collab.com/

There is still lots more metadata

How many times has http://pubmed.gov/19060304

been cited?

Who has cited http://pubmed.gov/19060304 ? Give me all the references that cite this one

Give me all the referencescited by http://pubmed.gov/19060304

Who the hell is Doug Kell?Steve Pettifer?Duncan Hull?

What is Doug Kell’s h-index?

Remember: Machines ask these questions, not just humans

Notify me wheneverSteve Pettifer

publishes a paperNotify me whenever

someone citeshttp://pubmed.gov/1906030

4

Impact factor?

Digital Identity would solve some of these problems

Give yourself a URI, you deserve it!

Tim Berners-Lee http://www.w3.org/People/Berners-Lee/card#i

see http://dig.csail.mit.edu/breadcrumbs/node/71

URI’s for Douglas Kell

1. http://blogs.bbsrc.ac.uk

2. http://www.chemistry.manchester.ac.uk/aboutus/staff/showprofile.php?id=194

3. http://dbkgroup.org/kell.htm

4. http://douglaskell.myopenid.com

5. http://dx.doi.org/10.1371/journal.pcbi.1000204

“Contributor identifier” from

www.myopenid.com

www.openid.net

(Also Note researcher-id from thomson)

• http://pubmed.gov/19112480 Phil Bourne

John Ziman, Physicist

Science is public

knowledgehttp://tinyurl.com/publicknowledge

Conclusions: What hasn’t changed

• The Web has revolutionised libraries in just 20 short years but…

• Still takes time for humans to read and digest: We can get more papers but there are still only 24 hours in a day, 7 days in a week, 52 weeks in a year

– We need help from machines (and the people that build them)

– Need to make metadata more machine-friendly

Conclusions: Publication metadata matters

• Managed to convince you metadata matters (and why)

• People make important decisions based on metadata

– Funding

– Hiring (and Firing)

– Publishing

– Who to collaborate with

Yet our current libraries can’t even accurately identify crucial metadata

Individual people - digital identity needed

Publications - disambiguation

Everything else…

Conclusions: Scientists are too blasé about metadata!

• Leave it to stamp collectors, dusty-librarians, informaticians, database administrators (yawn!), “biocurators” http://biocurator.org/

– Boring, unscientific, not cutting-edge innovation?

• Everyone wants to use good metadata but few people want to spend time curating and cleaning metadata

– Like a clean toilet

• We ignore metadata at our peril “not my job”

– We leave it to publishers, who then mess it up, and charge us for their services, we should be getting better value for money

– We waste precious time organising metadata

– We waste precious time searching for metadata

– Data is more valuable with better metadata

• Have a look at citeulike (and other tools)

metadata

http://biocurator.org/

Conclusions: Do us a favour!

Acknowledgements

• Refine project: Sophia Ananiadou, Jun'ichi Tsujii, Pedro Mendes, Steve Pettifer, Yoshimasa Tsuruoka, Douglas Kell www.nactem.ac.uk/refine

• BBSRC grant code BB/E004431/1

• CSW Informatics Ltd.: John Chelsom, Mavis Cournane, Niki Dinsey www.csw.co.uk BBC Monitoring, Ford Motor Company

• School of Chemistry, MIB (now) www.mib.ac.uk

• Faculty of Life Sciences (a long long time ago) and Casey Bergman, Jean-Marc Schwartz (now)

• School of Computer Science (not so long ago) Information Management Group http://img.cs.man.ac.uk/

• Any Questions?


http://www.csw.co.uk/

http://www.mib.ac.uk/

http://img.cs.man.ac.uk/

Defrosting the Digital Library: A survey of bibliographic tools for the next generation web

Technology

cars data metadata

accurate metadata

buyingselling metadata

ford data

data cars

data need

data doesnt

basis metadata crucial