Top Banner
1 ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/ Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project Peter Morgan SPECTRa-T Project Director Head of Medical and Science Libraries Cambridge University Library [email protected] www.lib.cam.ac.uk/spectra-t/
27

1ETD 2008_Morgan_The SPECTRa-T Project Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

Mar 28, 2015

Download

Documents

Caleb McFarland
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

1ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project

Peter Morgan

SPECTRa-T Project DirectorHead of Medical and Science Libraries

Cambridge University [email protected]

www.lib.cam.ac.uk/spectra-t/

Page 2: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

2ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

Outline

• Why SPECTRa-T?• Getting started• Mining the text

– PDFs– .docx

• Workflows• Further thoughts

Page 3: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

3ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

Why SPECTRa-T?

Page 4: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

4ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

“theses should be semantic and interactive”- Peter Murray-Rust

(ETD 2007 keynote address)

Page 5: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

5ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

SPECTRa-T background

• SPECTRa-T = Submission, Preservation, & Exposure of Chemistry Teaching and Research data from Theses)

• SPECTRa-T funded by JISC Digital Repositories Programme

• 1 year project (April 2007 – March 2008)

• partners: – University of Cambridge (Chemistry + Library)– Imperial College London (Chemistry + ICT)

• team had previously worked together on “SPECTRa”

Page 6: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

6ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

Why SPECTRa-T?

• research chemists produce experimental data (materials, reactions, properties = “recipes”)

• these data are the basis of further research

• theses are a rich source of data – c.10k chemistry papers p.a. worldwide– a typical thesis contains 50-60 preparations– 20% will be published in research papers– 80% are not published

Page 7: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

7ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

Why SPECTRa-T?

• text-mining can retrieve these data

• two basic data types:– Named Chemical Entities (NCEs) (e.g.

words/phrases describing properties, procedures, instruments, etc)

– Chemical Objects (COs) (e.g. molecules, spectra)

• our Semantic Web aim: – extract both data types– create RDF triples and chemical objects – link them to enable semantic querying

Page 8: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

8ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

RDF triples

• RDF triples are statements containing a subject (resource), predicate (property), and object (value)

• “water boils at 100 degrees Celsius”

• the value of one property can be used as the resource for another

Page 9: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

9ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

Getting started

Page 10: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

10ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

Test material

• 100 PDF chemistry theses from CalTech, MIT, St Andrews & Stirling – some MIT theses OCR-derived (later removed

from analysis because of misassigned characters)

• 20 Word chemistry theses from Cambridge (converted to Office Open XML .docx mark-up format)

Page 11: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

11ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

Software

• OSCAR3 (Open Source Chemistry Analysis Routines) as text-mining tool– developed by SciBorg Project (Cambridge)– natural language processing to identify chemical

terms– converts human-readable text into XML marked-

up content that machines can manipulate– prefers SciXML documents– uses ChEBI Ontology for chemical name

recognition

Page 12: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

12ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

OSCAR3 parsing

Highlighted experimental procedures created by OSCAR3

Page 13: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

13ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

Mining the text

Page 14: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

14ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

PDF ...

• wraps text in simple high-level elements• is optimized for human, not machine,

readability• produces poor SciXML

– line breaks = loss of continuous text and paragraph structures

– chemical drawings replaced by text and disconnected lines– loss of subscript and superscript characters– non-printing characters– OCR-derived text produces erroneous character

assignment (e.g. i,l,1)

Page 15: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

15ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

PDF processing

• SPECTRa-T tools...– removed line-breaks– removed non-printing characters– removed text fragments resulting from broken

drawings– used UTF-8 Unicode to preserve Greek characters

(lost in ASCII)• (note: PDF/A can avoid some but not all such

problems)

• text then converted to SciXML

Page 16: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

16ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

SciXML from PDF

• OSCAR retrieves Named Chemical Entities• OSCAR creates SAFXML (Standoff

Annotated Format XML) output• NCE metadata transformed by XSL

stylesheets into RDF triples• RDF triplestore can be queriedBUT...• OSCAR cannot identify Chemical Objects

Page 17: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

17ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

.docx processing

• Word theses converted to Office Open XML (.docx) using MS Word 2007

• XML is converted into rich SciXML• SciXML structure enables OSCAR3 to identify

“Experimental” sections and extract Chemical Objects• XML converted to CML (Chemical Markup Language) • URIs assigned to CO metadata & associated with NCEs• CML COs deposited in lightweight data repository

• RDF triplestore and CO data repository, linked by URIs, can now be queried semantically

Page 18: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

18ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

Workflows

Page 19: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

19ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

PDF workflow

THESISInput PDF document

(text)

SAFXMLSciXML RDF

SPECTRa-T text processing tools

OSCAR3

Triplestore (NCEs)

XSL stylesheet

Processing of PDF e-theses to yieldnamed chemical entities in a queryable RDF Triplestore

(Text and lines in red indicate SPECTRa-T tools)

PDF flow

Query

Page 20: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

20ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

.docx workflow

THESISInput .docx document

(XML markup)

SAFXMLSciXML RDF

SPECTRa-T text processing tools

Triplestore(NCEs)

XSL stylesheet

Processing of DOCX e-theses to yield named chemical entities and linked chemical objects

in a semantically queryable linked RDF triplestore and data repository(Text and lines in red indicate SPECTRa-T tools)

Add URI link

Data XMLCreate URI

CML Chemical Objects

URI

Data Repository(COs)

Semantic Query

DOCX flow

OSCAR3

OSCAR3

Page 21: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

21ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

Further thoughts

Page 22: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

22ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

Caveats

• SPECTRa-T a proof-of-concept approach• restricted to a few chemistry sub-disciplines• investigated only 2 file formats• dangerous to generalise too far

• but our specific observations raise questions about broader implications ...

Page 23: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

23ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

File formats

• PDF has some value for text-mining• born-digital PDF is better than OCR-derived• PDF/A will resolve some problems• but both still contain broken text and

unreliable structure for text-mining– (and most legacy material is still only in PDF)

• XML better at providing structured documents for text-mining– (and may be good for preservation as well)

Page 24: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

24ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

Role of institutional repository• preservation versus re-usability?

• should a central IR require both PDF and Word/XML versions of a thesis?

• which file format(s) should be openly accessible?– cf. UKPMC XML policy for research papers

• should subject data be held in subject-specific data repositories managed by domain experts?

• can subject-based departmental repositories co-exist with a central IR?

• how can librarians and repository managers understand researchers’ needs?

Page 25: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

25ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

IPR

• institutions can best realise the value of their research data assets by encouraging their discovery

• facts cannot be copyrighted

• derived data and databases raise complex legal issues

• ownership and licensing issues need urgent clarification

Page 26: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

26ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

Fit for purpose?

• need to be clear why we collect theses• are they intended to be fully re-usable?• what does this entail for each subject?• do librarians understand researchers?• do thesis regulations ensure appropriate

formats and submission processes? • do IPR policies facilitate re-use?

• in short, are our e-theses fit for purpose?

Page 27: 1ETD 2008_Morgan_The SPECTRa-T Project  Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.

27ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/

Thanks...

• thanks to my colleagues on the Project team– at Cambridge:

• Jim Downing, Peter Murray-Rust, Diana Stewart, Alan Tonge, Joe Townsend

– at Imperial College LondonMatt Harvey, Henry Rzepa

• thanks to the Joint Information Systems Committee (JISC) for funding the project (see www.lib.cam.ac.uk/spectra-t for Final Report)

... and thanks to you for listening!