1 ETD 2008_Morgan_The SPECTRa-T Project www.lib.cam.ac.uk/spectra-t/ Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project Peter Morgan SPECTRa-T Project Director Head of Medical and Science Libraries Cambridge University Library [email protected]www.lib.cam.ac.uk/spectra-t/
27
Embed
1ETD 2008_Morgan_The SPECTRa-T Project Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
• research chemists produce experimental data (materials, reactions, properties = “recipes”)
• these data are the basis of further research
• theses are a rich source of data – c.10k chemistry papers p.a. worldwide– a typical thesis contains 50-60 preparations– 20% will be published in research papers– 80% are not published
• OSCAR3 (Open Source Chemistry Analysis Routines) as text-mining tool– developed by SciBorg Project (Cambridge)– natural language processing to identify chemical
terms– converts human-readable text into XML marked-
up content that machines can manipulate– prefers SciXML documents– uses ChEBI Ontology for chemical name
• wraps text in simple high-level elements• is optimized for human, not machine,
readability• produces poor SciXML
– line breaks = loss of continuous text and paragraph structures
– chemical drawings replaced by text and disconnected lines– loss of subscript and superscript characters– non-printing characters– OCR-derived text produces erroneous character
• Word theses converted to Office Open XML (.docx) using MS Word 2007
• XML is converted into rich SciXML• SciXML structure enables OSCAR3 to identify
“Experimental” sections and extract Chemical Objects• XML converted to CML (Chemical Markup Language) • URIs assigned to CO metadata & associated with NCEs• CML COs deposited in lightweight data repository
• RDF triplestore and CO data repository, linked by URIs, can now be queried semantically
• SPECTRa-T a proof-of-concept approach• restricted to a few chemistry sub-disciplines• investigated only 2 file formats• dangerous to generalise too far
• but our specific observations raise questions about broader implications ...
• PDF has some value for text-mining• born-digital PDF is better than OCR-derived• PDF/A will resolve some problems• but both still contain broken text and
unreliable structure for text-mining– (and most legacy material is still only in PDF)
• XML better at providing structured documents for text-mining– (and may be good for preservation as well)
• need to be clear why we collect theses• are they intended to be fully re-usable?• what does this entail for each subject?• do librarians understand researchers?• do thesis regulations ensure appropriate
formats and submission processes? • do IPR policies facilitate re-use?