SemanticCampLondon, 16th February 2008

Automaticallyindexing

science usingnatural-language

processing,RDF andSPARQL

AndrewWalkingshaw,

Nick Day,Peter Corbett,Jim Downing,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Extracting(meta)data

Using the data

Thanks

Automatically indexing science usingnatural-language processing, RDF and

SPARQL

Andrew Walkingshaw, Nick Day, Peter Corbett, JimDowning, Joe Townsend, Peter Murray-Rust

February 16, 2008




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Data sources

• Supplemental and experimental data

• Journals

• Self-archived papers (e.g. arXiv)

• Mainstream journalism

• Blogs




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Data sources


• Journals



• Blogs




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Data sources


• Journals



• Blogs




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Data sources


• Journals



• Blogs




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Data sources


• Journals



• Blogs




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Supplemental data: CrystalEye

• http://wwmm.ch.cam.ac.uk/crystaleye/

• Repository for crystallographic data




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Supplemental data: CrystalEye

• http://wwmm.ch.cam.ac.uk/crystaleye/

• Repository for crystallographic data




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Journals and arXiv

• “Traditional” journal articles

• Titles and abstracts. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Journals and arXiv

• “Traditional” journal articles

• Titles and abstracts. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Journalism and blogs

• Unstructured text with little semantics;

• . . . hence Google Scholar, Web of Science, etc.




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Journalism and blogs

• Unstructured text with little semantics;

• . . . hence Google Scholar, Web of Science, etc.




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Semi-structured data: Golem

• We’ve got a lot of chemical data as CML

• http://en.wikipedia.org/wiki/Chemical Markup Language

• . . . but we still need to get data out of that and into amore useful form

• hence Golem: http://www.lexical.org.uk/science/golem/

• GRDDLish strategy for extracting data from CML files:identify dialect-specific concepts with XPath expressionsand XSLT stylesheets

• upshot: we can extract JSON objects from CML files.




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks











AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks











AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks











AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks











AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks











AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Free text: OSCAR3

• http://oscar3-chem.sourceforge.net/

• Natural-language parser for documents about chemistry

• Dark magic: don’t ask me how it works!

• . . . but it can be run as a Jetty webservice so as long as itdoes, I’m happy

• Author’s blog:http://wwmm.ch.cam.ac.uk/blogs/corbett/




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Free text: OSCAR3









AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Free text: OSCAR3









AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Free text: OSCAR3









AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Free text: OSCAR3









AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Getting the data in

• Everything (more or less) talks RSS nowadays. . .

• RSS 0.91, RSS 1.0 (which one?), Atom, etc etc etc.

• Thankfully: feedparser (http://feedparser.org/)




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Getting the data in







AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Getting the data in







AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Serializing metadata

• RDF – using:

• Dublin Core terms

• A homebrew ontology based on the IUCr’s CIF data format

• and another homebrew ontology for OSCAR annotations

• (it’d be good to standardise these, but to be honest, notmany people are doing this sort of thing)




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks


• RDF – using:








AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks


• RDF – using:








AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks


• RDF – using:








AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks


• RDF – using:








AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

The process

• For each feed in a list of feeds:

• If it’s supplying CML data, set Golem on each entry, getthe observables out, and turn them into triples; runOSCAR3 over the title and/or abstract

• If it’s not, extract the free text from each entry, send it tothe OSCAR web service, and assign triples based on thechemical entities OSCAR finds

• Upload the RDF to your triple store

• (I’m using the Talis platform, so that’s just curl)

• And. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

The process






• And. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

The process






• And. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

The process






• And. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

The process






• And. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

The process






• And. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

SPARQL is great.

Just post queries at a SPARQL endpoint:authortemplate=’’’PREFIX dc: <http://purl.org/dc/terms/>PREFIX ce:<http://wwmm.ch.cam.ac.uk/crystaleye/dictionary#>DESCRIBE ?file WHERE { ?file dc:contributorsome author . }’’’




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

SPARQL isn’t (entirely) great.

• Scientists shouldn’t have to know this stuff.

• So we need to build a front end which your average senioracademic might be able to use. . .

• (i.e. it’s got to look like a website.)




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks








AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks








AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

What queries do we want?

• What experimental data is an author responsible for?

• What chemical entities are in some data?

• Where is a given chemical entity talked about?

• So we can build a web app around these queries.

• django + rdflib + sparql + Talis Platform




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks










AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks










AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks










AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks










AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Demo!

And here it is.




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Thanks to. . .

• Talis (http://n2.talis.com/) for access to their platform

• and to the RSC and IUCr for their support of CrystalEye.




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Thanks to. . .

• Talis (http://n2.talis.com/) for access to their platform

• and to the RSC and IUCr for their support of CrystalEye.

SemanticCampLondon, 16th February 2008

Education

nick day

jim downing

joe townsend

sparql andrew walkingshaw

cml peter corbett

petermurrayrust http

joe http

lot of chemical data