The PIMMS project and Natural Language Processing for Climate Science Extending the Chemical Tagger natural language processing tool with climate science.

The PIMMS project and Natural Language Processing for Climate Science

Extending the Chemical Tagger natural language processing tool with

climate science controlled vocabularies

Charlotte Pascoe, Hannah Barjat, Peter Murray-Rust and Gerry Devine

June 9th 2012, Open Repositories 2012

Portable Infrastructure for the Metafor Metadata System

http://proj.badc.rl.ac.uk/pimms/

http://proj.badc.rl.ac.uk/pimms/

Software

Activity

Data

Grids

Quality

Shared ISO

Some concepts are shared

We can record the quality of things

We reuse various ISO classes

We can talk about DataObjects collected together in any number of ways, stored in a particular medium

We can talk about hierarchical ModelComponents with ModelProperties, some of which can be coupled together

We can talk about Simulations run in support of Experiments. Experiments consist of Requirements; Simulations conform to Requirements

A particular Activity uses a particular SoftwareComponent

We can define a GridSpec or some other geometry

Common Information Model

Common Information Model

Mind maps are used to capture information requirements from domain experts and build a controlled vocabulary.

Mind Maps

<component name="Radiation"> <definition status="missing">Definition of component type Radiation required</definition> <parameter name="RadiativeTimeStep" choice="keyboard"> <definition status="missing">Definition of property name RadiativeTimeStep required</definition> <value format="numerical" name="time step" units="time units"/> </parameter> <parametergroup name="Longwave"> <parameter name="SchemeType" choice="XOR"> <definition status="missing">Definition of property name SchemeType required</definition> <value name="Wide-band model"/> <value name="Wide-band (Morcrette)"/> <value name="K-correlated"/> <value name="K-correlated (RRTM)"/> <value name="other"/> </parameter> <parameter name="Method" choice="XOR"> <definition status="missing">Definition of property name Method required</definition> <value name="Two stream"/> <value name="Layer interaction"/> <value name="other"/> </parameter> <parameter name="NumberOfSpectralIntervals" choice="keyboard"> <definition status="missing">Definition of property name NumberOfSpectralIntervals required</definition> <value format="numerical" name=""/> </parameter> </parametergroup>

The python parser processes the XML files generated by the mind maps

Python Parser

http://q.cmip5.ceda.ac.uk/Web Forms

Web forms generate content in CIM xml format

http://q.cmip5.ceda.ac.uk/

http://zonda5.badc.rl.ac.uk/site/public/tools/viewer/integrated/1.5/en/73c59aba-dc6d-11df-a442-00163e9152a5/1CIM Viewer

http://zonda5.badc.rl.ac.uk/site/public/tools/viewer/integrated/1.5/en/73c59aba-dc6d-11df-a442-00163e9152a5/1

http://chemicaltagger.ch.cam.ac.uk/

ChemicalTagger is an open-source tool that uses OSCAR4 and NLP techniques for tagging and parsing experimental sections in the chemistry literature.

Chemical Tagger


• Java project Developed by the Peter Murray-Rust group, Cambridge. Online demo: http://chemicaltagger.ch.cam.ac.uk/

• Adapted for use with ACP Abstracts (Lezan Hawizy and Hannah Barjat).

– Modification by use of dictionaries and changes to grammar.– First use case outside of laboratory chemistry.– Still with a significant chemistry component.– Wider physical science.

• Open Source NLP tool for processing chemical text• Combines Chemical Entity Recognitions (OSCAR) with NLP

techniques• Extendible and Reconfigurable Taggers and Parsers

Chemical Tagger https://bitbucket.org/wwmm/chemicaltagger & https://bitbucket.org/wwmm/acpgeo

• Open Source NLP tool for processing chemical text

• Combines Chemical Entity Recognitions (OSCAR) with NLP techniques

• Extendible and Reconfigurable Taggers and Parsers generated using ANTLR (ANother Tool for Language Recognition)



https://bitbucket.org/wwmm/chemicaltagger

https://bitbucket.org/wwmm/acpgeo

https://bitbucket.org/wwmm/acpgeo

11

• To extend chemical tagger to be more suited to climate modelling.– Specifically:

• Palaeoclimate modelling and how process of text mining might differ from development of a controlled vocabulary.

• High-lighting of text for comparison with CIM documents.• Initially only using XML Abstracts e.g. from EGU’s

Geoscientific Model Development and Climate of the Past.– Brief look at PDF to Text.

Chemical Tagger & PIMMS

• Time periods and climatic events– Includes named Ages, Epochs, Eras etc. [Including all those in a mind map produced for the

PIMMS project at Bristol].– context of proper nouns e.g. with words such as ‘period’, ‘era’, ‘epoch’– Numbers with appropriate units e.g. Mya, yr BP– Likely date numbers e.g. 1750 AD.– Acronyms – known’LGM’ e.g. [in context ACRONYMS have not been investigated]– Related adjectives e.g. seasonal, decadal, glacial, interglacial, stadial, interstadial, maximum,

minimum where used as proper nouns.

• Palaeoclimate Models– Can guess model names from context

• e.g. proper noun or acronym followed by model • e.g. reconstruction / simulation with XXX

– Can develop/use glossary of model names.

• Palaeoclimate Acronyms– Time periods and models.– Theories, techniques, physical and chemical parameters?– Can develop/use glossary of acronyms – problem area: often not unique even within

subject.

Paleoclimate Language

13

• Quick compilation of proper nouns used for time periods (primarily from Wikipedia) contains 185 words.– Use of these words together with adjective/ dates / details of events

would produce a very large number of phrases.

• Controlled Vocabulary from Bristol contains around 24 of these. • Use of these words together with other proper nouns / adjectives /

dates gives only 44 phrases within the Bristol CV.

• Map natural language to CV?– Straightforward for most dates?– Understanding of context important

• Does context refer to main emphasis of paper?• Is an event/time period described unambiguously? e.g. “Last Glacial

Natural Language vs CV

Tag / Tags Example Comment

<timePhrase><PALAEOTIME>

(i) Holocene, (ii) 8 kyr BP (iii)

<referencePhrase> (i) (Otto et al. 2009b)(ii) Giraudeau et al. 2000

Important to distinguish year pattern from dates relevant to the study.

<locationPhrase> (i) around Lake Kotokel, (ii) over Tibetan Plateau

False positives: e.g. “from Sphagnum”

<LOCATION> (i) 52°47´ N, 108°07´ E, 458 m a.s.l (ii) London.

Cannot currently do degrees from pdf-text.

<TempPhrase> ‘warm’ and ‘cool’: verbs in synthetic chem unlike env. chem.

Preliminary Results Preliminary Results (from 68 files)

Tag / Tags Example Numbers found

<CAMPAIGN> (i) PMIP, (ii) PANASH Less relevant here than to ACP in general

<MODEL> (i) REVEALS model, (ii) ECBILT-CLIO intermediate complexity climate model

<acronymPhrase> (i) Modern Analogues Technique ( MAT ) (ii) REVEALS ( Regional Estimates of VEgetation Abundance from Large Sites )

May pick up campaigns / models where phrases above have failed.

<QUANTITY> (i) 10 ppm (ii) 0.53 mm/day

units dictionary could be more extensive

<MOLECULE> (i) CO2, (ii) calcium carbonate

Many false positives as what chemical tagger was designed for.

16

XML rendered with CSS

Chemical Tagger Rendering of PALEOTIME

http://www.clim-past.net/2/205/2006/cp-2-205-2006.html

https://webmail.stfc.ac.uk/owa/redir.aspx?C=a3950771db85415cae478d0d0d7ed201&URL=http%3A%2F%2Fwww.clim-past.net%2F2%2F205%2F2006%2Fcp-2-205-2006.html

http://www.geosci-model-dev.net/4/1035/2011/gmd-4-1035-2011.htmlGMD Journal Article

http://www.geosci-model-dev.net/4/1035/2011/gmd-4-1035-2011.html

The acronym / name MIROC4 is not explained – so reproduce sentence

The description is just first few sentences after

appearance of <MODEL>

CIM Document Viewer

Makes use of existing chemical tagging.

CIM Document Viewerhttp://zonda5.badc.rl.ac.uk/site/public/repository

http://zonda5.badc.rl.ac.uk/site/public/repository

Number of spectral intervals were not

found! No place for “not found”

CIM Document Viewerhttp://zonda5.badc.rl.ac.uk/site/public/repository

http://zonda5.badc.rl.ac.uk/site/public/repository

• Unless paper is specifically about the model we are unlikely to find much MEAFOR type CV in the abstract

– Look at experimental / methods sections• model name• model resolution• model schemes

– Problem with PDF -> text.– Only certain elements easy to extract (e.g.

resolution)

Climate Models – General Constraints

22

• Add a few more phrases e.g. specific phrases to look for model resolution, using expected vocabulary (e.g. grid, levels, resolution, directions etc).

• Refine output of ACPgeo to look for specific CV terms.

• Try to put CV terms in context:– Look for proximity of CV terms to other phrases:

• Within phrase; within sentence or within a number of sentences

Refine ACPgeo Output

– Chemical Tagger was designed to be used primarily with chemistry.

• Unsurprising that there is a tendency to to assign acronyms; hyphenated words; and words with common chemical endings as molecules.

– It is possible to filter some of these wrongly assigned words by probability.

– There are still conflicts e.g. C3 and C4 could refer to hydrocarbons or plants.

• Extensive testing and modifying / machine learning might reduce these.

– Better to get right first time if important!

<MOLECULE>

http://proj.badc.rl.ac.uk/pimms/blog/CIM was designed to be populated by modellers with the (probably over simplistic) assumption that if something isn't in the CIM document then it either isn't in the model or isn't relevant. But CIM documents created by harvesting information from papers will naturally not cover everything about a model, so missing info doesn't mean that those things weren't included/aren't relevant.

PIMMS will need to describe different protocols for interpreting CIM documents depending on how they were created, but we will also want to ensure that that CIM accounts for missing data more intelligently in future releases.

In essence the difference between journal article descriptions and metadata documentation is Narrative. Journal articles need to tell a story so the information they include is only that which is relevant to the narrative, whereas metadata documentation is an attempt to include as much as possible across the board. The general nature of metadata documentation is probably why it has historically been perceived as such a boring task to complete.

PIMMS will make metadata documentation more fun by bringing back the Narrative, once PIMMS is established at an institution users will be able to create generalised metadata having only described those things that are relevant to the story of their experiment.

Harvested Metadata vs Documented Metadata

http://proj.badc.rl.ac.uk/pimms/blog/

The PIMMS project and Natural Language Processing for Climate Science Extending the Chemical Tagger natural language processing tool with climate science.

Documents

chemical tagger slide

ukpimms slide

processing chemical

language recognition

definition of property

cim xml format slide

mind maps python parser

parsers chemical tagger