Facilitating the development of controlled vocabularies for metabolomics with text mining

Facilitating the development of

controlled vocabularies for metabolomics with

text miningI. Spasić,1 D. Schober,2 S. Sansone,2

D. Rebholz-Schuhmann,2 D. Kell,1 N. Paton1 and the MSI Ontology Working Group

Members3

1 MCISB http://www.mcisb.org2 EBI http://www.ebi.ac.uk3 MSI http://msi-workgroups.sf.net

Motivation

• experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology & bioinformatics

• controlled vocabularies and ontologies play a crucial role in consistent interpretation and seamless integration of information scattered across public resources

• the pressing need for vocabularies and ontologies for metabolomics

Metabolomics Society

• http://www.metabolomicssociety.org

• the most recent community-wide initiative to coordinate the efforts in standardising reporting structures of metabolomics experiments

• five working groups:

– biological sample context

– chemical analysis

– data analysis

– ontology

– data exchange

MSI OWG

• Metabolomics Standardisation Initiative Ontology WG

• http://msi-ontology.sourceforge.net

• [email protected]

• coordinated by Dr Susanna-Assunta Sansone

• develop a common semantic framework for metabolomics studies by means of

– controlled vocabularies

– ontologies

so to be able to:

– describe the experimental process consistently

– ensure meaningful and unambiguous data exchange

Scope

• the coverage of the domain reflects the typical structure of metabolomics investigations:

– general components (investigation design; sample source, characteristics, treatments and collection; computational analysis)

– technology-specific components (sample preparation; instrumental analysis; data pre-processing)

• analytical technologies: mass spectrometry (MS), gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS), nuclear magnetic resonance (NMR) spectroscopy…

Terms

• terms:

– linguistic representations of domain-specific concepts

– means of conveying scientific and technical information

• CV terms:

– used to tag units of information so that they can be more easily retrieved by a search

– improve technical communication by ensuring that everyone is using the same term to mean the same thing

Term acquisition

• CV terms are chosen and organised by trained professionals who possess expertise in the subject area

• in a rapidly developing domain of metabolomics, new analytical techniques emerge regularly, thus often compelling domain experts to use non-standardised terms

• problem: manual term acquisition approaches are time-consuming, labour-intensive and error-prone

• solution: a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a CV with terms already in use in the scientific literature

Strategy

• each CV is compiled in an iterative process consisting of the following steps:

1. create an initial CV by re-using the existing terminologies from database models, glossaries, etc. and normalise the terms according to the common naming conventions

2. expand the CV with other frequently co-occurring terms identified automatically using text mining over a relevant corpus of scientific publications

3. circulate the proposed CV to the practitioners in the relevant area of metabolomics for validation in order to ensure its quality and completeness

A text mining workflow

1. information retrieval: gather a technology-specific corpus of documents

search terms: MeSH terms & CV termsdocuments: abstracts & full papersresources: Entrez — MEDLINE & PubMed Central (PMC)

2. term recognition: extract terms as lexical units frequently occurring in a domain-specific corpus

method: C-value provided by NaCTeM

3. term filtering: filter out terms not directly related to a given technology, such as those denoting substances, organisms, organs, diseases, etc.

resources: UMLS — MetaThesaurus & Semantic Network

Information retrieval using MeSH terms

• MeSH = Medical Subject Headings

• http://www.nlm.nih.gov/mesh/

• MeSH is the NLM's CV used for indexing articles for MEDLINE/PubMed

• MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts

IR using MeSH terms

• finding the relevant MeSH terms using the MeSH browser

• http://www.nlm.nih.gov/mesh/MBrowser.html

• look up: NMR

• resulting MeSH term(s): Magnetic Resonance Spectroscopy

• PubMed query: Magnetic Resonance Spectroscopy [MeSH Terms]

Beyond MeSH terms

• NMR (or any other analytical technique used in metabolomics) is rarely itself the focus of a metabolomics study it is expected only for the results discovered to be reported in an abstract and not for the experimental conditions leading to these results

• the experimental conditions are typically reported within “Materials & Methods” sections or as part of the supplementary material it is important to process the full text articles as opposed to abstracts only

• as a consequence, an IR approach based on MeSH terms or search terms limited to abstracts will result in a low recall (i.e. many of the relevant articles will be overlooked)

NMR

NMR

NMR

NMR

MEDLINE(abstracts)

PubMed Central

(full papers)

biomedical literature

Selecting search terms

2400

Selecting documents

doc ID

number of matching

terms

> threshold

local corpus0

5000

10000

15000

20000

25000

30000

do

cum

ents

1 4 7 10 13 16 19 22 25 28 31

search terms

= 3

Term recognition: C-value

• http://www.nactem.ac.uk/batch.php

C-value

• syntactic pattern matching used to select term candidates:

(ADJ | N)+ | ((ADJ | N)* [N PREP] (ADJ | N)*) N

• termhood of each candidate term t is calculated using:

– |t| its length as the number of words

– f(t) its frequency of occurrence

– S(t) the set of other candidate terms containing

it as a subphrase

)( if ,))(|)(|

1)((||ln

)( if ,)(||ln)(

)(

tSsftS

tft

tStfttC

tSs

C-value results

Unified Medical Language System (UMLS)

• UMLS = an “ontology” which merges information from over 100 biomedical source vocabularies

• http://umlsks.nlm.nih.gov

• UMLS contains the following semantic classes relevant to our problem:

Organism A.1.1Anatomical Structure A.1.2Substance A.1.4Biological Function B.2.2.1Injury or Poisoning B.2.3

• we used these classes to automatically extract the corresponding terms from the UMLS thesaurus

Summary

UMLS

Results

• input: 243 NMR terms & 152 GC terms

• output: 5,699 NMR terms & 2,612 GC terms

2%

16.25

0.13

The End

Facilitating the development of controlled vocabularies for metabolomics with text mining

Documents

metabolomics studies

msi ontology

mass spectrometry ms

msi http

proposed cv

initial cv

cooccurring terms

technical informationcv