Top Banner
Facilitating the development of controlled vocabularies for metabolomics with text mining I. Spasić, 1 D. Schober, 2 S. Sansone, 2 D. Rebholz-Schuhmann, 2 D. Kell, 1 N. Paton 1 and the MSI Ontology Working Group Members 3 1 MCISB http://www.mcisb.org 2 EBI http://www.ebi.ac.uk 3 MSI http://msi-workgroups.sf.net
22

Facilitating the development of controlled vocabularies for metabolomics with text mining

Jan 23, 2016

Download

Documents

eamon

Facilitating the development of controlled vocabularies for metabolomics with text mining. I. Spasić, 1 D. Schober, 2 S. Sansone, 2 D. Rebholz-Schuhmann, 2 D. Kell, 1 N. Paton 1 and the MSI Ontology Working Group Members 3 - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Facilitating the development of

controlled vocabularies for metabolomics with

text miningI. Spasić,1 D. Schober,2 S. Sansone,2

D. Rebholz-Schuhmann,2 D. Kell,1 N. Paton1 and the MSI Ontology Working Group

Members3

1 MCISB http://www.mcisb.org2 EBI http://www.ebi.ac.uk3 MSI http://msi-workgroups.sf.net

Page 2: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Motivation

• experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology & bioinformatics

• controlled vocabularies and ontologies play a crucial role in consistent interpretation and seamless integration of information scattered across public resources

• the pressing need for vocabularies and ontologies for metabolomics

Page 3: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Metabolomics Society

• http://www.metabolomicssociety.org

• the most recent community-wide initiative to coordinate the efforts in standardising reporting structures of metabolomics experiments

• five working groups:

– biological sample context

– chemical analysis

– data analysis

– ontology

– data exchange

Page 4: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

MSI OWG

• Metabolomics Standardisation Initiative Ontology WG

• http://msi-ontology.sourceforge.net

[email protected]

• coordinated by Dr Susanna-Assunta Sansone

• develop a common semantic framework for metabolomics studies by means of

– controlled vocabularies

– ontologies

so to be able to:

– describe the experimental process consistently

– ensure meaningful and unambiguous data exchange

Page 5: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Scope

• the coverage of the domain reflects the typical structure of metabolomics investigations:

– general components (investigation design; sample source, characteristics, treatments and collection; computational analysis)

– technology-specific components (sample preparation; instrumental analysis; data pre-processing)

• analytical technologies: mass spectrometry (MS), gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS), nuclear magnetic resonance (NMR) spectroscopy…

Page 6: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Terms

• terms:

– linguistic representations of domain-specific concepts

– means of conveying scientific and technical information

• CV terms:

– used to tag units of information so that they can be more easily retrieved by a search

– improve technical communication by ensuring that everyone is using the same term to mean the same thing

Page 7: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Term acquisition

• CV terms are chosen and organised by trained professionals who possess expertise in the subject area

• in a rapidly developing domain of metabolomics, new analytical techniques emerge regularly, thus often compelling domain experts to use non-standardised terms

• problem: manual term acquisition approaches are time-consuming, labour-intensive and error-prone

• solution: a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a CV with terms already in use in the scientific literature

Page 8: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Strategy

• each CV is compiled in an iterative process consisting of the following steps:

1. create an initial CV by re-using the existing terminologies from database models, glossaries, etc. and normalise the terms according to the common naming conventions

2. expand the CV with other frequently co-occurring terms identified automatically using text mining over a relevant corpus of scientific publications

3. circulate the proposed CV to the practitioners in the relevant area of metabolomics for validation in order to ensure its quality and completeness

Page 9: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

A text mining workflow

1. information retrieval: gather a technology-specific corpus of documents

search terms: MeSH terms & CV termsdocuments: abstracts & full papersresources: Entrez — MEDLINE & PubMed Central (PMC)

2. term recognition: extract terms as lexical units frequently occurring in a domain-specific corpus

method: C-value provided by NaCTeM

3. term filtering: filter out terms not directly related to a given technology, such as those denoting substances, organisms, organs, diseases, etc.

resources: UMLS — MetaThesaurus & Semantic Network

Page 10: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Information retrieval using MeSH terms

• MeSH = Medical Subject Headings

• http://www.nlm.nih.gov/mesh/

• MeSH is the NLM's CV used for indexing articles for MEDLINE/PubMed

• MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts

Page 11: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

IR using MeSH terms

• finding the relevant MeSH terms using the MeSH browser

• http://www.nlm.nih.gov/mesh/MBrowser.html

• look up: NMR

• resulting MeSH term(s): Magnetic Resonance Spectroscopy

• PubMed query: Magnetic Resonance Spectroscopy [MeSH Terms]

Page 12: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Beyond MeSH terms

• NMR (or any other analytical technique used in metabolomics) is rarely itself the focus of a metabolomics study it is expected only for the results discovered to be reported in an abstract and not for the experimental conditions leading to these results

• the experimental conditions are typically reported within “Materials & Methods” sections or as part of the supplementary material it is important to process the full text articles as opposed to abstracts only

• as a consequence, an IR approach based on MeSH terms or search terms limited to abstracts will result in a low recall (i.e. many of the relevant articles will be overlooked)

NMR

NMR

NMR

NMR

MEDLINE(abstracts)

PubMed Central

(full papers)

biomedical literature

Page 13: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Selecting search terms

2400

Page 14: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Selecting documents

doc ID

number of matching

terms

> threshold

local corpus0

5000

10000

15000

20000

25000

30000

do

cum

ents

1 4 7 10 13 16 19 22 25 28 31

search terms

= 3

Page 15: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Term recognition: C-value

• http://www.nactem.ac.uk/batch.php

Page 16: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

C-value

• syntactic pattern matching used to select term candidates:

(ADJ | N)+ | ((ADJ | N)* [N PREP] (ADJ | N)*) N

• termhood of each candidate term t is calculated using:

– |t| its length as the number of words

– f(t) its frequency of occurrence

– S(t) the set of other candidate terms containing

it as a subphrase

)( if ,))(|)(|

1)((||ln

)( if ,)(||ln)(

)(

tSsftS

tft

tStfttC

tSs

Page 17: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

C-value results

Page 18: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Unified Medical Language System (UMLS)

• UMLS = an “ontology” which merges information from over 100 biomedical source vocabularies

• http://umlsks.nlm.nih.gov

• UMLS contains the following semantic classes relevant to our problem:

Organism A.1.1Anatomical Structure A.1.2Substance A.1.4Biological Function B.2.2.1Injury or Poisoning B.2.3

• we used these classes to automatically extract the corresponding terms from the UMLS thesaurus

Page 19: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Summary

UMLS

Page 20: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining
Page 21: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Results

• input: 243 NMR terms & 152 GC terms

• output: 5,699 NMR terms & 2,612 GC terms

2%

16.25

0.13

Page 22: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

The End