Facilitating the development of controlled vocabularies for metabolomics with text mining I. Spasić, 1 D. Schober, 2 S. Sansone, 2 D. Rebholz-Schuhmann, 2 D. Kell, 1 N. Paton 1 and the MSI Ontology Working Group Members 3 1 MCISB http://www.mcisb.org 2 EBI http://www.ebi.ac.uk 3 MSI http://msi-workgroups.sf.net
22
Embed
Facilitating the development of controlled vocabularies for metabolomics with text mining
Facilitating the development of controlled vocabularies for metabolomics with text mining. I. Spasić, 1 D. Schober, 2 S. Sansone, 2 D. Rebholz-Schuhmann, 2 D. Kell, 1 N. Paton 1 and the MSI Ontology Working Group Members 3 - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Facilitating the development of
controlled vocabularies for metabolomics with
text miningI. Spasić,1 D. Schober,2 S. Sansone,2
D. Rebholz-Schuhmann,2 D. Kell,1 N. Paton1 and the MSI Ontology Working Group
• experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology & bioinformatics
• controlled vocabularies and ontologies play a crucial role in consistent interpretation and seamless integration of information scattered across public resources
• the pressing need for vocabularies and ontologies for metabolomics
Metabolomics Society
• http://www.metabolomicssociety.org
• the most recent community-wide initiative to coordinate the efforts in standardising reporting structures of metabolomics experiments
• five working groups:
– biological sample context
– chemical analysis
– data analysis
– ontology
– data exchange
MSI OWG
• Metabolomics Standardisation Initiative Ontology WG
• develop a common semantic framework for metabolomics studies by means of
– controlled vocabularies
– ontologies
so to be able to:
– describe the experimental process consistently
– ensure meaningful and unambiguous data exchange
Scope
• the coverage of the domain reflects the typical structure of metabolomics investigations:
– general components (investigation design; sample source, characteristics, treatments and collection; computational analysis)
– technology-specific components (sample preparation; instrumental analysis; data pre-processing)
• analytical technologies: mass spectrometry (MS), gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS), nuclear magnetic resonance (NMR) spectroscopy…
Terms
• terms:
– linguistic representations of domain-specific concepts
– means of conveying scientific and technical information
• CV terms:
– used to tag units of information so that they can be more easily retrieved by a search
– improve technical communication by ensuring that everyone is using the same term to mean the same thing
Term acquisition
• CV terms are chosen and organised by trained professionals who possess expertise in the subject area
• in a rapidly developing domain of metabolomics, new analytical techniques emerge regularly, thus often compelling domain experts to use non-standardised terms
• problem: manual term acquisition approaches are time-consuming, labour-intensive and error-prone
• solution: a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a CV with terms already in use in the scientific literature
Strategy
• each CV is compiled in an iterative process consisting of the following steps:
1. create an initial CV by re-using the existing terminologies from database models, glossaries, etc. and normalise the terms according to the common naming conventions
2. expand the CV with other frequently co-occurring terms identified automatically using text mining over a relevant corpus of scientific publications
3. circulate the proposed CV to the practitioners in the relevant area of metabolomics for validation in order to ensure its quality and completeness
A text mining workflow
1. information retrieval: gather a technology-specific corpus of documents
search terms: MeSH terms & CV termsdocuments: abstracts & full papersresources: Entrez — MEDLINE & PubMed Central (PMC)
2. term recognition: extract terms as lexical units frequently occurring in a domain-specific corpus
method: C-value provided by NaCTeM
3. term filtering: filter out terms not directly related to a given technology, such as those denoting substances, organisms, organs, diseases, etc.
• MeSH is the NLM's CV used for indexing articles for MEDLINE/PubMed
• MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts
IR using MeSH terms
• finding the relevant MeSH terms using the MeSH browser
• http://www.nlm.nih.gov/mesh/MBrowser.html
• look up: NMR
• resulting MeSH term(s): Magnetic Resonance Spectroscopy
• PubMed query: Magnetic Resonance Spectroscopy [MeSH Terms]
Beyond MeSH terms
• NMR (or any other analytical technique used in metabolomics) is rarely itself the focus of a metabolomics study it is expected only for the results discovered to be reported in an abstract and not for the experimental conditions leading to these results
• the experimental conditions are typically reported within “Materials & Methods” sections or as part of the supplementary material it is important to process the full text articles as opposed to abstracts only
• as a consequence, an IR approach based on MeSH terms or search terms limited to abstracts will result in a low recall (i.e. many of the relevant articles will be overlooked)
NMR
NMR
NMR
NMR
MEDLINE(abstracts)
PubMed Central
(full papers)
biomedical literature
Selecting search terms
2400
Selecting documents
doc ID
number of matching
terms
> threshold
local corpus0
5000
10000
15000
20000
25000
30000
do
cum
ents
1 4 7 10 13 16 19 22 25 28 31
search terms
= 3
Term recognition: C-value
• http://www.nactem.ac.uk/batch.php
C-value
• syntactic pattern matching used to select term candidates:
(ADJ | N)+ | ((ADJ | N)* [N PREP] (ADJ | N)*) N
• termhood of each candidate term t is calculated using:
– |t| its length as the number of words
– f(t) its frequency of occurrence
– S(t) the set of other candidate terms containing
it as a subphrase
)( if ,))(|)(|
1)((||ln
)( if ,)(||ln)(
)(
tSsftS
tft
tStfttC
tSs
C-value results
Unified Medical Language System (UMLS)
• UMLS = an “ontology” which merges information from over 100 biomedical source vocabularies
• http://umlsks.nlm.nih.gov
• UMLS contains the following semantic classes relevant to our problem:
Organism A.1.1Anatomical Structure A.1.2Substance A.1.4Biological Function B.2.2.1Injury or Poisoning B.2.3
• we used these classes to automatically extract the corresponding terms from the UMLS thesaurus