-
1
Mapping disease annotation
in Swiss-Prot to Medical terminology MeSH
In the frame of a Master in Proteomics and Bioinformatics
University of Geneva 2006/2007
Anaïs Mottaz Supervised by: Anne-Lise Veuthey, Yum Lina Yip
Swiss-Prot research Group, SIB
-
2
Summary This work’s objective is to find an automatic way to
link the Swiss-Prot knowledgebase to the medical thesaurus MeSH. We
propose here to take advantage of the more than 2000 annotations
relative to diseases present in the database as well as their
references to OMIM. We first separately mapped the disease terms
automatically extracted from the annotations and the disease names
obtained through OMIM references to the MeSH terms. We used exact
match to map the diseases when possible, if not we used a
similarity score especially developed for this work to try to
retrieve the closest concept in the terminology. We then evaluated
the mapping using a set of manually mapped entries that serves as
benchmark. Better results were obtained using OMIM references when
compared to that using Swiss-Prot (SP) disease terms. We obtained a
recall of 43% and a precision of 95% with OMIM, while we obtained
with SP a recall of 35% and a precision of 89%. When we combined
the mapping of SP and OMIM, we reached a precision of 100% with a
recall of 37%. By lowering the similarity score threshold and using
another SP and OMIM combination to enhance the mapping coverage, we
obtained a recall of 59% with a precision of 89%. The low recall
values obtained are mainly due to the absence of MeSH terms for
many genetic diseases present in OMIM and Swiss-Prot. This mapping
represents a good basis for the development of a complete mapping
of Swiss-Prot to medical terminologies.
-
3
Table of contents Summary 1. Introduction
1.1. Disease annotation in Swiss-Prot 1.2. OMIM 1.3. Medical
terminologies 1.4. MeSH 1.5. Terminology mapping
2. Methods 2.1. Extraction of diseases names 2.2. Mapping
algorithm 2.3. Benchmark 2.4. Evaluation
3. Results 3.1. Manual mapping 3.2. Disease extraction 3.3.
Automatic mapping
4. Discussion 5. Conclusion
Appendix Bibliography
-
4
1. Introduction
Medicine is in constant evolution. It has changed from an
exclusively symptomatic diagnosis and treatment to a more elaborate
care management (from prevention to diagnosis) involving molecular
biology data. The treatments become adapted to the specificity of
the patient and of his disease and this specificity implies, among
others, molecular information. Blood markers to reveal cancer and
treatment protocols adapted to the molecular type of cancer are
examples of such an evolution. Meanwhile, bioinformatic resources,
which are devoted to store, treat and analyze molecular biology
information, contain more and more data relative to medicine. These
growing interactions necessitate a standardized method for
communicating results between these fields.
The use of a common terminology is required to help the
integration of molecular biology data at clinical level. The
project UniMed fits into this context. Its aim is to link the
UniProt knowledgebase, a protein database, to disease terminology
in order to enhance the interoperability between bioinformatics and
medical resources and provide a link between gene and phenotype.
This area is of interest, as indicated by the existence of
different projects, such as INFOBIOMED [1] whose aim is to develop
biomedical informatics, bridging the gap between medical
informatics and bioinformatics. The work presented here is a first
step toward the achievement of this goal. We took advantage of the
annotations relative to diseases present in Swiss-Prot, the
manually annotated part of UniprotKB. We developed an automatic
mapping between these annotations and the biomedical thesaurus
MeSH. We tested two different approaches, one using the text
present in Swiss-Prot concerning the involvement of the protein in
a disease, and the other using the Swiss-Prot citations to OMIM, a
gene and phenotype database.
We present first the Swiss-Prot database and its data concerning
diseases, followed by a presentation of OMIM, then a small
introduction on the principal existing medical terminologies, and
finally some related works on the mapping of medical
terminologies.
1.1. Disease annotation in Swiss-Prot UniProtKB/Swiss-Prot is a
protein sequence database containing high quality
manually annotated and non-redundant protein sequence records.
It is composed of more than 260 000 entries, among which nearly 16
000 are human gene products. Manual annotation consists of
analysis, comparison and merge of all available sequences for a
given protein, as well as critical review of associated data from
the literature [Bairoch et al., 2007]. The information concerns
many aspects of the protein, such as its function, its structure,
its variants, its involvement in disease, etc.
Around 2 600 entries contain annotation relative to the protein
involvement in one or more diseases, from which more than 2000 are
human.
The medical annotation appears under different fields. The
description of associated diseases in the comment lines, the data
on variants in the features lines, the keywords and the
cross-references to OMIM, to DrugBank and to many other databases
related to human genes and proteins [Bairoch et al. 2007]. In this
work we focus on the disease comment lines and the cross-references
to OMIM.
-
5
Figure 1. Example of the disease comment lines of one Swiss-Prot
entry The Figure 1 presents an example of disease comment lines in
Swiss-Prot,
which are the data of Swiss-Prot that we used in this work. We
can see that there can be several disease comment lines per entry.
First more than one disease-causing variant can exist for a
protein. Then, a given disease-causing variant can have different
phenotypic expressions, depending on the genome and on the
environment [Boeckmann et al., 2005]. It is worth noticing that we
mapped the disease comment lines and not the entries in this
work.
1.2. OMIM
OMIM is a public knowledgebase of human genes and genetic
disorders [Hamosh et al., 2005]. Each OMIM entry has a full-text
summary of a genetically determined phenotype and/or gene and has
numerous links to other genetic databases. An OMIM entry includes a
primary title, which is the principal name of the disease,
alternative titles, which are the synonyms, and ‘included’ titles,
which are related but not synonymous information [2]. Different
types of entries exist:
� Gene of known sequence (*) � Phenotype that does not represent
a unique locus (#) � Gene of known sequence and a phenotype (+) �
Mendelian phenotype or phenotypic locus for which the
underlying
molecular basis is not known (%) � Phenotype for which the
mendelian basis, although suspected, has not been
clearly established or when the separateness of this phenotype
from that in another entry is unclear (no sign)
The references of Swiss-Prot to OMIM in the disease comment
lines can be
present or absent, depending on the existence of the
corresponding entry in OMIM. Among the disease comment lines, about
73% have a reference to an OMIM entry. The entries referenced are
always of the phenotype (#) or phenotype and gene (+) type.
-
6
Sometimes there can be more than one reference to OMIM in one
disease comment line. It is the case when the annotation references
a closely related disease to complete the description of the
disease. This happens also when the annotation is about a disease
whose subtypes are classified under different OMIM entries.
1.3. Medical terminologies
Terminology refers to the set of words employed in a domain.
This set of words can be organized in different structures that are
referred to as controlled vocabulary, taxonomy, thesaurus, ontology
or meta-model. The role of all these structured sets of vocabulary
is to help structure, classify, model, and represent the concepts
and relationships pertaining to some subject matter and to enable a
community to come to agreement and to commit to use the same terms
in the same way [3]. Definitions of these terms are provided in the
Appendix A.
Here is an introduction of the most common medical vocabularies,
except MeSH [4] for which the next section is dedicated. � SNOMED
CT - "Systematized Nomenclature of Medicine Clinical Terms" is
developed by the College of American Pathologists. This
terminology is designed to deal with clinical information. It
covers diseases, clinical findings and procedures [5]. Its coverage
of disease is important, as reported in article [B.L. Humphreys et
al., 1997].
� ICD - "International Classification of Disease" is published
by the World Health
Organization. It is worldwidely used to deal with diagnostic
information, for example epidemiological and health management
data, particularly in hospital records [6]. The more recent version
is the ICD-10. Its disease coverage, especially concerning genetic
diseases, has been shown to be less important than other
terminologies [K.M. O’Keefe et al., 1993].
� UMLS - NLM produces and distributes the Unified Medical
Language System
Knowledge Sources (databases) and associated software tools
(programs) [7]. The major component of UMLS is the Metathesaurus, a
repository of inter-related biomedical concepts, as presented in
the figure 2. Two other knowledge sources in the UMLS are the
Semantic Network, providing high-level categories used to
categorize every Metathesaurus concept, and lexical resources
including the SPECIALIST lexicon and programs for generating the
lexical variants of biomedical terms. These programs include Norm,
which is a program that generates the normalized form of a term;
WordInd, which decomposes strings into words; and Lvg, which
generates lexical variants [O. Bodenreider, 2004]. MMTx, which is a
program accessible as a web service, maps Metathesaurus concepts
from a text.
-
7
Figure 2. The various subdomains integrated in the UMLS. In the
frame of this work, UMLS has not been used due to its complexity.
SNOMED-CT neither for it is not freely available and ICD because of
its lack of granularity. Yet all of these resources will be used in
the future works.
1.4. MeSH
The National Library of Medicine (NLM) has produced the Medical
Subject Headings since 1960. The MeSH thesaurus is NLM's controlled
vocabulary for subject indexing and searching of journal articles
in MEDLINE, as well as books, journal titles, and non-print
materials in NLM's catalog [Nelson et al., 2001]. It consists in a
set of nouns called descriptors organized in a hierarchical
structure offering research at different levels of specificity. The
last important change in its structure has been done in 2000
[Savage A., 2000].
MeSH has three major components:
� Main Headings (Descriptors) � Subheadings (Qualifiers) �
Supplementary Concept Records (formerly ‘Supplementary Chemical
Records’)
This work only used the Main Headings. The description of the
other components is informational.
Main Headings They are the index terms used to indicate the main
subject and characterize the
content. They are also called Descriptors. Subheadings
-
8
Also called qualifiers, they are used in conjunction with
descriptors to afford a mean of grouping together the citations
that are concerned with a particular aspect of a subject. Complex
concepts not pre-coordinated in a descriptor and that can be
represented by the conjunction of a descriptor and of a qualifier
are examples of such combined use. Monoamine Oxidase/deficiency is
one of those examples, Monoamine oxidase being the descriptor and
deficiency the qualifier. Each descriptor has a list of allowable
qualifiers, in order to prevent wrong associations. Descriptors can
also be coordinated with other descriptors but there is a risk of
wrong associations.
Supplementary Concept Records They are supplementary headings,
each being linked to a descriptor. They serve to
control the use of substances names during indexation.
As mentioned above, the Descriptors are the component of MeSH we
used in this work. They are organized in 16 different categories.
These categories are Anatomy, Organisms, Diseases, Chemical and
drugs, etc. Only the Disease category is used in this work.
Descriptors are arranged in a specificity hierarchy, containing up
to eleven levels. The links are not clearly defined (not strict ‘is
a’ or ‘part of’). The descriptors can be found in different places
in the hierarchy and their children can vary according to their
place in the hierarchy.
Each descriptor contains a set of concepts, not equivalent but
not (or not yet) enough distinct to be themselves descriptors. The
descriptor itself is one of its concepts, the preferred concept.
The concepts have a defined link to the preferred concept
(narrower, broader or related).
Each concept itself contains a set of terms, which are synonyms
and lexical variants. The concept is one of its terms, the
preferred term.
The figure 3 represents for one descriptor its relation to the
concepts and terms. In this work, we only used the Terms, to which
we mapped the Swiss-Prot annotation lines, and the Descriptor, to
which we referred for the validation of our automatic mapping.
Figure 3. Relations between terms, concepts and descriptors in
MeSH.
-
9
The Figure 5 represents the descriptor ‘Esophageal Neoplasms’,
in one of its places in the ‘Diseases’ tree. The Figure 4
represents the entry corresponding to this descriptor, as it
appears with the MeSH browser accessible on the web.
Figure 4. Example of MeSH entry as it appears with the web MeSH
browser.
Figure 5. The entry of the Figure 4 represented in the
descriptor hierarchy
-
10
1.5. Terminology mapping In the domain of terminology mapping,
the ontology mapping has been the
focus of a variety of works originating from diverse communities
over a number of years. Indeed this kind of mapping could provide a
common layer from which several ontologies could be accessed and
hence could exchange information in semantically sound manners.
Concerning medical terminologies, an important effort has already
been done in this direction and UMLS is the result of the merging
of several medical vocabularies, including a small part of OMIM.
Now in this field the efforts tend to integrate biological
terminologies and biological data, as well as phenotypic data and
medical terminologies, often in UMLS.
Within the perspective of the need for a representation of gene
and gene products in the UMLS, a method has been proposed to map
the Gene Ontology to UMLS [Sarkar et al., 2003]. It uses three
mapping approaches: (1) exact string matching, (2) partial matching
using two tools provided by UMLS, namely Norm, a lexical tool, and
MMTx which automatically map text to UMLS concepts and finally (3)
a mapping based on similarities using a BLAST algorithm. They
obtained the best results using the exact matches and the Norm
tool.
With the idea that there is a pressing demand of technologies
for greater integration of phenotypic data and phenotype-centric
discovery tools to facilitate biomedical research, an attempt to
map a phenotype vocabulary developed by Mouse Genome Database, the
Phenolism terminology, to SNOMED CT have been made [Lussier et Li,
2004]. Their method proposed the decomposition of phenolism
concepts and normalization with Norm.
These works have in common the mapping of terms to concept in
UMLS or other medical terminology such as SNOMED CT. The problems
of mapping encountered are thus often the same. Here is a summary
of these principal difficulties, with some illustrative
examples:
� Synonymy Ex: ‘Primary adrenal insufficiency’ and ‘Addison
disease’
� Morphology -Inflection
Ex: ‘cancer, esophageal’ and ‘cancers, esophageal’ -Derivation
(laryngeal larynx)
Ex: ‘cancer, esophageal’ and ‘cancer, esophagus’ �
Orthography
-Spelling variants Ex: ‘cancer, esophageal’ and ‘cancer,
oesophageal’
� Syntaxy -Complementation for verbs, nouns and adjectives
Ex: ‘cancer, esophageal’ and ‘cancer of the esophagus’ � Level
of specificity
Ex: ‘esophageal cancers’ narrower than ‘esophageal neoplasms’
narrower than ‘gastrointestinal neoplasms’
� Context Ex: ‘human embryo’ and ‘mammalian embryo’. The embryo
is defined in the human context as an organism at stage
-
11
� Definition conflict Ex: ‘cirrhosis’. Cirrhosis is sometimes
defined as a chronic interstitial inflammation of any tissue or
organ, and more often as a chronic disease of the liver, the liver
cirrhosis.
A part of these problems are resolved in this work by the
synonyms and variants
provided by MeSH through the different terms grouped under each
concept. The syntactic variants are further treated by the
decomposition of the diseases in words to calculate the partial
matching. Only the context and definition conflicts are not treated
here.
2. Methods
All the programs developed in this project are written in perl
(v5.8.5). We extracted diseases from the human entries of
Swiss-Prot (version 51.0). It represents 2033 Swiss-Prot human
entries with at least one disease comment line, and 2966 disease
comment lines from which 2179 have a cross-reference to OMIM,
phenotype (#) and phenotype and gene (+) type. The version of OMIM
is the one of September 2006. This version contains 2306 entries of
the phenotype and phenotype and gene type. The fields of the
entries used for mapping are the “Title”, “Alternative titles” and
“Included titles”.
We used the category ‘Disease’ of the MeSH (version 2006 in XML
file format). This category contains 4150 descriptors, 7146
concepts and 38193 terms.
A database has been created to store all the necessary
information from the three databases (Swiss-Prot, MeSH, OMIM), and
to easily retrieve the data required by the mapping program. The
database is implemented using the postgreSQL (8.2.3) database
management system. The schema of the database is in Appendix (C.
Graph 1). The information present in this database has been
extracted from the Swiss-Prot and OMIM flat files using perl
regular expressions, and from the MeSH XML file using the Twig perl
Library. The database contains more information than the ones
strictly used by the programs. It has been designed to contain
information that could be useful in the next steps of the project.
For example the semantic types of the MeSH concepts are stored
although we do not use it.
2.1. Extraction of disease names
The approach used to map the annotation lines includes a
procedure for the extraction of the disease names from the disease
comment lines of Swiss-Prot, as well as of the corresponding MIM
identifier.
The extraction of the disease names has been done using regular
expressions. First, by looking at the disease comments lines, we
established a list of terms that are used to indicate the
association of a protein with the disease. These expressions
operate as a starter for disease name extraction. A list of
specific ‘stop’ words has been defined to remove terms that
obviously do not belong to the name of the disease. The role of
termination terms is to remove additional information that
sometimes follows the name of the disease. These expressions are
presented in the Table 1. The extraction procedure
-
12
does not treat cases where several diseases are displayed in the
same disease comment line. These cases are rare, although one is
present in the benchmark.
The MIM identifiers have also been extracted from the disease
comment line. They have not been extracted from the DR lines
(cross-references lines) because in the DR lines there is no
indication as to which disease comment lines it refers. This
information is important because we map the disease comment lines
and not the entry. These references have also been extracted using
regular expressions. This has been made easier by the fact that the
MIM identifiers are always presented the same way ([MIM:115150]) in
the disease comment line. The MIM numbers are then used to retrieve
the corresponding title and alternative titles in the database
created for this project.
(1) Starter expressions (2) Specific stop words (3) Termination
term Cause(s) of /a involved in (can) contribute(s) to
associated/association with correlated with responsible for
contributor to result(s)/resulting in lead(s) to induce(s)
defective in individual(s) with patient(s) with/suffering from
reduce(s) influence(s) deleted in down-regulated in found in
implicated in predispose(s) to favor antigen of antigen for thought
to be an role in could impart mediate(s) candidate (gene)
susceptibility to development of genetic predisposition for
developing pathogenesis of subset of various types of some form of
increased risk of
also known as but which an due to in condition(s) such as .
[MIM:
Table 1. (1) Expressions used to extract the part of the string
containing the disease name. (2) Terms removed from the extracted
string. (3) Expressions indicating the end of the disease name.
-
13
2.2. Mapping algorithm The mapping algorithm is globally divided
into two steps, the exact match and
the partial match. The partial match is searched only if the
exact match has not been found. Exact match
The exact match consists of searching among the MeSH terms to
determine if the exact same term exists (same length, same word
order, but case insensitive). For OMIM, there can be more than one
disease term per disease comment line, either because of the
synonyms provided by the OMIM entry or because of multiple OMIM
references in the comment line. In this case one exact match among
all the possibilities is sufficient.
Partial match
To evaluate which term can be the corresponding term in MeSH
when no exact match is found, a similarity score is calculated
between the disease and all terms present in the Disease category
of MeSH.
To calculate the similarity score between a disease and a MeSH
term, the disease and the MeSH terms are first decomposed into
words. For this purpose, word is defined as a token containing only
alphanumeric characters with length one or greater.
A weight is calculated for all these words. This weight is a
function of the frequency and represents the importance of the word
in the context of origin of the terms. Its purpose is to give more
importance to infrequent words, hemophilia for example, than to
frequent words like disease or syndrome.
The score is calculated by adding the weight of each word in
common between the disease and the MeSH term, and substracting the
weight of every different word of the disease and of the MeSH term.
The score was normalized by dividing it by the number of words of
contained in the disease term:
)(
))(
1log()
)(1
log(
diseasesize
ncwfreqcwfreq∑ ∑−
freq = occurrence in all OMIM titles, alternative and included
titles, all MeSH terms of the Disease category and all Swiss-Prot
disease comment lines, divided by the total number of words in
these documents
cw = words in common between the MeSH term and the disease to
map ncw = words not in common of the MeSH term and the disease to
map disease = the disease name to map size = word count
-
14
This formula is inspired by the IDF, which is the Inverse
Document Frequency used in the Information Retrieval domain
[Manning et al., 2007]. In addition, an adjustment has been done
concerning the calculation of the score: Hyphenated terms
To take into account the fact that some words are linked by
hyphens, we treat them specially. These words are considered common
words only if all words composing the hyphenated term are common.
If at least one of the words is not a common word, all of them are
considered as not common words. This is to avoid false positives
matches.
We calculated the partial matching scores of each disease names
against all MeSH terms. The term obtaining the best score is
returned. In the case of OMIM, the score is calculated between all
the synonyms and the MeSH terms, and the MeSH term with the best
score is returned. With the use of synonyms, several MeSH terms can
obtain the same score. In this case, all MeSH terms with the best
score are considered as the result. No match means diseases names
which have no words in common with any MeSH term of the Disease
category.
2.3. Benchmark
In order to evaluate the performance of the mapping program, we
performed a manual mapping between a set of Swiss-Prot entries and
MeSH terms, which was used as benchmark. The benchmark was created
by taking 92 disease comment lines, present in randomly selected 43
Swiss-Prot entries. To ensure a representative sampling, these
entries have been taken according to the results of a preliminary
version of the mapping program. In this way, the benchmark follows
the ratio of exact matches, partial matches and no matches obtained
with the preliminary program.
The manual mapping has been made as much as possible on one MeSH
term. The mapping was done if possible on the term used in the
disease comment line of the Swiss-Prot entry. If necessary the OMIM
entry referenced was used and sometimes also other medical
resources such as Orphanet [8], which is a freely accessible
database of rare diseases and orphan drugs.
2.4. Evaluation
The procedure used to validate the mapping generated by the
program consisted in comparing the descriptor manually selected to
the automatically retrieved ones. The automatic mapping was
considered correct if both descriptors were the same.
If more than one term had been either manually or automatically
mapped, the mapping was considered correct if at least one of the
terms had a corresponding MeSH descriptor.
-
15
In this particular context, the terminology used to analyze the
results was the following:
� True Positive (TP): number of correct mapping above the
threshold � True Negative (TN): number of wrong mapping below the
threshold � False Positive (FP): number of wrong mapping above the
threshold � False Negative (FN): number of true mapping below the
threshold � No Result (NR): number of not mapped diseases (no word
in common with any
MeSH term) � Retrieved (R): number of mapping (true or false)
above the threshold
The formulas used to evaluate the global results of the mapping
on the
benchmark are the recall and the precision. To evaluate the
mapping of all Swiss-Prot disease terms, the retrieval was used.
The formulas are listed below:
Recall = TP / (TP+TN+FP+FN+NR) Precision = TP / (TP+FP)
Retrieval = R / (TP+TN+FP+FN+NR)
3. Results
We used two parallel sources to map the Swiss-Prot annotation
lines to MeSH. The first source was the disease extracted from the
disease comment line and will be referred as “SP”. The second
source was the disease names of the corresponding OMIM entry cited
in the disease comment line. This source will be referred as
“OMIM”.
First, we introduce the manual mapping in order to better
understand the aim of the mapping and its difficulties. Then, we
will have a look at the principal points concerning the first step
of the automatic mapping, the automatic extraction of the disease
from the comment lines, and see how correctly it functioned. The
automatic mapping itself is then introduced. We will discuss the
choice of the threshold for the partial matching and finally we
present the results of the program on the benchmark and on the
whole Swiss-Prot, with a comparison of the SP and OMIM sources.
Combination of SP and OMIM mappings are also presented, together
with their goals and their results.
3.1. Manual Mapping
We manually mapped a set of 92 disease comment lines. This
manual mapping is of interest as it highlights the difficulties
that can help understand the complexity of terminology mapping,
automatic or not.
About three quarter of disease comment lines did not raise any
particular problems for the manual mapping. Either the disease had
a corresponding entry in
-
16
MeSH for an identical concept, or, if it hadn’t, choosing a
corresponding entry was obvious, for example ‘short qt syndrome
type 2’ corresponding to ‘arrythmia’.
About a quarter (26) has been more complicated to map, for
different reasons. The most frequent reason was the case of
diseases not present in MeSH (40) and for which the choice of the
corresponding descriptor was not obvious (25).
Among the latter cases, many were diseases or syndromes that
touch many different parts of the body (10). The only entries in
MeSH representing this kind of diseases are ‘Abnormalities
Multiple’ and ‘Genetic Diseases, inborn’. The problem with this
kind of mapping is the loss of information. We have looked at the
approach chosen by MeSH to classify this kind of diseases. For this
we have taken one existing syndrome and looked under which
descriptors it is classified. The Marfan syndrome is thus the child
of many different descriptors: ‘Abnormalities, Multiple’, ‘Bone
Diseases, Developmental’, ‘Heart Defects, Congenital’, ‘Genetic
Diseases, Inborn’ and ‘Connective Tissue Diseases’. These
descriptors represent the range of expression of the disease, and
can be numerous depending on the disease. In this work, as we want
to map to one descriptor as much as possible, we chose to map to
one of the two general descriptors (‘Abnormalities Multiple’ and
‘Genetic Diseases, inborn’). In this way, we avoided the problems
coming from the mapping of all possible descriptors. The choice of
‘Abnormalities Multiple’ has come from the fact that it gives the
information that the disease implicates several different problems,
which is not always the case in the ‘Genetic Diseases, inborn’.
This choice has been made although the ‘Abnormalities, Multiple’ do
not have to be genetic, but can also be chromosomal or only
congenital. The only cases when we mapped to another descriptor
other than ‘Abnormalities, Multiple’ were when the different
abnormalities touch only one system. This case made the mapping
easier and allows one to retain some information given by the
original disease term.
The other cases of problematic mappings due to the absence of
diseases in MeSH concerned 15 cases. The origin of these problems
is that sometimes a disease can be classified according to
different aspects of it. For example, ‘long-chain 3-hydroxyl- coa
dehydrogenase deficiency’ has been mapped to ‘lipid metabolism,
inborn error’ but could have been also mapped to ‘mitochondrial
diseases’.
The remaining case concerned an inconsistency of MeSH. The
disease, ‘inclusion body myopathy 2’, is not directly represented
in MeSH. The sporadic form is represented (‘inclusion body
myopathy, sporadic’) under the descriptor ‘inclusion body
myositis’, child of myositis (inflammation of muscle tissue).
Unfortunately, the familial form, to which ‘inclusion body myopathy
2’ belongs, doesn’t present muscle inflammation signs. This
particularity of the familial form is even mentioned in the
definition of the ‘inclusion body myositis’ descriptor, but no term
neither the classification corresponds to the familial form. Thus
it cannot be classified under this descriptor. We have thus mapped
it to ‘myopathy’, which is somewhere else in the hierarchy.
Some disease comment lines have been mapped to several different
descriptors for other reasons than those mentioned above. Two cases
concern cancers and their classifications at both anatomical and
histological level. ‘Squamous cell carcinoma of the head and neck’
and ‘oral squamous cell carcinomas’ have thus been both mapped to
‘squamous cell carcinoma’, and to ‘head and neck cancer’ and ‘oral
cancer’ respectively. Two other cases of multiple mappings are
disease comment lines mentioning several diseases. As explained in
the methods, in cases of multiple mapping, the mapping to a single
correct term is considered as a successful mapping.
-
17
3.2. Disease extraction The automatic approach presented in this
work implies the extraction of the
disease name from the disease comment line of Swiss-Prot. Given
the number of these annotation lines (2966), an automatic
extraction has been developed (see Methods 2.1.).
Only in seven cases the program has not extracted anything from
the disease comment lines of Swiss-Prot (Table 2), none of them
belongs to the benchmark.
AC Disease comment lines
P01008 at-iii basel, tours/alger/amiens/toyama, rouen-1, -2, -3
and -4 have decreased (or lack) heparin-binding properties.
P22681 can be converted to an oncogenic protein by deletions or
mutations that disturb its ability to down-regulate rtks.
Q8WYB5 a chromosomal aberration involving myst4 may be a cause
acute myeloid leukemias. translocation t(10;16)(q22;p13) with
crebbp.
P35226 cooperates with the myc oncogene to produce
b-lymphomas.
P78346 sera from scleroderma patients recognize rpp38.
P78345 sera from scleroderma patients recognize rpp38.
P28908 Most specific hodgkin disease associated antigen.
Table 2. Comment lines from which a disease couldn’t be
automatically extracted.
For the rest of Swiss-Prot (2959 disease comment lines), we
manually checked the diseases to see if they were correctly
extracted. To this end, we visually scanned the 2959 diseases
seeking for wrong extractions. We detected cases where the
extraction did not function correctly, corresponding principally to
comment lines mentioning several diseases. The number of such cases
cannot be formally calculated, as the check was not really
exhaustive. The problem with the lines containing several diseases
comes from the fact that the automatic extraction doesn’t take into
account these cases. One is present in the benchmark (‘KRT16 and
KRT17 are coexpressed only in pathological situations such as
metaplasias and carcinomas of the uterine cervix and in psoriasis
vulgaris.’). On the other hand, some diseases were more easily
extracted. They correspond to the diseases caused by variants.
Indeed in these cases the protein is the cause of the disease and
the expressions used to say it are similar (‘is the cause of’,
‘causes’, etc).
-
18
3.3. Automatic mapping
The automatic mapping approach consisted in exact and partial
match. We first had to choose a threshold for the similarity score
used in the partial match, choice that is presented in the first
part of this section. We then evaluated the benchmark mapping with
this threshold. We tested different combinations between the
results obtained with SP disease and OMIM disease. Finally we
tested the mapping on all the disease comment lines of
Swiss-Prot.
On the figure 6 is presented the most interesting mapping
procedure, with a combination of the SP and OMIM sources, whose
results are discussed later this section.
Figure 6. One of the automatic mapping procedure
Disease extracted (SP)
OMIM title and alternative titles
Swiss-Prot entry - Disease comment line
Exact match Exact match
Partial match Partial match
Same Descriptor
MeSH
-
19
Threshold set-up Almost all disease names of the benchmark
provided a mapping to a MeSH
term, either exact or partial. We used these results to set up a
threshold on the score of the partial matches above which the
program considers them as correct. This threshold will modify the
recall and the precision obtained. In our case, the precision is
what we have to favor, because the automatic mapping should be
reliable. To choose the threshold, we considered the partial
matches of SP and OMIM separately. We calculated for different
thresholds the recall and the precision (graph 1) of the system. We
can notice that, for the same threshold, the level of recall for SP
and OMIM disease names is not exactly the same. This effect is
certainly due to the small number of terms with a partial match in
the benchmark, 76 for SP and 61 for OMIM. Therefore, we could not
really justify the use of two different thresholds. As we should
favor the precision instead of the recall, we decided to take the
value +1 for the threshold. This threshold represents a good
compromise between OMIM and SP, and it can ensure a precision above
80% for partial matches. Actually, we can see that by lowering the
threshold to 0, for instance, we can enhance the recall without
lowering too much the precision. However, we then approached the
region on the graph where there is an important drop in precision.
A bigger benchmark set is definitely needed to more precisely
analyze the effect of the threshold on the recall and
precision.
Recall-Precision functions of threshold
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
0 0.1 0.2 0.3 0.4 0.5 0.6
Recall
Pre
cisi
on
SP
OMIM
Graph 1. Recall-Precision functions of the threshold. The
threshold vary from –7 to +7, prominent dots corresponds to a
threshold of +1
-
20
Evaluation of the benchmark set
The Table 3 summarizes the results in recall and precision
obtained on the benchmark using a threshold of 1. The benchmark is
composed of 92 disease comment lines among which 82 have a
cross-reference to OMIM (89%). Comparing the results obtained by SP
and OMIM, we can see that the results of OMIM are slightly better
than those obtained by SP, concerning both exact matches and
partial matches. The results are nevertheless quite similar
concerning the exact matches, with an excellent precision (100%)
and a modest recall (around 20%). The partial matches have recall
values comparable to those of the exact matches, but it should be
noted that they are calculated only on disease having not obtained
exact match. This represents 76 SP diseases and 61 OMIM diseases.
The precision is better with OMIM (90%), although SP has also a
fairly good precision of 80%. After adding the exact and partial
matches, interesting results were obtained with OMIM , with a
recall of 43% and a precision of 95%.
Given these results, we tried to combine the results of SP and
of OMIM to see the effect on the recall and the precision. We
tested two ways of combining them.
1) We considered the intersection of the mapping (SP ∩ OMIM),
designed for enhancing the confidence of the mapping. We considered
only the partial matches where SP and OMIM had a score above the
threshold and where the mapped terms from both sources correspond
to the same descriptor. On the exact matches the recall was lowered
without affecting the precision which was maximal already. Its
effect on the partial matches was to raise the precision to 100%,
while diminishing the recall of more than a half. The summing of
exact and partial matches thus gave a bad recall (20%) with an
excellent precision (100%), as expected.
2) We calculated the union of the mapping (SP ∪ OMIM), aiming to
enhance the coverage of the mapping. We considered mapping where at
least one of SP or OMIM match had obtained a score above the
threshold. The results showed an increase of recall for the exact
matches, leading to more than a quarter of the benchmark correctly
mapped (29%). The partial matching showed, when compared to OMIM
alone, the same recall (21%) with a worse precision (83% compared
to 90% for OMIM). The combination of exact and partial for this
combination revealed a recall of 50% and a precision of 92%, which
indeed corresponded to an enhancement of the coverage, with the
counterpart of lowering the precision.
As the union of SP and OMIM had a precision of 100% for the
exact matches, we decided to combine it to the intersection of SP
and OMIM for the partial matching, enhancing the coverage with the
exact matches and keeping an excellent precision with the partial
matching. This association has been represented in the beginning of
this section (figure 6). It resulted in a recall of 37%, quite
modest, but with a precision of 100%.
With the threshold placed to 0, the results obtained with this
last combination showed a slightly better recall (39% instead of
37%) but not a perfect precision (95%). If we lower the threshold
until –1, the recall went up to 50% with a precision of 91%. All
these results were satisfying but as mentioned, we could only draw
more conclusions on the difference between these score when a
larger data set is evaluated.
-
21
Exact match Partial match Total 92 disease comment
lines 82 with OMIM
Retrieval Recall Precision Retrieval Recall Precision Retrieval
Recall Precision
SP 16
(17%) 16
(17%) 100%
20 (22%)
16 (17%)
80% 36
(39%) 32
(35%) 89%
OMIM 21
(23%) 21
(23%) 100%
21 (23%)
19 (21%)
90% 42
(46%) 40
(43%) 95%
SP ∩∩∩∩ OMIM 10
(11%) 10
(11%) 100%
8 (9%)
8 (9%)
100% 18
(20%) 18
(20%) 100%
SP ∪∪∪∪ OMIM 27
(29%) 27
(29%) 100%
23 (25%)
19 (21%)
83% 50
(54%) 46
(50%) 92%
Table 3. Benchmark results with threshold equal to +1. Mapping
of the whole Swiss-Prot set
The next step of this work consisted in using the program to map
the whole Swiss-Prot to MeSH. It represented 2966 disease comment
lines, of which 2173 had an OMIM reference. It represents 73% of
the Swiss-Prot disease comment lines, which is lower that the
percentage of the benchmark set (89%).
No recall or precision could be calculated, given the lack of
mapping validation. Only the retrieval values could be calculated
and these are presented in Table 4. They can be compared to the
retrieval values obtained on the benchmark set (Table 3).
The retrieval percentages of the whole Swiss-Prot were
comparable to the benchmark ones. This suggested that the later was
a quite representative sample despite its small size. The only
important difference concerned the lower retrieval of OMIM matches
of the whole Swiss-Prot set (37% compared to 46% in the benchmark).
This can be explained by the lower proportion of OMIM references
compared to the benchmark set.
Exact match Partial match Total 2966 disease comment
lines 2173 with
OMIM Retrieval Recall Precision Retrieval Recall Precision
Retrieval Recall Precision
SP 483
(16%) - -
634 (21%)
- - 1117 (38%)
- -
OMIM 610
(21%) - -
483 (16%)
- - 1093 (37%)
- -
SP ∩∩∩∩ OMIM 292
(10%) - -
237 (8%)
- - 529
(18%) - -
SP ∪∪∪∪ OMIM 794
(27%) - -
640 (22%)
- - 1434 (48%)
- -
Table 4. Whole Swiss-Prot results with threshold equal to
+1.
-
22
The study of some particular cases of the exact matches of the
whole Swiss-Prot set gave more insight onto the mapping problems.
Indeed, seven cases appeared where the exact matches of SP and OMIM
mapped to different descriptors. These cases are reported in the
table 5. We can notice that there were two errors in Swiss-Prot
annotation. These errors have already been corrected.
SP disease OMIM disease SP-OMIM discrepancy origin
Note
Turner syndrome Noonan syndrome Human Error in Swiss-Prot
annotation (corrected)
Turner syndrome is a chromosomal abnormality, cannot be caused
by a variant
Hepatoerythropoietic porphyria
Porphyria cutanea tarda
Difference in classification
OMIM: hepatoerythropoietic porphyria included title of porphyria
cutanea tarda MeSH: different descriptors
Multiple carboxylase deficiency
Multiple carboxylase deficiency, neonatal form
Lack of precision in Swiss-Prot annotation (corrected)
Multiple carboxylase deficiency has two forms, ealy and late
onset, involving two types of enzymes. The early onset one
corresponds to this entry
Enchondromatosis Osteochondromatosis Difference in
classification
OMIM: alternative titles MeSH: different descriptors
Protein C deficiency Thrombophilia Automatic mapping: use of
second OMIM reference
SP: second link to OMIM in the disease comment line,
thrombophilia, corresponding to the type of disease to which
protein C deficiency belongs
Cholesteryl ester storage disease
Wolman disease Difference in classification
Allelic variants OMIM: alternative titles MeSH: different
descriptors
Cholelithiasis Cholecystitis Difference in classification
Not synonyms OMIM: alternative titles MeSH: different
descriptors
Table 5. Different exact matches between SP and OMIM.
-
23
4. Discussion
In this project, we have developed an automatic mapping approach
to map the disease comment lines of Swiss-Prot to the mesh
terminology, using exact and partial matches using a similarity
score.
In this part, we will discuss the mappings obtained with the
similarity score threshold equal to +1, with the exact matches
obtained from the union of SP and OMIM, and the partial matches
obtained from the intersection of SP and OMIM (see Figure 6, in
Results). This combination has obtained a recall of 37% and
precision of 100%. A table of these detailed mappings is available
on the web (URL:
“http://intranet.isb-sib.ch/pages/viewpageattachments.action?pageId=2590040”).
First, we can notice the absence of false positive. Nevertheless,
many false negatives are present, and some true negatives have
quite high scores. These false negatives have not mapped either
because they were just under the threshold +1 or because the
corresponding mapping of the other source was under the threshold
or mapped to a different descriptor (see SP ∩ OMIM in Results).
The low recall obtained can be explained by the presence of
false negatives. The false negative mappings come principally from
small differences of granularity between the disease to map and the
corresponding MeSH term. For example, ‘arthrogryposis, distal, type
7’ that correctly maps to arthrogryposis but with a score of –2.8
or ‘familial erythrocytosis type 1’ that maps to erythrocytosis
with a score of –1.6. Other cases concern diseases not perfectly
extracted from the disease comment lines, such as ‘variety of human
tumors’ that maps to tumors but with a score of –3.3. These
problems could be partly resolved by the use of a supplementary
source to calculate the frequency of the words used in the
similarity score. This would lower the influence of words such as
‘type’ or ‘variety’ compared to ‘arthrogryposis’, ‘erythrocytosis’
or ‘tumors’.
Concerning the true negative mappings, their origin is
principally the absence in MeSH of a close corresponding
descriptor. In these cases a simple similarity score is not
efficient because of the lack of common words between the disease
to map and the MeSH descriptor. For example ‘Autosomal dominant
weill-marchesami syndrome’ has no chance to map to ‘Abnormalities
multiple’. Sometimes, although some diseases do not have a very
distant descriptor, they are not mapped to the correct descriptor
for the same reason. For example, ‘gnathodiaphyseal sclerosis’ was
mapped to ‘sclerosis’ instead of to ‘osteochondrodysplasias’. Among
these true negative mappings, interesting ones were those that
obtained good scores. The presence of these true negatives with
good scores prevented us from further lowering the threshold to
increase the recall, as such an attempt will inevitably lower the
precision. For example, ‘Cirrhosis, familial’, obtains a score of
1.3 on the MeSH descriptor ‘cirrhosis’. Unfortunately, this
descriptor refers to the phenomen of fibrosis in general, while
OMIM entry refers to fibrosis of the liver. The correct descriptor
in MeSH is thus ‘liver cirrhosis’. The automatic mapping is then
wrong, due to a difference of definition between OMIM and MeSH.
Another true negative with high score, ‘epidermolysis bullosa
dystrophica, cockayne-touraine type’, comes from a difference of
classification between OMIM and MeSH. Indeed ‘epidermolysis bullosa
dystrophica, cockayne-touraine type’ is classified in another
descriptor than the ‘epidermolysis bullosa simplex weber-cockayne
type’, which is the SP disease. OMIM classify them in the same
entry.
-
24
Concerning ‘inclusion body myopathy 2’ it has also raised
difficulties during the manual mapping due to an inconsistency in
MeSH (see 3.1. Manual mapping).
The classification conflicts between OMIM and MeSH seem
impossible to resolve. They probably come from the fundamental
differences between both resources. Indeed, OMIM is not strictly a
terminology, it is a database. Moreover its entries formally
correspond to phenotypes. Thus, its disease classification does not
have to be as strict as in a terminology. This can probably explain
the fact that two diseases can be alternative titles in OMIM and
different descriptors in MeSH. A question we could try to answer is
whether the alternative titles of OMIM are really useful to enhance
the recall, or do they lower too much the precision. Studying the
recall and precision according to the use of principal or
alternative titles of OMIM may bring an answer to this
question.
Concerning the wrong mappings due to the lack of genetic
diseases coverage of MeSH, different solutions can be considered.
For example, we could use another terminology, such as SNOMED CT
that could be more complete than MeSH. A mapping to SNOMED CT can
also help map diseases to ICD, as a mapping between SNOMED CT and
ICD-9 already exists. Indeed, ICD is hardly avoidable due to its
widespread use in the hospital data. But its lack of granularity,
compared to the other terminologies, will probably make a mapping
from Swiss-Prot and OMIM difficult. In the case we use another
terminology such as SNOMED CT or ICD, the UMLS tools will probably
be useful to treat lexical variants that were treated here by MeSH
and its numerous term variants. Another solution would be to wait
for a completion of MeSH concerning the genetic diseases, which is
the aim of a running project. Finally, in order to map the diseases
to a general term such as ‘Abnormalities, multiple’, we could try
to use the hierarchy of MeSH to reach these general descriptors. To
achieve this, we could be inspired from a work on mapping OMIM
expression terms to MeSH [van Driel et al., 2006], that calculated
a similarity score taking into account the MeSH hierarchy.
No matter which solution we will adopt, we most likely will have
to develop a specific approach for the diseases for which we will
never find corresponding terms in any terminology. Indeed, there is
an important loss of information when mapping a disease to a
general term. The treatment of these cases will probably imply the
mapping of the disease to concepts corresponding to its
pathological expressions. This task will require the use of other
data besides the name of the disease. These data will be, for
example, the full text description of the disease, the expression
terms provided by OMIM, etc.
Finally, it is conceivable that using a combination of different
methods, as well as the one presented in this work, will enable a
complete mapping of Swiss-Prot.
-
25
5. Conclusion
In this work, we have developed an automatic approach to map
Swiss-Prot disease annotations to the medical terminology MeSH. The
results obtained are encouraging, despite a moderate recall.
Solutions exist for the improvement of our mapping. The choice of
appropriate approaches will depend on the characterization of the
future use of our work.
-
26
Appendix
A. Definitions
� Controlled vocabulary: list of terms that have been enumerated
explicitly, which not always has a specified definition, even if it
theoretically should.
� Taxonomy: collection of controlled vocabulary terms organized
into a
hierchical structure. The relations are of parent-child type
(whole-part, genus-species, type-instance). If a term appears in
different places in the taxonomy, it should be the same, with the
same children.
� Thesaurus: network collection of controlled vocabulary terms.
Uses
associative, (not hierarchical) relationships in addition to
parent-child relationships.
� Ontology: Can refer to many different things, such as
glossaries & data
dictionaries, thesauri & taxonomies, schemas & data
models, and formal ontologies & inference. A formal ontology is
a controlled vocabulary expressed in an ontology representation
language. This language has a grammar for using vocabulary terms to
express something meaningful within a specified domain of
interest.
� Meta-model: an explicit model of the constructs and rules
needed to
build specific models within a domain of interest. A valid
meta-model is an ontology, but not all ontologies are modeled
explicitly as meta-models.
B. Database
The tables in the superior part of the figure A1 contain
information of MeSH. (All the data used in this work are contained
in the table Term and Descriptor):
� DESCRIPTOR: decriptor of MeSH � TREENUMBER: position of a
given descriptor in the thesaurus’
hierarchy � CONCEPT: concept included in a descriptor �
CONCEPTUMLS: identifier of the corresponding concept in UMLS �
SEMANTICTYPE: identifier and name of semantic type from
UMLS � CONCEPT_SEMANTICTYPE: association table between
concept
and semantictype
-
27
The tables in the inferior part of the figure A1 contain
informations of Swiss-Prot:
� SWISSPROT: Swiss-Prot entry � SPDISEASE: Swiss-Prot annotation
line. The ‘digit’ key is an
artificial identifier of the disease comment lines, useful as we
map the disease comment lines, which doesn’t have own identifier in
Swiss-Prot.
� � OMIM: OMIM (phenotype and phenotype + gene) entry �
SPDISEASE_OMIM: association table between omim and spdisease
Figure A1.
-
28
Bibliography
A. Bairoch., L. Yip, L. Famiglietti. The UniProtKB/Swiss-Prot
protein knowledgebase in the context of human molecular medical
research. Version of February 2007 (unpublished) B. Boeckmann,
M.-C. Blatter, L. Famiglietti, U. Hinz, L. Lane, B. Roechert and A.
Bairoch. Protein variety and functional diversity: Swiss-Prot
annotation in its biological context. C.R. Biologies 328 882-899,
2005. A. Hamosh, A.F. Scott, J.S. Amberger, C.A. Bocchini, V.A.
McKusick. Online Mendelian Inheritance in Man (OMIM), a
knowledgebase of human genes and genetic disorders. Nucleic Acids
Research; 33(Database issue):D514-7. January 2005 A. Savage.
Changes in MeSH Data Structure. NLM Tech Bulletin; (313):e2.
March-April 2000. Nelson, Stuart J.; Johnston, Douglas, Humphreys,
Betsy L. Relationships in Medical Subject Headings. Bean, Carol A.;
Green, Rebecca, editors. Relationships in the organization of
knowledge. New York: Kluwer Academic Publishers; p.171-184. 2001 O.
Bodenreider. The Unified Medical Language System (UMLS):
integrating biomedical terminology. Nucleic Acids
Research;32(Database issue):D267-70. January 2004. B.L. Humphreys,
A.T. McCray, M.L. Cheh. Evaluating the coverage of controlled
health data terminologies: report on the results of the NLM/AHCPR
large scale vocabulary test. Journal of American Medical
Informatics Association; 4(6):484-500. November-December 1997. K.M.
O'Keefe, M. Sievert, J.A. Mitchell. Mendelian inheritance in man:
diagnoses in the UMLS. Proc Annu Symp Comput Appl Med Care.
1993;:735-9. 1993 I.N. Sarkar, M.N. Cantor, R. Gelman, F.Hartel,
Y.A. Lussier. Linking Biomedical Language Information and knowledge
Resources in the 21st Century: GO and UMLS. Pacific Symposium on
Biocomputing 8:439-450. 2003 Y.A. Lussier J. Li. Terminological
Mapping for High Throughput Comparative Biology of Phenotypes.
Pacific Symposium on Biocomputing 9:202-213. 2004 C.D. Manning, P.
Raghavan, H. Schütze. Introduction to Information Retrieval.
Cambride University press. 2007 M.A. van Driel, J. Bruggeman, G.
Vriend, H.G. Brunner, J.A. Leunissen. A text-mining analysis of the
human phenome. Eur J Hum Genet. 14(5):535-42 May 2006
-
29
URL: [1] http://www.infobiomed.org/ [2]
http://www.ncbi.nlm.nih.gov/omim/ [3] http://www.metamodel.com
[4] http://www.nlm.nih.gov/mesh/ [5] http://www.snomed.org/
[6] http://www.who.int/classifications/icd/en/ [7]
http://www.nlm.nih.gov/research/umls/about_umls.html [8]
http://www.orpha.net/