Mapping disease annotation in Swiss-Prot to Medical terminology MeSH · 2007. 4. 18. · in Swiss-Prot to Medical terminology MeSH In the frame of a Master in Proteomics and Bioinformatics

1

Mapping disease annotation

in Swiss-Prot to Medical terminology MeSH

In the frame of a Master in Proteomics and Bioinformatics

University of Geneva 2006/2007

Anaïs Mottaz Supervised by: Anne-Lise Veuthey, Yum Lina Yip Swiss-Prot research Group, SIB

2

Summary This work’s objective is to find an automatic way to link the Swiss-Prot knowledgebase to the medical thesaurus MeSH. We propose here to take advantage of the more than 2000 annotations relative to diseases present in the database as well as their references to OMIM. We first separately mapped the disease terms automatically extracted from the annotations and the disease names obtained through OMIM references to the MeSH terms. We used exact match to map the diseases when possible, if not we used a similarity score especially developed for this work to try to retrieve the closest concept in the terminology. We then evaluated the mapping using a set of manually mapped entries that serves as benchmark. Better results were obtained using OMIM references when compared to that using Swiss-Prot (SP) disease terms. We obtained a recall of 43% and a precision of 95% with OMIM, while we obtained with SP a recall of 35% and a precision of 89%. When we combined the mapping of SP and OMIM, we reached a precision of 100% with a recall of 37%. By lowering the similarity score threshold and using another SP and OMIM combination to enhance the mapping coverage, we obtained a recall of 59% with a precision of 89%. The low recall values obtained are mainly due to the absence of MeSH terms for many genetic diseases present in OMIM and Swiss-Prot. This mapping represents a good basis for the development of a complete mapping of Swiss-Prot to medical terminologies.

3

Table of contents Summary 1. Introduction

1.1. Disease annotation in Swiss-Prot 1.2. OMIM 1.3. Medical terminologies 1.4. MeSH 1.5. Terminology mapping

2. Methods 2.1. Extraction of diseases names 2.2. Mapping algorithm 2.3. Benchmark 2.4. Evaluation

3. Results 3.1. Manual mapping 3.2. Disease extraction 3.3. Automatic mapping

4. Discussion 5. Conclusion

Appendix Bibliography

4

1. Introduction

Medicine is in constant evolution. It has changed from an exclusively symptomatic diagnosis and treatment to a more elaborate care management (from prevention to diagnosis) involving molecular biology data. The treatments become adapted to the specificity of the patient and of his disease and this specificity implies, among others, molecular information. Blood markers to reveal cancer and treatment protocols adapted to the molecular type of cancer are examples of such an evolution. Meanwhile, bioinformatic resources, which are devoted to store, treat and analyze molecular biology information, contain more and more data relative to medicine. These growing interactions necessitate a standardized method for communicating results between these fields.

The use of a common terminology is required to help the integration of molecular biology data at clinical level. The project UniMed fits into this context. Its aim is to link the UniProt knowledgebase, a protein database, to disease terminology in order to enhance the interoperability between bioinformatics and medical resources and provide a link between gene and phenotype. This area is of interest, as indicated by the existence of different projects, such as INFOBIOMED [1] whose aim is to develop biomedical informatics, bridging the gap between medical informatics and bioinformatics. The work presented here is a first step toward the achievement of this goal. We took advantage of the annotations relative to diseases present in Swiss-Prot, the manually annotated part of UniprotKB. We developed an automatic mapping between these annotations and the biomedical thesaurus MeSH. We tested two different approaches, one using the text present in Swiss-Prot concerning the involvement of the protein in a disease, and the other using the Swiss-Prot citations to OMIM, a gene and phenotype database.

We present first the Swiss-Prot database and its data concerning diseases, followed by a presentation of OMIM, then a small introduction on the principal existing medical terminologies, and finally some related works on the mapping of medical terminologies.

1.1. Disease annotation in Swiss-Prot UniProtKB/Swiss-Prot is a protein sequence database containing high quality

manually annotated and non-redundant protein sequence records. It is composed of more than 260 000 entries, among which nearly 16 000 are human gene products. Manual annotation consists of analysis, comparison and merge of all available sequences for a given protein, as well as critical review of associated data from the literature [Bairoch et al., 2007]. The information concerns many aspects of the protein, such as its function, its structure, its variants, its involvement in disease, etc.

Around 2 600 entries contain annotation relative to the protein involvement in one or more diseases, from which more than 2000 are human.

The medical annotation appears under different fields. The description of associated diseases in the comment lines, the data on variants in the features lines, the keywords and the cross-references to OMIM, to DrugBank and to many other databases related to human genes and proteins [Bairoch et al. 2007]. In this work we focus on the disease comment lines and the cross-references to OMIM.

5

Figure 1. Example of the disease comment lines of one Swiss-Prot entry The Figure 1 presents an example of disease comment lines in Swiss-Prot,

which are the data of Swiss-Prot that we used in this work. We can see that there can be several disease comment lines per entry. First more than one disease-causing variant can exist for a protein. Then, a given disease-causing variant can have different phenotypic expressions, depending on the genome and on the environment [Boeckmann et al., 2005]. It is worth noticing that we mapped the disease comment lines and not the entries in this work.

1.2. OMIM

OMIM is a public knowledgebase of human genes and genetic disorders [Hamosh et al., 2005]. Each OMIM entry has a full-text summary of a genetically determined phenotype and/or gene and has numerous links to other genetic databases. An OMIM entry includes a primary title, which is the principal name of the disease, alternative titles, which are the synonyms, and ‘included’ titles, which are related but not synonymous information [2]. Different types of entries exist:

� Gene of known sequence (*) � Phenotype that does not represent a unique locus (#) � Gene of known sequence and a phenotype (+) � Mendelian phenotype or phenotypic locus for which the underlying

molecular basis is not known (%) � Phenotype for which the mendelian basis, although suspected, has not been

clearly established or when the separateness of this phenotype from that in another entry is unclear (no sign)

The references of Swiss-Prot to OMIM in the disease comment lines can be

present or absent, depending on the existence of the corresponding entry in OMIM. Among the disease comment lines, about 73% have a reference to an OMIM entry. The entries referenced are always of the phenotype (#) or phenotype and gene (+) type.

6

Sometimes there can be more than one reference to OMIM in one disease comment line. It is the case when the annotation references a closely related disease to complete the description of the disease. This happens also when the annotation is about a disease whose subtypes are classified under different OMIM entries.

1.3. Medical terminologies

Terminology refers to the set of words employed in a domain. This set of words can be organized in different structures that are referred to as controlled vocabulary, taxonomy, thesaurus, ontology or meta-model. The role of all these structured sets of vocabulary is to help structure, classify, model, and represent the concepts and relationships pertaining to some subject matter and to enable a community to come to agreement and to commit to use the same terms in the same way [3]. Definitions of these terms are provided in the Appendix A.

Here is an introduction of the most common medical vocabularies, except MeSH [4] for which the next section is dedicated. � SNOMED CT - "Systematized Nomenclature of Medicine Clinical Terms" is

developed by the College of American Pathologists. This terminology is designed to deal with clinical information. It covers diseases, clinical findings and procedures [5]. Its coverage of disease is important, as reported in article [B.L. Humphreys et al., 1997].

� ICD - "International Classification of Disease" is published by the World Health

Organization. It is worldwidely used to deal with diagnostic information, for example epidemiological and health management data, particularly in hospital records [6]. The more recent version is the ICD-10. Its disease coverage, especially concerning genetic diseases, has been shown to be less important than other terminologies [K.M. O’Keefe et al., 1993].

� UMLS - NLM produces and distributes the Unified Medical Language System

Knowledge Sources (databases) and associated software tools (programs) [7]. The major component of UMLS is the Metathesaurus, a repository of inter-related biomedical concepts, as presented in the figure 2. Two other knowledge sources in the UMLS are the Semantic Network, providing high-level categories used to categorize every Metathesaurus concept, and lexical resources including the SPECIALIST lexicon and programs for generating the lexical variants of biomedical terms. These programs include Norm, which is a program that generates the normalized form of a term; WordInd, which decomposes strings into words; and Lvg, which generates lexical variants [O. Bodenreider, 2004]. MMTx, which is a program accessible as a web service, maps Metathesaurus concepts from a text.

7

Figure 2. The various subdomains integrated in the UMLS. In the frame of this work, UMLS has not been used due to its complexity. SNOMED-CT neither for it is not freely available and ICD because of its lack of granularity. Yet all of these resources will be used in the future works.

1.4. MeSH

The National Library of Medicine (NLM) has produced the Medical Subject Headings since 1960. The MeSH thesaurus is NLM's controlled vocabulary for subject indexing and searching of journal articles in MEDLINE, as well as books, journal titles, and non-print materials in NLM's catalog [Nelson et al., 2001]. It consists in a set of nouns called descriptors organized in a hierarchical structure offering research at different levels of specificity. The last important change in its structure has been done in 2000 [Savage A., 2000].

MeSH has three major components:

� Main Headings (Descriptors) � Subheadings (Qualifiers) � Supplementary Concept Records (formerly ‘Supplementary Chemical

Records’)

This work only used the Main Headings. The description of the other components is informational.

Main Headings They are the index terms used to indicate the main subject and characterize the

content. They are also called Descriptors. Subheadings

8

Also called qualifiers, they are used in conjunction with descriptors to afford a mean of grouping together the citations that are concerned with a particular aspect of a subject. Complex concepts not pre-coordinated in a descriptor and that can be represented by the conjunction of a descriptor and of a qualifier are examples of such combined use. Monoamine Oxidase/deficiency is one of those examples, Monoamine oxidase being the descriptor and deficiency the qualifier. Each descriptor has a list of allowable qualifiers, in order to prevent wrong associations. Descriptors can also be coordinated with other descriptors but there is a risk of wrong associations.

Supplementary Concept Records They are supplementary headings, each being linked to a descriptor. They serve to

control the use of substances names during indexation.

As mentioned above, the Descriptors are the component of MeSH we used in this work. They are organized in 16 different categories. These categories are Anatomy, Organisms, Diseases, Chemical and drugs, etc. Only the Disease category is used in this work. Descriptors are arranged in a specificity hierarchy, containing up to eleven levels. The links are not clearly defined (not strict ‘is a’ or ‘part of’). The descriptors can be found in different places in the hierarchy and their children can vary according to their place in the hierarchy.

Each descriptor contains a set of concepts, not equivalent but not (or not yet) enough distinct to be themselves descriptors. The descriptor itself is one of its concepts, the preferred concept. The concepts have a defined link to the preferred concept (narrower, broader or related).

Each concept itself contains a set of terms, which are synonyms and lexical variants. The concept is one of its terms, the preferred term.

The figure 3 represents for one descriptor its relation to the concepts and terms. In this work, we only used the Terms, to which we mapped the Swiss-Prot annotation lines, and the Descriptor, to which we referred for the validation of our automatic mapping.

Figure 3. Relations between terms, concepts and descriptors in MeSH.

9

The Figure 5 represents the descriptor ‘Esophageal Neoplasms’, in one of its places in the ‘Diseases’ tree. The Figure 4 represents the entry corresponding to this descriptor, as it appears with the MeSH browser accessible on the web.

Figure 4. Example of MeSH entry as it appears with the web MeSH browser.

Figure 5. The entry of the Figure 4 represented in the descriptor hierarchy

10

1.5. Terminology mapping In the domain of terminology mapping, the ontology mapping has been the

focus of a variety of works originating from diverse communities over a number of years. Indeed this kind of mapping could provide a common layer from which several ontologies could be accessed and hence could exchange information in semantically sound manners. Concerning medical terminologies, an important effort has already been done in this direction and UMLS is the result of the merging of several medical vocabularies, including a small part of OMIM. Now in this field the efforts tend to integrate biological terminologies and biological data, as well as phenotypic data and medical terminologies, often in UMLS.

Within the perspective of the need for a representation of gene and gene products in the UMLS, a method has been proposed to map the Gene Ontology to UMLS [Sarkar et al., 2003]. It uses three mapping approaches: (1) exact string matching, (2) partial matching using two tools provided by UMLS, namely Norm, a lexical tool, and MMTx which automatically map text to UMLS concepts and finally (3) a mapping based on similarities using a BLAST algorithm. They obtained the best results using the exact matches and the Norm tool.

With the idea that there is a pressing demand of technologies for greater integration of phenotypic data and phenotype-centric discovery tools to facilitate biomedical research, an attempt to map a phenotype vocabulary developed by Mouse Genome Database, the Phenolism terminology, to SNOMED CT have been made [Lussier et Li, 2004]. Their method proposed the decomposition of phenolism concepts and normalization with Norm.

These works have in common the mapping of terms to concept in UMLS or other medical terminology such as SNOMED CT. The problems of mapping encountered are thus often the same. Here is a summary of these principal difficulties, with some illustrative examples:

� Synonymy Ex: ‘Primary adrenal insufficiency’ and ‘Addison disease’

� Morphology -Inflection

Ex: ‘cancer, esophageal’ and ‘cancers, esophageal’ -Derivation (laryngeal larynx)

Ex: ‘cancer, esophageal’ and ‘cancer, esophagus’ � Orthography

-Spelling variants Ex: ‘cancer, esophageal’ and ‘cancer, oesophageal’

� Syntaxy -Complementation for verbs, nouns and adjectives

Ex: ‘cancer, esophageal’ and ‘cancer of the esophagus’ � Level of specificity

Ex: ‘esophageal cancers’ narrower than ‘esophageal neoplasms’ narrower than ‘gastrointestinal neoplasms’

� Context Ex: ‘human embryo’ and ‘mammalian embryo’. The embryo is defined in the human context as an organism at stage

11

� Definition conflict Ex: ‘cirrhosis’. Cirrhosis is sometimes defined as a chronic interstitial inflammation of any tissue or organ, and more often as a chronic disease of the liver, the liver cirrhosis.

A part of these problems are resolved in this work by the synonyms and variants

provided by MeSH through the different terms grouped under each concept. The syntactic variants are further treated by the decomposition of the diseases in words to calculate the partial matching. Only the context and definition conflicts are not treated here.

2. Methods

All the programs developed in this project are written in perl (v5.8.5). We extracted diseases from the human entries of Swiss-Prot (version 51.0). It represents 2033 Swiss-Prot human entries with at least one disease comment line, and 2966 disease comment lines from which 2179 have a cross-reference to OMIM, phenotype (#) and phenotype and gene (+) type. The version of OMIM is the one of September 2006. This version contains 2306 entries of the phenotype and phenotype and gene type. The fields of the entries used for mapping are the “Title”, “Alternative titles” and “Included titles”.

We used the category ‘Disease’ of the MeSH (version 2006 in XML file format). This category contains 4150 descriptors, 7146 concepts and 38193 terms.

A database has been created to store all the necessary information from the three databases (Swiss-Prot, MeSH, OMIM), and to easily retrieve the data required by the mapping program. The database is implemented using the postgreSQL (8.2.3) database management system. The schema of the database is in Appendix (C. Graph 1). The information present in this database has been extracted from the Swiss-Prot and OMIM flat files using perl regular expressions, and from the MeSH XML file using the Twig perl Library. The database contains more information than the ones strictly used by the programs. It has been designed to contain information that could be useful in the next steps of the project. For example the semantic types of the MeSH concepts are stored although we do not use it.

2.1. Extraction of disease names

The approach used to map the annotation lines includes a procedure for the extraction of the disease names from the disease comment lines of Swiss-Prot, as well as of the corresponding MIM identifier.

The extraction of the disease names has been done using regular expressions. First, by looking at the disease comments lines, we established a list of terms that are used to indicate the association of a protein with the disease. These expressions operate as a starter for disease name extraction. A list of specific ‘stop’ words has been defined to remove terms that obviously do not belong to the name of the disease. The role of termination terms is to remove additional information that sometimes follows the name of the disease. These expressions are presented in the Table 1. The extraction procedure

12

does not treat cases where several diseases are displayed in the same disease comment line. These cases are rare, although one is present in the benchmark.

The MIM identifiers have also been extracted from the disease comment line. They have not been extracted from the DR lines (cross-references lines) because in the DR lines there is no indication as to which disease comment lines it refers. This information is important because we map the disease comment lines and not the entry. These references have also been extracted using regular expressions. This has been made easier by the fact that the MIM identifiers are always presented the same way ([MIM:115150]) in the disease comment line. The MIM numbers are then used to retrieve the corresponding title and alternative titles in the database created for this project.

(1) Starter expressions (2) Specific stop words (3) Termination term Cause(s) of /a involved in (can) contribute(s) to associated/association with correlated with responsible for contributor to result(s)/resulting in lead(s) to induce(s) defective in individual(s) with patient(s) with/suffering from reduce(s) influence(s) deleted in down-regulated in found in implicated in predispose(s) to favor antigen of antigen for thought to be an role in could impart mediate(s) candidate (gene)

susceptibility to development of genetic predisposition for developing pathogenesis of subset of various types of some form of increased risk of

also known as but which an due to in condition(s) such as . [MIM:

Table 1. (1) Expressions used to extract the part of the string containing the disease name. (2) Terms removed from the extracted string. (3) Expressions indicating the end of the disease name.

13

2.2. Mapping algorithm The mapping algorithm is globally divided into two steps, the exact match and

the partial match. The partial match is searched only if the exact match has not been found. Exact match

The exact match consists of searching among the MeSH terms to determine if the exact same term exists (same length, same word order, but case insensitive). For OMIM, there can be more than one disease term per disease comment line, either because of the synonyms provided by the OMIM entry or because of multiple OMIM references in the comment line. In this case one exact match among all the possibilities is sufficient.

Partial match

To evaluate which term can be the corresponding term in MeSH when no exact match is found, a similarity score is calculated between the disease and all terms present in the Disease category of MeSH.

To calculate the similarity score between a disease and a MeSH term, the disease and the MeSH terms are first decomposed into words. For this purpose, word is defined as a token containing only alphanumeric characters with length one or greater.

A weight is calculated for all these words. This weight is a function of the frequency and represents the importance of the word in the context of origin of the terms. Its purpose is to give more importance to infrequent words, hemophilia for example, than to frequent words like disease or syndrome.

The score is calculated by adding the weight of each word in common between the disease and the MeSH term, and substracting the weight of every different word of the disease and of the MeSH term. The score was normalized by dividing it by the number of words of contained in the disease term:

)(

))(

1log()

)(1

log(

diseasesize

ncwfreqcwfreq∑ ∑−

freq = occurrence in all OMIM titles, alternative and included titles, all MeSH terms of the Disease category and all Swiss-Prot disease comment lines, divided by the total number of words in these documents

cw = words in common between the MeSH term and the disease to map ncw = words not in common of the MeSH term and the disease to map disease = the disease name to map size = word count

14

This formula is inspired by the IDF, which is the Inverse Document Frequency used in the Information Retrieval domain [Manning et al., 2007]. In addition, an adjustment has been done concerning the calculation of the score: Hyphenated terms

To take into account the fact that some words are linked by hyphens, we treat them specially. These words are considered common words only if all words composing the hyphenated term are common. If at least one of the words is not a common word, all of them are considered as not common words. This is to avoid false positives matches.

We calculated the partial matching scores of each disease names against all MeSH terms. The term obtaining the best score is returned. In the case of OMIM, the score is calculated between all the synonyms and the MeSH terms, and the MeSH term with the best score is returned. With the use of synonyms, several MeSH terms can obtain the same score. In this case, all MeSH terms with the best score are considered as the result. No match means diseases names which have no words in common with any MeSH term of the Disease category.

2.3. Benchmark

In order to evaluate the performance of the mapping program, we performed a manual mapping between a set of Swiss-Prot entries and MeSH terms, which was used as benchmark. The benchmark was created by taking 92 disease comment lines, present in randomly selected 43 Swiss-Prot entries. To ensure a representative sampling, these entries have been taken according to the results of a preliminary version of the mapping program. In this way, the benchmark follows the ratio of exact matches, partial matches and no matches obtained with the preliminary program.

The manual mapping has been made as much as possible on one MeSH term. The mapping was done if possible on the term used in the disease comment line of the Swiss-Prot entry. If necessary the OMIM entry referenced was used and sometimes also other medical resources such as Orphanet [8], which is a freely accessible database of rare diseases and orphan drugs.

2.4. Evaluation

The procedure used to validate the mapping generated by the program consisted in comparing the descriptor manually selected to the automatically retrieved ones. The automatic mapping was considered correct if both descriptors were the same.

If more than one term had been either manually or automatically mapped, the mapping was considered correct if at least one of the terms had a corresponding MeSH descriptor.

15

In this particular context, the terminology used to analyze the results was the following:

� True Positive (TP): number of correct mapping above the threshold � True Negative (TN): number of wrong mapping below the threshold � False Positive (FP): number of wrong mapping above the threshold � False Negative (FN): number of true mapping below the threshold � No Result (NR): number of not mapped diseases (no word in common with any

MeSH term) � Retrieved (R): number of mapping (true or false) above the threshold

The formulas used to evaluate the global results of the mapping on the

benchmark are the recall and the precision. To evaluate the mapping of all Swiss-Prot disease terms, the retrieval was used. The formulas are listed below:

Recall = TP / (TP+TN+FP+FN+NR) Precision = TP / (TP+FP) Retrieval = R / (TP+TN+FP+FN+NR)

3. Results

We used two parallel sources to map the Swiss-Prot annotation lines to MeSH. The first source was the disease extracted from the disease comment line and will be referred as “SP”. The second source was the disease names of the corresponding OMIM entry cited in the disease comment line. This source will be referred as “OMIM”.

First, we introduce the manual mapping in order to better understand the aim of the mapping and its difficulties. Then, we will have a look at the principal points concerning the first step of the automatic mapping, the automatic extraction of the disease from the comment lines, and see how correctly it functioned. The automatic mapping itself is then introduced. We will discuss the choice of the threshold for the partial matching and finally we present the results of the program on the benchmark and on the whole Swiss-Prot, with a comparison of the SP and OMIM sources. Combination of SP and OMIM mappings are also presented, together with their goals and their results.

3.1. Manual Mapping

We manually mapped a set of 92 disease comment lines. This manual mapping is of interest as it highlights the difficulties that can help understand the complexity of terminology mapping, automatic or not.

About three quarter of disease comment lines did not raise any particular problems for the manual mapping. Either the disease had a corresponding entry in

16

MeSH for an identical concept, or, if it hadn’t, choosing a corresponding entry was obvious, for example ‘short qt syndrome type 2’ corresponding to ‘arrythmia’.

About a quarter (26) has been more complicated to map, for different reasons. The most frequent reason was the case of diseases not present in MeSH (40) and for which the choice of the corresponding descriptor was not obvious (25).

Among the latter cases, many were diseases or syndromes that touch many different parts of the body (10). The only entries in MeSH representing this kind of diseases are ‘Abnormalities Multiple’ and ‘Genetic Diseases, inborn’. The problem with this kind of mapping is the loss of information. We have looked at the approach chosen by MeSH to classify this kind of diseases. For this we have taken one existing syndrome and looked under which descriptors it is classified. The Marfan syndrome is thus the child of many different descriptors: ‘Abnormalities, Multiple’, ‘Bone Diseases, Developmental’, ‘Heart Defects, Congenital’, ‘Genetic Diseases, Inborn’ and ‘Connective Tissue Diseases’. These descriptors represent the range of expression of the disease, and can be numerous depending on the disease. In this work, as we want to map to one descriptor as much as possible, we chose to map to one of the two general descriptors (‘Abnormalities Multiple’ and ‘Genetic Diseases, inborn’). In this way, we avoided the problems coming from the mapping of all possible descriptors. The choice of ‘Abnormalities Multiple’ has come from the fact that it gives the information that the disease implicates several different problems, which is not always the case in the ‘Genetic Diseases, inborn’. This choice has been made although the ‘Abnormalities, Multiple’ do not have to be genetic, but can also be chromosomal or only congenital. The only cases when we mapped to another descriptor other than ‘Abnormalities, Multiple’ were when the different abnormalities touch only one system. This case made the mapping easier and allows one to retain some information given by the original disease term.

The other cases of problematic mappings due to the absence of diseases in MeSH concerned 15 cases. The origin of these problems is that sometimes a disease can be classified according to different aspects of it. For example, ‘long-chain 3-hydroxyl- coa dehydrogenase deficiency’ has been mapped to ‘lipid metabolism, inborn error’ but could have been also mapped to ‘mitochondrial diseases’.

The remaining case concerned an inconsistency of MeSH. The disease, ‘inclusion body myopathy 2’, is not directly represented in MeSH. The sporadic form is represented (‘inclusion body myopathy, sporadic’) under the descriptor ‘inclusion body myositis’, child of myositis (inflammation of muscle tissue). Unfortunately, the familial form, to which ‘inclusion body myopathy 2’ belongs, doesn’t present muscle inflammation signs. This particularity of the familial form is even mentioned in the definition of the ‘inclusion body myositis’ descriptor, but no term neither the classification corresponds to the familial form. Thus it cannot be classified under this descriptor. We have thus mapped it to ‘myopathy’, which is somewhere else in the hierarchy.

Some disease comment lines have been mapped to several different descriptors for other reasons than those mentioned above. Two cases concern cancers and their classifications at both anatomical and histological level. ‘Squamous cell carcinoma of the head and neck’ and ‘oral squamous cell carcinomas’ have thus been both mapped to ‘squamous cell carcinoma’, and to ‘head and neck cancer’ and ‘oral cancer’ respectively. Two other cases of multiple mappings are disease comment lines mentioning several diseases. As explained in the methods, in cases of multiple mapping, the mapping to a single correct term is considered as a successful mapping.

17

3.2. Disease extraction The automatic approach presented in this work implies the extraction of the

disease name from the disease comment line of Swiss-Prot. Given the number of these annotation lines (2966), an automatic extraction has been developed (see Methods 2.1.).

Only in seven cases the program has not extracted anything from the disease comment lines of Swiss-Prot (Table 2), none of them belongs to the benchmark.

AC Disease comment lines

P01008 at-iii basel, tours/alger/amiens/toyama, rouen-1, -2, -3 and -4 have decreased (or lack) heparin-binding properties.

P22681 can be converted to an oncogenic protein by deletions or mutations that disturb its ability to down-regulate rtks.

Q8WYB5 a chromosomal aberration involving myst4 may be a cause acute myeloid leukemias. translocation t(10;16)(q22;p13) with crebbp.

P35226 cooperates with the myc oncogene to produce b-lymphomas.

P78346 sera from scleroderma patients recognize rpp38.

P78345 sera from scleroderma patients recognize rpp38.

P28908 Most specific hodgkin disease associated antigen.

Table 2. Comment lines from which a disease couldn’t be automatically extracted.

For the rest of Swiss-Prot (2959 disease comment lines), we manually checked the diseases to see if they were correctly extracted. To this end, we visually scanned the 2959 diseases seeking for wrong extractions. We detected cases where the extraction did not function correctly, corresponding principally to comment lines mentioning several diseases. The number of such cases cannot be formally calculated, as the check was not really exhaustive. The problem with the lines containing several diseases comes from the fact that the automatic extraction doesn’t take into account these cases. One is present in the benchmark (‘KRT16 and KRT17 are coexpressed only in pathological situations such as metaplasias and carcinomas of the uterine cervix and in psoriasis vulgaris.’). On the other hand, some diseases were more easily extracted. They correspond to the diseases caused by variants. Indeed in these cases the protein is the cause of the disease and the expressions used to say it are similar (‘is the cause of’, ‘causes’, etc).

18

3.3. Automatic mapping

The automatic mapping approach consisted in exact and partial match. We first had to choose a threshold for the similarity score used in the partial match, choice that is presented in the first part of this section. We then evaluated the benchmark mapping with this threshold. We tested different combinations between the results obtained with SP disease and OMIM disease. Finally we tested the mapping on all the disease comment lines of Swiss-Prot.

On the figure 6 is presented the most interesting mapping procedure, with a combination of the SP and OMIM sources, whose results are discussed later this section.

Figure 6. One of the automatic mapping procedure

Disease extracted (SP)

OMIM title and alternative titles

Swiss-Prot entry - Disease comment line

Exact match Exact match

Partial match Partial match

Same Descriptor

MeSH

19

Threshold set-up Almost all disease names of the benchmark provided a mapping to a MeSH

term, either exact or partial. We used these results to set up a threshold on the score of the partial matches above which the program considers them as correct. This threshold will modify the recall and the precision obtained. In our case, the precision is what we have to favor, because the automatic mapping should be reliable. To choose the threshold, we considered the partial matches of SP and OMIM separately. We calculated for different thresholds the recall and the precision (graph 1) of the system. We can notice that, for the same threshold, the level of recall for SP and OMIM disease names is not exactly the same. This effect is certainly due to the small number of terms with a partial match in the benchmark, 76 for SP and 61 for OMIM. Therefore, we could not really justify the use of two different thresholds. As we should favor the precision instead of the recall, we decided to take the value +1 for the threshold. This threshold represents a good compromise between OMIM and SP, and it can ensure a precision above 80% for partial matches. Actually, we can see that by lowering the threshold to 0, for instance, we can enhance the recall without lowering too much the precision. However, we then approached the region on the graph where there is an important drop in precision. A bigger benchmark set is definitely needed to more precisely analyze the effect of the threshold on the recall and precision.

Recall-Precision functions of threshold

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

0 0.1 0.2 0.3 0.4 0.5 0.6

Recall

Pre

cisi

on

SP

OMIM

Graph 1. Recall-Precision functions of the threshold. The threshold vary from –7 to +7, prominent dots corresponds to a threshold of +1

20

Evaluation of the benchmark set

The Table 3 summarizes the results in recall and precision obtained on the benchmark using a threshold of 1. The benchmark is composed of 92 disease comment lines among which 82 have a cross-reference to OMIM (89%). Comparing the results obtained by SP and OMIM, we can see that the results of OMIM are slightly better than those obtained by SP, concerning both exact matches and partial matches. The results are nevertheless quite similar concerning the exact matches, with an excellent precision (100%) and a modest recall (around 20%). The partial matches have recall values comparable to those of the exact matches, but it should be noted that they are calculated only on disease having not obtained exact match. This represents 76 SP diseases and 61 OMIM diseases. The precision is better with OMIM (90%), although SP has also a fairly good precision of 80%. After adding the exact and partial matches, interesting results were obtained with OMIM , with a recall of 43% and a precision of 95%.

Given these results, we tried to combine the results of SP and of OMIM to see the effect on the recall and the precision. We tested two ways of combining them.

1) We considered the intersection of the mapping (SP ∩ OMIM), designed for enhancing the confidence of the mapping. We considered only the partial matches where SP and OMIM had a score above the threshold and where the mapped terms from both sources correspond to the same descriptor. On the exact matches the recall was lowered without affecting the precision which was maximal already. Its effect on the partial matches was to raise the precision to 100%, while diminishing the recall of more than a half. The summing of exact and partial matches thus gave a bad recall (20%) with an excellent precision (100%), as expected.

2) We calculated the union of the mapping (SP ∪ OMIM), aiming to enhance the coverage of the mapping. We considered mapping where at least one of SP or OMIM match had obtained a score above the threshold. The results showed an increase of recall for the exact matches, leading to more than a quarter of the benchmark correctly mapped (29%). The partial matching showed, when compared to OMIM alone, the same recall (21%) with a worse precision (83% compared to 90% for OMIM). The combination of exact and partial for this combination revealed a recall of 50% and a precision of 92%, which indeed corresponded to an enhancement of the coverage, with the counterpart of lowering the precision.

As the union of SP and OMIM had a precision of 100% for the exact matches, we decided to combine it to the intersection of SP and OMIM for the partial matching, enhancing the coverage with the exact matches and keeping an excellent precision with the partial matching. This association has been represented in the beginning of this section (figure 6). It resulted in a recall of 37%, quite modest, but with a precision of 100%.

With the threshold placed to 0, the results obtained with this last combination showed a slightly better recall (39% instead of 37%) but not a perfect precision (95%). If we lower the threshold until –1, the recall went up to 50% with a precision of 91%. All these results were satisfying but as mentioned, we could only draw more conclusions on the difference between these score when a larger data set is evaluated.

21

Exact match Partial match Total 92 disease comment

lines 82 with OMIM

Retrieval Recall Precision Retrieval Recall Precision Retrieval Recall Precision

SP 16

(17%) 16

(17%) 100%

20 (22%)

16 (17%)

80% 36

(39%) 32

(35%) 89%

OMIM 21

(23%) 21

(23%) 100%

21 (23%)

19 (21%)

90% 42

(46%) 40

(43%) 95%

SP ∩∩∩∩ OMIM 10

(11%) 10

(11%) 100%

8 (9%)

8 (9%)

100% 18

(20%) 18

(20%) 100%

SP ∪∪∪∪ OMIM 27

(29%) 27

(29%) 100%

23 (25%)

19 (21%)

83% 50

(54%) 46

(50%) 92%

Table 3. Benchmark results with threshold equal to +1. Mapping of the whole Swiss-Prot set

The next step of this work consisted in using the program to map the whole Swiss-Prot to MeSH. It represented 2966 disease comment lines, of which 2173 had an OMIM reference. It represents 73% of the Swiss-Prot disease comment lines, which is lower that the percentage of the benchmark set (89%).

No recall or precision could be calculated, given the lack of mapping validation. Only the retrieval values could be calculated and these are presented in Table 4. They can be compared to the retrieval values obtained on the benchmark set (Table 3).

The retrieval percentages of the whole Swiss-Prot were comparable to the benchmark ones. This suggested that the later was a quite representative sample despite its small size. The only important difference concerned the lower retrieval of OMIM matches of the whole Swiss-Prot set (37% compared to 46% in the benchmark). This can be explained by the lower proportion of OMIM references compared to the benchmark set.

Exact match Partial match Total 2966 disease comment

lines 2173 with

OMIM Retrieval Recall Precision Retrieval Recall Precision Retrieval Recall Precision

SP 483

(16%) - -

634 (21%)

- - 1117 (38%)

- -

OMIM 610

(21%) - -

483 (16%)

- - 1093 (37%)

- -

SP ∩∩∩∩ OMIM 292

(10%) - -

237 (8%)

- - 529

(18%) - -

SP ∪∪∪∪ OMIM 794

(27%) - -

640 (22%)

- - 1434 (48%)

- -

Table 4. Whole Swiss-Prot results with threshold equal to +1.

22

The study of some particular cases of the exact matches of the whole Swiss-Prot set gave more insight onto the mapping problems. Indeed, seven cases appeared where the exact matches of SP and OMIM mapped to different descriptors. These cases are reported in the table 5. We can notice that there were two errors in Swiss-Prot annotation. These errors have already been corrected.

SP disease OMIM disease SP-OMIM discrepancy origin

Note

Turner syndrome Noonan syndrome Human Error in Swiss-Prot annotation (corrected)

Turner syndrome is a chromosomal abnormality, cannot be caused by a variant

Hepatoerythropoietic porphyria

Porphyria cutanea tarda

Difference in classification

OMIM: hepatoerythropoietic porphyria included title of porphyria cutanea tarda MeSH: different descriptors

Multiple carboxylase deficiency

Multiple carboxylase deficiency, neonatal form

Lack of precision in Swiss-Prot annotation (corrected)

Multiple carboxylase deficiency has two forms, ealy and late onset, involving two types of enzymes. The early onset one corresponds to this entry

Enchondromatosis Osteochondromatosis Difference in classification

OMIM: alternative titles MeSH: different descriptors

Protein C deficiency Thrombophilia Automatic mapping: use of second OMIM reference

SP: second link to OMIM in the disease comment line, thrombophilia, corresponding to the type of disease to which protein C deficiency belongs

Cholesteryl ester storage disease

Wolman disease Difference in classification

Allelic variants OMIM: alternative titles MeSH: different descriptors

Cholelithiasis Cholecystitis Difference in classification

Not synonyms OMIM: alternative titles MeSH: different descriptors

Table 5. Different exact matches between SP and OMIM.

23

4. Discussion

In this project, we have developed an automatic mapping approach to map the disease comment lines of Swiss-Prot to the mesh terminology, using exact and partial matches using a similarity score.

In this part, we will discuss the mappings obtained with the similarity score threshold equal to +1, with the exact matches obtained from the union of SP and OMIM, and the partial matches obtained from the intersection of SP and OMIM (see Figure 6, in Results). This combination has obtained a recall of 37% and precision of 100%. A table of these detailed mappings is available on the web (URL: “http://intranet.isb-sib.ch/pages/viewpageattachments.action?pageId=2590040”). First, we can notice the absence of false positive. Nevertheless, many false negatives are present, and some true negatives have quite high scores. These false negatives have not mapped either because they were just under the threshold +1 or because the corresponding mapping of the other source was under the threshold or mapped to a different descriptor (see SP ∩ OMIM in Results).

The low recall obtained can be explained by the presence of false negatives. The false negative mappings come principally from small differences of granularity between the disease to map and the corresponding MeSH term. For example, ‘arthrogryposis, distal, type 7’ that correctly maps to arthrogryposis but with a score of –2.8 or ‘familial erythrocytosis type 1’ that maps to erythrocytosis with a score of –1.6. Other cases concern diseases not perfectly extracted from the disease comment lines, such as ‘variety of human tumors’ that maps to tumors but with a score of –3.3. These problems could be partly resolved by the use of a supplementary source to calculate the frequency of the words used in the similarity score. This would lower the influence of words such as ‘type’ or ‘variety’ compared to ‘arthrogryposis’, ‘erythrocytosis’ or ‘tumors’.

Concerning the true negative mappings, their origin is principally the absence in MeSH of a close corresponding descriptor. In these cases a simple similarity score is not efficient because of the lack of common words between the disease to map and the MeSH descriptor. For example ‘Autosomal dominant weill-marchesami syndrome’ has no chance to map to ‘Abnormalities multiple’. Sometimes, although some diseases do not have a very distant descriptor, they are not mapped to the correct descriptor for the same reason. For example, ‘gnathodiaphyseal sclerosis’ was mapped to ‘sclerosis’ instead of to ‘osteochondrodysplasias’. Among these true negative mappings, interesting ones were those that obtained good scores. The presence of these true negatives with good scores prevented us from further lowering the threshold to increase the recall, as such an attempt will inevitably lower the precision. For example, ‘Cirrhosis, familial’, obtains a score of 1.3 on the MeSH descriptor ‘cirrhosis’. Unfortunately, this descriptor refers to the phenomen of fibrosis in general, while OMIM entry refers to fibrosis of the liver. The correct descriptor in MeSH is thus ‘liver cirrhosis’. The automatic mapping is then wrong, due to a difference of definition between OMIM and MeSH. Another true negative with high score, ‘epidermolysis bullosa dystrophica, cockayne-touraine type’, comes from a difference of classification between OMIM and MeSH. Indeed ‘epidermolysis bullosa dystrophica, cockayne-touraine type’ is classified in another descriptor than the ‘epidermolysis bullosa simplex weber-cockayne type’, which is the SP disease. OMIM classify them in the same entry.

24

Concerning ‘inclusion body myopathy 2’ it has also raised difficulties during the manual mapping due to an inconsistency in MeSH (see 3.1. Manual mapping).

The classification conflicts between OMIM and MeSH seem impossible to resolve. They probably come from the fundamental differences between both resources. Indeed, OMIM is not strictly a terminology, it is a database. Moreover its entries formally correspond to phenotypes. Thus, its disease classification does not have to be as strict as in a terminology. This can probably explain the fact that two diseases can be alternative titles in OMIM and different descriptors in MeSH. A question we could try to answer is whether the alternative titles of OMIM are really useful to enhance the recall, or do they lower too much the precision. Studying the recall and precision according to the use of principal or alternative titles of OMIM may bring an answer to this question.

Concerning the wrong mappings due to the lack of genetic diseases coverage of MeSH, different solutions can be considered. For example, we could use another terminology, such as SNOMED CT that could be more complete than MeSH. A mapping to SNOMED CT can also help map diseases to ICD, as a mapping between SNOMED CT and ICD-9 already exists. Indeed, ICD is hardly avoidable due to its widespread use in the hospital data. But its lack of granularity, compared to the other terminologies, will probably make a mapping from Swiss-Prot and OMIM difficult. In the case we use another terminology such as SNOMED CT or ICD, the UMLS tools will probably be useful to treat lexical variants that were treated here by MeSH and its numerous term variants. Another solution would be to wait for a completion of MeSH concerning the genetic diseases, which is the aim of a running project. Finally, in order to map the diseases to a general term such as ‘Abnormalities, multiple’, we could try to use the hierarchy of MeSH to reach these general descriptors. To achieve this, we could be inspired from a work on mapping OMIM expression terms to MeSH [van Driel et al., 2006], that calculated a similarity score taking into account the MeSH hierarchy.

No matter which solution we will adopt, we most likely will have to develop a specific approach for the diseases for which we will never find corresponding terms in any terminology. Indeed, there is an important loss of information when mapping a disease to a general term. The treatment of these cases will probably imply the mapping of the disease to concepts corresponding to its pathological expressions. This task will require the use of other data besides the name of the disease. These data will be, for example, the full text description of the disease, the expression terms provided by OMIM, etc.

Finally, it is conceivable that using a combination of different methods, as well as the one presented in this work, will enable a complete mapping of Swiss-Prot.

25

5. Conclusion

In this work, we have developed an automatic approach to map Swiss-Prot disease annotations to the medical terminology MeSH. The results obtained are encouraging, despite a moderate recall. Solutions exist for the improvement of our mapping. The choice of appropriate approaches will depend on the characterization of the future use of our work.

26

Appendix

A. Definitions

� Controlled vocabulary: list of terms that have been enumerated explicitly, which not always has a specified definition, even if it theoretically should.

� Taxonomy: collection of controlled vocabulary terms organized into a

hierchical structure. The relations are of parent-child type (whole-part, genus-species, type-instance). If a term appears in different places in the taxonomy, it should be the same, with the same children.

� Thesaurus: network collection of controlled vocabulary terms. Uses

associative, (not hierarchical) relationships in addition to parent-child relationships.

� Ontology: Can refer to many different things, such as glossaries & data

dictionaries, thesauri & taxonomies, schemas & data models, and formal ontologies & inference. A formal ontology is a controlled vocabulary expressed in an ontology representation language. This language has a grammar for using vocabulary terms to express something meaningful within a specified domain of interest.

� Meta-model: an explicit model of the constructs and rules needed to

build specific models within a domain of interest. A valid meta-model is an ontology, but not all ontologies are modeled explicitly as meta-models.

B. Database

The tables in the superior part of the figure A1 contain information of MeSH. (All the data used in this work are contained in the table Term and Descriptor):

� DESCRIPTOR: decriptor of MeSH � TREENUMBER: position of a given descriptor in the thesaurus’

hierarchy � CONCEPT: concept included in a descriptor � CONCEPTUMLS: identifier of the corresponding concept in UMLS � SEMANTICTYPE: identifier and name of semantic type from

UMLS � CONCEPT_SEMANTICTYPE: association table between concept

and semantictype

27

The tables in the inferior part of the figure A1 contain informations of Swiss-Prot:

� SWISSPROT: Swiss-Prot entry � SPDISEASE: Swiss-Prot annotation line. The ‘digit’ key is an

artificial identifier of the disease comment lines, useful as we map the disease comment lines, which doesn’t have own identifier in Swiss-Prot.

� � OMIM: OMIM (phenotype and phenotype + gene) entry � SPDISEASE_OMIM: association table between omim and spdisease

Figure A1.

28

Bibliography

A. Bairoch., L. Yip, L. Famiglietti. The UniProtKB/Swiss-Prot protein knowledgebase in the context of human molecular medical research. Version of February 2007 (unpublished) B. Boeckmann, M.-C. Blatter, L. Famiglietti, U. Hinz, L. Lane, B. Roechert and A. Bairoch. Protein variety and functional diversity: Swiss-Prot annotation in its biological context. C.R. Biologies 328 882-899, 2005. A. Hamosh, A.F. Scott, J.S. Amberger, C.A. Bocchini, V.A. McKusick. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Research; 33(Database issue):D514-7. January 2005 A. Savage. Changes in MeSH Data Structure. NLM Tech Bulletin; (313):e2. March-April 2000. Nelson, Stuart J.; Johnston, Douglas, Humphreys, Betsy L. Relationships in Medical Subject Headings. Bean, Carol A.; Green, Rebecca, editors. Relationships in the organization of knowledge. New York: Kluwer Academic Publishers; p.171-184. 2001 O. Bodenreider. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research;32(Database issue):D267-70. January 2004. B.L. Humphreys, A.T. McCray, M.L. Cheh. Evaluating the coverage of controlled health data terminologies: report on the results of the NLM/AHCPR large scale vocabulary test. Journal of American Medical Informatics Association; 4(6):484-500. November-December 1997. K.M. O'Keefe, M. Sievert, J.A. Mitchell. Mendelian inheritance in man: diagnoses in the UMLS. Proc Annu Symp Comput Appl Med Care. 1993;:735-9. 1993 I.N. Sarkar, M.N. Cantor, R. Gelman, F.Hartel, Y.A. Lussier. Linking Biomedical Language Information and knowledge Resources in the 21st Century: GO and UMLS. Pacific Symposium on Biocomputing 8:439-450. 2003 Y.A. Lussier J. Li. Terminological Mapping for High Throughput Comparative Biology of Phenotypes. Pacific Symposium on Biocomputing 9:202-213. 2004 C.D. Manning, P. Raghavan, H. Schütze. Introduction to Information Retrieval. Cambride University press. 2007 M.A. van Driel, J. Bruggeman, G. Vriend, H.G. Brunner, J.A. Leunissen. A text-mining analysis of the human phenome. Eur J Hum Genet. 14(5):535-42 May 2006

29

URL: [1] http://www.infobiomed.org/ [2] http://www.ncbi.nlm.nih.gov/omim/ [3] http://www.metamodel.com

[4] http://www.nlm.nih.gov/mesh/ [5] http://www.snomed.org/

[6] http://www.who.int/classifications/icd/en/ [7] http://www.nlm.nih.gov/research/umls/about_umls.html [8] http://www.orpha.net/

Mapping disease annotation in Swiss-Prot to Medical terminology MeSH · 2007. 4. 18. · in Swiss-Prot to Medical terminology MeSH In the frame of a Master in Proteomics and Bioinformatics

Documents