An agent- and ontology-based system for integrating public gene, protein, and disease databases

1 / 29

An Agent- and Ontology-based System

for Integrating Public Gene, Protein and Disease Databases

R. Alonso-Calvo, V. Maojo, H. Billhardt a, F. Martin-Sanchez b, M. García-Remesal, D.

Pérez-Rey

Biomedical Informatics Group, Artificial Intelligence Laboratory,

School of Computer Science, Polytechnic University of Madrid

Boadilla del Monte, 28660 Madrid, Spain a Universidad Rey Juan Carlos. Madrid, Spain

b Medical Bioinformatics Department, Institute of Health Carlos III,

Majadahonda. Madrid, Spain

Abstract

In this paper, we describe OntoFusion, a database integration system. This system has been

designed to provide unified access to multiple, heterogeneous biological and medical data

sources that are publicly available over Internet. Many of these databases do not offer a

direct connection, and inquiries must be made via Web forms, returning results as HTML

pages. A special module in the OntoFusion system is needed to integrate these public

‘Web-based’ databases. Domain ontologies are used to do this and provide database

mapping and unification. We have used the system to integrate seven significant and

widely used public biomedical databases: OMIM, PubMed, Enzyme, Prosite and Prosite

documentation, PDB, SNP, and InterPro. A case study is detailed in depth, showing system

performance. We analyze the system’s architecture and methods and discuss its use as a

tool for biomedical researchers.

Keywords

Bioinformatics. Medical Informatics. Heterogeneous databases. Data integration. Genomic

databases.

2 / 29

1. Introduction

At the time of writing this paper, more than 700 biological databases (DBs) were publicly

available [1]. These databases are the result of a large number of biological research

projects that have produced a huge amount of heterogeneous information about genes,

proteins and genetic diseases— e.g., nucleotide polymorphisms, gene mutations, protein

sequences and structures and others. Public DBs are maintained by different institutions

and research centers that collect these biological data. Often, different public DBs include

related data types —e.g., Prosite, Swiss-Prot, and PDB store information related to proteins.

In other cases, different organizations store their own information —e.g., gene

polymorphisms and mutations DBs— but this disparate information is not integrated. In

this regard, there is now a need and challenge to integrate information from Web-based

public DBs and other private, local DBs for efficient use in biomedical research. Whether

the publicly available information is integrated or not will have a significant impact on

future clinical applications of genomic research.

One of the barriers to the integration of biological and medical databases is that they are

designed and maintained differently by different organizations, such as the US National

Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI),

the Swiss Institute of Bioinformatics (SIB) and others. Furthermore, not all of the data

sources are directly available. Many of them cannot be transparently accessed ―as is the

case, for example, of databases stored at local database management system (DBMS).

Instead, these remote sources are usually accessed by querying their DBs through Web-

based interfaces —e.g., HTML forms—. In this paper, we refer to these remote sources as

“public Web-based DBs”.

In this paper, we present the use of the OntoFusion system for integrating public Web-

based biological and disease-related information databases. OntoFusion is a system for

integrating databases that are either publicly available on the Internet or are directly

accessible through DBMS. It uses a multiagent-based architecture and its integration

approach is founded on the use of ontologies. OntoFusion has been developed within the

INFOGENMED project, with funding from the European Commission [2]. This finished

project aimed to create tools to allow transparent and integrated access to biomedical

information sources. The rest of the paper is organized as follows. Section 2 gives

https://www.researchgate.net/publication/8122311_The_Molecular_Biology_Database_Collection_2005_Update?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

3 / 29

background on approaches to database integration and extracting information from public

Web-based DBs in the biomedical field. In section 3, we present our approaches to

database integration and to the problem of accessing Web-based DBs. In section 4, we

provide a general overview of the OntoFusion system, and describe the part related to the

integration of public Web-based DBs in more detail. Section 5 presents a case study, where

OntoFusion is used to integrate seven different public Web-based DBs containing

biomedical data. Finally, section 6 provides some discussion and section 7 concludes the

paper.

2. Background

OntoFusion is an information source integration system. Its aim is to provide facilities to

access information from multiple resources – in particular, private databases and public

Web-based databases – in a transparent, integrated, and uniform way. For instance, using

this tool, a biologist can search multiple public biological databases just in the same way as

if she was searching a single database on her personal computer. The results of each search

process are the recompiled result instances encountered possibly at different sites.

Obtaining the same information without using an information source integration tool,

would require to search each database separately and to recompile the encountered

instances manually. The two crucial aspects of the OntoFusion system are database

integration and access of public Web-based databases.

2.1. DB Integration Approaches

An earlier report [3] suggested that three different approaches to DB integration should be

considered: information linkage, data translation, and query translation.

The first approach, information linkage, establishes relationships among different data

sources by using cross-references. It facilitates the navigation over different data sources.

The main drawback of this approach is that it does not actually integrate the information.

This approach is used in many public biological DBs, like MEDLINE, PDB, Prosite, and

others.

4 / 29

The data translation approach intends to create a central data repository containing all the

data from the databases that are integrated. Data from different sources are translated to a

unified conceptual schema and stored at the central repository. Queries are launched to this

repository. This approach can be seen as the creation of a central data warehouse for

multiple DBs. Its advantage is that it provides efficient and transparent access to the data.

However, the effort needed to maintain such a data warehouses is considerable.

Furthermore, any changes to the structure of the integrated DBs may require changes to the

unified conceptual model.

The query translation approach does not maintain a central data repository. Instead, it

divides the user queries into different sub-queries — one for each DB within the system.

Then, mediators or wrappers execute the sub-queries in the respective DBs. Results from

different DBs are gathered and returned to the user. Depending on the data

conceptualization model used, four different categories can be identified: i) pure mediation,

ii) single conceptual schema, iii) multiple conceptual schemas, and iv) hybrid approach.

We have analyzed a number of systems and placed them into these categories.

Systems based on pure mediation employ wrappers and mediators to execute user queries

(e.g., TSIMMIS [4], DISCO, DIOM, HERMES, Bio Kleisli, Bio Data Server [5]). These

mediators contain all the information needed to retrieve the requested data and to present

them to the user. The systems do not explicitly conceptualize the structure of the accessible

data, and, thus, the approach is less intuitive for users than other approaches based on data

conceptualization.

The single conceptual schema approach uses a global conceptualization model for the data

from all integrated databases. The advantage of this approach is that users can specify their

queries with regard to a single global conceptual schema. However, as with data

warehouses, any changes to the set of integrated DBs may call for modifications to the

global conceptualization model. Examples of such systems are SIMS [6], Pegasus [7],

Garlic [8], TAMBIS [9][10], ARIADNE [11], BACIIS [12], and Discovery Link [13].

The multiple conceptual schema approach does not rely on a global conceptualization

model of the data. Instead, each DB is described by an individual conceptual schema.

Additions, modifications and removals of DBs only affect their conceptual schemas, not

https://www.researchgate.net/publication/220353200_DiscoveryLink_A_system_for_integrated_access_to_life_science_data_sources?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

https://www.researchgate.net/publication/2986063_Complex_Life_Science_Multidatabase_Queries?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

https://www.researchgate.net/publication/2677127_Query_Processing_in_the_SIMS_Information_Mediator?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

https://www.researchgate.net/publication/2646777_The_TSIMMIS_Project_integration_of_heterogeneous_information_sources?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

https://www.researchgate.net/publication/11307451_BioDataServer_A_SQL-based_service_for_the_online_integration_of_life_science_data?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

https://www.researchgate.net/publication/262352561_Pegasus_a_heterogeneous_information_management_system?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

https://www.researchgate.net/publication/12914678_An_ontology_for_bioinformatics_applications?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

5 / 29

the whole system. User queries may be expressed using terms from specific domain

ontologies. However, it cannot be generally assumed that the individual schemas employ

the terms of such domain ontologies, and, thus, some relevant results may not be found

when a query is executed. An example of a system based on multiple conceptual schemas

is OBSERVER [14].

Finally, systems using the hybrid query translation approach use individual conceptual

schemas to describe each database, but assure that these schemas have been created using a

common global conceptualization or domain ontology. As with the previous approach, the

incorporation of new DBs, or the modification or removal of DBs does not require changes

to the whole system. Moreover, users can specify their queries with respect to the domain

ontology, and it is assured that these queries are transferred to the correct databases.

Examples of hybrid query translation systems are PICSEL [15], COIN [16], MECOTA

[17], BUSTER [18], and SEMEDA [19].

A different approach to integrating databases that cannot be classified within the above

taxonomy is schema matching. Actually, schema matching is performed as an individual

step in all the approaches included in the above classification —with the exception of

information linkage.

Schema matching identifies conceptually equivalent objects in two or more schemas and

creates a unified schema by specifying mappings between equivalent schema objects.

Schema matching is used in several application domains, like, for instance, schema

integration (constructing a global view from two independently developed schemas), data

warehousing, E-commerce (to translate messages from trading partners) and semantic

query processing (in the case of the system presented in this paper).

Since schema matching is mostly carried out manually, it becomes a problem as database

schemas become larger. Therefore, the need for tools and methods to perform schema

matching automatically has grown in recent years. Although there are a lot of methods and

tools addressing this problem, it still remains unsolved. For further details on schema

matching, see [20], which gives a comprehensive survey of schema matching methods, a

classification of these methods, and some examples of systems following the different

schema matching approaches.

https://www.researchgate.net/publication/5688724_SEMEDA_Ontology_based_semantic_integration_of_biological_databases?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

https://www.researchgate.net/publication/3670082_OBSERVER_An_Approach_for_Query_Processing_in_Global_Information_Systems_Based_on_Interoperability_across_Pre-existing_Ontologies?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

https://www.researchgate.net/publication/2570151_Catalogue_Integration_-_A_Case_Study_in_Ontology-Based_Semantic_Translation?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

https://www.researchgate.net/publication/220816611_An_Integration_Method_for_the_Specification_of_Rule-Oriented_Mediators?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

6 / 29

As regards OntoFusion, schema matching is performed in the unification process. The

unification process is fully automated, since all the semantic dissimilarities and

inconsistencies have been removed in the mapping phase. Most automatic schema

matching methods currently available deal with these inconsistencies, but the results are

usually not good. It is evident that with the current state of the technology, human

intervention is required to reliably perform database integration tasks.

2.2 Public Web-based biomedical databases

The most common approach for carrying out an information extraction process from a web

page is to use robots and wrappers. Wrappers must have sufficient information to be able to

extract the desired data from target web pages. There are different approaches for building

wrappers and the necessary information from HTML pages. Systems like Ariadne[11] and

TSIMMIS [4][21] describe web page structures using a declarative language, and wrappers

are able to extract the desired data using this information. In Lixto [22], web pages are

described using XML configuration files that describe the location of relevant information

inside the page. These two approaches entail the intervention of a human supervisor who

has to study and describe the data to be extracted. On the other hand, systems, like

RoadRunner [23] and others [24], exploit similarities between different pages at the same

Web site to get web page structures automatically. Related recent examples are [25] and

[26].

An approach presented in [27] introduces a different perspective, proposing patterns for

building web pages. It later improves the information extraction process using wrappers

and robots.

Two kinds of DBs can be integrated with OntoFusion: private DBs, and public Web-based

DBs. In this paper, we consider “private” DBs to be those biomedical data sources that are

created, maintained and used at research centers, hospitals, universities, etc. Access to these

data sources is usually restricted to users who are members of the institution owning the

DB. Private DBs are usually stored using DBMS. Often, they do not provide public

interfaces that can be accessed by anonymous users, and querying these DBs may require

https://www.researchgate.net/publication/221214217_Extracting_Structured_Data_from_Web_Pages?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

https://www.researchgate.net/publication/221309772_Visual_Web_Information_Extraction_with_Lixto?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

https://www.researchgate.net/publication/2535329_Extracting_Semistructured_Information_from_the_Web?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

https://www.researchgate.net/publication/2380952_RoadRunner_Towards_Automatic_Data_Extraction_from_Large_Web_Sites?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

https://www.researchgate.net/publication/237440957_A_new_web_system_for_automatic_retrieval_of_biomedical_data_from_multiple_internet_based_resources?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

https://www.researchgate.net/publication/8353254_Towards_the_automatic_generation_of_biomedical_sources_schema?el=1_x_8&enrichId=rgreq-7cc08781-a50e-4304-bc60-ce34a49cd645&enrichSource=Y292ZXJQYWdlOzcxNTgwNTQ7QVM6OTkyMDU0MTA0NTk2NjZAMTQwMDY2MzgyMTY5Nw==

7 / 29

substantial knowledge of the conceptual and physical schema of the DB. On the other hand,

public Web-based DBs are data sources —also usually stored using DBMS— that can be

accessed by external, often anonymous users over Internet. Examples of such DBs are

OMIM, SwissProt, Prosite and many others. Public Web-based DBs have HTML-based

interfaces, and all that is needed to query these data sources is a Web browser. Commonly,

a user specifies a query by filling in a HTML form, and the results are presented as HTML

pages, XML files or plain text files. However, public DBs present an enormous diversity of

user interfaces both for query specification and result presentation. This feature, together

with the fact that they cannot be accessed directly ―only through HTML pages― makes

such DBs harder to integrate. To actually access the data, wrappers have to be created to

act as connecting points between the integration system and the actual data sources. These

wrappers have to translate user queries to HTTP requests and extract the results from the

HTML pages.

There are numerous examples —over 700— of molecular biology DBs. Of these, we

selected seven. All the selected DBs fulfill the following criteria: the DB is maintained by a

reference institution, the DB is freely accessible through Internet, the content of the DB is

relevant in the context of genomic research, the DB represents a primary resource for the

type of data it stores. Besides, these seven DBs provide a significant, representative view of

the landscape of public biomedical databases, including data on:

DNA variations Proteins Metabolism Disease Biomedical Literature

Database Created Type of data Purpose Number of Entries

OMIM,

Online Mendelian

Inheritance in Man

McKusick, Johns

Hopkins University

Human genes

and genetic

disorders

For use by clinicians,

researchers and other

professionals or students

interested in genetic disorders.

Over 15000 entries

Entrez PubMed National Library of

Medicine

Publications and

articles

To give access to citations from

MEDLINE and other life

science journals, including links

to full text articles.

Over 15 million

citations for

biomedical articles

dating back to the

1950s

ENZYME

The ExPASy

(Expert Protein

Analysis System)

proteomics server

Nomenclature of

enzymes

To search recommendations of

the International Union of

Biochemistry and Molecular

Biology (IUBMB)

Over 4000 entries

8 / 29

of the Swiss

Institute of

Bioinformatics

(SIB)

Nomenclature Committee. To

find characterized enzymes for

which an EC (Enzyme

Commission) number has been

provided.

PROSITE AND

PROSITE

DOCUMENTATION

The ExPASy

(Expert Protein

Analysis System)

proteomics server

of the Swiss

Institute of

Bioinformatics

(SIB)

Protein families

and domains

To find patterns and profiles that

help to reliably identify to which

known protein family (if any) a

new sequence belongs.

Over 1200

documentation

entries that describe

over 1700 different

patterns, rules and

profiles/matrices

PDB

Protein Data Bank

Research

Collaboratory for

Structural

Bioinformatics

(RCSB)

[the DB is

operated by

Rutgers

University]

3-D biological

macromolecular

structure data

To create a single worldwide

repository for processing and

distributing 3-D biological

macromolecular structure data.

Over 27000 entries

dbSNP

Single Nucleotide

Polymorphism

The National

Center for

Biotechnology

Information

Single

Nucleotide

Polymorphism

To serve as a central repository

for both single base nucleotide

substitutions and short deletion

and insertion polymorphisms.

They could be used by additional

laboratories, using the sequence

information around the

polymorphism and the specific

experimental conditions.

Over 1,5 millions

entries from 27

different organisms

InterPro

EMBL-EBI

European

Bioinformatics

Institute

Protein families,

domains and

functional sites

To offer identifiable features

found in known proteins that can

be applied to unknown protein

sequences.

11007 entries,

representing

2573 domains,

8166 families,

201 repeats,

26 active sites,

21 binding sites and

20 post-translation

modification sites,

at the time of

writing this paper

Table 1 Characteristics of seven public Web-based DBs.

9 / 29

3. Methods

In this section we present our approaches to the two fundamental issues in the OntoFusion

system: i) database integration and ii) mechanisms to access public Web-based DBs.

3.1. Database Integration with OntoFusion

The problem of database integration can be subdivided into two subproblems: i) the

technological integration of different data sources, and ii) the conceptual integration of

those sources. OntoFusion addresses the first of these issues by using a multiagent

architecture. Database agents that act as wrappers are used to hide the actual database

access procedures from the rest of the system. Such wrapper agents were created for public

Web-based DBs, as well as for private DBs that are accessible through ODBC or JDBC.

The second issue, the conceptual integration of databases, refers to the need to overcome

data heterogeneity at the schematic level. The data provided by a set of different databases,

each with different database schemas, have to be described through a common conceptual

schema. OntoFusion uses a “hybrid query translation” database integration approach. In

particular, each integrated database is represented by an individual conceptual schema,

which we call virtual schema. These virtual schemas are generated by means of a mapping

process, in which an administrator assigns the structural elements from databases to

concepts in a domain ontology. Figure 1 shows a schematic representation of this process.

10 / 29

Figure 1. Mapping process in OntoFusion

Elements from the physical database schema are mapped to elements in the domain ontology. A virtual schema for the database is generated from the identified concepts (yellow circles), relationships (green

circles), and attributes (red circles) in the domain ontology.

The purpose of the domain ontology is to provide a common conceptual framework to

which each integrated database is mapped. The system allows the use of several domain

ontologies such that specific ontologies can be used to map databases with data from a

common application domain. Furthermore, it allows the use of domain ontologies in

several ways. Fixed, pre-existing ontologies or controlled vocabulary resources —e.g., the

Gene Ontology (GO) [28], the Human Gene Nomenclature Committee (HGNC) (HGNC)

or the Unified Medical Language System (UMLS) [29][30]— may be used to integrate

biological and clinical DBs [31] [32] [33] [34]. It is also possible to generate domain

ontologies from scratch or to extend predefined ontologies, if necessary, with new concepts

that appear when more databases are integrated. The mapping process assures that all

structural elements from a database that are reflected in the virtual schema match some

element in the used domain ontology. Thus, different virtual schemas that have been built

with the same domain ontology share the same vocabulary; in fact, each virtual schema is a

subset of the domain ontology used. This is the basis for the next integration step: the

unification of multiple virtual schemas into a virtual unification schema. Such a schema

11 / 29

represents a conceptual description of the data integrated from a set of different databases.

Unification is a completely automatic process that imposes only one constraint: all virtual

schemas to be unified must have been built using the same underlying domain ontology.

The algorithm has been developed by the authors [35]. It strongly relies on the fact that any

semantically identical elements in two different virtual schemas use the same descriptors.

This will be the case if the virtual schemas have been carefully generated using the same

underlying domain ontology. Briefly, the algorithm works as follows. All concepts

appearing in the original virtual schemas are passed to the virtual unification schema. This

way, identical concepts (sharing the same descriptor) or hierarchically related concepts are

unified, i.e., they are represented by a single concept in the new schema. The representative

concept is the most general of a set of hierarchically related concepts. All the attributes of a

concept in the original schemas and all the relationships a concept is involved in are added

to the representative concept in the virtual unification schema. Then, different attributes

and relationships with the same descriptors are unified into single attributes and

relationships, respectively.

Both types of virtual schemas —schemas for single real databases and output by the

mapping process and schemas generated by the unification of multiple virtual schemas—,

can be considered as virtual repositories. In the first case these repositories provide access

to single real DBs, whereas in the latter case they provide an integrated access to the data

contained in a set of DBs.

The proposed database integration approach, based on “mapping” and “unification”, allows

hierarchies of virtual repositories to be created. Different sets of virtual schemas (e.g., with

similar data) can be unified and their virtual unification schemas can be unified again.

3.2. Accessing public Web-based DBs

Public Web-based DBs can be represented in the same way as private DBs: through virtual

schemas. However, they entail additional difficulties. First, their physical database schema

is usually not known and cannot be easily obtained and, second, their data cannot be

directly accessed (e.g., using query languages like SQL). Due to this characteristic, public

Web-based DBs require special access mechanisms. Instead of using a special module for

12 / 29

each individual database, OntoFusion uses a generic public Web-based databases access

module, which can be configured in order to give access to different public Web-based

DBs. The required configuration information has to be specified as XML files.

To integrate a database, OntoFusion first creates its virtual schema. To do this, the mapping

process, which relates elements and structures from the physical database schema to

concepts in the virtual schema, needs to be completed. When a private DB is going to be

mapped, its physical schema —i.e., tables, attribute, attribute types, primary keys, foreign

keys, etc.— can be extracted automatically. Public Web-based DBs do not offer a direct

connection to their DBMS. Therefore, their physical schemas cannot be obtained

automatically. Instead, these schemas have to be created manually. In particular, an XML

file containing the physical schema has to be constructed by an administrator. This task

calls for an in-depth analysis of each public DB, extracting the concepts, attributes and

relationships that appear in the database’s Web interface.

Queries in public Web-based DBs are specified through Web interfaces and match up with

URLs. The search arguments may be parameters of such URLs (e.g., in

http://www.ebi.ac.uk/interpro/ISearch?query=IPR000028&mode=ipr) or they may be

passed through HTML forms. There are no unified query languages for public Web-based

DBs. This means that the Web interfaces of each public DB must be analyzed to determine

how the URLs for user queries are built. The attributes that appear in Web forms —their

types and names— as well as other features, such as, for instance, grouping values, logical

operators, ranges of values and wildcard symbols, must be identified. An XML file

describing this query language has to be created for each public Web-based DB.

Once a query has been issued, the results are presented as HTML pages. For most

biomedical data sources, intermediate results pages —containing a list of objects or

instances that meet the search criteria— are returned. Each entry in such a list corresponds

to a hyperlink to the complete description of an individual result instance. Usually, this list

is ordered and presented at different pages, allowing users to inspect all the results very

quickly. To extract the results entries from such pages, the data access module requires

another XML file that describes the precise structure of the intermediate HTML pages of a

public Web-based DB.

13 / 29

Finally, to get the descriptions of an individual result instance, the system needs to access

the respective HTML page (through the hyperlink extracted in the intermediate pages).

Again, the presentation of results is different for each Web-based DB. Concepts and

attributes can be presented as hyperlinks, plain text, tables or even as images. OntoFusion

parses the results pages and extracts the requested data. To do this, the system needs to

know where the pertinent data is located inside the HTML results pages. Again, this

information has to be specified in an XML file that describes the structure of the results

pages and the precise location of each data item. Some Public Databases are able to return

the final results pages as formatted text or even as XML files. In these cases, OntoFusion is

able to extract the data they contain too. The XML configuration files for describing results

pages —text files or XML files— are easier to create than HTML results pages, because

the results are better structured.

Summarizing, four XML files are needed to integrate a public Web-based DB: i) a file

containing the identified physical schema, ii) a file describing the database’s query system,

iii) a file describing the structure of intermediate results pages, and iv) a file describing the

structure of the pages containing the individual results. The first of these files is used in the

mapping process, the second to translate and execute queries, and the last two provide the

information needed to extract results. All of these files have to be created manually, which

requires a detailed analysis of the Web interface of the database that is going to be

integrated.

4. System Overview

4.1. System Architecture

OntoFusion uses a multiagent architecture based on the JADE multiagent platform. This

makes it possible to execute different parts of the system at different machines. Figure 1

presents the four principal system modules: i) graphical interface, ii) vocabulary server

module, iii) mediator module, and iv) DB access module.

14 / 29

Figure 2. Schematic representation of the OntoFusion system. The system contains four main modules (user interface, vocabulary server module, mediator module, and BD access module) that interact with each other. The interaction between the different modules is carried

out through a multiagent platform. Discontinuous arrows represent the use of external resources.

The mediator module is the core system module. It is responsible for querying and

accessing virtual repositories. Each virtual repository —e.g., each virtual schema obtained

through the mapping or unification processes— is assigned to an individual agent. We

consider agents rather from a software engineering perspective —i.e., as independent and

autonomous software components that carry out special tasks and can use the services from

other agents. Within OntoFusion, agents provide transparent access to the virtual

repositories. They play two fundamental roles: i) they are able to execute queries issued to

their repository and return the retrieved results, and ii) they can provide their virtual

schema to other agents. Virtual schemas are stored using the DAML+OIL ontology

language. RDQL is used as the query language. The results gathered by agents from their

underlying DBs are returned as instances of the virtual schemas, i.e., as DAML+OIL

instance files. The mediator module is able to divide and propagate user queries through the

agent society. It collects and merges the results of these queries and sends them back to the

user interface.

The graphical interface contains the user interface and the administrator console. The user

interface can be accessed through the Web. It presents two fundamental characteristics.

15 / 29

First, it has been created as an ontology navigator, where the users can explore and

navigate through the virtual schemas of the integrated DBs. Second, it presents exactly

what information is accessible at any one time. A search process is usually done as follows.

The user explores the hierarchy of the virtual repositories that are available in the system.

After selecting an appropriate repository, he or she can navigate through its virtual schema.

Then, he or she can select a concept and issue a query in order to retrieve instances of that

concept. Filter criteria for the concept properties are created by filling in a form that

specifies the query. Then, the user interface automatically creates a RDQL query, which is

sent to the respective agent in the mediator module. After the results have been returned,

they are presented to the user as instances of the selected concept. A value-added feature of

OntoFusion is that the user can navigate from the retrieved instances to related instances of

other concepts following the identified relationships among concepts.

The administrator interface is used to start different OntoFusion components. It provides

facilities to monitor the system and is used for other administrative tasks like, for example,

the integration of new DBs through the mapping process or the unification of virtual

schemas with the unification tool. The process of mapping DBs to virtual schemas is

supported by the mapping tool. After selecting a domain ontology, an administrator creates

the virtual schema by selecting the concepts, concept attributes and relationships that

conceptually appear in the database from the domain ontology and maps them to the

structural elements (tables, fields, relationships) in the database’s physical schema. For

private DBs, the physical schema is automatically produced by the mapping tool. In the

case of public Web-based DBs, it is obtained from the respective XML configuration file.

The mappings are stored and are later used to translate user queries into the database

specific query languages. To unify existing virtual schemas, an administrator simply selects

the schemas to be unified in the unification tool, and the virtual unification schema is

generated automatically.

The vocabulary server maintains all the domain ontologies that have been used to integrate

the DBs. It may also contain other ontologies or controlled vocabularies that may be of use

in a given application domain. These ontologies are used to map and unify DBs.

Furthermore, they can be directly accessed by users, for instance, to refine queries — e.g.

searching synonyms, most used string, etc.

16 / 29

The DB access module is in charge of communicating the system with the physical DBs. It

contains the wrappers that translate queries from the intermediate query language (RDQL)

into the query languages of each particular DB. After executing a query and retrieving the

results, these are returned as instances of the DB’s virtual schema. The module contains

two parts, the private DBs module, and the public Web-based DBs module.

4.2. Query Processing

Once a RDQL query has been generated by the user interface, it is sent to the associated

virtual repository (the agent in charge of this repository). For example, when a user asks the

interface for the documents containing the term ‘fever’ in the concept

‘Functional_Domain_Documentation’ of a Virtual Repository of the Prosite

Documentation Database, the generated RDQL query is shown below:

Figure 3- An example RDQL query

If this repository is the unification of a set of other virtual repositories, the query is divided

and translated into sub-queries, which are sent to those repositories. The process is repeated

until the queries reach the repositories that directly match the physical DBs. Repositories

for physical DBs translate the queries into the DB specific query language. For the example

stated before, the generated URL was: http://au.expasy.org/cgi-bin/prosite-search-

ful?SEARCH=fever&makeWild=on . This translation process uses the correspondence file

created in the mapping process to convert the virtual concepts in the RDQL query into

elements from the physical database schema. After executing a query, the retrieved results

are reconverted into instances of the virtual schema. Then, the results are propagated

SELECT ?Functional_Domain_Documentation.PrositeDoc_ID, ?Functional_Domain_Documentation.Documetation WHERE (?x, <rdf:type>, <h:Functional_Domain_Documentation>)

(?x, <h:Functional_Domain_Documentation.PrositeDoc_ID>, ?Functional_Domain_Documentation.PrositeDoc_ID )

(?x, <h:Functional_Domain_Documentation.Documetation>, ?Functional_Domain_Documentation.Documetation )

(?y, <rdf:type>, <h:Functional_Domain>) (?x, <h:Functional_Domain_Documentation.Related_To.Functional_Domain>,

?y) USING rdf for <http://www.w3.org/1999/02/22-rdf-syntax-ns#> , h for http://infomed.dia.fi.upm.es/PrositeDoc_VDB#

17 / 29

backwards, as DAML+OIL instance files, until they reach the user interface. During this

process, intermediate repositories merge the results received from different sources. There,

the translation processes may also be necessary. For instance, if the instances from a source

do not provide values for a requested attribute, these values are set to “Without

information”. Figure 3 presents a sample query execution scenario. As can be seen, the

query is propagated from the user interface through the hierarchy of virtual repositories

down to the physical DBs. The results are returned back along the same path.

4.3. Components of the public Web-based DB module

Figure 4 shows the components of the public Web-based DB module. This module contains

three principal components: i) the query translator, ii) the result extractor, and iii) the cache

server. The first two of these components constitute the module core, whereas the cache

server provides performance enhancements.

The query translator is the component that translates RDQL queries for the virtual schema

of the database into executable queries for a public Web-based DB. It performs the

translation in two separate steps. First, it translates all the concepts appearing in the query

into terms from the physical database schema using the information stored in the mapping

process. In a second step, it translates the query expressed in RDQL into an appropriate

URL.

Figure 4. Public Web-based DB module

18 / 29

Once a query has been translated to an URL, it is sent to the result extractor. This

component is responsible for executing the queries, retrieving the results and transforming

them into instances of the virtual schemas of public Web-based DBs. First, the intermediate

results page is obtained by issuing an HTTP request for the query URL to the server. The

retrieved page is parsed using the XML file that contains the description of intermediate

pages. The result extractor obtains the list of links (URLs) leading to the complete

description of each individual result instance. Afterwards, each individual result instance is

treated as follows. A HTTP request is issued to get the HTML page with the individual

result description. This page is parsed —using the information from the XML file that

describes these pages— and the relevant information is extracted and converted into an

instance of the DB’s physical schema. Finally, the results are converted into instances of

the DB’s virtual schema and sent as a DAML+OIL file to the agent that originated the

query.

Query execution through the public Web-based DB module is time consuming. This is

because a great many Web pages have to be parsed and analyzed for each query. To

improve the performance of query executions on public Web-based DBs, a cache server

has been implemented to store the results of past queries.

The public Web-based DB module is generic for all public DBs. This implies that all it

takes to introduce a new or modify an existing public DB is to create/modify the respective

XML configuration files.

5. Case study

In this section, we present a sample search on a virtual repository that was produced by

integrating and unifying seven public Web-based DBs —containing biomedical data—

with OntoFusion. The DBs are: OMIM, PubMed, Enzyme, Prosite and Prosite

documentation, PDB, SNP, and InterPro. These databases were selected in this study

because of their importance to and significance for biomedical research. Although we have

chosen this set, any other public Web-based DB could be added. Their characteristics were

summarized in Table 1. The reason for integrating these databases was twofold. On one

19 / 29

hand, we wanted to evaluate the validity of the system. On the other hand, we considered

that providing unified access to these DBs would bring with it substantial benefits for

researchers in the fields of biology, biomedicine and genomics.

After analyzing these seven databases and mapping them to virtual schemas, we unified the

databases into a single virtual repository. The virtual unification schema of this repository

is presented in Figure 5.

Figure 5 shows all the concepts and attributes (belonging to which individual DB)

integrated in the virtual unification schema. None of the concepts belongs to more than one

database. Thus, the data from different DBs is not actually unified. However, the

unification process establishes links between the different DBs. These links can be used to

relate instances of concepts from one DB to instances of another concept from another DB.

For example, instances of “Enzyme” from the Enzyme DB can be related to instances of

“Functional domain documentation” from the Prosite documentation DB, because the

Figure 5. Virtual unification schema for public Web-based DBs (see the text for details)

20 / 29

Enzyme database returns a cross-reference to Prosite. This cross-reference is mapped into

the Virtual Repository of Enzyme — the cross-reference is the access number in Prosite

Doc. This attribute belongs to the “Functional domain documentation” concept. This

concept is mapped into the Virtual Repositories of Enzyme and Prosite Documentation

Databases. By means of the unification process, when we go through the relationship from

“Enzyme” concept to the “Functional domain documentation” concept, a query is built

automatically. This query uses the cross-reference, and it is launched against both the

Enzyme and Prosite Documentation databases. Thus, using unification, we are able to

navigate through different databases.

We present a sample search where a user accesses the virtual repository that covers all

seven DBs to retrieve information about a particular enzyme. After entering the graphical

user interface, the hierarchy of virtual repositories is presented to the user. After selecting

the repository he or she wants to access (in our case the repository that covers all seven

DBs), the virtual schema of the selected repository is opened in the ontology navigator.

Figure 6 gives an example. In this case, the virtual schema contains the nine different

concepts that appear in the seven DBs. The next step is to select the concept of interest, e.g.,

“Enzyme”.

When a concept is selected, a form containing all its attributes and relationships is

presented. This is illustrated in Figure 6 for the “Enzyme” concept. The user specifies the

Figure 6. Performing a search of instances of ‘Enzyme’ containing ‘1.1.1.1’ in its ‘Enzyme_ID’.

21 / 29

attributes and relationships of interest (by ticking a box) and can enter search criteria for

one or more attributes. In the example, the attributes ‘Enzyme_ID’ and ‘Official_Name’,

and the relationship ‘Related_to.Functional_Domain_Documentation’ have been selected

and the value ‘1.1.1.1’ has been entered into the field for ‘Enzyme_ID’. Thus, we search

for enzymes that contain the string ‘1.1.1.1’ in their identifier.

After submitting the query, results will be returned from all the DBs that contain the

queried concept (in the example only the Enzyme database). These results are presented in

the user interface as instances of the ‘Enzyme’ concept. In this case, only one enzyme

instance with the identification 1.1.1.1 has been encountered. It is presented using form

similar to the one shown in Figure 6, but where the attributes contain the values found.

Relationships are returned as links. Following such a link, the user can inspect the related

instances of other concepts. The user can use this mechanism in our example to inspect the

‘Functional Domain Documentation’ instances of proteins related to the detected enzyme.

From this information he or she can navigate to the related instances of the ‘Functional

Domain’ concept. This is shown in Figure 7.

22 / 29

As shown in Figure 7, some attributes can contain the ‘No Information’ value. This means

that the DB on which the query was run did not contain these values. This may occur, for

example, for attributes that correspond to links to Web pages with additional information.

Such links may not be provided for all instances.

The presented example shows how a user can get information from different public DBs

using a single interface. In the example, the query has gathered results from two different

public Web-based DBs —Enzyme, and Prosite/Prosite Documentation. Thus, OntoFusion

can interconnect databases automatically through relationships among concepts in virtual

unification schemas. Such relationships can be used to navigate among concepts that

belong to different DBs, a feature that is not originally provided by the DBs in question.

However, the virtual schemas of the DBs must be carefully built in order to exploit these

cross-references.

6. Discussion

We have adopted a hybrid query translation approach in OntoFusion. It provides a solution

to the vocabulary problem associated with the multiple virtual conceptual schema approach.

Following this hybrid approach, the conceptual schemas of all the databases to be

integrated are created using terminology borrowed from a domain ontology. This approach

ensures that a given object from one schema and all its semantically equivalent

counterparts from the conceptual schemas of the other databases will have exactly the same

standardized term, facilitating the unification process.

OntoFusion’s use of a query translation approach could be detrimental, since searching is

time consuming. This is because a lot of Web pages have to be parsed in the results

extraction process. The total amount of time that is spent on the execution process depends

on the network bandwidth and the available connection. To reduce this processing time,

Figure 7. Navigating through relationships. The top left window shows ‘Functional domain documentation’ instances related to the enzyme with identifier ‘1.1.1.1’. The bottom right window shows two instances of ‘Functional Domain’ that are related to ‘Functional domain documentation’

instance.

23 / 29

OntoFusion includes a caching mechanism. The cache server is used to increment the

efficiency and reduce the query execution time.

One advantage of the query translation approach, as compared to the data translation

approach, is that it does not need to update the data in some centralized repository.

OntoFusion also allows users to store the results of a query. All the retrieved results can be

downloaded as DAML+OIL files. Such files could be used, for instance, in a XML-based

DB — i.e., the eXist database [36] — to facilitate further studies or to carry out a more

detailed data analysis.

Other systems use ontologies for integrating distributed biological databases. Some systems

can integrate public databases by downloading them and storing the obtained data locally

—i.e. SEMEDA. Other systems integrate public databases developing ad-hoc wrappers for

each public database —i.e. TAMBIS. In contrast to TAMBIS, OntoFusion does not use a

single conceptual schema approach. When using this approach, modifications in the global

conceptualization model are needed to add or remove a database from the set of integrated

DBs. In addition, OntoFusion is able to exploit cross-references between DBs allowing

users to navigate across them.

It is important to comment that some of the biomedical public data resources mentioned in

this paper are freely available for downloading at their respective sites. Therefore, it would

be possible to download them and then consider them as private data resources located at

our institution. The reason for continuing to consider these databases as public resources is

that the philosophy of our system is distributed rather than centralized, i.e., our system does

not provide a centralized data warehouse containing information from several remote

sources. Instead, we offer middleware that provides users with tools and methods to access

information stored in different databases and located at remote sites. Besides, our system

does not harvest any data from these databases, since it does not store any data except for

results for the most frequent queries stored in the cache server. It is noteworthy that these

databases are periodically updated at regular intervals. Therefore, we have considered it

preferable to access these databases through their official website rather than download

them every time they are updated. This approach ensures that we are always accessing the

latest version of the database. Anyway, we can take advantage of the possibility of

downloading some databases that are available. This feature enhances the mapping process,

24 / 29

since we can directly use (or derive, in the worst case) the physical schema of the specific

public database without having to analyze the structure of the HTML pages.

To query and extract the results from public databases, OntoFusion processes the respective

HTML files. Some public databases retrieve the results in a XML format that is easier to

parse than HTML format. However, we have adopted the more general approach, since not

every public database provides an XML interface.

OntoFusion integrates not only public Web-based DBs but also private DBs that are stored

at some local host. The integration of private clinical and public genetic databases is similar

to the examples shown in this paper, as long as some information can be shared. In this

sense, the system may also be used to update, enhance or improve the contents of private

DBs by unifying them with more general or more complete public Web-based or other

private DBs. This kind of heterogeneous integration, unifying databases located at different

locations and countries, is one of the goals of the European INFOBIOMED Network of

Excellence in Biomedical Informatics, of which the authors are members.

Ontologies are especially well suited for mapping database schemas. Most of the modern

database integration systems follow ontology-based approaches. They ease the

understanding of each domain, providing a framework with more semantic expressiveness

than the entity-relationship model. Thus, an ontology description language was selected to

store the virtual schemas within OntoFusion instead of other semantic models. XML was

used where user interaction was not expected, i.e. to store the mappings between the

database schemas and the virtual schema. DAML+OIL was employed to store virtual

schemas, since it was the most commonly used language at the time of system design.

Some biomedical vocabulary sources, such as UMLS, GO or HGNC, are not considered

real ontologies and also pose several technical problems which are beyond the scope of this

paper, that were addressed within the INFOGENMED project. GO and HGNC, which were

created independently, are now included within the UMLS. OntoFusion provides a tool to

use domain ontologies based on these vocabulary sources, following ontological

foundations and principles.

25 / 29

A recent review paper [37] has suggested the possibility of using a combination of agent

technologies and ontologies for biomedical database integration. OntoFusion has been

designed and implemented to practically address the feasibility of this idea.

7. Conclusions

In this paper, we have presented OntoFusion, a database integration system capable of

offering unified access to remote heterogeneous databases. OntoFusion can be used to

query different databases through a single interface. The system is based on a query

translation integration approach, which represents each database by means of a virtual

schema representing all the concepts contained in the database. Virtual schemas of different

databases can be unified. This unification relies on the use of domain ontologies that

provide the conceptual framework for establishing semantic links among different

databases.

Information search in OntoFusion is supported by a navigation interface: the user can

navigate through the concepts of virtual schemas by following the relationships between

these concepts. A prototype demo of OntoFusion is publicly available on-line at

http://crick.dia.fi.upm.es:8080/Interface

A number of researchers at the “Carlos III” Institute of Health in Spain have recently begun

to use the system to retrieve integrated and updated data in the context of research into rare

genetic diseases. Some of the OntoFusion methods and tools are also being used for

research purposes within the currently active. European Commission-funded

INFOBIOMED Network of Excellence

At the time of writing this paper, DAML+OIL is used as the ontology language for

representing virtual schemas. We are currently updating this language to OWL, the ‘de

facto’ ontology representation language standard.

In future research, the agent-based middleware used in OntoFusion could be enhanced

using mobile agents. Agents could move to the servers containing DBs, improving system

performance. GRID technologies and Web services are also being studied to improve the

26 / 29

system’s capabilities. These technologies could provide for the execution of OntoFusion in

different distributed environments, optimizing the available resources.

Acknowledgments

This research has been supported by funding from the EC INFOBIOMED Network of

Excellence, the INFOGENMED project, the INBIOMED project, the Spanish Ministry of

Health and the Spanish Ministry of Education and Science. We want to thank Rachel Elliot

for her editorial assistance.

References

[1] Galperin, M.Y., "The Molecular Biology Database Collection: 2005

update" Nucleic Acids Research, 2005. 33(D4-D25).

[2] INFOGENMED: A Virtual Laboratory for accessing and integrating genetic and

medical information for Health Applications. EC Project IST-2001-39013. 2002-

2004.

[3] Sujanski W. Heterogeneous Database Integration in Biomedicine. Journal of

Biomedical Informatics 2001;34(4):285-298,.

[4] Chawathe S, Garcia-Molina H, Hammer J, Ireland K, Papakonstantinou Y, Ullman

J, Widom J. The TSIMMIS Project: Integration of Heterogeneous Information

Sources. Proceedings of IPSJ Conference, Tokyo, Japan, October 1994. p. 7-18.

[5] Freier A, Hofestadt R, Lange M, Scholz U, Stephanik A. BioDataServer: a SQL-

based service for the online integration of life science data. Silico Biol 2002;2(2),

37-57,.

[6] Arens Y, Hsu CN, Knoblock CA. Query processing in the SIMS information

mediator. In M. N. Huns and M.P. Singh (eds.), Readings in Agents. San Francisco:

Morgan Kauffmann: 1998.

[7] Shan MC, Ahmed R, Davis J, Du W, Kent W. Pegasus: a heterogeneous

information management system. Modern databases systems, W. Kim Ed., Chapter

32, ACM Press (Addison-Wesley Publishing Co.). Reading, MA: 1994.

[8] Carey M, Haas LM, Schwarz PM, Arya M, Cody WF, Fagin R, Flickner M,

Luniewski AW, Niblack W, Petkovic D, Thomas J, Williams JH, Wimmers EL.

27 / 29

Towards Heterogeneous Multimedia Information Systems. Proceedings of the 5th

International Workshop on Research Issues in Data Engineering, Taipei, Taiwan,

March 1995, IEEE, New York: 1995.

[9] Baker PG, Brass A, Bechhofer S, Gobble C, Paton N, Stevens R. TAMBIS:

Transparent Access to Multiple BioInformatics Information Sources. An Overview.

Proceedings of the Sixth International Conference of Intelligent Systems for

Molecular Biology, ISMB98, Montreal: 1998.

[10] Baker PG, Gobble CA, Bechhofer S, Paton NW, Stevens R, Brass A. An Ontology

for BioInformatics Application. BioInformatics 1999;15(6):510-520.

[11] Knoblock CA, Minton S, Ambite JL, Ashish N, Muslea I, Philpot AG, Tejada S.

The Ariadne Approach to Web-based Information Integration. International Journal

of Cooperative Information Systems 2001;10(1-2):145-169.

[12] Miled ZB, Li N, Kellet G, Sipes B, Buhkres O. Complex Life Science

Multidatabase Queries. Proceedings of the IEEE 2002;90(11).

[13] Haas LM, Schwarz PM, Kodali P, Kotlar E, Rice JE, Swope WC. DiscoveryLink: a

system for integrated access to life sciences data sources. IBM systems journal 2001:

40(2);489-511,.

[14] Mena E, Illarramendi A, Kashyap V, Sheth AP. OBSERVER: An approach for

Query processing in global information systems based on interoperation between

pre-existing ontologies. Distributed and parallel Databases 2000:8(2);223-271.

[15] Goasdoué F, Lattes V, Rousset MC. The use of CARIN Language and algorithms

for information integration: The PICSEL Project. International Journal of

Cooperative Information Systems 2000:9(4);383-401.

[16] Goh CH. Representing and Reasoning about semantic conflicts in heterogeneous

information sources. PhD Dissertation, Massachusetts Institute of Technology:

1997.

[17] Wache H, Scholz T, Stieghahn H, König-Ries B. An Integration Method for the

Specification of Rule-oriented Mediators. Proceedings of the International

Symposium on Database Applications in non-traditional Environments, Y.

Kambayashi and H Takura Eds. (EFIS 99), Kühlungsborn, Germany: 1999.

[18] Stuckenschmidt H, van Harmelen F, Fensel D, Klein M, Horrocks I. Catalogue

Integration: A case study in ontology-based semantic translation. Technical report

IR-474, Computer Science Department, Vrije Universiteit Amsterdam: 2000.

28 / 29

[19] Köhler J, Philippi S, Lange M. SEMEDA: ontology based semantic integration of

biological databases. Bioinformatics 2003:19(18);2420-2427.

[20] Rahm E, Bernstein PA. A survey of approaches to automatic schema matching. The

VLDB Journal 2001:10;334-350.

[21] Hammer J, García-Molina H, Cho J, Crespo A, Aranha R. Extracting

Semistructured Information from the Web. In Proceedings of theWorkshop on

Management of Semistructured Data, Tucson, Arizona, USA, May 16: 1997. p 18-

25.

[22] Baumgartner R, Flesca S, Gottlob G. Visual Web Information Extraction with Lixto.

In Proceedings of 27th International Conference on Very Large Data Bases (VLDB

2001), Rome, Italy, September 11.14: 2001. p 119-128.

[23] Crescenzi V, Mecca G, Merialdo P. RoadRunner: Towards Automatic Data

Extraction from Large Web Sites. In Proceedings of 27th International Conference

on Very Large Data Bases (VLDB 2001), Rome, Italy, September 11.14: 2001. p

109-118.

[24] Robinson J. Data Extraction from Web Data Sources. 4th International Workshop on

Web Based Collaboration: 2004.

[25] Pistella D, Masseroli M, Pinciroli F. A new web system for automatic retrieval of

biomedical data from multiple internet based resources. Proceedings of Medinfo

2004;1811.

[26] Mougin F, Burgun A, Loreal O, Le Beux P. Towards the automatic generation of

biomedical sources schema. Proceedings of Medinfo 2004:783-7.

[27] Arasu A, Garcia-Molina H. Extracting Structured Data from Web Pages. In

Proceedings of ACM SIGMOD International Conference on Management of Data

(SIGMOD 2003), San Diego, California, USA, June 9-12: 2003.

[28] The Gene Ontology Consortium. Gene Ontology: tool for unification in biology.

Nature Genetics 2000:25;25-29.

[29] Lindberg C. The Unified Medical Language System (UMLS) of National Library of

Medicine. J Am Med Rec Asso 1990:61(5);40-2.

[30] Bodenreider O. The Unified Medical Language System (UMLS): integrating

biomedical terminology. Nucleic Acids Research 32, Jan 1:2004;32.

[31] Pérez-Rey D, Maojo V, García-Remesal M, Alonso-Calvo R. Biomedical

ontologies in post-genomic information systems. IEEE Fourth Symposium on

29 / 29

Bioinformatics and Bioengineering (BIBE2004). Taichung, Taiwan 2004. p. 207-

214.

[32] García-Remesal M, Maojo V, Billhardt H, Crespo J, Alonso-Calvo R, Pérez-Rey D,

Martin F, Sousa A. ARMEDA II: Supporting Genomic Medicine through the

Integration of Medical and Genetic Databases. IEEE Fourth Symposium

on Bioinformatics and Bioengineering (BIBE2004). Taichung, Taiwan 2004. p.

227-236.

[33] Stevens RD, Goble CA, Bechhofer S. Ontology-based knowledge representation for

bioinformatics. Briefings in Bioinformatics 2000:1(4);398-416.

[34] Wache H, Vögele T, Visser U, Stuckenmidt H, Shuster G, Neumann H, Hübner S.

Ontology-based integration of information – A survey of existing approaches. Proc.

IJCAI-01Workshop: Ontologies and Information sharing, Seattle, WA: 2001.

[35] Billhardt H, Crespo J, Maojo V, Martin-Sánchez F, Maté JL. A New Method for

Unifying Heterogeneous Databases. Proceedings of ISMDA 2001, p. 54-61.

[36] Meier W. eXist: an open source native XML database. Web, web services and

database systems — NODe 2002 Web and Database related Workshops, vol. 2593,

LNCS, Erfurt, Germany, 2003.

[37] Karasavvas KA, Baldock R, Burger A. Bioinformatics integration and agent

technology. J Biomed Inform. 2004 Jun;37(3):205-19.

An agent- and ontology-based system for integrating public gene, protein, and disease databases

Documents