1 / 29 An Agent- and Ontology-based System for Integrating Public Gene, Protein and Disease Databases R. Alonso-Calvo, V. Maojo, H. Billhardt a , F. Martin-Sanchez b , M. García-Remesal, D. Pérez-Rey Biomedical Informatics Group, Artificial Intelligence Laboratory, School of Computer Science, Polytechnic University of Madrid Boadilla del Monte, 28660 Madrid, Spain a Universidad Rey Juan Carlos. Madrid, Spain b Medical Bioinformatics Department, Institute of Health Carlos III, Majadahonda. Madrid, Spain Abstract In this paper, we describe OntoFusion, a database integration system. This system has been designed to provide unified access to multiple, heterogeneous biological and medical data sources that are publicly available over Internet. Many of these databases do not offer a direct connection, and inquiries must be made via Web forms, returning results as HTML pages. A special module in the OntoFusion system is needed to integrate these public ‘Web-based’ databases. Domain ontologies are used to do this and provide database mapping and unification. We have used the system to integrate seven significant and widely used public biomedical databases: OMIM, PubMed, Enzyme, Prosite and Prosite documentation, PDB, SNP, and InterPro. A case study is detailed in depth, showing system performance. We analyze the system’s architecture and methods and discuss its use as a tool for biomedical researchers. Keywords Bioinformatics. Medical Informatics. Heterogeneous databases. Data integration. Genomic databases.
29
Embed
An agent- and ontology-based system for integrating public gene, protein, and disease databases
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1 / 29
An Agent- and Ontology-based System
for Integrating Public Gene, Protein and Disease Databases
R. Alonso-Calvo, V. Maojo, H. Billhardt a, F. Martin-Sanchez b, M. García-Remesal, D.
substantial knowledge of the conceptual and physical schema of the DB. On the other hand,
public Web-based DBs are data sources —also usually stored using DBMS— that can be
accessed by external, often anonymous users over Internet. Examples of such DBs are
OMIM, SwissProt, Prosite and many others. Public Web-based DBs have HTML-based
interfaces, and all that is needed to query these data sources is a Web browser. Commonly,
a user specifies a query by filling in a HTML form, and the results are presented as HTML
pages, XML files or plain text files. However, public DBs present an enormous diversity of
user interfaces both for query specification and result presentation. This feature, together
with the fact that they cannot be accessed directly ―only through HTML pages― makes
such DBs harder to integrate. To actually access the data, wrappers have to be created to
act as connecting points between the integration system and the actual data sources. These
wrappers have to translate user queries to HTTP requests and extract the results from the
HTML pages.
There are numerous examples —over 700— of molecular biology DBs. Of these, we
selected seven. All the selected DBs fulfill the following criteria: the DB is maintained by a
reference institution, the DB is freely accessible through Internet, the content of the DB is
relevant in the context of genomic research, the DB represents a primary resource for the
type of data it stores. Besides, these seven DBs provide a significant, representative view of
the landscape of public biomedical databases, including data on:
DNA variations Proteins Metabolism Disease Biomedical Literature
Database Created Type of data Purpose Number of Entries
OMIM,
Online Mendelian
Inheritance in Man
McKusick, Johns
Hopkins University
Human genes
and genetic
disorders
For use by clinicians,
researchers and other
professionals or students
interested in genetic disorders.
Over 15000 entries
Entrez PubMed National Library of
Medicine
Publications and
articles
To give access to citations from
MEDLINE and other life
science journals, including links
to full text articles.
Over 15 million
citations for
biomedical articles
dating back to the
1950s
ENZYME
The ExPASy
(Expert Protein
Analysis System)
proteomics server
Nomenclature of
enzymes
To search recommendations of
the International Union of
Biochemistry and Molecular
Biology (IUBMB)
Over 4000 entries
8 / 29
of the Swiss
Institute of
Bioinformatics
(SIB)
Nomenclature Committee. To
find characterized enzymes for
which an EC (Enzyme
Commission) number has been
provided.
PROSITE AND
PROSITE
DOCUMENTATION
The ExPASy
(Expert Protein
Analysis System)
proteomics server
of the Swiss
Institute of
Bioinformatics
(SIB)
Protein families
and domains
To find patterns and profiles that
help to reliably identify to which
known protein family (if any) a
new sequence belongs.
Over 1200
documentation
entries that describe
over 1700 different
patterns, rules and
profiles/matrices
PDB
Protein Data Bank
Research
Collaboratory for
Structural
Bioinformatics
(RCSB)
[the DB is
operated by
Rutgers
University]
3-D biological
macromolecular
structure data
To create a single worldwide
repository for processing and
distributing 3-D biological
macromolecular structure data.
Over 27000 entries
dbSNP
Single Nucleotide
Polymorphism
The National
Center for
Biotechnology
Information
Single
Nucleotide
Polymorphism
To serve as a central repository
for both single base nucleotide
substitutions and short deletion
and insertion polymorphisms.
They could be used by additional
laboratories, using the sequence
information around the
polymorphism and the specific
experimental conditions.
Over 1,5 millions
entries from 27
different organisms
InterPro
EMBL-EBI
European
Bioinformatics
Institute
Protein families,
domains and
functional sites
To offer identifiable features
found in known proteins that can
be applied to unknown protein
sequences.
11007 entries,
representing
2573 domains,
8166 families,
201 repeats,
26 active sites,
21 binding sites and
20 post-translation
modification sites,
at the time of
writing this paper
Table 1 Characteristics of seven public Web-based DBs.
9 / 29
3. Methods
In this section we present our approaches to the two fundamental issues in the OntoFusion
system: i) database integration and ii) mechanisms to access public Web-based DBs.
3.1. Database Integration with OntoFusion
The problem of database integration can be subdivided into two subproblems: i) the
technological integration of different data sources, and ii) the conceptual integration of
those sources. OntoFusion addresses the first of these issues by using a multiagent
architecture. Database agents that act as wrappers are used to hide the actual database
access procedures from the rest of the system. Such wrapper agents were created for public
Web-based DBs, as well as for private DBs that are accessible through ODBC or JDBC.
The second issue, the conceptual integration of databases, refers to the need to overcome
data heterogeneity at the schematic level. The data provided by a set of different databases,
each with different database schemas, have to be described through a common conceptual
schema. OntoFusion uses a “hybrid query translation” database integration approach. In
particular, each integrated database is represented by an individual conceptual schema,
which we call virtual schema. These virtual schemas are generated by means of a mapping
process, in which an administrator assigns the structural elements from databases to
concepts in a domain ontology. Figure 1 shows a schematic representation of this process.
10 / 29
Figure 1. Mapping process in OntoFusion
Elements from the physical database schema are mapped to elements in the domain ontology. A virtual schema for the database is generated from the identified concepts (yellow circles), relationships (green
circles), and attributes (red circles) in the domain ontology.
The purpose of the domain ontology is to provide a common conceptual framework to
which each integrated database is mapped. The system allows the use of several domain
ontologies such that specific ontologies can be used to map databases with data from a
common application domain. Furthermore, it allows the use of domain ontologies in
several ways. Fixed, pre-existing ontologies or controlled vocabulary resources —e.g., the
Gene Ontology (GO) [28], the Human Gene Nomenclature Committee (HGNC) (HGNC)
or the Unified Medical Language System (UMLS) [29][30]— may be used to integrate
biological and clinical DBs [31] [32] [33] [34]. It is also possible to generate domain
ontologies from scratch or to extend predefined ontologies, if necessary, with new concepts
that appear when more databases are integrated. The mapping process assures that all
structural elements from a database that are reflected in the virtual schema match some
element in the used domain ontology. Thus, different virtual schemas that have been built
with the same domain ontology share the same vocabulary; in fact, each virtual schema is a
subset of the domain ontology used. This is the basis for the next integration step: the
unification of multiple virtual schemas into a virtual unification schema. Such a schema
11 / 29
represents a conceptual description of the data integrated from a set of different databases.
Unification is a completely automatic process that imposes only one constraint: all virtual
schemas to be unified must have been built using the same underlying domain ontology.
The algorithm has been developed by the authors [35]. It strongly relies on the fact that any
semantically identical elements in two different virtual schemas use the same descriptors.
This will be the case if the virtual schemas have been carefully generated using the same
underlying domain ontology. Briefly, the algorithm works as follows. All concepts
appearing in the original virtual schemas are passed to the virtual unification schema. This
way, identical concepts (sharing the same descriptor) or hierarchically related concepts are
unified, i.e., they are represented by a single concept in the new schema. The representative
concept is the most general of a set of hierarchically related concepts. All the attributes of a
concept in the original schemas and all the relationships a concept is involved in are added
to the representative concept in the virtual unification schema. Then, different attributes
and relationships with the same descriptors are unified into single attributes and
relationships, respectively.
Both types of virtual schemas —schemas for single real databases and output by the
mapping process and schemas generated by the unification of multiple virtual schemas—,
can be considered as virtual repositories. In the first case these repositories provide access
to single real DBs, whereas in the latter case they provide an integrated access to the data
contained in a set of DBs.
The proposed database integration approach, based on “mapping” and “unification”, allows
hierarchies of virtual repositories to be created. Different sets of virtual schemas (e.g., with
similar data) can be unified and their virtual unification schemas can be unified again.
3.2. Accessing public Web-based DBs
Public Web-based DBs can be represented in the same way as private DBs: through virtual
schemas. However, they entail additional difficulties. First, their physical database schema
is usually not known and cannot be easily obtained and, second, their data cannot be
directly accessed (e.g., using query languages like SQL). Due to this characteristic, public
Web-based DBs require special access mechanisms. Instead of using a special module for
12 / 29
each individual database, OntoFusion uses a generic public Web-based databases access
module, which can be configured in order to give access to different public Web-based
DBs. The required configuration information has to be specified as XML files.
To integrate a database, OntoFusion first creates its virtual schema. To do this, the mapping
process, which relates elements and structures from the physical database schema to
concepts in the virtual schema, needs to be completed. When a private DB is going to be
keys, etc.— can be extracted automatically. Public Web-based DBs do not offer a direct
connection to their DBMS. Therefore, their physical schemas cannot be obtained
automatically. Instead, these schemas have to be created manually. In particular, an XML
file containing the physical schema has to be constructed by an administrator. This task
calls for an in-depth analysis of each public DB, extracting the concepts, attributes and
relationships that appear in the database’s Web interface.
Queries in public Web-based DBs are specified through Web interfaces and match up with
URLs. The search arguments may be parameters of such URLs (e.g., in
http://www.ebi.ac.uk/interpro/ISearch?query=IPR000028&mode=ipr) or they may be
passed through HTML forms. There are no unified query languages for public Web-based
DBs. This means that the Web interfaces of each public DB must be analyzed to determine
how the URLs for user queries are built. The attributes that appear in Web forms —their
types and names— as well as other features, such as, for instance, grouping values, logical
operators, ranges of values and wildcard symbols, must be identified. An XML file
describing this query language has to be created for each public Web-based DB.
Once a query has been issued, the results are presented as HTML pages. For most
biomedical data sources, intermediate results pages —containing a list of objects or
instances that meet the search criteria— are returned. Each entry in such a list corresponds
to a hyperlink to the complete description of an individual result instance. Usually, this list
is ordered and presented at different pages, allowing users to inspect all the results very
quickly. To extract the results entries from such pages, the data access module requires
another XML file that describes the precise structure of the intermediate HTML pages of a
public Web-based DB.
13 / 29
Finally, to get the descriptions of an individual result instance, the system needs to access
the respective HTML page (through the hyperlink extracted in the intermediate pages).
Again, the presentation of results is different for each Web-based DB. Concepts and
attributes can be presented as hyperlinks, plain text, tables or even as images. OntoFusion
parses the results pages and extracts the requested data. To do this, the system needs to
know where the pertinent data is located inside the HTML results pages. Again, this
information has to be specified in an XML file that describes the structure of the results
pages and the precise location of each data item. Some Public Databases are able to return
the final results pages as formatted text or even as XML files. In these cases, OntoFusion is
able to extract the data they contain too. The XML configuration files for describing results
pages —text files or XML files— are easier to create than HTML results pages, because
the results are better structured.
Summarizing, four XML files are needed to integrate a public Web-based DB: i) a file
containing the identified physical schema, ii) a file describing the database’s query system,
iii) a file describing the structure of intermediate results pages, and iv) a file describing the
structure of the pages containing the individual results. The first of these files is used in the
mapping process, the second to translate and execute queries, and the last two provide the
information needed to extract results. All of these files have to be created manually, which
requires a detailed analysis of the Web interface of the database that is going to be
integrated.
4. System Overview
4.1. System Architecture
OntoFusion uses a multiagent architecture based on the JADE multiagent platform. This
makes it possible to execute different parts of the system at different machines. Figure 1
presents the four principal system modules: i) graphical interface, ii) vocabulary server
module, iii) mediator module, and iv) DB access module.
14 / 29
Figure 2. Schematic representation of the OntoFusion system. The system contains four main modules (user interface, vocabulary server module, mediator module, and BD access module) that interact with each other. The interaction between the different modules is carried
out through a multiagent platform. Discontinuous arrows represent the use of external resources.
The mediator module is the core system module. It is responsible for querying and
accessing virtual repositories. Each virtual repository —e.g., each virtual schema obtained
through the mapping or unification processes— is assigned to an individual agent. We
consider agents rather from a software engineering perspective —i.e., as independent and
autonomous software components that carry out special tasks and can use the services from
other agents. Within OntoFusion, agents provide transparent access to the virtual
repositories. They play two fundamental roles: i) they are able to execute queries issued to
their repository and return the retrieved results, and ii) they can provide their virtual
schema to other agents. Virtual schemas are stored using the DAML+OIL ontology
language. RDQL is used as the query language. The results gathered by agents from their
underlying DBs are returned as instances of the virtual schemas, i.e., as DAML+OIL
instance files. The mediator module is able to divide and propagate user queries through the
agent society. It collects and merges the results of these queries and sends them back to the
user interface.
The graphical interface contains the user interface and the administrator console. The user
interface can be accessed through the Web. It presents two fundamental characteristics.
15 / 29
First, it has been created as an ontology navigator, where the users can explore and
navigate through the virtual schemas of the integrated DBs. Second, it presents exactly
what information is accessible at any one time. A search process is usually done as follows.
The user explores the hierarchy of the virtual repositories that are available in the system.
After selecting an appropriate repository, he or she can navigate through its virtual schema.
Then, he or she can select a concept and issue a query in order to retrieve instances of that
concept. Filter criteria for the concept properties are created by filling in a form that
specifies the query. Then, the user interface automatically creates a RDQL query, which is
sent to the respective agent in the mediator module. After the results have been returned,
they are presented to the user as instances of the selected concept. A value-added feature of
OntoFusion is that the user can navigate from the retrieved instances to related instances of
other concepts following the identified relationships among concepts.
The administrator interface is used to start different OntoFusion components. It provides
facilities to monitor the system and is used for other administrative tasks like, for example,
the integration of new DBs through the mapping process or the unification of virtual
schemas with the unification tool. The process of mapping DBs to virtual schemas is
supported by the mapping tool. After selecting a domain ontology, an administrator creates
the virtual schema by selecting the concepts, concept attributes and relationships that
conceptually appear in the database from the domain ontology and maps them to the
structural elements (tables, fields, relationships) in the database’s physical schema. For
private DBs, the physical schema is automatically produced by the mapping tool. In the
case of public Web-based DBs, it is obtained from the respective XML configuration file.
The mappings are stored and are later used to translate user queries into the database
specific query languages. To unify existing virtual schemas, an administrator simply selects
the schemas to be unified in the unification tool, and the virtual unification schema is
generated automatically.
The vocabulary server maintains all the domain ontologies that have been used to integrate
the DBs. It may also contain other ontologies or controlled vocabularies that may be of use
in a given application domain. These ontologies are used to map and unify DBs.
Furthermore, they can be directly accessed by users, for instance, to refine queries — e.g.
searching synonyms, most used string, etc.
16 / 29
The DB access module is in charge of communicating the system with the physical DBs. It
contains the wrappers that translate queries from the intermediate query language (RDQL)
into the query languages of each particular DB. After executing a query and retrieving the
results, these are returned as instances of the DB’s virtual schema. The module contains
two parts, the private DBs module, and the public Web-based DBs module.
4.2. Query Processing
Once a RDQL query has been generated by the user interface, it is sent to the associated
virtual repository (the agent in charge of this repository). For example, when a user asks the
interface for the documents containing the term ‘fever’ in the concept
‘Functional_Domain_Documentation’ of a Virtual Repository of the Prosite
Documentation Database, the generated RDQL query is shown below:
Figure 3- An example RDQL query
If this repository is the unification of a set of other virtual repositories, the query is divided
and translated into sub-queries, which are sent to those repositories. The process is repeated
until the queries reach the repositories that directly match the physical DBs. Repositories
for physical DBs translate the queries into the DB specific query language. For the example
stated before, the generated URL was: http://au.expasy.org/cgi-bin/prosite-search-
ful?SEARCH=fever&makeWild=on . This translation process uses the correspondence file
created in the mapping process to convert the virtual concepts in the RDQL query into
elements from the physical database schema. After executing a query, the retrieved results
are reconverted into instances of the virtual schema. Then, the results are propagated
SELECT ?Functional_Domain_Documentation.PrositeDoc_ID, ?Functional_Domain_Documentation.Documetation WHERE (?x, <rdf:type>, <h:Functional_Domain_Documentation>)
?y) USING rdf for <http://www.w3.org/1999/02/22-rdf-syntax-ns#> , h for http://infomed.dia.fi.upm.es/PrositeDoc_VDB#
17 / 29
backwards, as DAML+OIL instance files, until they reach the user interface. During this
process, intermediate repositories merge the results received from different sources. There,
the translation processes may also be necessary. For instance, if the instances from a source
do not provide values for a requested attribute, these values are set to “Without
information”. Figure 3 presents a sample query execution scenario. As can be seen, the
query is propagated from the user interface through the hierarchy of virtual repositories
down to the physical DBs. The results are returned back along the same path.
4.3. Components of the public Web-based DB module
Figure 4 shows the components of the public Web-based DB module. This module contains
three principal components: i) the query translator, ii) the result extractor, and iii) the cache
server. The first two of these components constitute the module core, whereas the cache
server provides performance enhancements.
The query translator is the component that translates RDQL queries for the virtual schema
of the database into executable queries for a public Web-based DB. It performs the
translation in two separate steps. First, it translates all the concepts appearing in the query
into terms from the physical database schema using the information stored in the mapping
process. In a second step, it translates the query expressed in RDQL into an appropriate
URL.
Figure 4. Public Web-based DB module
18 / 29
Once a query has been translated to an URL, it is sent to the result extractor. This
component is responsible for executing the queries, retrieving the results and transforming
them into instances of the virtual schemas of public Web-based DBs. First, the intermediate
results page is obtained by issuing an HTTP request for the query URL to the server. The
retrieved page is parsed using the XML file that contains the description of intermediate
pages. The result extractor obtains the list of links (URLs) leading to the complete
description of each individual result instance. Afterwards, each individual result instance is
treated as follows. A HTTP request is issued to get the HTML page with the individual
result description. This page is parsed —using the information from the XML file that
describes these pages— and the relevant information is extracted and converted into an
instance of the DB’s physical schema. Finally, the results are converted into instances of
the DB’s virtual schema and sent as a DAML+OIL file to the agent that originated the
query.
Query execution through the public Web-based DB module is time consuming. This is
because a great many Web pages have to be parsed and analyzed for each query. To
improve the performance of query executions on public Web-based DBs, a cache server
has been implemented to store the results of past queries.
The public Web-based DB module is generic for all public DBs. This implies that all it
takes to introduce a new or modify an existing public DB is to create/modify the respective
XML configuration files.
5. Case study
In this section, we present a sample search on a virtual repository that was produced by
integrating and unifying seven public Web-based DBs —containing biomedical data—
with OntoFusion. The DBs are: OMIM, PubMed, Enzyme, Prosite and Prosite
documentation, PDB, SNP, and InterPro. These databases were selected in this study
because of their importance to and significance for biomedical research. Although we have
chosen this set, any other public Web-based DB could be added. Their characteristics were
summarized in Table 1. The reason for integrating these databases was twofold. On one
19 / 29
hand, we wanted to evaluate the validity of the system. On the other hand, we considered
that providing unified access to these DBs would bring with it substantial benefits for
researchers in the fields of biology, biomedicine and genomics.
After analyzing these seven databases and mapping them to virtual schemas, we unified the
databases into a single virtual repository. The virtual unification schema of this repository
is presented in Figure 5.
Figure 5 shows all the concepts and attributes (belonging to which individual DB)
integrated in the virtual unification schema. None of the concepts belongs to more than one
database. Thus, the data from different DBs is not actually unified. However, the
unification process establishes links between the different DBs. These links can be used to
relate instances of concepts from one DB to instances of another concept from another DB.
For example, instances of “Enzyme” from the Enzyme DB can be related to instances of
“Functional domain documentation” from the Prosite documentation DB, because the
Figure 5. Virtual unification schema for public Web-based DBs (see the text for details)
20 / 29
Enzyme database returns a cross-reference to Prosite. This cross-reference is mapped into
the Virtual Repository of Enzyme — the cross-reference is the access number in Prosite
Doc. This attribute belongs to the “Functional domain documentation” concept. This
concept is mapped into the Virtual Repositories of Enzyme and Prosite Documentation
Databases. By means of the unification process, when we go through the relationship from
“Enzyme” concept to the “Functional domain documentation” concept, a query is built
automatically. This query uses the cross-reference, and it is launched against both the
Enzyme and Prosite Documentation databases. Thus, using unification, we are able to
navigate through different databases.
We present a sample search where a user accesses the virtual repository that covers all
seven DBs to retrieve information about a particular enzyme. After entering the graphical
user interface, the hierarchy of virtual repositories is presented to the user. After selecting
the repository he or she wants to access (in our case the repository that covers all seven
DBs), the virtual schema of the selected repository is opened in the ontology navigator.
Figure 6 gives an example. In this case, the virtual schema contains the nine different
concepts that appear in the seven DBs. The next step is to select the concept of interest, e.g.,
“Enzyme”.
When a concept is selected, a form containing all its attributes and relationships is
presented. This is illustrated in Figure 6 for the “Enzyme” concept. The user specifies the
Figure 6. Performing a search of instances of ‘Enzyme’ containing ‘1.1.1.1’ in its ‘Enzyme_ID’.
21 / 29
attributes and relationships of interest (by ticking a box) and can enter search criteria for
one or more attributes. In the example, the attributes ‘Enzyme_ID’ and ‘Official_Name’,
and the relationship ‘Related_to.Functional_Domain_Documentation’ have been selected
and the value ‘1.1.1.1’ has been entered into the field for ‘Enzyme_ID’. Thus, we search
for enzymes that contain the string ‘1.1.1.1’ in their identifier.
After submitting the query, results will be returned from all the DBs that contain the
queried concept (in the example only the Enzyme database). These results are presented in
the user interface as instances of the ‘Enzyme’ concept. In this case, only one enzyme
instance with the identification 1.1.1.1 has been encountered. It is presented using form
similar to the one shown in Figure 6, but where the attributes contain the values found.
Relationships are returned as links. Following such a link, the user can inspect the related
instances of other concepts. The user can use this mechanism in our example to inspect the
‘Functional Domain Documentation’ instances of proteins related to the detected enzyme.
From this information he or she can navigate to the related instances of the ‘Functional
Domain’ concept. This is shown in Figure 7.
22 / 29
As shown in Figure 7, some attributes can contain the ‘No Information’ value. This means
that the DB on which the query was run did not contain these values. This may occur, for
example, for attributes that correspond to links to Web pages with additional information.
Such links may not be provided for all instances.
The presented example shows how a user can get information from different public DBs
using a single interface. In the example, the query has gathered results from two different
public Web-based DBs —Enzyme, and Prosite/Prosite Documentation. Thus, OntoFusion
can interconnect databases automatically through relationships among concepts in virtual
unification schemas. Such relationships can be used to navigate among concepts that
belong to different DBs, a feature that is not originally provided by the DBs in question.
However, the virtual schemas of the DBs must be carefully built in order to exploit these
cross-references.
6. Discussion
We have adopted a hybrid query translation approach in OntoFusion. It provides a solution
to the vocabulary problem associated with the multiple virtual conceptual schema approach.
Following this hybrid approach, the conceptual schemas of all the databases to be
integrated are created using terminology borrowed from a domain ontology. This approach
ensures that a given object from one schema and all its semantically equivalent
counterparts from the conceptual schemas of the other databases will have exactly the same
standardized term, facilitating the unification process.
OntoFusion’s use of a query translation approach could be detrimental, since searching is
time consuming. This is because a lot of Web pages have to be parsed in the results
extraction process. The total amount of time that is spent on the execution process depends
on the network bandwidth and the available connection. To reduce this processing time,
Figure 7. Navigating through relationships. The top left window shows ‘Functional domain documentation’ instances related to the enzyme with identifier ‘1.1.1.1’. The bottom right window shows two instances of ‘Functional Domain’ that are related to ‘Functional domain documentation’
instance.
23 / 29
OntoFusion includes a caching mechanism. The cache server is used to increment the
efficiency and reduce the query execution time.
One advantage of the query translation approach, as compared to the data translation
approach, is that it does not need to update the data in some centralized repository.
OntoFusion also allows users to store the results of a query. All the retrieved results can be
downloaded as DAML+OIL files. Such files could be used, for instance, in a XML-based
DB — i.e., the eXist database [36] — to facilitate further studies or to carry out a more
detailed data analysis.
Other systems use ontologies for integrating distributed biological databases. Some systems
can integrate public databases by downloading them and storing the obtained data locally
—i.e. SEMEDA. Other systems integrate public databases developing ad-hoc wrappers for
each public database —i.e. TAMBIS. In contrast to TAMBIS, OntoFusion does not use a
single conceptual schema approach. When using this approach, modifications in the global
conceptualization model are needed to add or remove a database from the set of integrated
DBs. In addition, OntoFusion is able to exploit cross-references between DBs allowing
users to navigate across them.
It is important to comment that some of the biomedical public data resources mentioned in
this paper are freely available for downloading at their respective sites. Therefore, it would
be possible to download them and then consider them as private data resources located at
our institution. The reason for continuing to consider these databases as public resources is
that the philosophy of our system is distributed rather than centralized, i.e., our system does
not provide a centralized data warehouse containing information from several remote
sources. Instead, we offer middleware that provides users with tools and methods to access
information stored in different databases and located at remote sites. Besides, our system
does not harvest any data from these databases, since it does not store any data except for
results for the most frequent queries stored in the cache server. It is noteworthy that these
databases are periodically updated at regular intervals. Therefore, we have considered it
preferable to access these databases through their official website rather than download
them every time they are updated. This approach ensures that we are always accessing the
latest version of the database. Anyway, we can take advantage of the possibility of
downloading some databases that are available. This feature enhances the mapping process,
24 / 29
since we can directly use (or derive, in the worst case) the physical schema of the specific
public database without having to analyze the structure of the HTML pages.
To query and extract the results from public databases, OntoFusion processes the respective
HTML files. Some public databases retrieve the results in a XML format that is easier to
parse than HTML format. However, we have adopted the more general approach, since not
every public database provides an XML interface.
OntoFusion integrates not only public Web-based DBs but also private DBs that are stored
at some local host. The integration of private clinical and public genetic databases is similar
to the examples shown in this paper, as long as some information can be shared. In this
sense, the system may also be used to update, enhance or improve the contents of private
DBs by unifying them with more general or more complete public Web-based or other
private DBs. This kind of heterogeneous integration, unifying databases located at different
locations and countries, is one of the goals of the European INFOBIOMED Network of
Excellence in Biomedical Informatics, of which the authors are members.
Ontologies are especially well suited for mapping database schemas. Most of the modern
database integration systems follow ontology-based approaches. They ease the
understanding of each domain, providing a framework with more semantic expressiveness
than the entity-relationship model. Thus, an ontology description language was selected to
store the virtual schemas within OntoFusion instead of other semantic models. XML was
used where user interaction was not expected, i.e. to store the mappings between the
database schemas and the virtual schema. DAML+OIL was employed to store virtual
schemas, since it was the most commonly used language at the time of system design.
Some biomedical vocabulary sources, such as UMLS, GO or HGNC, are not considered
real ontologies and also pose several technical problems which are beyond the scope of this
paper, that were addressed within the INFOGENMED project. GO and HGNC, which were
created independently, are now included within the UMLS. OntoFusion provides a tool to
use domain ontologies based on these vocabulary sources, following ontological
foundations and principles.
25 / 29
A recent review paper [37] has suggested the possibility of using a combination of agent
technologies and ontologies for biomedical database integration. OntoFusion has been
designed and implemented to practically address the feasibility of this idea.
7. Conclusions
In this paper, we have presented OntoFusion, a database integration system capable of
offering unified access to remote heterogeneous databases. OntoFusion can be used to
query different databases through a single interface. The system is based on a query
translation integration approach, which represents each database by means of a virtual
schema representing all the concepts contained in the database. Virtual schemas of different
databases can be unified. This unification relies on the use of domain ontologies that
provide the conceptual framework for establishing semantic links among different
databases.
Information search in OntoFusion is supported by a navigation interface: the user can
navigate through the concepts of virtual schemas by following the relationships between
these concepts. A prototype demo of OntoFusion is publicly available on-line at
http://crick.dia.fi.upm.es:8080/Interface
A number of researchers at the “Carlos III” Institute of Health in Spain have recently begun
to use the system to retrieve integrated and updated data in the context of research into rare
genetic diseases. Some of the OntoFusion methods and tools are also being used for
research purposes within the currently active. European Commission-funded
INFOBIOMED Network of Excellence
At the time of writing this paper, DAML+OIL is used as the ontology language for
representing virtual schemas. We are currently updating this language to OWL, the ‘de
facto’ ontology representation language standard.
In future research, the agent-based middleware used in OntoFusion could be enhanced
using mobile agents. Agents could move to the servers containing DBs, improving system
performance. GRID technologies and Web services are also being studied to improve the
26 / 29
system’s capabilities. These technologies could provide for the execution of OntoFusion in
different distributed environments, optimizing the available resources.
Acknowledgments
This research has been supported by funding from the EC INFOBIOMED Network of
Excellence, the INFOGENMED project, the INBIOMED project, the Spanish Ministry of
Health and the Spanish Ministry of Education and Science. We want to thank Rachel Elliot
for her editorial assistance.
References
[1] Galperin, M.Y., "The Molecular Biology Database Collection: 2005
update" Nucleic Acids Research, 2005. 33(D4-D25).
[2] INFOGENMED: A Virtual Laboratory for accessing and integrating genetic and
medical information for Health Applications. EC Project IST-2001-39013. 2002-
2004.
[3] Sujanski W. Heterogeneous Database Integration in Biomedicine. Journal of