Top Banner
Journal of Digital Information Management Volume 2 Number 2 June 2004 87 Abstract. Bioinformatics is as a bridge between life science and computer science: computer algorithms are needed to face complexity of biological processes. Bioinformatics applications manage complex biological data accessing distributed heterogeneous databases, and require large computing power. After introducing bioinformatics requirements, we present the architecture of PROTEUS, a Grid-based Problem Solving Environment that integrates ontology and workflow approaches to enhance composition and execution of bioinformatics application on the Grid. A first distributed implementation of PROTEUS on Globus is also described. Keywords: Bioinformatics, Problem Solving Environments, Ontology, Workflow, Grid Reviewed and accepted: 31 Mar. 2004 1. Introduction Research in biological and medical areas, (also known as biomedicine), is always more accurate and requires high performance computer systems and sophisticated software tools to treat large volumes of complex data. Bioinformatics research involves an increasing number of computer scientists designing new algorithms and computational platforms to provide modelling and computing power to biomedical research. Data structures and software tools have been designed to support biomedicine in decoding the entire human genetic information, also known as genome. Today the new challenge is studying the proteome, i.e. the set of proteins encoded by the genome, to define models representing and analyzing the structure of the proteins contained in each cell, and eventually to prevent and cure any possible cell-mutation generating human diseases such that producing cancer-hill cells [2]. However, the high number of possible proteins, as well as the huge number of possible cell-mutations, requires a huge effort in designing software environments able to treat biomedical problems. Moreover, applications often need to access different and heterogeneous data sets, produced by experiments or obtained querying biological databases (e.g. the PDB protein database (16). Bio-informaticians are studying (biomedical-)data models, as well as specialized services and software components for managing biological data. For example, the HUPO (Human Proteome Organization) Proteomics Standards Initiative aims to define standards for data representation in proteomics to facilitate data comparison, exchange and verification [33], while systems as PEDRo introduce and implement systematic approach in proteomic research by modelling, capturing, and disseminating proteomics experimental data [13]. Grid community recognized bioinformatics as an opportunity for distributed high performance computing and collaboration applications [17]. The Life Science Grid Research Group established under the Global Grid Forum [22], believes bioinformatics requirements can be satisfied by Grid services and standards, and is interested in what new services Grids should provide to bioinformatics applications. To face these challenges, some emerging Bioinformatics Grid projects (BioGrids) are appearing. Bio-GRID is developing an access portal for bio-molecular modelling resources[ 14]. Asia Pacific BioGRID is building a customized, self-installing version of the Globus Toolkit[20]. Finally, myGrid is developing open source high-level Grid middleware to support data-intensive bioinformatics on the Grid [31]. Moreover, Grid community is considering workflows for designing applications [34] and ontologies to model semantics of resources and services (e.g., see [21]). We wish to provide an environment allowing biomedical researchers to search and compose bioinformatics software modules for solving biomedical problems. We focus on semantic modelling of the goals and requirements of bioinformatics applications using ontologies, and we employ workflow methodologies and tools for designing, scheduling and controlling bioinformatics applications. Such ideas are combined together using the Problem Solving Environment (PSE) software development approach [18]. We present PROTEUS, a Grid-based PSE allowing modelling, building and executing bioinformatics applications on the Grid. PROTEUS [14] uses ontologies and workflows to describe bioinformatics applications, helping users to build applications without dealing with software and data details. Moreover, Grid infrastructure offers the computational power and security needed by bioinformatics applications. The paper is organized as follows. Section 2 briefly describes main requirements of bioinformatics applications and some related technologies. Section 3 presents the overall architecture of PROTEUS. Section 4 presents a first distributed implementation of PROTEUS on Globus. Section 5 shows how applications are designed and executed with PROTEUS. Finally, Section 6 concludes the paper and outlines future work. 2 . Requirements and Technologies for Bioinformatics Applications Bioinformatics regards the design of advanced algorithms and computational platforms to solve problems in biology and medicine. It also deals with methods for acquiring, storing, retrieving and analyzing biological data obtained by experiments, or by querying databases. Applications in bioinformatics should support: – efficient representation of biological data and databases; – specialized services for data transformation and manipulation such as searching in protein databases, protein structure prediction, biological data mining; – description of the goals and requirements of applications, and expected results; – query and computing optimization, to face large data sets. Bioinformatics applications: (i) are naturally distributed, due to the high number of involved data sets; (ii) require high computing power, due to the large size of data sets and the complexity of basic computations; (iii) may access heterogeneous data, where heterogeneity is in data format, access policy, distribution, etc.; (iv) require a secure software infrastructure, because they could access private data owned by different organizations. Some technologies that can fulfil bioinformatics requirements are: Ontologies to describe semantics of data sources, software components and bioinformatics tasks; Workflow Management Systems to specify in an abstract way complex (distributed) applications, integrating and composing individual simple services; Grid infrastructure, due to its security, distribution, service orientation, and computational power; Problem Solving Environment for the definition and execution of complex applications, hiding software development details. We focus on first two topics. Grid and PSE are not treated here due to space limitations (see [5]). 2.1 Ontologies An ontology is a shared understanding of well defined domains of interest, which is realized as a set of classes (or concepts), properties, Ontology-based Design of Bioinformatics Workflows on PROTEUS Mario Cannataro1, Antonella Guzzo2, Carmela Comito2 and Pierangelo Veltri1 1 University Magna Græcia of Catanzaro, Via T. Campanella 115, Catanzaro, Italy 2 DEIS, University of Calabria, Via P. Bucci 41/c, Rende, Italy {cannataro,veltri}@unicz.it, {guzzo, comito}@si.deis.unical.it
6

Ontology-based Design of Bioinformatics Workflows on PROTEUS

Apr 21, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ontology-based Design of Bioinformatics Workflows on PROTEUS

Journal of Digital Information Management Volume 2 Number 2 June 2004 87

Abstract. Bioinformatics is as a bridge between life scienceand computer science: computer algorithms are needed toface complexity of biological processes. Bioinformaticsapplications manage complex biological data accessingdistributed heterogeneous databases, and require largecomputing power. After introducing bioinformaticsrequirements, we present the architecture of PROTEUS, aGrid-based Problem Solving Environment that integratesontology and workflow approaches to enhance compositionand execution of bioinformatics application on the Grid. Afirst distributed implementation of PROTEUS on Globus is alsodescribed.

Keywords: Bioinformatics, Problem Solving Environments, Ontology,Workflow, Grid

Reviewed and accepted: 31 Mar. 2004

1. Introduction

Research in biological and medical areas, (also known asbiomedicine), is always more accurate and requires high performancecomputer systems and sophisticated software tools to treat largevolumes of complex data. Bioinformatics research involves anincreasing number of computer scientists designing new algorithmsand computational platforms to provide modelling and computingpower to biomedical research. Data structures and software tools havebeen designed to support biomedicine in decoding the entire humangenetic information, also known as genome. Today the new challengeis studying the proteome, i.e. the set of proteins encoded by thegenome, to define models representing and analyzing the structureof the proteins contained in each cell, and eventually to prevent andcure any possible cell-mutation generating human diseases such thatproducing cancer-hill cells [2]. However, the high number of possibleproteins, as well as the huge number of possible cell-mutations,requires a huge effort in designing software environments able to treatbiomedical problems. Moreover, applications often need to accessdifferent and heterogeneous data sets, produced by experiments orobtained querying biological databases (e.g. the PDB protein database(16). Bio-informaticians are studying (biomedical-)data models, aswell as specialized services and software components for managingbiological data. For example, the HUPO (Human ProteomeOrganization) Proteomics Standards Initiative aims to define standardsfor data representation in proteomics to facilitate data comparison,exchange and verification [33], while systems as PEDRo introduceand implement systematic approach in proteomic research bymodelling, capturing, and disseminating proteomics experimental data[13]. Grid community recognized bioinformatics as an opportunity fordistributed high performance computing and collaboration applications[17]. The Life Science Grid Research Group established under theGlobal Grid Forum [22], believes bioinformatics requirements can besatisfied by Grid services and standards, and is interested in whatnew services Grids should provide to bioinformatics applications. Toface these challenges, some emerging Bioinformatics Grid projects(BioGrids) are appearing. Bio-GRID is developing an access portalfor bio-molecular modelling resources[ 14]. Asia Pacific BioGRID isbuilding a customized, self-installing version of the Globus Toolkit[20].Finally, myGrid is developing open source high-level Grid middlewareto support data-intensive bioinformatics on the Grid [31]. Moreover,Grid community is considering workflows for designing applications[34] and ontologies to model semantics of resources and services(e.g., see [21]). We wish to provide an environment allowing biomedicalresearchers to search and compose bioinformatics software modules

for solving biomedical problems. We focus on semantic modelling ofthe goals and requirements of bioinformatics applications usingontologies, and we employ workflow methodologies and tools fordesigning, scheduling and controlling bioinformatics applications.Such ideas are combined together using the Problem SolvingEnvironment (PSE) software development approach [18]. We presentPROTEUS, a Grid-based PSE allowing modelling, building andexecuting bioinformatics applications on the Grid. PROTEUS [14] usesontologies and workflows to describe bioinformatics applications,helping users to build applications without dealing with software anddata details. Moreover, Grid infrastructure offers the computationalpower and security needed by bioinformatics applications. The paperis organized as follows. Section 2 briefly describes main requirementsof bioinformatics applications and some related technologies. Section3 presents the overall architecture of PROTEUS. Section 4 presentsa first distributed implementation of PROTEUS on Globus. Section 5shows how applications are designed and executed with PROTEUS.Finally, Section 6 concludes the paper and outlines future work.

2 . Requirements and Technologies for BioinformaticsApplications

Bioinformatics regards the design of advanced algorithms andcomputational platforms to solve problems in biology and medicine.It also deals with methods for acquiring, storing, retrieving andanalyzing biological data obtained by experiments, or by queryingdatabases. Applications in bioinformatics should support:

– efficient representation of biological data and databases;

– specialized services for data transformation and manipulation suchas searching in protein databases, protein structure prediction,biological data mining;

– description of the goals and requirements of applications, andexpected results;

– query and computing optimization, to face large data sets.

Bioinformatics applications: (i) are naturally distributed, due to thehigh number of involved data sets; (ii) require high computing power,due to the large size of data sets and the complexity of basiccomputations; (iii) may access heterogeneous data, whereheterogeneity is in data format, access policy, distribution, etc.; (iv)require a secure software infrastructure, because they could accessprivate data owned by different organizations.

Some technologies that can fulfil bioinformatics requirements are:

– Ontologies to describe semantics of data sources, softwarecomponents and bioinformatics tasks;

– Workflow Management Systems to specify in an abstract waycomplex (distributed) applications, integrating and composingindividual simple services;

– Grid infrastructure, due to its security, distribution, serviceorientation, and computational power;

– Problem Solving Environment for the definition and execution ofcomplex applications, hiding software development details.

We focus on first two topics. Grid and PSE are not treated here dueto space limitations (see [5]).

2.1 Ontologies

An ontology is a shared understanding of well defined domains ofinterest, which is realized as a set of classes (or concepts), properties,

Ontology-based Design of Bioinformatics Workflows on PROTEUS

Mario Cannataro1, Antonella Guzzo2, Carmela Comito2 and Pierangelo Veltri11 University Magna Græcia of Catanzaro, Via T. Campanella 115, Catanzaro, Italy2 DEIS, University of Calabria, Via P. Bucci 41/c, Rende, Italy{cannataro,veltri}@unicz.it, {guzzo, comito}@si.deis.unical.it

Page 2: Ontology-based Design of Bioinformatics Workflows on PROTEUS

Journal of Digital Information Management Volume 2 Number 2 June 2004 88

functions and instances [25]. Additional knowledge can be capturedby logical axioms or rules which derive new facts from the existingones. An inference engine then can draw conclusions based on therules or axioms to create new knowledge and eventually to solveproblems. Ontologies can be used to share common understandingof the structure of information among people or software components[1].

We first experienced ontology development implementing DAMON,a DAML+OIL [10] ontology for the Data Mining domain describingresources and processes of knowledge discovery in databases [4].DAMON can be used to model data mining applications that constitutean important task in the analysis of many bioinformatics data. As anexample, Figure 1 shows a fragment of such ontology modelingclasses and properties related to the Classification Algorithm concept,and properties of the domain (i.e., UsesMethod and PerformsTask).

..<daml:Class

rdf:about=”file:DAMON/#ClassificationAlgorithm”><rdfs:subClassOf><daml:Classrdf:about=”file:DAMON/#Algorithm”/></ rdfs:subClassOf><rdfs:subClassOf><daml:Restriction><daml:onPropertyrdf:resource=”file:DAMON/#UsesMethod”/><daml:toClass><daml:Classrdf:about=”file:DAMON/#ClassificationMethod”/></ daml:toClass></ daml:Restriction></ rdfs:subClassOf><rdfs:subClassOf><daml:Restriction><daml:onPropertyrdf:resource=”file:DAMON/#PerformsTask”/><daml:toClass><daml:Classrdf:about=”file:DAMON/#Classification”/></ daml:toClass></ daml:Restriction></ rdfs:subClassOf></ daml:Class> ...

Figure 1. Modelling of the Classification Algorithm Conceptusing DAML+OIL.

The management of ontology is provided through OnBrowser [6], anontology manager based on the Jena APIs [24], that currently providesbrowsing and querying of DAML+OIL ontologies. Figure 2 depicts ascreenshot of OnBrowser showing the DAMON ontology. In particular,the left panel shows the taxonomy tree, whereas the right panel depictsthe Classification Algorithm concepts described in Figure 1.

The user browses the ontology choosing one of the input point (leftpanel of the frame) representing the taxonomies of the ontology andnavigates visiting the sub tree topics until reaching a concept of inter-est. The concept of interest is shown in the middle of the right panelof the frame and related concepts are displayed around it. The ontol-ogy may be browsed by promoting any of the related concepts to bethe central concept. The new central concept is then linked to all itsrelated concepts.

concepts.

Figure 2. OnBrowser tool showing DAMON

2.2 Workflows

A workflow is a partial or total automation of a business process, inwhich a collection of activities must be executed by some entities(humans or machines), according to certain procedural rules. In thiscontext, Workflow Management Systems (WfMSs) are well establishedtechnological infrastructures, aiming at facilitating the design of anyworkflow, and supporting its enactments, by scheduling differentactivities on available entities. According to the Workflow ManagementCoalition (WfMC) Reference Model [36], the two most relevantcomponents of WfMSs are:

Buildtime component. It is responsible for supporting the definitionof the workflow, by means of some formal description, the most usedof which being the workflow schema, for ensuring its persistentstorage, and for providing additional facilities to simulate workflowexecutions and analyze workflow schema. Specifically, the workflowschema is used for specifying both static and dynamic aspects in aworkflow modelling language, e.g. Web Services Flow Language -WSFL [28] or XML Process Definition Language - XPDL [36] orBusiness Process Execution Language - BPEL [26 ]. It includes twolevel of specification:

(i) control flow level, specifying the dependencies among tasks andtheir execution requirements, through the constructs offered by thelanguage (e.g. sequencing, synchronization, choice, etc); (ii) data flowlevel, specifying the information about processing entities, such asassignment to processing entities and the set of input and outputparameters of each activity.

Runtime component. It consists of a workflow engine (often calledworkflow scheduler) responsible of the enactments, by controlling andcoordinating execution of activities. Moreover, it stores log files aboutworkflow executions which can be queried and analyzed in order tovalidate the workflow schema, identify bottlenecks, etc. Finally, it alsoprovides monitoring tools that keep track of execution progress.

3. PROTEUS: a Grid-Based Problem Solving Environment

Bioinformatics applications require different tools and data sourcesthat usually are not available in a unique software platform. Moreover,composing such basic tools to obtain meaningful applications requiresa deep knowledge of the biomedical domain. Following thesemotivations we designed PROTEUS, a Grid-based Problem SolvingEnvironment [18] for composing and running bioinformaticsapplications on the Grid. We use ontologies for modellingbioinformatics processes and Grid resources, and workflow techniquesfor designing and scheduling bioinformatics applications. Semanticmodelling of Grid resources and workflow-based Grid programmingare emergent trends in Grid community (see [7],[21], and [8]).

Ontologies model bioinformatics knowledge about:

1. biological databases (e.g., Protein Data Bank, [16]);

2. bioinformatics processes (e.g., workflow of “in silico” experiments);

3. bioinformatics tools and software (e.g. EMBOSS [23], BLAST[15]).

The use of standard, off-the-shelf workflow systems will allowconcentrating on the semantic of applications, hiding the details ofGrid allocation and scheduling policies.

.

Page 3: Ontology-based Design of Bioinformatics Workflows on PROTEUS

Journal of Digital Information Management Volume 2 Number 2 June 2004 89

In summary, PROTEUS assists users in:

– formulating bioinformatics solutions, by choosing among differentavailable bioinformatics applications, or by composing a new oneas collection of basic software components;

– running applications on the Grid;

– presenting and analyzing results, and comparing them with pastresults forming the PROTEUS knowledge base.

3.1 Architecture

PROTEUS combines existing open source bioinformatics softwareand public-available biological databases by: i) adding metadata tosoftware; ii)modelling applications through ontologies and workflows,and iii) offering pre-packaged Grid-aware bioinformatics applications.Proteus extends the basic PSE architecture [35], and it is based onthe Knowledge GRID, a system for the design and execution ofdistributed data mining applicationson the Grid [7]. Main componentsof PROTEUS (see Figure 3) are:

Component and Application Library. It contains software tools,databases, user-defined applications which are stored on local filesystem of Grid nodes.

Metadata repository. It contains information about installed softwarecomponents and databases. Such repository is distributed along theGrid nodes. Metadata management is a key issue in distributedsystems and is gaining attention in the Grid community[30].

Figure 3. Proteus Architecture.

Ontology repository. It contains two kinds of ontologies: a domainontology and an application ontology. The former describes andclassifies biological concepts (e.g. proteins), bioinformatics softwaretools (e.g. EMBOSS [23]), or biological data sources (e.g. Swiss-Prot[29]). The latter describes and classifies user-defined bioinformaticsapplications, represented as workflows, and contains past results ofapplication executions, and comments about user experiences. Bothontologies contain references to data stored in the metadata repository.Although ontologies may embed metadata information, we separaterepositories for modularity reasons. Finally, ontologies can bereplicated on each node of the Grid.

Ontology-based Workflow Designer. It allows the design of abioinformatics application as a workflow of distributed components(software and data) selected by searching the PROTEUS ontologiesand combined by exploiting metadata information. It produces anabstract execution plan (workflow schema). It comprises:

–the Ontology-based Assistant, that suggests to the user availableapplications for a given bioinformatics problem/task, or guides theapplication design through a concept-based search of software anddata components. Selected components are composed as workflowsusing graphical interfaces;

– the Workflow User Interface, used with the Ontology-basedAssistant, allows to specify and design applications as workflows. Italso provides visualization facilities for browsing and accessing toexisting workflow schemas to re-use them.

Workflow-model Wrapper. It allows mapping designed workflowsinto a schedulable workflow. Workflows can be specified either by thegraphical interface or by an external tool designer. Such module allowsthe using of different workflow engines. We stress that there existdifferent formalisms to represent workflows such as the BusinessProcess Execution Language (BPEL)[26].

Workflow Engine. It schedules and controls the execution of activities(named enactment, in Figure 3). The scheduling depends on availableGrid nodes and takes into account any execution constraints (e.g., adata source cannot be moved). To optimize application execution orto satisfy constraints, the Workflow Engine may move data or softwarecomponents calling Grid functions, such as those provided by theGlobus Resource Allocation Manager [20].

Workflow Metadata Repository. It contains all the information onthe workflow schema. Information are organized such that staticaspects (e.g., control flow or data used in flow control) are separatedfrom application details (e.g., data used to perform a task). It isorganized as follows:

– WF Schema Metadata represents the static aspects of workflow,i.e., the definition of the control flow describing the precedencerelationships among activities.

– WF Application Metadata, keeps trace of the application statusduring execution. It contains details about the status of Gridnodes: availability, charge, activities status for each node, etc.Such information are provided by the Grid services, e.g. theGlobus Monitoring and Discovery Service [20], or the NetworkWeather Service [37].

Metadata and Ontology management. Metadata in PROTEUS areorganized in a hierarchical schema. At the top layer, ontologiesmodel the bioinformatics applications and needed softwarecomponents, whereas at the bottom layer, the metadata moduleprovides specific metadata about available (i.e., installed)bioinformatics software and data sources. Users are guided byontologies in choosing available software components orapplications (ontology-based application design) [4]. The bottomlayer allows users to access software tools and databases,providing information like installed version, input and output dataformat, parameters, etc. Layers are updated whenever newsoftware tools or data sources are added to the system, or newapplications are developed.

4. A Distributed Implementation of PROTEUS on Globus

Currently, our system architecture allows the user to easily designcomplex workflows, such as those that arise when planning “in silico”experiments. Moreover, it should be noted that ontology integrationis a fundamental approach used to address the challenges associatedwith the possible incompatibilities among biological databases.Exploiting ontologies it is possible to achieve data integration issuesto overcome the semantic heterogeneity among data sources,accessing to inter-related data sources irrespective of their differences.

Finally, mechanisms for browsing and querying ontologies permit tocatch concepts and their underlying relationships, such as the entitiesinvolved in the process and their dependencies. The key elementsimplemented by PROTEUS are a knowledge base modellingcomponents (data sources and software tools described throughontologies and metadata) and complex bioinformatics applications(described through workflows); and a set of tools for composing,executing and controlling such applications on the Grid and formanaging such knowledge base, when new components are addedto the system, or new bioinformatics concepts are introduced.

PROTEUS offers to application programmers and software catalogmanagers the following services: (i) metadata and ontologymanagement; (ii) application composition and execution; and (iii)knowledge base enrichment through result analysis and interpretation.These services are implemented through a set of modules offeringGraphical User Interfaces (GUI), and a set of distributed modulesimplementing the run-time system in a decentralized manner. Suchmodules are organized by preserving both the static distribution of

Page 4: Ontology-based Design of Bioinformatics Workflows on PROTEUS

Journal of Digital Information Management Volume 2 Number 2 June 2004 90

Figure 4. Distributed implementation of PROTEUS on Globus

resources (data, software) and the dynamic execution of the process.In other word, the resources, located in different points of the Grid,are accessed by multiple processes in a transparent way, and in turnseveral processes can be enact on the same or multiple nodes of thenetwork. The distributed implementation of PROTEUS, shown inFigure 4, comprises the elements described in the following, built ontop of the Globus Grid middleware services [20].

The Authoring module comprises both workflow designer andontologies tools that are adapted to the infrastructure (i.e. native or“wrapped” components). In the current implementation of the system,it offers an integrated GUI environment for specifying bioinformaticsapplication as workflow. In particular, the specification includes (i) allthe (predecessor/successor) dependencies between the tasks as wellas (ii) the data objects that are passed among the different tasks and(iii) their ontology definitions provided by querying/browsing theontology. The ontology allows both to enhance problem formulationand to check consistency of application design, by applying constraintsand rules specified by the ontology during components compositions.The ability of translating the workflow specified by the GUI environmentinto an executable workflow specification to be provided to a suitableenactment engine (e.g. Engine Module) is also provided.

Engine and Scheduler module is installed on each Grid node suitablefor workflow activities execution. The cooperation of different Enginemodules allows the execution of the workflow specification. It is in anearly stage of development, and is intended to control workflowexecution by navigating the workflow specification, to submit tasks tospecified Grid nodes by monitoring their status. In making so, theengine has to interpret the workflow specifications passed by theAuthoring Module for next execution at run time and it has to cooperatewith the several engines located in the network for both monitoringand enactment instances. When an engine is started, it examines thetasks that can be resolved and submits them to the local scheduleroffered by the Grid node. Therefore, the engine is responsible forcoordinating the workflow enactment and each task is executed underthe supervision of a local scheduler. Also, using the Grid the enginesexchange information among themselves regarding both the processand resource aspects. The Engine modules manage also the Workflowmetadata. We use Globus functionalities as mechanisms for locatingresources in different point of network, for accessing to multipleprocess in a transparent way, and in turn for enacting them on thesame or multiple nodes of the network. The Globus ResourceAllocation Manager (GRAM) [20] and the Network Weather Service[37] are respectively used for scheduling and monitoring on the Grid.In a first implementation enactment will be implemented using amaster-slave approach, where the master Engine module receivesthe workflow specification by the Authoring module and starts andcoordinates enactment. In the future we will consider to move towards

a fully decentralized model, where each Engine module is responsibleof the enactment of a partition of the workflow graph.

The Ontology Manager module manages the ontology life cycle andprovides (i) a mean to model all the resources involved in acomputation allowing the creation and updating of ontologies, (ii) amean to search and browse in a concept-based way the PROTEUSknowledge base, and finally (iii) a mean to match requests (e.g.analyzing data) with resources (e.g. using a bioinformatics tool),obeying to some constraints (e.g. the tool requires a well defined dataformat). In our first implementation, the creation of ontologies is doneoff-line by using the OilEd editor [3], whereas browsing and searchingis offered by OnBrowser (see Section 2.1). We plan to enhance suchtool providing it with visual editing capabilities. Moreover, the ontologymanager updates ontology whenever new concepts are introduced(e.g. a user-developed application/workflow solves a given problem),or new components, such as software tools or data sources are addedto the system. It should be noted that ontology browsing and queryingfunctions are used in the Authoring Module.

The Metadata Management module(MM Module) is installed oneach Grid node offering bioinformatics components. The cooperationof different Metadata Management modules implements in adecentralized way the management of the Ontology repository andMetadata repository described in Section 3.1. Moreover, such modulesallow to find metadata about components selected in the Authoringmodule through ontology browsing and querying. When the userselects a component in the authoring phase, the ontology returns theURLs of the metadata about all installed versions of the component.Metadata Management modules are in charge to find such metadata,accessing local matadata repositories. Such repositories will be storedusing the Globus Monitoring and Discovery Service (MDS) [20], sothat metadata querying and loading will be conducted using MDSfunctions.

5. Designing Bioinformatics Applications with PROTEUS

The current implementation of PROTEUS contains the DAMONontology and a (yet) rough ontology for the bioinformatics domain.We classify bioinformatics resources by task (e.g. sequence alignment,or similarity search), method, algorithm, software tool (e.g. BLAST, orEMBOSS), data source (e.g. Swiss-Prot, or PDB), and biologicalelement (e.g. Protein, DNA). A taxonomy that specializes each ofthese classification parameters is planned. For example the datasource taxonomy classifies the different databases specifying the kindof biological data stored, the format in which the data is stored, theannotations specifying the biological attributes of the elements, thetype of data source (flat file, relational database, etc), etc. Taxonomiesare linked together via relations or axioms. The ontology can be visitedchoosing a classifying parameters. E.g., exploring the task taxonomy,we can determine what are the available algorithms performing a giventask, which software implement such algorithms, and what biologicaldata sources are involved in the task. The design and execution of anapplication on PROTEUS comprises the following steps.

1. Ontology-based component selection. Search, location andselection of the resources used in applications are conducted on thedomain ontology.

2. Workflow design. Selected components are combined producing awork flow schema that can be translated into a standard language,such as BPEL.

3. Application execution on the Grid. The workflow is scheduled by aworkflow engine on the Grid.

4. Results visualization and storing. After application execution andresult collection, the user can enrich and extend the PROTEUSontology.

In the following we focus on the modelling and designing of applicationworkflow using the Authoring module.

5.1 Building workflows in PROTEUS

In our system, application programmers or catalog managers accessthe system respectively through the Authoring module or the Ontology

User’s Desktop

Engine

ModuleMM

Module

Authoring Module

SoftwareCatalog

WF SchemaEngineModule

Scheduler

Host A

MM

Module

Host B

Scheduler

MM

Module

Host C

ClusterCluster

WF Schema

WF Application

Metadata

OntologyMetadata

Components

Software

Catalog

OntologyMetadata

OntologyManager

Grid InfrastructureGrid Infrastructure

EngineModule

Page 5: Ontology-based Design of Bioinformatics Workflows on PROTEUS

Journal of Digital Information Management Volume 2 Number 2 June 2004 91

Manager, that adopt a client/server interface. Grid nodes containingsuch modules are also said Desktop nodes.

At build time an application programmer interacts with the Authoringmodule, so he/she can specify application workflow, by using theontology facilities and the workflow modelling tool. This modulecommunicates with the workflow metadata repository and retrieves,updates, and stores workflow definitions. When a workflow is specified,it can be submitted to a Engine module for the enactment.

In our architecture, workflow engines are distributed in the Grid andorganized in a hierarchical structure, such that each engine of thehierarchy is associated to workflow domain which can be identifiedby a group of nodes. The distributed nature of the execution is basedon the understanding that each engine is aware of the identity whichpart of workflow must be executed locally and what must be turned atthe other engines. On the termination of the local sub-workflow, thatengine signals the corresponding termination by sending appropriatemessages to its successors. If data flow is defined then the outputdata is written and transferred to other sub-workflows, as specified inthe complex workflow schema. The specified design is stored in theworkflow model repository for reusing purpose and=or submitted toengine module for enactment.

The workflow designer actually available in our system is the EclipseModeling Framework (EMF)[27], an open source environment,comprising graphical utilities to specify workflow as UML activitydiagrams and data object using a UML notation. Our choice hasseveral motivations: first, as OMG standard, UML [32] (as well as allits extensions) is the most widely accepted notation for designingand understanding complex systems; then, it has an intuitive graphicalnotation, and last, but not least, several works have demonstratedthat UML activity diagrams support most of the control flowconstructs[11] and is suitable to model many frequent situations relatedto workflow execution [12]. The EMF tool exports UML specificationinto a BPEL file (i.e. a XML file conform to Business Process ExecutionLanguage [26]) which is actually the standard “de facto” for workflowspecification.

Choosing and defining a language for workflow specification is a criticalissue, because a legacy problem exists when the implementationlanguage of the workflow and the language supported by the buildtime hosting environments are different. Thus, usually additional code,such as a wrapper, is needed to submit the workflow specification toa generic engine. Our approach is to use the same tool to (easily)designing and wrapping automatic specification, in fact in EMF, theUML to BPEL mapping is based on a UML profile which has beendeveloped from IBM Corporation[19]. A full description of the UML-profile and a set of sample UML models are included in[32].

6. Conclusions and Future Work

Novel Bioinformatics applications, and in particular proteomicsapplications, will need semantic modelling and will require largecomputational power offered by Grids. PROTEUS goes along thisdirections, it is a Grid-based Problem Solving Environment forBioinformatics that integrates ontology and workflow technologies.

In this paper we presented a preliminary distributed implementationof PROTEUS on Globus, and we focused on the workflow servicesoffered by the system. Future work will regard the realization of anontology in the proteomics domain, and the full implementation of thesystem. Moreover, we will use PROTEUS for the analysis of massspectrometry data for the early detection of inherited cancer [9].

Acknowledgements

This work has been partially supported by Project “FIRB GRID.IT”funded by MIUR.

References

1. Aguilra, Vincent, Cluet, Sophie, Milo, Tova, Veltri, PierangeloVodislav, Dan (2002). Views in a Large Scale XML Repository. VLDBJournal, 11(3).2. Baldi, P, Brunak, S (1998). Bioinformatics: The Machine LearningApproach. MIT Press.

3. Bechhofer, S, Horrocks, I, Goble, C, Stevens,R (2001). OilEd: areason-able ontology Editor for the SemanticWeb. In ArtificialIntelligence Conference. Springer Verlag.4. Cannataro, M, Comito, C (2003). A DataMining Ontology for GridProgramming. In Workshop on Semantics in Peer-to-Peer and GridComputing (in conj. With WWW2003), Budapest-Hungary, 2003.5. Cannataro, M, Comito, C, Lo Schiavo, F, Veltri, P (2003). Proteus,a Grid based Problem Solving Environment for Bioinformatics:Architecture and Experiments.IEEE Computational IntelligenceBulletin, 3(1). In Print.6.Cannataro, M, Massara, A, Veltri, P (2004). OnBrowser, a FlexibleOntology Manager. Bioinformatics Research Group Tech. Rep. 2004/03, University Magna Graecia of Catanzaro.7. Cannataro, M, Talia, D (2003). KNOWLEDGE GRID An Architecturefor Distributed Knowledge Discovery. Communication of ACM, 46(1).8. Grid Community. Global grid forum. http://www.gridforum.org/9. Cuda, G, Cannataro, M, Quaresima, B, Baudi, F, Casadonte, R,Faniello, M.C, Tagliaferri, P, Veltri, P., Costanzo, F S. Venuta (2003).Proteomic Profiling of Inherited Breast Cancer: Identification ofMolecular Targets for Early Detection, Prognosis and Treatment, andRelated Bioinformatics Tools. In WIRN 2003, LNCS, volume 2859 ofNeural Nets, Vietri sul Mare,. Springer Verlag.10. Daml.org. Daml+oil language. http://www.daml.org/2001/03/daml+oilindex.html11. Dumas, Marlon, . ter Hofstede, Arthur H. M. (2001). UML ActivityDiagrams as a Workflow Specification Language. Lecture Notes inComputer Science, 21855..12. Eshuis, R, Wieringa. R (2002). Verification support for workflowdesign with UML activity graphs. In ICSE02. Springer Verlag..13.Taylor, C.F et al.(2003) A systematic approach to modellingcapturing and disseminating proteomics experimental data. NatureBiotechnology..14. EUROGRID. Biogrid. http://biogrid.icm.edu.pl/.15. NCBI-National Cancer for Biotechnology Information. Blastdatabase. http://www.ncbi.nih.gov/BLAST/.16. Research Collaboratory for Structural Bioinformatics (RCSB).Protein data bank (pdb). http://www.rcsb.org/pdb/.17. Foster, I, Kesselman, C, Tuecke, S (2001). The Anatomy of theGrid: Enabling Scalable Virtual Organizations. SupercomputerApplications, 15(3).18. Gallopoulos, S, Houstis, E.N, Rice, J. (1994). Computer as Thinker/Doer: Problem- Solving Environments for Computational Science. InComputational Science and Engineering. IEEE.19. Gardner, T et al (2003). Profile for Automated Business Processeswith a mapping to the BPEL 1.0. Technical report, IBM alphaWorks.20. The globus project. http://www.globus.org/21. The Semantic Grid. Semanticgrid. http://www.semanticgrid.org/.22. Grid.org. Grid life science group. http://forge.gridforum.org/projects/lsg-rg23. EMBOSS Group. The european molecular biology open softwaresuite. http://www.emboss.org24. EMBOSS Group. Jena apis. http://jena.sourceforge.net25. Gruber, T. R (1993). A translation approach to portable ontologies.Knowledge Acquisition, 5(2):199–220.26. IBM. Business process execution language for web services-bpel4ws. http://www- 106.ibm.com/developerworks/webservices/library/ws-bpel/27.IBM. Emerging technologies toolkit-ibm alphaworks.http://www.alphaworks.ibm.com/tech/ettk/28. IBM. Web services flow language -wsfl. http://www-306.ibm.com/software/solutions/webservices/pdf/WSFL.pdf29. EMBL-European Molecular Biology Laboratory. The swiss-protprotein database. http://www.embl-heidelberg.de/30. Malaika, S (2003). Standards for databases on the grid. ACMSIGMOD Record, 32(3), September.31. University of Manchester. mygrid. http://mygrid.man.ac.uk/32. OMG. Uml- unified modeling language: Extensions for workflowprocess definition. http://www.omg.org/uml/33. Human Proteome Organization. Hupo. http://www.hupo.org/

Page 6: Ontology-based Design of Bioinformatics Workflows on PROTEUS

Journal of Digital Information Management Volume 2 Number 2 June 2004 92

34. Wagstrom, Patrick, Krishnan, Sriram, von Laszewski, Gregor(2000). GSFL : AWorkflow Framework for Grid Services. Firb “grid.it”wp8 working paper, July 2002.35. D. Walker, O. F. Rana, M. Li, M. S.Shields, and Y. Huang. The Software Architecture of a DistributedProblem-Solving Environment. Concurrency: Practice and Experience,12(15).36. WfMC. Workflow management coalition reference model. http://www.wfmc.org/37. Wolski, R, Spring, N, Hayes, J (1999). The network weatherservice: A distributed resource performance forecasting service formetacomputing. FGCS, 15(5-6).