Top Banner
22/6/14 Ph.D defense 1 Service-oriented architec ture for integration of b ioinformatic data and app lications Xiaorong Xiang Department of Computer Science and Engineerin g University of Notre Dame
68
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 1

Service-oriented architecture for integration of bioinformatic data and applications

Xiaorong XiangDepartment of Computer Science and Engineering

University of Notre Dame

Page 2: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 2

Contributions

Survey of research issues and challenges in service-oriented computing (Chapter 2)

Built a SOA based system for supporting bioinformatics research (Chapter 3)

Explored the deep phylogeny of the plastid with the system (Chapter 4)

Enhanced the system with semantic web technology and a novel approach of reuse workflows (Chapters 5 & 6)

Page 3: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 3

Outline

Introduction to SOA MoG project and MoGServ Ontological data and service representation

model Knowledge and workflow reuse

Page 4: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 4

ServiceRequester

ServiceRequester

ServiceBroker

ServiceBroker

ServiceProvider

ServiceProvider

2 3 54

1

DiscoveryInvoke

Publish

interface

SOA – an architectural style of distributed computing

Why SOA Reusability Interoperability Security Maintenance Save cost when

integrating applications Adoption of SOA

e-Business e-Science e-Government

Page 5: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 5

Web services – one realization of SOA

Network Transport ProtocolsTCP/IP, HTTP, SMTP, FTP, etc

Network Transport ProtocolsTCP/IP, HTTP, SMTP, FTP, etc

Meta LanguageXML

Meta LanguageXML

Services CommunicationSOAP

Services CommunicationSOAP

Service Publishing & DiscoveryUDDI

Service Publishing & DiscoveryUDDI

Services DescriptionWSDL

Services DescriptionWSDL

Business Process ExecutionBPEL4WS, WFML, WSFL,

BizTalk, …

Business Process ExecutionBPEL4WS, WFML, WSFL,

BizTalk, …

Additional WS* Standards …Additional WS* Standards … TransactionsTransactions

ManagementManagement

SecuritySecurity

Web Service Description Language

Simple Object Access Protocol

Universal Description, Discovery andIntegration

Page 6: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 6

Semantic Web

Grid Computing

Service-orientedArchitecture(Web Service)

Semantic WebService

Semantic Grid

Open Grid Service Architecture (OGSA)

SemanticGridService

The P2P technology plays an important role of increasing the scalability and reliability in Service discovery and workflow execution process

1

2

3

SOA research orientations

Page 7: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 7

From the article “Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade” by Folker Meyer in journal CTWatch Quarterly August, 2006 volume 2 number 3

Bioinformatics today

• Rapidly accumulating data: DNA sequences, contigs, expression data, ontologies, annotations, etc.

• Non-standard independently developed heterogeneous data sources• Data sharing, data integration, and security

Page 8: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 8

SOA in Bioinformatics

MORE Community efforts needed to provide

more shared and reliable services More demonstration projects needed =>

best practices, measured utility, feedback to middleware projects, etc.

Recent exposure of data & analysis tools as services

Large public databaseMiddleware projects

Provide infrastructure To compose, manage,Execute, connect the Distributed services

Page 9: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 9

Outline

Introduction to SOA MoG project and MoGServ Ontological data and service representation

model Knowledge and workflow reuse

Page 10: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 10

Mother of Green (MoG) project

Biological science In collaboration with Prof. Jeanne Romero-Severson,

Biological Sciences, University of Notre Dame. Study the deep phylogeny of plastid

Computer science Provide an environment to support scientists’ investigations A case study of using SOA for data and application

integration A prototype for future research in service-oriented

architecture domain

Page 11: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 11

MoG project – one motivation Malaria causes 1.5 - 2.7 million deaths every year 3,000 children under age five die of malaria every day Plasmodium falciparum (P. falciparum) causes human malaria Targeted drug design through phylogenomics P. falciparum has three genomes: nuclear, mitochondrial, plastid

(apicoplast) Find the ancestors of the apicoplast, better understanding of the

evolution of plastid Identify genes in the ancestors Determine gene function

P. falciparumP. falciparum

Apicoplast in P. falciparum

Page 12: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 12

A typical in-silico investigationData driven research workflow

A: Query complete genome sequences

given a taxon

A: Query complete genome sequences

given a taxon

B: Query protein coding genes

for each genome sequence

B: Query protein coding genes

for each genome sequence

C: Eliminate vectorsequences

C: Eliminate vectorsequences

D: Sequences alignment

D: Sequences alignment

E: Phylogenetic analysis

E: Phylogenetic analysis

Page 13: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 13

Challenges (Time consuming manual web-based operations) Data collection and information gathering

Rapid accumulation of raw sequence information Rate of accumulation is increasing Information accumulates faster than analyses

finish Information in forms not readily accessible

Analysis tool usage Experimental data recording Repetitive experiments for scientific discovery

Page 14: 4/30/10 Ph.D defense

Web InterfaceWeb Interface ApplicationsApplications

Application ServerApplication Server

Data AccessServices

Data AccessServices

Data AnalysisServices

Data AnalysisServices

Job ManagerJob Manager

Job LauncherJob Launcher

Service/WorkflowRegistry

Service/WorkflowRegistry

MetadataSearch

MetadataSearch

Local DataStorage

Local DataStorage

Workflow/SoapEngines

Services

NCBINCBI DDBJDDBJ EMBLEMBL

Data/Services Providers

MoGServMiddleLayer

ServicesAccessClient

OthersOthers

MoG

Ser

v S

yste

m A

rchi

tect

ure

Page 15: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 15

Data storage and access services

Local database Integrating data from multiple data sources with

scientists interests Supporting repetitive investigations against

several subsets of sequences Avoiding network traffic and service failure when

retrieving data on-the-fly from public data sources Accessing the data in the local database by

services

Page 16: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 16

Service and workflow registry

A table-based description with necessary properties Text description Service location Input/output Provider Version Algorithm Invocation method

Not intended for supporting service discovery or composition at current stage

A repository of service and workflow used for local application developers

Page 17: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 17

Indexing and querying metadata

Metadata Service and workflow description Description of sequence data in order to track the

origination of data Experimental data output, input, and intermediate

data Indexing and querying with keyword

Lucene Implemented as services

Page 18: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 18

Service and workflow enactment

INPUT

Parameters

Task Name

Timer

INPUT

Parameters

Task Name

Timer

Service/WorkflowRegistry

Job ManagerJob Manager

Find the service/workflowdefinition using the task name

Form a JobDescription

Output

Job ID

Output

Job ID

Job LauncherJob Launcher

Instances of Workflow/Service Engines

Instances of Workflow/Service Engines

Job Information

Page 19: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 19

Implementation Development and deployment

J2EE, JSP, XSLT Tomcat 5.0.18 / Axis 1_2RC2

Database PostgresSQL 8.1

Index and search of metadata Apache Lucene library

Service implementation Java2WSDL Wrap command line applications with JLaunch library

Workflow Taverna workbench, part of myGrid project Freefluo workflow engine

Page 20: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 20

Page 21: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 21

Taverna workbench

Page 22: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 22

A more complex workflow

Page 23: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 23

Issues with the first prototype Meta data description

Solution Index-based (keyword syntactic search) Capture most properties to support the end-users requirement Support data provenance

Limitation Similar to most services in the bioinformatics community Lack of semantic description (goal => semantic search)

Failure tolerance and recovery Solution

Statically encode alternative services in the workflow to prevent service failure Record status of the service and workflow execution into the database for possible

recovery strategy Multiple workflow engines deployment to prevent the hardware or network failure

Limitation No dynamic service selection (more semantic description support) during execution

time No fine grained resource management and monitoring

Security

Page 24: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 24

Outline

Introduction to SOA MoG project and MoGServ Ontological data and service representation

model Knowledge and workflow reuse

Page 25: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 25

Semantic web

Semantic web vision Giving meaning (semantics) to web-based information Machine-understandable such that software agents can autonom

ously process them Two standards: OWL & RDF

The Web Ontology Language (OWL) Defines common vocabularies for specifying the concepts and relatio

nship among concepts Resource Description Framework (RDF)

Formal format for encoding web content using defined vocabularies Semantic web for Bioinformatics

UniProt RDF project Semantic web for SOA

Automated service discovery, composition

Page 26: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 26

http://www.nd.edu/~mog

#hasCreator

#gmadey

#hasFullName

Gregory Madey

#hasTitle

#professorhttp://www.nd.edu/~gmadey

#hasPersonalSite

MoG is a … project

#hasTextDescription #hasResearchTopic

#bioinformatics

Literal Resource # URI provided the definition of these vocabularies

#hasFundedBy

#foundation

Resource Description Framework (RDF)

A graph model of statements, a set of triples: Predicate (Subject,

Object)

Representations: RDF/XML N-triples Turtle

A standard format to connect web information

Page 27: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 27

Generic Service Description Ontology(myGrid/Feta model)

DataServices

Workflows

Service Domain Ontology(myGrid)

MoGServ applicationDomain Ontology

(MoGServ)

Software components for annotation RDFStore

Ontological modules used for semantic description of data, services & workflows

Page 28: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 28

MoGServ Application Domain Ontology

To better track the data origination

To support the automation of workflow creation

To better share the data on the web in the future

properties domain range

invokedby Job User

isParentOf Set Set

isInstanceOf Job Service

hasSetName Set XML:String

Ontological modules

Number of Concepts Number of propertiesObject Datatype

MoGServ 12 9 7

myGrid 419 8

myGrid/Feta model 26 11 17

Example concepts and properties defined in MoGServ

Page 29: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 29

Sample data annotation – metadata from MoG local database

Displayed by Rdf-Gravity

Page 30: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 30

Sample service/workflow annotation

Question:Which service has an operation that accepts nucleotide_sequence as a parameter

Answer:Uri:http://www.ebi.ac.uk …/alignment:blastn_ncbiOperationName: Run

Displayed byRdf-Gravity

Page 31: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 31

Implementation of annotation and query components for data, services & workflows

Sesame 1.2.6 library Supports files, RDBMS, SeRQL

Sesame RDF store

AnnotationTemplates

(Data)

AnnotationTemplates(Service)

Querytemplates

Select Y, W, X from {Y} mg:hasOperation{W} mg:inputParameter {X} rdf:type {mog:set}using namespace rdf = <http://www.w3.org/1999/02/22-rdf-syntax-ns#>, mg = <http://www.mygrid.org.uk/ontology#>, mog = <http://almond.cse.nd.edu:10000/mog#>

QueryComponents

Annotationcomponents

resultService: http://almond.cse.nd.edu:10000/axis/services/ClustalW?wsdlOperation: runClustalWdfinputParameter: setidSeRQL

Page 32: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 32

Limitations

The MoGServ ontology is not complete Contains a small portion of necessary concepts u

sed for tracking the data provenance Service domain ontology is not complete

Needs more concepts as more services are published

Challenges of using semantic web in general Ontology creation, never complete Data and service annotation accuracy, efficiency Ontology integration

Page 33: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 33

Outline

Introduction to SOA MoG project and MoGServ Ontological data and service representation

model Knowledge and workflow reuse

Page 34: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 34

Aligning

Retrieving

Workflow A defined by a less experienced user using the functional definition of services

queryGene

clustalW

Workflow B defined by an intermediate user with executable services

queryGene

clustalW

queryGene queryGene

setIds

setFilter

clustalW clustalW

Workflow C defined by an expert user with two extra executable services to ensure the accurate output of

the biological process

Three user-defined workflows from different viewsQuestion: “are gene genealogies for ATP subunitαβ γ different?”

Page 35: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 35

Limitations of current workflow management systems Existing workflow management system and bioinfor

matics middleware Taverna, Kepler, Triana, Pegasus Design, execute, monitor, re-run

Support ad-hoc, semi-automated and automated service discovery and composition from scratch

Our approach: reuse the verified knowledge and workflow Increase the correctness over time Provide more accurate guidelines

Page 36: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 36

UserService

Annotator

Abstractworkflow

DL reasonerDL reasoner

Ontology

Create abstract workflow using ontology

Annotate services using ontology

Semantics enabled service registry

Semantics enabled service discovery

Semantics enabled service discovery

Service matchmakingService matchmaking

Workflow composer (software agent/experienced users)

Find appropriate service

Workflow execution

engine

Workflow execution

engine

concreteworkflow

Data provenancemanagement

Data provenancemanagement

Collect and manage information about data origination

Knowledgebase

management

Knowledgebase

managementKnowledgediscovery

Knowledgediscovery

Enhanced workflow system

Page 37: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 37

Encode, convert theHigh level definition To low-level executable

Invoke a workflow withSpecific input data andRecord the data Provenance and Performance of services,workflows.

Abstract workflow

Concrete workflow

Optimal workflow

Workflow instance

Replace individual Services with theiroptimal alternatives

Task A Task B

Service B

Service A

Service DService C

Service BService A

Service DService C’

input

outputService B

Service A

Service DService C’

Our hierarchical workflow structure

F F T f i l e a

/usr/local/bin/fft /home/file1

M o v e f i l e a f r o m h o s t 1 : / /

h o m e / f i l e a

t o h o s t 2 : / /h o m e / f i l e 1

Abstract Workflow

Concrete Workflow

DataTransfer

Data Registration

Pegasus workflow structure

Page 38: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 38

Reusable knowledge Connectivity

Helps to convert from abstract workflow to concrete workflow

Alternatives and quality-of-service profiles Helps to convert from concrete workflow to

optimal workflow Mapping of abstract workflow and concrete

workflow Helps to choose reusable workflows

Page 39: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 39

Connectivity identification(Match detection)

Service: QueryLocal Operation: createSet

performTask: mygrid:retrieving

inputPara: Settype(String, mog:gene)

Queryterm(String, null) outputPara:

Setid(string, mog:geneset)useResource: MoG

Service: ClustalW Operation: runClustalWdf

performTask: mygrid:aligning

inputPara: Setid(String, mog:set )Sequencetype(String,

mog:sequence) outputPara:

filen(string, mygrid:sequence_alignment_report)

useResource: EBI

Service: FormatConversion

Operation: convert performtask:

mygrid: translatinginputPara: filen(String, mygrid:sequence

_alignment_report )outputPara:

Out(String, mygrid:nexus_paup_format)

useResource: MoG

Parameter (data type, semantic type) Matching rule: opertation ij → operation mn if exist parameterk is output parameter of operationij and exist parametero is input parameter of operationmn and data type (parametero) = data type (parameterk) and semantic type (parametero) = semantic type(parameterk)

Page 40: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 40

Need for verified service connectivity The mismatching problem

TP FP

FN TN

Match detectionoutput

Accurate annotation

Inaccurate annotationLack semantic annotationInaccurate reasoning

Inaccurate annotationLack of semantic annotationInaccurate reasoning

Accurate annotation

GenBankServiceOut:GenBank record

BlastpIn: protein sequenceX

Mediator, adaptor,shim

DDBJ-XMLOut: sequence

data record

NCBI blastIn: sequence data

record

fasta formatSelf-defined format

May be detectedby expertise at design time or afterrun

Can be detected automatically

X

Yes No

Yes

No

FPTN

Real match

Page 41: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 41

Connectivity Graph Implementation

Registrationprocess

registry

Automatically Identify the connectivity

Knowledge base

Store the connectivity

Workflow Translation /

Service compositionprocess

Refine, update, decompose the workflow

connect (servicea, operationai, parameterc, serviceb, operationbi, parameterd)identifyConnect (Single service, rdf repository)Search at syntactic level: search path between two nodes search next available service

automatic composition base on input, output Implementation: shortest path algorithm Dijkstra

Page 42: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 42

Experiment Used 418 concepts from domain

ontology for semantic type, defined 10 concepts for data type.

Randomly generate service annotation. 1 input, 1 output

1000 services connectivity graph (right side)

Intel Pentium mobile 1.5GZ

Number of services Number of Matched pair

Load RDF repository

(milliseconds)

Average time of match detection per single service (milliseconds)

200 10 1547 12.02

400 34 2346 13.01

600 84 2600 12.31

800 138 3015 12.35

1000 225 3325 12.51

Number of nodes 724

Number of arcs 587

Average path search time (milliseconds)

Less than 1

Connectivity graph load time (milliseconds)

220

Length 0 = 724, length 1= 587,length 2=448, length 3= 281,Length 4=114, length 5=71Length 6 =28, length 7=16Length 8 = 4, length 9 = 2

Conclusion:Feasible solution.

Page 43: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 43

Reuse of workflows

Reuse of abstract workflows

Reuse of concrete workflows

Compare structural similarity of two workflows

Implementation: SUBDUE algorithm

input

output

query_term

hasParameter

task

hasInput

task

hasNext

retrieving

aligning

multiple_alignment_report

performTask

hasOutput performTask

hasParameter

v 1 inputv 2 outputv 3 taskv 4 taskv 5 query_termv 6 retrievingv 7 aligningv 8 multiple_aligning_report

e 3 4 hasNexte 3 1 hasInpute 4 2 hasOutpute 3 6 performTaske 4 7 performTaske 1 5 hasParametere 2 8 hasParameter

SUBDUE input formatGraph view

Page 44: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 44

Pro and Con Pro

Increase the correctness of the formed workflow over time Avoid the incorrect, inaccurate semantic annotations Take advantage of verified knowledge Avoid the ontological reasoning process

Better support for semi-automated and automated service composition over time Provide more accurate guideline to users over time

Con The connectivity graph can be big

Number of parameters Number of services

Search the connectivity of a service when a service is registered in the system may take relative long time More complex matching rule Number of parameters

May not have high accuracy at the beginning

Page 45: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 45

Summary

Described the design and implementation of MoGServ

Explored the ontological representation of data and services

Described new approach for reuse of workflows and connectivity of services

Page 46: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 46

Future work

Integrate the GridSam into the MoGServ for execution, monitoring

Integrate the Grid computing technology for resource allocation

Refine the MoGServ application domain ontology Create interface for end-user workflow creation Create interface for individual workspace Evaluate the scalability, accuracy of connectivity

graph approach and the graph matching approach with large number real workflows and services

Page 47: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 47

Acknowledgements

Dr. Madey Dr. Romero-Severson Dr. Flynn Dr. Striegel Dr. Chaudhary Dr. Collins Mr. Eric Morgan Dr. Jean-Christophe Ducom

Partially supported by the Indiana Center for Insect Genomics (ICIG) with funding from the Indiana 21st Century fund

Page 48: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 48

Publications

X. Xiang, G. Madey and J. Romero-Severson, “A Service-oriented Data Integration and Analysis Environment for In-Silico Experiments and Bioinformatics Research”, Proceedings of the 40th Annual Hawaii International Conference on System Sciences (CD-ROM), January 3-6 2007, Computer Society Press.

Xiaorong Xiang and Greg Madey, "A Semantic Web Services Enabled Web Portal Architecture", IEEE International Conference on Web Services (ICWS 2004), San Diego, July 2004

Xiaorong Xiang and Greg Madey, “Improving the reuse of scientific workflows and their by-products. In International Conference on Web Services (ICWS2007). Under review.

Xiaorong Xiang and Eric Lease Morgan, Exploiting "Light-weight" Protocols and Open Source Tools to Implement Digital Library Collections and Services. D-Lib Magazine, October 2005, Volume 11 Number 10

Page 49: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 49

Publications planned

One journal paper for BMC Bioinformatics Chapter 3, chapter 4, chapter 5

Future IEEE ICWS proceedings Chapter 6

Biology journal – TBD Results from using MoGServ

Page 50: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 50

Thank you

Page 51: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 51

Summary

A practical demonstration of building a SOA-based system Applied in a bioinformatics application to study the deep phylogeny

Easy and rapid extraction of DNA and protein sequence from public databases to a local database which saves scientists months of repetitive searching, downloading, and data management.

Painless reformatting of the extracted data for commonly used analytical tools. Preliminary data inspection and analysis using these tools within the web-

services environment which permits inspection of many conserved gene candidates, enabling the investigator to rapidly determine the suitability of the chosen gene for deep phylogenetic analysis.

User-specified additions to the local database which allows upload sequences into the local database.

User-specified additions to the automated queries which provides a free-text searching interface for constructing data sets with interests.

Page 52: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 52

Ontological modules

Generic service description ontology Feta model from myGrid

Service domain ontology myGrid bioinformatics ontology

MoG application domain ontology Adding more concepts particularly used in the Mo

G project Small portion of concepts and properties

Page 53: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 53

Service Provider

Service Requester

Return results in XML format

Send request in XML format

Internet

SoftwareAgent ImplementThe service

SoftwareAgent Has knowledgeOf the serviceIn terns of theDescription notThe implementation

Servicedescription

Page 54: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 54

Adoption of SOA

Why SOA Reusability Interoperability Security Maintenance Save cost when integration of applications

Application of SOA e-Business e-Science e-Government

Page 55: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 55

Page 56: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 56

Page 57: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 57

Page 58: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 58

Page 59: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 59

Page 60: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 60

Page 61: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 61

Page 62: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 62

Page 63: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 63

Data and services

Services, Workflows Data collection from remote database Query local database Data analysis tools, blast, clustalw, Data format conversion, readseq Management data sets and jobs Download and upload

Data Complete genome sequences ATP gene sequences Sequence sets Saved jobs

Page 64: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 64

The information gathering problem

• Rapid accumulation of raw sequence information~100 sequenced chloroplast genomes~57 sequenced cyanobacterial genomesRate of accumulation is increasingInformation accumulates faster than analyses finishInformation in forms not readily accessible

• SolutionSemi-automated web-services“Smart” web-servicesSemantic web

Page 65: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 65

Time consuming manual web-based operations

Data collection Copy & paste!

Analysis tool usage Copy & paste!

Experiment data recording Copy & paste!

Repetitive experiments for scientific discovery Copy & paste!

Page 66: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 66

MoGServ system architecture

MoGServ interface Web interface Application interface

MoGServ middle layer Data access storage Data and analysis services Service and workflow registry Indexing and querying metadata Service and workflow enactment

Acting in two roles: service requester and service provider

Page 67: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 67

MoG project

Find the ancestors of the apicoplast, better understanding of the evolution of plastid

Identify genes in the ancestors Determine gene function Look for these genes in the P. falciparum

nucleus Then study regulatory mechanisms in

candidate genes

Page 68: 4/30/10 Ph.D defense

23/4/11 Ph.D defense 68

Improvement of the system

Use existing domain ontology in bioformatics community to describe services, workflows, and data

Integrate the grid computing technologies to address the security and resource allocation issues

Integrate the semantic web technology to support end-users workflow creation based on their knowledge of scientific domain