22/6/14 Ph.D defense 1 Service-oriented architec ture for integration of b ioinformatic data and app lications Xiaorong Xiang Department of Computer Science and Engineerin g University of Notre Dame
23/4/11 Ph.D defense 1
Service-oriented architecture for integration of bioinformatic data and applications
Xiaorong XiangDepartment of Computer Science and Engineering
University of Notre Dame
23/4/11 Ph.D defense 2
Contributions
Survey of research issues and challenges in service-oriented computing (Chapter 2)
Built a SOA based system for supporting bioinformatics research (Chapter 3)
Explored the deep phylogeny of the plastid with the system (Chapter 4)
Enhanced the system with semantic web technology and a novel approach of reuse workflows (Chapters 5 & 6)
23/4/11 Ph.D defense 3
Outline
Introduction to SOA MoG project and MoGServ Ontological data and service representation
model Knowledge and workflow reuse
23/4/11 Ph.D defense 4
ServiceRequester
ServiceRequester
ServiceBroker
ServiceBroker
ServiceProvider
ServiceProvider
2 3 54
1
DiscoveryInvoke
Publish
interface
SOA – an architectural style of distributed computing
Why SOA Reusability Interoperability Security Maintenance Save cost when
integrating applications Adoption of SOA
e-Business e-Science e-Government
23/4/11 Ph.D defense 5
Web services – one realization of SOA
Network Transport ProtocolsTCP/IP, HTTP, SMTP, FTP, etc
Network Transport ProtocolsTCP/IP, HTTP, SMTP, FTP, etc
Meta LanguageXML
Meta LanguageXML
Services CommunicationSOAP
Services CommunicationSOAP
Service Publishing & DiscoveryUDDI
Service Publishing & DiscoveryUDDI
Services DescriptionWSDL
Services DescriptionWSDL
Business Process ExecutionBPEL4WS, WFML, WSFL,
BizTalk, …
Business Process ExecutionBPEL4WS, WFML, WSFL,
BizTalk, …
Additional WS* Standards …Additional WS* Standards … TransactionsTransactions
ManagementManagement
SecuritySecurity
Web Service Description Language
Simple Object Access Protocol
Universal Description, Discovery andIntegration
23/4/11 Ph.D defense 6
Semantic Web
Grid Computing
Service-orientedArchitecture(Web Service)
Semantic WebService
Semantic Grid
Open Grid Service Architecture (OGSA)
SemanticGridService
The P2P technology plays an important role of increasing the scalability and reliability in Service discovery and workflow execution process
1
2
3
SOA research orientations
23/4/11 Ph.D defense 7
From the article “Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade” by Folker Meyer in journal CTWatch Quarterly August, 2006 volume 2 number 3
Bioinformatics today
• Rapidly accumulating data: DNA sequences, contigs, expression data, ontologies, annotations, etc.
• Non-standard independently developed heterogeneous data sources• Data sharing, data integration, and security
23/4/11 Ph.D defense 8
SOA in Bioinformatics
MORE Community efforts needed to provide
more shared and reliable services More demonstration projects needed =>
best practices, measured utility, feedback to middleware projects, etc.
Recent exposure of data & analysis tools as services
Large public databaseMiddleware projects
Provide infrastructure To compose, manage,Execute, connect the Distributed services
23/4/11 Ph.D defense 9
Outline
Introduction to SOA MoG project and MoGServ Ontological data and service representation
model Knowledge and workflow reuse
23/4/11 Ph.D defense 10
Mother of Green (MoG) project
Biological science In collaboration with Prof. Jeanne Romero-Severson,
Biological Sciences, University of Notre Dame. Study the deep phylogeny of plastid
Computer science Provide an environment to support scientists’ investigations A case study of using SOA for data and application
integration A prototype for future research in service-oriented
architecture domain
23/4/11 Ph.D defense 11
MoG project – one motivation Malaria causes 1.5 - 2.7 million deaths every year 3,000 children under age five die of malaria every day Plasmodium falciparum (P. falciparum) causes human malaria Targeted drug design through phylogenomics P. falciparum has three genomes: nuclear, mitochondrial, plastid
(apicoplast) Find the ancestors of the apicoplast, better understanding of the
evolution of plastid Identify genes in the ancestors Determine gene function
P. falciparumP. falciparum
Apicoplast in P. falciparum
23/4/11 Ph.D defense 12
A typical in-silico investigationData driven research workflow
A: Query complete genome sequences
given a taxon
A: Query complete genome sequences
given a taxon
B: Query protein coding genes
for each genome sequence
B: Query protein coding genes
for each genome sequence
C: Eliminate vectorsequences
C: Eliminate vectorsequences
D: Sequences alignment
D: Sequences alignment
E: Phylogenetic analysis
E: Phylogenetic analysis
23/4/11 Ph.D defense 13
Challenges (Time consuming manual web-based operations) Data collection and information gathering
Rapid accumulation of raw sequence information Rate of accumulation is increasing Information accumulates faster than analyses
finish Information in forms not readily accessible
Analysis tool usage Experimental data recording Repetitive experiments for scientific discovery
Web InterfaceWeb Interface ApplicationsApplications
Application ServerApplication Server
Data AccessServices
Data AccessServices
Data AnalysisServices
Data AnalysisServices
Job ManagerJob Manager
Job LauncherJob Launcher
Service/WorkflowRegistry
Service/WorkflowRegistry
MetadataSearch
MetadataSearch
Local DataStorage
Local DataStorage
Workflow/SoapEngines
Services
NCBINCBI DDBJDDBJ EMBLEMBL
Data/Services Providers
MoGServMiddleLayer
ServicesAccessClient
OthersOthers
MoG
Ser
v S
yste
m A
rchi
tect
ure
23/4/11 Ph.D defense 15
Data storage and access services
Local database Integrating data from multiple data sources with
scientists interests Supporting repetitive investigations against
several subsets of sequences Avoiding network traffic and service failure when
retrieving data on-the-fly from public data sources Accessing the data in the local database by
services
23/4/11 Ph.D defense 16
Service and workflow registry
A table-based description with necessary properties Text description Service location Input/output Provider Version Algorithm Invocation method
Not intended for supporting service discovery or composition at current stage
A repository of service and workflow used for local application developers
23/4/11 Ph.D defense 17
Indexing and querying metadata
Metadata Service and workflow description Description of sequence data in order to track the
origination of data Experimental data output, input, and intermediate
data Indexing and querying with keyword
Lucene Implemented as services
23/4/11 Ph.D defense 18
Service and workflow enactment
INPUT
Parameters
Task Name
Timer
INPUT
Parameters
Task Name
Timer
Service/WorkflowRegistry
Job ManagerJob Manager
Find the service/workflowdefinition using the task name
Form a JobDescription
Output
Job ID
Output
Job ID
Job LauncherJob Launcher
Instances of Workflow/Service Engines
Instances of Workflow/Service Engines
Job Information
23/4/11 Ph.D defense 19
Implementation Development and deployment
J2EE, JSP, XSLT Tomcat 5.0.18 / Axis 1_2RC2
Database PostgresSQL 8.1
Index and search of metadata Apache Lucene library
Service implementation Java2WSDL Wrap command line applications with JLaunch library
Workflow Taverna workbench, part of myGrid project Freefluo workflow engine
23/4/11 Ph.D defense 20
23/4/11 Ph.D defense 21
Taverna workbench
23/4/11 Ph.D defense 22
A more complex workflow
23/4/11 Ph.D defense 23
Issues with the first prototype Meta data description
Solution Index-based (keyword syntactic search) Capture most properties to support the end-users requirement Support data provenance
Limitation Similar to most services in the bioinformatics community Lack of semantic description (goal => semantic search)
Failure tolerance and recovery Solution
Statically encode alternative services in the workflow to prevent service failure Record status of the service and workflow execution into the database for possible
recovery strategy Multiple workflow engines deployment to prevent the hardware or network failure
Limitation No dynamic service selection (more semantic description support) during execution
time No fine grained resource management and monitoring
Security
23/4/11 Ph.D defense 24
Outline
Introduction to SOA MoG project and MoGServ Ontological data and service representation
model Knowledge and workflow reuse
23/4/11 Ph.D defense 25
Semantic web
Semantic web vision Giving meaning (semantics) to web-based information Machine-understandable such that software agents can autonom
ously process them Two standards: OWL & RDF
The Web Ontology Language (OWL) Defines common vocabularies for specifying the concepts and relatio
nship among concepts Resource Description Framework (RDF)
Formal format for encoding web content using defined vocabularies Semantic web for Bioinformatics
UniProt RDF project Semantic web for SOA
Automated service discovery, composition
23/4/11 Ph.D defense 26
http://www.nd.edu/~mog
#hasCreator
#gmadey
#hasFullName
Gregory Madey
#hasTitle
#professorhttp://www.nd.edu/~gmadey
#hasPersonalSite
MoG is a … project
#hasTextDescription #hasResearchTopic
#bioinformatics
Literal Resource # URI provided the definition of these vocabularies
#hasFundedBy
#foundation
Resource Description Framework (RDF)
A graph model of statements, a set of triples: Predicate (Subject,
Object)
Representations: RDF/XML N-triples Turtle
A standard format to connect web information
23/4/11 Ph.D defense 27
Generic Service Description Ontology(myGrid/Feta model)
DataServices
Workflows
Service Domain Ontology(myGrid)
MoGServ applicationDomain Ontology
(MoGServ)
Software components for annotation RDFStore
Ontological modules used for semantic description of data, services & workflows
23/4/11 Ph.D defense 28
MoGServ Application Domain Ontology
To better track the data origination
To support the automation of workflow creation
To better share the data on the web in the future
properties domain range
invokedby Job User
isParentOf Set Set
isInstanceOf Job Service
hasSetName Set XML:String
Ontological modules
Number of Concepts Number of propertiesObject Datatype
MoGServ 12 9 7
myGrid 419 8
myGrid/Feta model 26 11 17
Example concepts and properties defined in MoGServ
23/4/11 Ph.D defense 29
Sample data annotation – metadata from MoG local database
Displayed by Rdf-Gravity
23/4/11 Ph.D defense 30
Sample service/workflow annotation
Question:Which service has an operation that accepts nucleotide_sequence as a parameter
Answer:Uri:http://www.ebi.ac.uk …/alignment:blastn_ncbiOperationName: Run
Displayed byRdf-Gravity
23/4/11 Ph.D defense 31
Implementation of annotation and query components for data, services & workflows
Sesame 1.2.6 library Supports files, RDBMS, SeRQL
Sesame RDF store
AnnotationTemplates
(Data)
AnnotationTemplates(Service)
Querytemplates
Select Y, W, X from {Y} mg:hasOperation{W} mg:inputParameter {X} rdf:type {mog:set}using namespace rdf = <http://www.w3.org/1999/02/22-rdf-syntax-ns#>, mg = <http://www.mygrid.org.uk/ontology#>, mog = <http://almond.cse.nd.edu:10000/mog#>
QueryComponents
Annotationcomponents
resultService: http://almond.cse.nd.edu:10000/axis/services/ClustalW?wsdlOperation: runClustalWdfinputParameter: setidSeRQL
23/4/11 Ph.D defense 32
Limitations
The MoGServ ontology is not complete Contains a small portion of necessary concepts u
sed for tracking the data provenance Service domain ontology is not complete
Needs more concepts as more services are published
Challenges of using semantic web in general Ontology creation, never complete Data and service annotation accuracy, efficiency Ontology integration
23/4/11 Ph.D defense 33
Outline
Introduction to SOA MoG project and MoGServ Ontological data and service representation
model Knowledge and workflow reuse
23/4/11 Ph.D defense 34
Aligning
Retrieving
Workflow A defined by a less experienced user using the functional definition of services
queryGene
clustalW
Workflow B defined by an intermediate user with executable services
queryGene
clustalW
queryGene queryGene
setIds
setFilter
clustalW clustalW
Workflow C defined by an expert user with two extra executable services to ensure the accurate output of
the biological process
Three user-defined workflows from different viewsQuestion: “are gene genealogies for ATP subunitαβ γ different?”
23/4/11 Ph.D defense 35
Limitations of current workflow management systems Existing workflow management system and bioinfor
matics middleware Taverna, Kepler, Triana, Pegasus Design, execute, monitor, re-run
Support ad-hoc, semi-automated and automated service discovery and composition from scratch
Our approach: reuse the verified knowledge and workflow Increase the correctness over time Provide more accurate guidelines
23/4/11 Ph.D defense 36
UserService
Annotator
Abstractworkflow
DL reasonerDL reasoner
Ontology
Create abstract workflow using ontology
Annotate services using ontology
Semantics enabled service registry
Semantics enabled service discovery
Semantics enabled service discovery
Service matchmakingService matchmaking
Workflow composer (software agent/experienced users)
Find appropriate service
Workflow execution
engine
Workflow execution
engine
concreteworkflow
Data provenancemanagement
Data provenancemanagement
Collect and manage information about data origination
Knowledgebase
management
Knowledgebase
managementKnowledgediscovery
Knowledgediscovery
Enhanced workflow system
23/4/11 Ph.D defense 37
Encode, convert theHigh level definition To low-level executable
Invoke a workflow withSpecific input data andRecord the data Provenance and Performance of services,workflows.
Abstract workflow
Concrete workflow
Optimal workflow
Workflow instance
Replace individual Services with theiroptimal alternatives
Task A Task B
Service B
Service A
Service DService C
Service BService A
Service DService C’
input
outputService B
Service A
Service DService C’
Our hierarchical workflow structure
F F T f i l e a
/usr/local/bin/fft /home/file1
M o v e f i l e a f r o m h o s t 1 : / /
h o m e / f i l e a
t o h o s t 2 : / /h o m e / f i l e 1
Abstract Workflow
Concrete Workflow
DataTransfer
Data Registration
Pegasus workflow structure
23/4/11 Ph.D defense 38
Reusable knowledge Connectivity
Helps to convert from abstract workflow to concrete workflow
Alternatives and quality-of-service profiles Helps to convert from concrete workflow to
optimal workflow Mapping of abstract workflow and concrete
workflow Helps to choose reusable workflows
23/4/11 Ph.D defense 39
Connectivity identification(Match detection)
Service: QueryLocal Operation: createSet
performTask: mygrid:retrieving
inputPara: Settype(String, mog:gene)
Queryterm(String, null) outputPara:
Setid(string, mog:geneset)useResource: MoG
Service: ClustalW Operation: runClustalWdf
performTask: mygrid:aligning
inputPara: Setid(String, mog:set )Sequencetype(String,
mog:sequence) outputPara:
filen(string, mygrid:sequence_alignment_report)
useResource: EBI
Service: FormatConversion
Operation: convert performtask:
mygrid: translatinginputPara: filen(String, mygrid:sequence
_alignment_report )outputPara:
Out(String, mygrid:nexus_paup_format)
useResource: MoG
Parameter (data type, semantic type) Matching rule: opertation ij → operation mn if exist parameterk is output parameter of operationij and exist parametero is input parameter of operationmn and data type (parametero) = data type (parameterk) and semantic type (parametero) = semantic type(parameterk)
23/4/11 Ph.D defense 40
Need for verified service connectivity The mismatching problem
TP FP
FN TN
Match detectionoutput
Accurate annotation
Inaccurate annotationLack semantic annotationInaccurate reasoning
Inaccurate annotationLack of semantic annotationInaccurate reasoning
Accurate annotation
GenBankServiceOut:GenBank record
BlastpIn: protein sequenceX
Mediator, adaptor,shim
DDBJ-XMLOut: sequence
data record
NCBI blastIn: sequence data
record
fasta formatSelf-defined format
May be detectedby expertise at design time or afterrun
Can be detected automatically
X
Yes No
Yes
No
FPTN
Real match
23/4/11 Ph.D defense 41
Connectivity Graph Implementation
Registrationprocess
registry
Automatically Identify the connectivity
Knowledge base
Store the connectivity
Workflow Translation /
Service compositionprocess
Refine, update, decompose the workflow
connect (servicea, operationai, parameterc, serviceb, operationbi, parameterd)identifyConnect (Single service, rdf repository)Search at syntactic level: search path between two nodes search next available service
automatic composition base on input, output Implementation: shortest path algorithm Dijkstra
23/4/11 Ph.D defense 42
Experiment Used 418 concepts from domain
ontology for semantic type, defined 10 concepts for data type.
Randomly generate service annotation. 1 input, 1 output
1000 services connectivity graph (right side)
Intel Pentium mobile 1.5GZ
Number of services Number of Matched pair
Load RDF repository
(milliseconds)
Average time of match detection per single service (milliseconds)
200 10 1547 12.02
400 34 2346 13.01
600 84 2600 12.31
800 138 3015 12.35
1000 225 3325 12.51
Number of nodes 724
Number of arcs 587
Average path search time (milliseconds)
Less than 1
Connectivity graph load time (milliseconds)
220
Length 0 = 724, length 1= 587,length 2=448, length 3= 281,Length 4=114, length 5=71Length 6 =28, length 7=16Length 8 = 4, length 9 = 2
Conclusion:Feasible solution.
23/4/11 Ph.D defense 43
Reuse of workflows
Reuse of abstract workflows
Reuse of concrete workflows
Compare structural similarity of two workflows
Implementation: SUBDUE algorithm
input
output
query_term
hasParameter
task
hasInput
task
hasNext
retrieving
aligning
multiple_alignment_report
performTask
hasOutput performTask
hasParameter
v 1 inputv 2 outputv 3 taskv 4 taskv 5 query_termv 6 retrievingv 7 aligningv 8 multiple_aligning_report
e 3 4 hasNexte 3 1 hasInpute 4 2 hasOutpute 3 6 performTaske 4 7 performTaske 1 5 hasParametere 2 8 hasParameter
SUBDUE input formatGraph view
23/4/11 Ph.D defense 44
Pro and Con Pro
Increase the correctness of the formed workflow over time Avoid the incorrect, inaccurate semantic annotations Take advantage of verified knowledge Avoid the ontological reasoning process
Better support for semi-automated and automated service composition over time Provide more accurate guideline to users over time
Con The connectivity graph can be big
Number of parameters Number of services
Search the connectivity of a service when a service is registered in the system may take relative long time More complex matching rule Number of parameters
May not have high accuracy at the beginning
23/4/11 Ph.D defense 45
Summary
Described the design and implementation of MoGServ
Explored the ontological representation of data and services
Described new approach for reuse of workflows and connectivity of services
23/4/11 Ph.D defense 46
Future work
Integrate the GridSam into the MoGServ for execution, monitoring
Integrate the Grid computing technology for resource allocation
Refine the MoGServ application domain ontology Create interface for end-user workflow creation Create interface for individual workspace Evaluate the scalability, accuracy of connectivity
graph approach and the graph matching approach with large number real workflows and services
23/4/11 Ph.D defense 47
Acknowledgements
Dr. Madey Dr. Romero-Severson Dr. Flynn Dr. Striegel Dr. Chaudhary Dr. Collins Mr. Eric Morgan Dr. Jean-Christophe Ducom
Partially supported by the Indiana Center for Insect Genomics (ICIG) with funding from the Indiana 21st Century fund
23/4/11 Ph.D defense 48
Publications
X. Xiang, G. Madey and J. Romero-Severson, “A Service-oriented Data Integration and Analysis Environment for In-Silico Experiments and Bioinformatics Research”, Proceedings of the 40th Annual Hawaii International Conference on System Sciences (CD-ROM), January 3-6 2007, Computer Society Press.
Xiaorong Xiang and Greg Madey, "A Semantic Web Services Enabled Web Portal Architecture", IEEE International Conference on Web Services (ICWS 2004), San Diego, July 2004
Xiaorong Xiang and Greg Madey, “Improving the reuse of scientific workflows and their by-products. In International Conference on Web Services (ICWS2007). Under review.
Xiaorong Xiang and Eric Lease Morgan, Exploiting "Light-weight" Protocols and Open Source Tools to Implement Digital Library Collections and Services. D-Lib Magazine, October 2005, Volume 11 Number 10
23/4/11 Ph.D defense 49
Publications planned
One journal paper for BMC Bioinformatics Chapter 3, chapter 4, chapter 5
Future IEEE ICWS proceedings Chapter 6
Biology journal – TBD Results from using MoGServ
23/4/11 Ph.D defense 50
Thank you
23/4/11 Ph.D defense 51
Summary
A practical demonstration of building a SOA-based system Applied in a bioinformatics application to study the deep phylogeny
Easy and rapid extraction of DNA and protein sequence from public databases to a local database which saves scientists months of repetitive searching, downloading, and data management.
Painless reformatting of the extracted data for commonly used analytical tools. Preliminary data inspection and analysis using these tools within the web-
services environment which permits inspection of many conserved gene candidates, enabling the investigator to rapidly determine the suitability of the chosen gene for deep phylogenetic analysis.
User-specified additions to the local database which allows upload sequences into the local database.
User-specified additions to the automated queries which provides a free-text searching interface for constructing data sets with interests.
23/4/11 Ph.D defense 52
Ontological modules
Generic service description ontology Feta model from myGrid
Service domain ontology myGrid bioinformatics ontology
MoG application domain ontology Adding more concepts particularly used in the Mo
G project Small portion of concepts and properties
23/4/11 Ph.D defense 53
Service Provider
Service Requester
Return results in XML format
Send request in XML format
Internet
SoftwareAgent ImplementThe service
SoftwareAgent Has knowledgeOf the serviceIn terns of theDescription notThe implementation
Servicedescription
23/4/11 Ph.D defense 54
Adoption of SOA
Why SOA Reusability Interoperability Security Maintenance Save cost when integration of applications
Application of SOA e-Business e-Science e-Government
23/4/11 Ph.D defense 55
23/4/11 Ph.D defense 56
23/4/11 Ph.D defense 57
23/4/11 Ph.D defense 58
23/4/11 Ph.D defense 59
23/4/11 Ph.D defense 60
23/4/11 Ph.D defense 61
23/4/11 Ph.D defense 62
23/4/11 Ph.D defense 63
Data and services
Services, Workflows Data collection from remote database Query local database Data analysis tools, blast, clustalw, Data format conversion, readseq Management data sets and jobs Download and upload
Data Complete genome sequences ATP gene sequences Sequence sets Saved jobs
23/4/11 Ph.D defense 64
The information gathering problem
• Rapid accumulation of raw sequence information~100 sequenced chloroplast genomes~57 sequenced cyanobacterial genomesRate of accumulation is increasingInformation accumulates faster than analyses finishInformation in forms not readily accessible
• SolutionSemi-automated web-services“Smart” web-servicesSemantic web
23/4/11 Ph.D defense 65
Time consuming manual web-based operations
Data collection Copy & paste!
Analysis tool usage Copy & paste!
Experiment data recording Copy & paste!
Repetitive experiments for scientific discovery Copy & paste!
23/4/11 Ph.D defense 66
MoGServ system architecture
MoGServ interface Web interface Application interface
MoGServ middle layer Data access storage Data and analysis services Service and workflow registry Indexing and querying metadata Service and workflow enactment
Acting in two roles: service requester and service provider
23/4/11 Ph.D defense 67
MoG project
Find the ancestors of the apicoplast, better understanding of the evolution of plastid
Identify genes in the ancestors Determine gene function Look for these genes in the P. falciparum
nucleus Then study regulatory mechanisms in
candidate genes
23/4/11 Ph.D defense 68
Improvement of the system
Use existing domain ontology in bioformatics community to describe services, workflows, and data
Integrate the grid computing technologies to address the security and resource allocation issues
Integrate the semantic web technology to support end-users workflow creation based on their knowledge of scientific domain