ASU, 7/25/2003 Scientific Data Management: From Data Integration to Analytical Pipelines Scientific Data Management: From Data Integration to Analytical.

ASU, 7/25/2003

Scientific Data Management: From Scientific Data Management: From Data Integration to Analytical Pipelines Data Integration to Analytical Pipelines

Data & Knowledge SystemsData & Knowledge Systems

San Diego Supercomputer CenterSan Diego Supercomputer Center

University of California, San DiegoUniversity of California, San Diego

Bertram LudäscherBertram Ludä[email protected]@sdsc.edu

ASU, 7/25/2003

OutlineOutline

• Motivation: Scientific Data Integration ProblemsMotivation: Scientific Data Integration Problems

• ““Semantic” (Model-based) MediationSemantic” (Model-based) Mediation

• Scientific Workflows and Analytical PipelinesScientific Workflows and Analytical Pipelines

ASU, 7/25/2003

AcknowledgementsAcknowledgements• National Science Foundation (NSF)National Science Foundation (NSF)

– www.nsf.gov

• GEOsciences Network (NSF) GEOsciences Network (NSF) – www.geongrid.org

• Biomedical Informatics Research Network (NIH)Biomedical Informatics Research Network (NIH)– www.nbirn.net

• Science Environment for Ecological Knowledge (NSF)Science Environment for Ecological Knowledge (NSF)– seek.ecoinformatics.org

• Scientific Data Management Center (DOE)Scientific Data Management Center (DOE)– sdm.lbl.gov/sdmcenter/

http://www.nsf.gov/

http://www.geongrid.org/

http://www.nbirn.net/

http://seek.ecoinformatics.org/



http://sdm.lbl.gov/sdmcenter/

http://sdm.lbl.gov/sdmcenter/

An Online Shopper’s Information Integration ProblemAn Online Shopper’s Information Integration Problem

El Cheapo: “Where can I get the cheapest copy (including shipping cost) of El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”

??Information Information IntegrationIntegration

addall.comaddall.com

““One-World” Scenario:One-World” Scenario:XML-based mediatorXML-based mediator

amazon.comamazon.comamazon.comamazon.com A1books.comA1books.comA1books.comA1books.comhalf.comhalf.comhalf.comhalf.combarnes&noble.combarnes&noble.combarnes&noble.combarnes&noble.com

Mediator (virtual DB)Mediator (virtual DB)(vs. Datawarehouse)(vs. Datawarehouse)

A Home Buyer’s Information Integration ProblemA Home Buyer’s Information Integration Problem

What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood a nearby school ranking in the upper third, in a neighborhood

with below-average crime rate and diverse population? with below-average crime rate and diverse population?


RealtorRealtor DemographicsDemographicsSchool RankingsSchool RankingsCrime StatsCrime Stats

““Multiple-Worlds”Multiple-Worlds”MediationMediation

http://www.epa.gov/opptintr/kids/hometour/images/house.jpg

http://www.ftb.ca.gov/education/images/schools.jpg

ASU, 7/25/2003

• Data Integration Approaches:Data Integration Approaches:– Let’s just share data, e.g., link everything from a web page!– ... or better put everything into an relational or XML database– ... and do remote access using the Grid– ... or just use Web services!

• Nice try. But: Nice try. But: – “Find the files where the amygdala was segmented.”– “Which other structures were segmented in the same files?”– “Did the volume of any of those structures differ much from

normal?”– “What is the cerebellar distribution of rat proteins with more

than 70% homology with human NCS-1? Any structure specificity? How about other rodents?”

Some BIRNing Data Some BIRNing Data Integration QuestionsIntegration Questions

Biomedical InformaticsBiomedical InformaticsResearch NetworkResearch Networkhttp://nbirn.nethttp://nbirn.net

A Neuroscientist’s Information Integration Problem

What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity?

How about other rodents?


protein localization(NCMIR)

neurotransmission(SENSELAB)

sequence info(CaPROT)

morphometry(SYNAPSE)

““Complex Complex Multiple-Worlds”Multiple-Worlds”

MediationMediation

Biomedical InformaticsResearch Networkhttp://nbirn.net

ASU, 7/25/2003

Information Integration Challenges: Information Integration Challenges: Heterogeneities = SHeterogeneities = S44......

• SSystem Aspectsystem Aspects– platforms, devices, distribution, APIs, protocols, …

• SSyntaxesyntaxes– heterogeneous data formats (one for each tool ...)

• SStructurestructures– heterogeneous schemas (one for each DB ...)

– heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …)

• SSemanticsemantics– unclear & “hidden” semantics : e.g., incoherent terminology,

multiple taxonomies, implicit assumptions, ...

ASU, 7/25/2003

Information Integration ChallengesInformation Integration Challenges

• System aspects: “Grid” Middleware• distributed data & computing• Web Services, WSDL/SOAP, OGSA, …• sources = functions, files, data sets, …

• Syntax & Structure: (XML-Based) Data Mediators

• wrapping, restructuring • (XML) queries and views• sources = (XML) databases

• Semantics: Model-Based/Semantic Mediators

• conceptual models and declarative views • Knowledge Representation: ontologies,

description logics (RDF(S),OWL ...)• sources = knowledge bases (DB+CMs+ICs)

SyntaxSyntax

StructureStructure

SemanticsSemantics

System aspectsSystem aspects

reconciling reconciling SS44 heterogeneitiesheterogeneities

““gluing” together multiple gluing” together multiple data sources data sources

bridging information and bridging information and knowledge gaps knowledge gaps computationallycomputationally

ASU, 7/25/2003

Information Integration from a DB Information Integration from a DB Perspective Perspective

• Information Integration ProblemInformation Integration Problem– Given: data sources S1, ..., Sk (DBMS, web sites, ...) and user

questions Q1,..., Qn that can be answered using the Si

– Find: the answers to Q1, ..., Qn

• The Database Perspective: source = “database” The Database Perspective: source = “database” Si has a schema (relational, XML, OO, ...)

Si can be queried

define virtual (or materialized) integrated views V over S1 ,..., Sk using database query languages (SQL, XQuery,...)

questions become queries Qi against V(S1,..., Sk)

ASU, 7/25/2003

Standard (XML-Based) Mediator ArchitectureStandard (XML-Based) Mediator Architecture

MEDIATORMEDIATOR

(XML) Queries & Results

S1

Wrapper

(XML) View

S2

Wrapper

(XML) View

Sk

Wrapper

(XML) View

Integrated Global(XML) View G

Integrated ViewDefinition

G(..) S1(..)…Sk(..)

USER/ClientUSER/Client

Query Q ( G (SQuery Q ( G (S11,..., S,..., Skk) )) )

wrappers implementedas web services

Scientific Data IntegrationScientific Data Integration ... Questions to Queries ...... Questions to Queries ...

What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ?

How does it relate to host rock structures?

?Information Integration

Geologic Map(Virginia)

GeoChemicalGeoPhysical

(gravity contours)GeoChronologic

(Concordia)Foliation Map(structure DB)

“Complex Multiple-Worlds”

Mediation

domain knowledge

Database mediationData modeling

Knowledge Representation:ontologies, concept spaces

raw data

GeoSciences Network

ASU, 7/25/2003

Towards Shared Conceptualizations: Towards Shared Conceptualizations: Data Contextualization via Concept Spaces Data Contextualization via Concept Spaces

ASU, 7/25/2003

Rock Classification OntologyRock Classification Ontology

Composition

Genesis

Fabric

Texture

ASU, 7/25/2003

Some enabling operations on “ontology data”Some enabling operations on “ontology data”

Composition

Concept expansion:Concept expansion:• what else to look for what else to look for when asking for ‘Mafic’when asking for ‘Mafic’

ASU, 7/25/2003

Some enabling operations on “ontology data”Some enabling operations on “ontology data”

Composition

Generalization:Generalization:• finding data that is finding data that is “like” X and Y“like” X and Y

ASU, 7/25/2003

Show formations where AGE = ‘Paleozic’

(without age ontology)


(without age ontology)


(with age ontology)


(with age ontology)

domainknowledge

domainknowledge

Knowledge r

epresentatio

n

AGE ONTOLOGY

NevadaNevada

ASU, 7/25/2003

Example: Geologic Map Integration Example: Geologic Map Integration

domainknowledge

domainknowledge

Knowledge r

epresentatio

n

AGE ONTOLOGY

NevadaNevada

Geoscientists + Computer Scientists Igneous Geoinformaticists+/- Energy

GEON Metamorphism Equation:

+/- a few hundred million years

ASU, 7/25/2003

GEON and “Semantic” Data IntegrationGEON and “Semantic” Data Integration

Rocky Mountains

Midatlantic Region

ASU, 7/25/2003

Mediator DemoMediator Demo

ASU, 7/25/2003



Getting Formal: Source ContextualizationGetting Formal: Source Contextualization & Ontology Refinement in Logic & Ontology Refinement in Logic

ASU, 7/25/2003

Distributed Querying Processing Distributed Querying Processing ChallengesChallenges: Part I, The Basics: Part I, The Basics

GeoSciences Network

• ““Scientific data” (BIRN, GEON, ...) variant of Scientific data” (BIRN, GEON, ...) variant of data data integration problemintegration problem studied by database CS community studied by database CS community

• Given Given – user query against integrated view

– view to source mappings (GAV/LAV)

– sources with limited access patterns

• Compute a Compute a distributed query plan distributed query plan PP s.t.s.t.– P has a feasible execution order

– P optimized wrt. time/space/networking complexity

CS &CS &

theorytheory

ASU, 7/25/2003

RReal-time eal-time OObservatories, bservatories, AApplications, and pplications, and DData management ata management NetNetworkwork

• Autonomous field sensorsAutonomous field sensors– Seismic, oceanic, climate, ecological, …, video, audio,

…

• RT Data Acquisition: RT Data Acquisition: – ANZA Seismic Network (1981-present):13 Broadband

Stations, 3 Borehole Strong Motion Arrays, 5 Infrasound Stations, 1 Bridge Monitoring System; Kyrgyz Seismic Network (1991-present): 10 Broadband Stations; IRIS PASSCAL Transportable Array (1997-Present):15 - 60 Broadband and Short Period Stations; IDA Global Seismic Network (~1990 -Present): 38 Broadband Stations

• High Performance Wireless Research Network High Performance Wireless Research Network (HPWREN)(HPWREN)

– High performance backbone network: 45Mbps duplex point-to-point links, backbone nodes at quality locations, network performance monitors at backbone sites; High speed access links: hard to reach areas, typically 45Mbps or 802.11radios, point-to-point or point-to-multipoint

• Data Grid Technology (SRB) Data Grid Technology (SRB) – collaborative access to distributed heterogeneous data,

single sign-on authentication and seamless authorization,data scaling to Petabytes and 100s of millions of files, data replication, etc.

ASU, 7/25/2003

A P2P Problem from ROADNetA P2P Problem from ROADNet

• Networks of ORBs send each other various data streamsNetworks of ORBs send each other various data streams

• Avoid Avoid actualactual loops in the presence of loops in the presence of virtual virtual loops:loops:– A B C

– A: c1B

– B: c2 C

– C: c3 A

– ...

• Idea: L(c1) Idea: L(c1) L(c2) L(c2) L(c3) … = {} L(c3) … = {}• In the real system: unix regexpsIn the real system: unix regexps

ASU, 7/25/2003

Scientific Workflows and Scientific Workflows and Analytical PipelinesAnalytical Pipelines

ASU, 7/25/2003

Scientific Workflows/AnalyticalPipelines over Brain Data


Representation of the workflow for cortical reconstruction using FreeSurfer. Raw anatomical MR images are first pre-processed and then must be manually edited to correct defects in the pre-processing. Once verified for correctness the pre-processed images can then be analyzed. During the processing various “snapshots” of the data are returned to the BIRN Virtual Data Grid.

ASU, 7/25/2003

Example: Promoter Identification Workflow (PIW) (simplified)

• scientific data sets flow between the steps• abstraction of tasks into higher conceptual levels• branching/merging of tasks and looping

ASU, 7/25/2003

• Large collaborative NSF/ITR project: UNM, UCSB, UCSD, UKansas,..Large collaborative NSF/ITR project: UNM, UCSB, UCSD, UKansas,..

• Fundamental improvements for researchers:Fundamental improvements for researchers: Global access to ecologically relevant Global access to ecologically relevant data; Rapidly locate and utilize distributed computation; Capture, reproduce, extend data; Rapidly locate and utilize distributed computation; Capture, reproduce, extend analysis processanalysis process

SEEK: Vision & Overview

ASx ASy ASzTS1TS2

Semantic MediationEngine

Data Binding

Query Processing

ECO2

Logic Rules ECO2-CL

Analytical Pipeline (AP)

SMS: SemanticMediation System

EcoGrid

provides unified access to Distributed Data Stores , Parameter Ontologies, & Stored Analyses, and runtime capabilities via the Execution Environment

Semantic Mediation System & Analysis and Modeling System use WSDL/UDDI to access services within theEcoGrid, enabling analytically driven data discovery and integration

SEEK is the combination of EcoGrid data resources and information services, coupled with advanced semantic and modeling capabilities

AM: Analysis and Modeling System

ASr

Parameters w/ Semantics

CC

C

CC

CParameterOntologies

WSDL/UDDI WSDL/UDDI

SRB KNB

MC

Species

WrpDar

...

Raw data setswrappedfor integrationw/ EML, etc.

ECO2 TaxOn

EML

etc.

Execution Environment

SAS, MATLAB,FORTRAN, etc

Library of Analysis Steps, Pipelines& Results

Invasive speciesover time

ASr

WSDL/UDDI

Example of “AP0”

AP0

ASU, 7/25/2003

SEEK ComponentsSEEK Components• EcoGridEcoGrid

• Seamless access to distributed, heterogeneous data: ecological, biodiversity, environmental data

• “Semantically” mediated and metadata driven• Centralized search & management portal(s)

• Analysis and Modeling SystemAnalysis and Modeling System – Capture, reproduce, and extend analysis process

• Declarative means for documenting analysis• “Pipeline” system for linking generic analysis steps• Strong version control for analysis steps

– Easy-to-use interface between data and analysis

• Semantic Mediation System:Semantic Mediation System:– “smart” data discovery, “type-correct” pipeline construction & data binding:– determine whether/how to link analytic steps – determine how data sets can be combined – determine whether/how data sets are appropriate inputs for analysis steps

ASU, 7/25/2003

AMS OverviewAMS Overview

• ObjectiveObjective– Create a semi-automated system for analyzing data

and executing models that provides documentation, archiving, and versioning of the analyses, models, and their outputs (visual programming language?)

• ScopeScope– Any type of analysis or model in ecology and

biodiversity science– Massively streamline the analysis and modeling

process– Archiving, rerunning analyses in SAS, Matlab, R,

SysStat, C(++),…– …

ASU, 7/25/2003

SMS Requirements from AMSSMS Requirements from AMS

• ...assist users in determining the appropriateness of ...assist users in determining the appropriateness of combining various analytical steps and data sources combining various analytical steps and data sources based on semantic mediationbased on semantic mediation......

• Semantic mediation should occur in three areas:Semantic mediation should occur in three areas:1. determine whether it is appropriate to link together particular

analytic steps. 2. mediate between multiple data sets to determine in what

ways they can be combined. 3. determine whether the selected data sources are appropriate

inputs for the selected analysis.

ASU, 7/25/2003

Some functional requirements Some functional requirements

• SMS should have the ability to ... SMS should have the ability to ... FR1: recognize data types (XML Schema types!? EML types?) of

registered EcoGrid data sets

FR2: recognize semantic types (OWL and/or RDF(S) !?) of registered EcoGrid data sets

FR3: recognize registered EcoGrid ontologies Note: semantic types reference those ontologies

FR4: recognize data type signature (XML Schema? WSDL?) of analytical steps (ASs)

FR5: recognize semantic type signature of analytical steps

FR6: recognize semantic constraints (OWL? First-order? What syntax? KIF? Prolog?)

Note: data schemas and signatures of analytical steps have those

ASU, 7/25/2003

... some functional requirements... some functional requirements

• Ability to ... Ability to ... FR8: check well-typedness (data and semantics) of a data set wrt. an

analytical step

FR9: check compatibility of two data sets wrt. "generalized operations" between those data sets (e.g., "semantic" join and union)

FR10: check well-typedness (data and semantics) of chained analytical steps

FR11: introduce data type conversions (e.g., int float)

FR12: perform and "explain" semantic type substitutions (e.g. if some AS works for Cs and D-isa-C, it also works for Ds)

FR13: [optional] generate type correct APs from a given schema of desired output and (optionally) input parameters

ASU, 7/25/2003

Use CasesUse Cases

• Clients of the SMS include the AMS, the EcoGrid, and "scientific Clients of the SMS include the AMS, the EcoGrid, and "scientific workflow engineers".workflow engineers".– UC1: Client requests type signature (data and semantic types) of a

registered EcoGrid data set (DS)– UC2: Client requests "other semantic constraints" of a DS.– UC3: Client requests type signature (data and semantic types) of an

analytical step (AS) – UC4: Client requests "other semantic constraints" of an AS.– UC5: Client requests type signature of an AP.– UC6: Client requests type checking of AP.– UC7: Client requests registered data sets compatible with the inputs of an AS

(e.g., if AS is scale sensitive, then all data sets must have the same scale; a flag is raised if data needs scaling).

– UC8: Client requests all registered ASs which can produce a given parameter (the latter is part of a registered ontology)

– UC9: Client requests candidate predecessor and successor steps for a given AS.

ASU, 7/25/2003

Planned ComponentsPlanned Components

SW1: Formal language(s) for representing/instantiating data types, semantic types, ontologies, and "other semantic constraints".

SW2: System for data type checking and inference (includes introduction of data type conversion steps)

SW3: System for semantic type checking and inference

SW4: [optional] System for "planning" APs given some of: output parameters, data sets, and input parameteres

ASU, 7/25/2003

THE PROBLEMTHE PROBLEM – Reconcile this: – Reconcile this:

• Simple, intuitiveSimple, intuitive graph/pipeline language, graph/pipeline language,

• … … which is which is expressive enoughexpressive enough to handle real-world to handle real-world flows (SciDAC: PIW),flows (SciDAC: PIW),

• … … and allows some and allows some static analysisstatic analysis

• while trying to leverage existing work:while trying to leverage existing work:– e.g., Ptolemy-II directors: Process Networks (PN),

Synchronous Dataflow (SDF), ..., – or workflow standards and systems

ASU, 7/25/2003

(Analytical) Pipelines …. (Scientific) Workflows(Analytical) Pipelines …. (Scientific) Workflows

• Spectrum of languages & formalisms:Spectrum of languages & formalisms:– Pipelines (a la Unix)

– Dataflow languages:• Kahn’s process networks (PN)

• Synchronous dataflow networks (SDF)

– “Web page-flow”: • Active XML, WebML, …

• Hesitating-weak-alternating-tree-automata-ML

• …

– (Business) Workflows:• WfMC’s XPDL, WSFL, BPELWS, …

ASU, 7/25/2003

Kahn Process Networks (PN)Kahn Process Networks (PN)• Concurrent processes communication through Concurrent processes communication through one-wayone-way FIFO channels with FIFO channels with unbounded unbounded

capacitycapacity• A A functional processfunctional process F F maps a set of input sequences into a set of output sequences maps a set of input sequences into a set of output sequences

(sounds like XSM!)(sounds like XSM!)• increasing chain of sets of sequences increasing chain of sets of sequences outputs may outputs may notnot increase! increase! • Consider increasing chains (wrt. prefix ordering “<“) of streamsConsider increasing chains (wrt. prefix ordering “<“) of streams• PN is PN is continuouscontinuous if lub(Xs) exists for all increasing chains Xs and if lub(Xs) exists for all increasing chains Xs and

– F(lub(Xs)) < lub(F(Xs))F(lub(Xs)) < lub(F(Xs))• Continuous implies montonicContinuous implies montonic::

– if Xs < Ys then F(Xs)<F(Ys)if Xs < Ys then F(Xs)<F(Ys)

ASU, 7/25/2003

Process Networks (cont’d)Process Networks (cont’d)

• PN in essence: PN in essence: simultaneous relations between sequencessimultaneous relations between sequences• Network of functional processes can be described by a Network of functional processes can be described by a

mapping mapping

X X = F(= F(XX,,II) ) – X denotes all the sequences in the network (inputs I+outputs)

• X X that forms a solution is a that forms a solution is a fixed pointfixed point• Continuity implies exactly one “minimal” fixed pointContinuity implies exactly one “minimal” fixed point

– minimal in the sense of pre-fix ordering for any inputs I

– execution of the network: given I = and find the minimal fixed point (works because of the monotonic property)

ASU, 7/25/2003

Synchronous Synchronous Data Flow Data Flow Networks Networks

(SDF)(SDF)

• Special case of PNSpecial case of PN• Ptolemy-II SDF overview Ptolemy-II SDF overview

– SDF supports efficient execution of Dataflow graphs that lack control structures– with control structures Process Networks(PN) – requires that the rates on the ports of all actors be known before hand– do not change during execution– in systems with feedback, delays, which are represented by initial tokens on relations must be explicitly noted

SDF uses this rate and delay information to determine the execution sequence of the actors before execution begins.

ASU, 7/25/2003

Extended Kahn-MacQueen Process NetworksExtended Kahn-MacQueen Process Networks

• A process is considered A process is considered activeactive from its creation until its termination from its creation until its termination

• An active process can block when trying to read from a channel An active process can block when trying to read from a channel ((read-blockedread-blocked), when trying to write to a channel (), when trying to write to a channel (write-blockedwrite-blocked) or ) or when waiting for a queued topology change request to be processed when waiting for a queued topology change request to be processed ((mutation-blockedmutation-blocked))

• A A deadlockdeadlock is when all the active processes are blocked is when all the active processes are blocked– real deadlock: all the processes are blocked on a read

– artificial deadlock: all processes are blocked, at least one process is blocked on a write increase the capacity of receiver with the smallest capacity amongst all the receivers on which a process is blocked on a write. This breaks the deadlock.

– If the increase results in a capacity that exceeds the value of maximumQueueCapacity, then instead of breaking the deadlock, an exception is thrown. This can be used to detect erroneous models that require unbounded queues.

ASU, 7/25/2003

Analytical Pipelines: An Open Source ToolAnalytical Pipelines: An Open Source Tool

ASU, 7/25/2003

A commercial tool for Analytical PipelinesA commercial tool for Analytical Pipelines

ASU, 7/25/2003

ASU, 7/25/2003

MAP: Data Massaging a la Blue-Titan/Perl MAP: Data Massaging a la Blue-Titan/Perl

ASU, 7/25/2003

Compiling Abstract Scientific Compiling Abstract Scientific Workflows into Workflows into

Web Service WorkflowsWeb Service Workflows

SSDBM’03SSDBM’03

ASU, 7/25/2003

The ProblemThe Problem

• Scientist would like to ...Scientist would like to ...– create a high-level “abstract” WF and

– not bother about web service urls, parameter passing, low-level data transformations,...

• How to go from ...How to go from ...– a high-level Abstract Workflow (AWF) to

– an Executable (web service) Workflow (EWF) ??

• Idea:Idea:– Using nested definitions, express AWF in terms of other AWFs

and EWFs; unfold definitions at compile-time

Abstract-as-View approach

ASU, 7/25/2003

WF Language Constructs (AWF+EWF)WF Language Constructs (AWF+EWF)

Edge TypeEdge Type ExplanationExplanation

Data-In: Data-In: specifies input data of a task.specifies input data of a task.

Conditional Data-In: Conditional Data-In: as before but data flows only if as before but data flows only if cond cond is is satisfiedsatisfied

Data-Out: Data-Out: specifies output of a task specifies output of a task

Conditional Data-Out: Conditional Data-Out: as before but data flows only if as before but data flows only if cond cond is is satisfiedsatisfied

Data-Connect: Data-Connect: connects output data (of previous steps) to input data connects output data (of previous steps) to input data (of subsequent steps)(of subsequent steps)

Conditional Data-Connect: Conditional Data-Connect: as before but data flows only if as before but data flows only if cond cond is is satisfiedsatisfied

Parameter: Parameter: specifies a control parameter of a taskspecifies a control parameter of a task

cond

cond

cond

ASU, 7/25/2003

Compute clusters(min. distance)

Select gene-set(cluster-level)

For each geneRetrieve

Transcription factors

ArrangeTranscription factors

For each promoter

ComputeSubsequence labels

With all Promoter Models

Compute JointPromoter Model

Retrieve matching cDNA

Retrieve genomicSequence

Extract promoterRegion(begin, end)

Create consensussequence

Align promoters

Conceptual WorkflowConceptual Workflow

ASU, 7/25/2003

Abstract Workflow (AWF)Abstract Workflow (AWF)(= chain program over relations with i/o (= chain program over relations with i/o

patterns)patterns)% AWF

piw(DB,Gene,TFBSModel) :- cDNASequence(Gene, GeneSeq),localAlignment(DB, CDNASeq,RankedPromoterList),firstRest(Promoter,RankedPromoters,RankedPromoters1),promoter_detail(Promoter, PromoterId, Start, End, Orientation),

cDNASequence(PromoterId,GenomicSeq),trim_sequence(GenomicSeq, Start, End, Orientation, ShortSeq),convertSeq(Orientation,ShortSeq,PosSeq),transfac(PosSeq, TFBSModel).

ASU, 7/25/2003

promoters tfbs_models

piwAWF

promoters AAV

gene_seq localAlignment

DB

Promoters TFBSModelsPromotersGene

Gene PromotersCDNASeq CDNASeq

AAV

genbank_service

embl_service

ddbj_service

DDBJId

EMBLId

GenbankId cDNASeq

cDNASeq

cDNASeq

convertToAcc#

Gene GeneId CDNASeq

gene_seq AAVEWF

…

AWF to EWF in AWF to EWF in graph formgraph form

ASU, 7/25/2003

AWF AWF EWF Translation EWF Translation

1.1. Check whether AWF is Check whether AWF is well-formedwell-formed and and well-typedwell-typed; if not, corresponding ; if not, corresponding warnings are issued (a semantic type mismatch may not only be a workflow warnings are issued (a semantic type mismatch may not only be a workflow design error, but often indicates the incompleteness of the underlying ontology).design error, but often indicates the incompleteness of the underlying ontology).

2.2. Next the AWF is successively Next the AWF is successively unfoldedunfolded, using the AAV view definitions. , using the AAV view definitions. • (Compiling AWF into EWF using AAV is similar to rewriting a query against a global schema

into queries against the sources.)

3.3. The unfolded The unfolded logic query planlogic query plan then undergoes several rewriting steps until a then undergoes several rewriting steps until a certain normal (DNF/UCQcertain normal (DNF/UCQ) is reached. If the join variables (= the connection ) is reached. If the join variables (= the connection edges) are not of the same edges) are not of the same data typedata type (but at least of compatible semantic types) (but at least of compatible semantic types) then the insertion of conversion rules is attempted; if this fails, an error is then the insertion of conversion rules is attempted; if this fails, an error is reported.reported.

4.4. For each list of conjunctive goals, the system tries to find an For each list of conjunctive goals, the system tries to find an executable goal executable goal orderorder, i.e., one which satisfies all i/o restrictions imposed by the web service , i.e., one which satisfies all i/o restrictions imposed by the web service descriptions of executable tasks.descriptions of executable tasks.

• ImplementationImplementation: a set of Java and Prolog programs, rules, ontologies and repositories: a set of Java and Prolog programs, rules, ontologies and repositories

ASU, 7/25/2003

geneList

managegeneLoop/[while geneList

not EMPTY]

updatedGeneList

[geneList EMPTY]

expressionArray selectGeneSet

updatedGeneList

LOOP1: [for each gene]

Loop1Final

gene

AWF for Matt’s Promoter Identification Workflow

ASU, 7/25/2003

geneId

inspectedTFBSs

shortSeq

sequence

geneNo

seqName coreValue sort threshold matrix indiv

TRANSFACMatInspector

complementSequence

plusSeq[orient < 0]

partialSeq

minusSeq

manageClustalWLoop

[orient > 0]

ClustalWSequence

[geneListNOTEmpty]

[geneListEmpty]

loop back:

geneListupdatedGeneList

EWF for Matt’s Extended Promoter Identification Workflow (w/ loops & conditions)

Figure1

prepareClustalWInput

ClustalWSequence

ListmultipleSeqAlignment

typepwalignment

noMoreGenes

geneListEmpty

CWSequence

[orient > 0]

ASU, 7/25/2003

geneId

inspectedTFBSs

shortSeq

sequence

geneNo

seqNamecoreValue sort thresholdmatrix indiv

TRANSFACMatInspector

complementSequence

plusSeq

[orient < 0]

partialSeq

minusSeq

manageClustalWLoop

[orient > 0]

ClustalWSequence

[geneListNOTEmpty]

[geneListEmpty]

loop back:

geneListupdatedGeneList prepareClustalWInput

ClustalWSequence

ListmultipleSeqAlignment

typepwalignment

noMoreGenes

geneListEmpty

format

RequestIdcDNASeq seq1 BlastRIDGenbank1

programdb1

BlastPromoter

fullGenomicSequence

RIdlist_udis Genbank2

doptcmd2db2cmd1

promoters

seq2

orientation

end

start

hitId

to

from

orient

trimSequence

promoterList

outputNextPromoter

updatedPromoter

List

Unfolded EWF

[orient > 0]

ASU, 7/25/2003

Generated EWF Plan (using BIRN Mediation Tool)

ASU, 7/25/2003

Abstract-As-View (AAV) Definitions:Abstract-As-View (AAV) Definitions:Control-Flow IssuesControl-Flow Issues

% AAVcDNASequence(GeneId, CDNASeq) :-

genbank(GeneId, CDNASeq) ; fail(genbank), embl(GeneId, CDNASeq) ; fail(genbank),fail(embl),ddbj(GeneId, CDNASeq).

localAlignment(DB, CDNASeq,RankedPromoterList) :- blast(CDNASeq,DB,xml,RankedPromoterList) ; fail(blast), fasta(CDNASeq,DB, RankedPromoterList) ; fail(blast),fail(fasta),blat(CDNASeq,querytype,

sortcriteria,outputtype,RankedPromoterList).

convertSeq(Orientation,ShortSeq,PosSeq) :-negative(Orientation),

complement(ShortSeq,PosSeq);equals(ShortSeq,PosSeq)

ASU, 7/25/2003

Further ProblemsFurther Problems

• Reconcile: Reconcile:

– Simple, intuitive graph/pipeline language,

– … which is expressive enough to handle real-world flows (PIW),

– … and allows some static analysis

– while trying to leverage existing work:• e.g., Ptolemy-II directors: Process Networks (PN), Synchronous Dataflow

(SDF), ...,

• or workflow standards and systems

• Semi-automatic web service composition:Semi-automatic web service composition:– use of semantic and data types to define data transformations:

• map: prev_step.out next_step.in

ASU, 7/25/2003

Design(Ptolemy-II) Execution monitoring(Ptolemy-II)

WF-Pilot

User

ET ET

Genbank BLAST

web serviceinvocation

web serviceinvocation

(Ptolemy II-Based Architecture)

Execution(Ptolemy-II)Directors:

PN, SDF, . . , XPDL/OFBiz Style Ptolemy-II Director

AWF Valid-AWF

Abstract Task(AT) Repository

AAVrules C

C

C

Data & Parameter Ontologies

ETschemas

query rewriting

Executable Task(ET) Repository

semantic typechecking

conversionrules

data typeconversion

Datatype & Conversion Repository

web servicematching

WF-ValidatorAWF & EWF Validation

ValidationErrors

ET -- Web serviceAT -- (“Mini workflow” of ETs; Composition of ETs and ATs) may become a web service if deployed

SciDAC Extensions to Ptolemy-II Web Service plug-in

ASU, 7/25/2003

Gene Sequence Processing

Designing PIW in Scientific Workflow Management System User specified parameters:• The accession numbers, separated by commas,• The number of promoters to investigate,• The name of the file to hold the fasta format promoter regions.

ASU, 7/25/2003

Looking Inside an Abstract Task: Gene Sequence Processing

Look Inside

Look Inside

Look Inside

ASU, 7/25/2003

Run or Resume the Model

Running the PIW Model

ASU, 7/25/2003

Summary (Scientific Workflows)Summary (Scientific Workflows)

• Spectrum of dataflow/control-flow/workflow approaches:Spectrum of dataflow/control-flow/workflow approaches:– SDF, PN, …, AXML, WebML, … XPDL

• Scientist user needs to “visually program” themScientist user needs to “visually program” them• System support needed:System support needed:

– Translation from simple, conceptual (“declarative”) WFs to executable Web/Grid service plans

– Static analysis to check • dynamic properties (deadlocks, starvation,…),

• feasibility wrt. given sources

• type compatibilities

• Macro/micro-level planning (overall control flow, local schema mappings)

ASU, 7/25/2003

Summary: Mediation Scenarios & Summary: Mediation Scenarios & TechniquesTechniques

Federated Databases XML-Based Mediation Model-Based Mediation

One-World One-/Multiple-Worlds Complex Multiple-Worlds

Common Schema Mediated Schema Common Glue Maps

SQL, rules XML query languages DOOD query languages

Schema Transformations Syntax-Aware Mappings Semantics-Aware Mappings

Syntactic Joins Syntactic Joins “Semantic” Joins via Glue Maps

DB expert DB expert KRDB + domain experts

Glue?Glue?

ASU, 7/25/2003

Combine Everything:Combine Everything:Die Die eierlegende Wollmilchsaueierlegende Wollmilchsau::

• Database Federation/MediationDatabase Federation/Mediation– query rewriting under GAV/LAV – w/ binding pattern constraints– distributed query processing

• Semantic MediationSemantic Mediation– semantic integrity constraints, reasoning w/ plans, automated

deduction– deductive database/logic programming technology, AI “stuff”...– Semantic Web technology

• Scientific Workflow ManagementScientific Workflow Management– more procedural than database mediation (often the scientist is

the query planner)– deployment using web services

ASU, 7/25/2003 Scientific Data Management: From Data Integration to Analytical Pipelines Scientific Data Management: From Data Integration to Analytical.

Documents