ASU, 7/25/2003 Scientific Data Management: Scientific Data Management: From Data Integration to From Data Integration to Analytical Pipelines Analytical Pipelines Data & Knowledge Systems Data & Knowledge Systems San Diego Supercomputer Center San Diego Supercomputer Center University of California, San Diego University of California, San Diego Bertram Ludäscher Bertram Ludäscher [email protected][email protected]
65
Embed
ASU, 7/25/2003 Scientific Data Management: From Data Integration to Analytical Pipelines Scientific Data Management: From Data Integration to Analytical.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ASU, 7/25/2003
Scientific Data Management: From Scientific Data Management: From Data Integration to Analytical Pipelines Data Integration to Analytical Pipelines
Data & Knowledge SystemsData & Knowledge Systems
San Diego Supercomputer CenterSan Diego Supercomputer Center
University of California, San DiegoUniversity of California, San Diego
An Online Shopper’s Information Integration ProblemAn Online Shopper’s Information Integration Problem
El Cheapo: “Where can I get the cheapest copy (including shipping cost) of El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”
A Home Buyer’s Information Integration ProblemA Home Buyer’s Information Integration Problem
What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood a nearby school ranking in the upper third, in a neighborhood
with below-average crime rate and diverse population? with below-average crime rate and diverse population?
• Data Integration Approaches:Data Integration Approaches:– Let’s just share data, e.g., link everything from a web page!– ... or better put everything into an relational or XML database– ... and do remote access using the Grid– ... or just use Web services!
• Nice try. But: Nice try. But: – “Find the files where the amygdala was segmented.”– “Which other structures were segmented in the same files?”– “Did the volume of any of those structures differ much from
normal?”– “What is the cerebellar distribution of rat proteins with more
than 70% homology with human NCS-1? Any structure specificity? How about other rodents?”
Some BIRNing Data Some BIRNing Data Integration QuestionsIntegration Questions
Towards Shared Conceptualizations: Towards Shared Conceptualizations: Data Contextualization via Concept Spaces Data Contextualization via Concept Spaces
ASU, 7/25/2003
Rock Classification OntologyRock Classification Ontology
Composition
Genesis
Fabric
Texture
ASU, 7/25/2003
Some enabling operations on “ontology data”Some enabling operations on “ontology data”
Composition
Concept expansion:Concept expansion:• what else to look for what else to look for when asking for ‘Mafic’when asking for ‘Mafic’
ASU, 7/25/2003
Some enabling operations on “ontology data”Some enabling operations on “ontology data”
Composition
Generalization:Generalization:• finding data that is finding data that is “like” X and Y“like” X and Y
Getting Formal: Source ContextualizationGetting Formal: Source Contextualization & Ontology Refinement in Logic & Ontology Refinement in Logic
ASU, 7/25/2003
Distributed Querying Processing Distributed Querying Processing ChallengesChallenges: Part I, The Basics: Part I, The Basics
GeoSciences Network
• ““Scientific data” (BIRN, GEON, ...) variant of Scientific data” (BIRN, GEON, ...) variant of data data integration problemintegration problem studied by database CS community studied by database CS community
• Given Given – user query against integrated view
– view to source mappings (GAV/LAV)
– sources with limited access patterns
• Compute a Compute a distributed query plan distributed query plan PP s.t.s.t.– P has a feasible execution order
– P optimized wrt. time/space/networking complexity
CS &CS &
theorytheory
ASU, 7/25/2003
RReal-time eal-time OObservatories, bservatories, AApplications, and pplications, and DData management ata management NetNetworkwork
• Autonomous field sensorsAutonomous field sensors– Seismic, oceanic, climate, ecological, …, video, audio,
…
• RT Data Acquisition: RT Data Acquisition: – ANZA Seismic Network (1981-present):13 Broadband
Stations, 3 Borehole Strong Motion Arrays, 5 Infrasound Stations, 1 Bridge Monitoring System; Kyrgyz Seismic Network (1991-present): 10 Broadband Stations; IRIS PASSCAL Transportable Array (1997-Present):15 - 60 Broadband and Short Period Stations; IDA Global Seismic Network (~1990 -Present): 38 Broadband Stations
• High Performance Wireless Research Network High Performance Wireless Research Network (HPWREN)(HPWREN)
– High performance backbone network: 45Mbps duplex point-to-point links, backbone nodes at quality locations, network performance monitors at backbone sites; High speed access links: hard to reach areas, typically 45Mbps or 802.11radios, point-to-point or point-to-multipoint
• Data Grid Technology (SRB) Data Grid Technology (SRB) – collaborative access to distributed heterogeneous data,
single sign-on authentication and seamless authorization,data scaling to Petabytes and 100s of millions of files, data replication, etc.
ASU, 7/25/2003
A P2P Problem from ROADNetA P2P Problem from ROADNet
• Networks of ORBs send each other various data streamsNetworks of ORBs send each other various data streams
• Avoid Avoid actualactual loops in the presence of loops in the presence of virtual virtual loops:loops:– A B C
– A: c1B
– B: c2 C
– C: c3 A
– ...
• Idea: L(c1) Idea: L(c1) L(c2) L(c2) L(c3) … = {} L(c3) … = {}• In the real system: unix regexpsIn the real system: unix regexps
ASU, 7/25/2003
Scientific Workflows and Scientific Workflows and Analytical PipelinesAnalytical Pipelines
ASU, 7/25/2003
Scientific Workflows/AnalyticalPipelines over Brain Data
Representation of the workflow for cortical reconstruction using FreeSurfer. Raw anatomical MR images are first pre-processed and then must be manually edited to correct defects in the pre-processing. Once verified for correctness the pre-processed images can then be analyzed. During the processing various “snapshots” of the data are returned to the BIRN Virtual Data Grid.
• Fundamental improvements for researchers:Fundamental improvements for researchers: Global access to ecologically relevant Global access to ecologically relevant data; Rapidly locate and utilize distributed computation; Capture, reproduce, extend data; Rapidly locate and utilize distributed computation; Capture, reproduce, extend analysis processanalysis process
SEEK: Vision & Overview
ASx ASy ASzTS1TS2
Semantic MediationEngine
Data Binding
Query Processing
ECO2
Logic Rules ECO2-CL
Analytical Pipeline (AP)
SMS: SemanticMediation System
EcoGrid
provides unified access to Distributed Data Stores , Parameter Ontologies, & Stored Analyses, and runtime capabilities via the Execution Environment
Semantic Mediation System & Analysis and Modeling System use WSDL/UDDI to access services within theEcoGrid, enabling analytically driven data discovery and integration
SEEK is the combination of EcoGrid data resources and information services, coupled with advanced semantic and modeling capabilities
AM: Analysis and Modeling System
ASr
Parameters w/ Semantics
CC
C
CC
CParameterOntologies
WSDL/UDDI WSDL/UDDI
SRB KNB
MC
Species
WrpDar
...
Raw data setswrappedfor integrationw/ EML, etc.
ECO2 TaxOn
EML
etc.
Execution Environment
SAS, MATLAB,FORTRAN, etc
Library of Analysis Steps, Pipelines& Results
Invasive speciesover time
ASr
WSDL/UDDI
Example of “AP0”
AP0
ASU, 7/25/2003
SEEK ComponentsSEEK Components• EcoGridEcoGrid
• Seamless access to distributed, heterogeneous data: ecological, biodiversity, environmental data
• Analysis and Modeling SystemAnalysis and Modeling System – Capture, reproduce, and extend analysis process
• Declarative means for documenting analysis• “Pipeline” system for linking generic analysis steps• Strong version control for analysis steps
– Easy-to-use interface between data and analysis
• Semantic Mediation System:Semantic Mediation System:– “smart” data discovery, “type-correct” pipeline construction & data binding:– determine whether/how to link analytic steps – determine how data sets can be combined – determine whether/how data sets are appropriate inputs for analysis steps
ASU, 7/25/2003
AMS OverviewAMS Overview
• ObjectiveObjective– Create a semi-automated system for analyzing data
and executing models that provides documentation, archiving, and versioning of the analyses, models, and their outputs (visual programming language?)
• ScopeScope– Any type of analysis or model in ecology and
biodiversity science– Massively streamline the analysis and modeling
process– Archiving, rerunning analyses in SAS, Matlab, R,
SysStat, C(++),…– …
ASU, 7/25/2003
SMS Requirements from AMSSMS Requirements from AMS
• ...assist users in determining the appropriateness of ...assist users in determining the appropriateness of combining various analytical steps and data sources combining various analytical steps and data sources based on semantic mediationbased on semantic mediation......
• Semantic mediation should occur in three areas:Semantic mediation should occur in three areas:1. determine whether it is appropriate to link together particular
analytic steps. 2. mediate between multiple data sets to determine in what
ways they can be combined. 3. determine whether the selected data sources are appropriate
inputs for the selected analysis.
ASU, 7/25/2003
Some functional requirements Some functional requirements
• SMS should have the ability to ... SMS should have the ability to ... FR1: recognize data types (XML Schema types!? EML types?) of
registered EcoGrid data sets
FR2: recognize semantic types (OWL and/or RDF(S) !?) of registered EcoGrid data sets
FR4: recognize data type signature (XML Schema? WSDL?) of analytical steps (ASs)
FR5: recognize semantic type signature of analytical steps
FR6: recognize semantic constraints (OWL? First-order? What syntax? KIF? Prolog?)
Note: data schemas and signatures of analytical steps have those
ASU, 7/25/2003
... some functional requirements... some functional requirements
• Ability to ... Ability to ... FR8: check well-typedness (data and semantics) of a data set wrt. an
analytical step
FR9: check compatibility of two data sets wrt. "generalized operations" between those data sets (e.g., "semantic" join and union)
FR10: check well-typedness (data and semantics) of chained analytical steps
FR11: introduce data type conversions (e.g., int float)
FR12: perform and "explain" semantic type substitutions (e.g. if some AS works for Cs and D-isa-C, it also works for Ds)
FR13: [optional] generate type correct APs from a given schema of desired output and (optionally) input parameters
ASU, 7/25/2003
Use CasesUse Cases
• Clients of the SMS include the AMS, the EcoGrid, and "scientific Clients of the SMS include the AMS, the EcoGrid, and "scientific workflow engineers".workflow engineers".– UC1: Client requests type signature (data and semantic types) of a
registered EcoGrid data set (DS)– UC2: Client requests "other semantic constraints" of a DS.– UC3: Client requests type signature (data and semantic types) of an
analytical step (AS) – UC4: Client requests "other semantic constraints" of an AS.– UC5: Client requests type signature of an AP.– UC6: Client requests type checking of AP.– UC7: Client requests registered data sets compatible with the inputs of an AS
(e.g., if AS is scale sensitive, then all data sets must have the same scale; a flag is raised if data needs scaling).
– UC8: Client requests all registered ASs which can produce a given parameter (the latter is part of a registered ontology)
– UC9: Client requests candidate predecessor and successor steps for a given AS.
ASU, 7/25/2003
Planned ComponentsPlanned Components
SW1: Formal language(s) for representing/instantiating data types, semantic types, ontologies, and "other semantic constraints".
SW2: System for data type checking and inference (includes introduction of data type conversion steps)
SW3: System for semantic type checking and inference
SW4: [optional] System for "planning" APs given some of: output parameters, data sets, and input parameteres
ASU, 7/25/2003
THE PROBLEMTHE PROBLEM – Reconcile this: – Reconcile this:
Kahn Process Networks (PN)Kahn Process Networks (PN)• Concurrent processes communication through Concurrent processes communication through one-wayone-way FIFO channels with FIFO channels with unbounded unbounded
capacitycapacity• A A functional processfunctional process F F maps a set of input sequences into a set of output sequences maps a set of input sequences into a set of output sequences
(sounds like XSM!)(sounds like XSM!)• increasing chain of sets of sequences increasing chain of sets of sequences outputs may outputs may notnot increase! increase! • Consider increasing chains (wrt. prefix ordering “<“) of streamsConsider increasing chains (wrt. prefix ordering “<“) of streams• PN is PN is continuouscontinuous if lub(Xs) exists for all increasing chains Xs and if lub(Xs) exists for all increasing chains Xs and
– if Xs < Ys then F(Xs)<F(Ys)if Xs < Ys then F(Xs)<F(Ys)
ASU, 7/25/2003
Process Networks (cont’d)Process Networks (cont’d)
• PN in essence: PN in essence: simultaneous relations between sequencessimultaneous relations between sequences• Network of functional processes can be described by a Network of functional processes can be described by a
mapping mapping
X X = F(= F(XX,,II) ) – X denotes all the sequences in the network (inputs I+outputs)
• X X that forms a solution is a that forms a solution is a fixed pointfixed point• Continuity implies exactly one “minimal” fixed pointContinuity implies exactly one “minimal” fixed point
– minimal in the sense of pre-fix ordering for any inputs I
– execution of the network: given I = and find the minimal fixed point (works because of the monotonic property)
ASU, 7/25/2003
Synchronous Synchronous Data Flow Data Flow Networks Networks
(SDF)(SDF)
• Special case of PNSpecial case of PN• Ptolemy-II SDF overview Ptolemy-II SDF overview
– SDF supports efficient execution of Dataflow graphs that lack control structures– with control structures Process Networks(PN) – requires that the rates on the ports of all actors be known before hand– do not change during execution– in systems with feedback, delays, which are represented by initial tokens on relations must be explicitly noted
SDF uses this rate and delay information to determine the execution sequence of the actors before execution begins.
ASU, 7/25/2003
Extended Kahn-MacQueen Process NetworksExtended Kahn-MacQueen Process Networks
• A process is considered A process is considered activeactive from its creation until its termination from its creation until its termination
• An active process can block when trying to read from a channel An active process can block when trying to read from a channel ((read-blockedread-blocked), when trying to write to a channel (), when trying to write to a channel (write-blockedwrite-blocked) or ) or when waiting for a queued topology change request to be processed when waiting for a queued topology change request to be processed ((mutation-blockedmutation-blocked))
• A A deadlockdeadlock is when all the active processes are blocked is when all the active processes are blocked– real deadlock: all the processes are blocked on a read
– artificial deadlock: all processes are blocked, at least one process is blocked on a write increase the capacity of receiver with the smallest capacity amongst all the receivers on which a process is blocked on a write. This breaks the deadlock.
– If the increase results in a capacity that exceeds the value of maximumQueueCapacity, then instead of breaking the deadlock, an exception is thrown. This can be used to detect erroneous models that require unbounded queues.
ASU, 7/25/2003
Analytical Pipelines: An Open Source ToolAnalytical Pipelines: An Open Source Tool
ASU, 7/25/2003
A commercial tool for Analytical PipelinesA commercial tool for Analytical Pipelines
ASU, 7/25/2003
ASU, 7/25/2003
MAP: Data Massaging a la Blue-Titan/Perl MAP: Data Massaging a la Blue-Titan/Perl
ASU, 7/25/2003
Compiling Abstract Scientific Compiling Abstract Scientific Workflows into Workflows into
Web Service WorkflowsWeb Service Workflows
SSDBM’03SSDBM’03
ASU, 7/25/2003
The ProblemThe Problem
• Scientist would like to ...Scientist would like to ...– create a high-level “abstract” WF and
– not bother about web service urls, parameter passing, low-level data transformations,...
• How to go from ...How to go from ...– a high-level Abstract Workflow (AWF) to
– an Executable (web service) Workflow (EWF) ??
• Idea:Idea:– Using nested definitions, express AWF in terms of other AWFs
and EWFs; unfold definitions at compile-time
Abstract-as-View approach
ASU, 7/25/2003
WF Language Constructs (AWF+EWF)WF Language Constructs (AWF+EWF)
Edge TypeEdge Type ExplanationExplanation
Data-In: Data-In: specifies input data of a task.specifies input data of a task.
Conditional Data-In: Conditional Data-In: as before but data flows only if as before but data flows only if cond cond is is satisfiedsatisfied
Data-Out: Data-Out: specifies output of a task specifies output of a task
Conditional Data-Out: Conditional Data-Out: as before but data flows only if as before but data flows only if cond cond is is satisfiedsatisfied
Data-Connect: Data-Connect: connects output data (of previous steps) to input data connects output data (of previous steps) to input data (of subsequent steps)(of subsequent steps)
Conditional Data-Connect: Conditional Data-Connect: as before but data flows only if as before but data flows only if cond cond is is satisfiedsatisfied
Parameter: Parameter: specifies a control parameter of a taskspecifies a control parameter of a task
cond
cond
cond
ASU, 7/25/2003
Compute clusters(min. distance)
Select gene-set(cluster-level)
For each geneRetrieve
Transcription factors
ArrangeTranscription factors
For each promoter
ComputeSubsequence labels
With all Promoter Models
Compute JointPromoter Model
Retrieve matching cDNA
Retrieve genomicSequence
Extract promoterRegion(begin, end)
Create consensussequence
Align promoters
Conceptual WorkflowConceptual Workflow
ASU, 7/25/2003
Abstract Workflow (AWF)Abstract Workflow (AWF)(= chain program over relations with i/o (= chain program over relations with i/o
1.1. Check whether AWF is Check whether AWF is well-formedwell-formed and and well-typedwell-typed; if not, corresponding ; if not, corresponding warnings are issued (a semantic type mismatch may not only be a workflow warnings are issued (a semantic type mismatch may not only be a workflow design error, but often indicates the incompleteness of the underlying ontology).design error, but often indicates the incompleteness of the underlying ontology).
2.2. Next the AWF is successively Next the AWF is successively unfoldedunfolded, using the AAV view definitions. , using the AAV view definitions. • (Compiling AWF into EWF using AAV is similar to rewriting a query against a global schema
into queries against the sources.)
3.3. The unfolded The unfolded logic query planlogic query plan then undergoes several rewriting steps until a then undergoes several rewriting steps until a certain normal (DNF/UCQcertain normal (DNF/UCQ) is reached. If the join variables (= the connection ) is reached. If the join variables (= the connection edges) are not of the same edges) are not of the same data typedata type (but at least of compatible semantic types) (but at least of compatible semantic types) then the insertion of conversion rules is attempted; if this fails, an error is then the insertion of conversion rules is attempted; if this fails, an error is reported.reported.
4.4. For each list of conjunctive goals, the system tries to find an For each list of conjunctive goals, the system tries to find an executable goal executable goal orderorder, i.e., one which satisfies all i/o restrictions imposed by the web service , i.e., one which satisfies all i/o restrictions imposed by the web service descriptions of executable tasks.descriptions of executable tasks.
• ImplementationImplementation: a set of Java and Prolog programs, rules, ontologies and repositories: a set of Java and Prolog programs, rules, ontologies and repositories
PN, SDF, . . , XPDL/OFBiz Style Ptolemy-II Director
AWF Valid-AWF
Abstract Task(AT) Repository
AAVrules C
C
C
Data & Parameter Ontologies
ETschemas
query rewriting
Executable Task(ET) Repository
semantic typechecking
conversionrules
data typeconversion
Datatype & Conversion Repository
web servicematching
WF-ValidatorAWF & EWF Validation
ValidationErrors
ET -- Web serviceAT -- (“Mini workflow” of ETs; Composition of ETs and ATs) may become a web service if deployed
SciDAC Extensions to Ptolemy-II Web Service plug-in
ASU, 7/25/2003
Gene Sequence Processing
Designing PIW in Scientific Workflow Management System User specified parameters:• The accession numbers, separated by commas,• The number of promoters to investigate,• The name of the file to hold the fasta format promoter regions.
ASU, 7/25/2003
Looking Inside an Abstract Task: Gene Sequence Processing