Exercises: Visit sites: GCMD – http://gcmd.nasa.gov/ ORNL DAAC – http://www-eosdis.ornl.gov/ NBII – http:// nbii .gov/ ESA – http:// esapubs.org /archive/ Video Hans Rosling
Jan 15, 2016
Exercises: Visit sites:
GCMD – http://gcmd.nasa.gov/ ORNL DAAC – http://www-eosdis.ornl.gov/ NBII – http://nbii.gov/ ESA – http://esapubs.org/archive/
Video Hans Rosling
Introduction to SEEK & CI
William Michener
LTER Network Office, University of New Mexico
January 2007
Cyberinfrastructure for Environmental Biology
Environmental sciences increasingly focus on collaboration and synthesis
Cyberinfrastructure supports science by: Supporting data access and discovery Facilitating the integration of heterogeneous data Enabling complex analysis, modeling and forecasting
Outline Cyberinfrastructure challenges
Overview of the Science Environment for Ecological Knowledge (SEEK) architecture EcoGrid, Kepler, Semantic Mediation System
Scientific Workflows and Kepler
Semantics in integration, analysis and modeling
Outline Cyberinfrastructure challenges
Overview of the Science Environment for Ecological Knowledge (SEEK) architecture EcoGrid, Kepler, Semantic Mediation System
Scientific Workflows and Kepler
Semantics in integration, analysis and modeling
Access, integration, and analysis
Synthesis projects have distinctive needs Need to access large numbers of data sets
Little of the data are 'theirs', so they know few details about them Studies are 'data limited‘ Must engage the broader community to really solve the access
issue
Need to integrate those data in a meaningful way Studies not designed to be used together, so many pitfalls Integrated product differs for every synthesis project
Need to analyze and model the data efficiently and collaboratively
Leverage dynamic data loading to increase efficiency Use scientific workflow systems to work collaboratively
Dilemma: no unified model
No single database suffices
numerous data warehouses exist, but not extensible for all data
VegBank, ClimbDB, GenBank, PDB, etc.
data warehouses use federated schemas any data that does not fit is not captured this is a form of data integration for one purpose
Custom development for 1000’s of databases is not feasible
Discovery, Access, and Archive
Effectively store and archive data Effectively locate and access data from dispersed collections
Approaches Repositories for some data exist
KNB Metacat, SRB, EcoGrid Professional Society registries
Structured metadata + ontologies Ecological Metadata Language FGDC metadata not sufficient
Smart search Replication
Challenge areas Better recall and precision
Exploiting semantics Building effective ontologies Need to target individual scientists + students
Data sharing is increasing• KNB Metacat
– ESA Data Registry– NCEAS– LTER (LNO, SBC)– PISCO– OBFS– UC Natural Reserves– Pelagic Fisheries DB– LITS DB (UK)– Kruger Nat. Park (SA)– OSU, FIU
Data Packagesin the KNB
2002 2003 2004 2005 2006
Year
02000
4000
6000
8000
1000012000
Cum
ula
tive c
oun
t
Metadata Metadata Metadata Metadata
An absolute necessity
Ecological Metadata Language FGDC/NBII metadata (good but insufficient)
Loosly coupled data repositories accommodate heterogeneity
Data heterogeneity Data are heterogeneous
Differing formats, logical organization, and interpretation
Syntax Format of the data (e.g., csv, NetCDF, Excel, etc.)
Schema Logical model of the data (e.g., relational models, hierarchical models, etc.)
Semantics Meaning of the data (e.g., conceptual links, formalized methods,
interpretation)
Broad array of relevant data sources Ecological (population survey, community survey, behavioral, etc.) Physical (hydrology, meteorology, chemistry, etc.) Social (demographic data, land use patterns, policy information, etc.) Economic (economic valuations, demographic data, etc.)
Data Integration Combining heterogeneous data is necessary for synthesis
Approaches Manual Semi-automated integration that leverages domain knowledge
Challenges Integration constrained by intended analyses as well as data input
Not the traditional data warehouse approach Difficult to build consistent knowledge base Automated reasoning tested on small data sources Little semantics support in software tools for domain science
Integration needs to be:Ad-hoc (No global view)Fast (hours instead of months)
Current practices are ad-hoc and non-repeatable
Model the steps used by researchers during analysis Graphical model of flow of data among processing steps
Each step often occurs in different software
Refer to these graphs as ‘Scientific Workflows’
Analysis and Modeling
Data GraphClean Analyze
Outline Cyberinfrastructure challenges
Overview of the Science Environment for Ecological Knowledge (SEEK) architecture EcoGrid, Kepler, Semantic Mediation System
Scientific Workflows and Kepler
Semantics in integration, analysis and modeling
Science Environment for Ecological Knowledge
SEEK extends informatics approaches to improve analysis and modeling to support broad scale synthesis
Expose ecological, biodiversity, environmental data through a common architecture
Create a framework for executing, preserving and communicating complex quantitative analytical processes
Address myriad challenges associated with integrating heterogeneous data for use in analysis
EcoGrid Data access to diverse data systems
Lightweight web service interfaces Common query syntax Common mechanism to access
ecological data (100’s of field stations) museum specimen data (100’s of museums) environmental data (data in SRB at SDSC) geological data (GEON portal)
Kepler: Analysis and Modeling
Scientific workflow paradigm Models data flow among modular components Improved user interface for complex processes
Benefits Improves documentation Simplifies sharing of custom models with colleagues Promotes modular components Hierarchical models can hide complexity Direct access to data via EcoGrid Access common analysis tools (e.g., R, Matlab) from a
single framework
Kepler: Analysis and Modeling
Semantic Mediation Mediation layer for Kepler and EcoGrid
Addresses data heterogeneity and integration issues
Uses a formal reasoning approach for Smart data discovery Semi-automated data integration Workflow design Workflow validation
Relies upon good knowledge model Developed by Knowledge Representation group
Semantic Mediation
Knowledge Representation
Ecologists and computer scientists together capture critical knowledge about ecological data Extensible Observation Ontology (OBOE) captures the
semantics of scientific data Semantics of observations and measurements
Unit types Observation context Sampling hierarchies
Used by the Semantic Mediation system
Knowledge Representation
Taxonomic Nomenclature
Evolving collaborations Ecological Metadata Language – started in 1997 KNB/Morpho/Metacat – KDI 1999 Lifemapper – KDI 1998 Kepler – SEEK ITR 2002 Production work – Mellon Foundation 2002
Collaboration…organic and evolving
Kepler Collaboration Open-source
Builds on Ptolemy II from UC Berkeley
Collaborators SEEK Project SciDAC SDM Center Ptolemy Project GEON Project ROADNet Project Resurgence Project
Goals Create powerful analytical
tools that are useful across disciplines
Ecology, Biology, Engineering, Geology, Physics, Chemistry, Astronomy, …
Ptolemy IIPtolemy II
Broader Cyberinfrastructure Landscape
I. Data and metadata systems Physical (NBDC buoy data) Molecular bio (GenBank) Biological Collections (DiGIR) Oceanography (OpenDAP)
II. Domain Applications/Algorithms Sequence processing (BLAST) Ecological Niche Modeling (GARP) Site selection (Marxan)
III. Analysis and modeling frameworks Grid Systems (Globus) Workflow systems (Triana)
Outline Cyberinfrastructure challenges
Overview of the Science Environment for Ecological Knowledge (SEEK) architecture EcoGrid, Kepler, Semantic Mediation System
Scientific Workflows and Kepler
Semantics in integration, analysis and modeling
Need framework for designing, executing, preserving, and sharing analyses and models
Approaches Scientific workflows – modular, re-usable components, archive-friendly
Challenges Incorporating semantics Enabling effective model design Effective access to grid computing
Analysis and Modeling
A
Source(e.g., data)
C
Sink(e.g., display)
B
Scientific workflows Features of scientific workflows
Graphical model of data flow among processing steps Inputs and Outputs of components are precisely defined Components are modular and reusable Flow of data controlled by a separate execution model Support for hierarchical models
A’
Processor(e.g., regression)
B
ED F
Kepler: dynamic data loading
Data source from EcoGrid(metadata-driven ingestion)
res <- lm(BARO ~ T_AIR)resplot(T_AIR, BARO)abline(res)
R processing script
Kepler supports dynamic data loading:
• Data sources are discovered via metadata queries
• EML metadata allows arbitrary schemas to be loaded into an embedded database
• Data queries can be performed before data flows downstream
• Statistics (R and Matlab, etc.)
• Logic and math functions
• Graphics and visualization
• Geospatial data processing
• Molecular data processing
• Domain specific models
• Web services
• Grid services
• Data sources and sinks
• Ecology data
• Geology data
• Taxonomic data
• And much, much more…
The local library
Fast access to local components that are developed by the Kepler team and ship with Kepler
Import
Publish
• Components contributed by scientists
• ‘Upload to repository’ function in Kepler
• Saved in repository, explicit versioning
• Can be shared with colleagues
• Can be referenced in published papers
• Components can be downloaded and executed
• Downloaded components can be customized
• Promotes replication of analyses and models
The remote repository
Publish and share custom analyses, models, and components with colleagues.
Kepler Component Library
Active work in Kepler Real-time data in Kepler
Scientist’s view: real-time data accessed like archived data Engineer’s view: drill-down to manage sensor network resources
Semantics Semantic annotation connects models to knowledge Smart Search (data and components) Smart Data Integration Smart Workflow Linking
* by “smart” we mean these services are informed by metadata and ontology information
Improved user interfaces for Grid computing
ORB
Kepler and Sensor Networks
Collaborators: NCEAS, SDSC, UC Davis, OSU, CENS (UCLA), Opendap
Management and Analysis of Environmental Observatory Data using the Kepler Scientific Workflow System (CEO:P)
Startup October, 2006
Major foci:• Sensor network management – standardized services model
• Analysis of data from sensors and archives
• Public web view of sensor data
• Opendap and EcoGrid compatibility
Outline Cyberinfrastructure challenges
Overview of the Science Environment for Ecological Knowledge (SEEK) architecture EcoGrid, Kepler, Semantic Mediation System
Scientific Workflows and Kepler
Semantics in integration, analysis and modeling
Semantics in scientific workflows
Components and their ports typically have: Explicit ‘structural type’
e.g., int, float, string, {double} Implicit semantic type
Not sure whether the stream of values from a port represents ‘rainfall’ values or ‘body size’ values
A B
int intstring intint int
rainfall bodysizebodysize bodysize
int int
Semantic Annotation Label data with semantic types Label inputs and outputs of analytical
components with semantic types Grounded at level of measurement and data,
avoiding some pitfalls of upper ontologiesData Ontology Workflow Components
Provide extension points for loading specialized domain ontologies
Goal: generically describe the structure of scientific observation and measurement as found in a data set
Observation ontology
Entities represent real-world objects or concepts that can be measured.
Measurements assign values and units to characteristics of observed entities.
Observations are made about particular entities.
Every measurement has a characteristic, which defines the property of the entity being measured.
Every measurement relates a characteristic to a standard or unit.
Observations can provide context for other observations.
Entities, through observations, can be associated with one or more measured characteristics.
A value is typically a cell in a data set.
Extension points
Measurements have precision.
Semantic annotation
ObservationOntology
Data set
Mapping between data and the ontology via semantic annotation
Smart (Data) Integration: Merge Discover data of interest
… connect to merge actor
… “compute merge” align attributes via
annotations open dialog for user
refinement store merge mapping in
MOML
… enjoy! … your merged dataset
Smart Linking (Workflow Design)
Navigate errors and warnings within the workflow
Search for and insert “adapters” to fix (structural and semantic) errors …
Statically perform semantic and structural type checking
Semantic capabilities Answer semantic data queries:
Find sites in California where current abundance of molluscs is < 10% of historical abundance in 1900
Validate semantic correctness of workflows
Workflow design tools that exploit semantic context
Data integration that dynamically matches data sources to target schema needed for analysis
In summary… Typical analytical models are complex and difficult to
comprehend and maintain
Scientific workflows provide An intuitive visual model Structure and efficiency in modeling and analysis Abstractions to help deal with complexity Direct access to data Means to publish and share models
Kepler is an evolving but effective tool for scientists Looking for ways to transition from research prototype to a
production software tool
Scalable data integration is our main challenge
Prototype NEON Portal – “myNEON”
For more information, see:
http://kepler-project.org/
AcknowledgementsThe National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676.
Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis
The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.
The Andrew W. Mellon Foundation.
Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence
Roadmap ahead: Monday, Research Design: January 8, 2007 1:15 – 4:30 Scientific workflows– Pennington
Tuesday, Data Grids: January 9, 2007 8:30 -- 10:30 Grid technologies and activity – Servilla and Pennington 10:30 – 2:15 EML/Metadata best practices/Morpho – Tyburczy 2:30 -- 3:00 QA/QC – Vanderbilt 3:00 -- 4:30 Good Practices on storing data –Vanderbilt/White
Wednesday, Workflows I: Using pre-built workflows in Kepler January 10, 2007 8:30 -- 12:00 Introduction to Kepler with demos– Pennington/Romanello 1:15 -- 2:00 Using Desktop data in Kepler – Higgins 2:00 -- ??? Bosque, dinner at the Socorro Brew Pub
Thursday, Workflows II: Tools in Kepler January 11, 2007 8:30 -- 10:30 Using R in Kepler (demo + exercise) – Higgins 10:30 -- 12:00 Visualization in Kepler (demo + exercise) – Higgins/Pennington 1:15 -- 2:15 Biodiversity example in Kepler (demo + exercise) – “/” 2:30 -- 4:30 Taxonomic resolution in Kepler –Stewart
Friday, Workflows III: Semantic approaches in Kepler January 12, 2007 8:30 -- 12:00 Knowledge representation and semantic mediation – Bowers 1:15 -- 2:00 Recapping the week 2:00 -- 3:00 Preparing to use ecoinformatics in the classroom – Katz 3:00 -- 4:00 Roundtable Discussion – Katz