Exercises:

Exercises: Visit sites:

GCMD – http://gcmd.nasa.gov/ ORNL DAAC – http://www-eosdis.ornl.gov/ NBII – http://nbii.gov/ ESA – http://esapubs.org/archive/

Video Hans Rosling

Introduction to SEEK & CI

William Michener

LTER Network Office, University of New Mexico

January 2007

Cyberinfrastructure for Environmental Biology

Environmental sciences increasingly focus on collaboration and synthesis

Cyberinfrastructure supports science by: Supporting data access and discovery Facilitating the integration of heterogeneous data Enabling complex analysis, modeling and forecasting

Outline Cyberinfrastructure challenges

Overview of the Science Environment for Ecological Knowledge (SEEK) architecture EcoGrid, Kepler, Semantic Mediation System

Scientific Workflows and Kepler

Semantics in integration, analysis and modeling





Access, integration, and analysis

Synthesis projects have distinctive needs Need to access large numbers of data sets

Little of the data are 'theirs', so they know few details about them Studies are 'data limited‘ Must engage the broader community to really solve the access

issue

Need to integrate those data in a meaningful way Studies not designed to be used together, so many pitfalls Integrated product differs for every synthesis project

Need to analyze and model the data efficiently and collaboratively

Leverage dynamic data loading to increase efficiency Use scientific workflow systems to work collaboratively

Dilemma: no unified model

No single database suffices

numerous data warehouses exist, but not extensible for all data

VegBank, ClimbDB, GenBank, PDB, etc.

data warehouses use federated schemas any data that does not fit is not captured this is a form of data integration for one purpose

Custom development for 1000’s of databases is not feasible

Discovery, Access, and Archive

Effectively store and archive data Effectively locate and access data from dispersed collections

Approaches Repositories for some data exist

KNB Metacat, SRB, EcoGrid Professional Society registries

Structured metadata + ontologies Ecological Metadata Language FGDC metadata not sufficient

Smart search Replication

Challenge areas Better recall and precision

Exploiting semantics Building effective ontologies Need to target individual scientists + students

Data sharing is increasing• KNB Metacat

– ESA Data Registry– NCEAS– LTER (LNO, SBC)– PISCO– OBFS– UC Natural Reserves– Pelagic Fisheries DB– LITS DB (UK)– Kruger Nat. Park (SA)– OSU, FIU

Data Packagesin the KNB

2002 2003 2004 2005 2006

Year

02000

4000

6000

8000

1000012000

Cum

ula

tive c

oun

t

Metadata Metadata Metadata Metadata

An absolute necessity

Ecological Metadata Language FGDC/NBII metadata (good but insufficient)

Loosly coupled data repositories accommodate heterogeneity

Data heterogeneity Data are heterogeneous

Differing formats, logical organization, and interpretation

Syntax Format of the data (e.g., csv, NetCDF, Excel, etc.)

Schema Logical model of the data (e.g., relational models, hierarchical models, etc.)

Semantics Meaning of the data (e.g., conceptual links, formalized methods,

interpretation)

Broad array of relevant data sources Ecological (population survey, community survey, behavioral, etc.) Physical (hydrology, meteorology, chemistry, etc.) Social (demographic data, land use patterns, policy information, etc.) Economic (economic valuations, demographic data, etc.)

Data Integration Combining heterogeneous data is necessary for synthesis

Approaches Manual Semi-automated integration that leverages domain knowledge

Challenges Integration constrained by intended analyses as well as data input

Not the traditional data warehouse approach Difficult to build consistent knowledge base Automated reasoning tested on small data sources Little semantics support in software tools for domain science

Integration needs to be:Ad-hoc (No global view)Fast (hours instead of months)

Current practices are ad-hoc and non-repeatable

Model the steps used by researchers during analysis Graphical model of flow of data among processing steps

Each step often occurs in different software

Refer to these graphs as ‘Scientific Workflows’

Analysis and Modeling

Data GraphClean Analyze





Science Environment for Ecological Knowledge

SEEK extends informatics approaches to improve analysis and modeling to support broad scale synthesis

Expose ecological, biodiversity, environmental data through a common architecture

Create a framework for executing, preserving and communicating complex quantitative analytical processes

Address myriad challenges associated with integrating heterogeneous data for use in analysis

EcoGrid Data access to diverse data systems

Lightweight web service interfaces Common query syntax Common mechanism to access

ecological data (100’s of field stations) museum specimen data (100’s of museums) environmental data (data in SRB at SDSC) geological data (GEON portal)

Kepler: Analysis and Modeling

Scientific workflow paradigm Models data flow among modular components Improved user interface for complex processes

Benefits Improves documentation Simplifies sharing of custom models with colleagues Promotes modular components Hierarchical models can hide complexity Direct access to data via EcoGrid Access common analysis tools (e.g., R, Matlab) from a

single framework

Kepler: Analysis and Modeling

Semantic Mediation Mediation layer for Kepler and EcoGrid

Addresses data heterogeneity and integration issues

Uses a formal reasoning approach for Smart data discovery Semi-automated data integration Workflow design Workflow validation

Relies upon good knowledge model Developed by Knowledge Representation group

Semantic Mediation

Knowledge Representation

Ecologists and computer scientists together capture critical knowledge about ecological data Extensible Observation Ontology (OBOE) captures the

semantics of scientific data Semantics of observations and measurements

Unit types Observation context Sampling hierarchies

Used by the Semantic Mediation system

Knowledge Representation

Taxonomic Nomenclature

Evolving collaborations Ecological Metadata Language – started in 1997 KNB/Morpho/Metacat – KDI 1999 Lifemapper – KDI 1998 Kepler – SEEK ITR 2002 Production work – Mellon Foundation 2002

Collaboration…organic and evolving

Kepler Collaboration Open-source

Builds on Ptolemy II from UC Berkeley

Collaborators SEEK Project SciDAC SDM Center Ptolemy Project GEON Project ROADNet Project Resurgence Project

Goals Create powerful analytical

tools that are useful across disciplines

Ecology, Biology, Engineering, Geology, Physics, Chemistry, Astronomy, …

Ptolemy IIPtolemy II

Broader Cyberinfrastructure Landscape

I. Data and metadata systems Physical (NBDC buoy data) Molecular bio (GenBank) Biological Collections (DiGIR) Oceanography (OpenDAP)

II. Domain Applications/Algorithms Sequence processing (BLAST) Ecological Niche Modeling (GARP) Site selection (Marxan)

III. Analysis and modeling frameworks Grid Systems (Globus) Workflow systems (Triana)





Need framework for designing, executing, preserving, and sharing analyses and models

Approaches Scientific workflows – modular, re-usable components, archive-friendly

Challenges Incorporating semantics Enabling effective model design Effective access to grid computing

Analysis and Modeling

A

Source(e.g., data)

C

Sink(e.g., display)

B

Scientific workflows Features of scientific workflows

Graphical model of data flow among processing steps Inputs and Outputs of components are precisely defined Components are modular and reusable Flow of data controlled by a separate execution model Support for hierarchical models

A’

Processor(e.g., regression)

B

ED F

Kepler: dynamic data loading

Data source from EcoGrid(metadata-driven ingestion)

res <- lm(BARO ~ T_AIR)resplot(T_AIR, BARO)abline(res)

R processing script

Kepler supports dynamic data loading:

• Data sources are discovered via metadata queries

• EML metadata allows arbitrary schemas to be loaded into an embedded database

• Data queries can be performed before data flows downstream

• Statistics (R and Matlab, etc.)

• Logic and math functions

• Graphics and visualization

• Geospatial data processing

• Molecular data processing

• Domain specific models

• Web services

• Grid services

• Data sources and sinks

• Ecology data

• Geology data

• Taxonomic data

• And much, much more…

The local library

Fast access to local components that are developed by the Kepler team and ship with Kepler

Import

Publish

• Components contributed by scientists

• ‘Upload to repository’ function in Kepler

• Saved in repository, explicit versioning

• Can be shared with colleagues

• Can be referenced in published papers

• Components can be downloaded and executed

• Downloaded components can be customized

• Promotes replication of analyses and models

The remote repository

Publish and share custom analyses, models, and components with colleagues.

Kepler Component Library

Active work in Kepler Real-time data in Kepler

Scientist’s view: real-time data accessed like archived data Engineer’s view: drill-down to manage sensor network resources

Semantics Semantic annotation connects models to knowledge Smart Search (data and components) Smart Data Integration Smart Workflow Linking

* by “smart” we mean these services are informed by metadata and ontology information

Improved user interfaces for Grid computing

ORB

Kepler and Sensor Networks

Collaborators: NCEAS, SDSC, UC Davis, OSU, CENS (UCLA), Opendap

Management and Analysis of Environmental Observatory Data using the Kepler Scientific Workflow System (CEO:P)

Startup October, 2006

Major foci:• Sensor network management – standardized services model

• Analysis of data from sensors and archives

• Public web view of sensor data

• Opendap and EcoGrid compatibility





Semantics in scientific workflows

Components and their ports typically have: Explicit ‘structural type’

e.g., int, float, string, {double} Implicit semantic type

Not sure whether the stream of values from a port represents ‘rainfall’ values or ‘body size’ values

A B

int intstring intint int

rainfall bodysizebodysize bodysize

int int

Semantic Annotation Label data with semantic types Label inputs and outputs of analytical

components with semantic types Grounded at level of measurement and data,

avoiding some pitfalls of upper ontologiesData Ontology Workflow Components

Provide extension points for loading specialized domain ontologies

Goal: generically describe the structure of scientific observation and measurement as found in a data set

Observation ontology

Entities represent real-world objects or concepts that can be measured.

Measurements assign values and units to characteristics of observed entities.

Observations are made about particular entities.

Every measurement has a characteristic, which defines the property of the entity being measured.

Every measurement relates a characteristic to a standard or unit.

Observations can provide context for other observations.

Entities, through observations, can be associated with one or more measured characteristics.

A value is typically a cell in a data set.

Extension points

Measurements have precision.

Semantic annotation

ObservationOntology

Data set

Mapping between data and the ontology via semantic annotation

Smart (Data) Integration: Merge Discover data of interest

… connect to merge actor

… “compute merge” align attributes via

annotations open dialog for user

refinement store merge mapping in

MOML

… enjoy! … your merged dataset

Smart Linking (Workflow Design)

Navigate errors and warnings within the workflow

Search for and insert “adapters” to fix (structural and semantic) errors …

Statically perform semantic and structural type checking

Semantic capabilities Answer semantic data queries:

Find sites in California where current abundance of molluscs is < 10% of historical abundance in 1900

Validate semantic correctness of workflows

Workflow design tools that exploit semantic context

Data integration that dynamically matches data sources to target schema needed for analysis

In summary… Typical analytical models are complex and difficult to

comprehend and maintain

Scientific workflows provide An intuitive visual model Structure and efficiency in modeling and analysis Abstractions to help deal with complexity Direct access to data Means to publish and share models

Kepler is an evolving but effective tool for scientists Looking for ways to transition from research prototype to a

production software tool

Scalable data integration is our main challenge

Prototype NEON Portal – “myNEON”

For more information, see:

http://kepler-project.org/

AcknowledgementsThe National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676.

Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis

The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.

The Andrew W. Mellon Foundation.

Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence

Roadmap ahead: Monday, Research Design: January 8, 2007 1:15 – 4:30 Scientific workflows– Pennington

Tuesday, Data Grids: January 9, 2007 8:30 -- 10:30 Grid technologies and activity – Servilla and Pennington 10:30 – 2:15 EML/Metadata best practices/Morpho – Tyburczy 2:30 -- 3:00 QA/QC – Vanderbilt 3:00 -- 4:30 Good Practices on storing data –Vanderbilt/White

Wednesday, Workflows I: Using pre-built workflows in Kepler January 10, 2007 8:30 -- 12:00 Introduction to Kepler with demos– Pennington/Romanello 1:15 -- 2:00 Using Desktop data in Kepler – Higgins 2:00 -- ??? Bosque, dinner at the Socorro Brew Pub

Thursday, Workflows II: Tools in Kepler January 11, 2007 8:30 -- 10:30 Using R in Kepler (demo + exercise) – Higgins 10:30 -- 12:00 Visualization in Kepler (demo + exercise) – Higgins/Pennington 1:15 -- 2:15 Biodiversity example in Kepler (demo + exercise) – “/” 2:30 -- 4:30 Taxonomic resolution in Kepler –Stewart

Friday, Workflows III: Semantic approaches in Kepler January 12, 2007 8:30 -- 12:00 Knowledge representation and semantic mediation – Bowers 1:15 -- 2:00 Recapping the week 2:00 -- 3:00 Preparing to use ecoinformatics in the classroom – Katz 3:00 -- 4:00 Roundtable Discussion – Katz

Exercises:

Documents

data access

access data

data repositories

data limitedmust

form of data integration

data existknb metacat

federated schemasany

gcmd http