Top Banner
End-to-End eScience Integrating Query, Workflow, Visualization, and Mashups at an Ocean Observatory Bill Howe, University of Washington Harrison Green- Fishback, PSU David Maier, PSU Erik Anderson, Utah Emanuele Santos, Utah Juliana Freire, Utah Carlos Scheidegger, Utah Claudio Silva, Utah Antonio Baptista, OHSU Peter Lawson, OSU Renee Bellinger, OSU http://dev.pacificfishtrax.org/ QuickTime™ and a decompressor are needed to see this pictu
86

End-to-End eScience

May 10, 2015

Download

Technology

Bill Howe

Invited talk at Microsoft Research, Spring 2009
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: End-to-End eScience

End-to-End eScienceIntegrating Query, Workflow, Visualization, and Mashups at an Ocean Observatory

Bill Howe, University of Washington

Harrison Green-Fishback, PSUDavid Maier, PSU

Erik Anderson, UtahEmanuele Santos, UtahJuliana Freire, UtahCarlos Scheidegger, UtahClaudio Silva, Utah

Antonio Baptista, OHSU

Peter Lawson, OSURenee Bellinger, OSU

http://dev.pacificfishtrax.org/

QuickTime™ and a decompressor

are needed to see this picture.

Page 2: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 2

Outline

eScience Brief Demo A Domain-Specific Query Algebra Mashups

Page 3: End-to-End eScience

Theory

Experiment

Observation

slide: Ed Lazowska

Page 4: End-to-End eScience

Theory

Experiment

Observation

slide: Ed Lazowska

Page 5: End-to-End eScience

Theory

Experiment

Observation

slide: Ed Lazowska

Page 6: End-to-End eScience

Theory

Experiment

ObservationComputational

Science

slide: Ed Lazowska

Page 7: End-to-End eScience

Theory

Experiment

ObservationComputational

Science

eScience

Page 8: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 8

All Science is becoming eScience

Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)New model: “Download the world” (Data acquired en masse, independent of hypotheses)But: Acquisition now outpaces analysis

Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Medicine: ubiquitous digital records, MRI, ultrasound Oceanography: high-resolution models, cheap sensors, satellites Biology: lab automation, high-throughput sequencing

“Increase Data Collection Exponentially in Less Time, with FlowCAM”

Empirical X Analytical X Computational X X-informatics

Page 9: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 9

The long tail is getting fatter:

notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB)

The Long Tailda

ta in

vent

ory

ordinal position

Researchers with growing data management challenges but limited resources for cyberinfrastructure

• No dedicated IT staff

• Overreliance on simple tools (e.g., spreadsheets)CERN (~15PB/year)

LSST (~100PB)

PanSTARRS (~40PB)

Ocean Modelers <Spreadsheet

users>

SDSS (~100TB)

Seis-mologists

MicrobiologistsCARMEN (~50TB)

“The future is already here. It’s just not very evenly distributed.”-- William Gibson

Page 10: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 10

eScience Institute at UW

Mission Help position the University of Washington at the

forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon these techniques and technologies

Strategy Increase the sharing of expertise and facilities Bootstrap a cadre of Research Scientists Add faculty in key fields Make the entire University more effective

Launched July 1 with $1 million in permanent funding from the Washington State Legislature Sought, and need, $2 million

Page 11: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 11

Web Services

Facets of Database Research

Query Languages

Storage Management

Visualization; Workflow

Data IntegrationKnowledge Extraction,Crawlers

Access Methods

Data Mining, Parallel Programming Models, Provenance

complexity-hiding interfaces

My research: customize and optimize for science

Page 12: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 12

The eScience Elephant

eScience

Cloud/Cluster

Workflow

Databases

Visualization Provenance

“flexibility; web services; integration”

“query processing; data independence; algebraic optimization; needles in haystacks”

“Exploratory science; mapping quantitative data to intuition”

“Reproducibility; forensics; sharing/reuse”

“Massive data parallelism”

Mashups“Rapid Prototyping; Simplified web programming”

Page 13: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 13

Some eScience Research

Query Algebra for new Data Type

Scientific Workflow Systems

Science Mashups

“Dataspace” systems

[Howe, Freire, Silva, et al. 2008]

[Howe, Green-Fishback, Maier, 2009]

[Howe, Maier, Rayner, Rucker 2008]

[Howe, Maier. 2004, 2005, 2006]

thi s

talk

Page 14: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 14

Outline

eScience Brief Demo A Domain-Specific Query Algebra Science Mashups

Page 15: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 15

VisTrails for Computation

Page 16: End-to-End eScience

Spatial Patterns in Fisheries: new Spatial Patterns in Fisheries: new techniques, new opportunities for techniques, new opportunities for

ecosystem-based managementecosystem-based managementPeter LawsonPeter Lawson11, Lorenzo Cianelli, Lorenzo Cianelli22, Bobby Ireland, Bobby Ireland22

12

Page 17: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 17

Enabling Scientific Discourse between Fishermen and Fisheries Managers

Page 18: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 18

Page 19: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 19

Page 20: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 20

Page 21: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 21

VisTrails for Collaboration

Bill Howe @ CMOP computes salt flux using GridFields

Erik Anderson @ Utah adds vector

streamlines and adjusts opacity

Bill Howe @ CMOP adds an isosurface of

salinity

Peter Lawson adds discussion of the

scientific interpretation

Page 22: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 22

Outline

eScience Brief Demo A Domain-Specific Query Algebra Mashups

Page 23: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 23

CMOP

Page 24: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 24

Columbia River Estuary

red = high salinity (~34psu)

blue = fresh water (~0 psu)

Page 25: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 25

Accessing Model Results CMOP ocean circulation models run in forecast or

hindcast mode Models run serially in ~1/5 real time

On MPICH2, about 10x speedup before overhead dominates Forecasts kept for 10 days, hindcasts kept indefinitely

(40TB + 25TB/year)

Access via a GridFields Web Service GFServer optimizes and evaluates GF expressions and returns

the result

Page 26: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 26

Unstructured Grids

“unstructured grids” model complex domains at multiple scales simultaneously

red = high salinity (~34psu)

blue = fresh water (~0 psu)

Columbia River Estuary

….but complicate processing

Page 27: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 27

“Structured” Grids

“structured grids” do a poor job of modeling complex features and complicate multi-scale analysis.

But:Coastlines are not rectilinear

x x

xx

xx xx

xx

xx

x

1) Missing values = wasted effort

Higher resolution = wasted effort in areas of low dynamism

2) Data associated with cells at multiple dimensions

Simple: Isomorphic to multidimensional arrays

Page 28: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 28

Structured grids are easy

The data model(Cartesian products of coordinate variables)

immediately implies a representation, (multidimensional arrays)

an API, (reading and writing subslabs)

and an efficient implementation (address calculation using array “shape”)

Page 29: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 29

Structured grid example

f( i , j )

x( i )

y( j )

for i in [4:6]:

for j in [1:4]:

addr = &f + j*|x| + i

= f[4:6, 1:4] =

NetCDF, MATLAB, RasDaMan, SciDB (soon), many more

Page 30: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 30

Unstructured Grids

2

3 4( E, I ) = A

y

xz

E0 = {2,3,4}

E1 = {x,y,z}

E2 = {A}

I = z2z4Az

x2x3Ax

Ayy4y3

…plus the transitive closure

Page 31: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 31

Subsetting

Full grid: Eastern Pacific Subset: mouth of Columbia River

color: bathymetry

Washington

Oregon

California

Page 32: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 32

Correctness properties preserved

Grid is well-supported

(no ragged edges)

Page 33: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 33

Subset semantics

01

1

1

1 0

0

1

1

1

1

1

1

1

1

Input Simple Drop “Exact”

1

1

11

0

01

1 0

0 1

1

1

12

1

1

Cut everything labeled “0”. What should be kept?

Page 34: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 34

What about Visualization Libs?

Different C++ classes, each dependent on data characteristics. Changes to data characteristics require changes to the program Logical equivalences obscured No data independence

vtkExtractGeometryvtkThresholdvtkExtractGridvtkExtractVOIvtkThresholdPoints

We want:

in VTK:

Page 35: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 35

GridField Data Model

A GridField with two attributes bound to the 2-cells and four attributes bound to the 0-cells

x y salt temp

13.8 10.6 29.4 12.1

13.9 9.4 29.8 12.5

14.3 9.0 28.0 12.0

13.4 9.0 30.1 13.2

flux area

11.5 3.3

13.9 5.5

13.1 4.5

Page 36: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 36

GridField Operations

Lifted set operations Union, Intersection, Cross Product

Scan/Bind Read a grid/attribute

Restrict Remove cells that do not satisfy a predicate

Accrete Grow a grid by adding neighbors of cells

Regrid Map the data of one grid onto another

Page 37: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 37

Usage Example (1)

H = Scan(context, "H")

rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H)

H = rH =

dimensionpredicate

color: bathymetry

Page 38: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 38

Usage Example (2)

H = Scan(context, “H")

rH = Restrict(“h<500", 0, H)

H = rH =

color: bathymetry

Page 39: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 39

Longer Example

H : (x,y,b)

V : (z)

render

H V

(H V)

r(z>b)

r(H V)

b(s)

b(r(H V))

r(region)

r(b(r(H V)))

Page 40: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 40

H(x,y,b)

V(z)

r(z>b) b(s) r(region)

H(x,y,b)

V(z)

r(z>b) b(s)

r(x,y)

r(z)

Optimization

*Howe, Maier, Algebraic Manipulation of Scientific Datasets. VLDB Journal, 14:4, 2005

Page 41: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 41

Transect (Vertical Slice)

P

Page 42: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 42

Transect: Bad Plan

H(x,y,b)

V(z)

r(z>b) b(s) regrid

PP V

1) Construct full-size 3D grid2) Construct 2D transect grid3) Spatial Join 1) with 2)

Page 43: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 43

Transect: Optimized Plan

P V

V(z)P

H(x,y,b)regrid b(s) regrid

1) Find 2D cells containing points2) Create “stacks” of 2D cells carrying data3) Create 2D transect grid4) Spatial Join 2) with 3)

Page 44: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 44

1) Find cells containing points in P

Page 45: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 45

1)

4)

2)

1) Find cells containing points in P

2) Construct “stacks” of cells

4) Join 2) with 3)

Page 46: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 46

0

5

10

15

20

25

30

35

40

45

vtk(3D) interpolate simple interp_o simple_o

Transect: Results

secs

800 MB dataset

simple = nearest neighbor interpolation

*_o = optimized by restricting to the region of interest

Page 47: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 47

Ongoing work NSF Cluster Exploratory Award:

Where the Ocean Meets the Cloud: Ad Hoc Longitudinal Analysis of Massive Mesh Data

Partnership between NSF, IBM, Google Data-intensive computing

massive queries, not massive simulations

To “Cloud-Enable” GridFields and VisTrails Goal: 10+-year climatologies at interactive speeds Parallel implementations of GridField operators

via Hadoop (and Dryad!) Provenance, repeatability, visualization via VisTrails

Connect rich desktop experience

Co-PIs from University of Utah Claudio Silva and Juliana Freire

Page 48: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 48

Outline

eScience Brief Demo A Domain-Specific Query Algebra Scientific Mashups

Page 49: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 49

Why Mashups?

Jim Gray: # of datasets scales as N2

Each pairwise comparison generates a new dataset

Corollary: # of apps scales as N2

Every pairwise comparison motivates a new mashup To keep up, we need to

entrain new programmers, make existing programmers more productive, or both

Page 50: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 50

Satellite Images + Crime Incidence Reports

Page 51: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 51

Twitter Feed + Flickr Stream

Page 52: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 52

Mashup Frameworks

A bottom up approach Start with a GPL, add

Visual programming Interactive type checking Exploit a corpus of

previous examples bootstrapping a mashup mashup “autocomplete” emit warnings

Page 53: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 53

Page 54: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 54

Page 55: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 55

Page 56: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 56

Scientific Mashup Characteristics

Turn over more data per operation Involve subtle visualizations Must serve a diverse audience

Page 57: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 57

A Model for Scientific Mashups

The “Data Product” is the currency of scientific communication with the public

Scientists are already adept at crafting them (consider powerpoint slides and figures)

We take a top down approach: Take a static data product ensemble, endow it with interactivity, publish it online, allow others to repurpose it at runtime

Page 58: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 58

Data Product Ensemble

Page 59: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 59

Mashup

Page 60: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 60

CTD: Conducitvity, Temperature, Depth

Page 61: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 61

Sampling

Page 62: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 62

Event Detection: Red Water

Page 63: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 63

CTD Cast

Page 64: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 64

Flowthrough

Page 65: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 65

Mashup

Page 66: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 66

Mashup

Page 67: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 67

Key Concepts

A mashup is a synchronized ensemble of data products

A data product is a mashable that has been adapted for a particular purpose

A mashable is an arbitrarily-complex computation that returns a relation

An adaptor displays the relation to the user and returns a subset

All adapted mashables accept input Hence, user controls are modeled

as adapted mashables just like “visual” data products

Page 68: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 68

Adapted Mashables

Page 69: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 69

Data Flow Graph

Page 70: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 70

Inferring Data Flow

provides: {ABC}

requires: {AB}

Page 71: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 71

Inferring Data Flow

provides: {AC}

requires: {AB}

provides: {B}

Page 72: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 72

Inferring Data Flow

provides: {AC}

requires: {AB}

underspecified mashup

Solution: 1) use defaults2) root environment3) hand-specified parameter

Page 73: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 73

Inferring Data Flow

provides: {AB}

requires: {AB}

provides: {B}

overspecified mashup

Solution: Break ties:1) Prefer nodes on longer paths2) Use layout information

Page 74: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 74

Audience-Tailored Mashups

K12 studentsExperts

Page 75: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 75

Conclusions and Future Directions

We want to augment scientists, not programmers Requires limiting expressiveness -- not yet clear where

to draw the line

More work on semi-automatically tailoring a mashup at runtime Automatically insert “context products”

See salinity, add a salinity colorbar See a time, add a tide chart See a location, add a map

Re-skin data products “Dashboard-style” vs. “Wizard-style” apps

Page 76: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 76

http://escience.washington.edu

(retooled website coming soon)

Page 77: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 77

ComparisonData Model Operations Services

GPL * * Typing, maybe

Workflow * arbitrary boxes-and-arrows

typing, provenance, Pegasus-style resource mapping, task parallelism

Relational Algebra

Relations Select, Project, Join, Aggregate, …

optimization, physical data independence, data parallelism

MapReduce [(key,value)] Map, Reduce massive data parallelism, fault tolerance

MS Dryad IQueryable, IEnumerable

RA + Apply + Partitioning

typing, massive data parallelism, fault tolerance

MPI Arrays/ Matrices

70+ ops data parallelism, full control

Page 78: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 78

Mashups serve a diverse audience

student

public

scientist

Page 79: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 79

Computational Science

Theory Experiment Observation Simulation (in silico) Analysis (in ferro)

Data acquisition is hypothesis-driven

Data acquisition is technology-driven

Page 80: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 80

Explore architectures blending techniques from

• mashups (rapid prototyping),

• visualization (interactivity, richness),

• workflow (data integration, provenance),

• databases (optimization, data independence)

to answer science questions at an Ocean Observatory

Motivation

Page 81: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 81

Source: MayaVi website

PLOT3D, GDAL, ShapeFile,

OGC, .obj, .vtk, netCDF, HDF5,

FITS, others

Optimized for “throwing datasets” and interactivity

Declarative query, interoperability, repeatability generally lacking

Source: http://pogl.wordpress.com/2007/06/

Visualization

Page 82: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 82

Workflow Emphasis on integration, web

services, flexibility

Unconstrained boxes-and-arrows Any operation on any data type

Very expressive, but limited opportunities for static reasoning

Type safety Task parallelism Cache safety Optimization via rewrite rules Result size / execution time estimation Transparent data parallelism Platform portability

To move the earth, you need somewhere to stand

Page 83: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 83

Databases

Pre-relational DBMS brittleness: if your data changed, your application broke.

Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.

physical data independence

logical data independence

files and pointers

relations

views

“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”

Key Idea: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independent of physical data representation

Page 84: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 84

Heterogeneity also drives costs#

of

by

tes

# of data types

CERN (~15PB/year, particle interactions)

LSST(~100PB; images, objects)

PanSTARRS (~40PB; images, objects, trajectories)

OOI(~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more)

SDSS (~100TB; images, objects)

Biologists(~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogeny trees)

Page 85: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 85

The eScience Elephant

“Like a snake”

“Like a hand fan”

“Like a wall” “Like tree trunk”

“Like a spear”

“Like a rope”

Page 86: End-to-End eScience

04/11/23 Bill Howe, eScience Institute 86