C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li Berkeley Water Center John Hopkins University Lawrence Berkeley Laboratory.

Unifying Diverse Watershed Data to Enable Analysis

C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li

Berkeley Water CenterJohn Hopkins University

Lawrence Berkeley LaboratoryMicrosoft Research

University of California, Berkeley

IntroductionOver the past year, we’ve been

exploring how to build and user a digital watershed in the cloud

Our focus is enabling end-user analysis Assumes data access will get better

(thanks to CUAHSI and others) Bottoms up approach: start with

database and build to the tool Just in time approach: build tools to

solve science needs In the cloud to free the scientist from

any operational issues associated with the technology we use

http:/www.berkeley.edu/RussianRiver

Hydrologic Data Analysis PipelineDistributed Data Sets

An

alysis G

ate

wayD

ata

Gate

way

Models, Analysis Tools

Knowledge discovery,Hypothesis testing, Water

Synthesis

Dissemination

Challenge is to Connect Data, Resources, and People

DataArchive

DataTransformations

Data Flow Pipeline

Agency web site, streaming sensor data, or other source

CSV Files

BWC SQL Server Database

BWC Data Cube

Reports, Excel Pivot Table, MatLab, ArcGIS

Key Schema Abstractions Data, ancillary data, and metadata

Analyses often require combining time series data with fixed, or nearly fixed ancillary data such as river mile, vegetative cover, sediment grain size

Ancillary data used as fixed property, time series, or event time window Metadata describing algorithms, measurement techniques, etc. Normalized table structure simplifies adding variables and cube building

Versioning and folder-like collections Accommodate algorithm changes, temporal granularity and derived quantities Track derivations through processing pipeline Define and track analysis “working set”

Namespace translation Data assembly traverses different repositories each with own (useful?) name

space Some repositories encode metadata in variable name space (eg USGS

turbidity)

Any access layer shares the same abstractions.

Database Schema Subsetdata

FK1 sitesetidFK2 siteidFK3 datumid valueFK4 timeFK5 exdatumidFK6 repeatFK7 offsetidFK8 qualityid

dataset

PK datasetid

howmade createTime lastAppendTime lastModifyTime appendOnlyTime fixTime deleteTimeFK2 creatorid name description

offset

PK offsetid

value units

site

PK siteid

... name ...

time

PK time

siteset

PK sitesetid

FK1 siteid createTime lastAppendTime lastModifyTime appendOnlyTime fixTime deleteTime ingestChecksumFK2 parentSitesetidFK3 creatorid name description howmade path

dataset_siteset

FK1 datasetidFK2 sitesetid

repeat

PK repeat

datumtype

PK datumid

shortname units name offsetunits

exdatumtype

PK exdatumid

debris

quality

PK qualityid

qualityflags gapflags

investigator

PK investigatorid

... name ...

• Star schema for data similar to CUAHSI ODM

• Ancillary data shredded like data– Active over a

time range– Numeric or text– Flows to the data

cube as site attribute or time series data

• Two level versioning maps to data sourcing– Bound into a

dataset version with spline filter

– Only the dataset flows to the datacube

0

5000

10000

15000

20000

25000

30000

AU

ST

IN C

NR

CA

ZA

DE

RO

CA

BIG

SU

LPH

UR

C A

G R

ES

OR

T N

R

BIG

SU

LPH

UR

C N

R C

LOV

ER

DA

LE C

A

BIG

SU

LPH

UR

C N

R M

IDD

LET

OW

N C

A

CO

LGA

N C

NR

SE

BA

ST

OP

OL

CA

DR

Y C

NR

CLO

VE

RD

ALE

CA

DR

Y C

NR

MO

UT

H N

R H

EA

LDS

BU

RG

DR

Y C

NR

YO

RK

VIL

LE C

A

DR

Y C

TR

IB N

R H

OP

LAN

D C

A

DU

TC

HE

R C

NR

AS

TI

CA

EF

RU

SS

IAN

R A

ND

PO

TT

ER

VA

LLE

Y

EF

RU

SS

IAN

R N

R U

KIA

H C

A

EF

RU

SS

IAN

R T

RIB

NR

PO

TT

ER

VA

L

FE

LIZ

C N

R H

OP

LAN

D C

A

FR

AN

Z C

NR

KE

LLO

GG

CA

LAG

UN

A D

E S

AN

TA

RO

SA

A S

TO

NY

PT

LAG

UN

A D

E S

AN

TA

RO

SA

C N

R

MA

AC

AM

A C

NR

KE

LLO

GG

CA

MA

TA

NZ

AS

C A

SA

NT

A R

OS

A C

A

PE

NA

C N

R G

EY

SE

RV

ILLE

CA

PO

TT

ER

VA

LLE

Y I

RR

IG C

N 5

+6

NR

PO

TT

ER

VA

LLE

Y I

RR

IG C

N E

5 N

R

PO

TT

ER

VA

LLE

Y I

RR

IG C

N E

6 N

R

PO

TT

ER

VA

LLE

Y P

H (

TR

ON

LY)

NR

RU

SS

IAN

R A

DIG

GE

R B

EN

D N

R

RU

SS

IAN

R A

GE

YS

ER

VIL

LE C

A

RU

SS

IAN

R N

R C

LOV

ER

DA

LE C

A

RU

SS

IAN

R N

R G

UE

RN

EV

ILLE

CA

RU

SS

IAN

R N

R H

EA

LDS

BU

RG

CA

RU

SS

IAN

R N

R H

OP

LAN

D C

A

RU

SS

IAN

R N

R R

ED

WO

OD

VA

LLE

Y C

A

RU

SS

IAN

R N

R U

KIA

H C

A

SA

NT

A R

OS

A C

A S

AN

TA

RO

SA

CA

SA

NT

A R

OS

A C

A W

ILLO

WS

IDE

RD

NR

SA

NT

A R

OS

A C

NR

SA

NT

A R

OS

A C

A

WA

RM

SP

RIN

GS

C N

R A

ST

I C

A

Dataset USGS Surface Water Data Jan 2007 Datumtype Mean Discharge Quality All

Count

Site

Year

Datacube Basics A data cube is a database specifically

for data mining (OLAP) Organizes data along dimensions such

as time, site, or variable type Easy to group, filter, and aggregate

data in a variety of ways Simple aggregations such as sum,

min, or max can be pre-computed for speed

Additional calculations such as median can be computed dynamically

SQL Server Analysis Services (SSAS) provides the OLAP engine

SQL Server Business Intelligence Development Studio is used to define and tune

Excel and other client tools enable simple browsing

Minimizes total software burden writing queries (SQL or MDX) Discharge and Turbidity variability

Daily Discharge Availability by Site by Year

Each bar is a count of data points color coded by reporting per year The higher the bar, the more reported datal

Learnings and Observations Simplifying data discovery speeds analysis

Discovery is a necessary precursor step to analysis

What data where when? At what quality? Versioning is critical

Site-variable most naturally maps to analysis patterns

Dataset too coarse; individual measurement too fine

Ancillary data must be versioned as well Matching the scientific notion of time to

commercial tools can problematic Second month of water year has 30 days in US MODIS week Granularity widely varying

Plan on decode stage for name, location, time, quality

Don’t forget historic (non-digital) data

0

500

1000

1500

2000

0 500 1000 1500 2000

Annual Precipitation [mm]

An

nu

al R

un

off

[m

m] Ukaih (100 sq mi)

Hopland (362 sq mi)Cloverdale (503 sq mi)Healdsburg (793 sq mi)Guerneville (1338 sq mi)

C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li Berkeley Water Center John Hopkins University Lawrence Berkeley Laboratory.

Documents

data access

data similar

data sourcing

data active

aggregate data

time series data

fixed ancillary data

data discovery speed