Top Banner
QuickTime™ and a decompressor are needed to see this picture. What to do with Scientific Data? Michael Stonebraker
34

What to do with Scientific Data? Michael Stonebraker.

Mar 26, 2015

Download

Documents

Chloe Rees
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

What to do with Scientific Data?

Michael Stonebraker

Page 2: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Outline

• Science data -- what it looks like

• Software options File system RDBMS SciDB

Page 3: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

O(100) petabytes

Page 4: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Page 5: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

LSST Data

• Raw imagery 2D arrays of telescope readings

• “Cooked” into observations Image intensity algorithm (data clustering) Spatial data

• Further cooked into “trajectories” Similarity query Constrained by maximum distance

Page 6: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Example LSST Queries

• Re-cook raw imagery with my algorithm

• Find all observations in a spatial region

• Find all trajectories that intersect a cylinder in time

Page 7: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Chemical Plant Data

• Plant is a directed graph of plumbing

• Sensors at various places 1/sec observations

• Directed graph of time series

• Plant runs “near the edge” to optimize output

• And fails every once in a while -- down for a week

Page 8: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Chemical Plant Data

• Record all data{(time, sensor-1, … , sensor-5000)}

• Look for interesting events i.e. sensor values out of whack

• Cluster events near each other in 5000 dimension space

• Goal is to identify “near-failure modes”

Page 9: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Snow Cover in the SierrasSnow Cover in the Sierras

Page 10: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Earth Science Data

• Raw imagery – arrays• “cooked” into composite images based

on multiple passes Cloud free cells

• “baked” into snow cover, …

Page 11: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

General ModelGeneral Model

sensors

CookingAlgorithm(s)(pipeline)

Derived data

Page 12: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Traditional Wisdom (1)

• Cooking pipeline in hard code or custom hardware

• All data in some file system

Page 13: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Problems

• Can’t find anything• Metadata often not recorded

Sensor parameters Cooking parameters

• Can’t easily share anything• Can easily recook anything• Everything is custom code

Page 14: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Traditional Wisdom (2)

• Cooking pipeline outside DBMS

• Derived data loaded into DBMS for subsequent querying

Page 15: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Problems with This Approach

• Most of the issues remain

• Better, but stay tuned

Page 16: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

My Preference

• Load the raw data into a DBMS

• Cooking pipeline is a collection of user-defined functions (DBMS extensions)

• Activated by triggers or a workflow management system

• ALL data captured in a common system!!!!

Page 17: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

What DBMS to Use?

• RDBMS (e.g. Oracle) Pretty hopeless on raw data

Simulating arrays on top of tables likely to cost a factor of 10-100

Not pretty on time series data Find me a sensor reading whose average

value over the last 3 days is within 1% of the average value adjoining 5 sensors

Page 18: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

RDBMS Summary

• Wrong data model Arrays not tables

• Wrong operations Regrid not join

• Missing features Versions, no-overwrite, provenance,

support for uncertain data, …

Page 19: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

But your mileage may vary

• SQL Server working well for Sloan Skyserver database

• See paper in CIDR 2009 by Jose Blakeley

Page 20: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

How to Do Analytics (e.g. clustering)

• Suck out the data• Convert to array format• Pass to Matlab, SAS, R• Compute• Return answer to DBMS

Page 21: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Bad News

• Painful• Slow• Many analysis platforms are not parallel• Many are main memory only

Page 22: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

My Proposal - SciDB

• Build a commercial-quality array DBMS from the ground up with integrated analytics Open source!!!!

Page 23: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Data Model

• Nested multi-dimensional arrays Cells can be tuples or other arrays

• Time is an automatically supported extra dimension

• Ragged arrays allow each row/column to have a different dimensionality

• Support for multiple flavors of ‘null’ Array cells can be ‘EMPTY’ User-definable, context-

sensitive treatment

Page 24: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Array-oriented storage manager

• Optimized for both dense and sparse array data Different data storage, compression, and

access

• Arrays are “chunked” (in multiple dimensions)

• Chunks are partitioned across a collection of nodes

• Chunks have ‘overlap’ to support neighborhood operations

• Replication provides efficiency and back-up

• No overwrite

Page 25: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Local ExecutorLocal Executor

Storage ManagerStorage Manager

Node 1

Application Layer

Server Layer

Storage Layer

Application

Language Specific UI

Query Interface and Parser

Runtime Supervisor

Plan Generator

Node 3Node 2

Local ExecutorLocal Executor

Storage ManagerStorage Manager

Node 1

Application Layer

Server Layer

Storage Layer

Application

Language Specific UI

Query Interface and Parser

Runtime Supervisor

Plan Generator

Node 3Node 2

Local ExecutorLocal Executor

Storage ManagerStorage Manager

Node 1

Node 3Node 2

Doesn’t require JDBC, ODBC

AQL an extension of SQLAlso supports UDFs

Java, C++, whatever…

Architecture• Shared nothing cluster

10’s–1000’s of nodes Commodity hardware TCP/IP between nodes Linear scale-up

• Each node has a processor and storage

• Queries refer to arrays as if not distributed

• Query planner optimizes queries for efficient data access & processing

• Query plan runs on a node’s local executor&storage manager

• Runtime supervisor coordinates execution

Page 26: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

SciDB DDL

CREATE ARRAY Test_Array < A: integer NULLS, B: double, C: USER_DEFINED_TYPE > [I=0:99999,1000, 10, J=0:99999,1000, 10 ] PARTITION OVER ( Node1, Node2, Node3 ) USING block_cyclic();

chunk size

1000

overlap

10

attribute names

A, B, C

index names

I, J

Page 27: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Array Query Language (AQL)

• Array data management e.g. filter, aggregate, join, etc.

• Statistical & linear algebra operations multiply, QR factorization, etc. parallel, disk-oriented

• User-defined operators (Postgres-style)

• Interface to external stat packages (e.g. R)

Page 28: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Array Query Language (AQL)

SELECT Geo-Mean ( T.B )

FROM Test_Array T

WHERE

T.I BETWEEN :C1 AND :C2

AND T.J BETWEEN :C3 AND :C4

AND T.A = 10

GROUP BY T.I;

User-defined aggregate on anattribute B in array T

Subsample

FilterGroup-by

So far as SELECT / FROM / WHERE / GROUP BY queries are concerned, there is little logical difference between AQL and SQL

Page 29: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Matrix Multiply

• Algorithm sensitive to distribution (range, BC)

• For range, the smaller of the two arrays is replicated at all nodes Scatter-gather

• Each node does its “core” of the bigger array with the replicated smaller one

• Produces a distributed answer

CREATE ARRAY TS_Data < A1:int32, B1:double >

[ I=0:99999,1000,0, J=0:3999,100,0 ]

multiply (project (TS_Data, B1),

transpose(project (TS_Data, B1)))

10K x 4K

Project on one attribute

Page 30: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Other Features Science Guys Want(These could be in an RDBMS, but aren’t)

• Uncertainty Data has error bars Which must be carried along in the computation

interval arithmentic

Page 31: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Other Features

• Time travel Don’t fix errors by overwrite i.e. keep all the data – in the extra time

dimension

• Named versions Recalibration usually handled this way

Page 32: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Other Features

• Provenance (lineage)

What calibration generated the data

What was the “cooking” algorithm

In general - repeatabiltiy of derviced data

Page 33: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

SciDB statuswww.scidb.org

• Global development team (17 people) including many volunteers

• Support/distribution by a VC-backed company Paradigm4 in Waltham

• 0.75 on website now Have at it!!!

• 1.0 release will be on website within a month Very functional, but a little pokey

• Red Queen will be out in October Blazing performance Probably 5X current system

Page 34: What to do with Scientific Data? Michael Stonebraker.

QuickTime™ and a decompressor

are needed to see this picture.

Performance of SciDB 1.0

• Performance on stat operations Comparable to R, degrades in a similar way when data does not fit in

main memory But multi-node parallelism provided -- much better scalability

• Performance on SQL Comparable to or better than Postgres, especially on multi-attribute

restrictions And multi-node parallelism provided

• And we can do stuff they can’t….