Designing for self-serve science

Designing for self-serve science

Daniel Halperin

How much time “handling data” vs “doing science”?

How much time “handling data” vs “doing science”?

90%

“I sort both my spreadsheets on Gene ID, then I copy matches into a new one”

We are the problem

0

30

60

90

120

Benchmark 1 Benchmark 2

Old system Your system Our system

0

2500

5000

7500

10000

Benchmark 1 Benchmark 2

Old system Your systemOur system What people use

Perfo

rman

ce

Complexity

Perfo

rman

ce

Complexity

Perfo

rman

ce

Complexity

Perfo

rman

ce

Complexity

Design for here

What we build What they need

Steve Jurvetson https://www.flickr.com/photos/jurvetson/7408464122

sutton-images.com http://biser3a.com/formula-1/f1-airboxes-all-you-need-to-know/

terms: http://sutton-images.com/terms.asp

https://www.flickr.com/photos/jurvetson/7408464122

http://sutton-images.com/terms.asp

Lowering barrier to entry

Developing a new language

• SQL: 3 great features for science • THE language of data

management!• We know how to

scale it • Scientists can learn it

• MyriaL is better • Imperative &

declarative:easy to write

• Iteration & recursion!• Lots of practical

extensions

Giving users insight

Diagnosing problems��

��

� � � � � � � � � ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Sour

ce n

ode

Destination node

Automating the ‘CS parts’• Do work on the user’s behalf:

(Ratul Mahajan’s Buffet Principle)

• Infer indexes and constraints!

• Aggressively reuse computation

• Speculatively apply queries to data

• Key enabler: science data is (mostly) read-only

Enable authoring & sharing

• “Autocomplete for science” - predict query snippets as users work. (Nodira Khoussainova)

• Natural language interface: queries → English questions → queries “Compute the fraction of CGs that are methylated in the oyster genome.”

Improve their state of the art

• “You just did in 1 minute what took me a week”

• “Replaced 100 lines of Python with 1 line of SQL”

• “That 5-line MyriaL program was 100x faster than my R cluster, and much simpler”

Trust, but Verify (& Support)

Trust, but Verify (& Support)

Designing for self-serve science

Data & Analytics

science data

old system

language of data management

performance complexity

time handling data vs

data key enabler

new language sql

serve science daniel