Top Banner
Designing for self- serve science Daniel Halperin
21

Designing for self-serve science

Jun 26, 2015

Download

Data & Analytics

dhalperi

I gave this talk at the UW Systems, Architecture, & Networking (SANE) retreat in May 2014. I argued that as a community, big data system-builders may be great at building fast systems.. but that these systems DO NOT serve the scientists we work with at the UW eScience Institute. I then provide a few ideas going forward for how to build services for scientists that will enable them to do their own work, thus "serving themselves".
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Designing for self-serve science

Designing for self-serve science

Daniel Halperin

Page 2: Designing for self-serve science

How much time “handling data” vs “doing science”?

Page 3: Designing for self-serve science

How much time “handling data” vs “doing science”?

90%

Page 4: Designing for self-serve science

“I sort both my spreadsheets on Gene ID, then I copy matches into a new one”

Page 5: Designing for self-serve science

We are the problem

Page 6: Designing for self-serve science

0

30

60

90

120

Benchmark 1 Benchmark 2

Old system Your system Our system

Page 7: Designing for self-serve science

0

2500

5000

7500

10000

Benchmark 1 Benchmark 2

Old system Your systemOur system What people use

Page 8: Designing for self-serve science

Perfo

rman

ce

Complexity

Page 9: Designing for self-serve science

Perfo

rman

ce

Complexity

Page 10: Designing for self-serve science

Perfo

rman

ce

Complexity

Page 11: Designing for self-serve science

Perfo

rman

ce

Complexity

Design for here

Page 12: Designing for self-serve science

What we build What they need

Steve Jurvetson https://www.flickr.com/photos/jurvetson/7408464122

sutton-images.com http://biser3a.com/formula-1/f1-airboxes-all-you-need-to-know/

terms: http://sutton-images.com/terms.asp

Page 13: Designing for self-serve science

Lowering barrier to entry

Page 14: Designing for self-serve science

Developing a new language

• SQL: 3 great features for science • THE language of data

management!• We know how to

scale it • Scientists can learn it

• MyriaL is better • Imperative &

declarative:easy to write

• Iteration & recursion!• Lots of practical

extensions

Page 15: Designing for self-serve science

Giving users insight

Page 16: Designing for self-serve science

Diagnosing problems����������������

�� ��������

� � � � � � � � � ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

���������

������������������������������������������������������������������������������������������������������������������������������

Sour

ce n

ode

Destination node

Page 17: Designing for self-serve science

Automating the ‘CS parts’• Do work on the user’s behalf:

(Ratul Mahajan’s Buffet Principle)

• Infer indexes and constraints!

• Aggressively reuse computation

• Speculatively apply queries to data

• Key enabler: science data is (mostly) read-only

Page 18: Designing for self-serve science

Enable authoring & sharing

• “Autocomplete for science” - predict query snippets as users work. (Nodira Khoussainova)

• Natural language interface: queries → English questions → queries “Compute the fraction of CGs that are methylated in the oyster genome.”

Page 19: Designing for self-serve science

Improve their state of the art

• “You just did in 1 minute what took me a week”

• “Replaced 100 lines of Python with 1 line of SQL”

• “That 5-line MyriaL program was 100x faster than my R cluster, and much simpler”

Page 20: Designing for self-serve science

Trust, but Verify (& Support)

Page 21: Designing for self-serve science

Trust, but Verify (& Support)