THE MISSING MANUAL FOR DATA SCIENCE: REMIX. RESUSE. REPRODUCE SPEAKER: Matt Wood Principal Data Scientist Amazon Web Services Monday, April 1, 13
Jan 27, 2015
THE MISSING MANUAL FOR DATA SCIENCE: REMIX. RESUSE. REPRODUCE
SPEAKER: Matt WoodPrincipal Data ScientistAmazon Web Services
Monday, April 1, 13
Monday, April 1, 13
Hello.
Monday, April 1, 13
Monday, April 1, 13
Data.
Monday, April 1, 13
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Monday, April 1, 13
Monday, April 1, 13
Generation challenge.
Monday, April 1, 13
Linus Bengtsson et al. PLoS Medicine, 2011
Amazing data generators: cell phones tracking cholera in Haiti
Monday, April 1, 13
You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011
Amazing data generators: social networks tracking influenza
Monday, April 1, 13
500% return on ad spend
Amazing data generators: web app logs targeting advertising
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Chromosome 11 : ACTN3 : rs1815739
Monday, April 1, 13
Chromosome X : rs6625163
Monday, April 1, 13
Chromosome 19 : FUT2 : rs601338
Monday, April 1, 13
Chromosome 2 : rs10427255
Monday, April 1, 13
TYPE II
Chromosome 10 : rs7903146
Monday, April 1, 13
+0.25
Chromosome 15 : rs2472297
Monday, April 1, 13
Monday, April 1, 13
Generation challenge.
Monday, April 1, 13
Generation challenge.X
Monday, April 1, 13
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Monday, April 1, 13
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Monday, April 1, 13
Monday, April 1, 13
Utility computing.
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Remove constraints.
Monday, April 1, 13
Monday, April 1, 13
Analytics challenge.
Monday, April 1, 13
Analytics challenge.X
Monday, April 1, 13
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Monday, April 1, 13
Monday, April 1, 13
Beautiful, unique.
Monday, April 1, 13
Monday, April 1, 13
Impossible to recreate.
Monday, April 1, 13
Monday, April 1, 13
Snowflake Data Science
Monday, April 1, 13
Monday, April 1, 13
Reproducibility.
Monday, April 1, 13
Monday, April 1, 13
Reproducibility scales data science.
Monday, April 1, 13
Monday, April 1, 13
Reproduce. Reuse. Remix.
Monday, April 1, 13
Monday, April 1, 13
Value++
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
How do we get from here to there?
5PRINCIPLESREPRODUCIBILITY
OF
Monday, April 1, 13
5PRINCIPLESREPRODUCIBILITY
OF
Monday, April 1, 13
1. Data has Gravity
5PRINCIPLESREPRODUCIBILITY
OF
Monday, April 1, 13
Monday, April 1, 13
Increasingly large data collections.
Monday, April 1, 13
Monday, April 1, 13
Challenging to obtain and manage.
Monday, April 1, 13
Monday, April 1, 13
Expensive to experiment.
Monday, April 1, 13
Monday, April 1, 13
Large barrier to reproducibility.
Monday, April 1, 13
Monday, April 1, 13
Move data to the users.
Monday, April 1, 13
Move data to the users.X
Monday, April 1, 13
Monday, April 1, 13
Move tools to the data.
Monday, April 1, 13
Monday, April 1, 13
Place data where it can be consumed by tools.
Monday, April 1, 13
Monday, April 1, 13
Place tools where they can access data.
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
More data,more users,more uses,
more locations
Monday, April 1, 13
Monday, April 1, 13
Cost
Monday, April 1, 13
Monday, April 1, 13
Force multiplier.
Monday, April 1, 13
Monday, April 1, 13
Cost and complexity kill reproducibility.
Monday, April 1, 13
5PRINCIPLESREPRODUCIBILITY
OF
Monday, April 1, 13
2. Ease of use is a prerequisite
5PRINCIPLESREPRODUCIBILITY
OF
Monday, April 1, 13
http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html
Monday, April 1, 13
Monday, April 1, 13
Help overcome the suck threshold.
Monday, April 1, 13
Monday, April 1, 13
Easy to embrace and extend.
Monday, April 1, 13
Monday, April 1, 13
Choose the right abstraction for the user.
Monday, April 1, 13
Monday, April 1, 13
$ ec2-run-instances
Monday, April 1, 13
Monday, April 1, 13
$ starcluster start
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Package and automate.
Monday, April 1, 13
Monday, April 1, 13
Expert-as-a-service.
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
1000 GenomesProject
Cloud BioLinux
Monday, April 1, 13
Monday, April 1, 13
Illumina Basespace
1000 GenomesProject + your genomic data
Monday, April 1, 13
Amazon S3
http://www.youtube.com/watch?v=oGcZ7WVx6EI
Legacy data warehousing
Cassandra Aegisthus Hadoop, Hive, Pig
Monday, April 1, 13
Amazon S3
http://www.youtube.com/watch?v=oGcZ7WVx6EI
Legacy data warehousing
Cassandra Aegisthus Hadoop, Hive, Pig
MicrostrategySting
R
Monday, April 1, 13
Monday, April 1, 13
5PRINCIPLESREPRODUCIBILITY
OF
Monday, April 1, 13
3. Reuse is as important as reproduction
5PRINCIPLESREPRODUCIBILITY
OF
Monday, April 1, 13
Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics
Monday, April 1, 13
Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics
Monday, April 1, 13
Monday, April 1, 13
Data scientists are hackers.
Monday, April 1, 13
Monday, April 1, 13
They have their own way of working.
Monday, April 1, 13
Monday, April 1, 13
Beware the Big Red Button.
Monday, April 1, 13
Monday, April 1, 13
Fire and forget reproduction is a good first step, but limits
longer term value.
Monday, April 1, 13
Monday, April 1, 13
Monolithic, one-stop-shop.
Monday, April 1, 13
Monday, April 1, 13
Work well for intended purpose.
Monday, April 1, 13
Monday, April 1, 13
Challenging to install, dependency heavy.
Monday, April 1, 13
Monday, April 1, 13
Di!cult to grok.
Monday, April 1, 13
Monday, April 1, 13
Data scientists are hackers:embrace it.
Monday, April 1, 13
Monday, April 1, 13
Small things. Loosely coupled.
Monday, April 1, 13
Monday, April 1, 13
Easier to grok, reuse and integrate.
Monday, April 1, 13
Monday, April 1, 13
Lower barrier to entry.
Monday, April 1, 13
5PRINCIPLESREPRODUCIBILITY
OF
Monday, April 1, 13
4. Build for collaboration
5PRINCIPLESREPRODUCIBILITY
OF
Monday, April 1, 13
Monday, April 1, 13
Workflows are memes.
Monday, April 1, 13
Monday, April 1, 13
Reproduction is just the first step.
Monday, April 1, 13
Monday, April 1, 13
Bill of materials: code, data, configuration, infrastructure.
Monday, April 1, 13
Monday, April 1, 13
Full definition for reproduction.
Monday, April 1, 13
Monday, April 1, 13
Utility computing provides aplayground for data science.
Monday, April 1, 13
Code + AMI + custom datasets + public datasets + databases + compute + result data
Monday, April 1, 13
Code + AMI + custom datasets + public datasets + databases + compute + result data
Monday, April 1, 13
Code + AMI + custom datasets + public datasets + databases + compute + result data
Monday, April 1, 13
Code + AMI + custom datasets + public datasets + databases + compute + result data
Monday, April 1, 13
5PRINCIPLESREPRODUCIBILITY
OF
Monday, April 1, 13
5. Provenance is a first class object
5PRINCIPLESREPRODUCIBILITY
OF
Monday, April 1, 13
Monday, April 1, 13
Versioning becomes really important.
Monday, April 1, 13
Monday, April 1, 13
Especially in an active community.
Monday, April 1, 13
Monday, April 1, 13
Doubly so with loosely coupled tools.
Monday, April 1, 13
Monday, April 1, 13
Provenance metadata is a first class entity.
Monday, April 1, 13
Monday, April 1, 13
Distributed provenance.
Monday, April 1, 13
5PRINCIPLESREPRODUCIBILITY
OF
Monday, April 1, 13
1. Data has gravity2. Ease of use is a prerequisite3. Reuse is as important as reproduction4. Build for collaboration5. Provenance is a first class object
5PRINCIPLESREPRODUCIBILITY
OF
Monday, April 1, 13
Monday, April 1, 13
Thank you
aws.amazon.com@mza
Monday, April 1, 13
Monday, April 1, 13