Page 1
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 1/39
1
From The Lab to the Factory
Building A Production Machine Learning InfrastructureJosh Wills, Senior Director of Data Science
Cloudera
Page 2
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 2/39
About Me
2
Page 3
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 3/39
3
What Do Data Scientists Do?
Page 4
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 4/39
What I Think I Do
4
Page 5
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 5/39
What Other People Think I Do
5
Page 6
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 6/39
What I Actually Do
6
Page 7
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 7/39
7
Data Science In the Lab
Page 8
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 8/39
Data Science as Statistics
8
Page 9
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 9/39
Investigative Analytics
9
Page 10
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 10/39
Tools for Investigative Analytics
10
Page 11
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 11/39
Inputs and Outputs
11
Page 12
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 12/39
On Actionable Insights
12
Page 13
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 13/39
13
Data Science in the Factory
Page 14
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 14/39
Building Data Products
14
Page 15
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 15/39
A Shift In Perspective
Analytics in the Lab
• Question-driven
• Interactive
• Ad-hoc, post-hoc
• Fixed data
• Focus on speed and
flexibility
• Output is embedded into a
report or in-database
scoring engine
Analytics in the Factory
• Metric-driven
• Automated
•
Systematic• Fluid data
• Focus on transparency andreliability
•
Output is a productionsystem that makescustomer-facing decisions
15
Page 16
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 16/39
Data Science as Decision Engineering
16
Page 17
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 17/39
All* Products Become Data Products
17
Page 18
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 18/39
18
From the Lab to the Factory:First Steps
Page 19
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 19/39
Step 1: Choose a Good Problem
19
Page 20
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 20/39
Page 21
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 21/39
Step 3: Log Everything
21
Page 22
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 22/39
Step 4: Hire (More) Data Scientists
22
Page 23
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 23/39
23
Workflow Optimization
Page 24
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 24/39
The Data Science Workflow
24
Page 25
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 25/39
Identifying the Bottlenecks
25
Page 26
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 26/39
Myrrix
26
Page 27
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 27/39
Page 28
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 28/39
Generational Thinking
28
Page 29
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 29/39
Oryx ALS Recommender Demo
29
Page 30
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 30/39
30
Rolling to Production
Page 31
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 31/39
The Limits of Our Models
31
Page 32
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 32/39
Space Exploration
32
Page 33
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 33/39
Data Science Needs DevOps
33
Page 34
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 34/39
Introducing Gertrude
• Multivariate Testing
• Define and explore a
space of parameters
•
OverlappingExperiments
• Tang et al. (2010)
• Runs multiple
independentexperiments on every
request
34
Page 35
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 35/39
Simple Conditional Logic
• Declare experiment
flags in compiled code
• Settings that can varyper request
• Create a config file thatcontains simple rulesfor calculating flagvalues and rules for
experiment diversion
35
Page 36
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 36/39
Separate Data Push from Code Push
• Validate config files and
push updates to servers
• Zookeeper via Curator
•
File-based• Servers pick up new
configs, load them, and
update experiment
space and flag valuecalculations
36
Page 37
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 37/39
The Experiments Dashboard
37
Page 38
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 38/39
A Few Links I Love
• http://research.google.com/pubs/pub36500.html • The original paper on the overlapping experiments
infrastrucure at Google
• http://www.exp-platform.com/
• Collection of all of Microsoft’s papers and presentations on
their experimentation platform
• http://www.deaneckles.com/blog/596_lossy-better-
than-lossless-in-online-bootstrapping/
• Dean Eckles on his paper about bootstrapped confidence
intervals with multiple dependencies
38
Page 39
8/12/2019 QConSF2013-JoshWills-BuildingAProductionMachineLearningInfrastructure
http://slidepdf.com/reader/full/qconsf2013-joshwills-buildingaproductionmachinelearninginfrastructure 39/39
J h Will Di f D S i Cl d @j h ill
Thank you!