Cloudera User Group - From the Lab to the Factory
Post on 07-Jul-2015
467 Views
Preview:
DESCRIPTION
Transcript
1
From The Lab to the Factory
Building A Production Machine Learning Infrastructure
Josh Wills, Senior Director of Data Science
Cloudera
One Other Thing About Me
2
Data Science: Another Definition
3
Data Scientists Build Data Products.
4
A Shift In Perspective
Analytics in the Lab
• Question-driven
• Interactive
• Ad-hoc, post-hoc
• Fixed data
• Focus on speed and
flexibility
• Output is embedded into a
report or in-database
scoring engine
Analytics in the Factory
• Metric-driven
• Automated
• Systematic
• Fluid data
• Focus on transparency and reliability
• Output is a production system that makes customer-facing decisions
5
All* Products Become Data Products
6
Identifying the Bottlenecks
7
Oryx: Model Building and Serving
• Algorithms
• ALS Recommenders
• K-Means Parallel
• RDF
• Batch model building
via MapReduce*
• Server for real-time
scoring and updates
• PMML 4.1 Models
8
Oryx Design
9
Generational Thinking
10
The Limits of Our Models
11
Space Exploration
12
Data Science Needs DevOps
13
Introducing Gertrude
• Multivariate Testing
• Define and explore a
space of parameters
• Overlapping
Experiments
• Tang et al. (2010)
• Runs multiple
independent
experiments on every
request
14
Simple Conditional Logic
• Declare experiment
flags in compiled code
• Settings that can vary per request
• Create a config file that contains simple rules for calculating flag values and rules for experiment diversion
15
Separate Data Push from Code Push
• Validate config files and
push updates to servers
• Zookeeper via Curator
• File-based
• Servers pick up new
configs, load them, and
update experiment
space and flag value
calculations
16
The Experiments Dashboard
17
A Few Links I Love
• http://research.google.com/pubs/pub36500.html
• The original paper on the overlapping experiments
infrastrucure at Google
• http://www.exp-platform.com/
• Collection of all of Microsoft’s papers and presentations on
their experimentation platform
• http://www.deaneckles.com/blog/596_lossy-better-
than-lossless-in-online-bootstrapping/
• Dean Eckles on his paper about bootstrapped confidence
intervals with multiple dependencies
18
Josh Wills, Director of Data Science, Cloudera @josh_wills
Thank you!
top related