Cloudera User Group - From the Lab to the Factory

Post on 07-Jul-2015

467 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is the presentation that Cloudera's senior director of data science, Josh Wills, delivered at the Cloudera User Group (CUG) Chicago meeting on 12/3/13 and NYC meeting on 12/5/13.

Transcript

1

From The Lab to the Factory

Building A Production Machine Learning Infrastructure

Josh Wills, Senior Director of Data Science

Cloudera

One Other Thing About Me

2

Data Science: Another Definition

3

Data Scientists Build Data Products.

4

A Shift In Perspective

Analytics in the Lab

• Question-driven

• Interactive

• Ad-hoc, post-hoc

• Fixed data

• Focus on speed and

flexibility

• Output is embedded into a

report or in-database

scoring engine

Analytics in the Factory

• Metric-driven

• Automated

• Systematic

• Fluid data

• Focus on transparency and reliability

• Output is a production system that makes customer-facing decisions

5

All* Products Become Data Products

6

Identifying the Bottlenecks

7

Oryx: Model Building and Serving

• Algorithms

• ALS Recommenders

• K-Means Parallel

• RDF

• Batch model building

via MapReduce*

• Server for real-time

scoring and updates

• PMML 4.1 Models

8

Oryx Design

9

Generational Thinking

10

The Limits of Our Models

11

Space Exploration

12

Data Science Needs DevOps

13

Introducing Gertrude

• Multivariate Testing

• Define and explore a

space of parameters

• Overlapping

Experiments

• Tang et al. (2010)

• Runs multiple

independent

experiments on every

request

14

Simple Conditional Logic

• Declare experiment

flags in compiled code

• Settings that can vary per request

• Create a config file that contains simple rules for calculating flag values and rules for experiment diversion

15

Separate Data Push from Code Push

• Validate config files and

push updates to servers

• Zookeeper via Curator

• File-based

• Servers pick up new

configs, load them, and

update experiment

space and flag value

calculations

16

The Experiments Dashboard

17

A Few Links I Love

• http://research.google.com/pubs/pub36500.html

• The original paper on the overlapping experiments

infrastrucure at Google

• http://www.exp-platform.com/

• Collection of all of Microsoft’s papers and presentations on

their experimentation platform

• http://www.deaneckles.com/blog/596_lossy-better-

than-lossless-in-online-bootstrapping/

• Dean Eckles on his paper about bootstrapped confidence

intervals with multiple dependencies

18

Josh Wills, Director of Data Science, Cloudera @josh_wills

Thank you!

top related