Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Automating Machine LearningFeatures and Workflows

jao@bigml.com

PAPIs Connect Valencia, 2016

Outline

Introduction: ML as a System Service

Feature Engineering Automation

Workflow Automation

Challenges and Outlook

Outline

Workflow Automation

Machine Learning as a System Service

The goal

Machine Learning as a systemlevel service

The means

I APIs: ML building blocks

I Abstraction layer overfeature engineering

I Abstraction layer overalgorithms

I Automation

Machine Learning Workflows

Dr. Natalia Konstantinova (http://nkonst.com/machine-learning-explained-simple-words/)

Machine Learning Workflows for real

Jeannine Takaki, Microsoft Azure Team

Machine Learning Automation Todayfrom bigml.api import BigML

api = BigML()

project = api.create_project({’name’: ’ToyBoost’})

orig_source =

api.create_source(source,

{"name": "ToyBoost",

"project": project[’resource’]})

api.ok(orig_source)

orig_dataset =

api.create_dataset(orig_source, {"name": "Boost"})

api.ok(orig_dataset)

trainset = api.get_dataset(trainset)

for loop in range(0,10):

api.ok(trainset)

model = api.create_model(trainset, {

"name": "ToyBoost - Model%d" % loop,

"objective_fields": ["letter"],

"excluded_fields": ["weight"],

"weight_field": "100011"})

api.ok(model)

batchp =

api.create_batch_prediction(model, trainset, {

"name": "ToyBoost - Result%d" % loop,

"all_fields": True,

"header": True})

api.ok(batchp)

batchp = api.get_batch_prediction(batchp)

batchp_dataset =

api.get_dataset(batchp[’object’])

trainset = api.create_dataset(batchp_dataset, {})

Machine Learning Automation Today

Problems of current solutions

Complexity Lots of details outside the problem domain

Reuse No inter-language compatibility

Scalability Client-side workflows hard to optimize

Not enough abstraction

Machine Learning Automation Today

Problems of current solutions

Complexity Lots of details outside the problem domain

Reuse No inter-language compatibility

Scalability Client-side workflows hard to optimize

Not enough abstraction

Machine Learning Automation Tomorrow

Solution: Domain-specific languages

Outline

Workflow Automation

Domain-specific Expressions (sexps)

(if (missing? "height")

(random-value "height")

(field "height"))

(window "income" 10)

(within-percentiles? "age" 0.5 0.95)

(cond (> (field "score") (mean "score")) "above average"

(= (field "score") (mean "score")) "below average"

"mediocre")

Domain-specific Expressions (JSON)

["if", ["missing?", "height"],

["random-value", "height"],

["field", "height"]]

["window", "income", 10]

["within-percentiles?", "age", 0.5, 0.95]

["cond", [">", ["field", "score"], ["mean", "score"]], "above average",

["=", ["field", "score"], ["mean", "score"]], "below average",

"mediocre"]

Domain-specific Expressions (sexps)

(if (missing? "height")

(random-value "height")

(field "height"))

(cond (> (field "score") (mean "score")) "above average"

(= (field "score") (mean "score")) "below average"

"mediocre")

Abstraction via the Language

;; (if (missing? "height")

;; (random-value "height")

;; (field "height"))

(ensure-value "height")

;; (cond (> (field "score") (mean "score")) "above average"

;; (= (field "score") (mean "score")) "below average"

;; "mediocre")

(discretize "score" "above above" "below average" "mediocre")

Abstraction via the User Interface

Remote for efficiency and reuse, local for discoverability

Flatline: A DSL for Feature Enginering

I Domain-specific: new fields from an input sliding window asdeclarative expressions

I Simple syntax: JSON → s-expressions

I Efficient: full server-side implementation

I Discoverable: in-browser client-side implementation

I Reusable: the same expressions usable from any languagebinding.

I Bonus: applicable to filtering

Outline

Workflow Automation

A DSL for Machine LearningWorkflows?

A DSL for Machine LearningWorkflows? Absolutely!

Same problems, only worse. . .

Complexity Hairy logic and control-flow

Reuse More complex algorithms and behaviour very hard toport to other languages

Scalability Lots of iterations and intermediate resources veryhard to make efficient on the client side

WhizzML, same solution, only better. . .

WhizzML: A sexp-based, domain-specific language

(define apple

"https://s3.amazonaws.com/bigml-public/csv/nasdaq_aapl.csv")

(define source (create-and-wait-source {"remote" apple

"name" "whizz"}))

(define dataset (create-and-wait-dataset {"source" source}))

(define anomaly (create-and-wait-anomaly {"dataset" dataset}))

(define input {"Open" 275 "High" 300 "Low" 250})

(define score

(create-and-wait-anomalyscore {"anomaly" anomaly

"input_data" input}))

(get (fetch score) "score")

WhizzML vs Flatline (as languages)

A better language:

I Better data structures (dictionaries, sets. . . )

I Better control-flow: (tail) recursion, iteration, loops

I Better abstraction: procedures

WhizzML: Lambda Abstraction

Abstraction

(define (score-stock name input)

(let (base "https://s3.amazonaws.com/bigml-public/csv"

stock (str base "/" name)

source (create-and-wait-source {"remote" stock})

dataset (create-and-wait-dataset {"source" source})

anomaly (create-and-wait-anomaly {"dataset" dataset}))

(create-and-wait-anomalyscore {"anomaly" anomaly

"input_data" input})))

WhizzML: Reusable Procedures

Abstraction

(score-stock "aapl" {"Open" 275 "High" 300 "Low" 250})

WhizzML: Server-side fortes

A better server-side:

I Better reusability: scripts, executions and libraries asfirst-class ML resources

I Higher efficiency gains: automatic parallelism

I More opportunities for UI extensions

WhizzML Source Code as a Machine Learning Resource

{"library":{

"imports":["12343addb343f2890f23492d"],

"source_code": "(define (mu2) (mu (g 3 8)))",

"exports": [{"name": "mu2", "signature": []}]}}

{"script":{

"parameters": [{"name": "remote_uri", "type": "string"},

{"name": "timeout", "type": "number",

"default": 10000}],

"source_code":

"(define id (create-source {\"remote\" remote_uri}))

(wait id timeout)",

"outputs": [{"name": "id", "type": "source-id"}]}}

Rich metadata, reuse and shareability of WhizzML code

Executions as a Machine Learning Resource

{"execution": {"script_id": "1a2232bf3498f95dde",

"username": "bittwidler",

"tlp": 4,

"resource_limits": {"total": 50,

"source": 10,

"dataset": 5,

"model": 10},

"max_exection_time": 3600,

"max_execution_steps": 10000,

"max_recursion_depth": 1024}}

Executions as a Machine Learning Resource

{"execution": {"script_id": "1a2232bf3498f95dde","username": "bittwidler","tlp": 4,"resource_limits": {"total": 50,

"source": 10,"dataset": 5,"model": 10},

"max_exection_time": 3600,"max_execution_steps": 10000,"max_recursion_depth": 1024}}

WhizzML: Client-side fortes

A better client-side:

I Better interactive experience: read-eval-print loop

I Scripts usable from the user’s machine

I Interoperability: Java, JavaScript and NodeJS REPLs

I Challenge: behaviourial coherence between server and clientsides

Outline

Workflow Automation

Challenges

Solved

I Local REPL and remote shared implementation

I Automatic parallelization

I Error reporting

I Traceability: stack traces and stepwise execution

I Better error management (dynamic typing, type inferencer)

I Resumable workflows

I Data locality: optimizing repeated access to the same datasets

Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Technology

Trenches Project

Noti Papis Agosto

QUALIDADE NO TRANSPORTE Procter & Gamble Marcos Papis...

Microdecision making in financial services - Greg Lamp @...

Papis Edición 55

BDD Testing and Automating from the trenches - Presented at....

Life in the Trenches. What are trenches? Trenches are...

Powerpoint trenches

Menu Tacos Papis

Papis Texican Grille Menu 12-2011

papis luces

Infiltration Trenches

Criptografia los papis x d

Noti Papis Septiembre 2015

Noti Papis Junio 2015

TIENDA ONLINE GOHERBALIFE - Papis Ivan y Silvia