Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect
Post on 08-Jan-2017
410 Views
Preview:
Transcript
Automating Machine LearningFeatures and Workflows
jao@bigml.com
PAPIs Connect Valencia, 2016
Outline
Introduction: ML as a System Service
Feature Engineering Automation
Workflow Automation
Challenges and Outlook
Outline
Introduction: ML as a System Service
Feature Engineering Automation
Workflow Automation
Challenges and Outlook
Machine Learning as a System Service
The goal
Machine Learning as a systemlevel service
The means
I APIs: ML building blocks
I Abstraction layer overfeature engineering
I Abstraction layer overalgorithms
I Automation
Machine Learning Workflows
Dr. Natalia Konstantinova (http://nkonst.com/machine-learning-explained-simple-words/)
Machine Learning Workflows for real
Jeannine Takaki, Microsoft Azure Team
Machine Learning Automation Todayfrom bigml.api import BigML
api = BigML()
project = api.create_project({’name’: ’ToyBoost’})
orig_source =
api.create_source(source,
{"name": "ToyBoost",
"project": project[’resource’]})
api.ok(orig_source)
orig_dataset =
api.create_dataset(orig_source, {"name": "Boost"})
api.ok(orig_dataset)
trainset = api.get_dataset(trainset)
for loop in range(0,10):
api.ok(trainset)
model = api.create_model(trainset, {
"name": "ToyBoost - Model%d" % loop,
"objective_fields": ["letter"],
"excluded_fields": ["weight"],
"weight_field": "100011"})
api.ok(model)
batchp =
api.create_batch_prediction(model, trainset, {
"name": "ToyBoost - Result%d" % loop,
"all_fields": True,
"header": True})
api.ok(batchp)
batchp = api.get_batch_prediction(batchp)
batchp_dataset =
api.get_dataset(batchp[’object’])
trainset = api.create_dataset(batchp_dataset, {})
Machine Learning Automation Today
Problems of current solutions
Complexity Lots of details outside the problem domain
Reuse No inter-language compatibility
Scalability Client-side workflows hard to optimize
Not enough abstraction
Machine Learning Automation Today
Problems of current solutions
Complexity Lots of details outside the problem domain
Reuse No inter-language compatibility
Scalability Client-side workflows hard to optimize
Not enough abstraction
Machine Learning Automation Tomorrow
Solution: Domain-specific languages
Outline
Introduction: ML as a System Service
Feature Engineering Automation
Workflow Automation
Challenges and Outlook
Domain-specific Expressions (sexps)
(if (missing? "height")
(random-value "height")
(field "height"))
(window "income" 10)
(within-percentiles? "age" 0.5 0.95)
(cond (> (field "score") (mean "score")) "above average"
(= (field "score") (mean "score")) "below average"
"mediocre")
Domain-specific Expressions (JSON)
["if", ["missing?", "height"],
["random-value", "height"],
["field", "height"]]
["window", "income", 10]
["within-percentiles?", "age", 0.5, 0.95]
["cond", [">", ["field", "score"], ["mean", "score"]], "above average",
["=", ["field", "score"], ["mean", "score"]], "below average",
"mediocre"]
Domain-specific Expressions (sexps)
(if (missing? "height")
(random-value "height")
(field "height"))
(window "income" 10)
(within-percentiles? "age" 0.5 0.95)
(cond (> (field "score") (mean "score")) "above average"
(= (field "score") (mean "score")) "below average"
"mediocre")
Abstraction via the Language
;; (if (missing? "height")
;; (random-value "height")
;; (field "height"))
(ensure-value "height")
(window "income" 10)
(within-percentiles? "age" 0.5 0.95)
;; (cond (> (field "score") (mean "score")) "above average"
;; (= (field "score") (mean "score")) "below average"
;; "mediocre")
(discretize "score" "above above" "below average" "mediocre")
Abstraction via the User Interface
Remote for efficiency and reuse, local for discoverability
Flatline: A DSL for Feature Enginering
I Domain-specific: new fields from an input sliding window asdeclarative expressions
I Simple syntax: JSON → s-expressions
I Efficient: full server-side implementation
I Discoverable: in-browser client-side implementation
I Reusable: the same expressions usable from any languagebinding.
I Bonus: applicable to filtering
Outline
Introduction: ML as a System Service
Feature Engineering Automation
Workflow Automation
Challenges and Outlook
Machine Learning Workflows
A DSL for Machine LearningWorkflows?
Machine Learning Workflows
A DSL for Machine LearningWorkflows? Absolutely!
Machine Learning Workflows
Same problems, only worse. . .
Complexity Hairy logic and control-flow
Reuse More complex algorithms and behaviour very hard toport to other languages
Scalability Lots of iterations and intermediate resources veryhard to make efficient on the client side
Machine Learning Workflows
WhizzML, same solution, only better. . .
WhizzML: A sexp-based, domain-specific language
(define apple
"https://s3.amazonaws.com/bigml-public/csv/nasdaq_aapl.csv")
(define source (create-and-wait-source {"remote" apple
"name" "whizz"}))
(define dataset (create-and-wait-dataset {"source" source}))
(define anomaly (create-and-wait-anomaly {"dataset" dataset}))
(define input {"Open" 275 "High" 300 "Low" 250})
(define score
(create-and-wait-anomalyscore {"anomaly" anomaly
"input_data" input}))
(get (fetch score) "score")
WhizzML vs Flatline (as languages)
A better language:
I Better data structures (dictionaries, sets. . . )
I Better control-flow: (tail) recursion, iteration, loops
I Better abstraction: procedures
WhizzML: Lambda Abstraction
Abstraction
(define (score-stock name input)
(let (base "https://s3.amazonaws.com/bigml-public/csv"
stock (str base "/" name)
source (create-and-wait-source {"remote" stock})
dataset (create-and-wait-dataset {"source" source})
anomaly (create-and-wait-anomaly {"dataset" dataset}))
(create-and-wait-anomalyscore {"anomaly" anomaly
"input_data" input})))
WhizzML: Reusable Procedures
Abstraction
(score-stock "aapl" {"Open" 275 "High" 300 "Low" 250})
WhizzML: Server-side fortes
A better server-side:
I Better reusability: scripts, executions and libraries asfirst-class ML resources
I Higher efficiency gains: automatic parallelism
I More opportunities for UI extensions
WhizzML Source Code as a Machine Learning Resource
{"library":{
"imports":["12343addb343f2890f23492d"],
"source_code": "(define (mu2) (mu (g 3 8)))",
"exports": [{"name": "mu2", "signature": []}]}}
{"script":{
"parameters": [{"name": "remote_uri", "type": "string"},
{"name": "timeout", "type": "number",
"default": 10000}],
"source_code":
"(define id (create-source {\"remote\" remote_uri}))
(wait id timeout)",
"outputs": [{"name": "id", "type": "source-id"}]}}
Rich metadata, reuse and shareability of WhizzML code
Executions as a Machine Learning Resource
{"execution": {"script_id": "1a2232bf3498f95dde",
"username": "bittwidler",
"tlp": 4,
"resource_limits": {"total": 50,
"source": 10,
"dataset": 5,
"model": 10},
"max_exection_time": 3600,
"max_execution_steps": 10000,
"max_recursion_depth": 1024}}
Executions as a Machine Learning Resource
{"execution": {"script_id": "1a2232bf3498f95dde","username": "bittwidler","tlp": 4,"resource_limits": {"total": 50,
"source": 10,"dataset": 5,"model": 10},
"max_exection_time": 3600,"max_execution_steps": 10000,"max_recursion_depth": 1024}}
WhizzML: Client-side fortes
A better client-side:
I Better interactive experience: read-eval-print loop
I Scripts usable from the user’s machine
I Interoperability: Java, JavaScript and NodeJS REPLs
I Challenge: behaviourial coherence between server and clientsides
Outline
Introduction: ML as a System Service
Feature Engineering Automation
Workflow Automation
Challenges and Outlook
Challenges
Solved
I Local REPL and remote shared implementation
I Automatic parallelization
I Error reporting
I Traceability: stack traces and stepwise execution
Open
I Better error management (dynamic typing, type inferencer)
I Resumable workflows
I Data locality: optimizing repeated access to the same datasets
top related