How to Make Analy.c Opera.ons Look More Like DevOps: Lessons learned Moving Machine Learning Algorithms to Produc.on Environments Robert L. Grossman University of Chicago and Open Data Group O’Reilly Strata Conference March 30, 2016 rgrossman.com @bobgrossman
43
Embed
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
How to Make Analy.c Opera.ons Look More Like DevOps: Lessons learned Moving Machine-‐
Learning Algorithms to Produc.on Environments
Robert L. Grossman University of Chicago
and Open Data Group
O’Reilly Strata Conference March 30, 2016
rgrossman.com @bobgrossman
Introduc.on to Analy.cOps
SoRware Development
Quality Assurance
Opera.ons
DevOps
The goal of DevOps is to establish a culture and an environment where building, tes.ng, releasing, and opera.ng soRware can happen rapidly, frequently, and more reliably.* *Adapted from Wikipedia, en.wikipedia.org/wiki/DevOps.
Analy.c Modeling
Quality Assurance
Analy.c Opera.ons
Analy.cOps
The goal of Analy.cOps is to establish a culture and an environment where building, valida.ng, deploying, and running analy.c models happen rapidly, frequently, and reliably.
Analy.c Modeling
Quality Assurance
Analy.c Opera.ons
Analy.cOps
The goal of Analy.cOps is to establish a culture and an environment where building, valida.ng, deploying, and running analy.c models happen rapidly, frequently, and reliably.
• SoRware • Model • Data
Analy.c strategy and planning
Analy.c models & algorithms Analy.c opera.ons
Analy.c Infrastructure
*Source: Robert L. Grossman, The Strategy and Prac.ce of Analy.cs, O’Reilly, 2016, to appear.
A Problem
There are plaZorms and tools for managing and processing big data (Hadoop), for building analy.cs (SAS, SPSS, R, Sta.s.ca, Spark, Skytree, Mahout), but few op.ons for deploying analy.cs into opera.ons or for embedding analy.cs into products and services.
Data scien.sts developing analy.c models & algorithms
Analy.c infrastructure
Enterprise IT deploying analy.cs into products, services and opera.ons
Deploying analy.cs
7
More Problems
Data scien.sts developing analy.c models & algorithms
Analy.c infrastructure
Enterprise IT deploying analy.cs into products, services and opera.ons
Deploying analy.cs
8
Monitoring opera.onal analy.cs
ETL and datamarts for the modelers
Case Study 1: Scoring Engines for Cri.cal Systems
Life Cycle of Predic.ve Model
Exploratory Data Analysis Get and clean the data
Build model in dev/modeling environment
Deploy model in opera.onal systems with scoring applica.on Monitor performance and
employ champion-‐challenger methodology to develop improved model
Analy.c modeling
Analy.c opera.ons
Deploy model
Perf. data
Re.re model and deploy improved model
Select analy.c problem & approach
Scale up deployment
Exploratory Data Analysis Get and clean the data
Build model in dev/modeling environment
Deploy model in opera.onal systems with scoring applica.on Monitor performance and
employ champion-‐challenger methodology to develop improved model
Analy.c modeling
Analy.c opera.ons
Deploy model
Re.re model and deploy improved model
Select analy.c problem & approach
Scale up deployment
ModelDev
AnalyticOps
Perf. data
Differences Between the Modeling and Deployment Environments
• Typically modelers use specialized languages such as SAS, SPSS or R.
• Usually, developers responsible for products and services use languages such as Java, JavaScript, Python, C++, etc.
• This can result in significant effort moving the model from the modeling environment to the deployment environment.
Ways to Deploy Models into Products/Services/Opera.ons
• Export and import tables of scores • Export and import tables of parameters • Have the product/service interact with the model as a web or message service.
• Import the models into a database • Embed the model into a product or service. • Push code.
How quickly can the model be updated? • Model parameters? • New features? • New pre-‐ & post-‐ processing?
What is a Scoring Engine?
• A scoring engine is a component that is integrated into products or enterprise IT that deploys analy.c models in opera.onal workflows for products and services.
• A Model Interchange Format is a format that supports the expor.ng of a model by one applica.on and the impor.ng of a model by another applica.on.
• Model Interchange Formats include the Predic.ve Model Markup Language (PMML), the Portable Format for Analy.cs (PFA), and various in-‐house or custom formats.
• Scoring engines are integrated once, but allow applica.ons to update models as quickly as reading a a model interchange format file.
14
Analy.c algorithms & models Analy.c opera.ons
Deploying analy.c models
Model Consumer
Model Producer
Analy.c Infrastructure
Export model
Import model
PMML & PFA
Case Study 2: Scaling Bioinforma.cs Pipelines for the Genomic Data Commons*
This case study describes work by the NCI Genomic Data Commons Project and the University of Chicago Center for Data Intensive Science.
TCGA dataset: 1.54 PB consis.ng of 577,878 files about 14,052 cases (pa.ents), in 42 cancer types, across 29 primary sites.
2.5+ PB of cancer
genomics data +
Bionimbus data commons technology running mul.ple community developed variant calling pipelines. Over 12,000 cores and 10 PB of raw storage in 18+ racks running for months.
Analy.cOps for the Genomic Data Commons
Dev Ops
• Virtualiza.on and the requirement for massive scale out spawned infrastructure automa.on (“infrastructure as code”).
• Requirement for reducing the .me to deploying code created tools for con.nuous integra.on and tes.ng.
ModelDev AnalyticOps
• Use virtualiza.on / containers, infrastructure automa.on and scale out to support large scale analy.cs.
• Requirement: reduce the .me and cost to do high quality analy.cs over large amounts of data.
Genomic Data Commons (GDC) Files Vary Over 9 Orders of Magnitude in Size
GDC Pipelines Are Complex and are Mostly Wriqen by Others
Computa.ons for a Single Genome Can Take Over a Week
Source: University of Chicago Center for Data Intensive Science Bioinforma.cs Group.
System Loads Vary Significantly
• Model quality (confusion matrix)
• Data quality (six dimensions)
• Lack of ground truth
• SoRware errors • Workflow with monitoring
• Scheduling
• Boqlenecks, stragglers, hot spots, etc. • Analy.c configura.ons problems* • System failures
• Human errors
Ten Factors Effec.ng Analy.cOps
*DMS = data-‐model-‐system
Monitor Data Quality and Model Performance and Summarize With Dashboards
Source: University of Chicago Center for Data Intensive Science Bioinforma.cs Group.
Analy.cOps Dashboard
Source: University of Chicago Center for Data Intensive Science Bioinforma.cs Group.
Data Quality: Batch Effects Can Be Significant
Source: University of Chicago Center for Data Intensive Science Bioinforma.cs Group.
Model Quality: Differences in Three Soma.c Muta.on Detec.on Algorithms
Source: University of Chicago Center for Data Intensive Science Bioinforma.cs Group.
ORen SoRware Must Be Wriqen so that It Can Be Run Efficiently in Automated Enivronments
• Generally, community soRware in bioinforma.cs is designed to be run manually over local clusters.
• Example – We patched one piece of soRware over 400 .mes so that it could run over 12,000 genomes
– Although only 3.3% of genomes had problems, it required significant manual effort.
• Analy.cOps requires opera.ng the soRware in automated environments.
Decide What Not to Compute VarScan Rate
Rate (GB/hour)
Frequency
0.0 0.5 1.0 1.5 2.0
0200
400
600
800
1000
1200
Manage these cases carefully.
Model Expected Performance Processing .me
Tumor BAM size (GB) Source: University of Chicago Center for Data Intensive Science Bioinforma.cs Group.
Case Study 3: Deploying Gaussian Process Models to the Industrial Internet*
*Thanks to the DMG PMML and PFA Working Groups.
Portable Format for Analy.cs (PFA) Standard
www.dmg.org
PFA is Based Upon Defining Primi.ves for Analy.c Models
• What would a standard look like that… – Defines primi.ves for data transforma.ons, data aggrega.ons, and sta.s.cal and analy.c models.
– Supports composi.on of data mining primi.ves (which makes it easy to specify machine learning algorithms and pre-‐/post-‐ processing of data).
– Is extensible. – Is “safe” to deploy in enterprise IT opera.onal environments.
• This is a different philosophy that is different and complementary to Predic.ve Model Markup Language (PMML).
34
Benefits of PFA
• PFA is based upon JSON and Avro and integrates easily into modern big data environments.
• PFA allows models to be easily chained and composed
• PFA allows developers and users users of analy.c systems to pre-‐process inputs and to post-‐process outputs to models
• PFA is easily integrated with Storm, Akka and other streaming environments
• PFA can be used to integrate mul.ple tools applica.ons within an analy.c ecosystem.
Gaussian Process Model
Example of a PFA model input: {type: array, items: double}output: {type: array, items: double}
calling method: parameters expressed as JSON input: get interpola.on point from input {cell: table}: get parameters from table null: no explicit Kriging weight (universal) {fcn: …}: kernel func.on
Example of a PFA model
• Appears declara.ve, but this is a func.on call. – Fourth parameter is another func.on: m.kernel.rbf (radial basis kernel, a.k.a. squared exponen.al).
– m.kernel.rbf was intended for SVM, but is reusable anywhere. – One argument (gamma) preapplied so that it fits the signature for model.reg.gaussianProcess.
• Any kernel func.on could be used, including user-‐defined func.ons wriqen with PFA “code.”
• The Gaussian Process could be used anywhere, even as a pre-‐processing or post-‐processing step.
1. Team a modeler, soRware engineer, and systems engineer. 2. Instrument and monitor analy.cs, soRware and systems and
populate and Analy.cOps dashboard. 3. Use an automated tes.ng and deployment environment to
improve the model quality. 4. Use scoring engines with languages such as PFA & PMML. 5. Put in place a data quality program. 6. For complex workloads, use workflow and schedulers (even if
you think you don’t need them ini.ally) and model scale up. 7. Op.mize the end to end performance of the Analy.cOps, not
individual analy.cs. 8. Dis.nguish scores from ac.ons. 9. Iden.fy and eliminate performance hot spots, system stragglers,
etc. 10. Invest in root cause analysis of Analy.cOps problems.