Top Banner
End-to-end analytics with Apache Spark Sandy Ryza
28

Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Mar 07, 2018

Download

Documents

trankhue
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

End-to-end analytics with Apache SparkSandy Ryza

Page 2: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

● Data scientist at Cloudera● Recently lead Apache Spark development at

Cloudera● Before that, committing on Apache Hadoop● Before that, studying combinatorial

optimization and distributed systems at Brown

Me

Page 3: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Large Scale Learning

Page 4: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

What for?

Page 5: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Detect Things That Will Go Wrong

● Churn prediction● Detect machine failures

Page 6: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Identify Bad Actors

Page 7: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Identify Bad Actors

● Network intruders● Payment fraudsters● Adversarial advertisers● Insurance claim grifters

Page 8: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Provide Recommendations

● Movies to stream● Music to stream● Products to buy● Ads to serve● People to date

Page 9: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

The Lab and the Factory

The Lab● Question-driven● Interactive● Fixed data● Output -> report or

in-database scoring engine

The Factory● Metric-driven● Automated● Fluid data● Output ->

production system that makes customer facing decisions

Page 10: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

What does it mean to productionize your machine learning?

Page 11: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Some models can be safely applied in batch

● Run your churn predictor every day and act on it at night

Page 12: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Most use cases need real time serving

● Catch bad actors before they do bad stuff

● Make recommendations upon site visit

Page 13: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Recommendations need real time updates

Page 14: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Infrastructure

Page 15: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Model Building

Page 16: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Model Serving

Page 17: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Model Updating

Page 18: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Oryx

Page 19: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Oryx

● https://github.com/cloudera/oryx● Focused on building real-time applications

using machine learning● Model building and model serving

infrastructure● Model serving consumes PMML● Most common use is recommendation

Page 20: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Oryx 1.0

● Model building○ Custom MapReduce algorithms

● Model update○ Partitioned by user○ Local to each serving daemon

Page 21: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Oryx 1.0

Page 22: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Algorithms - one of each

● Recommendation○ Alternating least squares for collaborative

filtering● Classification

○ Random decision forests● Clustering

○ K-means

Page 23: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

MLLib

Discrete Continuous

Supervised Classification● Logistic regression (and

regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests

(soon)

Regression● Linear regression (and

regularized variants)

Unsupervised Clustering● K-means

Dimensionality reduction, matrix factorization

● Principal component analysis / singular value decomposition

● Alternating least squares

Page 24: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Oryx 2.0

● Replace MR algorithms with MLLib● Replace real-time update with Spark

streaming

Page 25: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

“Lambda Architecture”

● Periodically train on whole data● Incremental updates with new data

Page 26: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Oryx 2.0

Model Server

Model Server

Model Server

Model Builder

Batch - Spark / MLLib

Updates - Spark Streaming

Kafka

Spark Executor

Spark Executor

Spark Executor

Spark Executor

Kafka /

HDFSInput

Page 27: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

What could go into MLLib?

● PMML output● Model update● Hyper-parameter tuning

Page 28: Spark End-to-end analytics with Apache · PDF fileEnd-to-end analytics with Apache Spark Sandy Ryza Data scientist at Cloudera ... Model serving consumes PMML Most common use is recommendation.

Contributions?

● https://github.com/cloudera/oryx