Top Banner
1 Real-time PMML Scoring over Spark Streaming and Storm Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus
30

Real-time PMML Scoring over Spark Streaming and Storm

Feb 23, 2016

Download

Documents

Cindy

Real-time PMML Scoring over Spark Streaming and Storm. Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus. Contents. Big Data Computations. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Real-time PMML Scoring over Spark Streaming and Storm

1

Real-time PMML Scoring over Spark Streaming and Storm

Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D,

Innovation Labs, Impetus

Page 2: Real-time PMML Scoring over Spark Streaming and Storm

Contents

2

Big Data Computations

• BDAS Spark• BDAS

Discretized Streams

Berkeley data analytics stack

Real-time analytics

with Storm

• PMML Primer• Naïve Bayes Primer

PMML Scoring for Naïve Bayes

Page 3: Real-time PMML Scoring over Spark Streaming and Storm

3

Big Data ComputationsCo

mpu

tatio

ns/

Oper

atio

nsGiant 1 (simple stats) is

perfect for Hadoop.

Giants 2 (linear algebra), 3 (N-body), 4 (optimization) Spark from UC Berkeley is

efficient.

Logistic regression, Kernel SVMs, Conjugate gradient

descent, collaborative filtering, Gibbs sampling, Alternating least squares.

Example is social group-first approach for

consumer churn analysis [1]

Interactive/On-the-fly data processing – Storm.

OLAP – data cube operations. Dremel/Drill

Data sets – not embarrassingly parallel?

Deep Learning Artificial Neural Networks

Machine vision from Google

Speech analysis from Microsoft

Giant 5 – Graph processing – GraphLab, Pregel, Giraph

[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.[2] RICHTER, Yossi ; YOM-TOV, Elad ; SLONIM, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741

Page 4: Real-time PMML Scoring over Spark Streaming and Storm

4

Berkeley Big-data Analytics Stack (BDAS)

Page 5: Real-time PMML Scoring over Spark Streaming and Storm

BDAS: Spark

[MZ12] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.

Transformations/Actions

Description

Map(function f1) Pass each element of the RDD through f1 in parallel and return the resulting RDD.

Filter(function f2) Select elements of RDD that return true when passed through f2.flatMap(function f3) Similar to Map, but f3 returns a sequence to facilitate mapping single input to

multiple outputs.Union(RDD r1) Returns result of union of the RDD r1 with the self.Sample(flag, p, seed) Returns a randomly sampled (with seed) p percentage of the RDD.groupByKey(noTasks) Can only be invoked on key-value paired data – returns data grouped by value.

No. of parallel tasks is given as an argument (default is 8).reduceByKey(function f4, noTasks)

Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the second argument.

Join(RDD r2, noTasks) Joins RDD r2 with self – computes all possible pairs for given key.groupWith(RDD r3, noTasks)

Joins RDD r3 with self and groups by key.

sortByKey(flag) Sorts the self RDD in ascending or descending based on flag.Reduce(function f5) Aggregates result of applying function f5 on all elements of self RDDCollect() Return all elements of the RDD as an array.Count() Count no. of elements in RDDtake(n) Get first n elements of RDD.First() Equivalent to take(1)saveAsTextFile(path) Persists RDD in a file in HDFS or other Hadoop supported file system at given

path.saveAsSequenceFile(path)

Persist RDD as a Hadoop sequence file. Can be invoked only on key-value paired RDDs that implement Hadoop writable interface or equivalent.

foreach(function f6) Run f6 in parallel on elements of self RDD.

Page 6: Real-time PMML Scoring over Spark Streaming and Storm

6

BDAS: Discretized Streams

pageViews = readStream("http://...", "1s")1_s = pageViews.map(event => (event.url, 1))counts = 1_s.runningReduce((a, b) => a + b)

Treats streams as series of small time interval batch computations

Event based APIs – stream handling

How to make interval granularity very low (milliseconds)?• Built over Spark RDDs – in-memory distributed cacheFault-tolerance is based on RDD lineage (series of transformations that can be stored and recomputed on failure).• Parallel recovery – re-computations happen in parallel across the

cluster.

Page 7: Real-time PMML Scoring over Spark Streaming and Storm

7

BDAS: D-Streams Streaming Operators

words = sentences.flatMap(s => s.split(" "))pairs = words.map(w => (w, 1))counts = pairs.reduceByKey((a, b) => a + b)

Windowing• pairs.window("5s").reduceByKey(_+_)

Incremental aggregation• pairs.reduceByWindow("5s", (a, b) => a + b)

Time skewed joins

Page 8: Real-time PMML Scoring over Spark Streaming and Storm

BDAS: Use Cases

8

Ooyala

Uses Cassandra for video data

personalization.

Pre-compute aggregates VS

on-the-fly queries.

Moved to Spark for ML and computing

views.

Moved to Shark for on-the-fly queries – C* OLAP aggregate

queries on Cassandra 130 secs, 60 ms in Spark

Conviva Uses Hive for

repeatedly running ad-hoc

queries on video data.

Optimized ad-hoc queries using Spark

RDDs – found Spark is 30 times faster

than HiveML for

connection analysis and

video streaming

optimization.

Yahoo

Advertisement targeting: 30K

nodes on Hadoop Yarn

Hadoop – batch processingSpark – iterative processing

Storm – on-the-fly processing

Content recommendati

on – collaborative

filtering

Page 9: Real-time PMML Scoring over Spark Streaming and Storm

9

Real-time Analytics: R over Storm

Page 10: Real-time PMML Scoring over Spark Streaming and Storm

Real-time Analytics UC 1: Internet Traffic Analysis

Page 11: Real-time PMML Scoring over Spark Streaming and Storm

11

Real-time Analysis UC2: Arrhythmia Detection

Page 12: Real-time PMML Scoring over Spark Streaming and Storm

PMML Primer

12

Predictive Model Markup Language

Developed by DMG (Data Mining Group)

XML representation of a model.

PMML offers a standard to define a

model, so that a model generated in

tool-A can be directly used in tool-B.

May contain a myriad of data

transformations (pre- and post-processing)

as well as one or more predictive

models.

Page 13: Real-time PMML Scoring over Spark Streaming and Storm

Naïve Bayes Primer

13

Normalization Constant

Likelihood Prior

A simple probabilistic

classifier based on Bayes Theorem

Given features X1,X2,…,Xn,

predict a label Y by calculating the probability for all possible Y value

Page 14: Real-time PMML Scoring over Spark Streaming and Storm

PMML Scoring for Naïve Bayes

14

Wrote a PMML based scoring

engine for Naïve Bayes algorithm.

This can theoretically be

used in any framework for

data processing by invoking the API

Deployed a Naïve Bayes PMML

generated from R into Storm / Spark

and Samza frameworks

Real time predictions with the above APIs

Page 15: Real-time PMML Scoring over Spark Streaming and Storm

<DataDictionary numberOfFields="4"> <DataField name="Class" optype="categorical" dataType="string"> <Value value="democrat"/> <Value value="republican"/> </DataField> <DataField name="V1" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V2" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V3" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> </DataDictionary>

(ctd on the next slide)

PMML Scoring for Naïve Bayes

15

Page 16: Real-time PMML Scoring over Spark Streaming and Storm

<NaiveBayesModel modelName="naiveBayes_Model" functionName="classification" threshold="0.003"> <MiningSchema> <MiningField name="Class" usageType="predicted"/> <MiningField name="V1" usageType="active"/> <MiningField name="V2" usageType="active"/> <MiningField name="V3" usageType="active"/> </MiningSchema> <Output> <OutputField name="Predicted_Class" feature="predictedValue"/> <OutputField name="Probability_democrat" optype="continuous" dataType="double" feature="probability" value="democrat"/> <OutputField name="Probability_republican" optype="continuous" dataType="double" feature="probability" value="republican"/> </Output> <BayesInputs>

(ctd on the next page)

PMML Scoring for Naïve Bayes

16

Page 17: Real-time PMML Scoring over Spark Streaming and Storm

PMML Scoring for Naïve Bayes

17

<BayesInputs> <BayesInput fieldName="V1"> <PairCounts value="n"> <TargetValueCounts> <TargetValueCount value="democrat" count="51"/> <TargetValueCount value="republican" count="85"/> </TargetValueCounts> </PairCounts> <PairCounts value="y"> <TargetValueCounts> <TargetValueCount value="democrat" count="73"/> <TargetValueCount value="republican" count="23"/> </TargetValueCounts> </PairCounts> </BayesInput> <BayesInput fieldName="V2"> * <BayesInput fieldName="V3"> *</BayesInputs> <BayesOutput fieldName="Class"> <TargetValueCounts> <TargetValueCount value="democrat" count="124"/> <TargetValueCount value="republican" count="108"/> </TargetValueCounts> </BayesOutput>

Page 18: Real-time PMML Scoring over Spark Streaming and Storm

PMML Scoring for Naïve Bayes

18

Definition Of Elements:-

DataDictionary : Definitions for fields as used in mining

models ( Class, V1, V2, V3 )

NaiveBayesModel : Indicates that this is a NaiveBayes PMML

MiningSchema : lists fields as used in that model.

Class is “predicted” field, V1,V2,V3 are “active” predictor fields

Output: Describes a set of result values that can be

returned from a model

Page 19: Real-time PMML Scoring over Spark Streaming and Storm

PMML Scoring for Naïve Bayes

19

Definition Of Elements (ctd .. ) :-

BayesInputs:For each type of inputs, contains the counts

of outputsBayesOutput:

Contains the counts associated with the values of the target field

Page 20: Real-time PMML Scoring over Spark Streaming and Storm

20

Sample InputEg1 - n y y n y y n n n n n n y y y yEg2 - n y n y y y n n n n n y y y n y

• 1st , 2nd and 3rd Columns: Predictor variables ( Attribute “name” in element MiningField )• Using these we predict whether the Output is

Democrat or Republican ( PMML element BayesOutput)

PMML Scoring for Naïve Bayes

Page 21: Real-time PMML Scoring over Spark Streaming and Storm

PMML Scoring for Naïve Bayes

21

• 3 Node Xeon Machines Storm cluster ( 8 quad code CPUs, 32 GB RAM, 32 GB Swap space, 1 Nimbus, 2 Supervisors ) Number of records

( in millions ) Time Taken (seconds)

0.1 4

0.4 7

1.0 12

2.0 21

10 129

25 310

Page 22: Real-time PMML Scoring over Spark Streaming and Storm

PMML Scoring for Naïve Bayes

22

• 3 Node Xeon Machines Spark cluster( 8 quad code CPUs, 32 GB RAM and 32 GB Swap space )

Number of records ( in millions )

Time Taken (

0.1 1 min 47 sec

0.2 3 min 35 src

0.4 6 min 40 secs

1.0 35 mins 17 sec

10 More than 3 hrs

Page 23: Real-time PMML Scoring over Spark Streaming and Storm

Thank You!

Mail • [email protected] LinkedIn • http://in.linkedin.com/in/

vijaysrinivasagneeswaranBlogs • blogs.impetus.com

Twitter • @a_vijaysrinivas.

Page 24: Real-time PMML Scoring over Spark Streaming and Storm

Back up slides

24

Page 25: Real-time PMML Scoring over Spark Streaming and Storm

25

Representation of an RDDInformation HadoopRDD FilteredRDD JoinedRDD

Set of partitions 1 per HDFS block Same as parent 1 per reduce task

Set of dependencies

None 1-to-1 on parent Shuffle on each parent

Function to compute data set based on parents

Read corresponding block

Compute parent and filter it

Read and join shuffled data

Meta-data on location (preferredLocaations)

HDFS block location from namenode

None (parent) None

Meta-data on partitioning (partitioningScheme)

None None HashPartitioner

Page 26: Real-time PMML Scoring over Spark Streaming and Storm

26

Logistic Regression: Spark VS Hadoop

http://spark-project.org

Page 27: Real-time PMML Scoring over Spark Streaming and Storm

Some Spark(ling) examplesScala code (serial)var count = 0 for (i <- 1 to 100000) { val x = Math.random * 2 - 1 val y = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 } println("Pi is roughly " + 4 * count / 100000.0)

Sample random point on unit circle – count how many are inside them (roughly about PI/4). Hence, u get approximate value for PI.

Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).

Page 28: Real-time PMML Scoring over Spark Streaming and Storm

Some Spark(ling) examplesSpark code (parallel)val spark = new SparkContext(<Mesos master>) var count = spark.accumulator(0) for (i <- spark.parallelize(1 to 100000, 12)) { val x = Math.random * 2 – 1val y = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 }println("Pi is roughly " + 4 * count / 100000.0)

Notable points:1. Spark context created – talks to Mesos1 master.2. Count becomes shared variable – accumulator.3. For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices.4. Parallelize method invokes foreach method of RDD.

1 Mesos is an Apache incubated clustering system – http://mesosproject.org

Page 29: Real-time PMML Scoring over Spark Streaming and Storm

Logistic Regression in Spark: Serial Code// Read data file and convert it into Point objects

val lines = scala.io.Source.fromFile("data.txt").getLines()val points = lines.map(x => parsePoint(x))

// Run logistic regressionvar w = Vector.random(D)for (i <- 1 to ITERATIONS) { val gradient = Vector.zeros(D) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient}println("Result: " + w)

Page 30: Real-time PMML Scoring over Spark Streaming and Storm

Logistic Regression in Spark// Read data file and transform it into Point objectsval spark = new SparkContext(<Mesos master>)val lines = spark.hdfsTextFile("hdfs://.../data.txt")val points = lines.map(x => parsePoint(x)).cache()

// Run logistic regressionvar w = Vector.random(D)for (i <- 1 to ITERATIONS) { val gradient = spark.accumulator(Vector.zeros(D)) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient.value}println("Result: " + w)