Top Banner
Building Machine Learning Applications with Sparkling Water 04/15/2015 Meetup Michal Malohlava and Alex Tellez and H2O.ai
35

Sparkling Water Meetup 4.15.15

Jul 17, 2015

Download

Spiritual

Sri Ambati
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sparkling Water Meetup 4.15.15

Building Machine Learning Applications with Sparkling Water

04/15/2015 Meetup

Michal Malohlava and Alex Tellez and H2O.ai

Page 2: Sparkling Water Meetup 4.15.15

TBD Head of Sales

Distributed Systems Engineers MakingML Scale!

[email protected]

Page 3: Sparkling Water Meetup 4.15.15

Scalable Machine Learning

For Smarter Applications

Page 4: Sparkling Water Meetup 4.15.15

Smarter Applications

Page 5: Sparkling Water Meetup 4.15.15

Scalable Applications

Distributed

Able to process huge amount of data from different sources

Easy to develop and experiment

Powerful machine learning engine inside

Page 6: Sparkling Water Meetup 4.15.15

BUT how to build

them?

Page 7: Sparkling Water Meetup 4.15.15

Build an application with …

?

Page 8: Sparkling Water Meetup 4.15.15

…with Spark and H2O

Page 9: Sparkling Water Meetup 4.15.15

Open-source distributed execution platform

User-friendly API for data transformation based on RDD

Platform components - SQL, MLLib, text mining

Multitenancy

Large and active community

Page 10: Sparkling Water Meetup 4.15.15

Open-source scalable machine learning platform

Tuned for efficient computation and memory use

Production ready machine learning algorithms

R, Python, Java, Scala APIs

Interactive UI, robust data parser

Page 11: Sparkling Water Meetup 4.15.15

Ensembles

Deep Neural Networks

• Generalized Linear Models : Binomial, Gaussian, Gamma, Poisson and Tweedie

• Cox Proportional Hazards Models • Naïve Bayes • Distributed Random Forest : Classification or

regression models • Gradient Boosting Machine : Produces an

ensemble of decision trees with increasing refined approximations

• Deep learning : Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations

Statistical Analysis

Dimensionality Reduction

Anomaly Detection

• K-means : Partitions observations into k clusters/groups of the same spatial size

• Principal Component Analysis : Linearly transforms correlated variables to independent components

• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning

Clustering

Supe

rvis

ed L

earn

ing

Unsupervised Learning

Page 12: Sparkling Water Meetup 4.15.15
Page 13: Sparkling Water Meetup 4.15.15

Sparkling WaterProvides

Transparent integration of H2O with Spark ecosystem

Transparent use of H2O data structures and algorithms with Spark API

Excels in existing Spark workflows requiring advanced Machine Learning algorithms

Platform for building Smarter Applications

Page 14: Sparkling Water Meetup 4.15.15

Sparkling Water Design

spark-submitSpark Master JVM

Spark Worker

JVM

Spark Worker

JVM

Spark Worker

JVM

Sparkling Water Cluster

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Sparkling App

implements

?

Regular Spark applicationcontaining also Sparkling Water

classes

Page 15: Sparkling Water Meetup 4.15.15

Data Distribution

H2O

H2O

H2O

Sparkling Water Cluster

Spark Executor JVMData

Source (e.g. HDFS)

H2O RDD

Spark Executor JVM

Spark Executor JVM

Spark RDD

RDDs and DataFramesshare same memory

space

toRDD

toH2OFrame

Page 16: Sparkling Water Meetup 4.15.15

Lets build an application !

Page 17: Sparkling Water Meetup 4.15.15

OR

Detect spam text messages

Page 18: Sparkling Water Meetup 4.15.15

Data example

case class SMS(target: String, fv: Vector)

ham / spam feature vector

Page 19: Sparkling Water Meetup 4.15.15

ML Workflow

1. Extract data

2. Transform, tokenize messages

3. Build Tf-IDF

4. Create and evaluate Deep Learning model

5. Score new messages

Goal: For a given text message identify if it is spam or not

Page 20: Sparkling Water Meetup 4.15.15

Application environment

sparkling-shell

Page 21: Sparkling Water Meetup 4.15.15

Lego #1: Data load

// Data loaddef load(dataFile: String): RDD[Array[String]] = { sc.textFile(dataFile).map(l => l.split(“\t")) .filter(r => !r(0).isEmpty)}

Produces [response,message]

Page 22: Sparkling Water Meetup 4.15.15

Lego #2: Ad-hoc Tokenization

def tokenize(data: RDD[String]): RDD[Seq[String]] = { val ignoredWords = Seq("the", “a", …) val ignoredChars = Seq(',', ‘:’, …) val texts = data.map( r => { var smsText = r.toLowerCase for( c <- ignoredChars) { smsText = smsText.replace(c, ' ') } val words =smsText.split(" ").filter(w => !ignoredWords.contains(w) && w.length>2).distinct words.toSeq }) texts}

Message Bag of words

Page 23: Sparkling Water Meetup 4.15.15

Lego #3: Tf-IDFdef buildIDFModel(tokens: RDD[Seq[String]], minDocFreq:Int = 4, hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[Vector]) = { // Hash strings into the given space val hashingTF = new HashingTF(hashSpaceSize) val tf = hashingTF.transform(tokens) // Build term frequency-inverse document frequency val idfModel = new IDF(minDocFreq=minDocFreq).fit(tf) val expandedText = idfModel.transform(tf) (hashingTF, idfModel, expandedText)}

Hash words into large

space

Term freq scale

“Thank for the order…” […,0,3.5,0,1,0,0.3,0,1.3,0,0,…]

Thank Order

Bag of words

hashingfunctions

term freqmodel

Featurevectors representing text

Page 24: Sparkling Water Meetup 4.15.15

Lego #4: Build a modeldef buildDLModel(train: Frame, valid: Frame, epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0, hidden: Array[Int] = Array[Int](200, 200)) (implicit h2oContext: H2OContext): DeepLearningModel = { import h2oContext._ // Build a model val dlParams = new DeepLearningParameters() dlParams._destination_key = Key.make("dlModel.hex").asInstanceOf[Key[Frame]] dlParams._train = train dlParams._valid = valid dlParams._response_column = 'target dlParams._epochs = epochs dlParams._l1 = l1 dlParams._hidden = hidden // Create a job val dl = new DeepLearning(dlParams) val dlModel = dl.trainModel.get // Compute metrics on both datasets dlModel.score(train).delete() dlModel.score(valid).delete() dlModel}

Deep Learning: Create multi-layer feed forward neural networks starting w i t h an i npu t l a ye r fo l lowed by mul t ip le l a y e r s o f n o n l i n e a r transformations

H2O model builder API

Do final scoring on train/test data

Page 25: Sparkling Water Meetup 4.15.15

Assembly application// Data loadval data = load(DATAFILE)// Extract response spam or hamval hamSpam = data.map( r => r(0))val message = data.map( r => r(1))// Tokenize message contentval tokens = tokenize(message)// Build IDF modelvar (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)// Merge response with extracted vectorsval resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))val table:DataFrame = resultRDD// Split tableval keys = Array[String]("train.hex", "valid.hex") val ratios = Array[Double](0.8) val frs = split(table, keys, ratios)val (train, valid) = (frs(0), frs(1))table.delete()// Build a modelval dlModel = buildDLModel(train, valid)

Split dataset

Build model

Data munging

Page 26: Sparkling Water Meetup 4.15.15

Data exploration

Page 27: Sparkling Water Meetup 4.15.15

Model evaluationval trainMetrics = binomialMM(dlModel, train)val validMetrics = binomialMM(dlModel, valid)

Collect Model Metrics

Page 28: Sparkling Water Meetup 4.15.15

Spam predictordef isSpam(msg: String, dlModel: DeepLearningModel, hashingTF: HashingTF, idfModel: IDFModel, hamThreshold: Double = 0.5):Boolean = { val msgRdd = sc.parallelize(Seq(msg)) val msgVector: SchemaRDD = idfModel.transform( hashingTF.transform ( tokenize (msgRdd))) .map(v => SMS("?", v)) val msgTable: DataFrame = msgVector msgTable.remove(0) // remove first column val prediction = dlModel.score(msgTable) prediction.vecs()(1).at(0) < hamThreshold}

Prepared models

Default decision threshold

Scoring

Page 29: Sparkling Water Meetup 4.15.15

Predict spamisSpam("Michal, beer tonight in MV?")

isSpam("We tried to contact you re your reply to our offer of a Video Handset? 750 anytime any networks mins? UNLIMITED TEXT?")

Page 30: Sparkling Water Meetup 4.15.15

Interactions with application from R

You need to install H2O’s R-package from:

http://h2o-release.s3.amazonaws.com/h2o-dev/master/1109/index.html#R

Page 31: Sparkling Water Meetup 4.15.15

Use app from R

Collect Model Metrics

Page 32: Sparkling Water Meetup 4.15.15

Where is the code?https://github.com/h2oai/sparkling-water/

blob/master/examples/scripts/

Page 33: Sparkling Water Meetup 4.15.15

Sparkling Water Downloadhttp://h2o.ai/download/

http://h2o-release.s3.amazonaws.com/sparkling-water/master/93/index.html

Page 34: Sparkling Water Meetup 4.15.15

Checkout H2O.ai Training Books

http://learn.h2o.ai/

Checkout H2O.ai Blog

http://h2o.ai/blog/

Checkout H2O.ai Youtube Channel

https://www.youtube.com/user/0xdata

Checkout GitHub

https://github.com/h2oai/sparkling-water

Meetups

https://meetup.com/

More info

Page 35: Sparkling Water Meetup 4.15.15

Learn more at h2o.ai Follow us at @h2oai

Thank you!Sparkling Water is

open-source ML application platform

combining power of Spark and H2O