Top Banner
Building Machine Learning Applications with Sparkling Water MLConf 2015 NYC Michal Malohlava and Alex Tellez and H2O.ai
27
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2015 03 27_ml_conf

Building Machine Learning Applications with Sparkling Water

MLConf 2015 NYC

Michal Malohlava and Alex Tellez and H2O.ai

Page 2: 2015 03 27_ml_conf

TBD Head of Sales

Distributed Systems Engineers MakingML Scale!

[email protected]

Page 3: 2015 03 27_ml_conf

Scalable Machine Learning

For Smarter Applications

Page 4: 2015 03 27_ml_conf

Smarter Applications

Page 5: 2015 03 27_ml_conf

Scalable Applications

Distributed

Able to process huge amount of data from different sources

Easy to develop and experiment

Powerful machine learning engine inside

Page 6: 2015 03 27_ml_conf

BUT how to build

them?

Page 7: 2015 03 27_ml_conf

Build an application with …

?

Page 8: 2015 03 27_ml_conf

…with Spark and H2O

Page 9: 2015 03 27_ml_conf

Open-source distributed execution platform

User-friendly API for data transformation based on RDD

Platform components - SQL, MLLib, text mining

Multitenancy

Large and active community

Page 10: 2015 03 27_ml_conf

Open-source scalable machine learning platform

Tuned for efficient computation and memory use

Production ready machine learning algorithms

R, Python, Java, Scala APIs

Interactive UI, robust data parser

Page 11: 2015 03 27_ml_conf
Page 12: 2015 03 27_ml_conf

Sparkling WaterProvides

Transparent integration of H2O with Spark ecosystem

Transparent use of H2O data structures and algorithms with Spark API

Platform for building Smarter Applications

Excels in existing Spark workflows requiring advanced Machine Learning algorithms

Page 13: 2015 03 27_ml_conf

Lets build an application ! (in 10minutes)

Page 14: 2015 03 27_ml_conf

OR

Detect spam text messages

Page 15: 2015 03 27_ml_conf

Data example

Page 16: 2015 03 27_ml_conf

ML Workflow

1. Extract data

2. Transform, tokenize messages

3. Build Tf-IDF

4. Create and evaluate Deep Learning model

5. Use the model

Goal: For a given text message identify if it is spam or not

Page 17: 2015 03 27_ml_conf

Lego #1: Data load

// Data loaddef load(dataFile: String): RDD[Array[String]] = { sc.textFile(dataFile).map(l => l.split(“\t")) .filter(r => !r(0).isEmpty)}

Page 18: 2015 03 27_ml_conf

Lego #2: Ad-hoc Tokenization

def tokenize(data: RDD[String]): RDD[Seq[String]] = { val ignoredWords = Seq("the", “a", …) val ignoredChars = Seq(',', ‘:’, …) val texts = data.map( r => { var smsText = r.toLowerCase for( c <- ignoredChars) { smsText = smsText.replace(c, ' ') } val words =smsText.split(" ").filter(w => !ignoredWords.contains(w) && w.length>2).distinct words.toSeq }) texts}

Page 19: 2015 03 27_ml_conf

Lego #3: Tf-IDFdef buildIDFModel(tokens: RDD[Seq[String]], minDocFreq:Int = 4, hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[Vector]) = { // Hash strings into the given space val hashingTF = new HashingTF(hashSpaceSize) val tf = hashingTF.transform(tokens) // Build term frequency-inverse document frequency val idfModel = new IDF(minDocFreq=minDocFreq).fit(tf) val expandedText = idfModel.transform(tf) (hashingTF, idfModel, expandedText)}

Hash words into large

space

Term freq scale

“Thank for the order…” […,0,3.5,0,1,0,0.3,0,1.3,0,0,…]

Thank Order

Page 20: 2015 03 27_ml_conf

Lego #4: Build a modeldef buildDLModel(train: Frame, valid: Frame, epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0, hidden: Array[Int] = Array[Int](200, 200)) (implicit h2oContext: H2OContext): DeepLearningModel = { import h2oContext._ // Build a model val dlParams = new DeepLearningParameters() dlParams._destination_key = Key.make("dlModel.hex").asInstanceOf[Key[Frame]] dlParams._train = train dlParams._valid = valid dlParams._response_column = 'target dlParams._epochs = epochs dlParams._l1 = l1 dlParams._hidden = hidden // Create a job val dl = new DeepLearning(dlParams) val dlModel = dl.trainModel.get // Compute metrics on both datasets dlModel.score(train).delete() dlModel.score(valid).delete() dlModel}

Deep Learning: Create multi-layer feed forward neural networks starting w i t h an i npu t l a ye r fo l lowed by mul t ip le l a y e r s o f n o n l i n e a r transformations

Page 21: 2015 03 27_ml_conf

Assembly application// Data loadval data = load(DATAFILE)// Extract response spam or hamval hamSpam = data.map( r => r(0))val message = data.map( r => r(1))// Tokenize message contentval tokens = tokenize(message)// Build IDF modelvar (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)// Merge response with extracted vectorsval resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))val table:DataFrame = resultRDD// Split tableval keys = Array[String]("train.hex", "valid.hex") val ratios = Array[Double](0.8) val frs = split(table, keys, ratios)val (train, valid) = (frs(0), frs(1))table.delete()// Build a modelval dlModel = buildDLModel(train, valid)

Split dataset

Build model

Data munging

Page 22: 2015 03 27_ml_conf

Data exploration

Page 23: 2015 03 27_ml_conf

Model evaluationval trainMetrics = binomialMM(dlModel, train)val validMetrics = binomialMM(dlModel, valid)

Collect model metrics

Page 24: 2015 03 27_ml_conf

Spam predictordef isSpam(msg: String, dlModel: DeepLearningModel, hashingTF: HashingTF, idfModel: IDFModel, hamThreshold: Double = 0.5):Boolean = { val msgRdd = sc.parallelize(Seq(msg)) val msgVector: SchemaRDD = idfModel.transform( hashingTF.transform ( tokenize (msgRdd))) .map(v => SMS("?", v)) val msgTable: DataFrame = msgVector msgTable.remove(0) // remove first column val prediction = dlModel.score(msgTable) prediction.vecs()(1).at(0) < hamThreshold}

Prepared models

Default decision threshold

Scoring

Page 25: 2015 03 27_ml_conf

Predict spamisSpam("Michal, beer tonight in MV?")

isSpam("We tried to contact you re your reply to our offer of a Video Handset? 750 anytime any networks mins? UNLIMITED TEXT?")

Page 26: 2015 03 27_ml_conf

Checkout H2O.ai Training Books

http://learn.h2o.ai/

Checkout H2O.ai Blog

http://h2o.ai/blog/

Checkout H2O.ai Youtube Channel

https://www.youtube.com/user/0xdata

Checkout GitHub

https://github.com/h2oai/sparkling-water

Meetups

https://meetup.com/

More info

Page 27: 2015 03 27_ml_conf

Learn more at h2o.ai Follow us at @h2oai

Thank you!Sparkling Water is

open-source ML application platform

combining power of Spark and H2O