H2O – The Open Source Math Engine Big Data Science with H 2 O in R
Jan 26, 2015
H2O – The Open Source Math Engine
Big Data Science with H2O in R
4/23/13
H2O –Open Source Math
& Machine Learning for Big DataAnqi Fu, August 2013
Universe is sparse. Life is messy. Data is sparse & messy.
- Lao Tzu
Introduction to Big Data
• There are about as many bits of information in our digital universe as there are stars in our actual universe.
• The process to decode the human genome took 10 years. It can now be done in a week.
• Big data means more than “lots of data”
H2O – The Open Source Math Engine
Better PredictionsSame Interface
Installation
1. Install and run H2O• Command line: java –Xmx2g –jar h2o.jar
• Pull up http://localhost:54321 in browser
2. Install the R package• install.packages(c(“RCurl”, “rjson”, “bitops”))
• install.packages(“Path/To/Package/ h2o_1.2.3.tar.gz", repos = NULL, type = "source")
3. In R console, type library(h2o)• demo(package=“h2o”)
• demo(h2o.glm)
Replace this!
Always have H2O running first!
Basic R Script
1. Tell R where H2O is running:
localH2O = new(“H2OClient”, ip=“127.0.0.1”, port=54321)
2. Check connection:
h2o.checkClient(localH2O)
3. Pass H2OClient as parameter to import:
h2o.importFile(localH2O, path=“Path/To/Data”, …)
Overview of Objects
• H2OClient: ip=character, port=numeric
• H2OParsedData: h2o=H2OClient, key=character
• H2OGLMModel: key=character, data=H2OParsedData, model=list(coefficients, deviance, aic, etc) Example: myModel@model$coefficients
H2Okey=“prostate.hex”
key=“airlines.hex”
Overview of Methods
Standard R H2O
read.csv, read.table, etc h2o.importFile, h2o.importURL
summary summary (limited to data only)
glm, glmnet h2o.glm(y, x, data, family, nfolds, alpha, lambda)
kmeans h2o.kmeans(data, centers, cols, iter.max)
randomForest, cforest h2o.randomForest(y, x_ignore, data, ntree, depth, classwt)
Demo 1: Basic GLM in H2O through R
Demo 1: Prostate Cancer Data
• Prostate cancer data set from Ohio State University Comprehensive Cancer Center• N = 380 patients, ages ranging from 43-79
• Goal: Predict presence of tumor from baseline exam of patient (age, race, PSA, total gleason score, etc)
Prostate Cancer
Data:
y = CAPSULE
0 = no tumor
1 = tumor
x = PSA (prostate-specific
antigen)
Prostate Cancer
Logistic Regression Fit Family: Binomial, Link: Logit
Data:
y = CAPSULE0 = no tumor
1 = tumor
x = PSA
(prostate-specific antigen)
Goal:
Estimate probability CAPSULE = 1
GLM Parameters
• y = response variable
• x = predictor variables (vector)
• family = binomial (default link = logit)
• data = H2OParsedData object
• nfolds = cross-validation
• lambda = weight on penalty factor
• alpha = elastic net mixing parameter• alpha = 0 is ridge penalty (L2 norm)
• alpha = 1 is lasso penalty (L1 norm)
Under the Hood: Hacking R for H2O
Under the Hood
REST API
Data
(JSON)
ImportParse
H2O
Data Scientist,Analyst, etc
GLM Code Snippet
• Create an object to represent model
setClass("H2OGLMModel", representation(key="character", data="H2OParsedData", model="list"))
• Declare new method for algorithm
setGeneric("h2o.glm", function(x, y, data, family, nfolds = 10, alpha = 0.5, lambda = 1.0e-5) { standardGeneric("h2o.glm") })
Name Slots
Parameter Initial Value
GLM Code Snippet
setMethod("h2o.glm", signature(x="character", y="character", data="H2OParsedData", …), function(x, y, data, …) {
• Send parameters to GLM.json page GLM job started
res = h2o.__remoteSend(data@h2o, h2o.__PAGE_GLM, key = data@key, y = y, x = paste(x, sep="", collapse=","), …)
• Keep polling and wait until job completed
while(h2o.__poll(data@h2o, res$response$redirect_request_args$job) != -1) { Sys.sleep(1) }
• Query Inspect.json page with GLM model key to get results
res = h2o.__remoteSend(data@h2o, h2o.__PAGE_INSPECT, key=res$destination_key)
http://cran.r-project.org/doc/contrib/Genolini-S4tutorialV0-5en.pdf
Demo 2: Data Munging and Remote H2O
Demo 2: Airlines Data
• Airlines data set 1987-2013 from RITA (25%)
• Goal: Predict if flight’s arrival will be delayed
• Examine slices of data directly
head(airlines.hex, n = 10); tail(airlines.hex)
summary(airlines.hex$DepTime)
• Take a subset of data to play with in R
airlines.small = as.data.frame(airlines.hex[1:1000,])
glm(IsArrDelayed ~ Dest + Origin, family = binomial, data = airlines.small)
http://www.transtats.bts.gov/Fields.asp?Table_ID=236
Connecting to H2O Remotely
• Your slip of paper contains IP/port of your assigned cluster
• Point R to remote H2O client
remoteH2O = new(“H2OClient”, ip = “192.168.1.161”, port = 54321)
• All data operations occur on cluster
h2o.importFile(remoteH2O, path = “Path/On/Remote/Server/To/Data”, …)
• Objects/methods operate just like before!
Roadmap
• Long-term Goal: Full H2O/R Integration • Subset col by name/index: df[,c(1,2)]; df[,”name”]
• Add/Remove cols: df[,-c(1,2)]; df[,3] = df[,2] + 1
• Filter rows: df[df$cName < 5,]
• Combine data frames by row/col: rbind, cbind
• Apply functions: tapply, sapply, lapply
• Support for R libraries (plyr, ggplot2, etc)
• More Algorithms: GBM, PCA, Neural Networks
4/23/13
Questions and Suggestions?