Top Banner
H2O – The Open Source Math Engine Big Data Science with H 2 O in R
27

Big datascienceh2oandr

Jan 26, 2015

Download

Health & Medicine

Anqi Fu's presentation from the August 20 Meetup on using H2O with R.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big datascienceh2oandr

H2O – The Open Source Math Engine

Big Data Science with H2O in R

Page 2: Big datascienceh2oandr

4/23/13

H2O –Open Source Math

& Machine Learning for Big DataAnqi Fu, August 2013

Page 3: Big datascienceh2oandr

Universe is sparse. Life is messy. Data is sparse & messy.

- Lao Tzu

Page 4: Big datascienceh2oandr

Introduction to Big Data

• There are about as many bits of information in our digital universe as there are stars in our actual universe.

• The process to decode the human genome took 10 years. It can now be done in a week.

• Big data means more than “lots of data”

Page 5: Big datascienceh2oandr

H2O – The Open Source Math Engine

Better PredictionsSame Interface

Page 6: Big datascienceh2oandr

Installation

1. Install and run H2O• Command line: java –Xmx2g –jar h2o.jar

• Pull up http://localhost:54321 in browser

2. Install the R package• install.packages(c(“RCurl”, “rjson”, “bitops”))

• install.packages(“Path/To/Package/ h2o_1.2.3.tar.gz", repos = NULL, type = "source")

3. In R console, type library(h2o)• demo(package=“h2o”)

• demo(h2o.glm)

Replace this!

Page 7: Big datascienceh2oandr

Always have H2O running first!

Page 8: Big datascienceh2oandr

Basic R Script

1. Tell R where H2O is running:

localH2O = new(“H2OClient”, ip=“127.0.0.1”, port=54321)

2. Check connection:

h2o.checkClient(localH2O)

3. Pass H2OClient as parameter to import:

h2o.importFile(localH2O, path=“Path/To/Data”, …)

Page 9: Big datascienceh2oandr

Overview of Objects

• H2OClient: ip=character, port=numeric

• H2OParsedData: h2o=H2OClient, key=character

• H2OGLMModel: key=character, data=H2OParsedData, model=list(coefficients, deviance, aic, etc) Example: myModel@model$coefficients

H2Okey=“prostate.hex”

key=“airlines.hex”

Page 10: Big datascienceh2oandr

Overview of Methods

Standard R H2O

read.csv, read.table, etc h2o.importFile, h2o.importURL

summary summary (limited to data only)

glm, glmnet h2o.glm(y, x, data, family, nfolds, alpha, lambda)

kmeans h2o.kmeans(data, centers, cols, iter.max)

randomForest, cforest h2o.randomForest(y, x_ignore, data, ntree, depth, classwt)

Page 11: Big datascienceh2oandr

Demo 1: Basic GLM in H2O through R

Page 12: Big datascienceh2oandr

Demo 1: Prostate Cancer Data

• Prostate cancer data set from Ohio State University Comprehensive Cancer Center• N = 380 patients, ages ranging from 43-79

• Goal: Predict presence of tumor from baseline exam of patient (age, race, PSA, total gleason score, etc)

Page 13: Big datascienceh2oandr
Page 14: Big datascienceh2oandr

Prostate Cancer

Data:

y = CAPSULE

0 = no tumor

1 = tumor

x = PSA (prostate-specific

antigen)

Page 15: Big datascienceh2oandr

Prostate Cancer

Logistic Regression Fit Family: Binomial, Link: Logit

Data:

y = CAPSULE0 = no tumor

1 = tumor

x = PSA

(prostate-specific antigen)

Goal:

Estimate probability CAPSULE = 1

Page 16: Big datascienceh2oandr

GLM Parameters

• y = response variable

• x = predictor variables (vector)

• family = binomial (default link = logit)

• data = H2OParsedData object

• nfolds = cross-validation

• lambda = weight on penalty factor

• alpha = elastic net mixing parameter• alpha = 0 is ridge penalty (L2 norm)

• alpha = 1 is lasso penalty (L1 norm)

Page 17: Big datascienceh2oandr

Under the Hood: Hacking R for H2O

Page 18: Big datascienceh2oandr

Under the Hood

REST API

Data

(JSON)

ImportParse

H2O

Data Scientist,Analyst, etc

Page 19: Big datascienceh2oandr

GLM Code Snippet

• Create an object to represent model

setClass("H2OGLMModel", representation(key="character", data="H2OParsedData", model="list"))

• Declare new method for algorithm

setGeneric("h2o.glm", function(x, y, data, family, nfolds = 10, alpha = 0.5, lambda = 1.0e-5) { standardGeneric("h2o.glm") })

Name Slots

Parameter Initial Value

Page 20: Big datascienceh2oandr

GLM Code Snippet

setMethod("h2o.glm", signature(x="character", y="character", data="H2OParsedData", …), function(x, y, data, …) {

• Send parameters to GLM.json page GLM job started

res = h2o.__remoteSend(data@h2o, h2o.__PAGE_GLM, key = data@key, y = y, x = paste(x, sep="", collapse=","), …)

• Keep polling and wait until job completed

while(h2o.__poll(data@h2o, res$response$redirect_request_args$job) != -1) { Sys.sleep(1) }

• Query Inspect.json page with GLM model key to get results

res = h2o.__remoteSend(data@h2o, h2o.__PAGE_INSPECT, key=res$destination_key)

http://cran.r-project.org/doc/contrib/Genolini-S4tutorialV0-5en.pdf

Page 21: Big datascienceh2oandr

Demo 2: Data Munging and Remote H2O

Page 22: Big datascienceh2oandr

Demo 2: Airlines Data

• Airlines data set 1987-2013 from RITA (25%)

• Goal: Predict if flight’s arrival will be delayed

• Examine slices of data directly

head(airlines.hex, n = 10); tail(airlines.hex)

summary(airlines.hex$DepTime)

• Take a subset of data to play with in R

airlines.small = as.data.frame(airlines.hex[1:1000,])

glm(IsArrDelayed ~ Dest + Origin, family = binomial, data = airlines.small)

Page 23: Big datascienceh2oandr
Page 24: Big datascienceh2oandr

http://www.transtats.bts.gov/Fields.asp?Table_ID=236

Page 25: Big datascienceh2oandr

Connecting to H2O Remotely

• Your slip of paper contains IP/port of your assigned cluster

• Point R to remote H2O client

remoteH2O = new(“H2OClient”, ip = “192.168.1.161”, port = 54321)

• All data operations occur on cluster

h2o.importFile(remoteH2O, path = “Path/On/Remote/Server/To/Data”, …)

• Objects/methods operate just like before!

Page 26: Big datascienceh2oandr

Roadmap

• Long-term Goal: Full H2O/R Integration • Subset col by name/index: df[,c(1,2)]; df[,”name”]

• Add/Remove cols: df[,-c(1,2)]; df[,3] = df[,2] + 1

• Filter rows: df[df$cName < 5,]

• Combine data frames by row/col: rbind, cbind

• Apply functions: tapply, sapply, lapply

• Support for R libraries (plyr, ggplot2, etc)

• More Algorithms: GBM, PCA, Neural Networks

Page 27: Big datascienceh2oandr

4/23/13

Questions and Suggestions?