Top Banner
val ScAlH2O = Scala ++ H2O San Francisco Data Science
12

Michal Malohlava presents: Open Source H2O and Scala

Jan 26, 2015

Download

Technology

Michal Malohlava discusses the magic behind the math - exposing the way that open source big data analysis H2O uses Scala to get work done, and demos how users can interact with Scala to get the most out of data analysis.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Michal Malohlava presents: Open Source H2O and Scala

val ScAlH2O =Scala ++ H2O San Francisco Data Science

Page 2: Michal Malohlava presents: Open Source H2O and Scala

Why Scala & H2O ?

● H 2O ~ fa s t , d is tr ib u te d , la rg e s c a le c om p u ta t io n p la t fo rm p ro v id in g r ic h J a v a A P I– B u t low - le v e l a n d fo r m a n y u s e r s to o c om p l ic a te d

public class ShuffleTask extends MRTask2<ShuffleTask> {

@Override public void map(Chunk ic, Chunk oc) { if (ic._len==0) return; // Each vector is shuffled in the same way Random rng = Utils.getRNG(0xe031e74f321f7e29L + (ic.cidx() << 32L)); oc.set0(0,ic.at0(0)); for (int row=1; row<ic._len; row++) { int j = rng.nextInt(row+1); // inclusive upper bound <0,row> if (j!=row) oc.set0(row, oc.at0(j)); oc.set0(j, ic.at0(row)); } }}

Page 3: Michal Malohlava presents: Open Source H2O and Scala

What we provides

● ScAlH2O - Scala library providing a DSL – Abstracting of H2O low-level API– Easy data manipulation and distributed computation– BUT still inside JVM

● Scala REPL integration into H2O– Console for experimenting with ScAlH2O

Page 4: Michal Malohlava presents: Open Source H2O and Scala

Basic concepts

● First-class entities– Scalars

– Frames

● Scala expressions

● Access to H2O aglos– And still preserving access to low-level H2O API)

Page 5: Michal Malohlava presents: Open Source H2O and Scala

Frame operations

● Parse data● Basic slicing

– Column/Rows selectors, append

● Scalar operations● Support head/tail/ncols/nrows/...● Cooperation with H2O distributed KV store

– Load/save operations

val f = parse("smalldata/cars.csv")

val f1 = f("name") ++ f(*, 5 to 7)

val f2 = f1("year") + 1900

val g = load("cars.hex")val g1 = g ++ g("year") > 80save("cars.hex", g1)

Page 6: Michal Malohlava presents: Open Source H2O and Scala

Map/filter/collect operations

● M a p– P e r v a lu e /r o w

● F ilte r

● C o lle c t

// Collect all cars with more than 4 cylindersval ff = f filter ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; });

// Returns a boolean vectorval fm = f map ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; });

// Compute sum of 2. column val fc = f collect ( 0.0, new CDOp() { def apply(acc:scala.Double rhs:Array[scala.Double]) =

acc + rhs(2) def reduce(l:scala.Double,r:scala.Double) = l+r } )

Page 7: Michal Malohlava presents: Open Source H2O and Scala

InternalsIt's magic

Page 8: Michal Malohlava presents: Open Source H2O and Scala

Internals

● No magic, BUT there are key-tricks– connect H2O classloaders with Scala ecosystem

● M a k e s u r e t h a t a l l d i s t r i b . o b j e c t s a r e c o r r e c t l y i c e d

– make translation of Scala code into calls of Java API● C r e a t e H 2 O M R t a s k s● P a s s o p e r a t i o n s a r o u n d t h e c l o u d ● C r e a t e n e w f r a m e s

– preserve primitives types ● d o n o t i n t r o d u c e o v e r h e a d o f b o x i n g / u n b o x i n g

Page 9: Michal Malohlava presents: Open Source H2O and Scala

Internals – translation to H2O MR tasks

def filter(af: T_A2B_Transf[scala.Double]):T = { val f = frame() val mrt = new MRTask2() { override def map(in:Array[Chunk], out:Array[NewChunk]) = { val rlen = in(0)._len val tmprow = new Array[scala.Double](in.length) for (row:Int <- 0 until rlen ) { if (af(Utils.readRow(in,row,tmprow))) { for (i:Int <- 0 until in.length) out(i).addNum(tmprow(i)) } } } } mrt.doAll(f.numCols(), f) val result = mrt.outputFrame(f.names(), f.domains()) apply(result) // return the DFrame }

// Collect all cars with more than 4 cylindersval f5 = f filter ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; });

T_A2B_Transf has to be water.Freezable

Page 10: Michal Malohlava presents: Open Source H2O and Scala

Party demo time!

Page 11: Michal Malohlava presents: Open Source H2O and Scala

Towards Scalding-like API

● V is io n is to p ro v id e S c a ld in g - lik e s y n ta x

● B u t s o fa r D S L is s t il l u g ly

f map ( ('name, 'cylinders) -> ('name, 'moreThan4) ) { (n:String, c:Int) => (n, if (c>4) 1 else 0) }

Input scheme Output scheme

Transformation

f map (f, ('name, 'cylinders) -> ('name, 'moreThan4) ) { new IcedFunctor2to2[Double,Int,Double,Int] {

def apply(n:Double, c:Int) = (n, if (c>4) 1 else 0) } }

Page 12: Michal Malohlava presents: Open Source H2O and Scala

Try and contribute !

> git clone [email protected]:0xdata/h2o.git

> git checkout -b h2oscala origin/h2oscala

> cd h2o-scala && ./depl.sh # or sbt compile

=== Welcome to the world of ScAlH2O === Type `help` or `example` to begin...

h2o>

Thank you!