val ScAlH2O = Scala ++ H2O San Francisco Data Science
Jan 26, 2015
val ScAlH2O =Scala ++ H2O San Francisco Data Science
Why Scala & H2O ?
● H 2O ~ fa s t , d is tr ib u te d , la rg e s c a le c om p u ta t io n p la t fo rm p ro v id in g r ic h J a v a A P I– B u t low - le v e l a n d fo r m a n y u s e r s to o c om p l ic a te d
public class ShuffleTask extends MRTask2<ShuffleTask> {
@Override public void map(Chunk ic, Chunk oc) { if (ic._len==0) return; // Each vector is shuffled in the same way Random rng = Utils.getRNG(0xe031e74f321f7e29L + (ic.cidx() << 32L)); oc.set0(0,ic.at0(0)); for (int row=1; row<ic._len; row++) { int j = rng.nextInt(row+1); // inclusive upper bound <0,row> if (j!=row) oc.set0(row, oc.at0(j)); oc.set0(j, ic.at0(row)); } }}
What we provides
● ScAlH2O - Scala library providing a DSL – Abstracting of H2O low-level API– Easy data manipulation and distributed computation– BUT still inside JVM
● Scala REPL integration into H2O– Console for experimenting with ScAlH2O
Basic concepts
● First-class entities– Scalars
– Frames
● Scala expressions
● Access to H2O aglos– And still preserving access to low-level H2O API)
Frame operations
● Parse data● Basic slicing
– Column/Rows selectors, append
● Scalar operations● Support head/tail/ncols/nrows/...● Cooperation with H2O distributed KV store
– Load/save operations
val f = parse("smalldata/cars.csv")
val f1 = f("name") ++ f(*, 5 to 7)
val f2 = f1("year") + 1900
val g = load("cars.hex")val g1 = g ++ g("year") > 80save("cars.hex", g1)
Map/filter/collect operations
● M a p– P e r v a lu e /r o w
● F ilte r
● C o lle c t
// Collect all cars with more than 4 cylindersval ff = f filter ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; });
// Returns a boolean vectorval fm = f map ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; });
// Compute sum of 2. column val fc = f collect ( 0.0, new CDOp() { def apply(acc:scala.Double rhs:Array[scala.Double]) =
acc + rhs(2) def reduce(l:scala.Double,r:scala.Double) = l+r } )
InternalsIt's magic
Internals
● No magic, BUT there are key-tricks– connect H2O classloaders with Scala ecosystem
● M a k e s u r e t h a t a l l d i s t r i b . o b j e c t s a r e c o r r e c t l y i c e d
– make translation of Scala code into calls of Java API● C r e a t e H 2 O M R t a s k s● P a s s o p e r a t i o n s a r o u n d t h e c l o u d ● C r e a t e n e w f r a m e s
– preserve primitives types ● d o n o t i n t r o d u c e o v e r h e a d o f b o x i n g / u n b o x i n g
Internals – translation to H2O MR tasks
def filter(af: T_A2B_Transf[scala.Double]):T = { val f = frame() val mrt = new MRTask2() { override def map(in:Array[Chunk], out:Array[NewChunk]) = { val rlen = in(0)._len val tmprow = new Array[scala.Double](in.length) for (row:Int <- 0 until rlen ) { if (af(Utils.readRow(in,row,tmprow))) { for (i:Int <- 0 until in.length) out(i).addNum(tmprow(i)) } } } } mrt.doAll(f.numCols(), f) val result = mrt.outputFrame(f.names(), f.domains()) apply(result) // return the DFrame }
// Collect all cars with more than 4 cylindersval f5 = f filter ( new FAOp { def apply(rhs: Array[scala.Double]):Boolean = rhs(2) > 4; });
T_A2B_Transf has to be water.Freezable
Party demo time!
Towards Scalding-like API
● V is io n is to p ro v id e S c a ld in g - lik e s y n ta x
● B u t s o fa r D S L is s t il l u g ly
f map ( ('name, 'cylinders) -> ('name, 'moreThan4) ) { (n:String, c:Int) => (n, if (c>4) 1 else 0) }
Input scheme Output scheme
Transformation
f map (f, ('name, 'cylinders) -> ('name, 'moreThan4) ) { new IcedFunctor2to2[Double,Int,Double,Int] {
def apply(n:Double, c:Int) = (n, if (c>4) 1 else 0) } }
Try and contribute !
> git clone [email protected]:0xdata/h2o.git
> git checkout -b h2oscala origin/h2oscala
> cd h2o-scala && ./depl.sh # or sbt compile
=== Welcome to the world of ScAlH2O === Type `help` or `example` to begin...
h2o>
Thank you!