YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: 2014 09 30_sparkling_water_hands_on

Sparkling Water“Killer App for Spark”

@hexadata & @mmalohlava presents

Page 2: 2014 09 30_sparkling_water_hands_on

Spark and H2OSeveral months ago…

Page 3: 2014 09 30_sparkling_water_hands_on

Sparkling WaterBefore

Tachyon based

Unnecessary data duplication

Now

Pure H2ORDD

Transparent use of H2O data and algorithms with Spark API

Page 4: 2014 09 30_sparkling_water_hands_on

Sparkling Water

����� ���

��

����� ���

��

+RDD

immutable"world

DataFrame mutable"world

Page 5: 2014 09 30_sparkling_water_hands_on

Sparkling Water

����� ���

��

����� ���

�� RDD DataFrame

Page 6: 2014 09 30_sparkling_water_hands_on

Sparkling Water Design

Sparkling App

jar file

Spark Master JVM

spark-submit

Spark Worker

JVM

Spark Worker

JVM

Spark Worker

JVM

Sparkling Water Cluster

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Page 7: 2014 09 30_sparkling_water_hands_on

Data Distribution

H2O

H2O

H2O

Sparkling Water Cluster

Spark Executor JVMData

Source (e.g. HDFS)

H2O RDD

Spark Executor JVM

Spark Executor JVM

Spark RDD

Page 8: 2014 09 30_sparkling_water_hands_on

Hands-on Time

Page 9: 2014 09 30_sparkling_water_hands_on

Example

Load&Parse CSV data

Use Spark API, do SQL query

Create Deep Learning model

Use model for prediction

Page 10: 2014 09 30_sparkling_water_hands_on

Requirements

Linux or Mac OS X

Oracle Java 1.7

Virtual image is provided

for Windows users

Page 11: 2014 09 30_sparkling_water_hands_on

Downloadhttp://0xdata.com/download/

Page 12: 2014 09 30_sparkling_water_hands_on

Install and Launch

Unpack zip fileorOpen provided virtual image in VirtualBox

and Launch h2o-examples/sparkling-shell

Page 13: 2014 09 30_sparkling_water_hands_on

What is Sparkling Shell?

Standard spark-shell

Launch H2O extension

export MASTER=“local-cluster[3,2,1024]” !spark-shell \ —jars shaded.jar \ —conf spark.extensions=org.apache.spark.executor.H2OPlatformExtension

JAR containing H2O code

Name of H2O extension provided by JAR

Spark Master address

Page 14: 2014 09 30_sparkling_water_hands_on

…more on launching…

‣ By default single JVM, multi-threaded (export MASTER=local[*]) or

‣ export MASTER=“local-cluster[3,2,1024]” to launch an embedded Spark cluster or

‣ Launch standalone Spark cluster via sbin/launch-spark-cloud.sh and export MASTER=“spark://localhost:7077”

Page 15: 2014 09 30_sparkling_water_hands_on

Lets play with Sparking shell…

Page 16: 2014 09 30_sparkling_water_hands_on

Create H2O Client

import water.{H2O,H2OClientApp} H2OClientApp.start() H2O.waitForCloudSize(3, 10000)

Page 17: 2014 09 30_sparkling_water_hands_on

Is Spark Running?http://localhost:4040

Page 18: 2014 09 30_sparkling_water_hands_on

Is H2O running?http://localhost:54321/steam/index.html

Page 19: 2014 09 30_sparkling_water_hands_on

DataLoad some data and parse them

import java.io.Fileimport org.apache.spark.examples.h2o._import org.apache.spark.h2o._val dataFile = “../h2o-examples/smalldata/allyears2k_headers.csv.gz" !// Create DataFrame - involves parse of dataval airlinesData = new DataFrame(new File(dataFile))

Page 20: 2014 09 30_sparkling_water_hands_on

Where are data?Go to http://localhost:54321/steam/index.html

Page 21: 2014 09 30_sparkling_water_hands_on

Use Spark API// H2O Context provide useful implicits for conversions val h2oContext = new H2OContext(sc)import h2oContext._

// Create RDD wrapper around DataFrameval airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)airlinesTable.count

// And use Spark RDD API directlyval flightsOnlyToSF = airlinesTable.filter( f => f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK") ) flightsOnlyToSF.count

Page 22: 2014 09 30_sparkling_water_hands_on

Use Spark SQLimport org.apache.spark.sql.SQLContext // We need to create SQL context val sqlContext = new SQLContext(sc)import sqlContext._ airlinesTable.registerTempTable("airlinesTable")

val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR Dest LIKE 'OAK'“ // Invoke query val result = sql(query) // Using a registered context and tablesresult.count

assert(result.count == flightsOnlyToSF.count)

Page 23: 2014 09 30_sparkling_water_hands_on

Launch H2O Algorithmsimport hex.deeplearning._import hex.deeplearning.DeepLearningModel.DeepLearningParameters // Setup deep learning parameters val dlParams = new DeepLearningParameters()dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, 'DayOfWeek, 'CRSDepTime, 'CRSArrTime,'UniqueCarrier, 'FlightNum, 'TailNum, 'CRSElapsedTime, 'Origin, 'Dest,'Distance, ‘IsDepDelayed) dlParams.response_column = 'IsDepDelayed.name

// Create a new model builder val dl = new DeepLearning(dlParams)

val dlModel = dl.train.get

Page 24: 2014 09 30_sparkling_water_hands_on

Make a prediction

// Use model to score data val prediction = dlModel.score(result)(‘predict) !// Collect predicted values via RDD APIval predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse("NaN") )

Page 25: 2014 09 30_sparkling_water_hands_on

What is under the hood?

Page 26: 2014 09 30_sparkling_water_hands_on

Spark App Extension/** Notion of Spark application platform extension. */trait PlatformExtension extends Serializable { /** Method to start extension */ def start(conf: SparkConf):Unit /** Method to stop extension */ def stop (conf: SparkConf):Unit /* Point in Spark infrastructure which will be intercepted by this extension. */ def intercept: InterceptionPoints = InterceptionPoints.EXECUTOR_LC /* User-friendly description of extension */ def desc:String override def toString = s"$desc@$intercept" } /** Supported interception points. * * Currently only Executor life cycle is supported. */object InterceptionPoints extends Enumeration { type InterceptionPoints = Value val EXECUTOR_LC /* Inject into executor lifecycle */ = Value}

Page 27: 2014 09 30_sparkling_water_hands_on

Using App Extensions

val conf = new SparkConf() .setAppName(“Sparkling H2O Example”) // Setup expected size of H2O cloudconf.set(“spark.h2o.cluster.size”,h2oWorkers) !// Add H2O extensionconf.addExtension[H2OPlatformExtension] !// Create Spark Context val sc = new SparkContext(sc)

Page 28: 2014 09 30_sparkling_water_hands_on

Spark Changes

We keep them small (~30 lines of code)

JIRA SPARK-3270 - Platform App Extensions

https://issues.apache.org/jira/browse/SPARK-3270

Page 29: 2014 09 30_sparkling_water_hands_on

You can participate!Epic PUBDEV-21aka Sparkling Water

PUBDEV-23 Test HDFS reader

PUBDEV-26 Implement toSchemaRDD

PUBDEV-27 Boolean transfers

PUBDEV-31 Support toRDD[ X <: Numeric]

PUBDEV-32/33 Mesos/YARN support

Page 30: 2014 09 30_sparkling_water_hands_on

More infoCheckout 0xdata Blog for tutorials

http://0xdata.com/blog/

Checkout 0xdata Youtube Channel

https://www.youtube.com/user/0xdata

Checkout github

https://github.com/0xdata/h2o-dev

https://github.com/0xdata/perrier

Page 31: 2014 09 30_sparkling_water_hands_on

Learn more about H2O at 0xdata.com

or

Thank you!

Follow us at @hexadata

neo> for r in h2o-dev perrier; do !git clone “[email protected]:0xdata/$r.git”!done


Related Documents