Top Banner
SparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang
36

SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Mar 19, 2018

Download

Documents

buique
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

SparkR: Interactive R at scale

Shivaram Venkataraman Zongheng Yang

Page 2: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Fast !

Scalable Interactive

Page 3: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Statistics !

Packages Plots

Page 4: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Fast !

Scalable

Interactive Shell

Statistics !

Plots

Packages

Page 5: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

RDD

Transformations map filter

groupby …

Actions count collect save …

Page 6: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Q: How can I use a loop to [...insert task here...] ?!A: Don’t. Use one of the apply functions.!

From: http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/

R

Page 7: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

R + RDD = R2D2

Page 8: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

R + RDD = RRDD

lapply lapplyPartition

groupByKey reduceByKey sampleRDD

collect cache …

broadcast includePackage

textFile parallelize

Page 9: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Example: Word Count lines  <-­‐  textFile(sc,  args[[2]])  

Page 10: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Example: Word Count lines  <-­‐  textFile(sc,  args[[2]])    words  <-­‐  flatMap(lines,                                    function(line)  {                                        strsplit(line,  "  ")[[1]]                                    })  wordCount  <-­‐  lapply(words,    

         function(word)  {            list(word,  1L)              })  

Page 11: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Example: Word Count lines  <-­‐  textFile(sc,  args[[2]])    words  <-­‐  flatMap(lines,                                    function(line)  {                                        strsplit(line,  "  ")[[1]]                                    })  wordCount  <-­‐  lapply(words,    

         function(word)  {            list(word,  1L)              })  

counts  <-­‐  reduceByKey(wordCount,  "+",  2L)  output  <-­‐  collect(counts)  

Page 12: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Demo

Page 13: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

MNIST

Page 14: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

A

b

Minimize ||Ax – b||2

Page 15: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

How does this work ?

Page 16: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Dataflow

Local Worker

Worker

Page 17: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Dataflow

Local

R

Worker

Worker

Page 18: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Dataflow

Local

R Spark Context

Java Spark

Context

JNI

Worker

Worker

Page 19: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Dataflow

Local Worker

Worker R Spark Context

Java Spark

Context

JNI

Spark Executor exec R

Spark Executor exec R

Page 20: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics
Page 21: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/

Page 22: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics
Page 23: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Dataflow

Local Worker

Worker R Spark Context

Java Spark

Context

JNI

Spark Executor exec R

Spark Executor exec R

Page 24: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Pipelined RDD words  <-­‐  flatMap(lines,…)  wordCount  <-­‐  lapply(words,…)  

Spark Executor exec R Spark

Executor R exec

Page 25: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Pipelined RDD

Spark Executor exec R Spark

Executor R exec

Spark Executor exec R R Spark

Executor

Page 26: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Alpha developer release

One line install !  

     install_github("amplab-­‐extras/SparkR-­‐pkg",              subdir="pkg")  

Page 27: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

SparkR Implementation

Very similar to PySpark Spark is easy to extend

292 lines of Scala code 1694 lines of R code 549 lines of test code in R

Page 28: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics
Page 29: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

EC2 setup All Spark examples MNIST demo Hadoop2, Maven build

Also on github

Page 30: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

In the Roadmap

DataFrame support using Catalyst Calling MLLib from R Daemon R processes

Page 31: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

SparkR

Combine scalability & utility

RDD à distributed lists Serialize closure Re-use R packages

Page 32: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

SparkR https://github.com/amplab-extras/SparkR-pkg

Shivaram Venkataraman [email protected]

Spark User mailing list [email protected]

Page 33: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics
Page 34: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Example: Logistic Regression pointsRDD  <-­‐  textFile(sc,  "hdfs://myfile")  weights  <-­‐  runif(n=D,  min  =  -­‐1,  max  =  1)    #  Logistic  gradient  gradient  <-­‐  function(partition)  {      X  <-­‐  partition[,1];  Y  <-­‐  partition[,-­‐1]      t(X)  %*%  (1/(1  +  exp(-­‐Y  *  (X  %*%  weights)))  -­‐  1)  *  Y  }  

Page 35: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

Example: Logistic Regression pointsRDD  <-­‐  textFile(sc,  "hdfs://myfile")  weights  <-­‐  runif(n=D,  min  =  -­‐1,  max  =  1)    #  Logistic  gradient  gradient  <-­‐  function(partition)  {      X  <-­‐  partition[,1];  Y  <-­‐  partition[,-­‐1]      t(X)  %*%  (1/(1  +  exp(-­‐Y  *  (X  %*%  weights)))  -­‐  1)  *  Y  }    #Iterate  weights  <-­‐  weights  -­‐  reduce(  

 lapplyPartition(pointsRDD,  gradient),  "+")  

Page 36: SparkR: Interactive R at scale - Meetupfiles.meetup.com/3138542/SparkR-meetup.pdfSparkR: Interactive R at scale Shivaram Venkataraman Zongheng Yang Fast ! Scalable Interactive Statistics

How does it work ?

R Shell

rJava Spark Context

Spark Executor Spark Executor RScript RScript

Data: RDD[Array[Byte]] Func: Array[Byte]