Enabling Exploratory Data Science with Apache Spark and R Hossein Falaki (@mhfalaki)
Apr 16, 2017
About the speaker: Hossein Falaki
Hossein Falaki is a software engineer at Databricksworking on the next big thing. Prior to that, he was a data scientist at Apple’s personal assistant, Siri. He graduated with Ph.D. in Computer Science from UCLA, where he was a member of the Center for Embedded Networked Sensing (CENS).
2
About the moderator: Denny Lee
Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud.
3
We are Databricks, the company behind Apache Spark
Founded by the creators of Apache Spark in 2013
Share of Spark code contributed by Databricksin 2014
75%
4
Data Value
Created Databricks on top of Spark to make big data simple.
Why do we like R?
5
• Open source• Highly dynamic• Interactive environment• Rich ecosystem of packages• Powerful visualization infrastructure• Data frames make data manipulation convenient• Taught by many schools to stats and computing students
What would be ideal?
Seamless manipulation and analysis of very large data in R• R’s flexible syntax• R’s rich package ecosystem • R’s interactive environment• Scalability (scale up and out)• Integration with distributed data sources / storage
6
Augmenting R with other frameworks
In practice data scientists use R in conjunction with other frameworks (Hadoop MR, Hive, Pig, Relational Databases, etc)
7
FrameworkX(LanguageY)
DistributedStorage
1.Load,clean,transform,aggregate,sample
LocalStorage
2.Saveto localstorage 3.ReadandanalyzeinRIterate
What is SparkR?
An R package distributed with Apache Spark:• Provides R frontend to Spark• Exposes Spark Dataframes (inspired by R and Pandas)• Convenient interoperability between R and Spark DataFrames
8
+distributed/robust processing,datasources,off-memorydatastructures
Spark
Dynamicenvironment, interactivity,packages,visualization
R
How does SparkR solve our problems?
No local storage involved Write everything in RUse Spark’s distributed cache for interactive/iterative analysis at
speed of thought9
LocalStorage
2.Saveto localstorage 3.ReadandanalyzeinR
FrameworkX(LanguageY)
DistributedStorage
1.Load,clean,transform,aggregate,sample
Iterate
Example SparkR program
# Loading distributed datadf <- read.df(“hdfs://bigdata/logs”, source = “json”)
# Distributed filtering and aggregationerrors <- subset(df, df$type == “error”)counts <- agg(groupBy(errors, df$code), num = count(df$code))
# Collecting and plotting small dataqplot(code, num, data = collect(counts), geom = “bar”, stat = “identity”) + coord_flip()
10
Overview of SparkR API
IO• read.df / write.df• createDataFrame / collect
Caching• cache / persist / unpersist• cacheTable / uncacheTable
Utility functions• dim / head / take • names / rand / sample / ...
12
ML Lib• glm / kmeans /
DataFrame APIselect / subset / groupByhead / showDF /unionAllagg / avg / column / ...
SQLsql / table / saveAsTableregisterTempTable / tables
Overview of SparkR API :: SQLContext
SQLContext is your interface to Spark functionality in Ro SparkR DataFrames are implemented on top of SparkSQL tableso All DataFrame operations go through a SQL optimizer (catalyst)
13
sc <- sparkR.init()sqlContext <- sparkRSQL.init(sc)
From now on, you don’t need Spark Context (sc) any more.
Overview of SparkR API :: Caching
16
Controls cashing of distributed data:o persist(sparkDF, storageLevel)
o DISK_ONLY o MEMORY_AND_DISKo MEMORY_AND_DISK_SERo MEMORY_ONLYo MEMORY_ONLY_SERo OFF_HEAP
o cache(sparkdF) == persist(sparkDF, “MEMORY_ONLY”)o cacheTable(sqlContext, “table_name”)
Overview of SparkR API :: DataFrame API
SparkR DataFrame behaves similar to R data.frameso sparkDF$newCol <- sparkDF$col + 1o subsetDF <- sparkDF[, c(“date”, “type”)]o recentData <- subset(sparkDF$date == “2015-10-24”)o firstRow <- sparkDF[[1, ]]o names(subsetDF) <- c(“Date”, “Type”)o dim(recentData)o head(collect(count(group_by(subsetDF, “Date”))))
17
Overview of SparkR API :: SQL
You can register a DataFrame as a table and query it in SQLo logs <- read.df(sqlContext, “data/logs”, source = “json”)o registerTempTable(df, “logsTable”)o errorsByCode <- sql(sqlContext, “select count(*) as num, type from logsTable where type
== “error” group by code order by date desc”)o reviewsDF <- table(sqlContext, “reviewsTable”)o registerTempTable(filter(reviewsDF, reviewsDF$rating == 5), “fiveStars”)
18
Mixing R and SQL
Pass a query to SQLContext and get the result back as a DataFrame
19
# Register DataFrame as a tableregisterTempTable(df, “dataTable”)
# Complex SQL query, result is returned as another DataFrameaggCount <- sql(sqlContext, “select count(*) as num, type, date group by type order by date
desc”)
qplot(date, num, data = collect(aggCount), geom = “line”)
Moving between languages
20
R Scala
Spark
df <- read.df(...)
wiki <- filter(df, ...)
registerTempTable(wiki, “wiki”)
val wiki = table(“wiki”)
val parsed = wiki.map {Row(_, _, text: String, _, _)
=>text.split(‘ ’)}
val model = Kmeans.train(parsed)
How to get started with SparkR?
• On your computer1. Download latest version of Spark (2.0)2. Build (maven or sbt)3. Run ./install-dev.sh inside the R directory4. Start R shell by running ./bin/sparkR
• Deploy Spark on your cluster • Sign up for Databricks Community Edition:
https://databricks.com/try-databricks
22
Summary
1. SparkR is an R frontend to Apache Spark2. Distributed data resides in the JVM3. Workers are not running R process (yet)4. Distinction between Spark DataFrames and R data frames
24