Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Spark Data Sources API + Spark-ElasticSearch Connector + Spark-ElasticSearch In Action

Advanced Apache Spark Meetup Thanks Rackspace (Space) and Loggly (Food)!!

Feb 15, 2016

Chris Fregly Principal Data Engineer @ IBM Spark Tech Center

advancedspark.com!

+

Costin Leau Engineer, Elastic

Urvish Mahida Data Platform Engineer, Loggly

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Who Am I?

2

Streaming Data Engineer Open Source Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer IBM Technology Center

Founder Advanced Apache Meetup

Author Advanced .

Due 2016


Where Am I?

3

Summit East 2016

New York City

Spark-NYC (Co-presenting with Databricks)


Advanced Apache Spark Meetup http://advancedspark.com Meetup Metrics Top 5 Most-active Spark Meetup! ~2600 Members in just 6 mos!! ~2600 Docker image downloads

Meetup Mission Deep-dive into Spark and related open source projects Surface key patterns and idioms Focus on distributed systems, scale, and performance

4


Presentation Outline ①  Partitions, Pruning, Pushdowns

②  Spark Data Sources API ③  DataFrames and DataSets

④  Spark and ElasticSearch In Action 5


Partitions Partition Based on Data Access Patterns /genders.parquet/gender=M/… /gender=F/… <-- Use Case: Access Users by Gender /gender=U/…

Dynamic Partition Creation (Write) Dynamically create partitions on write based on column (ie. Gender) SQL: INSERT TABLE genders PARTITION (gender) SELECT … DF: gendersDF.write.format("parquet").partitionBy("gender")

.save("/genders.parquet")

Partition Discovery (Read) Dynamically infer partitions on read based on paths (ie. /gender=F/…) SQL: SELECT id FROM genders WHERE gender=F DF: gendersDF.read.format("parquet").load("/genders.parquet/").select($"id"). .where("gender=F")

6


Pruning Partition Pruning Filter out rows by partition

SELECT id, gender FROM genders WHERE gender = ‘F’

Column Pruning Filter out columns by column filter Extremely useful for columnar storage formats (ie. Parquet) Skip entire blocks of columns

SELECT id, gender FROM genders

7


Pushdowns aka. Predicate or Filter Pushdowns Predicate returns true or false for given functionFilters rows deep into the data source Reduces number of rows returned Data Source must implement PrunedFilteredScan def buildScan(requiredColumns: Array[String],

filters: Array[Filter]): RDD[Row]

8


Filter Collapse and Pushdown

9

Filter Collapse

Filter is Not Pushed Down

(JSON)

Filter is Pushed Down

(Parquet)


Join Between Partitioned & Unpartitioned

10

Note: JSON supports partitioning, We’re not using it here.


Join Between Partitioned & Partitioned

11


Cartesian Join vs. Inner Join

12


Broadcast Join vs. Normal Shuffle Join

13


Visualizing the Query Plan

14

Effectiveness of Filter

Cost-based Join Optimization

Similar to MapReduce

Map-side Join & DistributedCache

Peak Memory for Joins and Aggs

UnsafeFixedWidthAggregationMapgetPeakMemoryUsedBytes()






Spark Data Sources API Relations (o.a.s.sql.sources.interfaces.scala) BaseRelation (abstract class): Provides schema of data TableScan (impl): Read all data from source PrunedFilteredScan (impl): Column pruning & predicate pushdowns InsertableRelation (impl): Insert/overwrite data based on SaveMode RelationProvider (trait/interface): Handle options, BaseRelation factory

Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class): Handles all filters supported by this source EqualTo (impl) GreaterThan (impl) StringStartsWith (impl)

16


Native Spark SQL Data Sources

17


JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json")

.load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or – val ratingsDF = sqlContext.read.json ("file:/root/pipeline/datasets/dating/ratings.json.bz2")

SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") 18

json() convenience method


Parquet Data Source Configuration

spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=false (unless your schema is evolving) spark.sql.parquet.cacheMetadata=true (requires sqlContext.refreshTable()) spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]

DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet")

SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet")

19


ElasticSearch Data Source Github https://github.com/elastic/elasticsearch-hadoop

Maven org.elasticsearch:elasticsearch-spark_2.10:2.2.0

Code

val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document-type>")

20


Creating a Custom Data Source ①  Study existing implementations o.a.s.sql.execution.datasources.jdbc.JDBCRelation ②  Extend base traits & implement required methods o.a.s.sql.sources.{BaseRelation,PrunedFilterScan}

Examples Spark JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation

with PrunedFilteredScan, InsertableRelation

DataStax Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan, InsertableRelation!

21

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Demo! Create a Simple Integer Data Source

Predicate (Filter) Pushdowns

https://github.com/fluxcapacitor/pipeline/blob/master/myapps/sql/src/main/scala/com/advancedspark/sql/source/IntegerDataSource.scala

22






DataFrames and DataSets DataFrames Lost compile-time typing from RDD’s Favored untyped o.a.s.sql.Row Code could break at runtime

DataSets Re-introduce compile-time types Requires Custom Encoders/Serializers Tip: Use Kryo Serializer for custom Encoder Check out mapGroups() and flatMapGroups() methods Operate on grouped data

24






advancedspark.com (github and docker) End User ->

ElasticSearch ->

Spark ML ->

Data Scientist -> 26

<- Kafka <- Spark Streaming <- Cassandra, Redis <- Zeppelin, iPython


Thank You!!! Chris Fregly IBM Spark Tech Center (http://spark.tc) San Francisco, California, USA advancedspark.com Sign up for the Meetup and Book Clone, Contribute, Commit on Github Run All Demos using Docker Image 2600 Docker Downloads!!

Find me on LinkedIn, Twitter, Github, Email, Fax 27

Image derived from http://www.duchess-france.org/


Up Next… Costin Leau Developer @ Elastic (45 mins – 1 hour w/ Q & A) Urvish Mahida Data Platform Dev @ Loggly (30 – 45 mins w/ Q & A)

28

Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016

Software