Top Banner
IBM | spark.tc Scotland Data Science Meetup Spark SQL + DataFrames + Catalyst + Data Sources API Chris Fregly, Principal Data Solutions Engineer IBM Spark Technology Center Oct 13, 2015 Power of data. Simplicity of design. Speed of innovation.
56

Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

Jan 07, 2017

Download

Software

Chris Fregly
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Scotland Data Science Meetup Spark SQL + DataFrames + Catalyst + Data Sources API

Chris Fregly, Principal Data Solutions Engineer IBM Spark Technology Center

Oct 13, 2015

Power of data. Simplicity of design. Speed of innovation.

Page 2: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Announcements

Thanks to !TechCube Incubator!!!

!Georgia Boyle!

Organizer, London Spark Meetup!!

Page 3: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Who am I?!!

Streaming Data Engineer!Netflix Open Source Committer!

!Data Solutions Engineer!

Apache Contributor!!

Principal Data Solutions Engineer!IBM Technology Center!

Meetup Organizer!Advanced Apache Meetup!

Book Author!Advanced Spark (2016)!

Page 4: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

meetup.com/Advanced-Apache-Spark-Meetup/!Total Spark Experts: 1200+ in only 3 mos!!#5 most active Spark Meetup in the world!!!Goals!

Dig deep into the Spark & extended-Spark codebase!!

Study integrations such as Cassandra, ElasticSearch,!Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc!

!Surface and share the patterns and idioms of these !

well-designed, distributed, big data components!!

Page 5: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Recent Events Cassandra Summit 2015!

Real-time Advanced Analytics w/ Spark & Cassandra!!!!

Strata NYC 2015!Practical Data Science w/ Spark: Recommender Systems!

!

All Slides Available on !Slideshare!

http://slideshare.net/cfregly!!

Page 6: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Upcoming Advanced Apache Spark Meetups!Project Tungsten Data Structs/Algos for CPU/Memory Optimization!

Nov 12th, 2015!

Text-based Advanced Analytics and Machine Learning!Jan 14th, 2016!

ElasticSearch-Spark Connector w/ Costin Leau (Elastic.co) & Me!Feb 16th, 2016!

Spark Internals Deep Dive!Mar 24th, 2016!

Spark SQL Catalyst Optimizer Deep Dive !Apr 21st, 2016!

Page 7: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Freg-a-palooza Upcoming World Tour   London Spark Meetup (Oct 12th)!  Scotland Data Science Meetup (Oct 13th)!  Dublin Spark Meetup (Oct 15th)!  Barcelona Spark Meetup (Oct 20th)!  Madrid Spark/Big Data Meetup (Oct 22nd)!  Paris Spark Meetup (Oct 26th)!  Amsterdam Spark Summit (Oct 27th – Oct 29th)!  Delft Dutch Data Science Meetup (Oct 29th) !  Brussels Spark Meetup (Oct 30th)!  Zurich Big Data Developers Meetup (Nov 2nd)!

High probability!I’ll end up in jail!

or married!!

Page 8: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Slides and Videos

Slides!Links posted in Meetup directly!

!

Videos!Most talks are live streamed and/or video recorded!Links posted in Meetup directly!

!All Slides Available on Slideshare!

http://slideshare.net/cfregly!!

Page 9: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Last Meetup (Spark Wins 100 TB Daytona GraySort) On-disk only, in-memory caching disabled!!sortbenchmark.org/ApacheSpark2014.pdf!

Page 10: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

Spark SQL + DataFrames

Catalyst + Data Sources API

Page 11: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Topics of this Talk! DataFrames! Catalyst Optimizer and Query Plans! Data Sources API! Creating and Contributing Custom Data Source!

! Partitions, Pruning, Pushdowns!

! Native + Third-Party Data Source Impls!

! Spark SQL Performance Tuning!

Page 12: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

DataFrames!Inspired by R and Pandas DataFrames!Cross language support!

SQL, Python, Scala, Java, R!Levels performance of Python, Scala, Java, and R!

Generates JVM bytecode vs serialize/pickle objects to Python!DataFrame is Container for Logical Plan!

Transformations are lazy and represented as a tree!Catalyst Optimizer creates physical plan!

DataFrame.rdd returns the underlying RDD if needed!Custom UDF using registerFunction() New, experimental UDAF support!!

Use DataFrames !instead of RDDs!!!

Page 13: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Catalyst Optimizer!Converts logical plan to physical plan!Manipulate & optimize DataFrame transformation tree!

Subquery elimination – use aliases to collapse subqueries!Constant folding – replace expression with constant!Simplify filters – remove unnecessary filters!Predicate/filter pushdowns – avoid unnecessary data load!Projection collapsing – avoid unnecessary projections!

Hooks for custom rules!Rules = Scala Case Classes!

val newPlan = MyFilterRule(analyzedPlan)

!

Implements!oas.sql.catalyst.rules.Rule!

Apply to any !plan stage!

Page 14: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Plan Debugging!gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true)!

Requires explain(true)!

DataFrame.queryExecution.logical!

DataFrame.queryExecution.analyzed!

DataFrame.queryExecution.optimizedPlan!

DataFrame.queryExecution.executedPlan!

Page 15: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Plan Visualization & Join/Aggregation Metrics!

Effectiveness !of Filter!

Cost-based !Optimization!is Applied!

Peak Memory for!Joins and Aggs!

Optimized !CPU-cache-aware!

Binary Format!Minimizes GC &!

Improves Join Perf!(Project Tungsten)!

New in Spark 1.5!!

Page 16: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Data Sources API!Relations (o.a.s.sql.sources.interfaces.scala)!

BaseRelation (abstract class): Provides schema of data!TableScan (impl): Read all data from source, construct rows !PrunedFilteredScan (impl): Read with column pruning & predicate pushdowns!InsertableRelation (impl): Insert or overwrite data based on SaveMode enum!

RelationProvider (trait/interface): Handles user options, creates BaseRelation!Execution (o.a.s.sql.execution.commands.scala)!

RunnableCommand (trait/interface)!ExplainCommand(impl: case class)!CacheTableCommand(impl: case class)!

Filters (o.a.s.sql.sources.filters.scala)!Filter (abstract class for all filter pushdowns for this data source)!

EqualTo (impl)!GreaterThan (impl)!StringStartsWith (impl)!

Page 17: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Creating a Custom Data Source!Study Existing Native and Third-Party Data Source Impls!!

Native: JDBC (o.a.s.sql.execution.datasources.jdbc)! class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation

!Third-Party: Cassandra (o.a.s.sql.cassandra)! class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation!

!!

Page 18: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Contributing a Custom Data Source!spark-packages.org!

Managed by!Contains links to externally-managed github projects!Ratings and comments!Spark version requirements of each package!

Examples!https://github.com/databricks/spark-csv!https://github.com/databricks/spark-avro!https://github.com/databricks/spark-redshift!!!

Page 19: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

Partitions, Pruning, Pushdowns

Page 20: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Demo Dataset (from previous Spark After Dark talks)!

RATINGS !========!

UserID,ProfileID,Rating !(1-10)!

GENDERS!========!

UserID,Gender !(M,F,U)!

<-- Totally -->!Anonymous !

Page 21: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Partitions!Partition based on data usage patterns!

/genders.parquet/gender=M/… /gender=F/… <-- Use case: access users by gender /gender=U/…

Partition Discovery!On read, infer partitions from organization of data (ie. gender=F)!

Dynamic Partitions!Upon insert, dynamically create partitions!Specify field to use for each partition (ie. gender)! SQL: INSERT TABLE genders PARTITION (gender) SELECT … DF: gendersDF.write.format(”parquet").partitionBy(”gender”).save(…)

Page 22: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Pruning!

Partition Pruning!Filter out entire partitions of rows on partitioned data SELECT id, gender FROM genders where gender = ‘U’

Column Pruning!

Filter out entire columns for all rows if not required!Extremely useful for columnar storage formats!

Parquet, ORC! SELECT id, gender FROM genders

!

Page 23: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Pushdowns!Predicate (aka Filter) Pushdowns!

Predicate returns {true, false} for a given function/condition!Filters rows as deep into the data source as possible!

Data Source must implement PrunedFilteredScan!

Page 24: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

Native Spark SQL Data Sources

Page 25: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Spark SQL Native Data Sources - Source Code!

Page 26: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

JSON Data Source!DataFrame!

val ratingsDF = sqlContext.read.format("json") .load("file:/root/pipeline/datasets/dating/ratings.json.bz2")

-- or --! val ratingsDF = sqlContext.read.json ("file:/root/pipeline/datasets/dating/ratings.json.bz2")

SQL Code! CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2")

Convenience Method!

Page 27: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

JDBC Data Source!Add Driver to Spark JVM System Classpath!

$ export SPARK_CLASSPATH=<jdbc-driver.jar>

DataFrame! val jdbcConfig = Map("driver" -> "org.postgresql.Driver", "url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename")

df.read.format("jdbc").options(jdbcConfig).load()

SQL! CREATE TABLE genders USING jdbc OPTIONS (url, dbtable, driver, …)

Page 28: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Parquet Data Source!Configuration!

spark.sql.parquet.filterPushdown=true ! spark.sql.parquet.mergeSchema=true spark.sql.parquet.cacheMetadata=true ! spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]

DataFrames! val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet")! gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet")

SQL! CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet")

Page 29: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

ORC Data Source!Configuration!

spark.sql.orc.filterPushdown=true

DataFrames! val gendersDF = sqlContext.read.format("orc") .load("file:/root/pipeline/datasets/dating/genders")! gendersDF.write.format("orc").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders")

SQL! CREATE TABLE genders USING orc OPTIONS (path "file:/root/pipeline/datasets/dating/genders")

Page 30: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

Third-Party Data Sources

spark-packages.org

Page 31: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

CSV Data Source (Databricks)!Github!

https://github.com/databricks/spark-csv!!Maven!

com.databricks:spark-csv_2.10:1.2.0!!Code!

val gendersCsvDF = sqlContext.read .format("com.databricks.spark.csv") .load("file:/root/pipeline/datasets/dating/gender.csv.bz2") .toDF("id", "gender") toDF() defines column names!

Page 32: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Avro Data Source (Databricks)!Github!

https://github.com/databricks/spark-avro!!Maven!

com.databricks:spark-avro_2.10:2.0.1!!Code!

val df = sqlContext.read .format("com.databricks.spark.avro") .load("file:/root/pipeline/datasets/dating/gender.avro") !

Page 33: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

ElasticSearch Data Source (Elastic.co)!Github!

https://github.com/elastic/elasticsearch-hadoop!

Maven!org.elasticsearch:elasticsearch-spark_2.10:2.1.0!

Code! val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document>")

Page 34: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Cassandra Data Source (DataStax)!Github!

https://github.com/datastax/spark-cassandra-connector!Maven!

com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

Code! ratingsDF.write .format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace"->"<keyspace>", "table"->"<table>")).save(…)

Page 35: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Cassandra Pushdown Rules!

Determines which filter predicates can be pushed down to Cassandra.!* 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate!* 2. Only push down primary key column predicates with = or IN predicate.!* 3. If there are regular columns in the pushdown predicates, they should have!* at least one EQ expression on an indexed column and no IN predicates.!* 4. All partition column predicates must be included in the predicates to be pushed down,!* only the last part of the partition key can be an IN predicate. For each partition column,!* only one predicate is allowed.!* 5. For cluster column predicates, only last predicate can be non-EQ predicate!* including IN predicate, and preceding column predicates must be EQ predicates.!* If there is only one cluster column predicate, the predicates could be any non-IN predicate.!* 6. There is no pushdown predicates if there is any OR condition or NOT IN condition.!* 7. We're not allowed to push down multiple predicates for the same column if any of them!* is equality or IN predicate.!

spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala!

Page 36: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Special Thanks to DataStax!!!!

Russel Spitzer!@RussSpitzer!

(He created the following few slides)!

(These guys built a lot of the connector.)!

Page 37: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Spark-Cassandra Architecture!

Page 38: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Spark-Cassandra Data Locality!

Page 39: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Spark-Cassandra Node-specific CQL Queries!

http://www.slideshare.net/CesareCugnasco/indexing-3dimensional-trajectories-apache-spark-and-cassandra-integration!

Page 40: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Spark-Cassandra Configuration:input.page.row.size!

Page 41: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Spark-Cassandra Configuration: grouping.key!

Page 42: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Spark-Cassandra Configuration: size.rows/bytes!

Page 43: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Spark-Cassandra Configuration: batch.buffer.size!

Page 44: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Spark-Cassandra Configuration: concurrent.writes!

Page 45: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Spark-Cassandra Configuration: throughput_mb/s!

Page 46: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Spark-Cassandra Optimizatins and Next Steps!

By-pass CQL front door!

Bulk read/write directly to SSTables!

Rumored to be in existence!

DataStax Enterprise only?!

Closed Source Alert!!

Page 47: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Redshift Data Source (Databricks)!Github!

https://github.com/databricks/spark-redshift!

Maven!com.databricks:spark-redshift:0.5.0!

Code! val df: DataFrame = sqlContext.read

.format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load(...) !

Copies to S3 for !fast, parallel reads vs !

single Redshift Master bottleneck!

Page 48: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Cloudant Data Source (IBM)!Github!

http://spark-packages.org/package/cloudant/spark-cloudant!

Maven!com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

Code! ratingsDF.write.format("com.cloudant.spark") .mode(SaveMode.Append) .options(Map("cloudant.host"->"<account>.cloudant.com", "cloudant.username"->"<username>", "cloudant.password"->"<password>")) .save("<filename>")

!

Page 49: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

DB2 and BigSQL Data Sources (IBM)!

Coming Soon!!!!!

https://github.com/SparkTC/spark-db2!https://github.com/SparkTC/spark-bigsql!

!

Page 50: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

REST Data Source (Databricks)!

Coming Soon!!https://github.com/databricks/spark-rest?!

Michael Armbrust!Spark SQL Lead @ Databricks!

Page 51: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Simple Data Source (Me and You Guys)!

Coming Right Now!!!

Me!

Page 52: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

SparkSQL Performance Tuning (oas.sql.SQLConf)!spark.sql.inMemoryColumnarStorage.compressed=true!

Automatically selects column codec based on data!spark.sql.inMemoryColumnarStorage.batchSize!

Increase as much as possible without OOM – improves compression and GC!spark.sql.inMemoryPartitionPruning=true!

Enable partition pruning for in-memory partitions!spark.sql.tungsten.enabled=true!

Code Gen for CPU and Memory Optimizations (Tungsten aka Unsafe Mode)!spark.sql.shuffle.partitions!

Increase from default 200 for large joins and aggregations!spark.sql.autoBroadcastJoinThreshold!

Increase to tune this cost-based, physical plan optimization!spark.sql.hive.metastorePartitionPruning!

Predicate pushdown into the metastore to prune partitions early!spark.sql.planner.sortMergeJoin!

Prefer sort-merge (vs. hash join) for large joins !spark.sql.sources.partitionDiscovery.enabled ! & spark.sql.sources.parallelPartitionDiscovery.threshold!

Enable automatic partition discovery when loading data!!

Page 53: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Related Links!

https://github.com/datastax/spark-cassandra-connector!

http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/!

https://github.com/phatak-dev/anatomy_of_spark_dataframe_api!

https://databricks.com/blog/!

https://www.youtube.com/watch?v=uxuLRiNoDio!

http://www.slideshare.net/RussellSpitzer!

Page 54: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

IBM | spark.tc

Freg-a-palooza Upcoming World Tour   London Spark Meetup (Oct 12th)!  Scotland Data Science Meetup (Oct 13th)!  Dublin Spark Meetup (Oct 15th)!  Barcelona Spark Meetup (Oct 20th)!  Madrid Spark/Big Data Meetup (Oct 22nd)!  Paris Spark Meetup (Oct 26th)!  Amsterdam Spark Summit (Oct 27th – Oct 29th)!  Delft Dutch Data Science Meetup (Oct 29th) !  Brussels Spark Meetup (Oct 30th)!  Zurich Big Data Developers Meetup (Nov 2nd)!

High probability!I’ll end up in jail!

or married!!

Page 55: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

http://spark.tc/datapalooza IBM Spark Tech Center is Hiring! "JOnly Fun, Collaborative People!! J

IBM | spark.tc

Sign up for our newsletter at

Thank You!

Power of data. Simplicity of design. Speed of innovation.

Coming to Your City!!!!

Page 56: Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, DataSources API, Spark Cassandra Connector, ORC, Parquet, JSON, CSV, REST, ElasticSearch, DynamoDB, RedShift,

Power of data. Simplicity of design. Speed of innovation.

IBM Spark