Top Banner
Anatomy of Spark Catalyst- Part 2 Journey from DataFrame to RDD https://github.com/phatak-dev/anatomy-of-spark-catalyst
28

Anatomy of Spark SQL Catalyst - Part 2

Jan 08, 2017

Download

Technology

datamantra
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Anatomy of Spark SQL Catalyst - Part 2

Anatomy of Spark Catalyst- Part 2

Journey from DataFrame to RDD

https://github.com/phatak-dev/anatomy-of-spark-catalyst

Page 2: Anatomy of Spark SQL Catalyst - Part 2

● Madhukara Phatak

● Technical Lead at Tellius

● Consultant and Trainer at datamantra.io

● Consult in Hadoop, Spark and Scala

● www.madhukaraphatak.com

Page 3: Anatomy of Spark SQL Catalyst - Part 2

Agenda● Recap of Part 1● Query Plan● Logical Plans● Analysis● Optimization● Spark Plan● Spark Strategies● Custom strategies

Page 4: Anatomy of Spark SQL Catalyst - Part 2

Code organization of Spark SQL● The code of spark sql is organized into below projects

○ Catalyst○ Core○ Hive○ Hive-Thrift server○ unsafe(external to spark sql)

● We are focusing on Catalyst code in this session● Core depends on catalyst ● Catalyst depends upon unsafe for tungsten code

Page 5: Anatomy of Spark SQL Catalyst - Part 2

Introduction to Catalyst● An implementation agnostic framework for manipulating

trees of relational operators and expressions● It defines the all the expressions, logical plans and

optimizations API’s for spark SQL● Catalyst API is independent of RDD evaluation. So

catalyst operators and expressions can be evaluated without RDD abstraction

● Introduced in Spark 1.3 version

Page 6: Anatomy of Spark SQL Catalyst - Part 2

Catalyst in Spark SQL

HiveQL

Hive parser

Hive queries

SparkSQL

SparkSQL Parser

Spark SQL queries

Dataframe DSL

DataFrame

Catalyst

Spark RDD code

Dataset DSL

Page 7: Anatomy of Spark SQL Catalyst - Part 2

Recap of Part 1● Trees● Expressions● Data types● Row and Internal Row● Eval● CodeGen● Janino

Page 8: Anatomy of Spark SQL Catalyst - Part 2

DataFrame Internals● Each dataframe is internally represented as logical

plans in spark● These logical plan converts into physical plans to

execute on RDD abstraction● Building blocks of plans like expressions, operators

come from the catalyst library● We need to understand internals of catalyst in order to

understand how a given query formed and executed.● Ex : DFExample.scala

Page 9: Anatomy of Spark SQL Catalyst - Part 2

Query Plan API● Root of all plans● Plan types - Logical Plan and Spark Plan● OutputSet signifies the attributes outputted by this plan● InputSet signifies the attributed inputted by this plan● Schema - Signifies the StructType associated with

output of logical plan● Provides special functions like transformExpressions,

transformExpressionsUp to manipulate the expressions in plan

Page 10: Anatomy of Spark SQL Catalyst - Part 2

Logical Plan

Page 11: Anatomy of Spark SQL Catalyst - Part 2

Logical Plan● One type of Query plan which focuses on building plans

of catalyst operators and expressions● Independent of RDD abstraction● Focuses on analysis of the plan for correctness ● Also responsible for resolving the attributes before they

are evaluated● Three default type of logical plans are

○ LeafNode,UnaryNode and BinaryNode● Ex:LogicalPlanExample

Page 12: Anatomy of Spark SQL Catalyst - Part 2

Tree manipulation of Logical Plan● As we did in earlier with expression trees, we can

manipulate the logical plans using tree API’s● All these manipulations are represented as a Rule● These rule take a plan and give you new plan● Rather than using transform and transformUp we will be

using transformExpression and transformExpressionUp for manipulating these trees

● Ex : FilterLogicalPlanManipulation

Page 13: Anatomy of Spark SQL Catalyst - Part 2

Understanding Plan Manipulation

Transform

LR

Filter

AttribituteRef(id) Literal(true)

Equals

Transform ExpressionUp

LR

Filter

Literal (true)

LR

Page 14: Anatomy of Spark SQL Catalyst - Part 2

Analysis of Logical Plan

Page 15: Anatomy of Spark SQL Catalyst - Part 2

Analysing● Analysis of logical plan is a step includes

○ Resolving relations (Spark SQL)○ Resolve attributes○ Resolve functions○ Analyze for correctness of structuring

● Analysis makes sures all information is extracted before a logical plan can be executed

● Analyzer is an interface to implement the analysis● AnalysisExample

Page 16: Anatomy of Spark SQL Catalyst - Part 2

Analysis in SQL● Whenever we use SQL API for manipulating

dataframes we work with UnResolvedRelation● UnResolvedRelation is a logical relation which needs to

be resolved from catalog● Catalog is a dictionary of all registered tables● Part of the analysis, is to to resolve these unresolved

relations and provide appropriate relation types● UnResolvedRelationExample

Page 17: Anatomy of Spark SQL Catalyst - Part 2

Logical Plan Optimization

Page 18: Anatomy of Spark SQL Catalyst - Part 2

Optimizer● One of the important part of the Spark Catalyst is to

implement the optimizations on logical plans● All these optimizations are represented using Rule

which transforms the logical plans● All code for Optimization resides in Optimizer.scala file● In our example, we see how filter push for a logical plan● For more information on optimization, refer to anatomy

of dataframe talk[1] from references● PushPredicateExample

Page 19: Anatomy of Spark SQL Catalyst - Part 2

Providing custom optimizations● Till Spark 2.0, we needed to change the spark source

code for the changing optimizations● As Dataset becomes core abstraction in 2.0, ability to

tweak catalyst optimization becomes important● So from spark 2.0, spark has exposed the ability to add

user defined rules in run time which makes spark optimizer more configurable

● More information about defining and adding custom rules refer to spark 2.0 talk[2] from references

Page 20: Anatomy of Spark SQL Catalyst - Part 2

Spark Plan

Page 21: Anatomy of Spark SQL Catalyst - Part 2

SparkPlan● Physical plan of the Spark SQL which lives in the core

package● Defines two abstract methods

○ doPrepare○ doExecute

● Specifies helper collectMethods like○ executeCollect, executeTake

● Three nodes LeafNode, UnaryNode, BinaryNode● org.apache.spark.sql.execution.SparkPlan

Page 22: Anatomy of Spark SQL Catalyst - Part 2

Logical Plan to Spark Plan● Let’s look at converting our logical plans to spark plans● On sqlContext, there is a SparkPlanner which will help

us to do the conversion● A single logical plan can result in multiple physical plans● On every physical plan , there is execute method which

consumes RDD[InternalRow] and produces RDD[InternalRow]

● LogicalToPhysicalExample

Page 23: Anatomy of Spark SQL Catalyst - Part 2

QueryPlanner ● Interface for converting logical planning to Physical

plans● List of strategies applied for conversion● Each strategy has a method plan which is chained like

we did in rules in logical plan side● QueryPlanner also extends from TreeNode which

supports all tree transversal● PlanLater is a strategy which gives lazy effect● org.apache.spark.sql.catalyst.planning.QueryPlanne

r

Page 24: Anatomy of Spark SQL Catalyst - Part 2

Spark Strategies● These are set of strategies which implement query

planner to turn logical plans to Spark Plans● The different strategies are

○ BasicOperators○ Aggregation○ DefaultJoin

● We execute these strategies in sequence to generate final result

● org.apache.spark.sql.core.phyzicalplan.SparkStrategyExample

Page 25: Anatomy of Spark SQL Catalyst - Part 2

SparkPlanner● User facing API for converting the logical plans to spark

plans● It lists all the strategies to execute on a given logical

plan● Calling plan method generate all physical plans using

the above strategies● These physical plans can be executed using execute

method on a physical plan● org.apache.spark.sql.execution.SparkPlanner

Page 26: Anatomy of Spark SQL Catalyst - Part 2

Understanding Filter Strategy● All code for Filter Strategy lives in

basicOperators.scala● It uses mapPartitionInternal for filtering data over

RDD[InternalRow]● A comparison expression is converted to predicate

using newPredicate method which uses code generation

● Once we have predicate, we can use scala filter to filter the data from the RDD

● Filtered RDD[InternalRow] is returned from the strategy

Page 27: Anatomy of Spark SQL Catalyst - Part 2

Custom strategy● As we can write custom rules for logical optimization,

we can add custom strategies also● Many connectors like MemSQL, Mongodb add custom

strategies to optimize read, filter etc● Developer can add custom strategies using

sqlContext.experimental.strategies object● You can look at simple custom strategy from memsql in

below link● http://bit.ly/2bwnUxF