Top Banner
Spark SQL
34

Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Oct 16, 2019

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Spark SQL

Page 2: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

The 8 fastest-growing tech skills worth over

$110,000

No. 1: Spark, up 120%, worth $113,214

Page 3: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

DO you know how to write code in

Spark ?

Page 4: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Can you write SQL ?

“SQL is a highly sought-after technical skill due to its ability to work with

nearly all databases.”

Ibro Palic, CEO of Resumes Templates

Page 5: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

History and Evolution of Big Data

Technologies

Procedural

Programing

interface

Declarative

Queries Automatic

Optimization

Page 6: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

So Far…

We have established that we need

platform with Automatic Optimization

Page 7: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

What user want ?

1•ETL from different sources

2•Advanced Analytics

Page 8: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph
Page 9: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Introducing

Spark SQL : Relational Data Processing

in Spark

Page 10: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Background

Apache Spark is a general-purpose cluster computing engine with

APIs in Scala, Java and Python and libraries for streaming, graph

processing and machine learning

RDDs are fault-tolerant, in that the system can recover lost datausing the lineage graph of the RDDs (by rerunning operations such

as the filter above to rebuild missing partitions). They can also

explicitly be cached in memory or on disk to support iteration

Shark, a modified the Apache Hive system to run on Spark and

implemented traditional RDBMS optimizations, such as columnar

processing, over the Spark engine.

Page 11: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Goals for Spark SQL

Support Relational Processing both within Spark

programs and on external data sources

Provide High Performance using established DBMS

techniques.

Easily support New Data Sources

Enable Extension with advanced analytics algorithms

such as graph processing and machine learning.

Page 12: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Programming Interface

Page 13: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

DataFrame API

DataFrame is a distributed collection of rows with a

homogeneous schema

# A Lazy ComputationKeep Track of

Hashtags ##

Page 14: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Data Model and DataFrame

Operations

Spark SQL uses a nested data model based on Hive

It supports all major SQL data types, including boolean, integer, double,

decimal, string, date, timestamp and also User Defined Data types

Example of DataFrame Operations

Page 15: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

DataFrame Operations Cont.

#Access DF with DSL or SQL

Page 16: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Real World Problems

#Heterogeneous

Data Sources

Page 17: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Schema Inference

Spark SQL can automatically infer the schema of these

objects using reflection

Scala/Java - extracted from the language’s type system

Python – Sampling the Dataset

Page 18: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

In – Memory Caching

#Invoked with .cache()

Page 19: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

User-Defined Functions

How Spark SQLs User defined

functions are different than traditional

Database Systems ?

Page 20: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Catalyst Optimizer

Catalyst is based on functional programming constructs in Scala

Purposes

Ability to add new optimization techniques

and features to

Spark SQL

Ability to extend the optimizer

Page 21: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Catalyst Optimization

#Trees

#Rules

Page 22: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Catalyst Optimization Cont.

Rule Based Optimization

Cost Based Optimization

Page 23: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Query Planning in Spark SQL

Page 24: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Extension Points

#Open Source Projects

Page 25: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Extension Points Cont.

Data Sources

Examples :

CSV

Avro

Parquet

JDBC

Page 26: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Extension Points Cont.

User Defined Types (UDTs)

#Useful for Machine Learning

Page 27: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Advanced Analytics Features

1.Schema Inference for Semi structured Data

2.Query Federation to External Databases

Page 28: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Advanced Analytics Features Cont.

3.Integration with Spark’s Machine

Learning Library

Page 29: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Evaluation

SQL Performance

Page 30: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Evaluation Cont.

DataFrames vs. Native Spark Code

Page 31: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Pipeline Performance

Page 32: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Applications

Generalized Online Aggregation

Computational Genomics

List is infinite only limited by your imagination…

Page 33: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph

Conclusion

Our Final Hash Tags

#A Platform with

#Automatic optimization

#Complex pipelines that mix relational and complex analytics

#Large-scale data analysis

#Semi-structured data

#Data types for machine learning

#Extensible optimizer called Catalyst

#Easy to add Optimization rules, data sources and data types

Page 34: Spark SQL : Relational Data Processing in Spark · Apache Spark is a general-purpose cluster computing engine with APIs in Scala, Java and Python and libraries for streaming, graph