Top Banner
Structured Streams in Spark 2.0 Long Tran
29

Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

May 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Structured Streams in Spark 2.0

Long Tran

Page 2: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

• RDDs • RDD / Scala API

• D-Streams (0.7) • SQL (1.3)

• Dataframes API • Catalyst • Tungsten (1.4)

• Structured Streams (finally!) (2.0) • File API - future APIs • Exactly Once semantics

Overview

Page 3: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

https://spark.apache.org/

Page 4: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

RDD

A parallelized, lazily evaluated, directed acyclic graph of computation.

Resilient Distributed Dataset

Page 5: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

https://dzone.com/refcardz/apache-spark

Page 6: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

RDD + Scala Word CountSCALA

SPARK

Page 7: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides
Page 8: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Spark Streaming

http://spark.apache.org/docs/latest/streaming-programming-guide.html

Page 9: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

DStream API

Page 10: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

DStream Word Count

Page 11: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

SQL & DataFramesAPI for computing structured data

Page 12: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

DataFrames and SQL Word Count

DataFrames

SQL

Page 13: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Catalyst

https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Page 14: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Catalyst

Page 15: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Catalyst

Page 16: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Tungsten

www.slideshare.net/databricks/2015-0616-spark-summit

Memory Management and Binary Processing

Page 17: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

TungstenCache Aware Computation

Page 18: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

\

Page 19: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Spark 2.0 is the ALPHA RELEASE of Structured

StreamingStructured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-

once stream processing without the user having to reason about streaming.

Page 20: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Classic Streaming

Page 21: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Continuous Applications

Page 22: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

https://spark.apache.org/docs/latest/img/structured-streaming-stream-as-a-table.png

Programming Model

Page 23: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides
Page 24: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Structured Stream Word Count

Page 25: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

end-to-end exactly once guarantees

Page 26: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Conclusions• Dataframe and SQL for streaming • Catalyst! • Tungsten! • Unified API for batch and streaming (+ ML +

GraphFrames) • BIs, DBAs, Data Scientists can now do streaming! • Exactly once guarantees • No need to reason about intervals • Event Time primitives

Page 27: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Future• Current support for reading file streams only • Kafka Integration (2.1) • Public API for sources and sinks • Watermarks • ML Integration - continuously updated models

Page 28: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

@LooooongTran

Page 29: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Sources• https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-

spark.html • https://databricks.com/blog/2016/07/28/continuous-applications-evolving-

streaming-in-apache-spark-2-0.html • https://www.youtube.com/watch?v=rl8dIzTpxrI • https://www.youtube.com/watch?v=fn3WeMZZcCk • https://spark.apache.org/docs/latest/structured-streaming-programming-

guide.html • https://www.oreilly.com/learning/apache-spark-2-0--introduction-to-structured-

streaming • https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-

closer-to-bare-metal.html • https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-

optimizer.html • https://www.youtube.com/watch?v=1a4pgYzeFwE • https://www.youtube.com/watch?v=5ajs8EIPWGI • http://www.kdnuggets.com/2016/05/spark-tungsten-burns-brighter.html • https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-

whole-stage-codegen.html