Taking Spark Streaming to the Next Level with Datasets and DataFrames

Taking Spark Streaming to the next

level with Datasets and DataFrames

Tathagata “TD” Das@tathadas

Strata San Jose 2016

Streaming in Spark

Spark Streaming changed how people write streaming apps

SQL Streaming MLlib

Spark Core

GraphXFunctional, concise and expressive

Fault-tolerant state management

Unified stack with batch processing

More than 50% users consider most important part of Spark

Streaming apps are growing more complex

Streaming computations don’t run in isolation

Need to interact with batch data, interactive analysis, machine learning, etc.

Use case: IoT Device Monitoring

IoTDevices

ETL into long term storage- Prevent data loss- Prevent duplicatesStatus monitoring

- Handle late data- Process based on

event time

Interactively debug issues- consistency

event stream

Anomaly detection- Learn models offline- Use online + continuous

learning

1. Processing with event-time, dealing with late data- DStream API exposes batch time, hard to incorporate event-time

2. Interoperate streaming with batch AND interactive - RDD/DStream has similar API, but still requires translation- Hard to interoperate with DataFrames and Datasets

3. Reasoning about end-to-end guarantees- Requires carefully constructing sinks that handle failures correctly- Data consistency in the storage while being updated

Pain points with DStreams

Structured Streaming

The simplest way to perform streaming analyticsis not having to reason about streaming at all

ModelTrigger: every 1 sec

1 2 3Time

data upto 1

Input data upto 2

data upto 3

Input: data from source as an append-only table

Trigger: how frequently to checkinput for new data

Query: operations on inputusual map/filter/reduce new window, session ops

output for data up to 1

Result

data upto 1

Input data upto 2

data upto 3

Result: final operated tableupdated every trigger interval

Output: what part of result to write to data sink after every trigger

Complete output: Write full result table every time

Output complete output

Result

data upto 1

Input data upto 2

data upto 3

Output deltaoutput

Result: final operated tableupdated every trigger interval

Output: what part of result to write to data sink after every trigger

Complete output: Write full result table every timeDelta output: Write only the rows that changed

in result from previous batchAppend output: Write only new rows

*Not all output modes are feasible with all queries

Static, bounded table

Dataset, DataFrame

Streaming, unbounded table

Single API !

Batch ETL with DataFrame

input = ctxt.read.format("json").load("source-path")

result = input.select("device", "signal").where("signal > 15")

result.write.format("parquet").save("dest-path")

Read from Json file

Select some devices

Write to parquet file

Streaming ETL with DataFrame

input = ctxt.read.format("json").stream("source-path")

result.write.format("parquet").outputMode("append").startStream("dest-path")

Read from Json file stream

Select some devices

Write to parquet file stream

read…stream() creates a streaming DataFrame, does not start any of the computation

write…startStream() defines where & how to output the data and starts the processing

Result[append-only table]

Output[append mode]

new rows in result

of 2new rows in result

Continuous Aggregations

Continuously compute average signal of each type of device

input.groupBy("device-type").avg("signal")

input.groupBy(window("event-time", "10min"),"device type")

.avg("age")

Continuously compute average signal of each type of device in last 10 minutes of event time

- Windowing is just a type of aggregation- Simple API for event time based windowing

Query Management

query = result.write.format("parquet").outputMode("append").startStream("dest-path")

query.stop()query.awaitTermination()query.exception()

query.sourceStatuses()query.sinkStatus()

query: a handle to the running streaming computation for managing it

- Stop it, wait for it to terminate- Get status- Get error, if terminated

Multiple queries can be active at the same time

Each query has unique name for keeping track

Logically:DataFrame operations on table(i.e. as easy to understand as batch)

Physically:Spark automatically runs the query in streaming fashion(i.e. incrementally and continuously)

DataFrame

Logical Plan

Continuous, incremental execution

Catalyst optimizer

Execution

Structured Streaming

High-level streaming API built on Spark SQL engineRuns the same computation as batch queries in Datasets/DataFramesEvent time, windowing, sessions, sources & sinksEnd-to-end exactly once semantics

Unifies streaming, interactive and batch queriesAggregate data in a stream, then serve using JDBCAdd, remove, change queries at runtimeBuild and apply ML models

Advantages over DStreams

1. Processing with event-time, dealing with late data

2. Exactly same API for batch, streaming, and interactive

3. End-to-end exactly-once guarantees from the system

4. Performance through SQL optimizations- Logical plan optimizations, Tungsten, Codegen, etc.- Faster state management for stateful stream processing

Underneath the Hood

Batch Execution on Spark SQL

DataFrame/Dataset

Logical Plan

Execution PlanPlanner

Logical plan optimization

Execution plan generation

RDD jobTungsten

Code generation

Abstract representation

of query

Continuous Incremental Execution

Planner extended to be aware of the streaming logical plans

Planner generates a continuous series of incremental execution plans, each processing the next chunk of streaming data

DataFrame/Dataset

Logical Plan

Incremental Execution Plan 1

Planner

Streaming Source

A streaming data source where- Records can be uniquely identified by a offset- Arbitrary segment of the stream data can be read

based on an offset range- Files, Kafka, Kinesis supported

More restricted than DStream Receivers Can ensure end-to-end exactly-once guarantees

Logical Plan

Sources

Operations

Streaming Sink

Allows output to pushed to external storage

Each batch of output should be written to storage atomically and idempotently

Both needed to ensure end-to-end exactly-once guarantees, AND data consistency

Logical Plan

Sources

Operations

Incremental Execution

Every trigger interval

- Planner ask source for next chunk of input data

- Generates execution plan with the input

- Generate output data

- Hands to sink for pushing it out

Logical Plan

Sources

Incremental Execution Plan

get new data as input

Outputpush output

generate outputPlanner

Offset Tracking with WAL

Planner saves the next offset range in a write-ahead log (WAL) on HDFS/S3 before incremental execution starts

Logical Plan

Sources

Incremental Execution Plan

Planner

processedoffsets

Offset WAL

in-progressoffsets

save offset range to WAL before processing starts

Recovery from WAL

Logical Plan

Sources

Incremental Execution PlanRestarted

Planner

processedoffsets

in-progressoffsets

Offset WAL

recover last offset range and restart computation

After failure, restarted planner recovers last offset range from WAL and restarts failed execution

Output exactly same as it would been without failure

Streaming source +

Streaming sink +

Offset tracking=

End-to-end exactly-once guarantees

Stateful Stream Processing

Streaming aggregations require maintaining intermediate "state" data across batches

compute aggregates 3

Cat 2Dog 1

Cat 3Dog 1Cow 1

state data

Cat, Cow

Cat, Dog, Cat

aggregates

State Management

State needs to be fault-tolerant

Cat, Dog, Cat Cat, Cow Dog

Cat 2Dog 1

Cat 3Dog 1Cow 1aggregates

both, input + previous state needed to recover from failure

Cat 3Dog 1Cow 1

Old State Management with DStreams

DStreams (i.e. updateStateByKey, mapWithState) represented state data as RDDs

Leveraged RDD lineage for fault-tolerance and RDD checkpointing to prevent unbounded

RDD checkpointing saves all state data to HDFS/S3Inefficient when update rates are low

No "incremental" checkpointing

New State Management: State Store

State Store API: any versioned key value store that

- Allows a set of key-value updates to be transactionally committed with a version number (i.e. incremental checkpointing)

- Allows a specific version of the key-value data to be retrieved

Incremental Execution 1

state store, v1

state store, v2

Spark cluster

HDFS-backed State Store

Implementation of State Store API in Spark 2.0:In-memory hashmap backed by files in HDFS

Driver Executor

Executor

hashmap

- Each executor has hashmap(s) containing versioned data

- Updates committed as delta files in HDFS/S3

- Delta files periodically collapsed into snapshots to improve recovery delta files

in HDFS

Fault recovery of HDFS State Store

State Store version also stored in the WAL

On fault recovery,- Recover input data from

source using last offsets- Recover state data using

last store versionIncremental

Execution

In progress WAL

sources

store filesrecovered state

recovered input

Fast, fault-tolerant, exactly-once stateful stream processing

without having to reason about streaming

Spark 2.0

Basic infrastructure and API- Event time, windows, aggregations

Files as sources and sinks- Kafka as source, SQL as sink (?)

Spark 2.1+

More support for late dataDynamic scalingSource and sinks public API

- Extends DataSource API

More sources and sinksML integrations

Spark XXX

"True streaming" engine- API already agnostic to

micro-batch

Simple and fast real-time analytics• Develop Productively• Execute Efficiently• Update Automatically

Questions?

Learn MoreToday, 1:50-2:30 AMA

Follow me @tathadas

Taking Spark Streaming to the Next Level with Datasets and DataFrames

Software

An Overview of Spark DataFrames with Scala

SPARK ON HIPERGATOR - help.rc.ufl.edu · SPARK SQL AND...

ScootR: Scaling R Dataframes on Dataflow Systems · ScootR:...

7 Steps for a Developer to Learn Apache Spark · Learning.....

Dashboards in Power BI with Real-time* Data · Or your Boss...

Building a modern Application with DataFrames

Chapter 1: Big Data Analytics at a 10,000-Foot...

Manipulating DataFrames - Amazon S3 · Manipulating...

Data processing in Apache Spark•Next week`s lecture is...

SolveDF: Extending Spark DataFrames with support for … ·...

Frustration-Reduced Spark: DataFrames and the Spark...

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And....

A Spreadsheet Interface for Dataframes

Spark Cassandra Connector Dataframes

CoALA: Contextualization Framework for Smart Learning...

Beyond SQL: Speeding up Spark with DataFrames