Top Banner
BUILDING REALTIME DATA PIPELINES WITH KAFKA CONNECT AND SPARK STREAMING Ewen Cheslack-Postava Confluent
25

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Apr 21, 2017

Download

Data & Analytics

Spark Summit
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

BUILDING REALTIME DATA PIPELINES WITH KAFKA CONNECT AND SPARK STREAMING

Ewen Cheslack-PostavaConfluent

Page 2: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

About Me: Ewen Cheslack-Postava• Engineer @ Confluent• Kafka Committer• Kafka Connect Lead

Page 3: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Traditional ETL

Page 4: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

More Data Systems

Page 5: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Stream Processing

Page 6: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava
Page 7: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava
Page 8: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Separation of Concerns

Page 9: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Large-scale streaming data import/export for Kafka

Kafka Connect

Page 10: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava
Page 11: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Separation of Concerns

Page 12: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava
Page 13: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava
Page 14: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Tasks - Parallelism

Page 15: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Execution Model - Standalone

Page 16: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Execution Model - Distributed

Page 17: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Execution Model - Distributed

Page 18: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Execution Model - Distributed

Page 19: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Data Integration as a Service

Page 20: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Delivery Guarantees• Automatic offset checkpointing and recovery

– Supports at least once– Exactly once for connectors that support it

(e.g. HDFS)– At most once simply swaps write & commit– On restart: task checks offsets & rewinds

Page 21: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Spark Streaming• Use Direct Kafka streams (1.3+)

– Better integration, more efficient, better semantics

• Spark Kafka Writer– At least once– Kafka community is working on improved

producer semantics

Page 22: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Spark Streaming & Kafka Connect• Increase # of systems Spark Streaming

works with, indirectly• Reduce friction to adopt Spark Streaming• Reduce need for Spark-specific connectors• By leveraging Kafka as de facto streaming

data storage

Page 23: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Kafka Connect Summary

23

• Designed for large scale stream or batch data integration

• Community supported and certified way of using Kafka

• Soon, large repository of open source connectors• Easy data pipelines when combined with Spark &

Spark Streaming

Page 24: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

THANK YOU.Follow me on Twitter: @ewencpTry it out: http://confluent.io/downloadMore like this, but in blog form: http://confluent.io/blog

Page 25: Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Add Pages as Necessary• Supporting points go here.