Top Banner
Data Pipelines with Apache Kafka Ben Stopford @confluentinc
50

Data Pipelines with Apache Kafka

Apr 16, 2017

Download

Technology

Ben Stopford
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Pipelines with Apache Kafka

Data Pipelines with Apache Kafka

Ben Stopford @confluentinc

Page 2: Data Pipelines with Apache Kafka

Today

• What is Kafka? (High level fluffy stuff)

• What makes it tick? (Low level geeky stuff)

• How can you use it? (Architect oriented stuff)

Page 3: Data Pipelines with Apache Kafka
Page 4: Data Pipelines with Apache Kafka

What is Kafka?

Page 5: Data Pipelines with Apache Kafka

Kafka: a Streaming Platform

The Log Connectors Connectors

Producer Consumer

Streaming Engine

Page 6: Data Pipelines with Apache Kafka

The Log Scalable, Fault Tolerant, Concurrent, Strongly Ordered, Stateful

The Log Connectors Connectors

Producer Consumer

Streaming Engine

Page 7: Data Pipelines with Apache Kafka

Clients JVM & C native implementations, Go, Python, many more OS

The Log Connectors Connectors

Producer Consumer

Streaming Engine

Page 8: Data Pipelines with Apache Kafka

Connectors Plug into your database of choice

The Log Connectors Connectors

Producer Consumer

Streaming Engine

Page 9: Data Pipelines with Apache Kafka

Streaming Engine The declarative power of a database, wrapped into a Kafka client

The Log Connectors Connectors

Producer Consumer

Streaming Engine

Page 10: Data Pipelines with Apache Kafka

Kafka: The distributed Log

Today we’ll focus on

Page 11: Data Pipelines with Apache Kafka

The log is a type of messaging system

Page 12: Data Pipelines with Apache Kafka

What is messaging in essence?

•  Take a message, keep it safe, make it available to consumers.

•  Track what messages have been consumed

Kafka attacks these problems separately

Page 13: Data Pipelines with Apache Kafka

What is a message broker in essence?

Sender Receiver

Broker (the log)

Page 14: Data Pipelines with Apache Kafka

The log is a simple idea

Messages are added at the end of the log

Just think of the log as a file

Old New

Page 15: Data Pipelines with Apache Kafka

Consumers have a position

Sally is here

George is here

Fred is here

Old New

Scan Scan

Scan

Page 16: Data Pipelines with Apache Kafka

Only Sequential Access

Old New Read to offset & scan

Page 17: Data Pipelines with Apache Kafka

No Random Access

Index

Disk

Kafka avoids Indexes by keeping the approach simple (indexes impede scalability in this context)

Page 18: Data Pipelines with Apache Kafka

Topics are Broadcast

Consumer

Consumer

Broker broadcast

Page 19: Data Pipelines with Apache Kafka

Can also behave as a queue

Sender Receiver

Page 20: Data Pipelines with Apache Kafka

The problem:

If you built a messaging system for internet scale,

what would it look like?

Page 21: Data Pipelines with Apache Kafka

Shard data to get scalability

Messages are sent to different partitions

Producer (1) Producer (2) Producer (3)

Cluster of machines Partitions live on

different machines

Page 22: Data Pipelines with Apache Kafka

Replicate to get fault tolerance

replicate

msg

mastership moves

machines

(1)

(2)

msg

leader

Machine A

Machine A

Machine B

Machine B

Page 23: Data Pipelines with Apache Kafka

Kafka goes a step further

A single topic can be spread over multiple consumers

(4 consuming machines process a single topic)

Page 24: Data Pipelines with Apache Kafka

Linearly Scalable Architecture

Single topic:

- Many producers machines

- Many consumer machines

- Many Broker machines

No Bottleneck!!

Page 25: Data Pipelines with Apache Kafka

Distributed Commit Log Different to a traditional messaging system

Page 26: Data Pipelines with Apache Kafka

Data is replicated

Page 27: Data Pipelines with Apache Kafka

Strong Consistency

Send Message

3 replicas on different machines

•  Only 1 elected leader •  Only leader can be written to, read from

Page 28: Data Pipelines with Apache Kafka

Replication provides resiliency

Another replica takes over on machine failure

Page 29: Data Pipelines with Apache Kafka

Replication Protocol

Send Message

Page 30: Data Pipelines with Apache Kafka

Optimistic Write (single machine delivery)

Send Message

Get ack (optimistic)

Page 31: Data Pipelines with Apache Kafka

Pessimistic Write (wait for replication to complete)

Send Message

Get ack (pessimistic)

Page 32: Data Pipelines with Apache Kafka

Replication Protocol Writer

Messages can be read only after replication completes

Reader

Page 33: Data Pipelines with Apache Kafka

Replication Protocol

Number of replicas is a soft quorum (set min/max tolerable values)

Writer

Reader

Page 34: Data Pipelines with Apache Kafka

Replication is used for resiliency. No need to flush

to disk synchronously. You can flush if you wish, but no one does.

Page 35: Data Pipelines with Apache Kafka

Advanced Features

Page 36: Data Pipelines with Apache Kafka

Consumers cluster too! Consumer Group 1 Consumer Group 1

Page 37: Data Pipelines with Apache Kafka

Consumers cluster too!

Page 38: Data Pipelines with Apache Kafka

Compacted Topics (Tabular View)

Version 3

Version 2

Version 1

Version 2

Version 1

Version 5

Version 4

Version 3

Version 2

Version 1

Version 2

Version 3

Version 5

All versions Latest Key only

Page 39: Data Pipelines with Apache Kafka

Multi Tenancy

Users isolated using security features

Bandwidth segregated per user

Page 40: Data Pipelines with Apache Kafka

Use Cases

Page 41: Data Pipelines with Apache Kafka

Microservice Backbone

Page 42: Data Pipelines with Apache Kafka

Always on, Event-Driven Services

The Log (streams & tables)

Ingestion Services

Services with Polyglotic

persistence

Simple Services

Streaming Services

Page 43: Data Pipelines with Apache Kafka

Event Buffer

Page 44: Data Pipelines with Apache Kafka

Many producers, small messages

Kafka

Hadoop etc

Page 45: Data Pipelines with Apache Kafka

Stream Processing for enrichment & transformation

Page 46: Data Pipelines with Apache Kafka

Kafka Streams Example

Orders

Customer (Compacted)

Join

Customer Stream

Join, aggregate, intermediary state

stored in Kafka

Kafka Kafka Streams

Orders Stream

Dashboard

Query

Page 47: Data Pipelines with Apache Kafka

Stream Data Platform (Kappa Architecture)

Page 48: Data Pipelines with Apache Kafka

All y

our

data

Stream Data platform Views

Client

Client

Kafka

Stream processor

Connectors

Page 49: Data Pipelines with Apache Kafka

Kafka: a Streaming Platform

The Log Connectors Connectors

Producer Consumer

Streaming Engine

Page 50: Data Pipelines with Apache Kafka

The end

@benstopford http://benstopford.com