Top Banner
Melanga Dissanayake SEPTEMBER 17, 2015 Apache Kafka
43

Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Aug 31, 2019

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Melanga Dissanayake

SEPTEMBER 17, 2015

Apache Kafka

Page 2: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

About Me

2

Melanga Dissanayake

Senior Software Engineer - EPAM System (Shenzhen)

• Over 10 years of enterprise application development experience

• Focused on financial domain

Page 3: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

3

Big data is like teenage sex: everyone talks about it,

nobody knows how to do it, everyone thinks everyone else is doing it, so everyone claims they

are doing it…(Dan Ariely)

Page 4: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

4

What is Big Data?

Page 5: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

5

What is Big Data?

3Vs

• Petabytes • Records • Transactions • Tables, files

• Batch • Near Time • Real Time • Streams

• Structured • Unstructured • Semi structured • All the above

Volume

Velocity Variety

Page 6: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

What is Big Data?

6

• Traditional Queue

• Website Activity Tracking

• Metrics - Operational data

• Log Aggregation

• Stream Processing

• Event Sourcing

• Commit Log

Page 7: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Where do we put Big Data?

7

• Traditional Database?

• Flat files?

Page 8: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

HDFS

8

• HDFS - Hadoop Distributed File System

• Highly fault-tolerant

• Java based file system

• Runs on commodity hardware

• Concurrent data access (Coordinated by YARN)

Page 9: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

How do we put these Big Data

9

• CSV file dump

• ETL

• Other messaging systems (ActiveMQ, RabbitMQ, etc)

Page 10: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Data pipeliens

Data pipeline starts like this

10

Source: Cloudera

Client Backend

Page 11: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Data pipeliens

Then we reuse them

11

Source: Cloudera

Client Backend

Client

Client

Client

Page 12: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Data pipeliens

Then we add more backends

12

Source: Cloudera

Client Backend

Client

Client

Client

Backend

Page 13: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Data pipeliens

Then it starts to look like this

13

Source: Cloudera

Client Backend

Client

Client

Client

Backend

Backend

Backend

Page 14: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Data pipeliens

with may be some of this

14

Source: Cloudera

Client Backend

Client

Client

Client

Backend

Backend

Backend

Page 15: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Data pipeliens

ended up having

15

Source: LinkedIn

Page 16: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

What is Apache Kafka

Apache Kafka is an open-source message broker rethought as a distributed commit log.

Developed using Scala and heavily influenced by transaction logs.

16

Page 17: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Little bit of history

Apache Kafka was initially developed by to pipeline the data across various internal systems.

Developed as a an internal project in early 2011 project was released under open-source license.

Graduated from Apache Incubator on October 2012.

17

Page 18: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

18

Kafka in nutshell

producerproducer producer

consumer consumer consumer

kafka cluster

Page 19: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

19

Kafka Architecture

consumer

broker broker

Zookeeper

consumer

broker broker

producer producer

Page 20: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Kafka Architecture

• Communication between all nodes based on high performance simple binary API over TCP

• Runs on as a cluster of brokers which is a one or more servers in this case

• High performance low level APIs for producer/consumer

• REST API via Kafka REST Proxy

20

Page 21: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Topics & Partitions

• A Topic is a category or feed name to which messages are published.

• Each topic separated to partitioned log where messages are kept.

• Partitions are replicated and distributed across the Kafka cluster for high availability and fault tolerance.

21

Page 22: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Partition

22

10001

10002

10003

10004

10005

10006

10007

10008

10009

10010

producer

Send

10010Write

consumer

consumer

consumer

• Multiple consumers can read from same topic on their own pace

• Messages are kept on log for predefined period of time

• Consumer maintain the message offset

Read

Read

Read

Page 23: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Partition

• Consumers can go away

23

10001

10002

10003

10004

10005

10006

10007

10008

10009

10010

producer

Send

10010Write

consumer

consumer

Read

Read

Page 24: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Partition

24

10001

10002

10003

10004

10005

10006

10007

10008

10009

10010

producer

Send

10010Write

consumer

consumer

consumer

• and come back

Read

Read

Read

Page 25: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

25

Kafka Architecture

consumer

broker broker

Zookeeper

consumer

broker broker

producer producer

Page 26: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Consumers and Consumer Groups

Kafka addresses 2 traditional messaging models

• Queuing

• Publish-subscribe

using “consumer group”, a single consumer abstraction

26

Page 27: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

27

Consumers and Consumer Groups

Page 28: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

28

Consumers and Consumer Groups

Consumer Group A

ConsumerConsumer

Consumer Group B

Consumer

Partition 0 Partition 0 ……………………………………………

Topic

Page 29: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

29

Message order and parallelism

• Retain messages in-order and handover to consumer in-order.

• Message order is not guaranteed when it’s come to parallel processing unless it’s a

exclusive consumer per queue.

• Partition on topics

• Partition is assigned to a consumer group so that each portion consumed by a single

consumer in consumer group.

• Message order guaranteed on partition basis

• Message key can be used to order explicitly (consumer)

• One consumer instance per partition within the consumer group

TRADITIONAL QUEUE

KAFKA WAY - THE PARTITION

Page 30: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

30

Replication

• Each portion of a topic has a 1 leader and 0 or more replicas

• Partitions are selected “round-robin” to balance the load unless it is required to maintain the order

• Leader handles all writes to the partition

Server1 Server1 Server1

A:0

A:1

B:0

A:0

A:1

B:0

A:0

A:1

Controller

Page 31: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

31

Durability and throughput

• Durability can be configured on producer level

• Durability ~ Throughput

• ISR - group of in-sync replicas for a partition

Durability Behaviour Per Event Latency

Highest ACK all ISRs have received Highest

Medium ACK once the leader has received Medium

Lowest No ACK required Lowest

Page 32: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

32

Durability and throughput

Page 33: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

33

Use case - LinkedIn

Source: LinkedIn

Traditional jerry-rigged pipped architecture

Page 34: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

34

Use case - LinkedIn

Source: LinkedIn

Page 35: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

35

Use case - LinkedIn

Source: LinkedIn

Stream-centric data architecture

Page 36: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

36

What is Stream Processing?

Page 37: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Stream Processing

37

Stream processing is a generalisation of Batch processing

Page 38: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Stream Processing

38

Transform

Transform

Transform

Intermediate Kafka Topic

Transform

Transform

Transform

Output Kafka Topic

Data Store

Consumer

Input Kafka Topic

cat input | grep “foo” | wc

Page 39: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Stream Processing with Kafka

39

+ = Stream Processing

Page 40: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Spark

40

• Data-Parallel computation • Micro batch processing • APIs in Java, Scala, Python • In-memory storage

Page 41: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Storm

41

• Even-Parallel computation • One-at-a-time processing • Micro batch processing is possible with Trident • APIs in Java, Scala, Python, Clojure, Ruby, etc • Suitable for processing complex event data • Transform unstructured data in to desired format

Page 42: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

Samza

42

• One-at-a-time processing • Based on messages and partitions • APIs in Java, Scala • Suitable for processing large amount od data

Page 43: Kafka - szjug.github.io · Client Backend Client Client Client Backend Backend Backend. Data pipeliens with may be some of this 14 Source: Cloudera Client Backend Client Client Client

43