This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
1. no consideration of data on the inside vs outside
2. schema not externally defined3. same config for every
client/topic4. 128 partitions as default config5. running on 8 overloaded nodes
#kafkasummit @spjelkavik @audunstrand
mistake: no consideration of data on the inside vs outside
https://flic.kr/p/6MjhUR
#kafkasummit @spjelkavik @audunstrand
why is it a mistakeeverything published on Kafka (0.8.2) is visible to any client that can access
#kafkasummit @spjelkavik @audunstrand
what is the consequencedirect reads across services/domains is quite normal in legacy and/or enterprise systems
coupling makes it hard to make changes
unknown and unwanted coupling has a cost
Kafka had no security per topic - you must add that yourself
#kafkasummit @spjelkavik @audunstrand
what is the correct solutionConsider what is data on the inside, versus data on the outside
Convention for what is private data and what is public data
If you want to change your internal representation often, map it before publishing it publicly (Anti corruption layer)
#kafkasummit @spjelkavik @audunstrand
what has finn.no doneDecided on a naming convention (i.e Public.xyzzy) for public topics
Communicates the intention (contract)
#kafkasummit @spjelkavik @audunstrand
mistake: schema not externally defined
#kafkasummit @spjelkavik @audunstrand
why is it a mistakedata and code needs separate versioning strategies
version should be part of the data
defining schema in a java library makes it more difficult to access data from non-jvm languages
very little discoverability of data, people chose other means to get their data
difficult to create tools
#kafkasummit @spjelkavik @audunstrand
what is the consequencedevelopment speed outside jvm has been slow
change of data needs coordinated deployment
no process for data versioning, like backwards compatibility checks
difficult to create tooling that needs to know data format, like data lake and database sinks
#kafkasummit @spjelkavik @audunstrand
what is the correct solutionconfluent.io platform has a separate schema registry
apache avro
multiple compatibility settings and evolutions strategies
connect
Take complexity out of the applications
#kafkasummit @spjelkavik @audunstrand
what has finn.no donestill using java library, with schemas in builders
confluent platform 2.0 is planned for the next step, not (just) kafka 0.9
#kafkasummit @spjelkavik @audunstrand
mistake: running mixed load with a single, default configuration
https://flic.kr/p/qbarDR
#kafkasummit @spjelkavik @audunstrand
why is it a mistakeHistorically - One Big Database with Expensive License
Database world - OLTP and OLAP
Changed with Open Source software and Cloud
Tried to simplify the developer's day with a single config
Kafka supports very high throughput and highly reliable
#kafkasummit @spjelkavik @audunstrand
what is the consequenceTrade off between throughput and degree of reliability
With a single configuration - the last commit wins
Either high throughput, and risk of loss - or potentially too slow
#kafkasummit @spjelkavik @audunstrand
what is the correct solutionUnderstand your use cases and their needs!
Use proper pr topic configuration
Consider splitting / isolation
#kafkasummit @spjelkavik @audunstrand
Defaults that are quite reliable
Exposing configuration variables in the client
Ask the questions;
● at least once delivery● ordering - if you partition, what must have strict ordering● 99% delivery - is that good enough?● what level of throughput is needed
what has finn.no done
#kafkasummit @spjelkavik @audunstrand
ConfigurationConfiguration for production
● Partitions● Replicas (default.replication.factor)● Minimum ISR (min.insync.replicas)● Wait for acknowledge when producing messages (request.required.acks, block.on.buffer.full)● Retries● Leader election
Configuration for consumer
● Number of threads● When to commit (autocommit.enable vs consumer.commitOffsets)