Stream Processing with Samza Navina Ramesh DDS, Data Infrastructure February 25, 2015
Jul 17, 2015
Stream Processing with Samza
Navina Ramesh
DDS, Data Infrastructure
February 25, 2015
Outline
• Introduction
• Use Cases at LinkedIn
• Architecture & Concepts
Response latency
Milliseconds to minutes
Synchronous Later. Possibly much later.
0 ms
Stream Processing
Use cases @ LinkedIn
• Data standardization platform (Project
“Waterloo”)
• Call graph assembly
• Metrics & Monitoring
Call graph assembly
Map-reduce/Hadoop Samza
Filter/redirect records Mapper Repartition job
Process the grouped records Reduce Aggregation job
Samza Concepts & Architecture
• Streams
• Tasks
• Jobs
• Stateful Stream Processing
Streams
Partition 0 Partition 1 Partition 2
next append
123456
12345
1234567
TasksPartition 0
Task 1
TasksPartition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
TasksPartition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
Page Views - Partition 0
1234
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
Page Views - Partition 0
1234
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
Page Views - Partition 0
1234
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
Page Views - Partition 0
1234
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
Page Views - Partition 0
1234
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
Page Views - Partition 0
1234
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
Page Views - Partition 0
1234
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
Page Views - Partition 0
1234
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
JobsAdViews AdClicks
Task 1 Task 2 Task 3
AdClickThroughRate
JobsAdViews AdClicks
Task 1 Task 2 Task 3
AdClickThroughRate
Stream Processing is Hard
• Partitioning
• Re-processing
• Failure semantics
• State
• Joins to services or database
• Non-determinism
Stream Processing is Hard
• Partitioning
• Re-processing
• Failure semantics
• State
• Joins to services or database
• Non-determinism
Jobs
AdViews AdClicks
Task 1 Task 2 Task 3
AdClickThroughRate
SELECTAdViews.id,COUNT(AdViews) views,COUNT(AdClicks) clicks,clicks/views ctr
FROMAdViews
LEFT JOINAdClicks
WHEREAdViews.id = AdClicks.id
GROUP BY id
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B Changelog Stream
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B Changelog Stream
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B Changelog Stream
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B Changelog Stream
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B Changelog Stream
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B Changelog Stream
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B Changelog Stream
Resources
• What’s next ?
– Support for SQL operators over streams
– Samza without YARN
• Get involved:
– Apache – http://samza.apache.org
– Dev Mailing List – [email protected]
– JIRA -
https://issues.apache.org/jira/browse/SAMZA