Introducing Tupilak, Snowplow’s unified log fabric Snowplow London Meetup #3, 21 Sep 2016
Quick show of hands• Batch pipeline: how many here run the Snowplow batch
pipeline?
• Real-time pipeline: how many here run the Snowplow RT pipeline?
• Orchestration: how are you running, scaling, monitoring the real-time pipeline?
• Anything else: who here is evaluating Snowplow or just curious?
From the beginning, Snowplow RT was designed around small, composable workers…
Diagram from our Feb 2014 Snowplow v0.9.0 release post
Today, we see a growing number of async micro-services making up Snowplow RT
Stream Collector
Stream Enrich Kinesis S3
Kinesis Elasticsearc
hKinesis Tee (coming soon)
Redshift dripfeeder (design stage)
User’s AWS Lambda function
User’s KCL worker app
User’s Spark
Streaming job
But managing this kind of complexity has some major challenges
“How do we monitor this
topology, and alert if
something (data loss;
event lag) is going wrong?”
“How do we scale our
streams and micro-services to handle event
peaks and troughs
smoothly?”
“How do we re-configure or upgrade our
micro-services without breaking things?”
Enter Tupilak!
“A tupilak was an avenging monster fabricated by a
shaman by using animal parts (bone, skin, hair, sinew, etc). The creature was given life by ritualistic chants. It was then
placed into the sea to seek and destroy a specific enemy.”
Today Tupilak serves 3 key functions for the Snowplow RT pipeline (Managed Service)
Monitoring
Auto-scaling
Alerting
• Visualizing the complex stream + worker topology in one place
• Indicating micro-services which are failing or falling behind (“lagging”)
• Auto-scaling the number of shards in each Kinesis stream• Auto-scaling the number of EC2 instances running each micro-
service
• Notifying our ops team in the case of a failing or lagging micro-service via PagerDuty
Let’s look at auto-scaling in particular
# Shards in Kinesis Stream
# EC2 Instances
• We scale the number of shards in each stream based on the read/write throughput we are seeing
Read/write throughput
• We scale the number of EC2 instances based on some fixed assumptions about the ratio between shards and workers
+-
+-
What’s next for Tupilak? 1. Better auto-scaling
# Shards in Kinesis Stream
# EC2 Instances
• We scale the number of shards in each stream based on the read/write throughput we are seeing, and the lag of any services consuming this stream or downstream of this stream
Read/write throughput
+-
+-
Micro-service lag
Performance metrics relative to stream