Raghavendra D Prabhu [email protected] @randomsurfer Distributed Systems Taskerman A Distributed Cluster Task Manager
Jul 22, 2020
Raghavendra D [email protected]
@randomsurferDistributed Systems
TaskermanA Distributed Cluster Task Manager
Yelp’s MissionConnecting people with great
local businesses.
Datastore Ecosystem @
Cassandra
Elasticsearch
Zookeeper
PostgreSQL
5
….● Memcached● Redis● Spark● Redshift● DynamoDB● PaaStorm● S3
Any many more..
● Several TB in Cassandra clusters with tens of nodes each● Close to a million messages/second in streaming pipeline● Several TB in Elasticsearch with several hundred nodes in
each● Many PB archived to S3 every month● Multi-AZ Multi-Region● And growing…
Distributed Systems
“Need to run logical backup on a fleet without disruption to ingress traffic”
“Run anti-entropy repair on Cassandra cluster without spiking read latency”
“Reboot 1000 instances without taking a millennia but not bringing down site either”
“Upgrade an Elasticsearch cluster from m3.medium to m3.xlarge safely without downtime”
Pet vs Cattle
Maintenance Cost
Engineering Efficiency
Scalability
Taskerman
● Safe● Security● Generic and Extensible● Distributed● Loosely coupled● Cluster awareness
Requirements
● Schedulable● Reusable● Auditability
○ Not Ad-hoc○ More Declarative, Less Imperative○ Config Management
● Maintainability● Observability● Resilience
Desirable
● Paramount*● Serialized execution
○ ‘m’ out of ‘n’ ○ Disjoint jobs.
● Avoid cascade● Privilege escalation● Pull-based
* Unless oncall is automated too.
Safety
● Network is reliable● Latency is zero● Bandwidth is infinite● Network is secure● One administrator● Transport cost is zero● Network is homogenous● Topology doesn't change
Fallacies of Distributed System
Quotes
There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. @secretGeek
There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery @mathiasverraes
● Scheduler● Router● Co-ordinator● Transport● Executor● Error handler● Configuration● Monitoring● Tooling
Building Blocks
RouterQueue
Q2Q1 Q3
Dead Letter Queue
T1T2
T3
Lease
Failure
Workqueue
Flow of task
Task Scheduler
Cluster
Node Queues
Retries
Zookeeper
EC2 API
#Anatomy of a Taskerman Task
# Restart action for 2 nodes of geo_counter # cassandra cluster owned by gsi{ ‘action’: ‘cassandra_task:restart’, ‘version’: 1.2, ‘limit’: 2, ‘cluster_name’: ‘cassandra:geo_counter’, ‘discovery’ : ‘aws_tags’, ‘owner’: ‘gsi’, ‘task_id’: ‘abcd-ef123’,
#Anatomy of a Taskerman Task
‘taskerman_params’: { ‘action_args’: {‘force’: true}, ‘workqueue_args’: {‘retry_count’:3}, }, ‘nodes’: [], ‘destnode’: ‘’,}
# force=true for restart, retry_count for queue# [a,b,c,d] to skip discovery
RouterQueue
Q2Q1 Q3
Dead Letter Queue
T1T2
T3
Lease
Failure
Workqueue
Flow of task
Task Scheduler
Cluster
Node Queues
Retries
Zookeeper
EC2 API
● Runs on Chronos● Emits a task● Enqueues into global queue● Ad-hoc invocation● Deployment granularities● Task tracking● Yelpsoa-configs
Task Scheduler
PaaSTA
RouterQueue
Q2Q1 Q3
Dead Letter Queue
T1T2
T3
Lease
Failure
Workqueue
Flow of task
Task Scheduler
Cluster
Node Queues
Retries
Zookeeper
EC2 API
● AWS SQS● Best-effort FIFO● Reliable and cheap● Low latency● Properties
○ Read without delete○ Visibility timeout○ Retry○ Dead Letter Queue
WorkQueue
AWS SQS
RouterQueue
Q2Q1 Q3
Dead Letter Queue
T1T2
T3
Lease
Failure
Workqueue
Flow of task
Task Scheduler
Cluster
Node Queues
Retries
Zookeeper
EC2 API
● Stateless Marathon worker● Routes tasks to clusters● Custom routing logic● At-least once delivery● ‘DNS’ of Taskerman● Pluggable discovery
○ AWS○ Smartstack
Task Router
PaaSTA
RouterQueue
Q2Q1 Q3
Dead Letter Queue
T1T2
T3
Lease
Failure
Workqueue
Flow of task
Task Scheduler
Cluster
Node Queues
Retries
Zookeeper
EC2 API
● The executor of Taskerman● Dequeue task and executes
○ Pre-defined reviewed code.● Cron-ed on node● Zookeeper for coordination● Task deleted upon success● Dead letter queue upon failed
retries
TaskRunner
class TestTaskRunner(TaskRunner): def __init__(self, task,..): # State mgmt and datastore specific
def pre_check(self): # Is the task safe to execute on this cluster
def execute_action(self): # Actual execution of task:action
def post_check(self): # cluster good after execution or is it on fire
RouterQueue
Q2Q1 Q3
Dead Letter Queue
T1T2
T3
Lease
Failure
Workqueue
Flow of task
Task Scheduler
Cluster
Node Queues
Retries
EC2 API
Zookeeper
● Distributed Coordinator● Non Blocking Lease
○ Time-based lease○ Global lease
● Ephemeral locks● Atomic Counters
○ Statistics○ Circuit breaker
Zookeeper
● Staleness○ Nodes can go down
● Garbage collection○ Cleanup of ZK data structures
● Composition● Starvation● Uptime
Zookeeper: Challenges
● Puppet● Terraform● Yelpsoa-configs● PaaSTA● Jenkins● AWS Lambda
Deployment
PaaSTA
● Multiple vectors of failure● Idempotency● Pessimistic approach
○ Job retry● Separation of state● Mutability● Highly available components● Circuit breakers
Failure handling
Debugging
● Heartbeat ping○ End-to-end monitoring
● Dead Letter Queue ○ Recycle bin of failed tasks.○ Hooks into human side of
monitoring● Status check
Failure detection
● End-to-end logging ○ Un/structured
● Metrics○ Counters○ Queue lengths
● Aggregation and dashboards● Staleness checks● Dead Letter Queue● Multi-modal Alerting
Monitoring
● Restarts● Reboots● Instance Replacement● Integration tests● Kafka config reload● Failure injection● Backup and restore● Search indexing● .. and many more.
Use cases
● Safety● Cassandra● Elasticsearch● Common issues● Constraints
○ Limit○ Healthcheck○ Mutual exclusion
Scheduled Backups
Secure Infrastructure
$ uptime 06:52:54 up 99 days, 19:20, 1 user, load average: 0.02, 0.03, 0.07
ps -eo pid,cmd,lstart | grep ..
10058 zookeeper Tue Dec 5 05:23:43 2017
www.yelp.com/careers/
We're Hiring!
@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp
Q & A
● Slides will also be uploaded to slideshare.net/slidunder.
Q & A
❖ Q: What challenges remain with Taskerman.➢ A:
❖ Q: …➢ A: …
● https://www.elastic.co/products/elasticsearch ● https://zookeeper.apache.org/ ● https://kafka.apache.org/● https://www.flickr.com/photos/dapuglet/6291424431 ● http://www.alamy.com/stock-photo/cattle-penning.html ● http://www.firstcallsigns.co.uk/content/images/thumbs/0000927_EE80127.jpeg ● https://sensuapp.org/img/logo-flat-white.png ● https://thumbs.gfycat.com/FocusedCompetentEyas-max-1mb.gif ● https://www.percona.com/sites/default/files/dashboard.png ● https://www.sales-initiative.com/downloads/2856/download/resilience.jpg?cb=29f43ac82cea225ab3ee370d7580760d ● http://izquotes.com/quotes-pictures/quote-a-distributed-system-is-one-in-which-the-failure-of-a-computer-you-didn-t-eve
n-know-existed-can-leslie-lamport-346227.jpg ● https://pbs.twimg.com/media/DRCfqaCWsAczqTz.jpg ● https://upload.wikimedia.org/wikipedia/en/thumb/e/e0/Iron_Man_bleeding_edge.jpg/220px-Iron_Man_bleeding_edge.jpg ● https://github.com/mesos/chronos● https://github.com/mesosphere
Image Credits
● http://www.networknuts-web.biz/wp-content/uploads/2014/10/cron-logo.png ● http://www.pvhc.net/img195/ojfspebrvfblupftgajb.png ● https://fun-damentals.com/wp-content/uploads/2016/05/a-resilience.png ● http://www.azquotes.com/picture-quotes/quote-debugging-is-twice-as-hard-as-writing-the-code-in-the-first-place-therefor
e-if-you-write-brian-kernighan-66-91-06.jpg ● https://thenounproject.com/ ● https://aws.amazon.com/ ● https://www.splunk.com/ ● https://www.terraform.io/ ● http://yelp.com ● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
Image Credits
● https://engineeringblog.yelp.com/2015/03/using-services-to-break-down-monoliths.html● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ ● https://martinfowler.com/bliki/TwoHardThings.html ● https://zookeeper.apache.org/ ● https://www.terraform.io/ ● https://github.com/Yelp/service-principles ● https://en.wikipedia.org/wiki/Law_of_Demeter
Further Reading