Top Banner
Raghavendra D Prabhu [email protected] @randomsurfer Distributed Systems Taskerman A Distributed Cluster Task Manager
61

Raghavendra D Prabhu [email protected] @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

Jul 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

Raghavendra D [email protected]

@randomsurferDistributed Systems

TaskermanA Distributed Cluster Task Manager

Page 2: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

Yelp’s MissionConnecting people with great

local businesses.

Page 3: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

Datastore Ecosystem @

Page 4: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

Cassandra

Elasticsearch

Zookeeper

PostgreSQL

Page 5: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

5

….● Memcached● Redis● Spark● Redshift● DynamoDB● PaaStorm● S3

Any many more..

Page 6: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● Several TB in Cassandra clusters with tens of nodes each● Close to a million messages/second in streaming pipeline● Several TB in Elasticsearch with several hundred nodes in

each● Many PB archived to S3 every month● Multi-AZ Multi-Region● And growing…

Distributed Systems

Page 7: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra
Page 8: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

“Need to run logical backup on a fleet without disruption to ingress traffic”

“Run anti-entropy repair on Cassandra cluster without spiking read latency”

“Reboot 1000 instances without taking a millennia but not bringing down site either”

“Upgrade an Elasticsearch cluster from m3.medium to m3.xlarge safely without downtime”

Page 9: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

Pet vs Cattle

Page 10: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

Maintenance Cost

Engineering Efficiency

Scalability

Page 11: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

Taskerman

Page 12: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra
Page 13: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra
Page 14: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra
Page 15: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● Safe● Security● Generic and Extensible● Distributed● Loosely coupled● Cluster awareness

Requirements

Page 16: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● Schedulable● Reusable● Auditability

○ Not Ad-hoc○ More Declarative, Less Imperative○ Config Management

● Maintainability● Observability● Resilience

Desirable

Page 17: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● Paramount*● Serialized execution

○ ‘m’ out of ‘n’ ○ Disjoint jobs.

● Avoid cascade● Privilege escalation● Pull-based

* Unless oncall is automated too.

Safety

Page 18: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra
Page 19: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra
Page 20: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● Network is reliable● Latency is zero● Bandwidth is infinite● Network is secure● One administrator● Transport cost is zero● Network is homogenous● Topology doesn't change

Fallacies of Distributed System

Page 21: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

Quotes

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. @secretGeek

There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery @mathiasverraes

Page 22: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● Scheduler● Router● Co-ordinator● Transport● Executor● Error handler● Configuration● Monitoring● Tooling

Building Blocks

Page 23: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

RouterQueue

Q2Q1 Q3

Dead Letter Queue

T1T2

T3

Lease

Failure

Workqueue

Flow of task

Task Scheduler

Cluster

Node Queues

Retries

Zookeeper

EC2 API

Page 24: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

#Anatomy of a Taskerman Task

# Restart action for 2 nodes of geo_counter # cassandra cluster owned by gsi{ ‘action’: ‘cassandra_task:restart’, ‘version’: 1.2, ‘limit’: 2, ‘cluster_name’: ‘cassandra:geo_counter’, ‘discovery’ : ‘aws_tags’, ‘owner’: ‘gsi’, ‘task_id’: ‘abcd-ef123’,

Page 25: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

#Anatomy of a Taskerman Task

‘taskerman_params’: { ‘action_args’: {‘force’: true}, ‘workqueue_args’: {‘retry_count’:3}, }, ‘nodes’: [], ‘destnode’: ‘’,}

# force=true for restart, retry_count for queue# [a,b,c,d] to skip discovery

Page 26: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra
Page 27: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

RouterQueue

Q2Q1 Q3

Dead Letter Queue

T1T2

T3

Lease

Failure

Workqueue

Flow of task

Task Scheduler

Cluster

Node Queues

Retries

Zookeeper

EC2 API

Page 28: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● Runs on Chronos● Emits a task● Enqueues into global queue● Ad-hoc invocation● Deployment granularities● Task tracking● Yelpsoa-configs

Task Scheduler

PaaSTA

Page 29: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

RouterQueue

Q2Q1 Q3

Dead Letter Queue

T1T2

T3

Lease

Failure

Workqueue

Flow of task

Task Scheduler

Cluster

Node Queues

Retries

Zookeeper

EC2 API

Page 30: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● AWS SQS● Best-effort FIFO● Reliable and cheap● Low latency● Properties

○ Read without delete○ Visibility timeout○ Retry○ Dead Letter Queue

WorkQueue

AWS SQS

Page 31: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

RouterQueue

Q2Q1 Q3

Dead Letter Queue

T1T2

T3

Lease

Failure

Workqueue

Flow of task

Task Scheduler

Cluster

Node Queues

Retries

Zookeeper

EC2 API

Page 32: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● Stateless Marathon worker● Routes tasks to clusters● Custom routing logic● At-least once delivery● ‘DNS’ of Taskerman● Pluggable discovery

○ AWS○ Smartstack

Task Router

PaaSTA

Page 33: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

RouterQueue

Q2Q1 Q3

Dead Letter Queue

T1T2

T3

Lease

Failure

Workqueue

Flow of task

Task Scheduler

Cluster

Node Queues

Retries

Zookeeper

EC2 API

Page 34: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● The executor of Taskerman● Dequeue task and executes

○ Pre-defined reviewed code.● Cron-ed on node● Zookeeper for coordination● Task deleted upon success● Dead letter queue upon failed

retries

TaskRunner

Page 35: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

class TestTaskRunner(TaskRunner): def __init__(self, task,..): # State mgmt and datastore specific

def pre_check(self): # Is the task safe to execute on this cluster

def execute_action(self): # Actual execution of task:action

def post_check(self): # cluster good after execution or is it on fire

Page 36: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

RouterQueue

Q2Q1 Q3

Dead Letter Queue

T1T2

T3

Lease

Failure

Workqueue

Flow of task

Task Scheduler

Cluster

Node Queues

Retries

EC2 API

Zookeeper

Page 37: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra
Page 38: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● Distributed Coordinator● Non Blocking Lease

○ Time-based lease○ Global lease

● Ephemeral locks● Atomic Counters

○ Statistics○ Circuit breaker

Zookeeper

Page 39: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● Staleness○ Nodes can go down

● Garbage collection○ Cleanup of ZK data structures

● Composition● Starvation● Uptime

Zookeeper: Challenges

Page 40: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra
Page 41: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● Puppet● Terraform● Yelpsoa-configs● PaaSTA● Jenkins● AWS Lambda

Deployment

PaaSTA

Page 42: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra
Page 43: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● Multiple vectors of failure● Idempotency● Pessimistic approach

○ Job retry● Separation of state● Mutability● Highly available components● Circuit breakers

Failure handling

Page 44: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

Debugging

Page 45: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● Heartbeat ping○ End-to-end monitoring

● Dead Letter Queue ○ Recycle bin of failed tasks.○ Hooks into human side of

monitoring● Status check

Failure detection

Page 46: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra
Page 47: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● End-to-end logging ○ Un/structured

● Metrics○ Counters○ Queue lengths

● Aggregation and dashboards● Staleness checks● Dead Letter Queue● Multi-modal Alerting

Monitoring

Page 48: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● Restarts● Reboots● Instance Replacement● Integration tests● Kafka config reload● Failure injection● Backup and restore● Search indexing● .. and many more.

Use cases

Page 49: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● Safety● Cassandra● Elasticsearch● Common issues● Constraints

○ Limit○ Healthcheck○ Mutual exclusion

Scheduled Backups

Page 50: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

Secure Infrastructure

$ uptime 06:52:54 up 99 days, 19:20, 1 user, load average: 0.02, 0.03, 0.07

ps -eo pid,cmd,lstart | grep ..

10058 zookeeper Tue Dec 5 05:23:43 2017

Page 51: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

www.yelp.com/careers/

We're Hiring!

Page 52: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra
Page 53: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra
Page 54: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra
Page 55: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

@YelpEngineering

fb.com/YelpEngineers

engineeringblog.yelp.com

github.com/yelp

Page 56: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

Q & A

● Slides will also be uploaded to slideshare.net/slidunder.

Page 57: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

Q & A

❖ Q: What challenges remain with Taskerman.➢ A:

❖ Q: …➢ A: …

Page 58: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra
Page 59: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● https://www.elastic.co/products/elasticsearch ● https://zookeeper.apache.org/ ● https://kafka.apache.org/● https://www.flickr.com/photos/dapuglet/6291424431 ● http://www.alamy.com/stock-photo/cattle-penning.html ● http://www.firstcallsigns.co.uk/content/images/thumbs/0000927_EE80127.jpeg ● https://sensuapp.org/img/logo-flat-white.png ● https://thumbs.gfycat.com/FocusedCompetentEyas-max-1mb.gif ● https://www.percona.com/sites/default/files/dashboard.png ● https://www.sales-initiative.com/downloads/2856/download/resilience.jpg?cb=29f43ac82cea225ab3ee370d7580760d ● http://izquotes.com/quotes-pictures/quote-a-distributed-system-is-one-in-which-the-failure-of-a-computer-you-didn-t-eve

n-know-existed-can-leslie-lamport-346227.jpg ● https://pbs.twimg.com/media/DRCfqaCWsAczqTz.jpg ● https://upload.wikimedia.org/wikipedia/en/thumb/e/e0/Iron_Man_bleeding_edge.jpg/220px-Iron_Man_bleeding_edge.jpg ● https://github.com/mesos/chronos● https://github.com/mesosphere

Image Credits

Page 60: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● http://www.networknuts-web.biz/wp-content/uploads/2014/10/cron-logo.png ● http://www.pvhc.net/img195/ojfspebrvfblupftgajb.png ● https://fun-damentals.com/wp-content/uploads/2016/05/a-resilience.png ● http://www.azquotes.com/picture-quotes/quote-debugging-is-twice-as-hard-as-writing-the-code-in-the-first-place-therefor

e-if-you-write-brian-kernighan-66-91-06.jpg ● https://thenounproject.com/ ● https://aws.amazon.com/ ● https://www.splunk.com/ ● https://www.terraform.io/ ● http://yelp.com ● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/

Image Credits

Page 61: Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer ...€¦ · “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra

● https://engineeringblog.yelp.com/2015/03/using-services-to-break-down-monoliths.html● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ ● https://martinfowler.com/bliki/TwoHardThings.html ● https://zookeeper.apache.org/ ● https://www.terraform.io/ ● https://github.com/Yelp/service-principles ● https://en.wikipedia.org/wiki/Law_of_Demeter

Further Reading