Top Banner
How Slack Works Keith Adams [email protected] @keithmadams facebook.com/kma
45

How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Jul 14, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

How Slack WorksKeith Adams

[email protected] @keithmadams facebook.com/kma

Page 2: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

What is

Slack?

Page 3: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

What is

Slack?Voice Calls! Platform! Something about Bots!!

Page 4: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

PersistentGroup

Messaging

But first it was a

Service

Page 5: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

In this talk

● How Slack works today➞ Application logic➞ Persistence➞ Real-time messaging➞ Deferring work for later

● Problems● What we’re doing about them

Page 6: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Also in this talk

● Flaws● Challenges● Mistakes● Dead-ends● Future directions

Page 7: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Slack Scale

● 4M DAU, 5.8M WAUPeak simultaneous connected: 2.5M

● > 2H / weekday for each active user> 10H / weekday connected

● Half of DAU outside US

Page 8: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Slack House Style

● Conservative technical tasteMost supporting technologies are >10 years old

● Willing to write a little codeChoose low coupling, fitness-to-purpose over DRY

● MinimalismChoose something we already operate over something new and tailor-madeShallow, transparent stack of abstractions

Page 9: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Cartoon Architecture of Slack

MySQL

Job Queue

Message Server

WebApp

Page 10: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Case Study: Login and Receive Messages

slack.com

POST /api/rtm.start?token=xoxo--&...

Page 11: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Slack’s webapp codebase

● PHP monolith of app logic<1MLoC

● Scaled-out LAMP stack appMemcache wrapped around sharded MySQL

● Recently migrated to HHVMPerformance, hacklang

Page 12: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

World’s shortest PHP-at-Slack FAQ

● Q: I hear/believe/have experienced PHP to be terrible.A: It sort of is, but it also works well.

● Q: I’m skeptical.A: You’re in good company! Check out this blog post. But we should probably get on with the talk at hand ...

● Q: Sounds good.A: Right-o.

Page 13: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Login and Receive Messages: the “mains”

slack.com main0

main1

SELECT db_shard FROM teams WHERE domain = %domain

Page 14: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Login and Receive Messages: the shards

slack.commain0

main1

main0

main1

main0

main1

main0

main1

Shard123a

Shard123b

SELECT * FROM channelsWHERE team_id = 711 ...

Page 15: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

MySQL Shards

● Source of truth for most customer dataTeams, users, channels, messages, comments, emoji, ...

● Replication across two DCsAvailable for 1-DC failure

● Sharded by teamFor performance, fault isolation, and scalability

Page 16: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Why MySQL?

● Many, many thousands of server-years of working● The relational model is a good discipline● Experience● Tooling

Not because of ACID, though

Page 17: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Master-Master Replication

www1 Shard123a

Shard123b

www17

Page 18: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

MMR Complications

● Choosing A in CAP terms● Conflicts are possible

➞ Most resolved automatically➞ Some manually, by operator action(!)

● INSERT ON DUPLICATE KEY UPDATE … ● Partitioning by team saves us

➞ Team writes cannot overlap➞ Even teams use “left” head, odd teams use “right” head

Page 19: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Case Study: Login and Receive Messages

slack.com{ “ok”: true, “url”: “wss:\/\/ms9.slack-msgs.com\/websocket\/7I5yBpcvk”, …}

Page 20: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Rtm.start payload

● Rtm.start returns an image of the whole team● Architecture of clients

➞ Eventually consistent snapshot of whole team➞ Updates trickle in through the web socket

● Guarantees responsive clients● ...once connection is established

Page 21: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Cartoon Architecture of Slack

MySQL

Job Queue

Message Server

WebApp

Page 22: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Persist, broadcast messages

Message Delivery

Message Server

WebApp

Page 23: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Wrinkles in Message Server

● Race between rtm.start and connection to MS➞ Event log mechanism

● Glitches, delays, net partitions while persisting➞ In-memory queue of pending sends➞ Queue depth sensitive barometer of system health

● Most messages are presence

Page 24: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Link unfurling

Deferring Work

Search indexing

Exports/Imports

Job Queue (Redis)

WebApp

Job Workers

Page 25: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Putting it all together

mains

shards

Message Server

WebApp

Page 26: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Things missing from the cartoon

● Memcache wrapped around many DB accesses➞ Case-by-case➞ Manual

● Computed data service (CDS)➞ Provides ML models via Thrift interface

● Rate-limiting around critical services● Search!

➞ Solr➞ Team-partitioned➞ fed from job queue workers

Page 27: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Slack Today: The Good Parts

● Team-partitioning➞ Easy scaling to lots of teams➞ Isolates failures and perf problems➞ Makes customer complaints easy to field➞ Natural fit for a paid product

● Per-team Message Server➞ Low-latency broadcasts

Page 28: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Some Hard Cases

Page 29: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Hard scenarios

● Mains failures● Rtm.start on large teams● Mass reconnects

Page 30: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Mains failure

● 1 master fails, partner takes over

● If both fail?➞ Many users can proceed via memcache➞ For the rest Slack is down➞ Quite possible if failure was load-induced

Page 31: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Rtm.start for large teams

● Returns image of entire team

● Channel membership is O(n2) for n users

Page 32: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Mass reconnects

● A large team loses, then regains, office Internet connectivity

● n users perform O(n2) rtm.start operations

● Can ‘melt’ the team shard

Page 33: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

What are we going to

Doabout it?

Page 34: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Scale-out mains

● Replace mains spof● With what? We’re not sure yet● Kicking the tires carefully on a scary change

Page 35: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Rtm.start for large teams

● Incremental work➞ Current p95,p99: 221ms, 660ms

● Core problem: channel membership is O(n2)● Change APIs so clients can load channel members lazily● Much harder than it sounds!

Page 36: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Mass reconnects

● Introducing flannel

● Application-level edge cache

Page 37: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Pre-Flannel

Message Delivery

Message Server

WebApp

Page 38: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Message Server

Page 39: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Flannel status

● On for a few teams

● Rolling out to you soon with any luck

Page 40: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Phew

Page 41: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Stuff I had to leave out

● Lots of client tech!● Voice● Backups● Data warehouse● Search● Deploying code● Monitoring and alerting

Page 42: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Wrapping up

● Sketch of how Slack works➞ Application Logic➞ Persistence➞ Real-time messaging➞ Asynchronous Work

● Problems● What we’re doing about them

Page 43: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

There is a lot left to doslack.com/jobs

Page 44: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

...

Page 45: How Slack Works - QCon San Francisco...A: It sort of is, but it also works well. Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get

Deployable Message Server

● Channel-sharded message bus

● Flannel discovers Channel servers via Consul➞ Scatters user writes➞ Gathers channel reads

● Failures do not need reconnects