Adam Hitchcock @NorthIsUp Making DISQUS Realtime
Jul 16, 2015
Adam Hitchcock @NorthIsUp
Making DISQUS Realtime
what is DISQUS?
why do realtime?
๏ getting new data to the user asap ๏ increased engagement ๏ looks awesome ๏ we can sell it
how many of you currently have a realtime component?
realtime
๏ polls memcache ๏ is kinda #failscale
DISQUS sees a lot of traffic
Google Analytics: May 29 2012 - June 28 2012
realertime
๏ currently active on all DISQUS 2012 sites
๏ tested ‘dark’ on ~50% of our network ๏ 1.5 million concurrently connected users ๏ 45 thousand new connections per second ๏ 165 thousand messages/second ๏ ~.2 seconds latency end to end
so, how did we do it?
technology
๏ node.js and mongodb for webscale
technology
๏ just kidding :) we used python
technology๏ gevent ๏ gunicorn ๏ flask ๏ thoonk (a queue built on redis)
๏ redis ๏ nginx ๏ haproxy
architecture overview
architecture overview
redis pub/sub
redis queue
“Frontend” Gunicorn and Flask“Backend”
Gevent server
New Postsdjango redis pub/sub
nginx + haproxy
architecture overviewDISQUS
Formatter
Multiplexe
Publisher
Listener
Sub Pool
Requests
Incoming HTTP requests from the
redis pub/sub
thoonk queue
New Posts redis pub/sub
the backend
the backend
๏ listens to a Thoonk queue ๏ cleans & formats message
๏ this is the final format before http publish
๏ compress data now ๏ publish message to pubsub
๏ forum:id, thread:id, user:id, post:id
Formatter
Multiplexe
Publisher
the backend
๏ average processing time is ~0.2 seconds ๏ queue maintenance
๏ ACK timeouts (5 secondsish)
random redis lessons
๏ separate pub/sub and non pub/sub redis usage by physical node
๏ transactions can be prickly
the backend
# redis key for the 'claimed' zset claimed = thoonk_worker.feed_claimed
# what jobs to re-queue too_late = int((time() - MAX_AGE) * 1000)
# get and cancel jobs job_ids = redis.zrange(claimed, 0, too_late) if len(job_ids): for job_id in job_ids: thoonk_worker.cancel(job_id)
gevent is nice
# the code is too big to show here, so just import it # http://bitly.com/geventspawn
from realertime.lib.spawn import Watchdog from realertime.lib.spawn import TimeSensitiveBackoff
the frontend
the frontend
๏ needs to be fast! ๏ pools redis connections ๏ routes messages from pubsub to http
the frontend
๏ new request! ๏ create/register a subscription
with the pool ๏ sub pool returns a (python)
queue based on the channel
Listener
Sub Pool
Requests
the frontend
๏ Listener receives message on a pubsub channel
๏ If that channel has a subscriber pass it on
๏ subscriber then passes message on to all appropriate requests
Listener
Sub Pool
Requests
long pollingish
๏ long held http connection ๏ stream JSON over this http connection
long pollingishdef __subscription_generator(self, q): #Returns a generator for the WSGI response try: to = Timeout(self.timeout_duration) to.start()
while True: queue_data = q.get() # one per line yield queue_data['data'] + '\n'
except Timeout, t: if t is to: pass else: raise t finally: self.unsubscribe(q)
pooling redis pub/sub
# old way was pretty failscale def subscribe(redis, channel): pubsub = redis.pubsub() pubsub.subscribe(channel) with Timeout(30): while True: yield pubsub.listen()
pooling redis pub/sub
pipe = Queue() pipe.put(‘subscribe’, ‘thread:12345’) pipe.put(‘unsubscribe’, ‘forum:cnn’)
... elsewhere ...
# new way is def listener(pubsub, pipe): for data in pubsub.listen(): # handle data here...
# handle new subscriptions if not pipe.empty(): action, channel = pipe.get_nowait() getattr(pubsub, action)(channel)
timeouts?
๏ needless reclaiming of ‘resources’
๏ maximize usage of cheap things ๏ connection count
๏ minimize expensive things ๏ requests per second
test, measure, repeat
testing
๏ Darktime ๏ use existing network to loadtest ๏ (user complaints when it didn’t work...)
๏ Darkesttime ๏ load testing a single thread
๏ have knobs you can twiddle
stats
๏ measure all the things! ๏ especially when the numbers don’t line up ๏ is hard in distributed systems ๏ try to express things as +1 and -1 if you can ๏ i used scales from greplin “metrics for py”
lessons
๏ do hard work early ๏ defer work that you might never need ๏ end-to-end acks are good, but expensive ๏ timeouts are not free ๏ greenlets are effectively free ๏ pubsub is effectively free
nginx lessons
location / { proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header Host $http_host; proxy_redirect off;
# this line is really important proxy_buffering off;
if (!-f $request_filename) { proxy_pass http://app_server; break; } }
http://gunicorn.org/deploy.html
slide full o’ links๏ Gevent (python coroutines and greenlets)
http://gevent.org/ ๏ Gunicorn (python pre-fork WSGI server)
http://gunicorn.org/ ๏ Thoonk (redis queue)
https://github.com/andyet/thoonk.py ๏ Sentry (log aggregation)
https://github.com/dcramer/sentry ๏ Scales (in-app metrics)
https://github.com/Greplin/scalescode.disqus.com
special thanks
๏ the team at DISQUS ๏ especially our dev-ops guys ๏ and jeff who had to review all my code
open questions
๏ best system config for thousands of rps? ๏ how to make the front end faster?
๏ something faster than pywsgi? ๏ FapWS?
๏ libevent -> libev? (i.e. gevent 1.0) ๏ dump wsgi for raw sockets? (last resort)
๏ best internal python pub/sub option?
DISQUSsion?
psst, we’re hiringdisqus.com/jobs
Adam Hitchcock @NorthIsUp
Making DISQUS Realtime