Basics of scale and availability
High Scalability
Who am I?• Jonathan Keebler @keebler keebler.net• Built video player for all CTV properties• Worked on news sites like CP24, CTV, TSN• CTO, Founder of ScribbleLive• Bootstrapped a high scalability startup
– Credit card limit wasn’t that high, had to find cheap ways to handle the load of top tier news sites
2
Sample load test
3
17 x Windows Server 2008, 2 x Varnish, 4 x nginx, 1 x SQL Server 2008
Scalability vs Availability• Often talked about separately• Can’t have one without the other• Let’s talk about the basic building blocks
4
Building blocks• Content Distribution Network (CDN)• Load-balancer• Reverse proxy• Caching server• Origin server
5
Basic hosting structure
6
Basic hosting structure
7
Amazon ELBF5HAProxy
VarnishSquidaiCache
LAMPASP.NETnode.js
nginxAkamaiCloudFrontEdgeCast
Basic hosting structure
8
Amazon ELBF5HAProxy
VarnishSquidaiCache
LAMPASP.NETnode.js
nginxAkamaiCloudFrontEdgeCast
+ Monitoring + Monitoring + Monitoring + Monitoring + Monitoring
Monitor or die• If you aren’t monitoring your stack, you
have NO IDEA what’s going on• Pingdom/WatchMouse/Gomez not enough
– Don’t help you when you’re trying to figure out what’s going wrong
– You need actionable metrics
9
Monitor or die• Outside monitoring e.g. Pingdom, Gomez
– DNS problems, localized problem, SLA• Inside monitoring e.g. New Relic, CloudWatch,
Server Density– High latency, CPU spikes, memory crunch,
peek-a-boo servers, rogue processes, SQL queries per second, SQL wait time, SQL locks, disk usage, disk IO performance, page file usage, network traffic, requests per second, active connections, timeouts, sleeping sockets,
10
New Relic• Dashboard
11
Alerting• Don’t send them to your email
– Try to work with notifications coming in every second
• PagerDuty• Don’t over do it = alert fatigue
12
Basic hosting structure• Now back to our servers...
13
Load-balancers• Bandwidth limits on dedicated boxes
harder to work around• F5s are great boxes, but have lousy live
reporting = can get into trouble quick• Adding/removing servers sucks• DNS load-balancing sucks for everyone
14
nginx• Fantastic at handling massive number of
requests (low CPU, low memory)• Easy to configure and change on-the-fly• Gzip, modify headers, host names• Proxy with error intercept• No query string or IF-statement* support
15
Varnish• Caching server but so much more• Fantastic at handling massive number of
requests (low CPU, low memory)• Easy to configure and change on-the-fly• Protect your origin servers• Deals with errors from origin servers
16
Origin servers• Whatever tweaks you make will never help
enough– e.g. If your disk IO is becoming a problem, it’s
already too late to save you• Keep them stock so you don’t blow your mind,
easier to deploy• Handle any query string hacking in Varnish
17
Databases• No silver bullet• Two options:
– Shard (split your data between servers)– Cluster (many boxes working together as one)
• Shards commonly used today– Lots of work on code level, no incremental IDs
• Clusters have a single point of failure– Try upgrading one and tell me they don’t
18
Discussion• What stack do you use?• What database do you use?• SQL vs NoSQL
19
Content Distribution Networks
High Scalability
Basics• Worldwide network of DNS load-balanced
reverse proxies• Not magic• Can achieve 99% offload if you do it right• Have to understand your requests
21
Market leaders• Akamai: market leader, $$$, most options, yearly
contracts, pay for GB + request headers• CloudFront: built on AWS, cheaper, pay-as-you-
go, less features, new features coming quickly, GB + pay-per-request
• EdgeCast (pay-as-you-go through GoGrid), CloudFlare (optimizer, security, easy!)
22
Tiered distribution• More points-of-presence (POPs) = less caching if
your traffic is global• Need to put a layer of servers between POPs
and your servers• Sophisticated setups throttle requests
– if 100 come in at same time, only 1 gets through
23
Cache keys• Need to have same query string to get cached
result• Some CDNs can ignore params
– important if you need a random number on the query string to prevent browser caching
• Cool options: case sensitive/insensitive, cache differently based on cookie, headers
24
Invalidations suck• Trying to get CDN to drop its cache is hard
– takes a long time to reach all POPs– triggers thundering herd– takes out all caching for a bit
• Build the ability to change query strings at the code layer– e.g. add version number to JS/CSS URLs. When you
rollout, breaks cache
25
How long to cache for?• As long as you need, but no longer• Make sure you think about error case i.e.
what if an error gets cached– Some CDNs let you set your own rules for that– Remember, invalidations suck
26
Thundering herds
27
Thundering herds• When you rollout or have high latency, all your
timeouts align– Origins get slammed at regular interval by POPs
• Random TTLs are your friend– Just +/- a few minutes can be a big help– TIP: break into C in Varnish
28
Don’t build your own*• You will never be as smart as Akamai/Amazon• You will never be able to bring on new servers
fast enough to scale• Spend your time building awesome software• Build your own caching layer for the POPs (and
just in-case, to protect your origin servers)
29
Discussion• What CDN do you use?• War stories
30
Caching in Code
High Scalability
Why do I need this?• You can’t cache every request• You can’t cache POST requests• Protect the database!• The longer you can go before you have to
shard your database, the better
32
What is it?• In-process, in-memory caching• Static variables work great
– TIP: .NET: static variables are scoped in the thread, WHY?!
• Custom memory stores• Whatever you want, just not the disk
33
Isn’t that what Memcached is for? • Memcached is in-memory BUT so is your database
– Advantages of Memcached over your database:• Cheaper to replicate• Fast lookups...if your db sucks
– Disadvantages:• Still has network latency, higher than db lookup (unless
your db sucks)• IT’S NOT A DATABASE!
34
Getting started• Think about your data + classes• TTLs based on knowledge of your data• Random TTLs (avoid the thundering herd)• Use concurrent, thread-safe objects• Wrap your code in try-catch
– Caching isn’t worth breaking your site for
35
Updating cache• Use semaphores (that Comp Sci degree is finally going to come in handy)• Semaphores should always unlock on their own
– Your thread could die/timeout at any time. You don’t want to lock forever• Use a separate thread for the lookup. Why should one user suffer?• Using a datetime semaphore is usually the best
– keep a time when the next update will take place– 1st thread to hit that time, immediately adds some seconds to the time.
Buys itself enough time to do lookup– Any blocked thread gets cached data. DON’T LOCK
36
Populating cache for first time• How do you prevent thundering herd before
cache?• Ok, you may have to lock. But be smart about it.• Are you sure your database can’t handle it?• This is where other caching layers help: CDN
throttling, Varnish throttling, Memcached, read-only databases
37
Garbage collection• Keep counters for metrics e.g. how many hits to the cached
object, datetime of last request for that object• Every X something, run your garbage collection
– Use semaphores– Don’t get rid of the most used objects
• You are going to collide with running code– try-catch is your friend
• Don’t be afraid to dump the cache and start over
38
Watch out for references• If you are storing something in a cache object, you
can save a lot of memory by passing reference to object
• Don’t forget about the reference• Watch out for garbage collection trying to destroy it• Updating cache operation might involve updating an
existing object
39
The curse• More servers = more caches = less
efficient• Discipline: can’t throw more servers at the
problem
40
Totally worth it!
41
Requests per minute to origin servers
Totally worth it!
42
CPU of 1 x SQL Server 2008 database
Discussion• What do you use to cache at a code layer?• War stories
43
Thank you!• Jonathan Keebler• [email protected]• @keebler
44