Google SRE (Site Reliability Engineering) Concepts Presenters David Hixson John Neil Robert Spier John Reese Mikey Dickerson Nori Heikkinen Ryan Anderson Designing Distributed Systems Limiting Factors What limits growth? Resource constraints How to design to push past limits Data latency Failure Modes Predict far in future Ways things will fail Hope is not a strategy Serve in spite of failures How to setve & grow past failures What can you orevent before it starts? 10 Rules for Scale Scaling Up Safely Make Good Choices Constraints Every part of design has limits Aggregate capability is probably the minimum All capacity above that value is wasted The smallest limit is the failure domain Gas tank size - car will run out of fuel first: failure domain Understand the Whole Stack Components in comp Disk/iops Components in data center Network Ports Rack uplinks Components at data center Cooling Wan connection Power Do you concentrate traffic into smaller failure domains? Next most critical decision Costs and alternatives Understand the risks Time spent evaluating is another risk! Decide Reassess Not too often Things change fast 3 choices / pick 2 Cheap Fast Reliable Engineering decisions are also driven by things outside engineering Product design limits Management directives Capacity Planning What are important things to think about? # users #viewers #searches Dependent requests & subqueries #request Most popular data Least popular What defines the service & it's capacity? Total request sent to entire system Total capacity per core Does it change for types of core? How long does it take to change the system? How much risk for failure is accounted for? How perfect is your load balancers? Planning cycle Estimate in theory the cost of the work Validate in practice the cost of work Monitor demand Monitor the work Identify improvements Caching Tuning Better code Product changes N+1 N is the capacity you need to serve at peak +1 = Shortcut for thinking about disaster capacity Expansion on anticipate the future from day 1 N+2 I need x resources to serve y traffic 99% of the time Like Supply Chain Management Several cycles Cheapest safe choice Engineering Tricks 1. Dark Launches Gain experience without the suffering New caching? New image replication? Avoid embarassment Build better estimates before public releases Identify bottlenecks Work on optimizing Turn on backend monitoring of features before making features visible to end users Collect/analyze all data you would monitor if live 2. Degraded Failure (success) mode What choices do you have if the system approaches a critical state? Can you reduce load? Serve lower quality images? Difference in what work you can do at 1 qps vs 1mil qps Don't accept it if it will make you fail R2D2 is offered one more shot of whiskey... Program him to kindly say no thank you when he's reached his limit 3. Monitoring Can't fix what you can't measure Types of monitoring Black box Monitoring what it is supposed to do External monitoring Limited knowledge of "how it works" Responsive White box Predictive of failures "Approaching peak" Predictive of what interventions will fix it Manual interventions (email Sal with instructions) Automated repair responses Beyond garbage collection Responsive to failures Detailed understanding of the system Identified critical thresholds Warning of approaching thresholds Transparent from day 1 Failure is not an option... But it's going to happen anyway You have to have a way to reason about your system What happens when a piece of your system goes away? What are the implications? What other systems absorb the impact? If it's too big to reason in your head, you need a tool to visualize Be able to visualize your system in realtime If you do something a lot "really rare" becomes twice a day Use good sources of uniqueness Clean up temporary files Validate your config files before you push them Test all layers of a system Humans can't review everything Automated tests are the only way to operate at scale Error paths need to be exercised regularly Even in production Always have safety checks for your automated pushes Things that are unthinkable are therefore undocumented Perfectly reasonable code can become a trap Document assumptions in the code Check assumptions when you use a library What % of data is affected by an automated push? If greater than some % place in holding pattern for review 1% is a whole freaking lot at scale Avoiding syncronication is important Small outages become bigger rapidly On error don't retry immediately Add exponential wait Add jitter Don't schedule tasks on hour or on half hour Make it random 1. KISS - Keep servers simple Do one thing and do it well Don't mix request types in a single server Growth Limiting My_app_server Handles image uploads Serves image thumbnails Mix of requests can change Capacity unpredictable for mix of services Growth Potential My_app_upload_server My_app_thumbnail_server Consistent behavior/capacity per setver Easy to understand Tons of requests from a variety of systems 2. Smaller & Stateless Prefer smaller stateless servers Many small jobs vs one big job Stateful jobs vs stateless jobs Stateful A stateful server remembers client data (state) from one request to the next. A stateful server is simpler Stateless A stateless server keeps no state information stateless server is more robust lost connections can't leave a file in an invalid state rebooting the server does not lose state information rebooting the client does not confuse a stateless server Using a stateless file server, the client must specify complete file names in each request specify location for reading or writing re-authenticate for each request Using a stateful file server, the client can send less data with each request Sticky sessions vs stateless sessions Sticky sessions Locking a session to a server to maintain identification of session and it's state load balancer is forced to send all the requests to their original server where the session state was created even though that server might be heavily loaded and there might be another less-loaded server available to take on this request Stateless session the server does not need to store any session state all necessary information is stored in the cookie held by the client load balancing is easier, as session state does not need to be replicated over multiple front-end servers Make failure domains smallest & fewest Growth Limiting One giant db server All photos on one server Failure point 3K QPS Growth Potential Many smaller sharded storage db servers Range of photo ids spread across servers Cache document state on servers Failure point 1K QPS 1K QPS 1K QPS 3. Retry Safely Growth Limiting Retry 3 times w 3 second delay Demand oscilation may occur Growth Potential Retry w random exponential back off Random back off misplaces requests So they don't line up when backed up Make sure requests don't exceed dependent system timeouts Stateless Ensure clients send identifying info to server 4. Bound Resource Usage: Fail Gracefully Growth Limiting Load entire objects or docs into RAM Error:connection timeout Growth Potential Operate on chunks of data 10 thimnbnails instead of 20 per page Consider data structure carefully Don't buffer user input w/o a limit Reject user requests if overloaded 5. Don't Crash/Assert Exit Never die due to unexpected input Send an exception response Just throw the request away and ignore it Growth Limiting Assert(request.size<=1000) Growth Potential Request size > 1000 (request too big) 6. Be Transparent Jobs should not be a black box Keep track of actions taken Make it available Export it Visible via private url Provide visibility of internal state Provide explicit statement of health Load balancers can use this to send traffic elsewhere Key value pairs Config files Provide debug pages for conplex data Mechanism for doing health checks Can i read my config file Connection to db? Memory used? Cpu? Errors sent to backend? 7. Avoid Lazy Initialization Prepare everything you need at startup Perform all health checks Before accepting requests Include db connections Loading files to disk etc 8. Maintain Flexibility Don't change the world at once Canarying experimental rollouts Release schedule & qa testing Don't release at peak Don't affect users Do it when workers can respond Don't release at midnight New features? Config protected Disabled by default Percentage rollout AB testing 9. Anticipate the Future Growth trends Watch them Have safety buffer Real disaster: Taiwan floods caused a global hard drive supply delay Plan for more capacity if needed Consider time to order Time to implement Consider growing Peaks Industry changes New technology? Bigger images? New upload bandwidth requirements 10. Check the User Experience Fast & Reliable Fast results = more users Slow performance = drop in user % Measurable week over week Probe off network Emulate real users Automate it Selenium Bandwidth avail Latency Don't just check servers Mind Mapped by Ayori Selassie Find me on Twitter @iayori Hosted at blacksintechnology.net