Google SRE (Site Reliability Engineering) Concepts

Google SRE (Site Reliability Engineering) Concepts

Presenters

David Hixson

John Neil

Robert Spier

John Reese

Mikey Dickerson

Nori Heikkinen

Ryan Anderson

Designing Distributed Systems

Limiting Factors

What limits growth?

Resource constraints

How to design topush past limits

Data latency

Failure Modes

Predict far in future

Ways things will fail

Hope is not a strategy

Serve in spite of failures

How to setve &grow past failures

What can you oreventbefore it starts?

10 Rules for Scale

Scaling Up Safely

Make Good Choices

Constraints

Every part ofdesign has limits

Aggregate capability isprobably the minimum

All capacity above thatvalue is wasted

The smallest limit isthe failure domain

Gas tank size - car will run out of fuelfirst: failure domain

Understand theWhole Stack

Components in comp Disk/iops

Components indata center

Network

Ports

Rack uplinks

Components atdata center

Cooling

Wan connection

Power

Do you concentrate traffic intosmaller failure domains?

Next most critical decision

Costs and alternatives

Understand the risks Time spent evaluatingis another risk!

Decide

Reassess

Not too often

Things change fast

3 choices / pick 2

Cheap

Fast

Reliable

Engineering decisions are also driven bythings outside engineering

Product design limits

Management directives

Capacity Planning

What are importantthings to think about?

# users

#viewers

#searches Dependent requests & subqueries

#request

Most popular data

Least popular

What defines theservice & it's capacity?

Total request sent toentire system

Total capacity per core Does it change fortypes of core?

How long does it take tochange the system?

How much risk for failureis accounted for?

How perfect is yourload balancers?

Planning cycle

Estimate in theory thecost of the work

Validate in practicethe cost of work

Monitor demand

Monitor the work

Identify improvements

Caching

Tuning

Better code

Product changes

N+1

N is the capacity youneed to serve at peak

+1 = Shortcut for thinking aboutdisaster capacity

Expansion on anticipatethe future from day 1

N+2

I need x resources to serve ytraffic 99% of the time

Like Supply ChainManagement

Several cycles

Cheapest safe choice

Engineering Tricks

1. Dark Launches Gain experiencewithout the suffering

New caching?

New image replication?

Avoid embarassment

Build better estimatesbefore public releases

Identify bottlenecks

Work on optimizing

Turn on backend monitoring offeatures before making features

visible to end users

Collect/analyze all data youwould monitor if live

2. Degraded Failure(success) mode

What choices do you have if the systemapproaches a critical state?

Can you reduce load? Serve lowerquality images?

Difference in what work you cando at 1 qps vs 1mil qps

Don't accept it if itwill make you fail

R2D2 is offered one moreshot of whiskey... Program him to kindly say no thank you when he's reached his limit

3. Monitoring Can't fix what youcan't measure Types of monitoring

Black box

Monitoring what it issupposed to do

External monitoring

Limited knowledge of"how it works"

Responsive

White box

Predictive of failures

"Approaching peak"

Predictive of whatinterventions will fix it

Manual interventions (emailSal with instructions)

Automatedrepair responses

Beyondgarbage

collection

Responsive to failures

Detailedunderstanding of the

system

Identified criticalthresholds

Warning ofapproachingthresholds

Transparent from day 1

Failure is notan option...

But it's going tohappen anyway

You have to have a way toreason about your system

What happens when a piece ofyour system goes away?

What are theimplications?

What other systemsabsorb the impact?

If it's too big to reason in your head,you need a tool to visualize

Be able to visualize yoursystem in realtime

If you do something a lot "reallyrare" becomes twice a day

Use good sourcesof uniqueness

Clean uptemporary

files

Validate your config filesbefore you push them

Test all layersof a system

Humans can'treview everything

Automated tests are the onlyway to operate at scale

Error paths need to beexercised regularly Even in production

Always have safety checks foryour automated pushes

Things that are unthinkable aretherefore undocumented

Perfectly reasonable codecan become a trap

Documentassumptions in the

code

Check assumptions whenyou use a library

What % of data is affected byan automated push?

If greater than some % place inholding pattern for review

1% is a wholefreaking lot at scale

Avoidingsyncronication is

important

Small outagesbecome bigger

rapidly

On error don'tretry immediately Add exponential wait Add jitter

Don't scheduletasks on hour or on half hour Make it random

1. KISS - Keepservers simple

Do one thing and do it well

Don't mix request typesin a single server

Growth Limiting

My_app_server

Handles image uploads

Serves image thumbnails

Mix of requestscan change

Capacity unpredictablefor mix of services

Growth Potential

My_app_upload_server

My_app_thumbnail_server

Consistentbehavior/capacity per

setver

Easy to understand

Tons of requests from avariety of systems

2. Smaller & Stateless

Prefer smallerstateless servers

Many small jobsvs one big job

Stateful jobs vsstateless jobs

Stateful

A stateful server remembersclient data (state) from one

request to the next.

A statefulserver issimpler

Stateless

A stateless server keepsno state information

stateless server is more robust

lost connections can't leave afile in an invalid state

rebooting the server does notlose state information

rebooting the client does not confuse a stateless server

Using a stateless file server, the client must specify complete file names in each request

specify location for reading or writingre-authenticate for each request

Using a stateful file server,the client can send less data

with each request

Sticky sessions vsstateless sessions

Sticky sessions

Locking a session to a server to maintain identification of session and it's state

load balancer is forced to send all therequests to their original server where the session state was

created

even though that server might be heavily loaded andthere might be another less-loaded server available to take

on this request

Stateless session

the server does not need tostore any session state

all necessary information is storedin the cookie held by the client

load balancing is easier, as session statedoes not need to be replicated over

multiple front-end servers

Make failure domainssmallest & fewest

Growth Limiting

One giant db server

All photos on one server

Failure point3K QPS

Growth Potential

Many smaller shardedstorage db servers

Range of photo idsspread across servers

Cache documentstate on servers

Failure point

1K QPS

1K QPS

1K QPS

3. Retry SafelyGrowth Limiting

Retry 3 times w 3second delay

Demandoscilation may

occur

Growth PotentialRetry w randomexponential back off

Random back offmisplaces requests

So they don't line upwhen backed up

Make sure requests don't exceeddependent system timeoutsStatelessEnsure clients send

identifying info to server

4. Bound ResourceUsage: Fail Gracefully

Growth Limiting

Load entire objects ordocs into RAM

Error:connection timeout

Growth Potential

Operate on chunks of data

10 thimnbnailsinstead of 20 per page

Consider datastructure carefully

Don't buffer userinput w/o a limit

Reject user requests ifoverloaded

5. Don't Crash/Assert Exit

Never die due tounexpected input

Send anexceptionresponse

Just throw the request away and ignore itGrowth Limiting

Assert(request.size<=1000)

Growth PotentialRequest size > 1000(request too big)

6. Be Transparent

Jobs should notbe a black box

Keep track ofactions takenMake it available

Export it

Visible via private url

Provide visibility ofinternal state

Provide explicitstatement of health

Load balancers can use this tosend traffic elsewhere

Key value pairsConfig files

Provide debug pagesfor conplex data

Mechanism for doinghealth checks

Can i read my config file

Connection to db?

Memory used?

Cpu?

Errors sent to backend?

7. Avoid Lazy Initialization

Prepare everything youneed at startup

Perform all health checks

Before accepting requests

Include db connections

Loading files to disk etc

8. Maintain Flexibility

Don't change theworld at once

Canaryingexperimental

rollouts

Release schedule& qa testing

Don't release at peak

Don't affect users

Do it when workerscan respondDon't release at midnight

New features?

Config protectedDisabled by default

Percentage rolloutAB testing

9. Anticipate the Future

Growth trends

Watch them

Have safety bufferReal disaster: Taiwan floods caused a globalhard drive supply delay

Plan for morecapacity if needed

Consider time to orderTime to implement

Consider growing Peaks

Industry changesNew technology?Bigger images?New uploadbandwidth

requirements

10. Check theUser Experience

Fast & Reliable

Fast results = more users

Slow performance =drop in user %

Measurableweek over week

Probe off network

Emulate real usersAutomate itSelenium

Bandwidth avail

Latency

Don't just check servers

Mind Mapped byAyori Selassie

Find me onTwitter @iayori

Hosted atblacksintechnology.net

Google SRE (Site Reliability Engineering) Concepts

Documents