Top Banner
SRE FROM SCRATCH
112

SRE From Scratch

Nov 16, 2014

Download

Technology

Grier Johnson

How to bootstrap an SRE team into your company. How to hire them, what to have them work on and how to interact with them as a team. Finally some thought on general practices to consider before your SREs arrive. There are also kitten pictures.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SRE From Scratch

SRE FROM SCRATCH

Page 2: SRE From Scratch

SITE RELIABILITY ENGINEERING

Page 3: SRE From Scratch

PRODUCTION ENGINEERING

Page 4: SRE From Scratch

DEVOPS?

Page 5: SRE From Scratch

WHAT DO SRE DO?

Page 6: SRE From Scratch

KEEP THE SITE UP

Page 7: SRE From Scratch

KNOW THE PRODUCTION ENVIRONMENT

Page 8: SRE From Scratch

KNOW THEIR PRODUCT

Page 9: SRE From Scratch

LIAISON, ADVISOR, CONSULTANT

Page 10: SRE From Scratch

TOOLING AND AUTOMATION

Page 11: SRE From Scratch

TRIAGE

Page 12: SRE From Scratch

SO? WHY DO I NEED THEM?

Page 13: SRE From Scratch

UPTIME

Page 14: SRE From Scratch

THE ENVIRONMENT IS A PRODUCT

Page 15: SRE From Scratch

THEY’VE DONE THIS BEFORE

Page 16: SRE From Scratch

OK... LET’S HIRE SOME

Page 17: SRE From Scratch

WHAT TO LOOK FOR...

Page 18: SRE From Scratch

SRES!

Page 19: SRE From Scratch

SYSADMINS THAT PROGRAM

Page 20: SRE From Scratch

PROGRAMMERS THAT DO SYSADMIN

Page 21: SRE From Scratch

EXPERIENCE WITH SCALE

Page 22: SRE From Scratch

HOW DO I INTERVIEW THEM?

Page 23: SRE From Scratch

FUNDAMENTALS

Page 24: SRE From Scratch

HARDWARE

Page 25: SRE From Scratch

SYSTEM INTERNALS

Page 26: SRE From Scratch

UNIX ENVIRONMENT

Page 27: SRE From Scratch

NETWORKING

Page 28: SRE From Scratch

APPLICATION SUPPORT

Page 29: SRE From Scratch

OPERATING AT SCALE

Page 30: SRE From Scratch

PROGRAMMING

Page 31: SRE From Scratch

DON’T HIRE HEROES

Page 32: SRE From Scratch

OK, I’VE HIRED SOME, WHAT SHOULD THEY DO?

Page 33: SRE From Scratch

DESIGN REVIEW

Page 34: SRE From Scratch

DATA FLOWS

Page 35: SRE From Scratch

DEPENDENCIES

Page 36: SRE From Scratch

FAILURE CONDITIONS

Page 37: SRE From Scratch

SCALING

Page 38: SRE From Scratch

LAUNCH PREPAREDNESS

Page 39: SRE From Scratch

DOCUMENTATION

Page 40: SRE From Scratch

BUILD INFRASTRUCTURE

Page 41: SRE From Scratch

MONITORING

Page 42: SRE From Scratch

DEPLOYMENT

Page 43: SRE From Scratch

OPERATOR TOOLS

Page 44: SRE From Scratch

CONFIGURATION MANAGEMENT

Page 45: SRE From Scratch

SELF-SERVICE

Page 46: SRE From Scratch

HOW SHOULD THE TEAMS INTERACT...

Page 47: SRE From Scratch

DON’T GIVE ALL THE DAY-TO-DAY TASKS TO THE SRES

Page 48: SRE From Scratch

SHARE THE LOAD

Page 49: SRE From Scratch

HAVE YOUR SRES SIT WITH YOU

Page 50: SRE From Scratch

INCLUDE THEM IN DISCUSSIONS THE AFFECT THE PRODUCTION ENVIRONMENT

Page 51: SRE From Scratch

SOFTWARE IS NEVER THROWN OVER THE WALL

Page 52: SRE From Scratch

HAND-OFFS

Page 53: SRE From Scratch

SRES SHOULD BLOCK DANGEROUS CHANGES

Page 54: SRE From Scratch

IF YOUR SRES ARE FIGHTING FIRES, THEY’RE NOT BUILDING

INFRASTRUCTURE

Page 55: SRE From Scratch

IF YOUR SOFTWARE IS CAUSING FIRES, FIX IT

Page 56: SRE From Scratch

ASK YOUR SRE TO HELP MAKE FLAME-PROOF SOFTWARE

Page 57: SRE From Scratch

DON’T HIDE YOUR PROBLEMS FROM SRE

Page 58: SRE From Scratch

SRE SHOULD BE INVOLVED TO UNDERSTAND THE PROBLEM

Page 59: SRE From Scratch

EVERYONE SHOULD BE WRITING CODE OR MAKING

HARD DECISIONS

Page 60: SRE From Scratch

OF COURSE THERE ARE OPTIONS...

Page 61: SRE From Scratch

SRE CAN DO ALL THE SUPPORT

Page 62: SRE From Scratch

SRES ARE A LIMITED RESOURCE

Page 63: SRE From Scratch

SWE CAN SUPPORT PRODUCTS...

Page 64: SRE From Scratch

APP SUPPORT BY SWE, INFRASTRUCTURE SUPPORT

BY SRE

Page 65: SRE From Scratch

OR JUST ROTATE AROUND

Page 66: SRE From Scratch

ANY PRODUCTION ADVICE?

Page 67: SRE From Scratch

SELF-SERVICE

Page 68: SRE From Scratch

ALL TOOLS SHOULD BE WRITTEN WITH THE IDEA THAT

ROBOTS CAN RUN THEM

Page 69: SRE From Scratch

BEFORE ROBOTS RUN THEM, ANYONE IN THE COMPANY

SHOULD BE ABLE TO

Page 70: SRE From Scratch

PEOPLE SHOULD MAKE HARD DECISIONS, NOT PUSH

BUTTONS

Page 71: SRE From Scratch

GIVE PEOPLE ACCESS

Page 72: SRE From Scratch

SWE SHOULD HAVE AS MUCH ACCESS AS THEY NEED.

Page 73: SRE From Scratch

SWE ALREADY WRITES CODE THAT HAS ACCESS TO

SENSITIVE DATA

Page 74: SRE From Scratch

PRODUCTION DATA STAYS IN PRODUCTION

Page 75: SRE From Scratch

MAKE GOOD SYNTHETIC DATA

Page 76: SRE From Scratch

MAKE GOOD WAYS TO TEST IN PROD

Page 77: SRE From Scratch

CANARY, A/B TEST, ETC.

Page 78: SRE From Scratch

LEARN TO TRIAGE

Page 79: SRE From Scratch

THINGS BREAK, YOU MUST FIX THEM

Page 80: SRE From Scratch

MONITORING, METRICS, OPERATOR TOOLS, FAST

BUILD AND DEPLOY

Page 81: SRE From Scratch

TO FIX, YOU NEED TO KNOW IT’S BROKEN

Page 82: SRE From Scratch

MONITORING

Page 83: SRE From Scratch

MONITOR APPLICATIONS

Page 84: SRE From Scratch

MONITOR BEHAVIOR

Page 85: SRE From Scratch

STANDARDIZE YOUR METRICS

Page 86: SRE From Scratch

PUSH METRICS OUT

Page 87: SRE From Scratch

DECOUPLE YOUR SYSTEMS

Page 88: SRE From Scratch

WATCH SYSTEMS AS A FUNCTION OF CAPACITY

Page 89: SRE From Scratch

ONLY ALERT ON SYSTEM METRICS KNOWN TO HURT

YOU

Page 90: SRE From Scratch

DATA STORES

Page 91: SRE From Scratch

BEWARE THE RDBMS

Page 92: SRE From Scratch

LEARN TO SHARD

Page 93: SRE From Scratch

DITCH THE DURABILITY WHERE YOU CAN

Page 94: SRE From Scratch

BUT FIGURE OUT HOW TO BOOTSTRAP NON-DURABLE

STORES

Page 95: SRE From Scratch

MEMCACHE IS A BLESSING AND A CURSE

Page 96: SRE From Scratch

ALWAYS CONSIDER A SITE-WIDE POWER OUTAGE

Page 97: SRE From Scratch

USE DURABLE AND NON-DURABLE STORES TOGETHER

Page 98: SRE From Scratch

ASK YOUR SRE FOR MORE INFO

Page 99: SRE From Scratch

DESPITE ALL THIS, YOU CAN STILL FAIL...

Page 100: SRE From Scratch

OBVIOUS FAILURE

Page 101: SRE From Scratch

DOWNTIME

Page 102: SRE From Scratch

DOWNTIME WITHOUT KNOWING

Page 103: SRE From Scratch

NON-OBVIOUS FAILURES

Page 104: SRE From Scratch

HEROIC ACTS

Page 105: SRE From Scratch

WERE YOU UP ALL NIGHT?

Page 106: SRE From Scratch

DID YOU DO THAT SAME TASK ALL DAY?

Page 107: SRE From Scratch

DID A WHOLE TEAM STOP WHAT THEY WERE DOING?

Page 108: SRE From Scratch

THESE ARE HEROIC ACTS, THEY ARE POISON

Page 109: SRE From Scratch

HEROISM = FAILURE

Page 110: SRE From Scratch

COMES FROM LEGACY SYSTEMS, PROCEDURES

Page 111: SRE From Scratch

ALSO FROM PERSONALITY TRAITS...

Page 112: SRE From Scratch

QUESTIONS?

• Grier Johnson

• @grierj

[email protected]