Top Banner
Keeping Pinterest Running 1 Joe Gordon 2 February 2016
31

Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

May 06, 2018

Download

Documents

phamtram
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Keeping Pinterest Running

1

Joe Gordon2 February 2016

Page 2: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

What is Pinterest?

Page 3: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Software v. Service

Page 4: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Software v. Service

● Stable branches● Drivers and configurations● Support matrix● Dependency versions● Developers support their own service

○ On call rotation○ Aligns incentives○ Monitoring & alerting built in from day one

● Testing against production traffic

Page 5: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

SRE at Pinterest

Page 6: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...
Page 7: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...
Page 8: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...
Page 9: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

What do SREs focus on?

Page 10: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Operational Maturity

Page 11: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Operational Excellence

Page 12: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Operational Excellence

Page 13: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

VisibilityInsight into the system

Page 14: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Visibility

● Data Driven● Cornerstone for many things we do

○ Measure and enforce SLA (Service Level Agreement)○ Debug issues○ Capacity planning

● Time series data - TSDB● Metrics

○ System○ Service○ Dependencies○ Latencies

● Alerting● ELK stack for real time

log collection

Page 15: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Deployments

Page 16: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Deployment Requirements

● No impact to end user● Change history● Easy

Page 17: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Staging and Canary

Canary in a Coal mine. Rabbit in a Sarin gas plant

Page 18: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Canary vs Staging

Staging

Page 19: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Teletraandeploy system

Page 20: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Teletraan

● Rollback

● Hotfix

● Rolling deploy

● Staging and testing

● Visibility & Usability

Features

Page 21: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

TeletraanDesign

● client-server model

● PRE/POST-DOWNLOAD

● PRE/POST-RESTART

● RESTART

● RBAC

Page 22: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

TeletraanAdvanced Features

● Pause/Resume● Acceptance Testing● Auto Deploy● Autoscaling

Staging Canary Production

Test Test Test

Auto Promote Promote

Page 23: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

PostmortemsLearn from our mistakes

Page 24: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Postmortems

● Blameless● Incident Manager● Impact● Outage Type● Method of Detection● Timeline● Root Cause● Restoration Details● Actionable Items

Page 25: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Production Readiness ReviewPre-mortem?

Page 26: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Production Readiness Review

● Dependencies● Define an SLA● Alerting● Capacity planning● Testing● On call rotation● Decider to turn feature off if needed● Incremental launch plan● Rate limiting

Page 27: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Public CloudIssues

Page 28: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Public Cloud

● “If you get an InsufficientInstanceCapacity error when you try to launch an instance, AWS does not currently have enough available capacity to service your request.”

○ Cloud is not infinite○ Reserved instances○ Capacity planning

● RequestLimitExceeded: “The maximum request rate permitted by the Amazon EC2 APIs has been exceeded for your account.”

○ Includes DescribeInstances○ Use internal mirror (powered by elasticsearch)

● Noise Neighbors● Rightsizing● Ownership

Issues

Page 29: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Open Sourced Tools

Page 30: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Open Sourced Tools

● mysql_utils○ MySQL Management Tools for the Cloud

● thrift-tools○ thrift-tools is a library and a set of tools to introspect Apache Thrift traffic

● secor○ Secor is a service implementing Kafka log persistence

● pymemcache○ A comprehensive, fast, pure-Python memcached client

● pinrepo○ Artifact Repo

● TeletraanMore at: https://github.com/pinterest

Page 31: Keeping Pinterest Runningsysadmin.miniconf.org/2016/lca2016-joe_gordon-keeping...System Service Dependencies Latencies Alerting ELK stack for real time log collection Deployments ...

Pinterest Template 1.0