Devops Workshop (Section 4)€¦ · Devops Workshop (Section 4) John Willis @botchagalupe. Section 4 - The Second Way - Feedback. Accelerate Feedback. ... Embedded engineers 3 Fast

Devops Workshop (Section 4)

John Willis @botchagalupe

Section 4 - The Second Way - Feedback

Accelerate Feedback

The Second Way - Amplify Feedback

3

“3% of the problems have figures, 97% of the problems do not”

- Dr Deming

▪ The Second Way - Goals

▪ Right to Left ▪ Find and Fix Fast ▪ Shorten and Amplify Feedback

3


The Second Way

▪ Accelerate Feedback

▪ Telemetry ▪ Fault Injection ▪ Safety Culture

3

The Second Way


▪ Telemetry ▪ Fault Injection ▪ Collaboration ▪ Safety Culture

3

The Second Way

▪ Telemetry

▪ Monitoring ▪ Logging ▪ Analytics

3

The Second Way

3

Source: Gene Kim - itrevolution.com

http://itrevolution.com

The Second Way

3

monitorama.comJason Dixon, John Allspaw, Dr Neil Gunther, Mathias Meyer, John Vincent, Jordan Sissel, Sean Porter, Katherine Daniels, Lindsay Holmwood, Adrian Cockcroft, Bridget Kromhout, Kyle Kingsbury, James Turnbull

http://monitorama.com

The Second Way



3

The Second Way

▪ Fault Injection

▪ Reduce MTBF ▪ Reduce MTTR

3

The Second Way

▪ Fault Injection

▪ Game Day ▪ Netflix Simian Army ▪ Netflix FIT

3

The Second Way

▪ Game Day

▪ Reduces MTBF ▪ Reduces MTTR

3

The Second Way

▪ Netflix Simian Army

▪ Chaos Monkey (Hosts) ▪ Chaos Gorilla (Data Center) ▪ Latency Monkey (Inject Latency) ▪ Conformity Monkey (Best Practice) ▪ Security Monkey (Security Violations)

3

The Second Way

▪ FIT : Failure Injection Testing

▪ Limit the blast ratio of the failure ▪ Telemetry of path of the failure ▪ Dependency telemetry

3

The Second Way



3

The Second Way

3

“In a complex system, doing the same thing twice will not predictably or necessarily lead to the same result.”

Sidney Dekker

Views on Human Error

▪ The Second Way - Right to Left

▪ Creating a Service Reliability Culture ▪ Fast Feedback ▪ Understanding Monitoring ▪ Understanding Complexity

3




3

Creating a Service Reliability Culture

▪ Service Reliability Culture is Like a Team Sport

▪ Availability ▪ Latency ▪ Performance ▪ Change Management ▪ Monitoring ▪ Emergency Response ▪ Capacity Planning

3


▪ Core Conflict “Dev vs Ops”

▪ Operations don’t really know the code base ▪ The team the knows least about the code typically

has the responsibility of it’s launch

3


▪ Understanding Service Levels

▪ Service Level Agreements ▪ Service Level Objectives (Targets) ▪ Service Level Indicators

3


▪ Service Level Agreements

▪ Between the business and the customer ▪ Typically a financial contract ▪ Can be MTTR or MTBF based ▪ Not all services have an explicit SLA

3


▪ Service Level Objectives

▪ Typically the basis for SLA’s ▪ Between the service and the system ▪ Typically target based ▪ All services should have an SLO ▪ Determine actions to take on missed SLO’s ▪ SLO’s should be tracked historically

3


▪ Service Level Objectives - Picking Targets

▪ Try and keep them simple ▪ Don’t over design ▪ Let them evolve ▪ Will learn over time

3


▪ Service Level Indicators

▪ Quantitative measure of a service ▪ Used as indicators for the SLO’s ▪ Monitor SLI’s and compare to SLO’s

3


▪ Service Level Indicators (Examples)

▪ Latency ▪ Errors ▪ Availability ▪ Throughput

3


▪ Generalized Indicators

▪ Management By Objectives (MBO) ▪ Key Performance Indicators (KPI) ▪ Objective and Key Results (OKR)

3



3

“Management is doing things right; leadership is doing the right things.” ― Peter F. Drucker


3

“A production line that never stopped was either extremely good or extremely bad”

- Taiichi Ohno

▪ Understanding Risk and Failure

▪ 100% reliability is a myth ▪ All systems go down ▪ Not all services are equal ▪ Manage risk and failure by service ▪ Managing reliability is about managing risk ▪ Managing risk is about cost

3


▪ Understanding the Cost of Reliability

▪ High availability systems ▪ Opportunity costs

3


▪ Understanding the Cost of Reliability

▪ Is it a free service? ▪ Is it a revenue based service?

3


▪ How Many 9‘s

▪ One (90%) - 36.5 days per year ▪ Two (99%) - 3.65 days per year ▪ Three (99.9%) - 8.76 hours per year ▪ Four (99.99%) - 52.56 minutes per year ▪ Five (99.999%) - 5.26 minutes per year ▪ Six (99.9999% - 31.5 seconds per year

3


▪ Example: On Million Per Day

▪ Two (99%) - 3.65 days per year = $3.65M ▪ Three (99.9%) - 8.76 hours per year = $365k ▪ Four (99.99%) - 52.56 minutes per year = $36.5k ▪ Five (99.999%) - 5.26 minutes per year = $3.65k ▪ Six (99.9999% - 31.5 seconds per year = $365

3


▪ Example: On Million Per Day

▪ Two (99%) - 3.65 days per year = $3.65M ▪ Three (99.9%) - 8.76 hours per year = $365k ▪ Four (99.99%) - 52.56 minutes per year = $36.5k ▪ Five (99.999%) - 5.26 minutes per year = $3.65k ▪ Six (99.9999% - 31.5 seconds per year = $365

3


▪ Google Site Reliability Engineers

▪ Google defined the job title ▪ Google SRE was created in 2003 ▪ No NOC ▪ A team that focuses on reliability

▪ Focus on service ▪ Focus on engineering

3


▪ Benjamin Treynor Sloss

▪ The number one feature for a product is that it works.

▪ The second most import feature for a product is that it works.

▪ The third most import feature for a product is that it works.

3


Fast Feedback

3

“You built it, you run it”

- Werner Vogels



3

Fast Feedback

▪ Fast Feedback

▪ Design for failure ▪ Adaptive systems - Feedback loops ▪ Developer managed service ▪ Contingency, peer reviews and pairing ▪ Embedded engineers

3

Fast Feedback

▪ Design for Failure

▪ Software resiliency typically is better than hardware based

▪ Cost ▪ Easier to change (fix, upgrade, replace) ▪ Faster to fix ▪ Easier to experiment

3

Fast Feedback

▪ Design for Failure

▪ MTTR over MTBF ▪ Game Days ▪ Chaos Monkey(s) ▪ Fault Injection

3

Fast Feedback

The Second Way

▪ Fast Feedback

▪ A/B Testting ▪ Dark Deploys ▪ Inject Deployment Metrics in Monitoring ▪ Developers Wear Pagers ▪ Pair Programming ▪ Peer Reviews

3

The Second Way

▪ Deploys - Upgrading Live Services

▪ Rolling Upgrades ▪ Canary ▪ Blue Green Deploys ▪ Toggling Feature

3

The Second Way

▪ Fast Feedback

▪ A/B Testting ▪ Dark Deploys ▪ Inject Deployment Metrics in Monitoring

3

Fast Feedback

3

“Reality is made up of circles but we see straight lines”

- Peter Senge

▪ Peer Reviews - Guidelines

▪ All changes are peer reviewed ▪ Everyone monitors the commit logs ▪ High risk changes should include an SME ▪ Break up larger changes into smaller ones

3

Fast Feedback

▪ Pairing

▪ Pair programming for everything ▪ Pair programming is slower but decrease bugs up

to 70% to 80% ▪ Spreads knowledge ▪ Great for training ▪ Setup pair times ▪ Need a culture that values pair programming

3

Fast Feedback

▪ Embedded Engineers

▪ Operations in development ▪ Development in operations

3

Fast Feedback

▪ ChatOps

“Everyone is pairing all the time”

Jesse Newland (Github)

3

Fast Feedback

▪ ChatOps Definition (Atlassian)

▪ ChatOps is a collaboration model that connects people, tools, process, and automation into a transparent workflow. This flow connects the work needed, the work happening, and the work done in a persistent location staffed by the people, bots, and related tools.

3

Fast Feedback

Source: http://blogs.atlassian.com/2016/01/what-is-chatops-adoption-guide/

▪ ChatOps Origins

▪ Originally based on chat bots ▪ Github’s use of Hubot ▪ Jesse Newland - ChatOps at Github ▪ Putting tools in the middle of the conversation

3

Fast Feedback

▪ ChatOps Chat Tools

▪ Slack ▪ Campfile ▪ Hipchat

3

Fast Feedback

▪ ChatOps Benefits

▪ It’s like a multiuser terminal where everyone can see the conversation and the commands interwoven.

▪ There is a historical record of the commands and the conversation. ▪ Provides a great training tools - teaching by doing ▪ Great for tactical incident resolution - everyone gets to see the

conversation and commands ▪ Dynamically manage the on call rotation. ▪ Can manage all aspects of the “devops” practices from one central

place. ▪ Mobile operations tool for free.

3

Fast Feedback

▪ ChatOps Examples

▪ Run a command ▪ Deploy code ▪ Check logs ▪ Check status from Github or Jenkins ▪ Change the on call rotation ▪ Check Nagios alert ▪ Graph monitoring or alert data ▪ Take a system online of offline ▪ Kill a job or process ▪ Answer help desk questions (ML)

3

Fast Feedback

Understanding Monitoring

3

“It’s not the upfront capital that kills you, it’s the operations and maintenance on the back end.”

- Gene Kim



3


▪ The Visible Ops Handbook (Kim, Behr, Spafford)

▪ Culture of Causality

▪ 80% of all outages are caused by a change ▪ 80% of restoration time is spent trying to figure

out what changed ▪ High performance organizations look for the

most recent change first

3


The Second Way

3

▪ Advanced Application Monitoring Tools

▪ New Relic ▪ AppDynamics ▪ Dynatrace

3


▪ SAS Monitoring Tools

▪ Data Dog ▪ HonyComb ▪ SignalFX

3



3

▪ Why Monitor

▪ Alerting ▪ Visualizing ▪ Collecting ▪ Trending ▪ Anomalies ▪ Learning

3


▪ Google’s Four Golden Signals

▪ Latency ▪ Traffic ▪ Errors ▪ Saturation

3


▪ Looking at the Service Stack

▪ Business Indicators ▪ Application Indicators ▪ Infrastructure Indicators ▪ User Based Indicators ▪ Deployment Indicators

3


▪ Other Examples

▪ Resolution times ▪ Abandoned shopping carts ▪ Sales transactions ▪ Churn rate ▪ Deployment promotions ▪ Lead time ▪ Forum posts

3


▪ Monitoring Deployments

3


Source: Mike Brittain - Etsy Code as Craft

▪ Monitoring Deployments

3


Source: Mike Brittain - Etsy Code as Craft

▪ Werner Vogels - Monitoring Question

▪ We monitor a lot of stuff but there is only one metric we can about. Order rate. We have years of heuristics telling us it’s upper and lower limits.

3


▪ Facebook

3


▪ Components of a monitoring system

▪ Sensing/Measuring ▪ Collecting ▪ Analysis/Computation ▪ Alerting ▪ Escalation ▪ Visualization

3


Source: Limoncelli - The Practice of Cloud System Administration V2

▪ Black Box vs White Box

▪ Black Box Monitoring ▪ Symptom based ▪ Active Problems ▪ User’s experience

▪ White Box Monitoring ▪ Agents ▪ Logs ▪ Instrumentation

3


▪ Types of Metrics (Raw)

▪ Gauges ▪ Counters ▪ Timers

3


▪ Types of Metrics (Derived)

▪ Delta ▪ Rates ▪ Ratios

3


▪ Analysis

▪ Real Time ▪ Correlation ▪ Historical ▪ Anomaly Detection ▪ Machine Learning

3


3


▪ Statistical Analysis

▪ Mean ▪ Median ▪ Percentiles ▪ Standard Deviation ▪ Median Absolute Deviation

3


3


Source: Wikipedia

68–95–99.7 Rule

▪ Non-Guassian Distribution Data

▪ Most IT operations and performance data doesn’t have a Guassian Distribution

▪ This can lead to over or under alerting

3


▪ Median ▪ Median Absolute Deviation

3


▪ Histograms

3


▪ Percentiles

3


▪ Percentiles

3


▪ Inverse Quantiles

▪ Instead of measuring how many slow transactions there are (99 Quantile)

▪ Measure how many transactions are too slow

▪ Modality Changes

3


▪ Modality Changes

3


Source: Theo Schlossnagel http://www.slideshare.net/postwait/adaptive-availability

http://www.slideshare.net/postwait/adaptive-availability

▪ Aggregate Graphs

3


Source: datadoghq.com

http://datadoghq.com

▪ Anomaly Detection

▪ Finding patterns in data that do not conform to expected behavior

▪ Can be used for noise reduction

3


Source: Chandola - Anomaly Detection : A Survey

▪ Anomaly Detection - Research Areas

▪ Statistics ▪ Machine Learning ▪ Information Theory ▪ Data Mining

3



▪ Anomaly Detection - Characteristics

▪ High Cardinality ▪ Minimizing False Positives ▪ Seasonality ▪ Non Normally Distributions

3



▪ Anomaly Detection - Netflix

3


▪ Ebay Case Study

3


Source: http://www.ebaytechblog.com/2015/08/19/statistical-anomaly-detection/

▪ Ebay Case Study

3



▪ Ebay Case Study

3



Understanding Complexity

3

“I smile and start to count on my fingers: One, people are good. Two, every conflict can be removed. Three, every situation, no matter how complex it initially looks, is exceedingly simple. Four, every situation can be substantially improved; even the sky is not the limit. Five, every person can reach a full life. Six, there is always a win-win solution. Shall I continue to count?”

-Eliyahu M. Goldratt



3


The Second Way

▪ Complexity

3

▪ In Search of Certainty

▪ Mark Burgess invented Desired State Configuration Management 20+ years ago

▪ Created Promise Theory 10+ years ago

▪ Uses realms of physics and biology to assert that uncertainty is an unescapable fact of technology.

3


▪ Cybernetics

▪ Norbert Wiener defined in 1948 ▪ Circular Causality ▪ Self Steering Approach ▪ Listen, Calibrate, Change and Adapt ▪ Systemic Approach

3


▪ Cynefin

▪ Defined by Dave Snowden ▪ Designed to describe the evolutionary

nature of complex systems ▪ Draws on research from complex

adaptive systems theory, cognitive science, anthropology and psychology

3


Source: Wikipedia - Cynefin

▪ Cause and Effect is Obvious

▪ Sense ▪ See what’s coming in

▪ Categorise ▪ Make it fit predetermined

categories ▪ Respond

▪ Decide what to do

3



▪ Cause and Effect Requires Analysis

▪ Sense ▪ See what’s coming in

▪ Analyse ▪ Investigate or analyse, using

expert knowledge ▪ Respond

▪ Decide what to do

3



▪ Cause and Effect in Retrospect

▪ Probe ▪ Experimental input

▪ Sense ▪ Failures or successes

▪ Respond ▪ Decide what to do, amplify or

dampen

3



▪ Cause and Effect Undetermined

▪ Act ▪ Attempt to stabilize

▪ Sense ▪ Failures or successes

▪ Respond ▪ Decide what to do next

3



3


Source: old.cognitive-edge.com

http://old.cognitive-edge.com

▪ Circuit Breaker Patterns

▪ Wrap a protected function call in a circuit breaker object

▪ Monitors for failures ▪ When a threshold is met trip a

circuit breaker ▪ Calls are then returned with an

error

3


▪ Circuit Breaker Patterns

3


Source: Martin Fowler http://martinfowler.com/bliki/CircuitBreaker.html

http://martinfowler.com/bliki/CircuitBreaker.html

▪ Netflix - Circuit Breaker - Hystrix

▪ Give protection from and control over latency and failure from dependencies accessed

▪ Stop cascading failures in a complex distributed system.

▪ Fail fast and rapidly recover. ▪ Fallback and gracefully degrade when possible. ▪ Enable near real-time monitoring, alerting, and

operational control.3


▪ Netflix - Circuit Breaker - Hystrix

▪ Isolates access points between services ▪ Can setup triggers (trip if 10 calls within 10

seconds take longer than 5 seconds) ▪ Provides fall back options (error, default value, null

value, or special error)

3


3


Source: https://github.com/netflix/hystrix/wiki

https://github.com/netflix/hystrix/wiki

▪ Other Users of Circuit Break Pattern

▪ Spring Boot ▪ Nginx Plus ▪ Envoy (ISTIO)

3


Devops Workshop (Section 4)€¦ · Devops Workshop (Section 4) John Willis @botchagalupe. Section 4 - The Second Way - Feedback. Accelerate Feedback. ... Embedded engineers 3 Fast

Documents