Devops Workshop (Section 4) John Willis @botchagalupe
Devops Workshop (Section 4)
John Willis @botchagalupe
Section 4 - The Second Way - Feedback
Accelerate Feedback
The Second Way - Amplify Feedback
3
“3% of the problems have figures, 97% of the problems do not”
- Dr Deming
▪ The Second Way - Goals
▪ Right to Left ▪ Find and Fix Fast ▪ Shorten and Amplify Feedback
3
The Second Way - Amplify Feedback
The Second Way
▪ Accelerate Feedback
▪ Telemetry ▪ Fault Injection ▪ Safety Culture
3
The Second Way
▪ Accelerate Feedback
▪ Telemetry ▪ Fault Injection ▪ Collaboration ▪ Safety Culture
3
The Second Way
▪ Telemetry
▪ Monitoring ▪ Logging ▪ Analytics
3
The Second Way
3
monitorama.comJason Dixon, John Allspaw, Dr Neil Gunther, Mathias Meyer, John Vincent, Jordan Sissel, Sean Porter, Katherine Daniels, Lindsay Holmwood, Adrian Cockcroft, Bridget Kromhout, Kyle Kingsbury, James Turnbull
The Second Way
▪ Accelerate Feedback
▪ Telemetry ▪ Fault Injection ▪ Safety Culture
3
The Second Way
▪ Fault Injection
▪ Reduce MTBF ▪ Reduce MTTR
3
The Second Way
▪ Fault Injection
▪ Game Day ▪ Netflix Simian Army ▪ Netflix FIT
3
The Second Way
▪ Game Day
▪ Reduces MTBF ▪ Reduces MTTR
3
The Second Way
▪ Netflix Simian Army
▪ Chaos Monkey (Hosts) ▪ Chaos Gorilla (Data Center) ▪ Latency Monkey (Inject Latency) ▪ Conformity Monkey (Best Practice) ▪ Security Monkey (Security Violations)
3
The Second Way
▪ FIT : Failure Injection Testing
▪ Limit the blast ratio of the failure ▪ Telemetry of path of the failure ▪ Dependency telemetry
3
The Second Way
▪ Accelerate Feedback
▪ Telemetry ▪ Fault Injection ▪ Safety Culture
3
The Second Way
3
“In a complex system, doing the same thing twice will not predictably or necessarily lead to the same result.”
Sidney Dekker
Views on Human Error
▪ The Second Way - Right to Left
▪ Creating a Service Reliability Culture ▪ Fast Feedback ▪ Understanding Monitoring ▪ Understanding Complexity
3
The Second Way - Amplify Feedback
▪ The Second Way - Right to Left
▪ Creating a Service Reliability Culture ▪ Fast Feedback ▪ Understanding Monitoring ▪ Understanding Complexity
3
Creating a Service Reliability Culture
▪ Service Reliability Culture is Like a Team Sport
▪ Availability ▪ Latency ▪ Performance ▪ Change Management ▪ Monitoring ▪ Emergency Response ▪ Capacity Planning
3
Creating a Service Reliability Culture
▪ Core Conflict “Dev vs Ops”
▪ Operations don’t really know the code base ▪ The team the knows least about the code typically
has the responsibility of it’s launch
3
Creating a Service Reliability Culture
▪ Understanding Service Levels
▪ Service Level Agreements ▪ Service Level Objectives (Targets) ▪ Service Level Indicators
3
Creating a Service Reliability Culture
▪ Service Level Agreements
▪ Between the business and the customer ▪ Typically a financial contract ▪ Can be MTTR or MTBF based ▪ Not all services have an explicit SLA
3
Creating a Service Reliability Culture
▪ Service Level Objectives
▪ Typically the basis for SLA’s ▪ Between the service and the system ▪ Typically target based ▪ All services should have an SLO ▪ Determine actions to take on missed SLO’s ▪ SLO’s should be tracked historically
3
Creating a Service Reliability Culture
▪ Service Level Objectives - Picking Targets
▪ Try and keep them simple ▪ Don’t over design ▪ Let them evolve ▪ Will learn over time
3
Creating a Service Reliability Culture
▪ Service Level Indicators
▪ Quantitative measure of a service ▪ Used as indicators for the SLO’s ▪ Monitor SLI’s and compare to SLO’s
3
Creating a Service Reliability Culture
▪ Service Level Indicators (Examples)
▪ Latency ▪ Errors ▪ Availability ▪ Throughput
3
Creating a Service Reliability Culture
▪ Generalized Indicators
▪ Management By Objectives (MBO) ▪ Key Performance Indicators (KPI) ▪ Objective and Key Results (OKR)
3
Creating a Service Reliability Culture
The Second Way - Amplify Feedback
3
“Management is doing things right; leadership is doing the right things.” ― Peter F. Drucker
The Second Way - Amplify Feedback
3
“A production line that never stopped was either extremely good or extremely bad”
- Taiichi Ohno
▪ Understanding Risk and Failure
▪ 100% reliability is a myth ▪ All systems go down ▪ Not all services are equal ▪ Manage risk and failure by service ▪ Managing reliability is about managing risk ▪ Managing risk is about cost
3
Creating a Service Reliability Culture
▪ Understanding the Cost of Reliability
▪ High availability systems ▪ Opportunity costs
3
Creating a Service Reliability Culture
▪ Understanding the Cost of Reliability
▪ Is it a free service? ▪ Is it a revenue based service?
3
Creating a Service Reliability Culture
▪ How Many 9‘s
▪ One (90%) - 36.5 days per year ▪ Two (99%) - 3.65 days per year ▪ Three (99.9%) - 8.76 hours per year ▪ Four (99.99%) - 52.56 minutes per year ▪ Five (99.999%) - 5.26 minutes per year ▪ Six (99.9999% - 31.5 seconds per year
3
Creating a Service Reliability Culture
▪ Example: On Million Per Day
▪ Two (99%) - 3.65 days per year = $3.65M ▪ Three (99.9%) - 8.76 hours per year = $365k ▪ Four (99.99%) - 52.56 minutes per year = $36.5k ▪ Five (99.999%) - 5.26 minutes per year = $3.65k ▪ Six (99.9999% - 31.5 seconds per year = $365
3
Creating a Service Reliability Culture
▪ Example: On Million Per Day
▪ Two (99%) - 3.65 days per year = $3.65M ▪ Three (99.9%) - 8.76 hours per year = $365k ▪ Four (99.99%) - 52.56 minutes per year = $36.5k ▪ Five (99.999%) - 5.26 minutes per year = $3.65k ▪ Six (99.9999% - 31.5 seconds per year = $365
3
Creating a Service Reliability Culture
▪ Google Site Reliability Engineers
▪ Google defined the job title ▪ Google SRE was created in 2003 ▪ No NOC ▪ A team that focuses on reliability
▪ Focus on service ▪ Focus on engineering
3
Creating a Service Reliability Culture
▪ Benjamin Treynor Sloss
▪ The number one feature for a product is that it works.
▪ The second most import feature for a product is that it works.
▪ The third most import feature for a product is that it works.
3
Creating a Service Reliability Culture
Fast Feedback
3
“You built it, you run it”
- Werner Vogels
▪ The Second Way - Right to Left
▪ Creating a Service Reliability Culture ▪ Fast Feedback ▪ Understanding Monitoring ▪ Understanding Complexity
3
Fast Feedback
▪ Fast Feedback
▪ Design for failure ▪ Adaptive systems - Feedback loops ▪ Developer managed service ▪ Contingency, peer reviews and pairing ▪ Embedded engineers
3
Fast Feedback
▪ Design for Failure
▪ Software resiliency typically is better than hardware based
▪ Cost ▪ Easier to change (fix, upgrade, replace) ▪ Faster to fix ▪ Easier to experiment
3
Fast Feedback
▪ Design for Failure
▪ MTTR over MTBF ▪ Game Days ▪ Chaos Monkey(s) ▪ Fault Injection
3
Fast Feedback
The Second Way
▪ Fast Feedback
▪ A/B Testting ▪ Dark Deploys ▪ Inject Deployment Metrics in Monitoring ▪ Developers Wear Pagers ▪ Pair Programming ▪ Peer Reviews
3
The Second Way
▪ Deploys - Upgrading Live Services
▪ Rolling Upgrades ▪ Canary ▪ Blue Green Deploys ▪ Toggling Feature
3
The Second Way
▪ Fast Feedback
▪ A/B Testting ▪ Dark Deploys ▪ Inject Deployment Metrics in Monitoring
3
Fast Feedback
3
“Reality is made up of circles but we see straight lines”
- Peter Senge
▪ Peer Reviews - Guidelines
▪ All changes are peer reviewed ▪ Everyone monitors the commit logs ▪ High risk changes should include an SME ▪ Break up larger changes into smaller ones
3
Fast Feedback
▪ Pairing
▪ Pair programming for everything ▪ Pair programming is slower but decrease bugs up
to 70% to 80% ▪ Spreads knowledge ▪ Great for training ▪ Setup pair times ▪ Need a culture that values pair programming
3
Fast Feedback
▪ Embedded Engineers
▪ Operations in development ▪ Development in operations
3
Fast Feedback
▪ ChatOps
“Everyone is pairing all the time”
Jesse Newland (Github)
3
Fast Feedback
▪ ChatOps Definition (Atlassian)
▪ ChatOps is a collaboration model that connects people, tools, process, and automation into a transparent workflow. This flow connects the work needed, the work happening, and the work done in a persistent location staffed by the people, bots, and related tools.
3
Fast Feedback
Source: http://blogs.atlassian.com/2016/01/what-is-chatops-adoption-guide/
▪ ChatOps Origins
▪ Originally based on chat bots ▪ Github’s use of Hubot ▪ Jesse Newland - ChatOps at Github ▪ Putting tools in the middle of the conversation
3
Fast Feedback
▪ ChatOps Chat Tools
▪ Slack ▪ Campfile ▪ Hipchat
3
Fast Feedback
▪ ChatOps Benefits
▪ It’s like a multiuser terminal where everyone can see the conversation and the commands interwoven.
▪ There is a historical record of the commands and the conversation. ▪ Provides a great training tools - teaching by doing ▪ Great for tactical incident resolution - everyone gets to see the
conversation and commands ▪ Dynamically manage the on call rotation. ▪ Can manage all aspects of the “devops” practices from one central
place. ▪ Mobile operations tool for free.
3
Fast Feedback
▪ ChatOps Examples
▪ Run a command ▪ Deploy code ▪ Check logs ▪ Check status from Github or Jenkins ▪ Change the on call rotation ▪ Check Nagios alert ▪ Graph monitoring or alert data ▪ Take a system online of offline ▪ Kill a job or process ▪ Answer help desk questions (ML)
3
Fast Feedback
Understanding Monitoring
3
“It’s not the upfront capital that kills you, it’s the operations and maintenance on the back end.”
- Gene Kim
▪ The Second Way - Right to Left
▪ Creating a Service Reliability Culture ▪ Fast Feedback ▪ Understanding Monitoring ▪ Understanding Complexity
3
Understanding Monitoring
▪ The Visible Ops Handbook (Kim, Behr, Spafford)
▪ Culture of Causality
▪ 80% of all outages are caused by a change ▪ 80% of restoration time is spent trying to figure
out what changed ▪ High performance organizations look for the
most recent change first
3
Understanding Monitoring
The Second Way
3
▪ Advanced Application Monitoring Tools
▪ New Relic ▪ AppDynamics ▪ Dynatrace
3
Understanding Monitoring
▪ SAS Monitoring Tools
▪ Data Dog ▪ HonyComb ▪ SignalFX
3
Understanding Monitoring
Understanding Monitoring
3
▪ Why Monitor
▪ Alerting ▪ Visualizing ▪ Collecting ▪ Trending ▪ Anomalies ▪ Learning
3
Understanding Monitoring
▪ Google’s Four Golden Signals
▪ Latency ▪ Traffic ▪ Errors ▪ Saturation
3
Understanding Monitoring
▪ Looking at the Service Stack
▪ Business Indicators ▪ Application Indicators ▪ Infrastructure Indicators ▪ User Based Indicators ▪ Deployment Indicators
3
Understanding Monitoring
▪ Other Examples
▪ Resolution times ▪ Abandoned shopping carts ▪ Sales transactions ▪ Churn rate ▪ Deployment promotions ▪ Lead time ▪ Forum posts
3
Understanding Monitoring
▪ Monitoring Deployments
3
Understanding Monitoring
Source: Mike Brittain - Etsy Code as Craft
▪ Monitoring Deployments
3
Understanding Monitoring
Source: Mike Brittain - Etsy Code as Craft
▪ Werner Vogels - Monitoring Question
▪ We monitor a lot of stuff but there is only one metric we can about. Order rate. We have years of heuristics telling us it’s upper and lower limits.
3
Understanding Monitoring
3
Understanding Monitoring
▪ Components of a monitoring system
▪ Sensing/Measuring ▪ Collecting ▪ Analysis/Computation ▪ Alerting ▪ Escalation ▪ Visualization
3
Understanding Monitoring
Source: Limoncelli - The Practice of Cloud System Administration V2
▪ Black Box vs White Box
▪ Black Box Monitoring ▪ Symptom based ▪ Active Problems ▪ User’s experience
▪ White Box Monitoring ▪ Agents ▪ Logs ▪ Instrumentation
3
Understanding Monitoring
▪ Types of Metrics (Raw)
▪ Gauges ▪ Counters ▪ Timers
3
Understanding Monitoring
▪ Types of Metrics (Derived)
▪ Delta ▪ Rates ▪ Ratios
3
Understanding Monitoring
▪ Analysis
▪ Real Time ▪ Correlation ▪ Historical ▪ Anomaly Detection ▪ Machine Learning
3
Understanding Monitoring
3
Understanding Monitoring
▪ Statistical Analysis
▪ Mean ▪ Median ▪ Percentiles ▪ Standard Deviation ▪ Median Absolute Deviation
3
Understanding Monitoring
3
Understanding Monitoring
Source: Wikipedia
68–95–99.7 Rule
▪ Non-Guassian Distribution Data
▪ Most IT operations and performance data doesn’t have a Guassian Distribution
▪ This can lead to over or under alerting
3
Understanding Monitoring
▪ Median ▪ Median Absolute Deviation
3
Understanding Monitoring
▪ Histograms
3
Understanding Monitoring
▪ Percentiles
3
Understanding Monitoring
▪ Percentiles
3
Understanding Monitoring
▪ Inverse Quantiles
▪ Instead of measuring how many slow transactions there are (99 Quantile)
▪ Measure how many transactions are too slow
▪ Modality Changes
3
Understanding Monitoring
▪ Modality Changes
3
Understanding Monitoring
Source: Theo Schlossnagel http://www.slideshare.net/postwait/adaptive-availability
▪ Anomaly Detection
▪ Finding patterns in data that do not conform to expected behavior
▪ Can be used for noise reduction
3
Understanding Monitoring
Source: Chandola - Anomaly Detection : A Survey
▪ Anomaly Detection - Research Areas
▪ Statistics ▪ Machine Learning ▪ Information Theory ▪ Data Mining
3
Understanding Monitoring
Source: Chandola - Anomaly Detection : A Survey
▪ Anomaly Detection - Characteristics
▪ High Cardinality ▪ Minimizing False Positives ▪ Seasonality ▪ Non Normally Distributions
3
Understanding Monitoring
Source: Chandola - Anomaly Detection : A Survey
▪ Anomaly Detection - Netflix
3
Understanding Monitoring
▪ Ebay Case Study
3
Understanding Monitoring
Source: http://www.ebaytechblog.com/2015/08/19/statistical-anomaly-detection/
▪ Ebay Case Study
3
Understanding Monitoring
Source: http://www.ebaytechblog.com/2015/08/19/statistical-anomaly-detection/
▪ Ebay Case Study
3
Understanding Monitoring
Source: http://www.ebaytechblog.com/2015/08/19/statistical-anomaly-detection/
Understanding Complexity
3
“I smile and start to count on my fingers: One, people are good. Two, every conflict can be removed. Three, every situation, no matter how complex it initially looks, is exceedingly simple. Four, every situation can be substantially improved; even the sky is not the limit. Five, every person can reach a full life. Six, there is always a win-win solution. Shall I continue to count?”
-Eliyahu M. Goldratt
▪ The Second Way - Right to Left
▪ Creating a Service Reliability Culture ▪ Fast Feedback ▪ Understanding Monitoring ▪ Understanding Complexity
3
Understanding Complexity
The Second Way
▪ Complexity
3
▪ In Search of Certainty
▪ Mark Burgess invented Desired State Configuration Management 20+ years ago
▪ Created Promise Theory 10+ years ago
▪ Uses realms of physics and biology to assert that uncertainty is an unescapable fact of technology.
3
Understanding Complexity
▪ Cybernetics
▪ Norbert Wiener defined in 1948 ▪ Circular Causality ▪ Self Steering Approach ▪ Listen, Calibrate, Change and Adapt ▪ Systemic Approach
3
Understanding Complexity
▪ Cynefin
▪ Defined by Dave Snowden ▪ Designed to describe the evolutionary
nature of complex systems ▪ Draws on research from complex
adaptive systems theory, cognitive science, anthropology and psychology
3
Understanding Complexity
Source: Wikipedia - Cynefin
▪ Cause and Effect is Obvious
▪ Sense ▪ See what’s coming in
▪ Categorise ▪ Make it fit predetermined
categories ▪ Respond
▪ Decide what to do
3
Understanding Complexity
Source: Wikipedia - Cynefin
▪ Cause and Effect Requires Analysis
▪ Sense ▪ See what’s coming in
▪ Analyse ▪ Investigate or analyse, using
expert knowledge ▪ Respond
▪ Decide what to do
3
Understanding Complexity
Source: Wikipedia - Cynefin
▪ Cause and Effect in Retrospect
▪ Probe ▪ Experimental input
▪ Sense ▪ Failures or successes
▪ Respond ▪ Decide what to do, amplify or
dampen
3
Understanding Complexity
Source: Wikipedia - Cynefin
▪ Cause and Effect Undetermined
▪ Act ▪ Attempt to stabilize
▪ Sense ▪ Failures or successes
▪ Respond ▪ Decide what to do next
3
Understanding Complexity
Source: Wikipedia - Cynefin
▪ Circuit Breaker Patterns
▪ Wrap a protected function call in a circuit breaker object
▪ Monitors for failures ▪ When a threshold is met trip a
circuit breaker ▪ Calls are then returned with an
error
3
Understanding Complexity
▪ Circuit Breaker Patterns
3
Understanding Complexity
Source: Martin Fowler http://martinfowler.com/bliki/CircuitBreaker.html
▪ Netflix - Circuit Breaker - Hystrix
▪ Give protection from and control over latency and failure from dependencies accessed
▪ Stop cascading failures in a complex distributed system.
▪ Fail fast and rapidly recover. ▪ Fallback and gracefully degrade when possible. ▪ Enable near real-time monitoring, alerting, and
operational control.3
Understanding Complexity
▪ Netflix - Circuit Breaker - Hystrix
▪ Isolates access points between services ▪ Can setup triggers (trip if 10 calls within 10
seconds take longer than 5 seconds) ▪ Provides fall back options (error, default value, null
value, or special error)
3
Understanding Complexity
3
Understanding Complexity
Source: https://github.com/netflix/hystrix/wiki
▪ Other Users of Circuit Break Pattern
▪ Spring Boot ▪ Nginx Plus ▪ Envoy (ISTIO)
3
Understanding Complexity