打造企業內的 SRE 透過 Istio. 透過 Istio...Istio in 2 minutes Gallery Service A Service B proxy proxy Control Plane API on K8S API Server Citadel Logging plugin oring plugin
Post on 18-Aug-2021
2 Views
Preview:
Transcript
透過 Istio 打造企業內的 SRE
Hybrid Specialist: Shawn Ho
shawnho@google.com
1What is SRE?
Product Lifecycle
Concept Business Development Operations Market
Agile solves this
DevOps solves this
DevelopersAgility
OperatorsStability
Dev & Ops’ KPIs aren't Aligned
What is relationship between Devops and SRE ?
● Devops is more like abstract concept,guide line and disciplines to break silos in developments, operation
● SRE is Google version of realized practice of Devops.
“Class SRE implements Devops”
Self-Service Platform
Monitoring Automation
CI/CD
SRE
Developers
Class SRE = REAL PERSON
#1. Decision based on data所有的決定是以資料為基礎
#2. Be user centric即使所有的監控數據都是正常的,
但客戶只要覺得系統不穩定,那系統就是不穩定
#3. Blameless culture & Share responsibility降低部門隔閡要由跨部門的責任分享開始 (Developers, Operators, Leader) 系統
系統失效不僅是維運者的責任,程式碼品質,技術債等都是可能的原因
2How to Implement SRE by Istio/Anthos?
Istio in 2 minutes
Gallery
Service A Service B
proxy proxy
Control Plane API on K8S API Server
Citadel
Logg
ing
plug
in
Mon
itorin
g pl
ugin
HTTP, gRPC, TCP
Routing +
Secure Naming
Cert
Aut
horit
y p
lugi
n
Ingress Gateway Egress Gateway
mTLSmTLS mTLS
JWT + TLS
Cert issuance
Perimeter security policies
Perimeter security policies
Istio Control Plane
Pilot
Policy Enforcement + Reporting
Data flow
Control + metrics flow
Local AuthzJWT + TLS
Internal App 1
External App 1
What does SRE implement on Platform?
Metrics & monitoring
Capacity planning
Emergency response
Change management
Culture
● SLO● Dashboard● Analytics
● Forecasting● Demand-driven● Performance
● Release process● Consulting design● Automations
● Oncall● Incident analysis● Postmortems
● Toil management● Blamelessness● Share responsibility
What does SRE implement on Platform?
Metrics & monitoring
Capacity planning
Emergency response
Change management
Culture
● SLO● Dashboard● Analytics
● Forecasting● Demand-driven● Performance
● Release process● Consulting design● Automations
● Oncall● Incident analysis● Postmortems
● Toil management● Blamelessness● Share responsibility
Monitoring and Incident Management
Understand system architecture
Understand system architecture and deployed topology
System monitoring
Monitoring system by gathering blackbox & whitebox metrics
SLI & SLO are extracted from the matrix and logs.
The informations are visualized thru dashboard
Log handling
Managing planned event (release, maintenance)
Incident handling
Create incident ticketRollback change to resolve incident
Investigate root cause with logging,monitoring matrix and debugging.
Postmortem
Retrospect incident and prepare plan to prevent reoccurence
What to Monitor?
SLO = SLI + Target“99% of REST API call will complete in less than 100ms every week”
SLI Target
SLIservice level indicator: a well-defined measure of 'good enough'
• used to specify SLO/SLA
SLOservice level objective: a top-line target for fraction of good interactions
• specifies goals (SLI + Target)
SLAservice level agreement: consequences
• SLA = (SLO + margin) + consequences = SLI + Target + consequences
Error BudgetProduct management & SRE define an availability target.
• 100% - availability targetis a “budget of unreliability”(or the error budget).
Availability SLO
Allowed unavailability window Error Budget
per year per quarter per 30 days Error rate 1%
90% 36.5 days 9 days 3 days 90
95% 18.25 days 4.5 days 1.5 days 80
99% 3.65 days 21.6 hours 7.2 hours 0
99.5% 1.83 days 10.8 hours 3.6 hours -100
99.9% 8.76 hours 2.16 hours 43.2 minutes -900
99.95% 4.38 hours 1.08 hours 21.6 minutes -1900
99.99% 52.6 minutes 12.96 minutes 4.32 minutes -9900
99.999% 5.26 minutes 1.30 minutes 25.9 seconds -99900
Error Budget (Availability)
Demo with Anthos:Monitoring+Incident Mgmt
● Topology
● SLO/SLI Metrics
● Blackbox/Whitebox
● Log Viewer
● Tracing/Tracing Report
Demo with Anthos:Monitoring+Incident Mgmt
Topology Blackbox Whitebox
Demo with Anthos:Monitoring+Incident MgmtLogging Tracing
Error Budget Burn Down Rate
Demo with Anthos:Proactive Reduce Error Budget
● Alert Setting
● Canary Deployment
● Cross-Region Deployment
ClientsKubernetes ClusterKubernetes Engine
Taiwan-1
Kubernetes ClusterKubernetes Engine
Singapore
Cloud LoadBalancing
10
90
● Alert Setting
● Canary Deployment
● Cross-Region Deployment
ClientsKubernetes ClusterKubernetes Engine
Taiwan-1
Kubernetes ClusterKubernetes Engine
Singapore
Cloud LoadBalancing
50
50
Demo with Anthos:Proactive Reduce Error Budget
What does SRE implement on Platform?
Metrics & monitoring
Capacity planning
Emergency response
Change management
Culture
● SLO● Dashboard● Analytics
● Forecasting● Demand-driven● Performance
● Release process● Consulting design● Automations
● Oncall● Incident analysis● Postmortems
● Toil management● Blamelessness● Share responsibility
Capacity planning
Plan for organic growth
Increased product adoption and usage by customers.
Determine inorganic growth
Sudden jumps in demand due to feature launches, marketing campaigns, etc.
Change ManagementRoughly 70%1 of outages are due to changes in a live system
Kubernetes Configuration Service Continuous Deployment
Clients
Kubernetes ClusterKubernetes Engine
Multiple Instances
Cloud SourceRepositories
OnPremise
Kubernetes ClusterKubernetes Engine
GCP
Kubernetes ClusterKubernetes Engine
On-Prem1
Anthos HubService
NAT
Demo with Anthos:The Power of GitOps
Summary + Call for Action● SRE has 3 key principles:
○ Decision Based on Data (有意義的監控)
○ Be User Centric(黑箱測試)
○ Blameless Culture & Share Responsibility (分擔責任,共同努力)
● Kubernetes is a perfect platform to implement SRE○ SLI + SLO + Error Budget ○ Watch for the Budget Burn Rate○ Establish CI+CD with GitOps
● Pick a System and Build your SRE Practices
Cover images used with permission. These books can be found on shop.oreilly.com.
top related