打造企業內的 SRE 透過 Istio. 透過 Istio...Istio in 2 minutes Gallery Service A Service B proxy proxy Control Plane API on K8S API Server Citadel Logging plugin oring plugin

Post on 18-Aug-2021

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

透過 Istio 打造企業內的 SRE

Hybrid Specialist: Shawn Ho

shawnho@google.com

1What is SRE?

Product Lifecycle

Concept Business Development Operations Market

Agile solves this

DevOps solves this

DevelopersAgility

OperatorsStability

Dev & Ops’ KPIs aren't Aligned

What is relationship between Devops and SRE ?

● Devops is more like abstract concept,guide line and disciplines to break silos in developments, operation

● SRE is Google version of realized practice of Devops.

“Class SRE implements Devops”

Self-Service Platform

Monitoring Automation

CI/CD

SRE

Developers

Class SRE = REAL PERSON

#1. Decision based on data所有的決定是以資料為基礎

#2. Be user centric即使所有的監控數據都是正常的,

但客戶只要覺得系統不穩定,那系統就是不穩定

#3. Blameless culture & Share responsibility降低部門隔閡要由跨部門的責任分享開始 (Developers, Operators, Leader) 系統

系統失效不僅是維運者的責任,程式碼品質,技術債等都是可能的原因

2How to Implement SRE by Istio/Anthos?

Istio in 2 minutes

Gallery

Service A Service B

proxy proxy

Control Plane API on K8S API Server

Citadel

Logg

ing

plug

in

Mon

itorin

g pl

ugin

HTTP, gRPC, TCP

Routing +

Secure Naming

Cert

Aut

horit

y p

lugi

n

Ingress Gateway Egress Gateway

mTLSmTLS mTLS

JWT + TLS

Cert issuance

Perimeter security policies

Perimeter security policies

Istio Control Plane

Pilot

Policy Enforcement + Reporting

Data flow

Control + metrics flow

Local AuthzJWT + TLS

Internal App 1

External App 1

What does SRE implement on Platform?

Metrics & monitoring

Capacity planning

Emergency response

Change management

Culture

● SLO● Dashboard● Analytics

● Forecasting● Demand-driven● Performance

● Release process● Consulting design● Automations

● Oncall● Incident analysis● Postmortems

● Toil management● Blamelessness● Share responsibility

What does SRE implement on Platform?

Metrics & monitoring

Capacity planning

Emergency response

Change management

Culture

● SLO● Dashboard● Analytics

● Forecasting● Demand-driven● Performance

● Release process● Consulting design● Automations

● Oncall● Incident analysis● Postmortems

● Toil management● Blamelessness● Share responsibility

Monitoring and Incident Management

Understand system architecture

Understand system architecture and deployed topology

System monitoring

Monitoring system by gathering blackbox & whitebox metrics

SLI & SLO are extracted from the matrix and logs.

The informations are visualized thru dashboard

Log handling

Managing planned event (release, maintenance)

Incident handling

Create incident ticketRollback change to resolve incident

Investigate root cause with logging,monitoring matrix and debugging.

Postmortem

Retrospect incident and prepare plan to prevent reoccurence

What to Monitor?

SLO = SLI + Target“99% of REST API call will complete in less than 100ms every week”

SLI Target

SLIservice level indicator: a well-defined measure of 'good enough'

• used to specify SLO/SLA

SLOservice level objective: a top-line target for fraction of good interactions

• specifies goals (SLI + Target)

SLAservice level agreement: consequences

• SLA = (SLO + margin) + consequences = SLI + Target + consequences

Error BudgetProduct management & SRE define an availability target.

• 100% - availability targetis a “budget of unreliability”(or the error budget).

Availability SLO

Allowed unavailability window Error Budget

per year per quarter per 30 days Error rate 1%

90% 36.5 days 9 days 3 days 90

95% 18.25 days 4.5 days 1.5 days 80

99% 3.65 days 21.6 hours 7.2 hours 0

99.5% 1.83 days 10.8 hours 3.6 hours -100

99.9% 8.76 hours 2.16 hours 43.2 minutes -900

99.95% 4.38 hours 1.08 hours 21.6 minutes -1900

99.99% 52.6 minutes 12.96 minutes 4.32 minutes -9900

99.999% 5.26 minutes 1.30 minutes 25.9 seconds -99900

Error Budget (Availability)

Demo with Anthos:Monitoring+Incident Mgmt

● Topology

● SLO/SLI Metrics

● Blackbox/Whitebox

● Log Viewer

● Tracing/Tracing Report

Demo with Anthos:Monitoring+Incident Mgmt

Topology Blackbox Whitebox

Demo with Anthos:Monitoring+Incident MgmtLogging Tracing

Error Budget Burn Down Rate

Demo with Anthos:Proactive Reduce Error Budget

● Alert Setting

● Canary Deployment

● Cross-Region Deployment

ClientsKubernetes ClusterKubernetes Engine

Taiwan-1

Kubernetes ClusterKubernetes Engine

Singapore

Cloud LoadBalancing

10

90

● Alert Setting

● Canary Deployment

● Cross-Region Deployment

ClientsKubernetes ClusterKubernetes Engine

Taiwan-1

Kubernetes ClusterKubernetes Engine

Singapore

Cloud LoadBalancing

50

50

Demo with Anthos:Proactive Reduce Error Budget

What does SRE implement on Platform?

Metrics & monitoring

Capacity planning

Emergency response

Change management

Culture

● SLO● Dashboard● Analytics

● Forecasting● Demand-driven● Performance

● Release process● Consulting design● Automations

● Oncall● Incident analysis● Postmortems

● Toil management● Blamelessness● Share responsibility

Capacity planning

Plan for organic growth

Increased product adoption and usage by customers.

Determine inorganic growth

Sudden jumps in demand due to feature launches, marketing campaigns, etc.

Change ManagementRoughly 70%1 of outages are due to changes in a live system

Kubernetes Configuration Service Continuous Deployment

Clients

Kubernetes ClusterKubernetes Engine

Multiple Instances

Cloud SourceRepositories

OnPremise

Kubernetes ClusterKubernetes Engine

GCP

Kubernetes ClusterKubernetes Engine

On-Prem1

Anthos HubService

NAT

Demo with Anthos:The Power of GitOps

Summary + Call for Action● SRE has 3 key principles:

○ Decision Based on Data (有意義的監控)

○ Be User Centric(黑箱測試)

○ Blameless Culture & Share Responsibility (分擔責任,共同努力)

● Kubernetes is a perfect platform to implement SRE○ SLI + SLO + Error Budget ○ Watch for the Budget Burn Rate○ Establish CI+CD with GitOps

● Pick a System and Build your SRE Practices

Cover images used with permission. These books can be found on shop.oreilly.com.

top related