Top Banner
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Coburn Watson, Director of Performance and Reliability, Netflix October 2015 SPOT302 Availability The New Kind of Innovators Dilemma
32

(SPOT302) Availability: The New Kind of Innovator’s Dilemma

Feb 14, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Coburn Watson, Director of Performance and Reliability, Netflix

October 2015

SPOT302

AvailabilityThe New Kind of Innovators Dilemma

Page 2: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

@coburnw

• Cloud performance and reliability @ Netflix

• Reduce time-to-detect and time-to-resolve

• Optimize usage of AWS cloud

• Steer global user traffic and support failover

• Inject chaos into production environment

• Build innovative performance analysis tooling

• Drive operational best practice adoption

Page 3: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

• 67M+ subscribers

• > 50 countries

• > 3 billion hours of video streamed monthly

• Massive cloud footprint

• Homegrown CDN

• Strong Originals slate

Page 4: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Atlas

https://netflix.github.io/

Page 5: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

What to Expect from the Session

• Strategies

• Maximizing engineering velocity in the cloud

• Minimizing risks to availability

Page 6: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

The cloud is a journey

…not a destination*

* Adapted from Ralph Waldo Emerson

Page 7: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

2008

2010

2011

2013

2015

Datacenter

Failure

Serving off

AWS US-EAST-1

Three AZ

Deployments

Serving off

AWS EU-WEST-1

Chaos Monkey

Unleashed

Serving from

AWS US-WEST-2

Running

Active-Active

Chaos Kong

Unleashed

Last Application

to the Cloud

Active-Active in

three AWS regions

The Netflix Cloud Journey

Page 8: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

The Innovators Dilemma

vs.

Page 9: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Shifting the Curve@Netflix

• Maintain or improve availability as engineering velocity increases

Page 10: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Maximize Engineering Velocity

"FA-18 Hornet breaking sound barrier (7 July 1999) - filtered" by Ensign John Gay, U.S. Navy

Page 11: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Infrastructure on Demand

• No procurement process• “all you can eat” **

• Expose IaaS via Spinnaker• No passwords, no keys

** please don’t eat all of it

Page 12: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Accelerate Code Deployment

• Commit-to-cloud in minutes

• Across three AWS regions

Page 13: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Decouple Services

• µservice architecture (500+ @Netflix)

• One Auto Scaling group per service

• Independent push schedules (1day 4weeks)

• Communicate via API

• Independent databases (280+ Cassandra clusters)

• Minimize aggregate rate of change

• Update code which needs updating…

Page 14: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Minimize Risks to Availability

“If everything seems under control, you're not going fast enough.”

― Mario Andretti

Page 15: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Maximize Infrastructure Stability

• Run on AWS

• Purchase 3-year EC2 Reserved instances (for failover as well)

• Distribute Auto Scaling groups across 3 Availability Zones per region

Page 16: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Propagate Changes Safely into Production

• Rolling regional “red-black” pushes

• Build pipelines & automated canary analysis

• 30 second time-to-detect on critical metrics

Page 17: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

• Rigorous quality and performance checks part of code push

• Canary score is the gate for push

Automated Canary Analysis

Page 18: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Cross-Service Resiliency

• Isolate misbehaving services

• Open “circuits” and provide fallback experiences

Normal(personalized)

Degraded(unpersonalized)

Page 19: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Improve Time-To-Detect

• 30 second alerts vs. prior 8 minutes

• Utilize streaming analysis infrastructure at the edge tier

Page 20: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Dynamically Provision Capacity

• Reactively scale Auto Scaling groups

Page 21: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Flexibility in Traffic Management

• Target three primary AWS regions

• Maintain capacity to allow regional evacuation

Page 22: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Frequently Exercise “Chaos”

• Netflix runs regional failover exercises monthly

• Can you spot the chaos?

Page 23: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Frequently Exercise “Chaos”

• Validates• Failover correctness

• Capacity

• Failover velocity

• Confidence in usage

(same time window as previous slide)

Page 24: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Continually Lower Operational Barriers

• “Production Ready” Program

• Identify operational best practices

• Develop tooling

• Consult with engineering teams

• Identify reliability “anti-patterns”…address

• Example key areas

• Auto Scaling, Hystrix tuning, alerting,

automated Canary analysis, Apache/Tomcat tuning

Page 25: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

It Works

Page 26: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Regional Isolation

Push-induced failure

Page 27: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Automated Service Fallbacks

• Downstream service issue; fallbacks gracefully applied

Page 28: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

….but what about efficiency?

..That’s a separate talk altogether

Page 29: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Wrapping it Up

• “To the cloud” – a journey

• Abstract complexity via platform

• Don’t be afraid to break things

• Break things intentionally and frequently

• Invest in reliability to support increased innovation

• Hire top talent

Page 30: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Related Sessions

Talk Speaker When? Where?

Engineering Netflix Global Operations in the Cloud Josh Evans Wed @11am Palazzo N

Efficient Innovation: High-Velocity Cost Management at Netflix Andrew Park Wed @ 2:45pm Palazzo C

Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million

Events Per SecondPeter Bakas Wed @ 2:45pm

San Polo

3501B

A Day in the Life of a Netflix Engineer Using 37% of the Internet Dave Hahn Wed @ 4:15pm Venetian H

Real-Time Analytics In Service of Self-Healing EcosystemsRoy Rapoport

Chris SandenWed @ 4:15pm Lido 3001B

Running Spark and Presto on the Netflix Big Data Platform Daniel Weeks Thu @ 11am Palazzo F

Splitting the Check on Compliance and Security: Keeping Developers and

Auditors Happy in the CloudJason Chan Thu @ 11am

Marcello

4501B

Page 31: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Remember to complete

your evaluations!

Page 32: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Thank you!