Traditionally, IT organizations have treated infrastructure components like family pets. We name them, we worry about them, and we let them wake us up at 4:00 am. Amazon CTO Werner Vogels has dubbed these behaviors as server hugging and antiquated in today's cloud infrastructures. In this breakout session, we will discuss methods and methodology to get away from server hugging and be concerned more with the overall status and life of our entire infrastructure. From making use of toss-away-able on-demand infrastructure, to monitoring services and not individual servers, to getting away from naming instances, this session helps you see your infrastructure for what it is, technology that you control.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
They Don't Hug Back! Or Why You Need To Stop Worrying About prodweb001 And Start Loving i-98fb9856
Chris Munns, Amazon Web Services
November 13, 2013
Why are we here? Old-school IT practices continue to weigh us down in the cloud. We need a way out.
“Everything now is a programmable resource. There are no physical things anymore. Things that you needed to do by walking to the datacenter, by hugging your servers, and believe me I’ve hugged servers enough in my life. They DO NOT hug you back.”
“Everything now is a programmable resource. There are no physical things anymore. Things that you needed to do by walking to the datacenter, by hugging your servers, and believe me I’ve hugged servers enough in my life. They DO NOT hug you back.” -
Waking when they cry: *** Nagios *** Notification Type: PROBLEM Service: Web CPU Host: web03.example.com Address: 10.167.10.51 State: CRITICAL Date/Time: Thu Oct 24 08:14:13 UTC 2013 Additional Info: CRITICAL – CPU LOAD 29
Hugging server babies and you • Is the site performing worse? • Are your customers impacted? • How impacted are they? • What are the other 20 web instances doing? • Did I really need to wake up at 4am for this? • If a server uses 100% of its CPU, should I care? • If this server is bad, how much work is there in fixing
it? • Is there something custom about this server?
Server hugging bad practices • “Pet-ting” – caring about a server’s “name,” its
well being, its individual status • “Snowflakes” – unique hosts in a common pool • “Model T-ing” – Hand-built one-off servers • “Names In Stone” – overuse of host names as
a source of truth
In short, there are a lot of old-school, dated habits being taken to cloud infrastructure. And once you’ve brought them to the cloud, you lose out on a lot of the benefits of the cloud. Such as: • Dynamic scale up/down • Self healing infrastructures • Increased flexibility • Automation
Letting go involves moving forward with some of the best of what AWS can offer you in terms of services and how you can work with them in some pretty incredible ways.
Letting go and loving the new way
• Using Auto Scaling for everything • ENIs and EIPs • Tags are the new DNS • Deployment tools • Host-based configuration • Service registries
• High CPU usage on anything • High memory usage on anything • Thread/process exhaustion • Filled disks • Not running software • Failed instances
Metrics:
Metrics:
Common actions taken when paged
1. Look at logs
2. Look at graphs
3. Reboot/restart related application/instance
Common actions taken when paged
1. Look at logs
2. Look at graphs
3. Reboot/restart related application/instance
} Looking at past data
Common actions taken when paged
1. Look at logs
2. Look at graphs
3. Reboot/restart related application/instance
} Looking at past data
Why do this manually?
Provisioned capacity
Traffic to our site vs. provisioned capacity manually
76%
24%
Provisioned capacity
Traffic to our site vs. provisioned capacity manually
Traffic to our site vs. provisioned capacity with Auto Scaling
Provisioned capacity
STONITH "Shoot the other node in the head”
Don’t be afraid to kill a node a with
something wrong with it as a resolution to failure!
With Auto Scaling it’s fine!
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
Auto Scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
Auto Scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
CloudWatch
Auto Scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
CloudWatch
Auto Scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Amazon SNS
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
CloudWatch
Alarm
Auto Scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Amazon SQS Amazon SNS
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
CloudWatch
Alarm
Auto scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Amazon SQS Amazon SNS
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
CloudWatch
Alarm
Watcher Instance
Auto scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Amazon SQS Amazon SNS
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
CloudWatch
Alarm
Watcher Instance
EC2 API
Auto scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Amazon SQS Amazon SNS
Auto scaling Group min=3
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
CloudWatch
Alarm
Watcher Instance
EC2 API
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
CloudWatch Amazon SQS Amazon SNS
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
EC2 API
Watcher Instance
Auto scaling Group min=3
Auto Scaling for everything! • You can use Auto Scaling for singular instances that
don’t scale up or down – min = 1, max = 1
• Auto Scaling gives you the ability to specify multiple Availability Zones, even you only need a single host – gives you multi-AZ failover
• Auto Scaling supports notifications on instance creation/termination – Useful for configuring other resources, bootstrapping, and
provisioning • Auto Scaling is free!
Auto Scaling for everything!
• Make use of the user data or configuration management tools to do things like: – Re-attaching an Amazon Elastic Block Store (EBS) volume with
application data – Re-attaching an Elastic Network Interface (ENI) – Update service registries – Update DNS – Update other reliant applications of the new host
Elastic Network Interfaces/Elastic IPs ENI: • Add additional interfaces to an
instance • One or more secondary private
IP addresses • Has its own MAC address • Can have Security Groups
assigned • Tag-able • Free
EIP: • A static public IP address • Can be assigned to either an
instance or an ENI • Doesn’t replace private IP • Small hourly charge when not
attached to an instance
Elastic Network Interfaces
Attaching multiple network interfaces to an instance is useful when you want to: • Create a management network. • Use network and security appliances in your
Amazon Virtual Private Cloud (VPC). • Create dual-homed instances with workloads/roles on distinct
subnets. • Create a low-budget, high-availability solution.
Elastic Network Interfaces
Attaching multiple network interfaces to an instance is useful when you want to: • Create a management network. • Use network and security appliances in your
Amazon Virtual Private Cloud (VPC). • Create dual-homed instances with workloads/roles on distinct
subnets. • Create a low-budget, high-availability solution.
• All more or less accomplish the same things – File configuration, package/software installation, user management, run
commands, interface with OS, process management
• All have their own syntax that isn’t too dissimilar • Some rely on agents, some are agentless • Use HBCM alongside one of the tools from the previous
slide • Spend the time required to learn them • Can’t scale easily without HBCM
“A service registry is one of the fundamental pieces of service-oriented architecture (SOA) for achieving reuse. It refers to a
place in which service providers can impart information about their offered services and
potential clients can search for services.” - www.architecturejournal.net, Sept 2009
Service registry workflow
1. A new instance boots. 2. It registers itself with our “service registry.” 3. Changes to the service registry kick off changes on
other systems related to the new instance. 4. Other instances now know about our new instance. 5. On instance termination, instance is deregistered,
and other instances remove it from use.
Service registry examples:
• Zookeeper • MuleSoft Anypoint Service Registry • Netflix Eureka • IBM WebSphere Service Registry and
Repository • Airbnb SmartStack
Zookeeper “is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.” – zookeeper.apache.org
Airbnb SmartStack Helping you build Service Oriented Architectures
not at Re:Invent
Intros
Igor Serebryany + SRE at Airbnb since 2012 + Built datacenter automation at
SingleHop + Scientific computing at University
of Chicago + Hobbies: welding, biking, long
walks on the beach
102
This guy is even more bearded than the last!
Intros
Martin Rhoads + SRE at Airbnb + user of AWS since 2006 + First 10 employees at RightScale + Previously worked at
Cloudscaling deploying OpenStack at Tier1s and Telcos
+ BioInformatics at UCSB + Obsessed with making things
easier
103
SmartStack Helping you build SOA
What are you trying to sell me?
Why do I need SOA?
+ The definitive way to scale your architecture + Allow different people to work on different code without stepping on toes + Separate deployment schedules + Separate machine and data requirements + Fail separately -- so you can have graceful degradation
105
How SOA happens When customers love a service very, very much...
106
How SOA happens
107
When customers love a service very, very much...
How SOA happens
108
When customers love a service very, very much...
How SOA happens When customers love a service very, very much...
109
How SOA happens When customers love a service very, very much...
110
How SOA happens When customers love a service very, very much...
111
Here’s how it ends up A certain kind of fun
112
To sum up
113
1 Services help you scale
2 SOA is an architecture style designed around services
+ The same code in the same language is always doing discovery/registration
+ Your application doesn’t know about nerve/synapse -- it only knows about its dependencies
+ Always consistent across your infrastructure
You don’t have to wake up
Automatic Failure Handling
+ Bad backends are automatically taken out of rotation + Useful during both problems and routine maintenance/deploys + Push-based => very rapid detection; avoid those little blips + haproxy even routes around network partitions!
121
See what’s REALLY going on
Introspection
Leverage the power of haproxy + status page that lets you see local
state + lots of available integrations to
gather global state + world-class logging for large-scale
analysis
122
No central point of failure
Distributed by Design
+ Traffic flows directly between boxes -- no routing layer + Even if SmartStack is stopped or broken, haproxy keeps traffic flowing + Zookeeper helps to avoid common pitfalls (like different backends in
different network segments)
123
How SmartStack has changed Airbnb
The Impact
124
100+
Services using
SmartStack
Requests per second
LOC deleted
Engineers using
SmartStack
2K 3K 30
Ben: “SmartStack is great! It helped me to discover services – and quit smoking”
Phillippe: “Distributed computing? And all this time I thought everything was running on one machine”
Spike : “Nerve and Synapse have greatly simplified my life as an application developer, and have enabled me to launch our first Node.js services with very little ops overhead.”
Barbara: “I love it!”
Sean: “Smart Stack has made deployment of new java services a matter of beer and 20 lines of ruby”
Our engineers love SmartStack
Future Direction Is this project, like, done...?
126
1
2
3
4
Better resiliency: more graceful handling of zookeeper edge cases
Better testing: improve on the current integration test suite
Dynamic registration: for services running on Mesos et. al.
A push API for nerve: allow services to communicate coming downtime
5 An auto-scaling layer: use nerve information to determine load levels
Join the AWS Startup Team this evening at the AWS Pub Crawl When: Wednesday November 13, 5:30pm - 7:30pm Where: Canaletto at The Venetian, 2nd Floor Who Will Be There: Startups, the AWS Startup Team, Startup Launch Companies, and AWS re:Invent Hackathon winners
Startup Spotlight Sessions with Dr. Werner Vogels Thurs. Nov 14, Marcello Room 4406
SPOT 203 – Fireside Chats – Startup Founders, 1:30-2:30pm – Eliot Horowitz, CTO of MongoDB – Jeff Lawson, CEO of Twilio – Valentino Volonghi, Chief Architect of AdRoll
SPOT 204 – Fireside Chats – Startup Influencers, 3:00-4:00pm – Albert Wegner, Managing Partner at Union Square Ventures – David Cohen, Founder and CEO of TechStars
SPOT 101 - Startup Launches, 4:15-5:15pm – 5 companies powered by AWS launching at AWS re:Invent 2013
We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.