CCA - NoDerivs 3.0 Unported License - Usage OK, no modifications, full attribution* * All unlicensed or borrowed works retain their original licenses Pets vs. Cattle: The Elastic Cloud Story DevOps Chicago Meetup February 26, 2014 @randybias
Jan 16, 2015
CCA - NoDerivs 3.0 Unported License - Usage OK, no modifications, full attribution*!* All unlicensed or borrowed works retain their original licenses
Pets vs. Cattle:!The Elastic Cloud Story!DevOps Chicago Meetup!February 26, 2014
@randybias
A Tale of Two Clouds
�2
Enterprise Computing Approach
�3
GUI Driven!Ticket-Based!Hand-Crafted!
Reserved !Scale-up!
Smart Hardware!Proprietary!
Traditional Dev!…
Cloud Computing Approach
�4
API Driven!Self-Service!Automated!On-demand!Scale-out!
Smart Apps!Open Source!Agile DevOps!
…
Elastic Cloud Shifts Uptime Responsibility
�5
Enterprise Model Cloud Model
99.9%!Applications!
(8h46m down)
99.999%!Infrastructure!
($$$$)
99.999% Applications!(5m down)
99% Infrastructure!
($$)
Elastic Cloud Origins
�6
Elastic !Private Cloud
Enterprise Virtualization!Private Cloud
Elastic & Virtualization
2.0 Clouds are very different.!
!Different
workloads.!!
Different !architectures.!
!Different !
skills.!!
Different economics.
≠
Virtual Infrastructure
Standardization, Automation,!
Chargeback, Self-Service!
Designed for Server Consolidation !IT Admins manage Infrastructure!Ticket-based manual provisioning!Improves virtualization value
=
+
Elastic Public Cloud
On-premise Deployment!
Designed for Agility!Cloud Admins manage Services!
Self-service automated provisioning!Delivers cloud value on-premise
=
+
What Companies Care About?
�7
Cloud Computing!
Agile Development!
Business !Agility!
Operational Discipline!
ACCELERATING!TIME TO VALUE!Continuous
Integration
Continuous Testing & Delivery
Agile Methodologies
IaaS / PaaS !!
Public / Private / Hybrid !!
Big Data / Analytics
!!
Public APIs
Continuous Deployment
DevOps Data Center & App Automation
Line of Business
Enablement
New App Initiatives
(Mobile, SaaS, etc.)
Data Center Modernization
Elastic Cloud is a Mindset Change
�8
Attribution: Bill Baker, Distinguished Engineer, Microsoft
bowzer.company.com!(scale-up)
web001.company.com!(scale-out)
(Virtual) Servers *are* cattle
Pets vs. Cattle Takes Off
�9
MicrosoftCloudscaling
CERN
IBM
ScalrRackspaceRed Hat
Scale-out, not UP in Cloud
(Some) Elastic Cloud Patterns!
!
!
What follows are *some* Elastic Cloud Patterns!There are many more, but these are mine!Input, ideas, & other thoughts welcome via twitter / email
�10
Big Failure Domains !Make Big Craters
�11
Big Failure Domains !Make Big Craters
�12
Anti-Pattern
Anti-Pattern
Smaller Failure Domains
�13
Would you rather have the whole cloud down !or just a small bit of it for a short time?
vs
Loose Coupling
�14
Synchronous, blocking calls mean cascading
failures.
Async, non-block calls mean failure in
isolation.
Open Source Software
�15
Excessive software taxation is the past.
Black boxes create lock-in.
You can !always fork.
Uptime in Software Self-management
�16
Hardware fails.!Software fails.!
People fail.
Only software can measure itself &
respond to failure in near real-time.
Applications designed for 99.999% uptime can
run anywhere
Scale Out vs Scale up
�17
Vertical Scaling Make boxes bigger (usually an HA pair)
Horizontal ScalingMake more boxes
A
A
➔➔
B
B ...A B C N
Circuit Breaker Pattern
�18
Fallback mechanisms (e.g. cached data)
ensure uninterrupted service while giving service time to
recover
When failing service detected, stop calling that
API and serve fallback responses
Buy from ODMs
�19
ODMs operate their businesses on 3-10%
margins.
AMZN, GOOG, and Facebook buy direct without a middleman.
Only a few enterprise vendors are pivoting to
compete.
Less Enterprise “Value” in x86 Servers
�20
Generic servers rule. Full stop. Nothing is better because nothing else is
*generic*.
“... a data center full of vanity free servers ... more efficient ... less expensive to build
and run ... “ - OCP
Fully Routed (L3) Networking
�21
The largest cloud operators all run layer-3 routed,
networks with no VLANs.
Cloud-ready apps don’t need or want VLANs.
Enterprise apps can be supported on elastic clouds
using Software-defined Networking (SDN)
Software-defined Networking (SDN)
�22
• x86 server is the new Linecard!• network switch is the new ASIC!• VXLAN (or NVGRE) is the new Chassis!• SDN Controller is the new SUP Engine
“Network Virtualization”
Flat Networking + SDNs
�23
Flat + SDN co-exist & thrive together
Standard SecurityGroup
1 2
Availability Zone
VM VM
VM
VM
VM
VM
Virtual L2 Network
VM
VMVM
Virtual Private Cloud
Networking
VPC SecurityGroup
Internet
VPC Gateway
Physical Node
RAIS instead of HA Pairs/ClustersRedundant arrays of inexpensive services (RAIS)!
Load balanced with no state sharing!Active … active … active … active … !On failure, connections are lost, but failures are rare!Rolling upgrades are easier, because each server is an island!Think: scale-out + fault isolation (sharding)!
Ridiculously simple & scalable!
Hardware failures are infrequent & impact subset of traffic!(N-F)/N, where N = total, F = failed!10 RAIS servers - 1 failure == 90% capacity!Most things retry anyway!
Cascade failures are unlikely and failure domains are small
�24
Service Array (RAIS) Example
�25
Backbone Routers
Cloud Access Switches
AZ (Spine) Switches
RAIS (NAT, LB, VPN)
OSPF Route Announcements
Return Traffic (default or source NAT)
API
Public IP Blocks
Cloud Control Plane
Lots of Inexpensive 1RU Switches
�26
1RU: 6K-30K VMs / AZ
Simple spine-and-leaf flat routed network
Rack 1 Rack 2 Rack 3
Modular: 40K-200K VMs / AZ
Rack 1Rack 2
MultipleRacks
Rack 1Rack 2
MultipleRacks
Rack 1Rack 2
MultipleRacks
Direct-attached Storage (DAS)
�27
Cloud-ready apps manage their own data replication.
DAS is the smallest failure domain possible with
reasonable storage I/O.
SAN == massive failure domain.
SSDs will be the great equalizer.
Elastic Block Device Services
�28
EBS/EBD is a crutch
Bigger failure domains (AWS outage anyone?), complex,
sets high expectations
Sometimes you need a crutch. When you do, overbuild the network, and make sure
you have a smart scheduler.
AWS EBS Outage!http://aws.amazon.com/message/65648/
More Servers == More Storage I/O
�29
>1M writes/second, triple-redundancy w/ Cassandra on AWS
Linear scale-out == linear costs for performance
Hypervisors are a Commodity
�30
Cloud end-users want OS of choice, not HVs.
Level up! Managing iron is for mainframe operators.!… hypervisors are bare metal APIs
Hypervisor of the future is open source, easily modifiable, &
extensible.
The Hypervisor of the Future May Be NO Hypervisor
�31
LXC
ironic
Bare Metal Cloud
Quiz Time
�32
Quiz Time
�33
Pets CattleLACP?
Quiz Time
�34
Pets CattleLACP ➔
Quiz Time
�35
Pets CattleLACP
Managing a Server at a Time?
Quiz Time
�36
Pets CattleLACP
Managing a Serverat a Time ➔
Quiz Time
�37
Pets CattleLACP
Managing Server at a Time
Auto-scaling?
Quiz Time
�38
Pets CattleLACP
Managing Server at a Time
Auto-scaling➔
Quiz Time
�39
Pets CattleLACP
Managing Server at a Time
Auto-scaling
Design-for-Failure?
Quiz Time
�40
Pets CattleLACP
Managing Server at a Time
Auto-scaling
Design-for-Failure➔
Quiz Time
�41
Pets CattleLACP
Managing Server at a Time
Auto-scaling
Design-for-Failure
100% Uptime Goals?
Quiz Time
�42
Pets CattleLACP
Managing Server at a Time
Auto-scaling
Design-for-Failure
100% Uptime Goals ➔
Quiz Time
�43
Pets CattleLACP
Managing Server at a Time
Auto-scaling
Design-for-Failure
100% Uptime Goals
HA pairs for redundancy?
Quiz Time
�44
Pets CattleLACP
Managing Server at a Time
Auto-scaling
Design-for-Failure
100% Uptime Goals
HA pairs for redundancy ➔
Quiz Time
�45
Pets CattleLACP
Managing Server at a Time
Auto-scaling
Design-for-Failure
100% Uptime Goals
HA pairs for redundancy
Shared Nothing Architecture?
Quiz Time
�46
Pets CattleLACP
Managing Server at a Time
Auto-scaling
Design-for-Failure
100% Uptime Goals
HA pairs for redundancy
Shared Nothing Architecture➔
Quiz Time
�47
Pets CattleLACP
Managing Server at a Time
Auto-scaling
Design-for-Failure
100% Uptime Goals
HA pairs for redundancy
Shared Nothing Architecture
Persistent Block Storage?
Quiz Time
�48
Pets CattleLACP
Managing Server at a Time
Auto-scaling
Design-for-Failure
100% Uptime Goals
HA pairs for redundancy
Shared Nothing Architecture
Persistent Block Storage ➔
Q & A
�49
Randy Bias!Founder & CEO, Cloudscaling!Director, OpenStack Foundation!@randybias