Transcript
Moving to structured state management in OpenStack
Yahoo! and NTT Data
Deployer use cases
• As a deployer I want to ensure that an instance is reserved & provisioned without falling back and/or reporting to users internal OpenStackerrors.
• As a deployer I want to be able to allocate, schedule and reserve resources before they are consumed so that I can make advanced/complex/custom scheduling decisions using the combination of those resources as a whole.
• I want to convey to my users that OpenStack is a reliable and dependable system that is resilient to API outages, resource failures…
Developer use cases
• I want to be able to add new (and improved!) states to OpenStack and know what the impacts will be on the other states in OpenStack in a easy to understand manner.
• I want to be able to undo (and redo) resource allocation decisions in a transactional and verifiablycorrect manner on errors or on other ‘smart’ algorithmic placement logic.
• I want to be able to quickly and easily understand an API request from start to finish & I want other developers to have a single place to understand the same.
User use cases
• I want to ensure that my instances are reliablybrought up without involving myself to resolve(or raise to support) errors inside of OpenStack.
• I want to ensure that my instances (and associated resources) are optimally scheduled in a reliable and correct manner or not have them scheduled to begin with.
• I want my resources to be fully utilized, and not have zombie resources being ‘locked’ due to the lack of transactional semantics (and recovery) in the underlying code.
The problem
• Hard to [follow, recover from, debug, ensure reliability, correctness, extend, audit…] ad-hoc distributed state transitions.– Created by continual placement of new features
without revisiting the underlying state management system.• The never ending battle between new hotness vs. stability
– Majority of focus (understandably) on getting OpenStack operational.
– Typical technical debt.• Acceptable for a new project like OpenStack to get off the
ground, but now is the time to focus on features that addstability/scalability...
The problem
• Inter-state ‘cutting’ results in instances which require manual or periodic tasks to recover.– Distributed systems should always be able to
automatically recover from failures, and not require manual/periodic intervention.
• Continually adding local [solutions,fixes,patches]• Lack of [focus,time,desire] to fix the system as a whole?
• How many inter-state race conditions are hiding underneath the covers??– Can verification even be done with the current
codebase (in a reasonable time period)?
request nova-api
Libvirt
MySQL
RabbitMQkeystone
glance
nova-compute
nova-scheduler
VolumeService
NetworkService
1
2
3
4
5
6
7
8
9
10,14 16
15
11
12
13
CREATE SERVER API (admin/user)
Create Server - Transitions and States
ID Service Operation vm_state task_state power_state
1 Nova API Initial State - - -
2 Keystone Authenticate user - - -
3 Nova API/Glance Show image - - -
4 Nova API/MySQL Create entry BUILDING SCHEDULING -
5 Nova API/RabbitMQ Cast to Scheduler BUILDING SCHEDULING -
6 Scheduler Received at Scheduler BUILDING SCHEDULING -
7 Scheduler/RabbitMQ Cast to Compute BUILDING SCHEDULING -
8 Compute Received at Compute BUILDING SCHEDULING -
9 Compute/Glance Show image BUILDING SCHEDULING -
10 Compute/MySQL Update DB BUILDING NETWORKING -
11 Compute/RabbitMQ Call on Network BUILDING NETWORKING -
12 Network Allocate Network BUILDING NETWORKING -
13 Compute/Volume Attach volume BUILDING BLOCK_DEVICE_MAPPING
-
14 Compute/MySQL Update DB BUILDING SPAWNING -
15 Compute/Libvirt Spawn instance BUILDING SPAWNING -
16 Compute/MySQL Update DB ACTIVE None RUNNING
What happensif we cut here??
Or here??
Or here??
Solutions solutions solutions
• Nova has mostly stabilized (code-wise)
– It appears to be a good time to rethink some of the foundations. And rework some of the foundations (with as minimal of an impact as we can)
– Eventually as other core components (quantum) stabilize similar analysis can be done there (if needed)
• Prototyping a potential solution and discuss with community on next steps.
– That’s why we are here folks
Create request without orchestration
https://docs.google.com/document/d/1xpUszQFEtKmRAf1Wz_XpwyJslhI5X6siM29amPnKifE
Create request with orchestration
https://docs.google.com/document/d/1xpUszQFEtKmRAf1Wz_XpwyJslhI5X6siM29amPnKifE
Key Benefits
• Less scattering of state management– Makes it easier to understand…
• Less scattering of recovery scenarios – Clearly defined rollbacks…
• Faster and more dependable resource acquisition– Compute node will perform initialization and final acquisition of resources. – Reservations and initial acquisitions will be done before request to provision
instances, hence faster VM spawns.
• Scheduler can be make better ‘overall’ scheduling decisions.– Ex. no need for compute <-> scheduler retry hacks– Can make advanced scheduling decisions based on volume choices, locality,
network choices... When you are able to acquire/release resources before there use, anything is possible…
– No more need for 'hinting'...
• Creates a single place where others can extend or alter nova state transitions to plug-in there own ‘custom/internal’ state transitions.
DEMOAND
DISCUSSION
https://etherpad.openstack.org/the-future-of-orch
top related