Can you trust Neutron?

Can you trust Neutron?A tour of scalability and reliability improvements from Havana to Juno

Salvatore Orlando (@taturiello)Aaron Rosen (@aaronorosen)

From Havana to Juno

● 12 months● 1672 commits● +147765 -70127 lines of code

(excluding changes in neutron/locale/*)

But... did it really get any better?

Measuring scalability - Process

● Goal: Validate agent scalability under varying loado In this talk we’ll discuss the L2 agent only, sorry!

● Testbed: single server OpenStack installation

● Methodology: run several experiments increasing

the number of servers concurrently createdo Number of servers ranging from 1 to 20o Every experiment is repeated 20 timeso For each metric, study mean, median, and variance

Measuring scalability - Metrics

Instance metrics (t_start = instance created):● t_active - time until the instance reaches active state● t_ping - time until the instance can be pinged● t_allocate_net - time spent configuring networking for instance

Port metrics (t_start = VIF plugged):● t_proc: time until the agent start processing the port● t_up: time until the port is wired● t_dhcp: time for adding DHCP info for the new port

Measuring scalability - Results

t_up in Havana and Juno - a rather remarkable difference!

Measuring scalability - Resultst_allocate_net almost constant in Juno

Growth trend is only 15% of the one seen in Havana

Measuring scalability - results● VM failure rate

analysiso Failure == error while

creating VM or unable to ping within 3 min timeout

● Juno is infallible decently reliable (Havana not as much…)

Analysing progress

FolsomGrizzly Havana

IcehouseJuno

>>>>

>>

<<

How the software improved

● Boot VMs only once network is wired

● Remove choke points from L2 agents

● Streamline security group RPC

● Better router processing in L3 agents

● Reporting floating IP processing status

● many others… which unfortunately won’t fit into the time

allocated to this talk

More results

● Virtually no improvements in time to ping an instance

- As the tests are executed on a single host IO contention between instances is the main bottleneck.

- “Time to ping” is slowed down by longer instance boot times

● Instances are slower to go to “ACTIVE” then they were in Havana

- This is actually a desired feature

- Indeed it’s the reason for which failure rate in Juno is 0 even with 20 concurrent instances

Nova/Neutron Event reporting

Problem: Nova displays cached IPAM info about instance from neutron. Cache is updated slowly…

nova-api

neutron-api1. Associate floating IP to port

2. Show me instance!

Wat? No floating ip?

Nova/Neutron Event reportingSolution: Neutron sends events to nova on IPAM changes causing nova to update its cache.

neutron-api1. Associate floating IP to port

nova-api

2. network-changed for instance X

nova-compute3. dispatch event to compute host

4. update_network cache for instance X

5. Show me instance!

I haz floating ip

Nova/Neutron Event reportingProblem: Instances would go active before network was wired. Some dhcp clients (as the one in cirros images) doesn’t continue retrying...

nova-api1. Boot instance

W00T Active!

Timeout.. Hrm?!?

2. Ready?!?

3. ssh instance…..

Nova/Neutron Event reportingSolution:Neutron sends events to nova on when network is ready.

nova-api

1. Boot instance

nova-scheduler nova-compute

VM

3. Started in paused state

neutron-api

2B. event: network-vif-plugged: port X

VM

Neutron Backend

2. Allocate network for instance

3B. unpaused

1B. Port X active

Enabling/disabling event reporting

Settings in nova.conf

vif_plugging_timeout = 300vif_plugging_is_fatal = True

Speeding up L2 interface processing

Problem - device processing delayed by:- inefficient server/agent interface- preemptive behaviour of security group callbacks- pedantic polling of interfaces on integration bridge- superficial analysis of devices to process

Solution:- ovsdb-monitor triggers interface processing only when changes are detected- Neutron server perform at most 2 RPC call over AMQP for each API operation

- only 1 call in most cases- The L2 agent queries the server only once for retrieving interface detail- Security group updates are processed in the same loop as interface, thus avoiding starvation.- The agent only processes interfaces which are ready to be used - and most importantly

processes them only once!

Streamlining security group RPCs

Problem - exponential complexityThe payload of the RPC call to retrieve security group rules grows exponentially when the number of devices increases

Solution:Restructure the format of the payload exchanged between agent and server, removing data redundancy.With the new payload format, security group rules are not repeated anymore.

Streamlining security group RPCs

Credits: Miguel Angel Ajo Pelayohttp://www.ajo.es/post/95269040924/neutron-security-group-rules-for-devices-rpc-rewrite

RPC message payload size vs # of ports RPC execution time vs # of ports

Reducing router processing times

Problems:● Router synchronization starves RPC handling● Not enough parallelism in router and floating IP processing

Solution:● Router synchronization tasks and RPC messages are added to a priority

queue. Items pulled from the queue are processed in separate threads.● Apply iptables command in a non blocking fashion

Know your floating IP status

Problem:There was no way to know whether your floating IP is ready or not(beyond pinging it, obviously)

Solution:- Introducing the concept of operational status for floating IPs.- The L3 agent calls back the server to confirm successful floating IP creation (ACTIVE), or an

error (DOWN)- The state defaults to DOWN. Goes ACTIVE upon floating IP association, and DOWN when the

floating IP is disassociated.

Other enhancements (in brief)

● Multiple REST API workers

● Multiple RPC over AMQP workers

● Better IP address recycling

● Removal of several locking queries

o ie: LOCK FOR UPDATE statements

● Removal of conditions triggering LOCK WAIT timeout errors

o bug triggered by eventlet yielding within a transaction

Where we are...● The L2 agent scalability considerably improved over the past 12 months

o Results measured with OVS only but the same considerations apply to Linux Bridge as well

● Security groups can now be used even in very large deployments

● Nova/Neutron interface much more reliableo Boot a server only when the network for it is wired

o Faster, less chatty communication

● Some progress on resource status trackingo Far from being optimal, but at least now you can now when your floating IP is

ready to use...

… and where we want to be● There is still a lot of room for improvement in the agents

o E.g.: OVS agent still scan all ports on integration bridge at each iteration

● The Nova/Neutron interface is better, but is however far from idealo Enhanced caching on the nova side can avoid a lot of round trips to neutron

● Little to nothing has been done for tracking async operation and resource status. For example:o there is no way to know whether DHCP info are ready for a port

o security group updates are processed asynchronously, but it is impossible to know when processing completes

Final thoughts

● “Much better” is different from “ideal”o ≅ 3 seconds for wiring an interface could not be ideal for many

applicationso scalability limits should be addressed even if they involve architectural

changes

● What about data plane scalability?

● What about API usability?

Can you trust Neutron?

Software

Can you trust Neutron?