Top Banner
Today we’re here to talk aboutupgrading OpenStack Ideally we don’twantto break everything And the session description promised you we wouldn’teven break Neutron,butwe’ll see how that worked out.
47

Upgrading OpenStack Without Breaking Everything (Including Neutron?)

Jan 22, 2018

Download

Technology

clayton_oneill
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

Today&we’re&here&to&talk&about&upgrading&OpenStack

Ideally&we&don’t&want&to&break&everything

And&the&session&description&promised&you&we&wouldn’t&even&break&Neutron,&but&we’ll&see&how&that&worked&out.

Page 2: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

Both&principal&engineers&at&TWC&on&the&OpenStack&teamClayton&C focus&on&automation,&CI/CD,&deployments,&etcSean&C focus&on&networking,&compute

Page 3: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● Our&OpenStack&team&started&with&four&people&about&two&years&ago● We&did&our&proof&of&concept&implementation&on&Havana&and&then&after&the&

Atlanta&summit&decided&to&switch&everything&to&Icehouse&and&VXLAN&based&networking&before&going&to&production&in&the&summer

● Since&then&we’ve&done&an&upgrade&to&Juno&and&Kilo● These&are&the&versions&of&the&services&we’re&currently&running

○ This&talk&will&focus&on&our&last&round&of&control&node&upgrades,&which&included&Nova,&Neutron,&Glance,&Cinder&and&Heat

○ Since&our&Kilo&upgrade,&we’ve&moved&Heat&into&a&Docker&container&and&upgraded&it&to&Liberty

○ Horizon&and&Keystone&aren’t&included&because&those&were&already&on&Kilo.

● There&are&a&few&core&tenets&that&we&feel&are&important&and&that&we&try&to&follow&regarding&OpenStack&upgrades.

Page 4: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● The&first&one&is:&You&really&don’t&want&to&fall&behind.&● We&plan&on&upgrading&every&6&months

We&think&you&should&also,&even&if&you&want&to&wait&for&bug&fixes&on&the&stable&branch

The&primary&reason&is&that&is&the&only&tested&path&for&upgradesAnd&with&rolling&upgrades&and&lazy&DB&migrations,&there&are&now&

intermediate&steps&that&have&to&be&done&between&releasesFor&example,&in&Kilo,&nova&flavor&migration&must&be&run&before&upgrading&

to&Liberty

Page 5: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

Automate&everythingIf&you&don’t&automate&everything&then&when&you&start&your&testing….

Page 6: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

You’re&going&to&feel&like&this&guyTest&it&over&and&over

Get&your&process&downUpgrades&might&impact&customers,&so&try&to&find&out&what&that&impact&is

Page 7: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● Our&team&gave&an&upgrades&talk&in&Vancouver,&some&of&you&may&have&been&to&that&talk&also

○ We&appreciate&anyone&that&felt&like&they&wanted&to&hear&us&talk&about&OpenStack&upgrades&twice&in&one&year.

● We’re&going&to&try&not&to&cover&too&much&of&the&same&ground,&the&Juno&talk&is&on&Youtube&and&it&covers&our&overall&approach

○ We’re&going&to&talk&more&about&updates&to&that&approach&and&issues&we&ran&into&while&upgrading&to&Kilo

● So&when&deciding&timing&for&our&Kilo&upgrade,&there&was&one&major&feature&we&were&looking&forward&to:

Page 8: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● Like&most&people&using&OpenStack,&we&use&RabbitMQ&as&the&message&broker&for&all&intra&service&communications

● Like&most&people&using&OpenStack,&we’ve&had&tons&of&problem&with&this,&although&it’s&gotten&better

● The&biggest&remaining&problem&we’ve&seen&with&Juno&was&that&if&anything&went&wrong,&OpenStack&services&would&not&realize&they&were&disconnected&from&Rabbit

○ NovaCcompute&was&particularly&bad&about&this.● AMQP&heartbeats&are&a&protocol&level&feature&that&let&the&RabbitMQ&server&

and&clients&check&in&on&each&other&regularly○ If&one&of&them&goes&missing,&everything&gets&cleaned&up&and&clients&

can&go&reconnect&in&a&timely&fashion○ This&was&added&as&an&experimental&feature&in&Kilo&and&we’d&heard&

good&things.

Page 9: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

Before&you&start&down&the&path&of&upgrading,&you&have&to&know&requirements&for&acceptable&downtime&and&outage.

This&also&requires&balancing&technical&capabilities&and&desires&with&customer&needs.

For&instance...

Page 10: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

If&you&can&just&forklift&upgrade&to&a&new&environment&or&even&reinstall&the&same&servers,&the&easiest&approach.&&

We,&as&operators,&love&this.&&It&makes&our&life&operationally&easy.

Another&option&we&like&is&to...

Page 11: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● ...think&of&the&upgrade&process&as&a&pit&stop…,&● pulling&the&entire&cloud&out&of&the&race&and&swapping&workloads&over&a&short&

period&of&time.● It’s&a&short&outage,&but&a&total&one.

● The&problem&is,&our&customers&don’t&want&_any_&outage

Page 12: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● This&is&what&our&customers&want.&&Zero&downtime!&&That’s&what&we&need.

● These&guys&change&two&tires&on&the&car&in&about&5&minutes,&while&the&car&is&driving&down&the&road&the&whole&time.

● And,&unfortunately,&we&don’t&get&to&change&the&tires&on&just&one&side&of&the&car.

Page 13: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● In&the&end,&our&requirements&ended&up&being...● Our&customers&are&ok&with&an&API&outage&for&say&10&or&15&minutes.● They’re&not&ok&with&any&other&sort&of&outage● This&is&basically&what&our&requirements&were&for&both&our&Juno&and&Kilo&

upgrades● For&Juno&our&upgrade&weakness&was&networking.● Let’s&talk&about&our&improvement&goals&for&our&Kilo&upgrade

Page 14: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

For&the&Kilo&upgrade&we&also&integrated&lessons&learned&from&our&Juno&upgrade.This&meant...

We&did&our&Juno&upgrade&in&the&early&evening&and&the&feedback&from&our&customers&was&that&this&was&their&peak&time.&&For&Kilo&we&changed&our&upgrade&time&to&be&2am&local&time.&(ugh)

We&also&realized&that&we&need&to&test&major&upgrades&using&production&data&from&both&regions,&we&did&this&and&thankfully&didn’t&have&an&issues&there.

The&major&problem&with&our&Juno&upgrade&was&that&we&had&unexpected&network&outages&when&upgrading&in&production:

Primary&reason&for&this&was&because&we&had&dramatically&more&routers&in&our&production&environment&than&we&did&in&dev&or&staging.

In&dev&and&staging&the&outage&was&just&not&long&enough&for&us&to&notice&it&and&we&weren’t&doing&good&monitoring

To&address&this:We&put&tooling&in&place&to&spin&up&around&100&virtual&networks&and&routers&

and&an&instance&behind&each&one&in&order&to&give&us&a&more&realistic&test&environment

We&also&put&in&place&high&granularity&ping&monitoring&of&those&instances&so&we&could&get&good&metrics&about&what&was&going&on&during&our&upgrade&testing.

This&was&really&effective&in&letting&us&understand&what&was&happening&during&the&testing

Page 15: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● We&talked&about&how&important&upgrade&automation&is&before,&I&just&want&to&touch&on&that&briefly&and&cover&how&we&handle&that

● All&of&our&upgrade&automation&is&done&using&Ansible&to&drive&changes&via&Puppet

○ Puppet&is&responsible&for&all&package&management,&config&changes,&service&restarts,&etc

○ Ansible&does&everything&else&and&handles&all&orchestration&and&ordering

● This&is&something&we&covered&in&a&fair&amount&of&depth&in&our&Vancouver&talk&if&you’re&interested&in&more&detail

● When&doing&our&Kilo&upgrade,&we&started&with&the&Juno&upgrade&automation&and&we&were&able&to&reuse&nearly&all&of&it

● So&let’s&look&at&what&our&actual&upgrade&process&looks&like

Page 16: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● This&is&what&our&starting&point&looks&like&for&our&control&cluster.● We&have&3&control&nodes.

○ Each&node&hosts&the&services&we’re&going&to&be&upgrading,&plus&a&bunch&of&virtual&routers.

○ They&are&also&all&part&of&a&shared&MySQL&cluster&and&RabbitMQ&cluster.

○ External&users&talk&to&these&nodes&via&a&hardware&load&balancer.○ What’s&not&shown&here&is&that&internal&traffic&goes&through&HAProxy

● So&let’s&walk&through&the&process&of&the&actual&upgrade.&○ Keep&in&mind&that&all&the&steps&you&are&seeing&were&automated&with&

Ansible&playbooks.

Page 17: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● The&goal&here&was&to&take&two&of&the&control&nodes&out&of&service&and&then&upgrade&the&first&node.&Here’s&how&we&got&there.

● We&do&is&shutdown&and&backup&the&database&on&two&of&the&nodes● Next&we&use&L3&agent&failover&to&move&all&the&routers&from&the&first&control&

node&to&the&other&two.○ The&issue&we’re&trying&to&avoid&here&is&that&when&the&OVS&agent&is&

started&during&the&upgrade■ It&will&drop&all&network&flows,&leading&to&a&loss&of&network&

connectivity.■ We’re&going&to&talk&about&that&more&later&on

○ To&avoid&that,&we&shut&down&the&L3&agent&on&the&first&control&node■ After&the&L3&agent&on&nodes&2&and&3&detect&the&“failure”&of&the&

L3&agent&on&the&first&control&node,&they’ll&start&taking&over&those&routers

○ Once&all&routers&are&moved,&we&disable&the&L3&agent&on&node&1&via&the&Neutron&API&so&that&when&it&comes&back&up&during&the&upgrade,&routers&don’t&move&back&automatically.

● This&leaves&us&functional,&not&in&an&outage,&but&with&a&cluster&of&only&one.● The&last&thing&we&do&before&starting&the&API&outage&is&get&a&list&of&all&instances&

with&floating&IPs○ We&set&up&a&small&script&to&ping&all&the&floating&IPs&and&report&on&their&

status&while&we&proceed&with&the&upgrade

Page 18: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● Start&the&API&outage&by&turning&off&external&load&balancer○ We&ran&into&some&issues&here,&but,&we’re&going&to&cover&that&later

● Then&we&shut&down&all&OpenStack&services&on&all&3&control&nodes.○ The&goal&is&to&not&have&Juno&services&trying&to&make&changes&against&a&

Kilo&database○ The&routers&continue&to&function&because&that&occurs&in&the&kernel

● Run&puppet&on&the&first&control&node.&&It&upgrades&all&the&packages,&updates&config&file&settings&and&finally&restarts&all&the&services

○ We&set&OS_ENDPOINT_TYPE&to&internalURL&when&running&Puppet&so&that&it&can&talk&via&the&internal&haproxy&load&balancer&instead&of&the&external&endpoints&that&we’ve&disabled

○ This&also&sets&the&nova&API&compat&flag&so&that&Juno&compute&nodes&can&still&talk&to&the&Kilo&control&services.

● When&this&is&complete,&we&run&a&simple&smoke&test&via&the&CLI&clients&to&verify&the&services&have&basic&functionality&before&continuing&on

Page 19: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● Once&we’ve&completed&our&smoketests,&we&want&to&start&getting&things&back&to&normal

● We&enable&the&L3&agent&on&the&Kilo&control&node,&it&will&detect&that&the&L3&agent&on&the&other&two&nodes&is&dead.

● Once&it’s&given&up&on&them,&it&will&start&plumbing&out&everything&needed&for&the&routers&on&the&first&control&node&and&they’ll&be&moved&automatically.

○ A&little&later&we’ll&talk&about&the&gross&workarounds&that&were&needed&to&make&this&work&well.

Page 20: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● We&reCenable&the&load&balancer.&We’re&out&of&outage&and&back&to&a&one&node&cluster.

○ Length&of&the&API&outage&is&basically&the&time&to&move&routers,&install&new&packages&and&run&DB&migrations

● We&can&now&relax&a&bit,&the&worst&is&mostly&over.&but&we&have&two&more&control&nodes&to&upgrade

Page 21: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● The&next&step&is&to&get&the&MySQL&Galera&cluster&back&up&and&running.&● When&we&start&the&database&on&the&other&nodes,&Galera&replication&will&ensure&

the&database&on&the&other&nodes&are&up&to&date.○ No&more&database&migrations&are&needed.

Page 22: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

Then&we&let&puppet&run&through&the&other&two&nodes&one&by&one,&upgrading&packages&to&Kilo&and&restarting&services.

Page 23: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

Once&all&nodes&are&upgraded&we’re&nearly&done,&&except&one&node&is&hosting&all&the&routers.&We&have&a&script&that&will&rebalance&the&routers&evenly&across&the&nodes,&while&avoiding&moving&any&high&profile&tenants

Page 24: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● And&now&we’re&done&with&control&nodes.&We&do&a&bunch&more&testing&here,&including

○ LiveCmigrating&a&canary&instances&on&compute&nodes○ Running&our&regression&test&suite○ Checking&logs,&etc.

Page 25: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● To&finish&the&upgrade,&we&need&to&get&the&compute&nodes&upgraded● We&live&migrate&all&instances&off&of&a&few&compute&nodes&and&put&canary&

instances&on&them● Upgrade&those&nodes&and&do&extensive&testing&on&them

○ Live&migration,&volume&attach/detach,&etc● Proceed&with&a&normal&deploy

○ This&causes&a&short&outage&because&the&OVS&agent&drops&all&flows&when&it’s&restarted.

○ Unfortunately&we&can’t&avoid&this&for&Kilo● Control&and&Compute&upgrades&took&less&than&3&hours&per&region,&and&we&did&

the&two&regions&on&separate&nights.● The&last&thing&we&did&was&merge&a&change&to&remove&the&API&compat&flag&on&

the&control&nodes&and&deploy&that&as&part&of&the&next&normal&deploy

Page 26: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

Overview

● As&we&mentioned&before,&a&big&problem&in&our&Juno&upgrade&was&loss&of&customer&network&connectivity&during&the&upgrade

● We&tracked&this&down&to&several&causes:○ Tunnel&MAC&learning&flows&have&a&default&timeout&of&5&minutes&and&

require&L2&Agent&to&be&running&to&refresh.&&If&your&upgrade&takes&more&than&more&than&5&minutes,&they’re&going&to&expire&and&you’re&going&to&drop&customer&traffic.

○ On&startup&the&OVS&L2&agent&flushes&all&flows.&&■ Dropping&all&the&flows&wouldn’t&be&too&bad,&except&that&

rebuilding&them&on&a&busy&control&node&is&*really*&slow■ Over&10C15&minutes&for&a&complete&rebuild&2500&flows&for&50C60&

routers.○ The&other&issue&we&ran&into&was&caused&by&our&abuse&of&Router&HA&

Agent&Failover&beyond&it’s&design.■ The&router&on&the&old&control&node&would&continue&ARPing&for&

the&gateway,&and&blackholing&the&traffic● Here’s&how&we&addressed&these...

Page 27: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

Detail

● Early&in&the&upgrade&we&change&the&OVS&MAC&learning&flow&timeouts&on&all&compute&and&control&nodes&from&the&default&of&5&minutes&to&30&minutes.&&

○ The&reason&we&do&this&is&that&we&know&we’re&going&to&have&Neutron&down&long&enough&during&the&upgrade&that&the&5&minute&timers&will&expire&and&we’ll&start&dropping&traffic

○ There&is&still&the&remaining&issue&that&any&*new*&flows&may&expire&before&the&upgrade&is&complete

■ We&didn’t&observe&this&being&an&issue&in&practice.

Page 28: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

Detail● First&work&around&is&to&avoid&ever&restarting&the&OVS&agent&on&a&node&that&is&

actively&passing&traffic.○ On&control&nodes&you&just&move&the&routers&to&a&box&that’s&not&actively&

being&upgraded○ On&compute&nodes&you&could&do&live&migration,&but&we&decided&not&to,&

since&rebuilding&flows&there&is&much&faster&due&to&lower&density.● We&use&L3&agent&failover&to&preCbuild&flows&when&we&move&routers.&&xxxx

○ This&means&that&the&time&to&build&those&flows&occurs&before&we&have&an&outage,&instead&of&during.

● Lastly,&the&long&term&fix&for&this&is&in&Liberty.○ In&Liberty,&the&OVS&agent&will&tag&flows&with&a&cookie&so&that&it&can&

properly&identify&the&flows&in&the&future○ On&restart,&Instead&of&rebuilding&everything&it&will&synchronize&the&OVS&

flow&state&with&what&Neutron&wants&it&to&be,&instead&of&the&brute&force&approach&that&it&used&to&take

Page 29: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

Detail● Lastly,&we&had&to&work&around&this&issue&with&routers&not&moving&properly&

sometimes● After&moving&the&routers&to&the&new&control&node,&we&cleaned&them&up&on&old&

hosting&control&node&with&the&following&steps:○ Delete&flows&in&the&integration&and&tunnel&bridges○ Delete&all&the&router&ports○ Delete&the&router&namespaces

● This&is&absolutely&a&brute&force&approach,&but&it&was&very&effective&in&avoiding&the&ARP&issue&and&we&had&very&few&tenants&losing&networking&with&this&approach.

Page 30: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● So&how&did&our&testing&and&upgrade&go?Let’s&use&realCworld&tropical&storm&Kilo&as&a&metaphor&for&our&Kilo&upgrade

It&slowly&meandered&all&over&the&place&and&it&eventually&died&out&after&about&3&weeks.

The&tropical&storm&was&the&3rd&longest&lasting&tropical&storm&in&record&history

We&ran&into&a&wide&variety&of&minor&and&major&problems&and&we&wish&our&Kilo&upgrade&had&only&lasted&3&weeks&like&the&storm&did

Even&with&lessons&learned&from&JunoPartially&this&was&because&we&put&more&network&testing&in&place&and&had&to&

improve&our&tooling&and&that’s&a&worthwhile&investmentBut&we&also&ran&into&a&lot&more&problems&with&the&Kilo&upgrade.Some&of&that&was&our&own&fault,&and&some&of&it….was&other&people’s&fault.

Page 31: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● After&our&upgrade&in&our&second&region&we&realized&that&cinderCvolume&was&completely&broken

○ It&was&really&odd,&because&we’d&done&exactly&the&same&thing&in&the&other&region&and&it&worked&with&no&issues

● Eventually&we&tracked&it&down&to&this○ The&os_region_name&variable&is&what&Nova&uses&to&determine&which&

region’s&cinder&endpoint&it&should&talk&to.○ If&you&only&have&one&region,&this&doesn’t&matter&at&all,&there&is&only&one&

cinder&endpoint■ If&you&have&multiCregions,&the&libraries&pick&the&endpoint&with&the&

lowest&UUID■ So&when&Nova&tried&to&attach&a&volume,&it&was&talking&to&cinder&

in&the&wrong&data&center!■ So&it&was&dumb&luck&we&ran&into&this&at&the&second&region,&

instead&of&the&first.○ The&problem&is&that&os_region_name&used&to&be&in&the&DEFAULT&

section.○ In&Kilo&it&moved&to&the&[cinder]&section,&but&we&didn’t&catch&that

● DEFAULT/os_region_name&was&deprecated&in&Juno,&but&we&apparently&ignored&that&when&we&did&our&upgrade

○ There&was&no&mention&of&the&removal&of&the&backwards&compatability&in&the&Kilo&release&notes

Page 32: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● If&you&have&more&than&100C200&routers&with&pythonCneutronclient&2.3.x,&you&can&run&into&this&issue

○ Returns&“Request&URI&too&long”● This&is&a&bug&that&had&already&been&fixed&upstream,&but&Canonical&packaged&

the&version&that&was&in&in&the&global&requirements&list● The&global&requirements&list&had&the&Juno&version&of&neutron&client&until&

August● Attempting&to&downgrade&the&Neutron&client&packages&to&work&around&this&is&

how&we&ended&up&accidently&uninstalling&Nova.

Page 33: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● So&with&the&Kilo&upgrade,&you&need&to&migrate&flavor&data&after&the&upgrade&to&get&things&into&the&new&way&of&storing&that&data.

● Once&nova&is&brought&up,&it&starts&lazily&migrating&this&data&as&flavors&are&accessed

● Shortly&after&the&upgrade&in&a&shared&dev&environment&we&*accidently*&uninstalled&Nova&on&all&nodes

● We&ended&up&with&flavor&data&that&was&partially&migrated&because&of&this,&and&that&caused&Nova&to&crash&on&startup.

● We&spent&hours&tracking&this&down&and&eventually&had&to&fix&it&by&hand&by&editing&the&database&entries.

● After&this&we&changed&our&automation&to&migrate&the&flavor&data&immediately&after&doing&the&upgrade,&and&before&we&brought&API&services&back&online

Page 34: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● In&Kilo,&Neutron&added&a&new&option&‘allow_automatic_dhcp_failover’○ This&provides&the&ability&to&have&DHCP&server&health&checked&

regularly,&and&if&one&failed,&it&would&automatically&be&spun&up&on&another&DHCP&agent.

● Unfortunately,&it&detects&spurious&failures&pretty&regularly,&for&us&multiple&times&a&day

● Unfortunately,&when&it&does&fail&over,&it&hits&another&bug&a&good&percentage&of&the&time&that&causes&the&DHCP&neutron&ports&to&get&stuck&in&creating&status

○ So&in&effect&this&was&killing&good&DHCP&servers&instead&of&recovering&bad&ones

● We&don’t&even&need&this&feature,&we&run&three&control&nodes,&and&two&DHCP&agents&per&network

● However,&it&defaults&to&on,&so&for&about&a&week&after&our&upgrade&we’d&have&tenants&dropping&offline&because&their&DHCP&server&hit&this&combination&of&bugs&and&is&dead&until&we&manually&clean&things&up

● There&was&no&mention&of&this&feature&in&the&release&notes.● Part&of&how&we&discovered&that&this&feature&existed&and&was&buggy&was&by&

looking&at&the&DHCP&code&changes&on&the&master&branch&for&neutron&and&comparing&it&to&the&kilo&branch

○ We&realized&this&feature&had&a&lot&of&bugs&when&we&found&lots&of&fixes&for&it&on&the&master&branch.

○ Of&the&half&dozen&fixes,&only&one&or&two&of&them&were&backported.● We&ended&up&just&turning&off&this&off

Page 35: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● As&I&implied&before,&we&ran&into&issues&with&validating&services&while&the&external&endpoints&were&offline

● Normally&the&CLI&clients&get&a&list&of&service&endpoints&from&keystone&and&default&to&the&public&one

○ By&setting&the&OS_ENDPOINT_TYPE&environment&variable&or&passing&the&same&thing&in&via&a&commandCline&option,&you&can&override&this&and&tell&them&to&use&the&internalURL,&which&for&us&is&separate&and&based&on&HAProxy

● The&issue&is&that&some&of&the&CLI&clients,&including&Neutron&and&Cinder&were&broken,&and&would&ignore&both&of&these.

● This&broke&our&Puppet&runs&during&the&upgrade&and&it&broke&our&smoke&test&scripts

● Unfortunately,&because&we&found&this&issue&very&late&in&the&process,&we&ended&up&deciding&to&just&leave&the&external&LB&for&our&production&upgrades.

Page 36: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● We&also&ran&into&schema&problems&with&Glance.● In&Kilo,&Nova&started&using&the&V2&Glance&API● The&V2&API&does&schema&validation,&but&the&v1&API&doesn’t&really

○ So&it&was&possible&to&create&images&with&attributes&via&the&V1&api,&that&the&V2&api&thought&was&invalid.

○ Like&description&being&NULL&instead&of&an&empty&string○ When&that&happens,&Nova&couldn’t&do&anything&with&the&image,&

because&it&would&fail&schema&validation&via&the&V2&API● There&was&no&way&to&tell&Nova&to&use&the&V1&API&instead● Flavio&from&the&Glance&team&helped&us&get&this&fixed&very&quickly● Canonical&backported&it&quickly

Page 37: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● We&ran&into&a&similar&issue&with&Glance&but&in&the&schema&file&instead&of&in&Glance&code

● The&attributes&this&time&were&kernel_id&and&ramdisk_id● We&changed&the&schema&file&to&allow&these&fields&to&be&nullable● This&has&been&fixed&upstream&in&the&same&way.

Page 38: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● When&doing&the&first&upgrade&in&our&shared&dev&environment,&we&ran&into&a&problem&with&Nova&migrations

● MySQL&was&failing&to&run&a&migration&to&convert&a&column&from&NULL&to&a&NOT&NULL

● It&was&failing&because&MySQL&5.6&has&a&bug&that&prevents&converting&a&column&to&NOT&NULL&if&it&has&a&foreign&key&constraint

● This&didn’t&happen&in&all&of&our&environments,&and&if&we&did&a&mysqldump&and&restore,&the&problem&went&away

● We&opened&a&support&case&with&Percona,&waited&for&them&to&track&it&down&and&got&a&new&build&from&them&that&resolved&the&issue.

Page 39: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● If&you&see&a&problem&like&this&when&running&DB&migrations,&your&problem&is&probably&due&to&existing&database&tables&not&matching&the&default&database&sort&order,&or&collation.

What&happened&for&us&is&that&we&had&some&databases&using&utf8_unicode_ci&and&the&upstream&Puppet&modules&changed&the&default&database&collation&to&utf8_general_ci

That&means&newly&created&tables&had&a&different&sort&order&than&the&existing&ones&and&when&adding&foreign&keys&between&an&old&and&new&table,&MySQL&would&refuse&add&them

This&could&happen&for&any&database&in&theory,&for&any&migration&that&changes&foreign&keys.

Page 40: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● Keystone&middleware&that&all&projects&use&for&token&validation&was&moved&into&a&separate&package&in&Juno,&but&Juno&still&supported&the&old&library&names.&&

In&Kilo&the&old&names&were&removed,&but&this&wasn’t&mentioned&in&the&Kilo&release&notes.&&

The&control&nodes&we&had&that&were&upgraded&from&icehouse&still&had&the&old&value

This&was&an&easy&fix&once&we&found&it.Issues&like&this&are&particularly&hard&to&find,&since&oslo.configs&normal&

deprecation&mechanisms&can’t&cover&this&scenario

Page 41: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● Last&but&not&least,&we&found&this&problem&after&completing&our&first&prod&upgrade&and&turning&API&services&back&on

New&feature&in&Nova&scheduler&called&“scheduler_tracks_instance_changes”.&&This&can&track&instance&state&to&allow&scheduler&filters&to&make&more&

informed&decisions.This&is&the&commit&message&for&the&new&feature

On&startup&the&scheduler&polls&all&compute&nodes&for&instance&state&in&batches&of&10&at&a&time

Our&experience&was&that&this&meant&that&novaCscheduler&was&chewing&up&100%&of&a&core&until&this&was&done&and&it&took&forever&to&finish

RabbitMQ&would&get&disconnected&CC we&believe&because&heartbeats&were&failing&due&to&the&thread&not&being&scheduled

Even&after&turning&off&heartbeats,&we&still&saw&instances&not&being&scheduled&while&this&was&enabled

We&don’t&use&any&scheduler&filters,&we&didn’t&need&it,&turned&it&offOnly&vague&notions&of&this&in&the&release&notes,&and&we&didn’t&understand&what&

was&going&on&until&we&found&this&commit&message.DocImpact&tag&definitely&didn’t&translate&to&release&note&updates&in&this&

case.

Page 42: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● After&all&those&issues,&this&is&about&how&we&felt&by&the&time&we&were&done&with&our&prod&kilo&upgrades

● If&you&haven’t&seen&Groundhog&Day,&you&should,&it’s&literally&a&classic.

Page 43: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● So&a&number&of&these&problems&we&ran&into&are&because&we&didn’t&pay&attention&to&deprecations&in&Juno,&and&when&those&features&were&removed&in&Kilo,&we&didn’t&know&because&we&just&read&the&Kilo&release&notes&for&our&Kilo&upgrade,&not&the&Juno&release&notes&for&our&Kilo&upgrade.

● MySQL&has&bugs,&we’re&good&at&finding&them&with&OpenStack&upgrades.&&Yay?

● Part&of&the&reason&we&upgrade&is&that&we&want&new&features&(and&bug&fixes),&but&at&least&two&of&the&problems&we&had&were&because&new&features&were&on&by&default,&and&they&were&buggy.

● Buggy&services&are&one&thing,&but&in&both&cases,&there&was&no&real&documentation&around&these&features.

○ One&of&them&wasn’t&mentioned&in&the&release&notes&at&all,&and&the&other&had&no&detail&about&what&it&did

● And&to&give&credit&where&credit&is&due,&some&projects&are&really&good&at&release&notes.

○ The&Cinder&Kilo&release&notes&were&widely&credited&as&being&good&at&the&Operator’s&MidCCycle&meetup

○ Looking&through&the&Liberty&release&notes,&the&Nova&section&is&really&really&good.&&It&would&be&nice&if&everyone&followed&their&example.

Page 44: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● So&with&that&litany&of&horrible&issues,&you&may&be&wondering&if&we&thought&upgrading&was&worthwhile:

● After&resolving&these&issues,&overall&stability&has&been&improved● So&AMQP&heartbeats&have&increased&stability&dramatically&for&us.

○ This&has&cleared&up&a&lot&of&intermittent&issues&for&us,&and&also&allowed&us&to&put&RabbitMQ&behind&a&load&balancer.

○ We&wanted&to&put&Rabbit&behind&a&load&balancer,&because&we’re&in&the&process&of&moving&our&OpenStack&environments&to&a&new&network&architecture,&and&this&helps&us&quiece&RabbitMQ&before&taking&it&offline.

Page 45: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● To&wrap&up,&let’s&talk&about&our&next&upgrade● We’ve&started&some&work&on&moving&to&Liberty&already

○ We’re&on&master&for&all&of&the&Puppet&modules&now&(except&keystone)○ We&don’t&know&what&the&timing&for&our&Liberty&upgrade&will&be&yet,&but&

I’ll&be&surprised&if&it’s&not&before&Austin● We’ve&learned&that&no&matter&what,&we’re&going&to&run&into&weird&problems.

○ For&example,&we&ran&into&MySQL&bugs&in&both&Juno&and&Kilo&upgrades,&so&apparently&we&should&just&assume&that&will&happen&and&add&another&two&weeks&to&get&that&fixed….

● We’re&going&to&continue&moving&services&into&containers.&&We’ve&got&heat&and&designate&in&containers&now,&and&it’s&allowed&us&to&upgrade&them&(or&not)&independently&of&other&services.

○ This&will&allow&us&to&avoid&having&to&deal&with&conflicting&dependencies&between&services

○ It&also&allows&us&to&stage&the&new&version&of&a&service&before&the&upgrade.&&Right&now&a&lot&of&our&upgrade&time&is&actually&installing&packages.

● As&we’ve&mentioned&before,&a&lot&of&the&complexity&in&our&upgrades&have&to&do&with&the&fact&that&upgrading&the&OVS&agent&causes&it&to&drop&all&active&flows.

○ We’re&really&looking&forward&to&deleting&a&bunch&of&code,&assuming&this&works&in&Liberty&(it’s&on&by&default)

● Lastly,&we’re&hoping&to&move&to&using&HA&routers&once&we’re&on&Liberty,&and&with&that&in&place&we&hope&to&avoid&moving&any&routers&around&during&the&

Page 46: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

upgrades○ Hopefully&that&will&help&with&our&Mitaka&upgrade

Page 47: Upgrading OpenStack Without Breaking Everything (Including Neutron?)

● That’s&all&we’ve&got,&we&appreciate&everyone&coming● Hopefully&have&some&time&for&questions