SCALING PUPPET ENTERPRISE TO 5,000 NODES IN 9 MONTHS Lesson’s learned, and how PE makes me think of goats
SCALING PUPPET ENTERPRISE TO 5,000 NODES IN 9 MONTHS
Lesson’s learned, and how PE makes me think of goats
WHO AM I?• DevOps and Cloud Admin* at Te
Connectivity • ~9 years of assorted technical
operations experience • ~1 year of PE usage/administration • Puppet Featured Community
Member (for most verbose complaints by a Test Pilot 2014)
• Puppet Certified Professional 2015 (sample scores: Puppet Language
94%, Console 40%) • Can’t be bothered to take internal
“Making compelling presentations training”
<= LIAR =>
PE DEPLOYMENT STATS• 5100 PE licenses
• Prod => 4157 Agents
• Dev => 72 Agents
• 871 Licenses purchased for systems of stubborn people.
• 14 supported OS spanning 7 OS families
• Prod PE deployment consists of 11 servers.
• 1 CA / Filebucket Server
• 1 PuppetDB server (using embedded PostgreSQL)
• 1 Puppet Console
• 4 Puppet Compile Masters
• 1 Active MQ Hub
• 3 Active MQ Brokers
THE CRUELEST LIES ARE OFTEN TOLD WHEN TRYING TO GET MANAGERS TO
BUY THE RIGHT TOOLS• Compliance reporting (without
remediation)
• Application code deployment
• Service discovery
• DNS?!
• Any phrase that includes “I’m sure there is a way puppet can…”
NO-OP (AKA MY ARCH NEMESIS)
• No-Op is a tool, not a solution.
• No-Op != Operational Intelligence
• Pandora’s Box full of excuses not to embrace change (see also: “brownfield”, “legacy”,“near-EoL”)
• Make sure you enforce enough code to control your agent configuration…
THE FASTEST WAY TO CAUSE 4000 AGENT RUNS TO FAIL
• Custom Facter facts are your friend, until they aren’t.
• #1 culprit for massive agent failures is bad confines in custom facts not tested against enough canary nodes.
• “It worked when I tested it, the fact even returns the right value”.
Important
#puppet.conf.stub [main] server = puppet.example.net archive_file = true archive_file_server = puppet.example.net ca_server = puppet.example.net
#puppetdb.conf.stub [main] server = puppet.example.net
#console.conf.stub [main] server = puppet.example.net
Evolution of puppet.conf
#puppet.conf.stub [main] server = puppet.example.net archive_file = true archive_file_server = puppet.example.net ca_server = puppet.example.net
#puppetdb.conf.stub [main] server = puppetdb.example.net
#console.conf.stub [main] server = puppetconsole.example.net
Evolution of puppet.conf
#puppet.conf.stub [main] server = puppet.example.net (Now an LB) archive_file = true archive_file_server = puppetfb.example.net* ca_server = puppetca.example.net*
#puppetdb.conf.stub [main] server = puppetdb.example.net
#console.conf.stub [main] server = puppetconsole.example.net
Evolution of puppet.conf
LOAD BALANCING PITFALLS• Do Load Balance
• Port 8140 between compile masters
• If you use connection stickiness > 30 minutes agents will never change masters.
• Port 61613 between ActiveMQ Brokers
• Don’t Load Balance
• Puppet CA, or any cert signing requests.
• File Bucket (archive_file_server)
• ActiveMQ hub, more split brain SSL
• Sizing Recommendations Revised
• PuppetDB needs way more RAM than is recommended when you scale. (Req 30GB, Our present 50GB, and it should be higher)
• PostgreSQL best practices claim 3xDB size of memory for best performance. @4000 nodes, puppetdb ~ 50GB, consoledb ~40GB @ 3days retention.
• ConsoleDB needs pruned aggressively. (reports = nodes * 48 * days retention). That much information is not useful in the console.
• Console uses less RAM than expected. (Req 30GB, Our present 10GB)
Pain
0%
15,000%
30,000%
45,000%
60,000%
None Agent Registered Agent Runs Agent Classified
PuppetDB Puppet Console
Puppet Scaling Experience (highly scientific data)
• @4000 nodes we use 8 dashboard workers.
• When # of nodes grows, the default page of the console can become very sluggish.
edit /opt/puppet/share/puppet-dashboard/config/routes.rb to adjust the route:
PuppetDashboard::Application.routes do # root :to => 'pages#home' root :to => 'reports#index'
CONSOLE CONFIGURATIONS
JVM TUNING• Problem: Service stops, log show Out of Memory Exceptions.
• Heap Sizes:
• puppetserver - 4GB
• puppetdb - 1GB
• PE console - 2GB
• ActiveMQ Hub - 1.5GB
• ActiveMQ Broker - 1GB
• PuppetDB (server component) has been a JVM for a while, so most GC actions can be tuned as Puppet Params
• Use R10K. Use Puppetfile. Use Roles and Profiles.
• Learn what nanlui/staging does. Then use it.
• exec { ‘horrible_idea’: cmd => ‘dostuff.sh && touch /tmp/didstuff.proof’, creates => ‘/tmp/didstuff.proof’, }
• PuppetLabs, myself, and most of our profession are absolutely terrible at naming things.
• Problem: (‘Environment’ && ‘Deployment’ && ‘Tier’ && ‘Branches’ && ‘Forks’) => [‘Production’, ‘Dev’, ‘QA’]
• Result: cats.all? { cats.content[:name] == ‘Selso’ } => true
• Proxy Servers are evil. Spaceship Operators have a cool name.
• Problem: universally_respected_proxy_variables.exists? => false
• Solution: Use site.pp + Resource Collection to set top level resource defaults.
The “read this later” slide
“IF I HAVE SEEN FURTHER IT IS BY STANDING ON YE SHOULDERS OF GIANTS” ~ ISAAC NEWTON
Resources that have gotten me by: • https://docs.puppetlabs.com/
references/latest/type.html • Puppet Types and Providers by
Dan Bode and Nan Liu • Puppet Practitioner’s Training • Gary Larizza’s Blog (aka nsfw
missing puppet documentation) • PuppetLabs Support • Puppet Professional Services And Most importantly • A healthy mixture of ambition,
stubbornness and stupidity.
QUESTIONS?
@pwattstbd github.com/Marsupermammal [email protected]