Scaling to-5000-nodes

SCALING PUPPET ENTERPRISE TO 5,000 NODES IN 9 MONTHS

Lesson’s learned, and how PE makes me think of goats

WHO AM I?• DevOps and Cloud Admin* at Te

Connectivity • ~9 years of assorted technical

operations experience • ~1 year of PE usage/administration • Puppet Featured Community

Member (for most verbose complaints by a Test Pilot 2014)

• Puppet Certified Professional 2015 (sample scores: Puppet Language

94%, Console 40%) • Can’t be bothered to take internal

“Making compelling presentations training”

<= LIAR =>

PE DEPLOYMENT STATS• 5100 PE licenses

• Prod => 4157 Agents

• Dev => 72 Agents

• 871 Licenses purchased for systems of stubborn people.

• 14 supported OS spanning 7 OS families

• Prod PE deployment consists of 11 servers.

• 1 CA / Filebucket Server

• 1 PuppetDB server (using embedded PostgreSQL)

• 1 Puppet Console

• 4 Puppet Compile Masters

• 1 Active MQ Hub

• 3 Active MQ Brokers

THE CRUELEST LIES ARE OFTEN TOLD WHEN TRYING TO GET MANAGERS TO

BUY THE RIGHT TOOLS• Compliance reporting (without

remediation)

• Application code deployment

• Service discovery

• DNS?!

• Any phrase that includes “I’m sure there is a way puppet can…”

NO-OP (AKA MY ARCH NEMESIS)

• No-Op is a tool, not a solution.

• No-Op != Operational Intelligence

• Pandora’s Box full of excuses not to embrace change (see also: “brownfield”, “legacy”,“near-EoL”)

• Make sure you enforce enough code to control your agent configuration…

THE FASTEST WAY TO CAUSE 4000 AGENT RUNS TO FAIL

• Custom Facter facts are your friend, until they aren’t.

• #1 culprit for massive agent failures is bad confines in custom facts not tested against enough canary nodes.

• “It worked when I tested it, the fact even returns the right value”.

Important

TIME TO SCALE OUT

#puppet.conf.stub [main] server = puppet.example.net archive_file = true archive_file_server = puppet.example.net ca_server = puppet.example.net

#puppetdb.conf.stub [main] server = puppet.example.net

#console.conf.stub [main] server = puppet.example.net

Evolution of puppet.conf

http://puppet.example.net





#puppet.conf.stub [main] server = puppet.example.net archive_file = true archive_file_server = puppet.example.net ca_server = puppet.example.net

#puppetdb.conf.stub [main] server = puppetdb.example.net

#console.conf.stub [main] server = puppetconsole.example.net







#puppet.conf.stub [main] server = puppet.example.net (Now an LB) archive_file = true archive_file_server = puppetfb.example.net* ca_server = puppetca.example.net*

#puppetdb.conf.stub [main] server = puppetdb.example.net

#console.conf.stub [main] server = puppetconsole.example.net







LOAD BALANCING PITFALLS• Do Load Balance

• Port 8140 between compile masters

• If you use connection stickiness > 30 minutes agents will never change masters.

• Port 61613 between ActiveMQ Brokers

• Don’t Load Balance

• Puppet CA, or any cert signing requests.

• File Bucket (archive_file_server)

• ActiveMQ hub, more split brain SSL

PERFORMANCE ISSUES(You’re looking down.)

• Sizing Recommendations Revised

• PuppetDB needs way more RAM than is recommended when you scale. (Req 30GB, Our present 50GB, and it should be higher)

• PostgreSQL best practices claim 3xDB size of memory for best performance. @4000 nodes, puppetdb ~ 50GB, consoledb ~40GB @ 3days retention.

• ConsoleDB needs pruned aggressively. (reports = nodes * 48 * days retention). That much information is not useful in the console.

• Console uses less RAM than expected. (Req 30GB, Our present 10GB)

Pain

0%

15,000%

30,000%

45,000%

60,000%

None Agent Registered Agent Runs Agent Classified

PuppetDB Puppet Console

Puppet Scaling Experience (highly scientific data)

• @4000 nodes we use 8 dashboard workers.

• When # of nodes grows, the default page of the console can become very sluggish.

edit /opt/puppet/share/puppet-dashboard/config/routes.rb to adjust the route:

PuppetDashboard::Application.routes do # root :to => 'pages#home' root :to => 'reports#index'

CONSOLE CONFIGURATIONS

JVM TUNING• Problem: Service stops, log show Out of Memory Exceptions.

• Heap Sizes:

• puppetserver - 4GB

• puppetdb - 1GB

• PE console - 2GB

• ActiveMQ Hub - 1.5GB

• ActiveMQ Broker - 1GB

• PuppetDB (server component) has been a JVM for a while, so most GC actions can be tuned as Puppet Params

GREAT WISDOMS AND PERSISTING PAINS

• Use R10K. Use Puppetfile. Use Roles and Profiles.

• Learn what nanlui/staging does. Then use it.

• exec { ‘horrible_idea’: cmd => ‘dostuff.sh && touch /tmp/didstuff.proof’, creates => ‘/tmp/didstuff.proof’, }

• PuppetLabs, myself, and most of our profession are absolutely terrible at naming things.

• Problem: (‘Environment’ && ‘Deployment’ && ‘Tier’ && ‘Branches’ && ‘Forks’) => [‘Production’, ‘Dev’, ‘QA’]

• Result: cats.all? { cats.content[:name] == ‘Selso’ } => true

• Proxy Servers are evil. Spaceship Operators have a cool name.

• Problem: universally_respected_proxy_variables.exists? => false

• Solution: Use site.pp + Resource Collection to set top level resource defaults.

The “read this later” slide

“IF I HAVE SEEN FURTHER IT IS BY STANDING ON YE SHOULDERS OF GIANTS” ~ ISAAC NEWTON

Resources that have gotten me by: • https://docs.puppetlabs.com/

references/latest/type.html • Puppet Types and Providers by

Dan Bode and Nan Liu • Puppet Practitioner’s Training • Gary Larizza’s Blog (aka nsfw

missing puppet documentation) • PuppetLabs Support • Puppet Professional Services And Most importantly • A healthy mixture of ambition,

stubbornness and stupidity.

https://docs.puppetlabs.com/references/latest/type.html

QUESTIONS?

@pwattstbd github.com/Marsupermammal [email protected]

http://github.com/Marsupermammal

mailto:[email protected]

Scaling to-5000-nodes

Software