A tale of disaster recovery Cfengine everyday, practices and tools Nicolas Charles <[email protected]> Jonathan Clarke <[email protected]> FOSDEM 2011 @Brussels, Belgium
May 17, 2015
A tale of disaster recoveryCfengine everyday, practices and tools
Nicolas Charles <[email protected]>Jonathan Clarke <[email protected]>
FOSDEM 2011 @Brussels, Belgium
About the speakers
Nicolas Charles
Cfengine contributor
Cfengine ”Community Champion” (C3)
Scala Developer
Jonathan Clarke
OpenLDAP commiter
Sysadmin
But we get on pretty well!(mostly...)
1) Configuration Management 101
2) Our choice of tool
3) A tale of disaster recovery
4) Introducing Cfengine 3
5) Why we love Cfengine 3
Agenda
A bit aboutConfiguration Management...
Configuration management What is it ?
Configuration Management is a field of management that focuses on establishing and maintaining consistency of a system (..) throughout its life
Software configuration management is the task of tracking and controlling changes in the software
Sources:http://en.wikipedia.org/wiki/Configuration_managementhttp://en.wikipedia.org/wiki/Software_configuration_management
Configuration management Why is it useful ?
Control changes Reproduce over time and nodes Audit and keep history data Repair automaticaly
What we chose, and why
Configuration ManagementTools
Our choice Back in mid 2009 Needed a configuration management tool Criteria:
Open source Multi-platform agent (including Windows) Resilient Non-disruptive
Our choice: candidates
Cfengine 3 Puppet Chef
Our choice: candidates
Cfengine 3
More on thischoice later...
An ill-fated talefrom the recent past
Disaster Recovery
(CASE STUDY)
Before the disaster... Our company's IT infrastructure
Small company: small requirements Web site, email Git repository, Redmine...
Small company: small budget All on one hosted server
Asking for trouble? Just one hosted server! Critical services!
No, a ”safe” configuration: Redundant hardware, 3 disk RAID-5 array All services automatically installed and setup
using Configuration Management Backups: daily (several off-site locations) Several VMs to separate services
A critical failure 2 hard drives fail simultaneously
→ RAID-5 array is down
→ Almost all services fail immediately
→ ”The end of the world as we know it”
→ Need to rebuild everything NOW
Recovering Step 1: Panic! Step 2: Get a new server Step 3: Reinstall base OS + virtualization Step 4: Restore VM configuration... whoops Step 4: Re-create the VMs manually Step 5: Reinstall each OS in each VM...
Recovering Step 6: Installation Configuration Management Step 7: Sit back and watch all the services
coming back online as if by magic! Step 8: Huh, where's my data? Step 9: Manually restore backups Step 10: Make a list of missing data...
Lessons learned
1) Hard disks fail reliably
2) Restoring virtualization setups:● Backing up the config files would have helped● Need CM tools to describe the desired state!
(Cfengine Nova does this)
3) Configuration Management should tie in to our backup system
4) Backups were lacking some files: always test!
Wishlist and discussion Integrating Configuration Management tools
and backup systems is a crucial step for CM to be efficient for disaster recovery
What do others do?
Provisioning VMs and their resources (disks, network) should be automated too
Cloud providers are one solution What about ”plain” virtualization?
A bit about Cfengine 3...
Sources: across the Internet
Cfengine: History
Source:http://verticalsysadmin.com/blog/uncategorized/relative-origins-of-cfengine-chef-and-puppet
Cfengine 3: Intro Configuration management software Written in C Two versions :
Community (GPL v3) Nova (closed source) : Community + extra
features
Backed by Cfengine AS – Norway based company founded in 2009
Cfengine 3: Features
According to Kuleven comparative study of configuration management systems:
Very mature Cross platform (*BSD, AIX, HP-UX, Linux, Mac
OS X, Solaris, Windows) Strongly distributed Based on state description and convergence Very high scalabily ( > 10000 nodes ) Very small footprint
Source: http://distrinet.cs.kuleuven.be/software/sysconfigtools/overview
Cfengine 3: Components Cf-agent
Runs on all managed hosts Applies configuration – this is the heart Can connect to cf-serverd to get policies / files
Cf-serverd Distributes policies and files Must be run on policy server(s) Usually run on all hosts to enable remote runs
Cf-monitord Collects statistics on all nodes
Cfengine 3: Promises Configuration rules are called promises
”Promise” to be in the desired state Cfengine agent handles the steps to get there:
convergence
Promise theory is based on research done in the University of Oslo
Cfengine 3: Usage examples Large companies (Facebook, AMD, …) Critical systems: Joint Australia Tsunami
Warning Centre Personal computers Mobile devices: Nokia N900 Underwater devices: army submarines Small and medium companies...
Why we love Cfengine 3...
Sources: our experience and opinions
Memory usage Daemon consumption on managed hosts
Multi-platform Define a configuration for all operating systems
Windows, Linux Make it ”transparent” (forget about the
complexity) Existing standard library handling the
differences between each OS and distribution
File editing Only change what you need to
You like your distribution's defaults? You have various different systems already
setup and just need to change something?
Search for lines and replace/delete/add them Only change one field in a file
/etc/passwd for example...
Complex tasks Powerful class system to trigger promises
Based on nodes itself Based on time Based on whatever you might imagine
Complex workflow can be created
FOSDEM 2011 Configuration Management room
Thank you !
And those brave enough to wake up early