Best Practices: Demonstrating Value with BSA - BMC Software · BMC Server Automation (BladeLogic) v8.2 Best Practices Demonstrating Value with BSA (BladeLogic) Sean Berry Lead, Customer
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
What value does automation bring to the organization?How is it going to make my job easier?How is it going to make me look better to my boss?How is it going to make me and my team more marketable? (within and without the company)Ideally, your resume shouldn’t only list your job descriptions, it should be what you accomplished, and what you will be able to accomplish in the future.$$ value and metrics on your resume means more to a company than a list of tasks: “I installed an agent”.
Be able to:Talk about your server automation environment in dollars and cents: how much money does good reporting or compliance save your company every day/week/month/year?Identify the major use cases in your BSA environment, and how they add value- faster provisioning, - faster reaction to issues, - faster mean time to repair (MTTR), - lower cost of management, - faster customer response
Identify the next use cases you want your group to take on, and start building a business case for rolling it outSpeak to the costs of automation, and where it makes sense (macros vs. AI)Speak to the percentage of project (revenue-impacting) vs. maintenance (overhead) work
“It doesn’t need to be pretty or shiny, it just needs to get the job done.” What does an outage cost your company in dollars per hour?- Do you have a check for everything that’s ever caused an outage in
your environment? Is it built into your build policy? You have a build policy, right?
Getting Value From BladeLogic- What goes into a server and why does it matter?- How are data centers built? How do we organize around them? How
do servers end up there? What’s a datacenter and why put them there and not under our desks?
- Value comes both with a capital V value measured by CTO and small v, measured by whether you spend the rest of the week cranking on something, or whether you get it wrapped up tonight before you go home.
- BSA, in the hands of someone who knows how to use it (either through training or experience), is a force multiplier. We estimated at one customer that a skilled BSA user can be 3-5x more productive than an equivalent UNIX or Windows sysadmin. Being able to take on more tasks in a given window of time (more “project” work vs. maintenance work) adds value.
What does an FTE or contractor cost per hour in your org?- Base salary + Fully loaded: w/ benefits/overhead/cubicle/workstation/VPN/travel $60k salary -> $30/hr base cost = ~$60-75/hr “loaded” 40*52 – vacation = 2000 working hours (w/o overtime)
For a given script execution, audit, compliance run, or software deploy:- How long would it have taken for an individual to execute this task by hand Including staging time Including identifying the correct servers Including verifying availability Could a level one or level two resource have done this task?
- How long does it take to run the job once?- How long does it take to schedule the job once?
Vs:- How much upkeep is required to maintain the job going forward? Including updating smartgroups (should be marginal or zero)
Most organizations: - 80% Maintenance / Keep The Lights On- 20% Project Work (new initiatives, things that bring in revenue)- Maintenance -> overhead: first place to cut costs- CIO/CTO: “How can I get more of my projects done this year?”
Easy to see “Job Security” in the maintenance, but once automation becomes standard…Outsourcing vs. Automation:- Common to see 10 offshore resources executing patching on 10-15
servers each, manually- One engineer can commonly execute automated patching against
several 100s of machines, more automated, fewer human errors.80% of downtime caused by human error: reduce exposure
What’s a script?- A series of commands, sometimes including error-checking or conditional
flows, to accomplish a specific goal.Common scripting languages include various shells (Bourne, Korn, C, etc.), DOS/Command, Visual Basic (vbs), PowerShell, Perl, ExpectMany scripts start their lives as “pipe lines”, several commands piped together to find a specific item of information or answer a specific questionScripts are a great tool in the hands of a skilled user, can sometimes be more difficult to effectively delegate to L1/L2 users- Power tools: don’t always have safeguards- Effective testing- Required options: passing blank arguments or no arguments into scripts that
What’re Objects?- The set of “nouns” in BladeLogic, like files, directories, configuration entries,
registry keys, software packages (both platform-specific and platform-agnostic), service definitions, virtual guest packages, against which the “verbs” like Audit, Snapshot, Package, Deploy, and Rollback/Undo can be used.
What’s the difference?- One-off configuration audits, rather than retrieving and parsing config files (or
parsing in-place on remote servers) become a matter of identifying the desired configuration, and a fast audit, with clear color-coded callouts of which config is correct, incorrect, or missing.
- No more automation required around “ssh”, transport is taken care of.- A “canned” software package and deploy job can be created by a domain
expert working with a BSA expert to correctly install/upgrade an agent in an hour or two of effort. Afterwards, this process (package + job) can be delegated to L1/L2 users, included in the new server provisioning process, and used as a remediation action by the build compliance process.
What’s the difference? (continued)- The intelligence about how to talk to different operating systems, parse
configuration files, and deploy/rollback software is already either built or templated in. You get to start two steps ahead. (process development gets cheaper)
- Since the Objects and Jobs are supported by someone else, you’re not stuck supporting your scripts forever, unable to get promoted because you’re “too critical” to take on new responsibilities.
“Scripty” post in the Optimize IT Blog: https://communities.bmc.com/communities/community/bsm_initiatives/optimize_it/blog/2011/01/14/scriptyAutomation in Cooking: https://communities.bmc.com/communities/community/bsm_initiatives/optimize_it/blog/2012/02/24/everything-i-know-about-automation-i-learned-from-my-sous-vide-supreme
For a given script execution, audit, compliance run, or software deploy:- How long would it have taken for an individual to execute this task by hand Including staging time Including identifying the correct server Including verifying availability
- How long does it take to run the job once?- How long does it take to schedule the job once?- How much upkeep is required to maintain the job going forward? Including updating smartgroups (should be 0)
How often were you running that task?- Were you only running it occasionally because the overhead of the process was too
high to run more often?
How often does that job run now?- Biannual or quarterly compliance audits vs. weekly or even daily visibility into
compliance- Cost of being out of compliance- Cost of getting back to a compliant state
At least one large financial institution uses the output from BSA, combined with some custom reports and a couple of good spreadsheets to demonstrate value delivered with BSA$10MM++ projectQuarterly Business Reviews / Cost JustificationsHeadcount JustificationMetrics are Meaningful & Powerful: - Hard to argue with facts & numbers- Easier to argue with interpretation of facts
Conservative estimates always help, better to aim lowDon’t try to do –everything- in ReportingDon’t be discouraged if you do have to do some post-processing
BDSSA provides OOTB reports that can help report in terms of dollars and hours: you may end up needing to either create a custom report or do some post-processing in Excel- There’s still value in being able to generate the underlying stats- Use what’s available out of the box or with small amounts of work to help support
These use cases assume a fully operational BSA environment. Some require integrations with a Change or Incident system.The road to implementing these use cases has many steps, and requires:- Functional process- Buy-in from all impacted groups- Working integrations & supported software versions- A healthy infrastructure environment- Trained and effective staff- Ongoing support
Weekly/daily lights-out audits vs. manually or semi-automated quarterlyPreviously cost-prohibitiveThe "invisible" cost: configuration drift between audits and inertia- Fear of change/risk- More regular audits: easier enforcement
Operator Initiated Change: - a change is selected or defined by the operator- linked into Change Management- when approved (and maintenance window reached), the approved
Change executesValue: - Effective Change Process- Less time spent in Change meetings, - much better change visibility and documentation, - lower total risk- (morale?)
Functional Bare Metal and/or Virtual Guest Provisioning Environment & Team- Provisioning- Virtualization (on all platforms: VMware, Hyper-V, Solaris Zones, IBM LPARs, etc.)
Functional Packaging and Promotion - BLPackager- Software Packages (incl. Custom Software Packages)- NSH Scripts & Jobs
Functional Compliance & Hardening- Every system should leave the “Server Factory” fully secured & compliant with: Security (CIS, DISA, custom) Regulatory (PCI, HIPAA, GLB, SOX, custom) Build Policies (OS platform, Middleware Platform, Data Center-specific)
Functional Patching & Hardening- Every system should leave the “Server Factory” fully patched to the current policy (no
Talk trackSkilled admins & subject matter experts (SMEs) usually have the privileges to maintain any component of a server or application, however, agent maintenance & other common tasks are not necessarily a good use of their time.Agent install/upgrade & other common tasks can be easily packaged by SMEsL1/L2 can then execute these tasks whenever needed, as many times as required.
Most inventories- Static spreadsheets, “stale once emailed”- Compiled quarterly (or worse)- Hard to correct/feedback
BladeLogic Customer example- Automated inventory survey -> report- Massive power outage- Used current inventory spreadsheet to build a “restart” plan
Value- “Date Updated” indicates last contact, currency of data- Current inventory increases confidence in decisions- BSA seen as “source of truth” for the data center- Inventory information used in Smart groups to quickly answer questions like: “How many Windows 2008 Servers do we have in Production” “How many RHEL 5 in the San Jose data center?”
What does an outage cost your company in dollars per hour?Insurance Company – acquired resources- Small set of servers, not built by our process- Remote Data Center: out of sight, out of mind- DNS, service accounts not setup correctly: when there’s an incident only a couple of
people know how to get into these systems- Response time, service level is poor, -> service perception is poor: low value
Datacenter move- Chicago data center: moving between facilities. - Significant pre-planning executed, some “invisible” assumptions.- When Chicago DNS server went offline, so did customer e-servicing “Put it back!” -> delayed move for hours Service unavailable or underperforming for 5 hours Isolated to misconfigured resolv.conf: several sysadmins had looked at that
configuration: only caught through scripted comparison.- Basic build audits could have caught or prevented - Thousands of dollars of lost revenue
Large financial institution near NYC, casual conversation discovered:Contractor assigned on a 90-day project to verify & reconcile /etc/resolv.conf entriesContractor probably billed at least $60/hr: 90*8*60: at least $43K problemProblem phases:- What do I have? (Discovery / Inventory)- Which is correct? (Manual/human interaction & Audit)- Identify incorrect servers (Snapshot & Audit or live-live Audit)- Package Changes (from Audit results)- Change approval (usually an external process)- Deploy Changes (execute Deploy)- Rollback in event of issues
Simple audit of /etc/resolv.conf using existing server smartgroups- < 1 hr “door-to-door”- Existing “intrinsic” standards become obvious
How many places in this process can we cut out cost? Do you want to spend 90 days chasing one fairly basic set of configurations?
“One true build policy”: - Single OS -> at least a secure and “standard” build
Many servers in a data center -> at least a few common traits per groupMost orgs have –some- kind of build standard- scribbled notes on a sheet passed around between admins- Under-utilized word doc - Configurations built into bare metal provisioning system
(kickstart/jumpstart/etc.) Most non-automated build standards aren’t complete, and are rarely updated.
Drift: Standards change over time, “July 2011 build”6-12 different builds over three years (times the number of different kinds of builds)Vs. standard RHEL 5 build that changes over time- Evaluate all servers to that standard regularly
Builds break down into major components: a given set of vertically aligned components is sometimes called a “stack”. - “SQL Server 2008” stack might be - built on Windows 2008 R2, - on virtual or on a standard make and model of hardware (HP DL380
G??),- have a standard set of agents appropriate for a database server, etc.
These can all be different policies, which only need to apply to the specific servers they’re relevant to. Even a single policy with a few rules can deliver value, and is a great place to start.Once built, the next time a configuration either causes a problem, or someone remarks on a misconfiguration, create a rule for it.
This is common any time we want to know when something has changed, but once it's changed, we want to use that as the new standard.Not to be confused with a build audit, where any deviation from standard required remediation.Sometimes called a "rolling" audit: this gives visibility into authorized and unauthorized change, and can be used to either verify configuration change, or identify unauthorized change.Auditing the entire machine (some 100,000 configurations) will generate mostly noise, Filter down for known, managed configuration items.
A basic deployment can consist of something as simple as dropping a tarball on a system, extracting it, and running a command.However, most deployments worth automating rarely stay so simple Now we need to be able to pass a hostname, or a directory to install in, or create a user account for the agent to run under.Test whether directory is present, writeable, correct permsDo the right thing if user account is already present. Handle error conditions.Need to be able to train our users to be able to understand the results of this deploy process.
Directory/file sync: scheduled, logged, auditableEmbed non-NSH scriptAnything that consolidates information: remote inventory, cmd or file pickupConfig file audit: resolv.conf, ntp.conf, backup agent config- Easy to add new config file, new grammar
Basic software deploy: build once, use many times- Easy for L1/L2 to use via Execution Tasks- Easy to use in Provisioning- Audit/Compliance: Use for remediation
What to do when you inherit a BSA installation, including “How to” videos: https://communities.bmc.com/communities/community/bsm_initiatives/optimize_it/blog/2012/06/15/taking-the-reins-server-automation
Initial Install – Database Setup: On BMCdocs YouTube at http://www.youtube.com/watch?v=91FEUDVD6sEInitial Install – File Server and App Server Installs: On Communities YouTube at
http://www.youtube.com/watch?v=m7Y3SY23kuQInitial Install – Console GUI and Appserver Config: On Communities YouTube at
http://www.youtube.com/watch?v=uwqlj60Lvo0Compliance Content Install: On BMCdocs YouTube at http://www.youtube.com/watch?v=bXdaogDsCNcCompliance Quick Audit: On BMCdocs YouTube at http://www.youtube.com/watch?v=i8BLi4WAWEYBSA 8.2 Patching - Setting Up a Windows Patch Catalog: On Communities YouTube at
http://www.youtube.com/watch?v=nfpFpOuub9k.Windows Patch Analysis: On Communities YouTube at http://www.youtube.com/watch?v=ODWhC01uEaQ.Patching in Short Maintenance Windows with BMC BladeLogic Server Automation: On Communities YouTube at
Change Tracking is the most basic form of Build Compliance. It says that something, once configured according to a standard, shouldn't change without authorization.A typical configuration might be a local account deployed on servers, or DNS Server entries (on UNIX, this is typically in /etc/resolv.conf). There are several more advanced ways to do this (including a really beautiful demonstration of the uses of the Property Dictionary), but the basic use case is easy to setup, and easy to show initial value.