BSA Deployment Best Practice Part 1 - BMC Software · 2020-05-04 · BMC Server Automation (BladeLogic) v8.2 Best Practices Deployment and Configuration Session 1 ... Deploy core
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
BSM & Automation involves many moving parts and dependencies- People- Process- Products
Understand the environment first- Business- Technical
Design and document initial plan first - Deployment depends on Implementation Architecture- Implementation Architecture depends on Scale- This session is focused on Deployment Architecture
Deployment will be completed in phases.- Don’t try to do everything at once (“Boil the Ocean”)- A plan is required- It’s important to consider dependencies- Plan for what the product & your environment is capable of today.
Drivers- Business Specific project goals: compliance deadline, tool replacement, labor savings Other Projects Time Lines
- Technical What is installed vs. not installed
– AO? Provisioning? Patching? # of Platforms?– ITSM, CMDB?
Environment Readiness– Hardware– Defined policies– Access (PXE, DHCP, firewalls, SOCKS, etc.)– Choose initial use cases with highest value, greatest potential for success, clearest requirements
- People Vacations, other assignments, etc Training (Use cases, general BSA foundational knowledge, etc.) Team priorities (where are the burning needs) Concurrence
Improve Service Quality – Execute complex changes correctlyReduce Risk – Comprehensive rollback for changes; Define and enforce administrative rolesReduce Cost – Single change process across all platforms; Automate configuration & performance mgmt processes
Phase 1 – Implementation Architecture- Determine and document implementation architecture- Depends on Scalability- Must be done first for every project big or small
Phase 2 – Deploy core infrastructure- Database- Application Servers & File Server[s]- Client Software (RCPConsole GUI, NSH, etc.)- RSCD Agents- Reporting- Repeaters (as applicable)- Order and content will vary a bit depending on the project
Phase 4 – Configure Initial Reporting- Show value & metrics back to the business early on
Phase 5 – Configure Software Deployment- Build one or two basic software deployment packages first (start with the easy ones)- Parameterize the packages as necessary next- Get comfortable setting options third
Phase 6 – Setup Patching & Provisioning- Identify supporting infrastructure & start change requests- Define build inputs & policies- Patch: Start with the most generous policy in analyze-only mode
Phase 7 – Setup Script Execution- Roll out existing scripted actions, refactor Inventory & Configuration change /
Configuration validation where possiblePhase 8 – Setup Closed Loop Change or Operator-Initiated Change, CMDB syncPhase 9 – Identify & Measure KPIs- Patch compliance %, number of servers provisioned per week, number of
remediations per month, number of server touches, etc.
Always keep the Business in mindSolution Acceptance / Perception- Work to show and prove value early Identify basic compliance up front Inventory/Asset Reporting If it’s not reportable, it difficult to demonstrate value
Major Mistakes to Avoid- Incomplete Asset Reporting Inaccurate / unknown state Perceived solution failure
- Incomplete Use Case Deployment / Training Reports look good, but users are not comfortable with basic tasks Perceived solution failure
Agents must be deployed comprehensively, early (90+%)Don’t do compliance in reporting only: better to define policy than dump bulk info
Target- A managed server or OS instance running the RSCD agent
Application Server- An instance of the software that does work in the form of Jobs, communicating with the
RSCD agentsFile Server- One or more systems, running the RSCD agent, that makes available one shared
storage space for payloads and scripts used by the Application ServersRepeater- One or more systems located in remote data centers for efficient one-to-many
replication of deployment payloadsConsole- An instance of the graphical client, used to interact with the Application Servers and
managed TargetsNetwork Shell- The command-line tool used to interact with servers
“ALL” type servers include the function of all three of: JOB, CONFIG, and NSH Proxy serversJOB servers:- heavy lifting- run Jobs, compliance calculations- Key players in the bulk of use cases. - Very good at using all available resources for fast Job execution.
CONFIG servers:- receive incoming user connections from the GUI (RCP Console), and Command Line
Interface (CLI or blcli). - When run only as a CONFIG or CONFIG/NSH Proxy server, better UI performance
“ALL” type servers include the function of all three of: JOB, CONFIG, and NSH Proxy serversNSH Proxy servers provide:- connectivity to managed targets (solves some firewall challenges)- authentication for NSH- centralized audit point (solves some audit requirements)- Adds an extra layer of security protecting NSH agents
File Server –can be anything that will run the RSCD agent and has sufficient storage space (local or NAS/SAN). File server can be on NAS only with Linux/UNIX.Database – SQL Server or Oracle: this is where all metadata, configurations, change tracking data, Jobs, anything that’s not an installable or executable, etc. are stored.Given a good copy/backup of the File Server and the Database, disaster recovery is fairly straightforward. Without one or the other, recovery can be very difficult.
First things first: Make sure your database can handle the load- regular cleanup (weekly & historicals daily if necessary)- monitoring of growth, I/O, CPU usage, etc.- Sufficient db connections to support # of WIT, etc.
Vertically scale your application servers- increase job server work-item threads and JVM heap size- exploit available CPU and memory (typical machine size these days)
Horizontally scale your application servers- add job servers as needed- survey spreadsheet based tool for job server estimation- add config servers as needed, with load balancers
load balancer can be avoided if user population naturally partitions
Job and config servers could be hosted on the same physical host- Rule of thumb: two CPU cores per app server instance- Ensure sufficient physical memory
Virtualized App Servers – one per OS image, (8-12GB RAM, 2 vCPU, 4GB heap recommended)
Number of work item threads per job server is configurable - Too few threads jobs could take too long to complete- Too many threads JVM could run out of memory
“Lightweight” Work Items can be served out of a separate thread pool- Some work items (e.g., in deploy jobs) consume very few app server resources, and
can be handled with greater parallelism- Default configuration of zero LWI threads will use normal WITs for LWIs
Asynchronous execution for remote work- When the work really happens on the target, it’s not necessary to hold up a WIT to
just wait for a response- Used by NSH Script Job (type 3), Patch analysis, Deploy, and SCAP compliance
Configure and Schedule your jobs to distribute load- Avoid overlaps where possible- Split targets across multiple jobs if necessary (~4000 servers/job)- Be careful with “unlimited” job parallelism
The app server transfers data from the file server over NSH- Not as efficient over the network as NFS or CIFS
The file server’s storage is likely to be SAN or NAS already in most environmentsSome performance benefit can be realized by mounting the same share on each app server- Each app server sees the file server as “localhost”- File storage has to be mounted to the same mount point on each app server- NSH communication happens over loopback (app server to local agent)- Data travels over the wire using NFS/CIFS (agent to filer)
The NAS can be clustered/load balanced- Allows the usual benefits: redundancy, performance- Usually already have expertise here
If standalone file server, ensure sufficient file handles (>16k) and I/O performance
App Servers, Database, and GUI Clients should remain close together- High data volume (sensitive to bandwidth)- High packet volume (sensitive to latency)- Database links in particular are sensitive to packet loss
Install Citrix Presentation Server to support remote users- Or Remote Desktop to provide responsive access to remote users
Install BladeLogic standard Repeaters in each remote data centerInstall BladeLogic Advanced Repeaters in each remote data center where bandwidth must be constrained (sensitive network links)Install provisioning infrastructure in each remote data center- PXE doesn’t perform reliably across many real-world WANs- Keep OS images as local to the provisoning target as is reasonable
Install SOCKS Proxies in each remote data center where minimal firewall configurations are required or overlapping IP addresses exist
“High Availability” here means that any one failure isn’t catastrophic (no single point of failure)
Database: Clustered database (e.g., SQL Server clustering or Oracle RAC) for highly available data access
File Server: Clustered NAS/SAN server & virtualized fileserver
Multiple Job servers on separate machines- Job servers can handle their own failover- Work items in flight on a failed job server will fail
Multiple Config servers and NSH Proxy servers on separate machines, with highly-available load balancer.- Load balancer handles failover of config servers
Disaster recovery is not “high availability”, it is the recovery of the entire environment due to a major catastropheDatabase: Replicate BladeLogic database (e.g., using Oracle DataGuard / GoldenGate / other replication technologies per company standard)Stand-by infrastructure (job servers, config servers, GUI clients) ready to go at DR siteFailover needs to be “rapid,” not “instantaneous.”- Failover is manually initiated.
Worst case: with a good copy of the database and file server, the installation can be stood back up by installing a new appserver. So make sure your backups of both are good, and please test them regularly. This is a great way to do upgrade testing in a lab environment.
Use App Server Certificates- By default, the installation process generates self-signed certificates for app servers.- Bad guys can self-sign certificates, too.- Recommend investing the effort in using a proper Certificate Authority, even if
internal.
Use Client-Side Certificates- Each management console should also have a certificate- App servers should be configured to require client-side certificates.
Use an NSH Proxy- Adds an extra layer of authentication to NSH communication
Don’t allow log-ins on the hosting servers- Part of protecting the app server infrastructure is to limit access to the underlying
servers.
Treat the BLAdmin and RBACAdmin user accounts like ‘root’- Each user should have their own account, and switch roles when necessary to
Use ‘exports’ file on each agent to restrict access- Allow connections only from job servers and NSH Proxies- Prevents rogue appservers or NSH clients from making direct connections
Use “ACL Push” jobs to manage ‘users’ file on each agent- Establish “ACL Push” jobs for all agents (except file servers!)- Schedule ACL Push jobs for regular execution (e.g., weekly)- PUSH_ACL_NO_USERS_FLAG property: leave set
Add BLAdmins:* to ‘users.local’ file on each agent- Allows access by BLAdmins role as a backup, in the event of an issue.- (map as a local administrator)
Setup NSH on an existing shared-access UNIX host- It’s not uncommon to install NSH (client) on an existing shared UNIX host: this is an
easy way to configure NSH where many users can access the shell without involving VPNs
Java Memory- 32-bit processes have to walk a fine line- 64-bit processes have more address space, but also use more memory- Specific values for each supported environment
Work Item Threads- Max performance usually comes from max number of work item threads that doesn’t
run the JVM out of memory- Diminishing returns for increasing WITs in job servers- Recommendation: 50 WITs (32-bit), 100 WITs (64-bit)
Database Connections- Two ‘job’ database connections per work item thread
Hosting Environment Java Heap Size Physical Memory Work Item Threads32-bit Windows 1GB 4GB 5032-bit Linux 1.5GB 4GB 5032-bit Solaris 2GB 4GB 5064-bit (any) 4-6GB 8-12GB 100
Note that you can get twice as many WorkItemThreads (WITs) per appserver instance on 64-bit Oses with a 50% heap:physical memory ratio, but will need enough memory to support it. Scale is not necessarily as linear as 64-bit objects require more memory than 32-bit. Start with 4GB heap, leave room to grow if very active.
Task/Maintenance Windows:- Performance and capacity is often relevant to a specific task or maintenance window:
whether it’s 11PM-7AM Saturday night maintenance window, or a 2-hour change window on a week-night, tasks are often time-constrained
Total performance capacity- Capacity of the environment: number of work items available vs a given task window:
In an environment with 2 app servers with 50 WorkItemThreads (WITs) will have 100 total WorkItemThreads: 2 APP x 50 MaxWIT = 100 WIT total
- In a given 60 minute window: 100 WIT x 60 minutes = 6000 WIT-M).
So, a task that takes 3 minutes and uses one WIT per server, may be able to run across as many as 2000 servers in about an hour at this capacity - 6000 WIT-M / 3 min/svr/WIT = 2000 servers
In practice, there is also time required for setting up a given Job and closing it down, but this is a useful shorthand.
Work-Item-Thread-Minutes / Time Constraints (cont’d)
Some long-running single-server Jobs are constrained by MaxJobs (like Provisioning), and may run for a few minutes: here total capacity may be less (typically 20 MaxJobs / appserver)Some types of NSH Script Jobs run parallel (type 1 & 3), while others are single-threaded (type 2 & 4)JOB_TIMEOUT and JOB_PART_TIMEOUT properties ensure that tasks complete or exit within time constraints, and that they don’t wait for non-responsive hosts or threads.Make sure to test & become familiar with job performance before using in critical maintenance windows
Agent Health is critical to successful job runs because the appserver is generous when trying to talk to a slow remote agent. - JOB_PART_TIMEOUT
Agent Health Survey:- Servers go up and down regularly - Run the “Update Server Properties” Job regularly, and before a critical job updates AGENT_STATUS property:
– “Agent is Alive” for hosts that are up, vs. – “Agent is Unavailable” for hosts that are down.
- AGENT_STATUS in Server Smart Groups to include only available hosts in Jobs Can’t deploy to a host that’s not up
Recovery:- Re-run Update Server Properties Job more often against a server group that only
includes “down” servers- Use a Server Smart Group to identify hosts that have been out of contact > 30 days
What to do when you inherit a BSA installation, including “How to” videos: https://communities.bmc.com/communities/community/bsm_initiatives/optimize_it/blog/2012/06/15/taking-the-reins-server-automation
Initial Install – Database Setup: On BMCdocs YouTube at http://www.youtube.com/watch?v=91FEUDVD6sEInitial Install – File Server and App Server Installs: On Communities YouTube at
http://www.youtube.com/watch?v=m7Y3SY23kuQInitial Install – Console GUI and Appserver Config: On Communities YouTube at
http://www.youtube.com/watch?v=uwqlj60Lvo0Compliance Content Install: On BMCdocs YouTube at http://www.youtube.com/watch?v=bXdaogDsCNcCompliance Quick Audit: On BMCdocs YouTube at http://www.youtube.com/watch?v=i8BLi4WAWEYBSA 8.2 Patching - Setting Up a Windows Patch Catalog: On Communities YouTube at
http://www.youtube.com/watch?v=nfpFpOuub9k.Windows Patch Analysis: On Communities YouTube at http://www.youtube.com/watch?v=ODWhC01uEaQ.Patching in Short Maintenance Windows with BMC BladeLogic Server Automation: On Communities YouTube at
http://www.youtube.com/watch?v=o6Lfzbb3JZg.Here's another video I made about basic packaging of a Windows MSI.Basic Software Packaging: http://www.youtube.com/watch?feature=player_embedded&v=dtOWTTFqsaY