© 2015 IBM Corporation Planning for Catastrophe with IBM WebSphere Application Server & IBM Business Process Manager Tom Alcott STSM Chris Richardson STSM
© 2015 IBM Corporation
Planning for Catastrophe withIBM WebSphere Application Server &IBM Business Process Manager
Tom Alcott STSM
Chris Richardson STSM
This Session
• This session will focus on the architectural and operational issues that need tobe considered when planning and implementing a Disaster Recovery plan withWebSphere Application Server and IBM BPM. Topics will include use ofmultiple data centers, geographic separation constraints, supporting softwarecomponents, disaster recovery and other common deployment issues. Thoughfocused primarily on WebSphere Application Server and IBM BPM this sessionalso applies to IBM Middleware that is deployed on WebSphere ApplicationServer as well as Pure Application System.
• While not a prerequisite, attendees should be familiar with the materialcovered in " Preparing to Fail, Practical WebSphere Application Server HighAvailability”
Introduction
• Why Are We Here?
• To Avoid This
Agenda
• Concepts
• Disaster Recovery• Multiple Cells and Data Centers
• WebSphere Application Server Recovery
• IBM BPM Recovery
• Final Thoughts
Definitions
• Redundancy• The provision of additional or duplicate systems, equipment, etc., that
function in case an operating part or system fails, as in a spacecraft.
• Isolated• Separated from other persons or things; alone; solitary
• Independent• Not dependent; not depending or contingent upon something else for
existence, operation, etc.
• All of the Above are Fundamental for Effective High Availability andDisaster Recovery
Definitions
High Availability (HA)• Ensuring that the system can continue to process work within one
location after routine single component failures
• Usually we assume a single failure
• Usually the goal is very brief disruptions for only some users forunplanned events
Continuous Operations• Ensuring that the system is never unavailable during planned
activities
• E.g., if the application is upgraded to a new version, we do it in away that avoids all downtime
Continuous Availability (CA)
• High Availability coupled with Continuous Operations
• No tolerance for planned downtime
• Little unplanned downtime as possible
• Very expensive
• Note that while achieving CA almost always requires anaggressive DR plan, they are not the same thing
Definitions
Background: High Availability in one picture
7
IP Sprayer
Node Agt
Node1
Server 1Cluster “A”
Web Server
User Registry
DMgr
IHS
IHS
Server 1
Server 2
Node Agt
Node2
Server 2
Server 2
SharedFilesystem
WAS Txn Logs
Database
Storage (SAN)
Server 2
Cluster “B”
Cluster “C”
• Clustered IP Sprayer and Firewalls (notdepicted)
• Clustered HTTP Servers
• WAS-ND Cell with Clustered ApplicationServers
• User Registry (LDAP) Hardware Clustered withShared Disk
• Database Hardware Clustered With SharedDisk
• JMS Provider (Not Depicted)• WAS Messaging Engine with Shared Disks/DB• External JMS with Hardware Cluster and
Shared Disk
• Transaction Logs on Shared File System
• Clusters of “2” Provide High Availability,
• Don’t Forget “Rule of 3”,
• When Using With Clusters of 2
• An Outage (Planned or Unplanned) ReducesCapacity by 50%
• Is No Longer Fault Tolerant
Disaster Recovery (DR)
• Ensuring that the system can be reconstituted and/or activated at another location andcan process work after an unexpected catastrophic failure at one location
• Often multiple single failures (which are normally handled by high availability techniques)is considered catastrophic
• There may or may not be significant downtime as part of a disaster recovery
• This environment may be substantially smaller than the entire production environment, asonly a subset of production applications demand DR
• Normally based on justifiable business need.
• Recovery Time Objective (RTO)
• Service Recovery with little to no interruption
• Recovery Point Objective (RPO)
• Data Recovery and acceptable data loss
Definitions
• Service Levels (SLAs ) cover many things, our focus is availability aspects
• You need a clear set of requirements that define precisely the availabilityrequirements of the system, taking into account
• Components of the system A system has many pieces and business aspects, how do their requirements differ?
‒ Responsiveness and throughput requirements 100% of requests aren't going to work perfectly 100% of the time
• Degraded services requirements Does everything have to meet the responsiveness requirements ALL the time?
• Dependent system requirements What are the implications if a system on which you depend is down?
• Data Loss• Application Data
• Application State (Is this Critical in a Disaster?)
• Maintenance Change occurs, how does that affect availability?
• Disaster Recovery The unimaginable happens, then what?
Definitions
HA Service Level Example
SLA ExternalCommitment
SLA Internal Target
Service Timeframe 7 x 24 7 x 24
ApplicationProcessing Availability
99.5% per month 99.7% per month
Recovery TimeObjective
4 Hours 1 Hour
Maintenance Window Tue-Thurs3:00 - 6:00 am
Tue-Thurs3:00 - 6:00 am
99.5% = 3.60 HoursDowntime/Month
99.7% = 2.16 HoursDowntime/Month
DR Service Level Example
SLA ExternalCommitment
SLA Internal Target
Recovery TimeObjective
16 Hours 4 Hours
Recovery PointObjective
~ 0 (No Data Loss) ~ 0 (No Data Loss)
Note: This is Recovery of an Entire Data Center with 100’s of Servers,Application, Database, Messaging, etc
Agenda
• Concepts
• Disaster Recovery• Multiple Cells and Data Centers
• WebSphere Application Server Recovery
• IBM BPM Recovery
• Final Thoughts
Stage 0 DR – a sound HA strategy
• HA is cheaper and Less Complex Than DR .
• A Robust HA Solution prevents small failures frombecoming disasters
• Don’t let a (relatively) minor failure become acatastrophe
• Eliminate all single points of failure in your primarydatacenter
• Spread Workloads Across Multiple Servers (andHypervisors ! )
• Add 2nd Production WAS-ND Cell
• Consider DB Replication, DB2 HADR, Oracle RAC inconjunction with hardware clustering
• LDAP/Registry Replication
• Otherwise, an HA event could force you to enactyour DR procedure
• Database is only replicated HA in Different DataCenter
13
IP Sprayer
Node Agt
Node1
Server 1Cluster “A”
Web Server
User Registry
DMgr
IHS
IHS
Server 1
Server 2
Node Agt
Node2
Server 2
Server 2
SharedFilesystem
WAS Txn Logs
Database
Storage (SAN)
Server 2
Cluster “B”
Cluster “C”
Multiple Data Center Options (1/4)
• Classic DR
• Active/Passive
• Two Data Centers, one Serving Requests the other Idling
• Independent Cells
• Easier Than Active/Active
• User and Application State Synchronization are Less Critical
• Asynchronous Replication Is Likely Sufficient
• Lower Cost for Network and Hardware Capacity
• From a Capacity perspective One Data Center is Being Underutilized.
• Typically Does Not Incur S/W License Charges When Idle
• If You Don’t Pay for S/W Licenses Is Cost and Underutilization Still a Concern?
• WebSphere License Provides for
o Hot – Processing Requests License Required
o Warm – Started But Not Processing Requests, License Not Required
o Cold – Installed, But Not Started, License Not Required
• DB2 and MQ Require < 100 % of Hot Licenses for Replication
Multiple Data Center Options (2/4)
• “Active/Active” with Single Set of Active Databases
• Two Data Centers
• Independent Application Cells
• Serving Requests for Same Applications
• Database(s) Only Active in One Data Center
– Additional Latency for Application Data Requests from Remote Data Center
– Request Processing Interruption When Data Replica is Promoted to Primary
Multiple Data Center Options (3/4)
• Classic “Active/Active”
• Two Data Centers
• Independent Cells and Synchronized Resource Managers (DBs)
• Serving Requests for Same Applications
• Requires Shared Application Data
o Application Data Consistency is prerequisite to any other planning
o Simultaneous Reads/Writes = Geographic Synchronous Disk Replication
• Additional Hardware and Disk Capacity Required
• e.g. IBM High Availability Geographic Cluster (HAGEO), Sun Cluster Geographic Edition
• Expectation of Continuous Availability and Transparent Failover
o Requires Sharing Application State
• Expectation Seldom Realized
• Outage of One Data Center, Stops Disk Writes in Both, No Longer “Transparent”
• Synchronous Disk Replication Limits Geographic Separation
• Hardest and Costliest to Achieve
Note: Disk Replication only employed for Application Data and Application State, WAS-ND cell configuration, software
updates, and application maintenance should maintained independently in order to insure isolation (and availability)
Multiple Data Center Options (4/4)
• Hybrid “Active/Active” (Partitioned by Applications)
• Two Data Centers• Independent Cells with replicated Resource Managers
• Both DC’s Serving Requests, Both DC’s Configured for All Applications• Running Different Applications (With Different Application Data)
– New Application Tests
– One DC Performing Updates, One DC Performing Inquiry Only (e.g. datawarehouse)
• No Shared Application State, No Shared Application Data– Asynchronous Replication Sufficient
• Global Network Switch Used to Partition/Distribute Traffic
• In the Event of a Disaster• Users failover from one DC to the other
• Likely Some Interruption– As Data Replica is Promoted to Primary
– During Failover Workload Startup
• Provides Most of the Benefits of “Classic Active/Active” without the Cost andComplexity
The CAP Theorem
• In a distributed environment, especially spanning data centersacross LANs and WANs there are three core requirements for aservice:
• Consistency– Either the service works or fails– Traditional ACID of databases provides consistency and isolation
• Availability– Extremely important in web business model– In a large distributed system, one may have to compromise with
consistency for the sake of availability• Partition Tolerance
– Network partition will happen when not all machines are connected– “No set of failures less than the total network failure is allowed to
cause the system to respond incorrectly” – Seth and Lynch– Quorum is used to guard against split brain syndrome
• Brewer’s CAP conjecture states that• One can achieve only two not all three of the above mentioned
requirements
http://en.wikipedia.org/wiki/CAP_theorem
18
Multiple Active Data Centers and the CAP Theorem
• Active/Active requires you to sacrifice either consistency, availability orpartition tolerance.
• All three aren’t possible
• If you choose full availability, then you are going to lose guaranteedconsistency.
• So you need to design with this in mind, and build in mechanisms(typically involving queuing technologies) that enable your system to"tend towards“ consistency.
• Your data is going to be in two places, either partitioned or replicated.
• If the former, what happens when one site is down?
• If the latter, what happens when users hitting each site see slightlydifferent versions of the current state?
• These are very complex problems.
• Which is why I try to steer customers away from active/active and intoan active/passive model with DR from active to passive.
• But they always feel like they are wasting hardware……………!
19
Data Center Utilization Urban Legends
• Legend
• Active/Active Improves Utilization
• Reality
• An Active/Active Topology at 40-50% Utilization in Each DC IsEquivalent to An Active/Passive Datacenter Deployment with OneActive at 80% to 90 % Utilization and the Other Passive
• Running Active/Active at Greater Than 50% Of Total (bothDatacenters) Capacity Can Often Result in a Complete Loss of ServiceWhen a Data Center Outage Occurs
o Insufficient Capacity in Remaining Data Center to Handle > 100% CapacityResults in
• Poor Response Time (at best)
• Network and Server Overload, Resulting in a Complete Crash
Active/Active - What’s Wrong With This Picture ?
A former employer of mine had two data centers, running active/active
at two facilities approximately 2.6 miles (or 4.2 KM) apart.
• Close Proximity Addressed Data Consistency Concerns ……..But………
What Happens When ?
• There’s an earthquake
• There’s a Civil Insurrection
• A Hazardous Chemical Spill Occurs• And The Wind Is Blowing the Chemical Cloud from West to East (or vice versa)
• Your DC May Not Be Located in a Locale Prone to Earthquakes• But what about the other catastrophes ???• They can, *and* will happen !!
• There’s No Substitute For Isolation Between Data Centers
• Data Centers Should Be Sufficiently Distant So That a Single Event Doesn’tImpact Both !!
• This Likely Mandates Asynchronous Replication• Active/Active No Longer Practical
Network Latency and Application Data Consistency – A 3rd
Party Perspective
• Since the latency or round trip time for a network is usually correlated to thelength of the network, or the physical distance between the two end points (inthis case the primary and standby), Maximum Protection and MaximumAvailability modes are not recommended for Data Guard deployments over aWide Area Network (WAN). Note that this recommendation is driven by thelaws of physics (speed of light limitation) - the greater the distance of anetwork, the longer it will take for data packets to traverse the network, andhence the longer it will take for primary database transactions to commit.
• http://www.oracle.com/technology/deploy/availability/htdocs/dataguardnetwork.htm
Multiple Cells and Data Centers
• Your Network Team Assures You That Can (or Have) Constructed a Network LinkBetween Data Centers• For Arguments Sake, We’ll I Agree, It Is possible to construct a network so that latency is
NOT an issue Under Normal Conditions• Even so, WANs are Less Reliable than LANs.
o And Much Harder To Fix !
• But You’re Missing The Point !• Network Interdependency Between Data Centers Means That the Data Centers are Not
Independent
• Question• Do You Want to Have to Explain to Your CIO Why A Problem In One Data
Center Impacted The Other and Resulted in a Outage Because You Didn’t
Have Cells Aligned to Data Center Boundaries ?
How Do I Recover WebSphere Application Server ?
• File System or OS Backup and Recovery• Disk or Tape• WAS backUpConfig/restoreConfig
– WAS_PROFILE/properties, ../etc, WAS_ROOT/java/jre/lib/*properties,WAS_ROOT/java/jre/lib/security,
• Build from Scratcho Only a Realistic Option with Complete Set of Scripts and Rigorous Change Control
• Best Options• File System/OS Backup & Recovery• backupConfig/restoreConfig for Deployment Manager
– WAS V8.0 and above, addNode –asExistingNode can Reconfigure Each Note AfterrestoreConfig of Deployment Manager Configuration
• Both From Last Know Working Production Configuration• Otherwise No Assurance Recovery Will Succeed• Same Concern with “Build From Scratch”‒ If Using Virtualization Consider VM Cloning for Install and Configuration‒ Consider Smart Cloud Orchestrator for Automated Install and Configuration
o Provision Both Primary Site and DR Site in a Consistent Manner with SCOo Note: Don’t Deploy to Backup to DR Site over a WAN !
• May Need to Change Cell and Host Names• Will the Original Data Center Be Restored, Or Is it Gone (for Good)?
WAS Full Profile DR Recovery
• Transaction Recovery on Separate (Physical) Server
• Access the Transaction Logs
– Move/Mount the Transaction Logs to Physical Server Hosting Application Server withAccess to Same Resources (e.g. JDBC, JMS)
– V8.x Optional Use of DB for Transaction logs
• If Recovery Occurs in Different Cell use wsadmin to Configure the Same JAASAlias for Accessing XA Resources
– With adminconsole the node name gets prefixed to the alias.
• No Longer Required to Have Same Hostname and IP Address
– Different IP’s with Multiple Host Alias’s Typicalhttp://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tjta_mvelog.html
• WAS Messaging Engine
• recoverMEConfig AdminTask Command Retrieves MEUUID From PersistentMessage Store and Updates Message Engine Configuration
• Allows Recovery of Stranded Messages After Catastrophic ME Failure in WASV8.5.0 and above
• http://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/rjk_recoverme_config.html
• Some Small Capacity in DR Site Needs to be set Aside for Recovery
• In Addition to Production Workload
• Or Production Workload Not Processed Until Recovery is Complete
WAS Liberty Profile DR Recovery
• Transaction Recovery on Separate (Physical) Server
• Access the Transaction Logs
– Move/Mount the Transaction Logs to Physical Server Hosting Application Server withAccess to Same Resources (e.g. JDBC, JMS)
– V8.x Optional Use of DB for Transaction logs (Not supported for Production)
• Restore server.xml Backup (e.g. zip)
• No Longer Required to Have Same Hostname and IP Address
– Different IP’s with Multiple Host Alias’s Typical
• WAS Messaging Engine
• Restore server.xml Backup (e.g zip)
• Point to Copy of Messages on File System (Could also employ zip/unzip)
• Some Small Capacity in DR Site Needs to be set Aside for Recovery
• In Addition to Production Workload
• Or Production Workload Not Processed Until Recovery is Complete
Classic DR for Stateful Applications: full cell replication
2828
IP Sprayer
Node Agt
Node1
Msg.mem1Messaging
Web ServerUser Registry
DMgr
AppTarget
Support
DMgr
IHS
IHS
App.mem1
Sup.mem1
Node Agt
Node2
App.mem2
Sup.mem2
Filesystem (NFS)
Node Agt
Node1
Msg.mem1
DMgr
IHS
IHS
App.mem1
Sup.mem1
Node Agt
Node2
App.mem2
Sup.mem2
WAS Txn Logs WAS Txn Logs
Consistency Group
SAN Replication
for Application Data
File Copy
for Install & Config Data
Database
Storage (SAN)
Primary Datacenter Secondary Datacenter
Msg.mem2 Msg.mem2A P A P
Filesystem (NFS)
User RegistryWeb Server
Database
Storage (SAN)
Consistency Group
IP Sprayer
Messaging
AppTarget
Support
DMgr
DR via Stray Nodes & Database Managed replication
2929
IP Sprayer
Node Agt
Node1
Messaging
Web Server
DMgr
AppTarget
Support
IHS
IHS
App.mem1
Sup.mem1
Node Agt
Node2
App.mem2
Sup.mem2
IP Sprayer
Node3
DMgr
IHS
IHS
App.mem3
Sup.mem3
Node4
App.mem4
Sup.mem4
WAS Txn Logs WAS Txn Logs
User Registry
Primary Datacenter Secondary Datacenter
Database Database
DB-managedReplication for
Application Data
Msg.mem1 Msg.mem2A P
Msg.mem3 Msg.mem4A P
Node AgtNode Agt
User Registry
Web Server
IBM BPM: Licensing Guidance for HA/DR Configurations
Configuration
What’s active?
What licenses are needed for the backup nodes?DB2
WASND
BPM
“Classic” Disaster Recovery(DR): SAN-based replication
off off off• Files in the backup data center are being synchronized automatically by aSAN. But there is no DB2, WAS ND, or BPM program active.• No extra DB2, WAS ND, or BPM licenses needed for backup nodes.
DR configuration with OSreplication for config data & DB2
replication for runtime dataON off off
• Active DB2 HADR Standby setup considered warm standby – licenses for100 DB2 PVUs required to cover warm standby servers• BPM and WAS ND are inactive – no extra WAS ND or BPM licensesneeded for backup nodes.
DR configuration with WAS NDreplication for config data &
DB2 replication for runtime dataON ON off
• Active DB2 HADR Standby setup considered warm standby – licenses for100 DB2 PVUs required to cover warm standby servers• WebSphere process in the backup nodes used for synchronization – WASND licenses are required.• BPM is inactive* – no extra BPM licenses needed for backup nodes.
High Availability (HA) ON ON ON• IBM BPM is active in all nodes – full BPM licenses are required.• WAS ND and DB2 licensing based on Supporting Programs terms of BPM
Note: For any HA or DR configuration, any node(s) running DMGR only will also require a WAS ND license.(*Inactive: BPM server JVMs in the remote datacenter are not started)
REFERENCES• IBM Program License Agreement licensing for Backup Use: This document explains that extra licenses are not required for cold or warm backupnodes. However, if there is a program actively “doing work” to keep the backup node synchronized with the primary site, then that program mustbe licensed. E.g., when the DB2 and WAS ND node agents are actively “doing work” (replication), they must be licensed, but if IBM BPMservices are not active, no BPM licenses are required in backup nodes.
• DeveloperWorks article on “Stray Node” DR Configuration: This article describes a “better” Stray Node DR configuration. It is different from aClassic DR configuration in that it keeps the WAS ND environment up-to-date as well as the DB2 environment, in order to reduce the serverrecovery time after a disaster. In this “better” Stray Node DR configuration, WAS ND node agents are active, but IBM BPM is not active.
•DeveloperWorks article on licensing DB2 10.1 servers in a HA environment
What Your Mother Didn’t Tell YouAbout Disaster Recovery
• Transaction Log replication & Network configuration requirements
• Historically, the WebSphere transaction service required IP Address & Hostname at the target server match thesource
• This is because, for some types of transactions, the server network information is written into the logsthemselves
• As of WAS v8 servers, this requirement is relaxed a bit – IP addresses no longer need to match
• High Availability and Disaster Recovery for the Deployment Manager
• Techniques are available that leverage hardware clustering or replication of the DMgr’s cell configuration to analternate server. These are described in detail at:
https://www.ibm.com/developerworks/websphere/techjournal/1001_webcon/1001_webcon.html
• Beginning in version 8.5.5 WebSphere supports High Availability features for the Deployment Manager, using ashared filesystem. WAS installations (including BPM) running on applicable versions can leverage this feature:
http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.doc/ae/twve_xdsoconfig.html
• Because these HA techniques rely on DMgr replication, they apply unchanged to DR scenarios
– Generally, we recommend recovering application servers before bringing up a replacement DMgr
• A note about logical corruption – data integrity problems that get replicated to the DR environment
• We recommend using storage system tooling (for example, FlashCopy) to periodically copy the system state
• This can be done at the replica, to avoid interfering with normal operations
• If the Primary data and its replica are both corrupted, then state can be restored to a copy made before thecorruption
• For DR purposes, why can’t I just make one WebSphere ND cell with members running in both of my datacenters? Thatway, if one datacenter is lost, another can carry the load
• See ‘Active/Active Antipattern’ discussion on the following slides
31
Active/Active anti-pattern (Cells Spanning Data Centers)
Secondary DatacenterPrimary Datacenter
32
IP Sprayer
Node Agt
Node1
Msg.mem1Messaging
Web Server
User Registry
DMgr
AppTarget
Support
IHS
IHS
App.mem1
Sup.mem1
Node Agt
Node2
App.mem2
Sup.mem2
Node Agt
Node3
DMgr
App.mem3
Sup.mem3
Node Agt
Node4
App.mem4
Sup.mem4
Msg.mem2 Msg.mem3 Msg.mem4A P P P
IP Sprayer
Web Server
IHS
IHS
Filesystem (NFS)
WAS Logs
Consistency Group Consistency Group
SAN Replication
for Application Data
Database
Storage (SAN) Storage (SAN)
User Registry
Filesystem (NFS)
Database
Why is this type of topology considered an Anti-Pattern?
• Active/Active approaches introduce new complexities that undermine the stability of the system
• Issues/Problems Can Propagate From One DC to the Other
• This Compromises Redundancy and Resiliency
• Worst Case a Outage Cascades Across Both Data Centers
• Frequently these negate the advantages that led the customer to consider the approach in the first place• Increased risk of network instability can lead to partitioned network (‘split brain’)
– Independent Transaction “Recovery” in Both Data Centers By HA Manager– The two data centers could move to inconsistent transactional states!!
• Increased network latency can limit system performance during normal operations– Latency between the Application Server and its databases– Latency among cluster members communicating via the WebSphere HA Manager
component• Desire to automate failover increases risk of false failover & rapid cycling• A system more than 50% utilized introduces the risk that losing a single component will
compromise the entire system, turning what could have been a (simple) HA event into a truedisaster
• In practice, many Active/Active topologies do not deliver Disaster Recovery capability at all:• Attempts to limit latency lead to datacenters physically near each other, increasing the risk that
a single disaster will eliminate the entire system• Many disasters arise from human error and data corruption. Tight coupling between DR
resources does not provide protection from this type of failure at all• A WAS-ND Cell Spanning Data Centers will actually interfere with Zero RTO
• Refer to
• http://www.ibm.com/developerworks/websphere/techjournal/0606_col_alcott/0606_col_alcott.html#sec1d
• http://www.ibm.com/developerworks/websphere/techjournal/1004_webcon/1004_webcon.html
33
What are the recommended alternatives?
• Properly plan a High Availability solution distinct from Disaster Recovery
• Eliminate single points of failure through redundancy in network and software components• HA features allow rapid and automatic recovery from loss of a single component. Utilize them!
• Improve RTO by reducing complexity, scripting operational procedures and drill
• Automate Processes for Repeatability and Consistency
– Scripting
– Point and Click” is Not Repeatable
• Discipline and Practice are Essential
• Well Defined Procedures for Every Contingency
– You Do Not Want to Learn During an Outage
– Practice Those Procedures
– Won’t Make Mistakes in Crisis
– Validates that Procedures Actually Work
– Practice Backup and Recovery, System Failures, Disaster Recovery, etc.• Goal: Make Daily Operations Boring
• Improve electricity distribution via Uninterruptable Power Supply
• Utilize application design patterns like loose coupling in order to improve application flexibility
• In cases where RTO between 1 and 4 hours is necessary, without the requirement to process new work,consider the Stray Node pattern
34
Is This Different for the Liberty Profile and Liberty Collectives?
• No• Same Fundamentals for Effective Redundancy and the
Requirements for Isolation and Independence Apply
• Though Liberty May Make It Easier to Ignore or Believe Thatthe Fundamentals Don’t Apply
35
Agenda
• Concepts
• Disaster Recovery• Multiple Cells and Data Centers
• WebSphere Application Server Recovery
• IBM BPM Recovery
• Final Thoughts
Disaster Recovery
• Develop a Disaster Recovery Plan
• Group Business Needs and Associated Applications into Tiers
• Group into tiers based on the hard/soft dollar impact on theorganization
• Categorize by RPO and RTO.
• The top tier likely includes zero data loss and either no downtime orperhaps just a few minutes of down time
• Subsequent tiers have an RTO of 24 hours, then 48 to 72 hours,then perhaps 72 to 96 hour
• Essential Part of Any Plan• Who approves DR move/recovery ?
• Automated site failover is a bad ideao Typically triggering DR is very expensive
o You do not want to trigger a DR by accident because of some transient issue – just makesthe situation worse
Disaster Recovery Objectives
• Recovery Time Objective
• How quickly the system will be able to accept traffic after the disaster
• Shorter times require progressively more expensive techniques
o e.g., a tape backup and restore is relatively inexpensive
o e.g., a fully redundant fully operational data center is very expensive
• One challenge is detection time
• It takes time to determine you are in a disaster state and triggerdisaster procedures
o While you are deciding if you are down, you are probably missing your SLA.
o Does the RTO include detection time?
Disaster Recovery Objectives
• Recovery Point Objective
• How much data you are willing to lose when there is a disaster
• Limiting data loss raises costs
o e.g., restoring from tape is relatively inexpensive but you'll lose everythingsince the last backup
o e.g., asynchronous replication of data and system state requires significantnetwork bandwidth to prevent falling far behind
o e.g., synchronous replication to the backup data center guarantees no dataloss but requires VERY fast and reliable network and will significantly harmperformance
• Warning: results in increased latency which means capacity must be increased atall layers
Disaster Recovery Objectives
• Most RTO and RPO goals will deeply impact application and infrastructurearchitecture and can't be done “after the fact”
• e.g., if data is shared across data centers, your database and applicationdesign will have to be careful to avoid conflicting database updates and/ortolerate them
• e.g., application upgrades have to account for multiple versions of theapplication running at once which can affect user interface design, databaselayout, etc
• Extreme RTO and RPO goals tend to conflict
• e.g., using synchronous disk replication of data gives you a zero RPO but thatmeans the second system can't be operational, which raises RTO
• Trying to Achieve a Zero RTO *and* a Zero RPO is Mutually Exclusive
Disaster Recovery Testing
• The DR hardware Should Be Put Into Actual Production Usage
• Otherwise How Can You Be Sure It Will Work When You REALLY Need It.
• A Corollary of Murphy’s Law
• The larger the numbers, the less likely all Tier 1 machines can be successfullyrestored to operations.
• DR Testing Options
• “Saturday Afternoon Surprise”
– Unannounced DR Test
– Only If You Can Tolerate an Outage
• Progressively More Realistic and Complex Tests
– Startup of Remote Infrastructure
– Remote Startup with Simulated Workload
– Remote Startup with Production Workload Shift
• Other Issues In a Real Disaster
• Will your key staff want to travel?
• Will they be able to travel?
41
Example and (Very High Level) DR Plan
• Executive/Management Approval for Activation of DR
• Isolate Data Centers
• Halt Incoming Network Traffic
– Static “Temporarily Unavailable Web Page”
• Break Disk Synchronization
• Sever Network Links Between Data Centers
• Start and Recovery of Surviving Center
• Restore/Recovery Hardware and Middleware
• Start DB, Messaging and Application Servers
• Examine DB and Message provider logs for pending transactions
• Recover Pending transactions and messages
• Start Accepting New Work in Surviving Data Center
• Enable Network
42
Other Aspects to Consider
• An HA, CA or DR Deployment Architecture is Not a Product Feature.
• WAS and the WebSphere portfolio products
– Provide HA Features and Function
– Can Be Employed in an HA Architecture
– The Appropriate Environment Varies by Customer
– One Size Does Not Fit All !
• Optimizing WebSphere HA Capabilities into a Robust Deployment
• Requires In-depth Understanding
– Of Environment
– Of Applications
– Of Operational Requirements (Service Levels)
• Architectural Advice May Require ISSW Assistance
43
Learn from Your Mistakes
• Mistakes and failures will occur, learn from them• What separates mediocre organizations from the good and great isn't so
much perfection as it is the constant striving to get better – to not repeatmistakes
• After every outage perform• Root cause analysis
– Capture diagnostic information
– Meet as a team including all key players to discuss
– Determine precisely what went wrong• Wrong doesn't mean “Bob made an error.”
• Find the process flaw that led to the problem
– Determine a corrective action that will prevent this from happening again• If you can't, determine what diagnostic information is needed next time this happens and ensure it is collected
– Implement that corrective action• All too often this last step isn't done
• Verify that action corrected problem
• A senior manager must own this process
Indispensable When Planning for Catastrophe
• Think !
Questions?
Notices and DisclaimersCopyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced ortransmitted in any form without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract withIBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has beenreviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBMshall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY,EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OFTHIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFITOR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of theagreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal withoutnotice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples arepresented as illustrations of how those customers have used IBM products and the results they may have achieved. Actualperformance, cost, savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products,programs or services available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do notnecessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neitherintended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legalcounsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’sbusiness and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice orrepresent or warrant that its services or products will ensure that the customer is in compliance with any law.
Notices and Disclaimers (con’t)
Information concerning non-IBM products was obtained from the suppliers of those products, their publishedannouncements or other publicly available sources. IBM has not tested those products in connection with thispublication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBMproducts. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.IBM does not warrant the quality of any third-party products, or the ability of any such third-party products tointeroperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED,INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR APARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under anyIBM patents, copyrights, trademarks or other intellectual property right.
• IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise DocumentManagement System™, Global Business Services ®, Global Technology Services ®, Information on Demand,ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™,PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®,urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks ofInternational Business Machines Corporation, registered in many jurisdictions worldwide. Other product andservice names might be trademarks of IBM or other companies. A current list of IBM trademarks is available onthe Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.
Thank YouYour Feedback is
Important!
Access the InterConnect 2015Conference CONNECT AttendeePortal to complete your sessionsurveys from your smartphone,
laptop or conference kiosk.
Backup Slides
51
Shameless Self Promotion
IBM WebSphere Deployment and Advanced ConfigurationBy Roland Barcia, Bill Hines, Tom Alcott and Keys BotzumISBN: 0131468626
52
Another Recommended Book
IBM WebSphere v5.0 System AdministrationBy Leigh Williamson, Lavena Chan,Roger Cundiff, Shawn Lauzon and ChristopherC. Mitchell
ISBN: 0131446045
Licensing Servers as Back Up Servers
From IBM Contracts and Practices Database• The policy is to Charge for HOT, and not for WARM or COLD back ups. The following are definitions of what constitutes HOT-
WARM-COLD backups:
• All programs running in backup mode must be under the customer's control, even if running at another enterprise's location.
• COLD - a copy of the program may be stored for backup purpose machine as long as the program has not been started.
• There is no charge for this copy.
• WARM - a copy of the program may reside for backup purposes on a machine and is started, but is "idling", and is not doing anywork of any kind.
• There is no charge for this copy.
• HOT - a copy of the program may reside for backup purposes on a machine, is started and is doing work. However, thisprogram must be ordered.
• There is a charge for this copy.
• "Doing Work", includes, for example, production, development, program maintenance, and testing. It also could include otheractivities such as mirroring of transactions, updating of files, synchronization of programs, data or other resources (e.g. activelinking with another machine, program, data base or other resource, etc.) or any activity or configurability that would allow anactive hot-switch or other synchronized switch-over between programs, data bases, or other resources to occur
Refer to http://www-03.ibm.com/software/sla/sladb.nsf/pdf/policies/$file/Feb-2003-IPLA-backup.pdf for more information
53