Failover Scenarios in a BPM HA Environment

Failover Scenarios in a Highly Available Failover Scenarios in a Highly Available WebSphere Business Process Management V7 WebSphere Business Process Management V7

Production EnvironmentProduction Environment

June 2010

© IBM Corporation, 2010

1

DisclaimerThis document is subject to change without notification and will not comprehensively cover issues encountered in any customer situation.

The information contained in this document has not been submitted to any formal IBM test and is distributed AS IS.

For updates or newer releases please contact the service team.

The Team of AuthorsThis document was written by the WebSphere Business Process Management (BPM) test team in Böblingen. The authors are:

Marco LezajicIT SpecialistIBM Software Group, Application and Integration Middleware Software,WebSphere BPM Test

Ekkehard VoeschTest ArchitectIBM Software Group, Application and Integration Middleware Software,WebSphere BPM Test

Eduard WelteAdvisory IT Architect / Managing ConsultantIBM Software Group, Application and Integration Middleware Software

We would like to thank the WebSphere Business Process Management test team for their contributions to this document.

2

SummaryIn case of an unexpected breakdown of one or more components within a complex WebSphere Business Process Management (hereafter called BPM) highly available production environment problem determination can become a challenging task. Inconsistent application states due to locked database tables and rows, unfinished transactions or discarded messages might be the result. Identifying an adequate recovery approach and solving the problems occurred might become very time consuming, often leaving the administrator searching for the needle in a haystack. Based on a highly available BPM production environment this paper discusses three common disaster failover scenarios. Each scenario exemplifies the impact an unexpected shutdown of a specific part of the overall environment might have on the components being executed at the time of the failure. Based on the case studies and the related implications outlined, you will learn about concrete configuration settings in order to recover a consistent application state.

Readers should have a good understanding of WebSphere Process Server V7.0.0.2 (hereafter called Process Server) and WebSphere Business Monitor V7.0.0.2 (hereafter called Business Monitor) clustering. In addition basic knowledge regarding WebSphere MQ V7.0.1.0 (hereafter called MQ) and Oracle Real Application Cluster V11.1.0.7 (hereafter called Oracle RAC) is required.

In the first part of this document the architecture of the BPM production system that is the basis for the failover observations will be introduced. In this context the two common approaches “Queue based” and “Queue bypass” used to integrate Process Server and Business Monitor will be explained. In addition you will learn about the benefits of utilizing MQ in order to establish a scalable and robust messaging layer between Process Server and Business Monitor.

The second part shows how Oracle RAC is adopted to facilitate a highly available event flow. In this context you will learn about several Oracle RAC concepts, like defining oracle services, oracle load balancing and failover and oracle distributed transaction processing (hereafter called DTP).

1

The third part illustrates three typical failover scenarios based on the described architecture and a simple business application monitoring (hereafter called BAM) application:

Scenario 1: Messaging Engine FailoverThe first scenario shows the implications in case one ore more active messaging engines (hereafter called ME) break down unexpectedly during event processing.

Scenario 2: Business Monitor FailoverThe second scenario shows the implications in case an active Business Monitor instance breaks down unexpectedly during event processing.

Scenario 3: Oracle RAC FailoverThe third scenario shows the implications in case an active Oracle RAC instance breaks down unexpectedly during event processing.

For each case study several questions will be answered. What are the implications of the break down? Is the system/application in an inconsistent state after the failover? If yes, what has caused the inconsistencies? What needs to be done in order to recover a consistent system/application state?

The fourth part presents a list of recommended configurations settings that are required to ensure that a BPM production environment is properly configured regarding high availability and recovery. In addition some general hints and tips are given in order to narrow down, understand and solve issues similar to those outlined in this paper.

2

Additional document sources being of interest

IBM WebSphere Business Process Management Version 7.0 information center

http://publib.boulder.ibm.com/infocenter/dmndhelp/v7r0mx/index.jsp?top-ic=/com.ibm.websphere.wps.doc/welcome_wps.html

IBM WebSphere MQ Version 7.0 information center

http://publib.boulder.ibm.com/infocenter/wmqv7/v7r0/index.jsp

IBM Redbooks

http://www.redbooks.ibm.com/

There are numerous publications available for WebSphere Business Process Management from which the following are of interest

• WebSphere Application Server V7 Administration and Configuration Guide (SG24-7615)

• WebSphere Application Server V6.1: System Management and Configuration (SG24-7304)

• WebSphere Application Server V6 System Management & Configuration Handbook (SG24-6451)

• WebSphere Application Server Network Deployment V6: High Availability Solutions (SG24-6688)

• WebSphere Business Process Management V7.0 Production Topologies (SG24-7854)

Other published white papers

Configuring WebSphere Process Server Version 7.0 for a Clustered Environment

http://www.ibm.com/support/docview.wss?uid=swg27018621

WebSphere Process Server Version 7.0 with Oracle 10g & 11g Configuring for XA Recovery


3

http://publib.boulder.ibm.com/infocenter/dmndhelp/v7r0mx/index.jsp?topic=/com.ibm.websphere.wps.doc/welcome_wps.html






Trademarks

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries.

A current list of IBM trademarks is available on the Web at

http://www.ibm.com/legal/copytrade.shtml

The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:

AIX® AlphaBlox® alphaWorks® DataPower® DB2 Universal Database™ DB2® developerWorks® HACMP™ IBM®

IMS™ Lotus® Redbooks® Redbooks(logo) ® Tivoli® WebSphere® Workplace™ Workplace Messaging® z/OS®

4

http://www.ibm.com/legal/copytrade.shtml

Trademarks

The following terms are trademarks of other companies:

Snapshot, and the NetApp logo are trademarks or registered trademarks of NetApp, Inc. in the U.S. and other countries.

SUSE, the Novell logo, and the N logo are registered trademarks of Novell, Inc. in the United States and other countries.

Oracle, JD Edwards, PeopleSoft, Siebel, and TopLink are registered trademarks of Oracle Corporation and/or its affiliates.

SAP NetWeaver, SAP, and SAP logos are trademarks or registered trademarks of SAP AG in Germany and in several other countries.

EJB, Enterprise JavaBeans, J2EE, Java, JavaBeans, JavaServer, JDBC, JMX, JSP, JVM, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.

Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.

Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.

5

How to read this documentThe intention of this document is to provide information on a WPS production environment v2v migration. To get a better understanding further documentation might be referenced.

This document contains numerous illustrations being formatted as follows.

• Interaction via the consoleMostly two types are of interest, one for the input requested on a console window sometimes mentioned as command line input and a second for the output provided on the console.

Console input (sometimes mentioned as command line input)Console outputThis might be several rowsIn case of large lists important aspects are highlighted in this way

• ListingsWhenever a list of parameters is necessary to be discussed they will look like this. Also file content will be formatted in this way. In case of large lists highlighting might be added to put the focus on the major aspects.

Parameter1 = value1Parameter2 = value2Parameter3 = string1etc

• HyperlinksFor making it easier to find references hyperlinks are used and formatted like the following link which leads to the IBM Redbooks homepage


• NoticesTo emphasize information two types of formatted notices are used

Standard notice – typically used

Important notice – used in special cases

6

Table of Contents Chapter 1 BPM V7 production environment.........................................................8 Chapter 2 Establishing an event flow................................................................10

2.1 Queue based event flow .........................................................................102.2 Queue bypass event flow ........................................................................112.3 Establishing an event flow using MQ..........................................................13

Chapter 3 Queue Bypass using a highly available database.................................173.1 Host selection, Oracle RAC instances and Distributed Transaction Processing . 173.2 Oracle RAC Load Balancing......................................................................193.3 Oracle RAC Failover................................................................................19

Chapter 4 Failover Scenarios..........................................................................204.1 Scenario 1: Messaging Engine Failover......................................................214.2 Scenario 2: Business Monitor Failover.......................................................254.3 Scenario 3. Oracle RAC Failover................................................................32

Chapter 5 High availability check list................................................................35 Chapter 10 Conclusion...................................................................................37

7

BPM V7 production environment

Chapter 1 BPM V7 production environmentThe BPM V7 production environment described in this paper consists of a Process Server and a Business Monitor golden topology setup. Figure 1 shows a schematic overview of the environment.

Figure 1. BPM V7 production environment

In a production environment the Process Server and the Business Monitor infrastructure is usually split-up into two separate cells (cross-cell setup). The major advantage of this approach is that both products/cells can be operated at different versions and/or fix pack levels. For instance, there might be situations where it is required to apply a certain version level to the Process Server cell only (and leave the Business Monitor as it is). If everything was in one cell both, Process Server and Business Monitor would have to be updated or migrated all at once.

Using the cross-cell approach each cell typically consists of two or more custom nodes and a deployment manager. Both setups have been created using the Deployment Environment approach. Using this approach golden topology setups (remote messaging and remote support) can be created quickly based on a set of basic information that is provided by the administrator. As a result all artifacts that are required (e.g. servers, clusters, data sources, JMS resources, J2C authentication aliases, etc.) will automatically be created by the runtime. After successful creation following Process Server and Business Monitor clusters exist:

8

BPM V7 production environment

Process Server Cell - Application Target ClusterHosts the Business Process Choreographer (hereafter called BPC) components, that is the Business Flow Manager (hereafter called BFM), the Human Task Manager (hereafter called HTM) and business process execution language (hereafter called BPEL) applications.

Process Server Cell - Support ClusterHosts the Common Event Infrastructure (hereafter called CEI), the Business Process Choreographer Explorer (hereafter called BPC Explorer) and the Business Space.

Process Server Cell - Messaging ClusterHosts the Messaging Engines (BPC, CEI, SCA.SYS and SCA.APP). The Messaging Engines (hereafter called MEs) are operated in a 1-of-N policy, this means that a components’ Messaging Engine is active on one cluster member only.

Business Monitor Cell – Application Target ClusterHosts the monitor model application.

Business Monitor Cell - Support ClusterHosts the CEI, the action services and the monitor scheduled services.

Business Monitor Cell - Messaging ClusterHosts the MEs (Monitor and CEI). The MEs are operated in a 1-of-N policy, this means that a components’ Messaging Engine is active on one cluster member only.

Business Monitor Cell – Web Application ClusterHosts the Business Monitor REST services, the Business Space and IBM DB2 Alphablox.

9

Establishing an event flow

Chapter 2 Establishing an event flowThis paper shows how to establish an event flow within a cross-cell environment using MQ. Prior to introducing MQ as messaging provider the two common means for event transmission are outlined (“Queue based” and “Queue bypass” event transmission).

2.1 Queue based event flow The queue based event flow makes use of the Java Messaging Service (JMS) and the Service Integration Bus (hereafter called SI Bus). In order to use this approach the configuration of the Process Server cell needs to be extended as follows:

A dedicated monitor bus has to be created within the Process Server cell.

A SI Bus Link has to be established between the monitor bus within the Process Server cell and the monitor bus within the Business Monitor cell.

Figure 2 explains the queue based event flow briefly.

Figure 2. Queue based event flow

1. The Event Producer (e.g. a BPEL process) emits Common Base Events to the local CEI using an event emitter factory. The events are put into the monitor model's event group.

2. The event group routes the events to a destination on the monitor bus within the Process Server cell.

3. The SI Bus Link forwards the events from the destination on the monitor bus within the Process Server cell to a destination on the monitor bus within the Business Monitor cell.

4. The monitor model consumes and processes the events from the destination on the monitor bus within the Business Monitor cell.

10


2.2 Queue bypass event flow The queue bypass approach avoids the SI Bus and makes use of a database instead. In order to use this approach the configuration of the Process Server cell needs to be extended as follows:

A dedicated data source has to be created within the Process Server cell that points to the monitor database.

Business Monitor plugins have to be added to the Process Server installation.

Figure 3 explains the queue bypass event flow briefly.

Figure 3. Queue bypass event flow

1. The Event Producer (e.g. a BPEL process) emits events to the local CEI using an event emitter factory. The events are put into the monitor models’ event group.

2. The Business Monitor plugins forward the events to the Business Monitor database using the newly created data source

3. The monitor model consumes and processes the events from the Business Monitor database.

11


Both approaches have several implications which might not always be desired.

Monitor Database dependencyWhen using the queue bypass approach the event emission within the Process Server cell depends on the performance and availability of the Business Monitor database. Problems with the Business Monitor database might negatively affect the behavior of the event emitting application. For instance, if a BPEL process emits events and queue bypass is configured, the event emission might fail in case the Business Monitor database is not available.

Tight couplingUsing both approaches queue based or queue bypass, the monitor model is always tightly coupled to a dedicated (remote) CEI. In case the event producer (e.g. a BPEL application) moves to another Process Server environment (e.g. from pre-production to production) the CEI configuration of the monitor model needs to be adjusted accordingly. This requires the CEI distribution mode to be set to inactive.

Additional administrative effortDuring deployment of a monitor model, the CEI has to be defined from which the monitor model will retrieve events (in our case the CEI within the Process Server cell). In order to establish a connection with the remote CEI (remote Deployment Manager) during monitor model deployment a Light Weight Third Party (LTPA) token needs to be exchanged between both cells. This applies to both, the queue based as well as the queue bypass approach. In case of the queue based approach signer certificates need to be exchanged additionally (otherwise the SI Bus link between the monitor bus in the Process Server cell and the monitor bus in the Business Monitor cell can’t be established).

How these constraints can be avoided is described in the next section.

12


2.3 Establishing an event flow using MQTo avoid the constraints outlined in the previous section, both cells have to be fully decoupled from each other. This can be achieved by using a third party messaging provider like MQ. Figure 4 explains the event flow in detail when using MQ.

Figure 4. Event flow utilizing MQ

1. The Event Producer (e.g. a BPEL process) emits events to the local CEI event service using an event emitter factory. The emitter factory to be used is specified within the Process Server cell’s Application Target Cluster. The CEI event service is usually deployed within the Process Server cell’s Support Cluster.

Figure 5. Common Event Infrastructure destination

13


2. The CEI event service puts the events into a dedicated event group (“All Events”). Since all events that are emitted within the Process Server cell are supposed to be gathered by one event group, the filter is defined accordingly.

Figure 6. Process Server cell event group

3. The “All Events” event group forwards the events to a distribution queue which maps to a MQ queue. Both, the MQ queue and MQ queue Connection factory, have to be defined as JMS resources (jms/tema02Event, jms/MQQcf) within the Process Server cell.

Figure 7. Event group distribution queue

4. A message driven bean (“WPSEventEmitter”) that is deployed to the Business Monitor cell’s Support Cluster reads the events from the MQ queue. The MDB can either use listener ports (deprecated in V7) or an activation specification.

14


5. The “WPSEventEmitter” MDB forwards the events to the local CEI event service using an event emitter factory. The Java implementation of the MDB is shown in Listing 1. The common base event is extracted from the JMS message and sent to the local CEI.

Listing 1. Message driven bean implementation

public void ejbCreate() {try {

Context ctx = new InitialContext();_emitterFactory = (EmitterFactory)ctx.lookup("com/ibm/events/configuration/emitter/Default"); _notificationHelperFactoryObject = ctx.lookup("com/ibm/

events/NotificationHelperFactory");ctx.close();

} catch(NamingException ne) { ne.printStackTrace(); }

public void onMessage(javax.jms.Message msg) { Emitter emitter = null;

try { emitter = _emitterFactory.getEmitter(); NotificationHelperFactory nhFactory = (NotificationHelperFactory)PortableRemoteObject. narrow(_notificationHelperFactoryObject, NotificationHelperFactory.class); NotificationHelper notificationHelper = nhFactory.getNotificationHelper(); EventNotification notifications[] = notificationHelper.getEventNotifications(msg); CommonBaseEvent cbe = notifications[0].getEvent(); emitter.sendEvent(cbe, 2, 12); } catch(Exception e) { e.printStackTrace(); fMessageDrivenCtx.setRollbackOnly(); }

}

15


6. Based on the monitor model’s specific event selector string the Business Monitor cell’s CEI event service puts all events that belong to a certain monitor model into a dedicated event group.

Figure 8. Monitor model specific event group

7. To avoid an additional JMS hop and to improve performance the monitor model is deployed using the queue bypass approach. This means that the monitor plugins read the events from the local CEI (event group).

8. The monitor plugins forward the events to the Business Monitor database.

9. The monitor model consumes and processes the events from the Business Monitor database.

16

Queue Bypass using a highly available database

Chapter 3 Queue Bypass using a highly available database

When using the queue bypass approach a highly available Business Monitor database is crucial to maintain the overall event flow. This section shows how Oracle RAC is ad-opted in order to eliminate queue bypass as a single point of failure. Listing 2 shows the Oracle RAC connection string of the Business Monitor data source (Monitor_<Cell-Name>_Routing_Database) being used by the queue bypass approach.

Listing 2. Oracle data source connection string

jdbc:oracle:thin:@(DESCRIPTION= (ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=A)(PORT=1521)) (ADDRESS=(PROTOCOL=TCP)(HOST=B)(PORT=1521)) (LOAD_BALANCE=off) (FAILOVER=on)) (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=monitordb) (FAILOVER_MODE=(TYPE=select)(METHOD=basic))))

Let’s break down the sections of the connection string to have a closer look at the configuration of Oracle RAC in conjunction with our event flow.

3.1 Host selection, Oracle RAC instances and Distributed Transaction Processing

The ADDRESS_LIST section specifies a list of hosts whose Oracle RAC instances run Oracle RAC services. In our case two Oracle RAC hosts exist, each hosting one Oracle RAC instance :

Host A (Oracle RAC instance mondb1)

Host B (Oracle RAC instance mondb2)

In this context two types of Oracle RAC instances have to be distinguished: preferred and available instances. Oracle RAC services always run on one or more preferred instances. In case a preferred instance fails (e.g due to a breakdown of the machine hosting the instance) Oracle will relocate the service to one of the remaining available Oracle RAC instances.

17


Figure 9. Preferred and available Oracle RAC instances

Figure 9 shows our service's (monitordb) preferred (mondb1) and available (mondb2) instance. Following command can be used to list the Oracle RAC services that are defined within a database (mondb) and their preferred instances:

srvctl status service -d mondb

Service monitordb is running on instance(s) mondb1

The question that might arise at this point is why not having two preferred Oracle RAC instances? In order to prevent two-phase commit related recovery problems in case of a failover it is recommended to enable Distributed Transaction Processing (hereafter called DTP) on the Oracle RAC service that’s being used (monitordb). DTP ensures that all branches of a two-phase commit transaction are always served by the same Oracle RAC instance (and thus by the same Oracle RAC service).

Figure 10. Enable DTP for Oracle RAC service

Since DTP is a singleton service it can only be enabled on a service with exactly one preferred instance. This means that one of our two instances has to be defined as preferred (mondb1) and the other one as available (mondb2).

18


3.2 Oracle RAC Load BalancingThe LOAD_BALANCE setting specifies how the Oracle JDBC driver obtains physical connections from the Oracle RAC services. In case LOAD_BALANCE is set to on the driver will try to obtain connections from those Oracle RAC services whose Oracle RAC instances (hosts) are contained within the address list. When LOAD_BALANCE is set to off the driver always obtains connections from that service whose instance host occupies the first entry within the address list. Since in our scenario only one instance is running the service (mondb1, due to the enablement of DTP) LOAD_BALANCE needs to be set to off and the host that corresponds to instance mondb1 (Host A) needs be set as first entry within the address list.

3.3 Oracle RAC FailoverWhen instance mondb1 (and thus service ‘monitordb’) becomes unavailable the Oracle JDBC driver tries to obtain physical connections from the service located on the remaining (available) instance (mondb2) in case FAILOVER is set to on. In this context two additional settings contained within FAILOVER_MODE are relevant:

TYPE=selectEnables failover without losing any read-only work in progress.

METHOD=basicThe connection to the backup (available) instance is made when the primary connection fails. When using METHOD=preconnect a connection is made to the primary and the backup instance at the same time. This provides faster failover, however additional resources have to be spent in order to keep a backup connection open all the time.

19

Failover Scenarios

Chapter 4 Failover ScenariosBased on the depicted environment this chapter examines three common failover scenarios and exemplarily describes the steps required for problem determination and successful recovery. In general, within a failover scenario three different failover types can be distinguished:

Soft failoverA soft failover takes place when one or more cluster member is shut down in a controlled manner (e.g. using the administrative console). Typically this happens for maintenance reasons. Usually all resources are cleaned up, thus leaving the system in a defined state (e.g. ongoing transactions are resolved, open file handles are closed, etc.).

Hard failoverA hard failover is typically initiated through an immediate software related shutdown of a randomly selected set of cluster members (e.g. via kill -9 <process_id>). Usually resources are not cleaned up, leaving the system in an undefined state. In order to recover from a hard failover additional software related configuration might be required.

Disaster failoverA disaster failover is initiated through an uncontrolled shutdown of a host including all its cluster members. It is usually caused by external factors like an unexpected power outage or a hardware related machine crash. In order to recover from a disaster failover additional software and operating system (hereafter called OS) related configuration might be required.

Among the three failover types, disaster failover usually requires the most effort regarding problem determination and successful recovery.

In the following three common disaster failover scenarios are described based on the introduced setup and a simple business application monitoring (BAM) application. The application consists of a BPEL process deployed to the Application Target Cluster with-in the Process Server cell and a corresponding monitor model deployed to the Applica-tion Target Cluster within the Business Monitor cell. The BPEL process emits Common Base Events that are consumed and processed by the monitor model.

20

Failover Scenarios

4.1 Scenario 1: Messaging Engine FailoverThe first scenario shows the implications in case a machine breaks down unexpectedly that runs the active, data store based MEs. By default a “1-of-N” core group policy is used, that means that a ME is in state “Started” on one cluster member and in state “Joined” on the other cluster member. Listing 3 shows the log of the cluster member hosting the active ME (here as example the BPC ME is used). The ME first acquires a lock on its data store and then starts. Listing 4 shows the log of the cluster member hosting the joined BPC ME.

Listing 3. Active BPC ME in cluster member 1

[BPC.fmtc5184Cell01.Bus:wps.Messaging.000-BPC.fmtc5184Cell01.Bus] CWSIS1537I: The messaging engine, ME_UUID=87C9826972B66E6C, INC_UUID=757475745D2D91A3, has acquired an exclusive lock on the data store.…[BPC.fmtc5184Cell01.Bus:wps.Messaging.000-BPC.fmtc5184Cell01.Bus] CWSID0016I: Messaging engine wps.Messaging.000-BPC.fmtc5184Cell01.Bus is in state Started.

Listing 4. Joined BPC ME in cluster member 2

[BPC.fmtc5184Cell01.Bus:wps.Messaging.000-BPC.fmtc5184Cell01.Bus] CWSID0016I: Messaging engine wps.Messaging.000-BPC.fmtc5184Cell01.Bus is in state Joined.

In order for the remaining cluster members to continue executing applications that make use of messaging (e.g. long running BPEL processes, monitor models that make use of the monitor bus etc.) operational (“Started”) MEs are essential. In case the machine hosting the active (“Started”) ME breaks down, the remaining cluster member will take over the ME by starting the (standby) ME whose state changes from “Joined” to “Started”.

Let’s assume following use case: Machine A hosting the cluster member that runs the active (“Started”) MEs breaks down unexpectedly due to a power outage. The WebSphere High Availability Manager (hereafter called HA manager) will discover that Machine A and the active MEs are inaccessible. It will then attempt to start the MEs on the remaining cluster member (Machine B). In order for the ME to start successfully on Machine B it needs to acquire the lock on its data store first:

Listing 5. BPC ME attempting to obtain a lock on the data store

[BPC.fmtc5184Cell01.Bus:wps.Messaging.000-BPC.fmtc5184Cell01.Bus] CWSIS1538I: The messaging engine, ME_UUID=87C9826972B66E6C, INC_UUID=62E562E55DDF0D0A, is attempting to obtain an exclusive lock on the data store

21

Failover Scenarios

If the configuration of the database host (in our case Oracle) is not prepared to handle this kind of disaster failover the ME will fail to start:

Listing 6. BPC ME fails to obtain a lock on the data store

[BPC.fmtc5184Cell01.Bus:wps.Messaging.000-BPC.fmtc5184Cell01.Bus] CWSIS1593I: The messaging engine, ME_UUID=87C9826972B66E6C, INC_UUID=62E562E55DDF0D0A, has failed to gain an initial lock on the data store.…HMGR0129I: The local member of group IBM_hc=wps.Messaging,WSAF_SIB_BUS= BPC.fmtc5184Cell01.Bus,WSAF_SIB_MESSAGING_ENGINE=wps.Messaging.000-BPC.fmtc5184Cell01.Bus,type=WSAF_SIB has been disabled. The reason is disable called internally, the reason is < Messaging Engine wps.Messaging.000-BPC.fmtc5184Cell01.Bus could not be activated

The problem is that the ME that broke down did not release its lock from the data store due to the undefined shutdown. The unreleased lock in turn prevents the ME that is supposed to take over from starting as explained in Figure 13 and 14.

Figure 11. Exclusive lock on the data store

1. As the BPC ME within cluster member 1 (Machine A) starts it acquires an exclusive lock on the data store (see Listing 3).

22

Failover Scenarios

Figure 12. Failure gaining lock on the data store

1. Machine A breaks down unexpectedly. The ME data store lock previously acquired by the BPC ME within cluster member 1 is not released due to the undefined shutdown.

2. As the BPC ME within cluster member 2 (Machine B) takes over it attempts to acquire and exclusive lock on the data store as well. This fails since the ME data store is still locked by the BPC ME previously running within cluster member 1 (see Listing 6).

In order to enable the BPC ME within cluster member 2 to gain a lock on the ME data store the “old” lock needs to be removed. This can be achieved by establishing an appropriate dead connection detection policy on the Oracle host. A connection is considered dead when the communication between the Oracle host and its clients (here the machines hosting the MEs) is idle for a defined period of time. In this context the operating system specific TCP Keepalive parameters become relevant (This article describes the parameters based on the AIX operating system):

tcp_keepidleDetermines the length of inactivity before the first keepalive message is sent by the Oracle host to its clients (default value: 14400 half seconds = 2 hours). To check the current setting on AIX following command can be used:

no –o tcp_keepidle

tcp_keepintvlA connection is considered dead after 8 (tcp_keepcnt) unresponded keepalive messages. tcp_keepintvl specifies the time interval between those messages (default value: 150 half seconds = 75 seconds). To check the current setting on AIX following command can be used:

no –o tcp_keepintvl

23

Failover Scenarios

tcp_keepcnt

Specifies the maximum number of keepalive messages to be sent (default value: 8). To check the current setting on AIX following command can be used:

no –o tcp_keepcnt

When a dead connection is detected based on the TCP Keepalive parameters Oracle will free all resources resulting from that connection, including stale database locks. Using the TCP Keepalive default parameters the BPC ME within cluster member 2 would take 2 hours + 8*75 seconds to gain a lock on the data store. Having such a large failover period is not desired in a production environment therefore the values should be decreased. In order to alter the values permanently (e.g tcp_keepidle = 40 s/2 = 20 s, tcp_keepintvl = 20 s/2 = 10 s) following commands have to be used:

no -p -o tcp_keepidle=40no -p -o tcp_keepintvl=20

24

Failover Scenarios

4.2 Scenario 2: Business Monitor FailoverThe second scenario shows the implications in case a disaster failover occurs within the Business Monitor cell.

Figure 13. The event flow and global two-phase commit transactions

Figure 14 shows the cohesion of the event flow with the involved components and global two phase commit (hereafter called 2PC) transactions.

The Transaction Managers (hereafter called TX manager) within the Process Server cell cluster members control TX 1. TX 1 involves the event producer (e.g. a BPEL process), the CEI and MQ. The TX managers within the Business Monitor cell cluster members control TX 2 and TX 3. TX 2 involves MQ, the Message Driven Bean (MDB), the CEI, the Monitor Plugins and one Oracle instance (monitor routing database). TX 3 involves an Oracle instance and the monitor model.

In case one of those components is not able to conduct its portion of work due to an error the corresponding global 2PC transaction (TX1, TX2 or TX3) is rolled back. If one cluster member fails for whatever reason, the TX manager of that cluster member will go down as well. In that case the TX manager within the remaining cluster member might be able to resume the work based on the transaction and recovery logs of the failed TX manager. However, this requires the transaction and recovery logs to be shared.

25

Failover Scenarios

Let’s assume following case: Business Monitor Machine A breaks down unexpectedly due to a power outage during event processing of a deployed monitor model.

Figure 14. Breakdown of Business Monitor Machine A

During the breakdown of Business Monitor Machine A the following can be observed:

The event consumption continues since the components on the remaining Business Monitor Machine B (MDB, CEI, Monitor Plugins) are still operational.

The event processing continues since the monitor model on the remaining Business Monitor Machine B is still operational. In case the monitor model moderator was running on Business Monitor Machine A the HA manager will start it on Business Monitor Machine B.

When all emitted events have been consumed (the MQ queue is empty) it might turn out that not all expected events have been processed by the monitor model. For instance, a monitor model instance might still be open even though its corresponding business process instance is already finished. In order to locate the “lost” events a closer look at how the TX manager handles 2PC transactions in conjunction with MQ is necessary.

26

Failover Scenarios

Figure 15. TX Manager and the 2PC protocol

Figure 15 depicts how the TX manager handles 2PC commit transactions in conjunction with MQ.

1. The TX manager within cluster member 1 (Business Monitor Machine A) starts a global 2PC transaction by sending xa_start to each transaction participant. Based on that command MQ creates a local transaction. The state of this local transaction is ACTIVE|ASSOC. This information is not stored in the transaction log.

2. As the global 2PC transaction is about to finish the TX manager sends xa_end to each transaction participant. MQ changes the state of the local transaction from ACTIVE|ASSOC to ACTIVE|VOTED. This information is not stored in the transaction log as well.

3. The TX manager sends xa_prepare to each transaction participant. Each participant sends a response once the transaction is prepared (xa_ok). MQ changes the state of the local transaction from ACTIVE|VOTED to PREPARED|VOTED. This information is stored in the transaction log.

4. Depending on the response of each transaction participant the TX manager finally sends xa_commit (each transaction participant is able to commit its local transaction) or xa_rollback (at least one transaction participant is not able to commit its local transaction). Based on the command received from the TX manager MQ either commits or rollbacks its local transaction. This information is stored in the transaction log as well.

27

Failover Scenarios

In case a TX manager within one cluster member fails, the TX manager of the remaining cluster member is able to resume the work based on the information stored in the transaction log. However, since only information regarding prepared and committed (rollbacked) transactions can be found in the transaction log, the TX manager is not able to resolve ACTIVE|ASSOC and/or ACTIVE|VOTED local transactions. As a result events associated with these transactions remain stuck.

Figure 16. TX Manager failover

Figure 16 shows how the remaining TX manager resumes the work of the failed TX manager based on the transaction log.

1. The TX manager within cluster member 1 (Business Monitor Machine A) goes down due to an unexpected breakdown of Business Monitor Machine A.

2. The TX manager within the remaining cluster member 2 (Business Monitor Machine B) takes over. Based on the information stored in the transaction log he is able to resume transactions that are in state PREPARED|VOTED (indoubt transactions) previously initiated by the failed TX manager (3./4.). He is not able to process ACTIVE|ASSOC and/or ACTIVE|VOTED transactions since no information about them is stored in the transaction log.

28

Failover Scenarios

In order to create a snapshot of all local transactions that currently exist within the system following MQ command can be used:

amqldmpa -m <QMGR_NAME> -c A -o 2 -f/<file>

Since the TX manager failed unpredictably during event processing it is likely that both ACTIVE|ASSOC as well as ACTIVE|VOTED local transactions were pending at the time of the break down. Listing 7 and 8 show how local transactions of both types are reflected in <file> that was created using the amqldmpa command.

Listing 7. ACTIVE|ASSOC Transaction

State ACTIVE|ASSOCXID: formatID 1463898948,gtrid_length 36, bqual_length 54gtrid [00000128D8C7E054000000010000825FBAF8067D58

7B5C18828A0318725482E669257105]bqual [00000128D8C7E054000000010000825FBAF8067D58

7B5C18828A0318725482E669257105000000010000 000000000000000000000001]

TranNum 0.64996Tranid 2.11.11.62e48hmtxTranAssn.RequestCount 1hmtxTranData.RequestCount 16574hmtxTranData.WaitCount 1hmtxTranData.DepthHWM 1AssociateTime 2010-05-27 10:01:37.362

Listing 8. ACTIVE|VOTED Transaction

State ACTIVE|VOTEDXID: formatID 1463898948,gtrid_length 36, bqual_length 54gtrid [00000128D8C7E045000000010000825EBAF8067D58

7B5C18828A0318725482E669257105]bqual [00000128D8C7E045000000010000825EBAF8067D58 7B5C18828A0318725482E669257105000000010000 000000000000000000000001]TranNum 0.64989Tranid 2.11.11.58228hmtxTranAssn.RequestCount 1hmtxTranData.RequestCount 16855AssociateTime 2010-05-27 10:01:37.364

In order to recover a consistent state ACTIVE|ASSOC and ACTIVE|VOTED local transactions have to be rolled back. The rollback releases the events associated with the transactions and enables them for consumption again

29

Failover Scenarios

Rollback ACTIVE|ASSOC transactions

Local transactions in state ACTIVE|ASSOC remain within the system until the MQ queue manager detects that the TX manager has died. This detection happens based upon the heartbeat interval property (HBINT) of the server connection channel being used (e.g. SYSTEM.DEF.SVRCONN). Once the MQ queue manager finds out that the TX manager has died it will rollback all ACTIVE|ASSOC local transactions, thus releasing the associated events. The HBINT property of a channel has to be set as follows:

ALTER CHANNEL(SYSTEM.DEF.SVRCONN) CHLTYPE(SVRCONN) HBINT(300)

Rollback ACTIVE|VOTED transactions

Local transactions in state ACTIVE|VOTED remain within the system unless the MQ queue manager is explicitly configured to roll them back after a specified amount of time. In this context two environment variables have to be set:

AMQ_TRANSACTION_EXPIRY_RESCAN and AMQ_XA_TRANSACTION_EXPIRY.AMQ_TRANSACTION_EXPIRY_RESCAN specifies the time interval in milliseconds in which the list of unfinished transactions is scanned. In case an ACTIVE|VOTED transaction is found that is older than AMQ_XA_TRANSACTION_EXPIRY the transaction will be rolled back, thus releasing the associated events.

AMQ_TRANSACTION_EXPIRY_RESCAN and AMQ_XA_TRANSACTION_EXPIRY have to be set as follows:

export AMQ_XA_TRANSACTION_EXPIRY=600000export AMQ_TRANSACTION_EXPIRY_RESCAN=60000

Note that the MQ queue manager needs to be stopped before setting the environment variables.

Example

HBINT ist set to 300s (5 min). AMQ_XA_TRANSACTION_EXPIRY is set to 600000 ms (10 min).

AMQ_TRANSACTION_EXPIRY_RESCAN is set to 60000 ms (1 min). 3.000 messages are put on the MQ queue (by an event emitter, e.g. a BPEL

process).

After half of the messages have been processed (~1.500) one machine making up the cluster breaks down, thus leaving several transactions in an unfinished state (ACTIVE|ASSOC, ACTIVE|VOTED).

The remaining cluster member continues processing the messages left.

30

Failover Scenarios

Finally, 3.000 messages have been processed (the MQ queue is empty), however only 2994 messages actually have been processed by the monitor model.

Issuing amqldmpa -m <QMGR_NAME> -c A -o 2 -f/<file> results in three local transactions in state ACTIVE|ASSOC and three local transactions in state ACTIVE|VOTED.

The associated time of each unfinished transaction is approx. 3:00 pm.

At ~3:05 pm (HBINT=5 min) the three local transactions in state ACTIVE|ASSOC are rolled back, the associated events are re-consumed and re-processed.

At ~3:10 pm (after AMQ_XA_TRANSACTION_EXPIRY = 10 min and AMQ_TRANSACTION_EXPIRY_RESCAN = 1 min) the three local transactions in state ACTIVE|VOTED are rolled back, the associated events are re-consumed and re-processed (finally resulting in 3000 messages processed by the monitor model).

31

Failover Scenarios

4.3 Scenario 3. Oracle RAC FailoverThe third scenario shows the implications in case a disaster failover occurs within Oracle RAC. Let’s assume that Oracle Machine A breaks down due to a power outage during event processing of a deployed monitor model.

Figure 17. Breakdown of Oracle Machine A

Further assume that Oracle Machine A hosted the oracle service’s (monitordb) preferred instance (mondb1). In that case following can be observed:

The event processing and consumption continues once the service (monitordb) is relocated to the available instance (mondb2 on Oracle Machine B) .

Due to the abrupt database outage exceptions are thrown by the Monitor Plugins’ code which causes TX2 to roll back.

After the instance on Oracle Machine 2 (mondb2) has taken over and all emitted events have been consumed (the MQ queue is empty) it might turn out that not all expected events have been processed by the monitor model. In order to locate the lost events a closer look at TX2 is necessary.

32

Failover Scenarios

Figure 18. Rollback of the global 2PC transaction TX2

Figure 18 depicts how TX2 is rolled back due to the outage of an Oracle instance as global 2PC transaction participant.

1. The oracle service’s (monitordb) preferred instance (mondb1) breaks down during event processing.

2. The breakdown of the Oracle service causes the TX manager to rollback the enclosing global 2PC transaction TX2. The TX manager issues a xa_rollback towards the components that have been able to execute their portion of work successfully (MQ, MDB, CEI , Monitor Plugins). Thereupon the corresponding local transactions will be rolled back.

Due to the rollback of the local transaction the MQ queue manager treats the associated message as backout message (a backout message is a message that hasn’t been able to reach its destination). In this context two MQ settings become relevant:

Backout thresholdThe backout threshold specifies how often a resubmission of a backout message is allowed. In case the backout threshold is set to 0 and no backout requeue queue is defined (default), the message will be discarded as soon as it fails to reach its destination for the first time.

33

Failover Scenarios

Backout requeue queueThe backout requeue queue absorbs backout messages that have exceeded the backout threshold. Both settings can be changed in the Storage section of a queue as shown in Figure 19.

Figure 19. Backout threshold, backout requeue queue

Given the settings from Figure 19 a message might be able to fail reaching its destination 1000 times before being put into the backout requeue queue (BRQQ). In order to avoid losing events it is recommended to define a backout threshold that is large enough to handle backout messages occurring in the timeframe in which the database is temporarily unavailable (time between the break down of the preferred instance and take over of the available instance).

If in addition listener ports are used for message consumption, each listener port's maximum retries setting should at least be equal (or larger) to the backout threshold setting. Otherwise the listener port could be stopped if the same message is delivered more often than allowed by the listener port's maximum retries setting (“poisoned message”).

34

High availability check list

Chapter 5 High availability check listIndependently of the setup and the failover scenarios described in this article following check list might be helpful to ensure that your system is properly configured regarding high availability and recovery. In addition some hints and tips are given in order to narrow down, understand and solve issues similar to those outlined in this article:

Each cluster used in the environment needs to enable “failover of transaction log recovery”.

The transaction log of each cluster member has to be stored on a shared network drive. In case of a failover the cluster member that takes over needs to be able to access the transaction (partner log) log of the cluster member that went down.

The compensation service of each cluster member needs to be enabled. In addition the recovery log of each cluster member has to be stored on a shared network drive. In case of a failover the cluster member that takes over needs to be able to access the recovery log of the cluster member that went down.

35

High availability check list

When NFS 3 is used as shared network drive, automated peer recovery might not work due to unreleased transaction log file locks in case of an abrupt shutdown of a TX manager. In that case transaction log file locking has to be disabled (Note that file locking is enabled by default).

Be aware that the behavior of software in terms of cleaning up unused resources might relate to OS specific configuration settings.

In case of an outage of a component/application examine the local transactions of each of the components associated with the component that failed. Verify whether unfinished local transactions exist. Try to resolve those local transactions manually if applicable.

36

Conclusion

Chapter 6 ConclusionFirst, you have learned about the benefits of decoupling a Process Server from the Business Monitor cell by using MQ. In addition you have learned about basic Oracle RAC concepts that facilitate a highly available event flow (“Queue Bypass”).

Based on that you have learned about the implications of three common failover scenarios that might occur in a complex BPM production environment. In this context you have learned about concrete configuration settings that can be applied to resolve system inconsistencies in order to regain a consistent application state.

Finally, you have been presented a list of recommended configurations settings that are required to ensure that a BPM production environment is properly configured regarding high availability and recovery.

37

Failover Scenarios in a BPM HA Environment

Documents