-
Understanding high availability with WebSphere MQ Mark Hiscock
Software Engineer IBM Hursley Park Lab United Kingdom Simon Gormley
Software Engineer IBM Hursley Park Lab United Kingdom May 11, 2005
Copyright International Business Machines Corporation 2005. All
rights reserved. This whitepaper explains how you can easily
configure and achieve high availability using IBMs enterprise
messaging product, WebSphere MQ V5.3 and later. This paper is
intended for:
o Systems architects who make design and purchase decisions for
the IT infrastructure and may need to broaden their designs to
incorporate HA.
o System administrators who wish to implement and configure HA
for their WebSphere MQ environment.
Table of Contents 1. Introduction
........................................................................................................................................3
2. High
availability..................................................................................................................................4
3. Implementing high availability with WebSphere MQ
....................................................................6
3.1. General WebSphere MQ recovery
techniques..............................................................................6
3.2. Standby machine - shared
disks....................................................................................................7
3.2.1. HA clustering software
.........................................................................................................9
3.2.2. When to use standby machine - shared disks
......................................................................10
3.2.3. When not to use standby machine - shared
disks................................................................10
3.2.4. HA clustering active-standby configuration
.......................................................................11
3.2.5. HA clustering active-active configuration
..........................................................................12
3.2.6. HA clustering benefits
........................................................................................................13
3.3. z/OS high availability
options.....................................................................................................16
3.3.1. Shared queues (z/OS
only)..................................................................................................16
3.4. WebSphere MQ queue manager clusters
....................................................................................19
3.4.1. Extending the standby machine - shared disk
approach......................................................20
3.4.2. When to use HA WebSphere MQ queue manager
clusters.................................................21
-
Understanding high availability with WebSphere MQ
3.4.3 When not to use HA WebSphere MQ queue manager
clusters............................................21 3.4.4.
Considerations for implementation of HA WebSphere MQ queue manager
clusters.........22
3.5. HA capable client applications
...................................................................................................24
3.5.1. When to use HA capable client
applications.......................................................................25
3.5.2. When not to use HA capable client
applications.................................................................25
4. Considerations for WebSphere MQ restart performance
............................................................26 4.1.
Long running transactions
..........................................................................................................26
4.2. Persistent message use
................................................................................................................27
4.3. Automation
.................................................................................................................................27
4.4. File systems
................................................................................................................................27
5. Comparison of generic versus specific failover
technology...........................................................29
6.
Conclusion.........................................................................................................................................31
Appendix A Available
SupportPacs.................................................................................................33
Resources...............................................................................................................................................34
About the authors
.................................................................................................................................34
Page 2
-
Understanding high availability with WebSphere MQ
1. Introduction With an ever increasing dependence on IT
infrastructure to perform critical business processes, the
availability of this infrastructure is becoming more important. The
failure of an IT infrastructure results in large financial losses,
which increases with the length of the outage [5]. The solution to
this problem is careful planning to ensure that the IT system is
resilient to any hardware, software, local or system wide failure.
This capability is termed resilience computing, which addresses the
following topics:
o High availability o Fault tolerance o Disaster recovery o
Scalability o Reliability o Workload balancing and stress
This whitepaper addresses the most fundamental concept of
resilience computing, high availability (HA). That is, An
application environment is highly available if it possesses the
ability to recover automatically within a prescribed minimal outage
window [7]. Therefore, an IT infrastructure that recovers from a
software or hardware failure, and continues to process existing and
new requests, is highly available.
Page 3
-
Understanding high availability with WebSphere MQ
2. High availability The HA nature of an IT system is its
ability to withstand software or hardware failures so that it is
available as much of the time as possible. Ideally, despite any
failure which may occur, this would be 100% of the time. However,
there are factors, both planned and unplanned, which prohibit this
from being a reality for most production IT infrastructures. These
factors lead to the unavailability of the infrastructure, meaning
the ideal availability (per year) can be measured as the percentage
of the year for which the system was available. For example: Figure
1. Number 9s availability per year
Availability% Downtime per Year
99 3.65 days
99.9 8.76 hours
99.99 52.6 minutes
99.999 5.26 minutes
99.9999 30.00 seconds
Figure 1 shows that a 30 second outage per year is called Six 9s
availability because of the percentage of the year the system was
available. Factors that cause a system outage and reduce the number
of 9s up time, fall into two categories: those that are planned and
those that are unplanned. Planned disruptions are either systems
management (upgrading software or applying patches), or data
management (backup, retrieval, or reorganization of data).
Conversely, unplanned disruptions are system failures (hardware or
software failures) or data failures (data loss or corruption).
Maximizing the availability of an IT system is to minimize the
impact of these failures on the system. The primary method is the
removal of any single point of failure (SPOF) so that should a
component fail, a redundant or backup component is ready to take
over. Also, to ensure enterprise messaging solutions are made
highly available, the softwares state and data must be preserved in
the event of a failure and made available again as soon as
possible. The preservation and restoration of this data removes it
as a single point of failure in the system. Some messaging
solutions remove single points of failure, and make software state
and data available, by using replication technologies. These may be
in the form of asynchronous or synchronous replication of data
between instances of the software in a network. However, these
approaches are not ideal as asynchronous replication can cause
duplicated or lost data and synchronous replication incurs a
significant
Page 4
-
Understanding high availability with WebSphere MQ
performance cost as data is being backed up in real time. It is
for these reasons that WebSphere MQ does not use replication
technologies to achieve high availability. The next section
describes methods for making a WebSphere MQ queue manager highly
available. Each method describes a technique for HA and when you
should and should not consider it as a solution.
Page 5
-
Understanding high availability with WebSphere MQ
3. Implementing high availability with WebSphere MQ This section
discusses the various methods of implementing high availability in
WebSphere MQ. Examples show when you can or cannot use HA.
Standby machine shared disks and z/OS high availability options
describe HA techniques for distributed and z/OS queue managers,
respectively.
WebSphere MQ Queue Manager clusters describes a technique
available to queue manages on all platforms.
HA capable client applications describes a client-side technique
applicable on all platforms.
By reading each section, you can select the best HA methodology
for your scenario. This paper uses the following terminology:
Machine A computer running an operating system. Queue manager A
WebSphere MQ queue manager that contains queue and
log data. Server A machine that runs a queue manager and other
3rd party services. Private message queues These are queues owned
by a particular queue
manager and are only accessible, via WebSphere MQ applications,
when the owning Queue manager is running. These queues are to be
contrasted with shared messages queues (explained below), which are
a particular type of queue only available on z/OS.
Shared message queues These are queues that reside in a Coupling
Facility and are accessible by a number of queue managers that are
part of a Queue Sharing Group. These are only available on z/OS and
are discussed later.
3.1. General WebSphere MQ recovery techniques On all platforms,
WebSphere MQ uses the same general techniques for dealing with
recovery of private message queues after a failure of a queue
manager. With the exception of shared messages queues (see Shared
queues), messages are cached in memory and backed by disk storage
if the volume of message data exceed the available memory cache.
When persistent messaging is used, WebSphere MQ logs messages to
disk storage. Therefore, in the event of a failure, the combination
of the message data on disk plus the queue manager logs can be used
to reconstruct the message queues. This restores the queue manager
to a consistent state at the time just before the failure occurred.
This recovery involves completing normal Unit or Work resolution,
with in-flight messages being rolled back, in-commit messages being
complete, and in-doubt messages waiting for coordinator resolution.
The following sections describe how the above general restart
process is used in conjunction with platform specific facilities,
such as HACMP on AIX or ARM on z/OS, to quickly restore message
availability after failures.
Page 6
-
Understanding high availability with WebSphere MQ
WebSphere MQ also provides a mechanism for improving the
availability of new messages by routing messages around a failed
queue manager transparently to the application producing the
messages. This is called Websphere MQ clustering and is covered in
WebSphere MQ Queue Manager clusters. Finally on z/OS, WebSphere MQ
supports shared message queues that are accessible to a number of
queue managers. Failure of one queue manager still allows the
messages to be accessed by other queue managers. These are covered
in z/OS high availability options.
3.2. Standby machine - shared disks As described above, when a
queue manager fails, a restart is required to make the private
message queues available again. Until then, the messages stored on
the queue manager will be stranded. Therefore, you cannot access
them until the machine and queue manager are returned to normal
operation. To avoid the stranded messages problem, stored messages
need to be made accessible, even if the hosting queue manager or
machine is inoperable. In the standby machine solution, a second
machine is used to host a second queue manager that is activated
when the original machine or queue manager fails. The standby
machine needs to be an exact replica, at any given point in time of
the master machine, so that when failure occurs, the standby
machine can start the queue manager correctly. That is, the
WebSphere MQ code on the standby machine should be at the same
level, and the standby machine should have the same security
privileges as the primary machine. A common method for implementing
the standby machine approach is to store the queue manager data
files and logs on an external disk system that is accessible to
both the master and standby machines. WebSphere MQ writes its data
synchronously to disk, which means a shared disk will always
contain the most recent data for the queue manager. Therefore, if
the primary machine fails, the secondary machine can start the
queue manager and resume its last known good state.
Page 7
-
Understanding high availability with WebSphere MQ
Figure 2. An active-standby setup The standby machine is ready
to read the queue manager data and logs from the shared disk and to
assume the IP address of the primary machine [3].
A shared external disk device is used to provide a resilient
store for queue data and queue manager logs so that replication of
messages are avoided. This preserves the once and once only
delivery characteristic of persistent messages. If the data was
replicated to a different system, the messages stored on the queues
have been duplicated to the other system, and once and once only
delivery cannot be guaranteed. For instance, if data was replicated
to a standby server, and the connection between the two servers
fails, the standby assumes that the master has failed, takes over
the master servers role, and starts processing messages. However,
as the master is still operational, messages are processed twice,
hence duplicated messages occur. This is avoided when using a
shared hard disk because the data only exists in one physical
location and concurrent access is not allowed. The external disk
used to store queue manager data should also be RAID1 enabled to
prevent it being a single point of failure (SPOF) [8]. The disk
device may also have multiple disk controllers and multiple
physical connections to each of the machines, to provide redundant
access channels to the data. In normal operation, the shared disk
is mounted by the master machine, which uses the storage to run the
queue manager in the same way as if it were a local disk, storing
both the queues and the WebSphere
1 Using a RAID configuration protects against data loss, such as
mirroring.
Page 8
-
Understanding high availability with WebSphere MQ
MQ log files on it. The standby machine cannot mount the shared
disk and therefore, cannot start the queue manager because the
queue manager data is not accessible. When a failure is detected,
the standby machine automatically takes on the master machines
role, and as part of that process, mounts the shared disk and
starts the queue manager. The standby queue manager replays the
logs stored on the shared disk to return the queue manager to the
correct state, and resumes normal operations. Note that messages on
queues that are failed over to another queue manager retain their
order on the queue. This failover operation can also be performed
without the intervention of a server administrator. It does require
external software, known as HA clustering, to detect the failure
and initiate the failover process. Only one machine has access to
the shared2 disk partition at a time, and only one instance of the
queue manager runs at any one time to protect data integrity of
messages. The objective of the shared disk is to move the storage
of important data (for example, queue data and queue manager logs)
to a location external to the machine, so that when the master
machine fails, another machine may use the data.
3.2.1. HA clustering software Much of the functionality in the
standby machine configuration is provided by external software,
often termed as HA clustering software [4]. This software addresses
high availability issues using a more holistic approach than single
applications, such as WebSphere MQ, can provide. It also recognizes
that a business application may consist of many software packages
and other resources, all of which need to be highly available. This
is because another complication is introduced when a solution
consists of several applications that have a dependency on each
other. For example, an application may need access to both
WebSphere MQ and a database, and may need to run on the same
physical machine as these services. HA clustering provides the
concept of resource groups, where applications are grouped
together. When failure occurs in of one of the applications in the
group, the entire group is moved to a standby server, satisfying
the dependency of the applications. However, this only occurs if
the HA clustering software fails to restart the application on its
current machine. It is also possible to move the network address
and any other operating system resources with the group so that the
failover is transparent to the client. If an individual software
package was responsible for its own availability, it may not be
able to transfer to another physical machine and will not be able
to move any other resources on which it is dependent. By using HA
clustering to cope with these low level considerations, such as
network address takeover, disk access, and application
dependencies, the higher level applications are relieved of this
complexity. Although there are several vendors providing HA
clustering, each package tends to follow the same basic principles
and provide a similar set of basic functionality. Some solutions,
such as Veritas Cluster Server and SteelEye LifeKeeper, are also
compatible with multiple platforms to provide a similar solution in
heterogeneous environments. In the same way that WebSphere MQ
removed the complexity of application connectivity from the
programmer, HA clustering techniques help provide a simple,
2 A more accurate name would be switchable disks.
Page 9
-
Understanding high availability with WebSphere MQ
generic solution for HA. This means applications, such as
messaging and data management, can focus on their core competencies
leaving HA clustering to provide a more reliable availability
solution than resource-specific monitors. HA clustering also covers
both hardware and software resources, and is a proven, recognized
technology used in many other HA situations. HA clustering products
are designed to be scalable and extensible to cope with changing
requirements. IBMs AIX HACMP product, SteelEye LifeKeeper, and
Veritas Cluster Server scale up to 32 servers. HACMP, LifeKeeper,
and Cluster Server have extensions available to allow replication
of disks to a remote site for disaster recovery purposes.
3.2.2. When to use standby machine - shared disks The standby
machine solution is ideal for messages that are delivered once and
only once. For example, in billing and ordering systems, it is
essential that messages are not duplicated so that customers are
not billed twice, or sent two shipments instead of one. As HA
clustering software is a separate product that sits along side
existing applications, this methodology is also suited to convert
an existing server, or set of servers to be highly available. It is
possible to gradually convert a set of servers to be highly
available. In large installations where there are many servers, HA
clustering is a cost effective choice through the use of an n+1
configuration. In this approach, a single machine is used as a
backup for a number of live servers. Hardware redundancy is reduced
and therefore, cost is reduced, as only one extra machine is
required to provide high availability to a number of active
servers. As already shown, HA clustering software is capable of
converting an existing application and its dependent resources to
be highly available. It is, therefore, suited to situations where
there are several applications or services that need to be made
highly available. If those applications are dependent on each
other, and rely on operating system resources, such as network
addresses to function correctly, HA clustering is ideally
suited.
3.2.3. When not to use standby machine - shared disks HA
clustering is not always necessary when considering an HA solution.
Although the examples given below are served by an HA clustering
method, other solutions would serve just as well and it would be
possible to utilize HA clustering at a later date if required. If
the trapped messages problem is not applicable, such as there is no
need to restart a failed queue manager with its messages intact,
then shared disks are not necessary. This occurs if the system is
only used for event messages that will be re-transmitted regularly,
for messages that expire in a relatively short time, or for
non-persistent messages (where an application is not relying on
WebSphere MQ for assured delivery). For these situations, you can
make a system highly available by using WebSphere MQ queue manager
clustering only. This technology load balances messages and routes
around failed servers. See WebSphere MQ Queue Manager clusters for
more information on queue manager clusters.
Page 10
-
Understanding high availability with WebSphere MQ
In situations where it is not important to process the messages
as soon as possible, then HA clustering may provide too much
availability at too much of an expense. For example, if trapped
messages can wait until an administrator restarts the machine, and
hence the queue manager is restarted (using an internal RAID disk
to protect the queue manager data), then HA clustering is
considered too comprehensive of a solution. In this situation, it
is possible to allow access for new messages using WebSphere MQ
queue manager clustering, as in the case above. The shared disk
solution requires the machines to be physically close to each
other, as the distance from the shared disk device needs to be
small. This makes it unsuitable for use in a disaster recovery
solution. However, some HA clustering software can provide disaster
recovery functionality. For example, IBMs HACMP package has an
extension called HAGEO, which provides data replication to remote
sites. By backing up data in this fashion, it is possible to
retrieve it if a site wide failure occurs. However, the off-site
data may not be the most up-to-date because the replication is
often delayed by a few minutes. This is because instantaneous
replication of data to an off-site location incurs a significant
performance hit. Therefore, the more important the data, the
smaller the time interval will be, but the greater the performance
impact. Time and performance must be traded against each other when
implementing a disaster recovery solution. Such solutions do not
provide all of the benefits of the shared disk solution and are
beyond the scope of this document. The following sections describe
two possible configurations for HA clustering. These are termed
active-active and active-standby configurations.
3.2.4. HA clustering active-standby configuration In a generic
HA clustering solution, when two machines are used in an
activestandby configuration, one machine is running the
applications in a resource group and the other is idle. In addition
to network connections to the LAN, the machines also have a private
connection to each other. This is either in the form of a serial
link or a private Ethernet link. The private link provides a
redundant connection between the machines for the purpose of
detecting a complete failure. As previously mentioned, if a link
between the machines fails, then both machines may try to become
active. Therefore, the redundant link reduces the risk of
communication failure between the two. The machines may also have
two external links to the LAN. Again, this reduces the risk of
external connectivity failure, but also allows the machines to have
their own network address. One of the adapters is used for the
service network address, such as the network address that clients
use to connect to the service, and the other adapter has a network
address associated with the physical machine. The service address
is moved between the machines upon failure to provide HA
transparency to any clients. The standby machine monitors the
master machine via the use of heartbeats. These are periodic checks
by the standby machine to ensure that the master machine is still
responding to requests. The master machine also monitors its disks
and the processes running on it to ensure that no hardware failure
has occurred. For each service running on the machine, a custom
utility is required to inform the HA clustering software that it is
still running. In the case of WebSphere MQ, the SupportPacs
describing HA configurations provide utilities to check the
operation of queue
Page 11
-
Understanding high availability with WebSphere MQ
managers, which can easily be adapted for other HA systems.
Details of these SupportPacs are listed in Appendix A. A small
amount of configuration is required for each resource group to
describe what should happen at start-up and shutdown, although in
most cases this is simple. In the case of WebSphere MQ, this could
be a start up script containing commands to start the queue manager
(for example, strmqm), listener (for example, runmqlsr), or any
other queue manager programs. A corresponding shutdown script is
also needed, and depending on the HA clustering package in use, a
number of other scripts may be required. Samples for WebSphere MQ
are provided with the SupportPacs described in Appendix A. As the
heartbeat mechanism is the primary method of failure detection, if
a heartbeat does not receive a response, the standby machine
assumes that the master server has failed. However, heartbeats may
not respond because of a number of reasons, such as an overloaded
server, or communication failure. There is a possibility that the
master server will resume processing at a later stage, or is still
running. This can lead to duplicate messages in the system and is
not desired. Managing this problem is also the role of the HA
clustering package. For example, RedHat Cluster services and IBMs
HACMP work around this problem by having a watchdog timer with a
lower timeout than the cluster. This ensures that the machine
reboots itself before another machine in the cluster takes over its
role. Programmable power supplies are also supported, so other
machines in the cluster can power cycle the affected machine, to
ensure that it is no longer operational before starting the
resource group. Essentially, the machines in the cluster have the
capability to turn the other machines off. Some HA clustering
software suites also provide the capability to detect other types
of failure, such as system resource exhaustion, or process failure,
and try to recover from these failures locally. For WebSphere MQ,
you can implement on AIX using the appropriate SupportPac (see
Appendix A) to restart a queue manager locally, which is not
responding. This can avoid the more time consuming operation of
completely moving the resource group to another server. You should
design the machines used in HA clustering to have identical
configurations to each other. This includes installed software
levels, security configurations, and performance capabilities, to
minimize the possibility of resource group start-up failure. This
ensures that machines in the network all have the capability to
take on another machines role. Note that for active-standby
configurations, only one instance of an application is running at
any one moment and therefore, software vendors may only charge for
one instance of the application, as is the case for WebSphere
MQ.
3.2.5. HA clustering active-active configuration It is also
possible to run services on the redundant machine in what is termed
an activeactive configuration. In this mode, the servers are both
actively running programs and acting as backups for each other. If
one server fails, the other continues
Page 12
-
Understanding high availability with WebSphere MQ
to run its own services, as well as the failed servers. This
enables the backup server to be used more effectively, although
when a failure does occur, the performance of the systems is
reduced because it has taken on extra processing. In Figure 3, the
second active machine runs both queue managers if a failure occurs.
Figure 3. An active-active configuration
In larger installations, where several resource groups exist and
more than one server needs to be made highly available, it is
possible to use one backup machine to cover several active servers.
This setup is known as an n+1 configuration, and has the benefit of
reduced redundant hardware costs, because the servers do not have a
dedicated backup machine each. However, if several servers fail at
the same time, the backup machine may become overloaded. These
extra costs must be weighed up against the potential cost of more
than one server failing, and more than one backup machine being
required.
3.2.6. HA clustering benefits HA clustering software provides
the capability to perform controlled failover of resource groups.
This allows administrators to test the functionality of a
configured system, and also allow machines to be gracefully removed
from an active cluster. This can be for maintenance purposes, such
as hardware and software upgrades or data backup. It also allows
failed servers, once repaired, to be placed back in the cluster and
to resume their services. This is known as fail-back [4]. A
controlled failover operation also results in less downtime because
the cluster does not need to detect the
Page 13
-
Understanding high availability with WebSphere MQ
failure. There is no need to wait for the cluster timeout. Also,
as the applications, such as WebSphere MQ, are stopped in a
controlled manner, the start up time is reduced because there is no
need for log replay. Using the abstract resource groups makes it
possible for a service to remain highly available. This occurs when
the machine that is normally running the services has been removed
from the cluster. This is only true as long as the other machines
have comparable software installed and access to the same data,
meaning any machine can run the resource group. The modular nature
of resource groups also helps the gradual uptake of HA clustering
in an existing system and easily allows services to be added at a
later date. This also means that in a large queue manager
installation, you can convert mission critical queue managers to be
highly available first, and later convert the less critical queue
managers, or not at all. Many of the requirements for implementing
HA clustering are also desirable in more bespoke, or
product-centric HA solutions. For example, RAID disk arrays [8],
extra network connections and redundant power supplies all protect
against hardware failure. Therefore, improving the availability of
a server results in additional cost, whether a bespoke or HA
clustering technique is used. HA clustering may require additional
hardware over and above some application specific HA solutions, but
this enables a HA clustering approach to provide a more complete HA
solution. You can easily extend the configuration of HA clustering
to cover other applications running on the machine. The
availability of all services is provided via a standard methodology
and presented through a consistent interface rather than being
implemented separately by each service on the machine. This in turn
reduces complexity and staff training times and reduces errors
being introduced during administration activities. By using one
product to provide an availability solution, you can take a common
approach to decision making. For instance, if a number of the
servers in a cluster are separated from the others by network
failure, an unanimous decision is needed to decide which servers
should remain active in the cluster. If there were several HA
solutions in place (such as each product uses its own availability
solution), each with separate quorum algorithms3, then it is
possible that each algorithm has a different outcome. This could
result in an invalid selection of active servers in the cluster
that may not be able to communicate. By having a separate entity,
in the form of the HA clustering software, to decide which part of
the cluster has the quorum, only one outcome is possible, and the
cluster of servers continues to be available. Summary The shared
disk solution described above is a robust approach to the problem
of trapped messages, and allows access to stored messages in the
event of a failure. However, there will be a short period of time
where there is no access to the queue manager while the failure is
being detected, and the service is being transferred to the standby
server. It is possible during this time to use WebSphere MQ
clustering to provide access for new messages because its load
balancing capabilities will route
3 A quorum is the minimum number of members of a deliberative
body necessary to conduct the business of that group.
Page 14
-
Understanding high availability with WebSphere MQ
messages around the failed queue manager to another queue
manager in the cluster. How to use HA clustering with WebSphere MQ
clustering is described in When to use WebSphere MQ queue manager
clusters.
Page 15
-
Understanding high availability with WebSphere MQ
3.3. z/OS high availability options z/OS provides a facility for
operating system restart of failed queue managers called Automatic
Restart Manager (ARM). It provides a mechanism, via ARM policies,
for a failed queue manager to be restarted in place on the failing
logical partition (LPAR). Or, in the case of an LPAR failure,
started on a different LPAR along with other subsystems and
applications grouped together, such that the subsystem components
provide the overall business solution can be restarted together. In
addition, with a parallel sysplex, Geographically Dispersed
Parallel Sysplex (GDPS) provides the ability for automatic restart
of subsystems, via remote DASD copying techniques, in the event of
a site failure. The above techniques are restart techniques that
are similar to those discussed earlier for distributed platforms.
We will now look at a capability which maximizes the availability
of message queues in the event of queue manager failures that does
not require queue manager restart.
3.3.1. Shared queues (z/OS only) WebSphere MQ shared queues is
an exploitation of the z/OS-unique Coupling Facility (CF)
technology that provides high-speed access to data across a sysplex
via a rich set of facilities to store and retrieve data. WebSphere
MQ stores shared message queues in the Coupling Facility, and this
in turn, means that unlike private message queues, they are not
owned by any single queue manager. Queue managers are grouped into
Queue Sharing Groups (QSGs), analogous to Data Sharing Groups with
data-sharing DB2. All queue managers within a QSG can access shared
message queues for putting and getting of messages via the
WebSphere MQ API. This enables multiple putters and getters on the
same shared queue from within the QSG. Also WebSphere MQ provides
peer recovery such that inflight shared queue messages are
automatically rolled back by another member of the QSG in the event
of a queue manager failure. WebSphere MQ still uses its logs for
capturing persistent message updates so that in the extremely
unlikely event of a CF failure, you can use the normal restart
procedures to restore messages. In addition, z/OS provides system
facilities to automatically duplex the CF structures used by
WebSphere MQ. The combination of these facilities provides
WebSphere MQ shared message queues with extremely high availability
characteristics. Figure 4 shows three queue managers: QM1, QM2 and
QM3 in the QSG GRP1 sharing access to queue A in the coupling
facility. This setup allows all three queue managers to process
messages arriving on queue A.
Page 16
-
Understanding high availability with WebSphere MQ
Figure 4. Three queue managers in a QSG share queue A on a
Coupling Facility
GRP1 QM 2
QM 3 QM 1
Q A
Coupling Facility
A further benefit of using shared queues is utilizing shared
channels. You can use shared channels in two different scenarios to
further extend the high availability of WebSphere MQ. First, using
shared channels, an external queue manager can connect to a
specific queue manager in the QSG using channels. It can then put
messages to the shared queue via this queue manager. This allows
for queue managers in a distributed environment to utilize the HA
functionality provided by shared queues. Therefore, the target
application of messages put by the queue manager can be any of
those running on a queue manager in the QSG. Second, you can use a
generic port so that a channel connecting to the QSG could be
connected to any queue manager in the QSG. If the channel loses its
connection (because of a queue manager failure), then it is
possible for the channel to connect to another queue manager in the
QSG by simply reconnecting to the same generic port. 3.3.1.1
Benefits of shared message queues The main benefit of a shared
queue is its high availability. There are numerous customer
selectable configuration options for CF storage, ranging from
running on standalone processors with their own power supplies to
the Internal Coupling Facility (ICF) that runs on spare processors
within a general zSeries server. Another key factor is that the
Coupling Facility Control Code (CFCC) runs in its own LPAR, where
it is isolated from any application or subsystem code. In addition,
it naturally balances the workload between the queue managers in
the QSG. That is, a queue manager will only request a message from
the shared queue when the application, which is processing
messages, is free to do so. Therefore, the availability of the
messaging service is improved because queue managers are not
flooded by messages directly. Instead, they consume messages from
the shared queue when they are ready to do so. Also, should greater
message processing performance be required, you can add extra queue
managers to the QSG to process more incoming messages. With
persistent messages, both private and shared, the message
processing limit is constrained by the speed of the log. With
shared message queues, each queue manager uses its own log
Page 17
-
Understanding high availability with WebSphere MQ
for updates. Therefore, deploying additional queue managers to
process a shared queue means the total logging cost is liquidated
gradually over a number of queue managers. This provides a highly
scalable solution. Conversely, if a queue manager requires
maintenance, you can remove it from the
astly, should a queue manager fail during the processing of a
Unit of Work, the other
QSG, leaving the remaining queue managers to continue processing
the messages. Both the addition and removal of queue managers in a
QSG can be performed without disrupting the already existing
members. Lmembers of the QSG will spot this and Peer Recovery is
initiated. That is, if the unit of work was not completed by the
failed queue manager, another queue manager in the QSG will
complete the processing. This arbitration of queue manager data is
achieved via hardware and microcode on z/OS. This means that the
availability of the system is increased as the failure of any one
queue manager does not result in trapped messages or inconsistent
transactions. This is because Peer Recovery either completes the
transaction or rolls it back. For more information on Peer Recovery
and how to configure it, see z/OS Systems Administration Guide [6].
The benefits of shared queues are not solely limited to z/OS queue
managers.
.3.1.2. Limitations of shared message queues essages are limited
to be less than
he Coupling Facility is a resilient and durable piece of
hardware, but it is a single
his system-managed duplexing is supported by WebSphere MQ. While
the rebuild is
inally, a queue manager can only belong to one QSG and all queue
managers in a
Although you cannot setup shared queues in a distributed
environment, it is possible for distributed queue managers to place
messages onto them through a member of the QSG. This allows for the
QSG to process a distributed applications message in a z/OS HA
environment. 3With WebSphere MQ V5.3, physical shared m63KB in
size. Any application that attempts to put a message greater than
this limit receives an error on the MQPUT call. However, you can
use the message grouping API to construct a logical message greater
than 63KB, which consists of a number of physical segments. Tpoint
of failure in this high availability configuration. However, z/OS
provides duplexing facilities, where updates to one CF structure
are automatically propagated to a second CF. In the unlikely event
of failure of the primary CF, z/OS automatically switches access to
the secondary, while the primary is being rebuilt. Ttaking place,
there is no noticeable application effect. However, this duplexing
will clearly have an effect on overall performance. FQSG must be in
the same sysplex. This is a small limitation on the flexibility of
QSGs. Also a QSG can only contain a maximum of 32 queue managers.
For more information on shared queues, see WebSphere MQ for Z/OS
Concepts and Planning Guide [1].
Page 18
-
Understanding high availability with WebSphere MQ
3.4. WebSphere MQ queue manager clusters A WebSphere MQ queue
manager cluster is a cross platform workload balancing solution
that allows WebSphere MQ messages to be routed around a failed
queue manager. It allows a queue to be hosted across multiple queue
managers, thus allowing an application to be duplicated across
multiple machines. It provides a highly available messaging service
allowing incoming messages to be forwarded to any queue manager in
the cluster for application processing. Therefore, if any queue
manager in the cluster fails, new incoming messages continue to be
processed by the remaining queue managers. In Figure 5, an
application puts a message to a cluster queue on QM2. This cluster
queue is defined locally on QM1, QM4 and QM5. Therefore, one of
these queue managers will receive the message and process it.
Figure 5. Queue managers 1,4, and 5 in the cluster receive messages
in order
cluster Queue
Application Local Queue
QM 3 QM 1 QM 2
QM 4 QM 6
QM 5 cluster 1
By balancing the workload between QM1, QM4, and QM5, an
application is distributed across multiple queue managers making it
highly available. If a queue manager fails, the incoming messages
are balanced among the remaining queue managers. While WebSphere MQ
clustering provides continuous messaging for new messages, it is
not a complete HA solution because it is unable to handle messages
that have already been delivered to a queue manager for processing.
As we have seen above, if a queue manager fails, these trapped
private messages are only processed when the queue manager is
restarted. However, by combining WebSphere MQ clustering with the
recovery techniques covered above, you can create an HA solution
from both new and existing messages. The following section shows
this in action in a distributed shared disk environment.
Page 19
-
Understanding high availability with WebSphere MQ
3.4.1. Extending the standby machine - shared disk approach By
hosting cluster queue managers on active-standby or active-active
setups, trapped messages, on private or cluster queues, are made
available when the queue manager is failed over to a standby
machine and restarted. The queue manager will be failed over and
will begin processing messages within minutes instead of the longer
amount of time it would take to manually recover and repair the
failed machine or failed queue manager in the cluster. The added
benefit of combining queue manager clusters with HA clustering is
that the high availability nature of the system becomes transparent
to any clients using it. This is because they are putting messages
to a single cluster queue. If a queue manager in the cluster fails,
the clients outstanding requests are processed when the queue
manager is failed over to a backup machine. In the meantime, the
client needs to take no action because its new requests will be
routed around the failure and processed by another queue manager in
the cluster. The client must only be tolerant if its requests are
taking slightly longer than normal to be returned in the event of a
failover. Figure 6 shows each queue manager in the cluster in an
active-active, standby machine-shared disk configuration. The
machines are configured with separate shared disks for queue
manager data and logs to decrease the time required to restart the
queue manager. See Considerations for WebSphere MQ restart
performance for more information. Figure 6. Queue managers 1,4, and
5 have active standby machines
Cluster Queue
Application Local Queue
QM 3 QM 1
QM 2 QM QM log log QM 4 QM 6 QM log QM 5 cluster 1
In this example, if queue manager 4 fails, it fails over to the
same machine as queue manager 3, where both queue managers will run
until the failed machine is repaired.
Page 20
-
Understanding high availability with WebSphere MQ
3.4.2. When to use HA WebSphere MQ queue manager clusters
Because this solution is implemented by combining external HA
clustering technology with WebSphere MQ queue manager clusters, it
provides the ultimate high availability configuration for
distributed WebSphere MQ. It makes both incoming and queued
messages available and also fails over not only a queue manager,
but also any other resources running on the machine. For instance,
server applications, databases, or user data can fail over to a
standby machine along with the queue manager. When using HA
WebSphere MQ clustering in an active-standby configuration, it is a
simpler task to apply maintenance or software updates to machines,
queue managers, or applications. This is because you can first
update a standby machine, then a queue manager can fail over to it,
ensuring that the update works correctly. If it is successful, you
can update the primary machine and then the queue manager can fail
back onto it. HA WebSphere MQ queue manager clusters also greatly
reduce the administration of the queue managers within it, which in
turn reduces the risk of administration errors. Queue managers that
are defined in a cluster do not require channel or queue
definitions setup for every other member of the cluster. Instead,
the cluster handles these communications and propagates relevant
information to each member of the cluster through a repository. HA
WebSphere MQ queue manager clusters are able to scale applications
linearly because you can add new queue managers to the cluster to
aid in the processing of incoming messages. Conversely, you can
remove queue managers from the cluster for maintenance and the
cluster can still continue to process incoming requests. If the
queue managers presence in the cluster is required, but the
hardware must be maintained, then you can use this technique in
conjunction with failing the queue manager over to a standby
machine. This frees the machine, but keeps the queue manager
running. It is also possible for administrators to write their own
cluster workload exits. This allows for a finer control of how
messages are delivered to queue managers in the cluster. Therefore,
you can target messages at machines in different ratios based on
the performance capabilities of the machine (rather than in a
simple round robin fashion).
3.4.3 When not to use HA WebSphere MQ queue manager clusters HA
WebSphere MQ queue manager clusters require additional proprietary
HA hardware (shared disks) and external HA clustering software
(such as HACMP). This increases the administration costs of the
environment because you also need to administer the HA components.
This approach also increases the initial implementation costs
because extra hardware and software are required. Therefore,
balance these initial costs with the potential costs incurred if a
queue manager fail and messages become trapped. Note that
non-persistent messages do not survive a queue manager failover.
This is because the queue manager restarts once it has been failed
over to the standby machine, causing it to process its logs and
return to its most recent known state. At
Page 21
-
Understanding high availability with WebSphere MQ
this point, non persistent messages are discarded. Therefore, if
your application requires non-persistent messages, take into
account this factor. If trapped messages are not a problem for the
applications (for example, the response time of the application is
irrelevant or the data is updated frequently), then HA WebSphere MQ
queue manager clusters are probably not required. That is, if the
amount of time required to repair a machine and restart its queue
manager is acceptable, then having a standby machine to take over
the queue manager is not necessary. In this case, it is possible to
implement WebSphere MQ queue manager clusters without any
additional HA hardware or software.
3.4.4. Considerations for implementation of HA WebSphere MQ
queue manager clusters When configuring an active-active or
active-standby setup in a cluster, administrators should test to
ensure that the failover of a given node works correctly. Nodes
should be failed over, when and where possible, to backup machines
to ensure the failover processes work as designed and that no
problems are encountered when a failover is actually required.
Perform this procedure at the discretion of the administrators. It
may cause problems or outages in a future production environment if
failover does not happen smoothly. As with queue manager clusters,
do not code WebSphere MQ applications as machine or queue manager
specific, such as relying on resources only available to a single
machine. This is because when applications are failed over to a
standby machine, along with the queue manager they are running on,
they may not have access to these resources. To avoid these
administrative problems, machines should be as equal as possible
with respect to software levels, operating system environments, and
security settings. Therefore, any failed over applications should
have no problems running. Avoid message affinities when programming
applications. This is because there is no guarantee that messages
put to the cluster queue will be processed by the same queue
manager every time. It is possible to use the MQ Open Option
BIND_ON_OPEN to ensure an applications messages are always
delivered to the same queue manager in the cluster. However, an
application performing this operation incurs reduced availability
because this queue manager may fail during message processing. In
this case, the application must wait until the queue manager is
failed over to a backup machine before it can begin processing the
applications requests. If affinities had not been used, then no
delay in message processing would be experienced. Another queue
manager in the cluster would continue processing any new requests.
Application programmers should avoid long running transactions in
their applications. This is because these will greatly increase the
restart time of the queue manager when it is failed over to a
standby machine. See Considerations for WebSphere MQ restart times
for more information. When implementing a WebSphere MQ cluster
solution, whether for an HA configuration or for normal workload
balancing, be careful to have at least two full cluster
repositories defined. These repositories should be on machines that
are highly
Page 22
-
Understanding high availability with WebSphere MQ
available. For example, they have redundant power supplies,
network access and hard disks, and are not heavily loaded with
work. Repositories are vital to the cluster because they contain
cluster wide information that is distributed to each cluster
member. If both of these repositories are lost, it is impossible
for the cluster to propagate any cluster changes, such as new
queues or queue managers. However, the cluster continues to
function with each members partial repositories until the full
repositories are restored.
Page 23
-
Understanding high availability with WebSphere MQ
3.5. HA capable client applications You can achieve high
availability on the client side rather than using HA clustering, HA
WebSphere MQ queue manager clusters, or shared queue server side
techniques as previously described. HA capable clients are an
inexpensive way to implement high availability, but usually it
results in a large client with complex logic. This is not ideal and
a server side approach is recommended. However, HA capable clients
are discussed here for completeness. Most occurrences of a queue
manager failure result in a connection failure with the client.
Even if the queue manager is returned to normal operation, the
client disconnects and remains so until the code used to connect
the client to the queue manager is executed again. One possible
solution to the problem of a server failure is to design the client
applications to reconnect, or connect to a different, but
functionally identical server. The clients application logic has to
detect a failed connection and reconnect to another specified
server. The method of detecting and handling a failed connection
depends on the MQ API in use. MQ JMS, for instance, provides an
exception listener mechanism that allows the programmer to specify
code to be run upon a failure event. The programmer can also use
Java try catch blocks to allow failures to be handled during code
execution. The MQI API reports a failure upon the next function
call that requires communication with the queue manager. In this
scenario, it is the programmers responsibility to resolve the
failure. The management of the failure depends on the type of
application and also, if there are any other high availability
solutions in place. A simple reconnect to the same queue manager
may be attempted, and if successful, the application can resume
processing. You can configure the application with a list of queue
managers that it may connect to. Upon failure, it can reconnect to
the next queue manager in the list. In an HA clustering solution,
clients still experience a failed connection if a server is
failed-over to a different physical machine. This is because it is
not possible to move open network connections between servers. The
client also may need to be configured to perform several reconnect
attempts to the server, and/or wait a period of time to allow time
for the server to restart. If the application is transactional, and
the connection fails mid-transaction, the entire transaction needs
to be re-executed when a new connection is established. This is
because WebSphere MQ queue managers will rollback any uncommitted
work at start-up time. You can supplement many server-side HA
solutions with the use of client side application code designed to
cope with the temporary loss, or need to reconnect to a queue
manager. A client that contains no extra code may need user
intervention, or even need to be completely restarted to resume
full functionality. There is obviously extra effort required to
code the client application to be HA aware, but the end result is a
more autonomous client.
Page 24
-
Understanding high availability with WebSphere MQ
3.5.1. When to use HA capable client applications HA capable
clients are ideally suited when an application has a number of
clients that need to reconnect in the event of a failure and no HA
solution has been implemented on the server side. This allows
clients to connect themselves to alternative services while the
failed service is restored.
3.5.2. When not to use HA capable client applications When a
robust extensible high availability solution is required, the HA
focus is on the server side rather than the client side. Clients
with complex HA logic become large and must be maintained. Also,
new clients coming onto the system must implement the same logic.
However, a transparent server side HA solution negates the need to
implement this technology Also, if there is a requirement for a
thin client, then there is no room for bulky HA logic. Therefore,
you must implement the HA solution on the server side.
Page 25
-
Understanding high availability with WebSphere MQ
4. Considerations for WebSphere MQ restart performance The most
important factor in making an IT system highly available is the
length of time required to recover from a failure. The methods
described for making a WebSphere MQ queue manager highly available
all involve situations, where a queue manager has failed and it
must be restarted on the same machine or a standby machine.
Therefore, the quicker you restart a queue manager, the quicker it
can complete any outstanding work and begin to process any new
requests. The quickest way to do this is to attempt to first
failover the queue manager to the same machine it failed on. This
is only possible if the queue manager has not failed due to a
hardware problem (an external HA clustering technology can
determine this). This approach will result in a much quicker
restart time and a less disruptive failover because there is no
need to move resources, such as network addresses, queue managers,
applications, and shared disks to the standby machine. However, if
you cannot achieve this, then the queue manager must be failed over
to a standby machine. Therefore, minimizing the amount of start-up
processing the queue manager must do to regain its state will
minimize the amount of time the queue manager is unavailable. The
next sections discuss factors that affect the start-up time of the
queue manager.
4.1. Long running transactions If your client applications have
long running transactions that use persistent messages, then this
increases the amount of time a queue manager takes to start up.
Design applications to avoid the use of long running transactions,
because these can affect the amount of log data that needs to be
replayed during recovery. By committing transactions as frequently
as possible, the amount of log replay required to recover the
transaction is reduced. WebSphere MQ uses automatically generated
checkpoints to determine the point at which the log will replay. A
checkpoint is a point where the log and queue files/pagesets are
consistent4. If a transaction is not committed for several
checkpoints, it follows that the size of the log required to
recover the queue manager increases. Therefore, short transaction
times reduce the amount of data to be processed when recovering a
queue manager. It is possible to force a checkpoint on z/OS using
log archiving or when a number of log records matching the LOGLOAD
value have been written. The use of shorter transactions also has
the benefit of reducing the possibility of the queue manager
exhausting the available log space (and the quantity of log space
required). This results in a long running transaction being rolled
back on distributed to release space. On z/OS, a transaction will
not be rolled back in this instance. Instead, it is necessary to
access the archive logs if the transaction backs out. This could
significantly extend the time that the backout takes. It is also
important to note that if the transaction backs out, and not all of
the log records are available, then the queue manager will
terminate.
4 For z/OS, note that pagesets are only consistent on every
third checkpoint.
Page 26
-
Understanding high availability with WebSphere MQ
For instance, if the queue manager has a long running Unit of
Work (UOW), then it must scan back over a number of logs to recover
it. By introducing frequent commits into the application code, it
is possible to minimize long start-up times due to large UOWs. This
also reduces the number of log files required to recover the queue
manager. If they have been backed up onto another medium, such as
tape, this significantly increases the restart time of the queue
manager.
4.2. Persistent message use Persistent messages are first
written to the queue manager log (for recovery purposes) and then
to the queue file/pageset if the message is not gotten immediately.
The queue manager replays the log during recovery. Reducing the
amount of log to be reprocessed reduces the time required for
recovery. Non-persistent messages are not written to the log so
they do not increase the queue managers restart time. However, note
that if an application is relying on WebSphere MQ to provide data
integrity, you must use persistent messages to ensure message
delivery. Also, as non-persistent messages are not logged, they do
not survive a queue manager restart. A new class of message service
was introduced with WebSphere MQ 5.3 CSD 6, which is positioned
between persistent and non-persistent messaging. It allows
non-persistent messages to survive a queue manager restart,
although some messages may be lost because of the absence of
message logging that persistent messaging provides. On non z/OS
platforms, you can enable this message class by setting the queue
parameter NPMCLASS to HIGH. On z/OS, this functionality is an
emergent property of the use of shared queues as non-persistent
messages are stored in the Coupling Facility. They do not get
removed on the queue manager startup.
4.3. Automation The detection of the failure, failover to a
standby machine, and restart of the queue manager (and
applications) should be automated. By reducing operator
intervention, the time required to failover a queue manager to a
backup machine is significantly reduced. This allows normal service
to be resumed as quickly as possible. You can achieve the
automation of this process by using HA clustering software as
described in HA clustering software.
4.4. File systems Use of a journaled file system is recommended
on distributed platforms to reduce the time required to recover a
file system to a working state. A journaling file system uses a
journal to maintain a list of the file transactions being written
to the disk. In the event of a failure, the disk structure remains
in a consistent state because it can be rebuilt at boot time (for
example, recovered from the journal) and can be used immediately.
On a non-journaling file system, the state of the file system after
a failure is not known and it is necessary to use a utility such as
scandisk, or e2fsck, to find and fix errors. As the use of a
journal avoids this problem, there is no need to perform a time
Page 27
-
Understanding high availability with WebSphere MQ
consuming file system scan to verify the integrity before you
can use it. Common journaled file systems include Windows NTFS,
Linux ext3, ReiserFS, and JFS. On z/OS, WebSphere MQ provides
facilities for taking backup copies of the message data while the
system is running. You can use these, in conjunction with the logs,
to recover WebSphere MQ in the event of media failure. Taking
periodic backups is recommended to reduce the amount of log data
that needs to be processed at restart. Finally, to decrease the
start-up time of a queue manager which has been failed over, store
the queue manager log and queue files on separate disks. This
increases performance in the recovery of the queue manager from its
logs because it will face no disk contention for the queue
files.
Page 28
-
Understanding high availability with WebSphere MQ
5. Comparison of generic versus specific failover technology The
WebSphere MQ methods for high availability standby machine and
shared disks and HA WebSphere MQ queue manager clusters both rely
on external HA clustering software and hardware to monitor hardware
resources and application data, to run processes, and to perform a
failover process if any of these fail. The alternative approach to
this solution is to utilize a product specific HA approach. These
provide an out of the box experience and are usually tailored for
each software application. These solutions primarily provide data
replication to a specified partner so that failover can occur if a
primary instance fails. You should fully investigate this product
specific high availability approach before considering its use in a
serious HA implementation. The primary reason for this
investigation is that the software may rely upon the
synchronization of data between product instances. Data replication
in this manner is discussed in the section High availability at the
beginning of this paper. It states these approaches are not ideal
as asynchronous replication can cause duplicated or lost data and
synchronous replication incurs a significant performance costs.
Therefore, replicating data in this manner is not a good method for
high availability. Another reason to avoid product specific
approaches is that they only tend to allow a single software
product to be failed over. However, an external HA clustering
solution offers the ability to failover and restart interdependent
groups of resources, such as other software applications and
hardware resources. For instance, it is possible to failover
WebSphere Business Integration Message Brokers with WebSphere MQ
and DB2 using IBMs HACMP technology. This extensibility is vital
when considering the wider scope of high availability for all
server applications and hardware resources. An external HA
clustering approach utilizes available machines on the network more
effectively. It is able to dynamically failover an application and
any other resources to a single backup machine, shared by a number
of queue managers in the network (often called an N+1 solution).
This means a standby machine is not required for every active
machine in the network. HA clustering technology detects subtle
failures, such as an unexpected increase in network latency (thus
heartbeats are not received), or the primary machine stalling for a
short period of time due to increased IO. In either of these
situations, the secondary machine may think its primary peer has
failed and it begins to take over work. External HA clustering
technologies, such as HACMP, perform these complex tasks. However,
product specific technologies may not perform these tasks. This may
mean that both the primary machine and the secondary think they are
the primary machine. This leads to a split brain problem and
duplicate message processing. Avoid the split brain situation using
external HA clustering technology. This arbitrates all resources in
the network and can decide which machines have access to the data.
Therefore, in the event of a failure, the HA clustering software
can provide
Page 29
-
Understanding high availability with WebSphere MQ
access to the shared resources to the standby machine. This
machine is now considered the primary machine by all. To conclude,
investigate product specific approaches because their HA approaches
may not be flexible or expandable enough to incorporate the much
wider demands of a highly available IT infrastructure.
Page 30
-
Understanding high availability with WebSphere MQ
6. Conclusion This paper discussed approaches for implementing
high availability solutions using the WebSphere MQ messaging
product. Choosing a solution to achieve a highly available system
is based on the HA requirements of that system. For instance, is
each message important? Can a trapped message wait a few hours
until a machine is restarted, or must it be made available as soon
as possible? If it is the former, then a simple clustering approach
is enough. However, the latter requirement requires the use of HA
clustering software and hardware. Also, are software applications
reliant on specific software or hardware resources? If so, a HA
cluster solution is critical when interdependent groups of
resources must be failed over together. Note that the approaches
discussed in this paper for implementing high availability with
WebSphere MQ all employ common HA principles. You should adhere to
those principles when implementing any highly available IT system.
The first is the use of a single copy of any data. This makes the
data much easier to manage, there are no ambiguities about who owns
the real data and there are no issues in reconciling the data if
there is a corruption. When a failover occurs, only one instance of
the software has access to the real data, avoiding any confusion.
The only exception to this statement is when you implement a
disaster recovery solution to move copies of critical data off
site. In this case, you cannot use a copy of the data to remove the
single point of failure and to provide high availability. Instead,
if a site wide failure occurs, the backup is used to restore
critical data and to resume services (possibly on another site).
Second, always verify software that stores persistent state on disk
to ensure it performs synchronous writes to the disk and to ensure
hardening of the data. Asynchronous writes to a disk can result in
software believing the data has been hardened to disk when, in
fact, it has not. WebSphere MQ always writes persistent data
synchronously to disk to ensure it has been hardened, and
therefore, recoverable in the event of a queue manager failure.
Third, implementing redundancy at a hard disk level, to remove the
disk as a single point of failure, is a simple step that prevents
the loss of critical data if a disk fails. Despite synchronous
writes ensuring the data has been hardened to disk, a disk failure
can still destroy the data. Therefore, implement technologies, such
as RAID, to provide a disk level redundancy of data. Fourth, and
often overlooked, implement process controls for the administration
of production IT systems. Often it is administrative errors that
cause outages because of improperly tested software updates,
incorrect parameter settings, or destructive actions performed by
administrators. By having proper process controls and security
restrictions, you can minimize these errors. Also, HA clustering
software provides a single administration view of all machines in a
HA cluster, which minimizes administration effort. Lastly,
programming applications to avoid affinities between clients and
servers and long running Units of Work are good practices. The
first allows applications to be
Page 31
-
Understanding high availability with WebSphere MQ
failed over to any machine and still continue running. The
second allows servers to be restarted quickly so that they do not
have large amounts of outstanding work to process. We can conclude
that implementing high availability using an external HA clustering
solution can bring large benefits to an IT infrastructure. It can
allow groups of resources to be failed over, single copies of data
to be maintained, and simpler administration of resources. IBM
WebSphere MQ, DB2, WebSphere Application Server and WebSphere
Business Integration Message Broker all support high availability
through HA clustering software, and all provide resources to
configure easily. This approach is considerably more flexible than
a product specific solution. You can expand this approach way
beyond its initial scope. Ultimately, high availability is a
combination of implementing the correct server side infrastructure,
avoiding single points of failure wherever they may lie (in
hardware or software), and being flexible in the HA approach. The
cost of implementing HA can initially be seen as an expensive
undertaking, but you must always balance it with the potential cost
of losing IT systems or critical data. External HA clustering
software can solve many issues of high availability, but high
availability is only a small part of resilience computing. You must
address concepts such as disaster recovery, fault tolerance,
scalability, and reliability to provide a 24 by 7 solution that is
available 100% of the time.
Page 32
-
Understanding high availability with WebSphere MQ
Appendix A Available SupportPacs These SupportPacs are provided
free from IBM and assist in the setup and configuration of
WebSphere MQ using different HA clustering technologies. MC41
Configuring WebSphere MQ for iSeries High Availability
http://www-1.ibm.com/support/docview.wss?rs=203&uid=swg24006894&loc=en_US&cs=utf-8&lang=en
MC63 WebSphere MQ for AIX Implementing with HACMP
http://www-1.ibm.com/support/docview.wss?rs=203&uid=swg24006416&loc=en_US&cs=utf-8&lang=en
MC68 Configuring WebSphere MQ with Compaq Trucluster for high
availability
http://www-1.ibm.com/support/docview.wss?rs=203&uid=swg24006383&loc=en_US&cs=utf-8&lang=en
MC69 Configuring WebSphere MQ with Sun Cluster 2.X
http://www-1.ibm.com/support/docview.wss?rs=203&uid=swg24000112&loc=en_US&cs=utf-8&lang=en
MC6A Configuring WebSphere MQ for Sun Solaris with Veritas Cluster
Server
http://www-1.ibm.com/support/docview.wss?rs=203&uid=swg24000678&loc=en_US&cs=utf-8&lang=en
MC6B WebSphere MQ for HP-UX Implementing with Multi
Computer/Service Guard
http://www.ibm.com/support/docview.wss?rs=203&uid=swg24004772&loc=en_US&cs=utf-8&lang=en
Page 33
-
Understanding high availability with WebSphere MQ
Page 34
Resources
[1]. WebSphere MQ for Z/OS Concepts and Planning Guide Chapter 2
(Shared Queues),
http://www-306.ibm.com/software/integration/mqfamily/library/manualsa/manuals/platspecific.html
[2]. WebSphere MQ queue manager clusters,
http://www-306.ibm.com/software/integration/mqfamily/library/manualsa/manuals/crosslatest.html
[3]. WebSphere MQ High Availability, Mark Taylor, Transaction
and Messaging Technical Conference
[4]. Choosing the right availability solution, L.Sherman,
http://whitepapers.zdnet.co.uk/0,39025945,60018358p-39000482q,00.htm
[5]. Understanding Downtime, Business Continuity Solution
Series, Vision Solutions Whitepaper,
http://www.visionsolutions.com/BCSS/White-Paper-102_final_vision_site.pdf
[6]. WebSphere MQ Manuals for z/OS, Systems Administration
Guide, Chapter 14, Page 151,
http://www-306.ibm.com/software/integration/mqfamily/library/manualsa/manuals/platspecific.html
[7]. Achieving High Availability Objectives, CNT whitepapers,
http://www.cnt.com/documents/?ext=pdf&filename=PL581
[8]. A definition of the term RAID, webopedia.com,
http://www.webopedia.com/TERM/R/RAID.html
About the authors Mark Hiscock joined IBM in 1999 while studying
at the same time for his Computer Science degree. He has worked in
the Hursley Park Laboratory in the United Kingdom testing IBMs
middleware suite of applications from WebSphere MQ Everyplace to
WebSphere Business Integration Message Brokers. He now works as a
customer scenarios tester for WebSphere MQ and WebSphere Business
Integration Message Brokers, basing his testing on real world
customer scenarios. You can reach him at [email protected].
Simon Gormley joined IBM in 2000 as a software engineer, and works
at the Hursley Park Laboratory in the United Kingdom. He is
currently working in the WebSphere MQ and WebSphere Business
Integration Brokers test team, and focusing on recreating customer
scenarios to form the basis of tests. You can reach him at
[email protected].
1. Introduction2. High availability3. Implementing high
availability with WebSphere MQ3.1. General WebSphere MQ recovery
techniques3.2. Standby machine - shared disks3.2.1. HA clustering
software3.2.2. When to use standby machine - shared disks3.2.3.
When not to use standby machine - shared disks3.2.4. HA clustering
active-standby configuration3.2.5. HA clustering active-active
configuration3.2.6. HA clustering benefits
3.3. z/OS high availability options3.3.1. Shared queues (z/OS
only)
3.4. WebSphere MQ queue manager clusters3.4.1. Extending the
standby machine - shared disk approach3.4.2. When to use HA
WebSphere MQ queue manager clusters3.4.3 When not to use HA
WebSphere MQ queue manager clusters3.4.4. Considerations for
implementation of HA WebSphere MQ queue manager clusters
3.5. HA capable client applications3.5.1. When to use HA capable
client applications3.5.2. When not to use HA capable client
applications
4. Considerations for WebSphere MQ restart performance4.1. Long
running transactions4.2. Persistent message use4.3. Automation4.4.
File systems
5. Comparison of generic versus specific failover technology6.
ConclusionAppendix A Available SupportPacsResourcesAbout the
authors