Building a High Availability ODM Enterprise environment · 2018. 11. 2. · Building a High Availability ODM Enterprise environment ... clustered, and manages high availability across
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
copy Copyright International Business Machines Corporation 1987 2009 US Government Users Restricted Rights - Use duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Cor
IBM ILOG ODM Enterprise V33
Building a High Availability ODM Enterprise environment V10
Table of Contents 1 Planning a HA ODM Enterprise environment 3
B Protecting the system 5 1) Protecting the ability to run jobs 5 2) Protecting the ability to manage jobs 5
C Sample HA topology 6 D IBM middleware protection strategies 6 E System failure detection 8 F Special considerations for ODM Clients 8
2 Configuring a HA ODM Enterprise environment 9 A Introduction 9 B Configuring DB2 HADR 9 C Configuring WebSphere Application Server and HTTP Server 11
1) Overview 11 2) Procedure 11
D Configuring ODM Application 13 3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster 15
A Workload Management capabilities of ODME 3301 15 1) Job Control and Administrative requests Workload Management 15 2) Job solving Workload Management 15
B High Availablity capabilities of ODME 3301 on WAS‐NDampDB2 HA 17 1) Operations continuity 17 2) Operations recovery 18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment 19 A Job processor fails to extract OPL binaries upon restart 19 B Solve cannot recover after WAS job‐processor or odmsolver stops 20 C Bad error reporting when Optimization Server loses connection from the Repository DB 20 D ODME cannot start when WAS administrative security is enabled 20 E ODM solver does not start 21
2
1 Planning a HA ODM Enterprise environment This document describes the points to consider when planning to use the IBMreg ILOG ODM Enterprise Optimization Server as part of a highly available solution
A Overview of ODM Enterprise IBM ILOG ODM Enterprise offers a platform for optimization‐based planning and scheduling applications One of its main components is the Optimization Server used by planners to perform computations At the Optimization Servers core is the Optimization Engine which performs long running computationally expensive optimization ldquosolvesrdquo Each solve requires access to Scenario Data as the input to the solve and as storage for solve results A Job Processor is an application which runs on the same server as one or more Optimization Engines and initiates new solve jobs based on the contents in the Jobs database The Jobs Database is populated by the Job Manager The Job Manager receives requests from clients to schedule and prioritize solve jobs
1) Databases The ODME product uses relational databases a Jobs database and one or more Scenario databases The Jobs database stores information on pending jobs The Scenario database holds data used as input to optimization jobs and the results of the previously completed solve operations Failure of the Jobs database causes the failure of both the Job Manager and the Job Processor Failure of the Scenario database causes the failure of any jobs using that database
2) Optimization Engines The Optimization Engines run as separate processes which wrap invocations of a native solve engine It retrieves scenario data from a database executes the optimization solve and finally writes back the result to the Scenario database The database connectivity is provided by JDBC database drivers A new instance of an Optimization Engine solver process is created for each new solve Failure of an Optimization Engine solver process means that the optimization in progress cannot complete
3
3) Job Processors A Job Processor initiates new solve jobs based on the contents of the Jobs database The Job Processor has a fixed number of solver slots (usually as many as there are physical CPU cores on the host) When the Job Processor has an Optimization Engine slot that is ready it will poll the database to check for new jobs New jobs will get solved by a newly launched Optimization Engine solve process The Job Processor is also responsible for updating the Jobs database with solve progress and final status Failure of the Job Processor means that new jobs cannot be picked for solving and that recording of progress or results of complete jobs The Optimization Engine solver process maintains contact with the Job Processor while running jobs ndash if the Job Processor fails then the Optimization Engine will stop The Job Processor runs as a Java EE application Database connectivity is provided by a JDBC data source The Job Processor responds to queries from the Job Manager using the Java Messaging Service (JMS) JMS is used only for interacting with jobs such as to cancel or accept the current solution and not for regular solving Multiple Job Processors can use the same Jobs database and will mark jobs as in‐progress as they are run
4) Job Manager The Job Manager receives requests from clients to schedule and solve jobs Jobs are stored as records in a Jobs relational database The Job Manager runs as a Java EE application Database connectivity to the Jobs DB is provided by a JDBC XA data source The Job Manager communicates with the Job Processor to interact with running jobs using the Java Messaging Service (JMS) Clients interact with the Job Manager using either a SOAPHTTP web service for solve jobs submission or by using a web‐based console for solve queue administration The Job Manager application holds no state information in memory (all state is in the database) so multiple instances of the application can be used The Job Manager includes a timer which checks the status of running jobs If a job has not been active (no reported heartbeat) for 120 seconds (default value settable in JobMonitors ejbjobActivityTimeout environment variable) the Job Manager will mark the job as failed and available to be restarted Failure of the Job Manager means that
bull No new jobs can be submitted bull Status of jobs cannot be queried bull Running solve jobs cannot be aborted bull Pending jobs can still be run as long as other components (Job Processor DBs) have not
failed
4
B Protecting the system This diagram shows the dependencies of ODME components on high‐level middleware components
Figure Component dependency diagram
These dependencies show the key consequences of the failure of any component For example if the application server for the Job Processor fails then so does the Job Processor application and any associated Optimization Engines If database manager for the Jobs DB fails then the Jobs DB Job Manager and Job Processor and Optimization Engine all fail While two logical application servers are depicted the solution may be deployed on a single physical application server instance The same is true for the two logical database managers This would of course mean that a failure could have a greater impact
1) Protecting the ability to run jobs Assuming job records are available in the Jobs database the ability to run optimization solve jobs is based on the following components
From a middleware perspective the following needs to be protected bull Application servers used by Job Processor bull The database servers used for the Jobs database bull The database servers used for the Scenario database bull The physical servers and network used by each of the above
2) Protecting the ability to manage jobs The ability to manage jobs is provided by the following logical components
bull Job Manager application bull Jobs database
5
From a middleware perspective the following needs to be protected bull HTTP server used by Job Manager bull Application server used by Job Manager bull The database server used for the Jobs database bull The physical servers and network used by each of the above
C Sample HA topology Here is a sample topology which can be used to protect the ODME solution
This logical topology consists of
Optimization Servers each running a Java EE application server with the Job Processor and Job Manager applications installed Each Optimization Server will host one or more Optimization Engines Both Optimization Servers can be operation at the same time ‐ if one Optimization Server fails then the other will continue taking jobs from the Jobs Database Any number of Optimization Servers could theoretically be used
Database server hosting both the Jobs and Scenario databases Keeping multiple copies of the same database active and up‐to‐date could be very difficult so instead a passive backup should be kept This backup needs to be up‐to‐date and ready to become active if the primary database server fails
Load balancing server which can route HTTP traffic to either of the two Job Managers The load balancer also needs to be backed‐up This backup can also be passive and ready to become active if the primary load balancer fails
D IBM middleware protection strategies There are many ways in which IBM middleware can be protected including (but not limited to)
bull Software bull WebSphere Application Server Network Deployment allows application servers to be
clustered and manages high availability across Java EE applications bull DB2reg can be used to keep an up‐to‐date replica of a database on a separate server using
its High Availability Disaster Recovery (HADR) feature bull Tivoli System Automation provides advanced clustering support for managing highly
6
available systems bull Hardware
bull A PowerHA solution provides a cluster of IBM Power servers using shared disks If one server (either software processes or hardware) fails the other takes over Power HA SystemMirror is available for AIX Linux and IBM i
bull Disk technology such as a Redundant Array of Independent Disks (RAID)
For example the following describes a software‐only topology to enable high availability using WebSphere Application Server‐Network Deployment and DB2 HADR
Figure 1 ODME Software‐only HA Topology example
This topology consists of
Optimization Servers running WebSphere Application Server (WAS) Each Optimization Server runs a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications is installed into the cluster Each Optimization Server hosts one or more ODM Optimization Engines A WAS Service Integration bus (SIBus) is configured to allow the JMS communication between Job Managers and Job Processors ‐ the SIBus uses a single messaging engine with a HADR database store which WAS will automatically move to the other WAS server in the event of a failure In case of primary database failure DB2 HADR will switch to the alternate standby database server
Database servers running DB2 Enterprise Server Edition Both servers run the same software with one acting as the primary database server The primary database server replicates all database updates on the standby server using DB2s High Availability Disaster Recovery (HADR) feature In addition Tivoli System Automation could be used on these servers to detect a failure and instruct the standby server to take over as the primary database
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests
7
to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis
E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical
It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection
There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring
F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings
ODM application
configuration (odmapp
ODM Enterprise IDE
ODMRepositorySCENARIO
DB
Development Deployment
IT developer
Java Development
Tools
OPL Studio
ODM Editors
ODM Studio -Planner and
Reviewer Editions
Optimization Server
Custom Clients and Batch Files
odmapp
odmapp
odmapp
odmapp
solve
solve
Readwrite
Rea
dwr
ite
Readwrite
The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs
When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server
8
2 Configuring a HA ODM Enterprise environment
A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products
bull IBM ILOG ODM Enterprise V3301
bull IBM WebSphere Application Server Network Deployment V61 FP 025
bull IBM HTTP Server V61 FP 025
bull IBM DB2 Enterprise Server Edition V95 FixPack 3
This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details
Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of
Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster
Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis
Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server
B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped
DB2 HADR Requirements
Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers
9
bull Identical operating system version and patch level
bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application
bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path
DB2 HADR Setup
1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well
2 Start the DB2 servers on both machines if they are not already running
3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)
bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema
bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE
bull SIBus database ndash used by the WAS Service Integration Bus
Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR
4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in
Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0
C Configuring WebSphere Application Server and HTTP Server
1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member
The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database
2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment
a) Install WAS 61 Network Deployment as detailed in
Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled
b) Start and connect to the deployment manager console create a new cluster in Servers =gt
Clusters and define cluster members for nodes created earlier
c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes
d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties
11
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well
The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules
e) Set the custom properties of these JDBC data sources
bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables
bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name
bull clientRerouteAlternatePortNumber standby server port number for client reroute
bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2
bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15
bull fullyMaterializeInputStreams set to true
bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues
f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled
g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection
h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)
bull OptimizationServerTopic named jmsoptimserverTopic in JNDI
bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI
bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI
bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic
i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope
j) Install IBM HTTP Server 61
12
Tip
bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application
bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier
k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of
plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps
l) Start cluster nodes
m) Start the cluster in Servers =gt Clusters
n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole
Useful links
bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo
bull ldquoRoadmap Installing the Network Deployment productrdquo
D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute
13
Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically
To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example
bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the
server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again
Notes bull This property list can be extended with other DB2 properties to match your
needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html
3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster
This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR
This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301
When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)
WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version
HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure
A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines
1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme
Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI
The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS
2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node
Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity
On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving
15
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
Table of Contents 1 Planning a HA ODM Enterprise environment 3
B Protecting the system 5 1) Protecting the ability to run jobs 5 2) Protecting the ability to manage jobs 5
C Sample HA topology 6 D IBM middleware protection strategies 6 E System failure detection 8 F Special considerations for ODM Clients 8
2 Configuring a HA ODM Enterprise environment 9 A Introduction 9 B Configuring DB2 HADR 9 C Configuring WebSphere Application Server and HTTP Server 11
1) Overview 11 2) Procedure 11
D Configuring ODM Application 13 3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster 15
A Workload Management capabilities of ODME 3301 15 1) Job Control and Administrative requests Workload Management 15 2) Job solving Workload Management 15
B High Availablity capabilities of ODME 3301 on WAS‐NDampDB2 HA 17 1) Operations continuity 17 2) Operations recovery 18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment 19 A Job processor fails to extract OPL binaries upon restart 19 B Solve cannot recover after WAS job‐processor or odmsolver stops 20 C Bad error reporting when Optimization Server loses connection from the Repository DB 20 D ODME cannot start when WAS administrative security is enabled 20 E ODM solver does not start 21
2
1 Planning a HA ODM Enterprise environment This document describes the points to consider when planning to use the IBMreg ILOG ODM Enterprise Optimization Server as part of a highly available solution
A Overview of ODM Enterprise IBM ILOG ODM Enterprise offers a platform for optimization‐based planning and scheduling applications One of its main components is the Optimization Server used by planners to perform computations At the Optimization Servers core is the Optimization Engine which performs long running computationally expensive optimization ldquosolvesrdquo Each solve requires access to Scenario Data as the input to the solve and as storage for solve results A Job Processor is an application which runs on the same server as one or more Optimization Engines and initiates new solve jobs based on the contents in the Jobs database The Jobs Database is populated by the Job Manager The Job Manager receives requests from clients to schedule and prioritize solve jobs
1) Databases The ODME product uses relational databases a Jobs database and one or more Scenario databases The Jobs database stores information on pending jobs The Scenario database holds data used as input to optimization jobs and the results of the previously completed solve operations Failure of the Jobs database causes the failure of both the Job Manager and the Job Processor Failure of the Scenario database causes the failure of any jobs using that database
2) Optimization Engines The Optimization Engines run as separate processes which wrap invocations of a native solve engine It retrieves scenario data from a database executes the optimization solve and finally writes back the result to the Scenario database The database connectivity is provided by JDBC database drivers A new instance of an Optimization Engine solver process is created for each new solve Failure of an Optimization Engine solver process means that the optimization in progress cannot complete
3
3) Job Processors A Job Processor initiates new solve jobs based on the contents of the Jobs database The Job Processor has a fixed number of solver slots (usually as many as there are physical CPU cores on the host) When the Job Processor has an Optimization Engine slot that is ready it will poll the database to check for new jobs New jobs will get solved by a newly launched Optimization Engine solve process The Job Processor is also responsible for updating the Jobs database with solve progress and final status Failure of the Job Processor means that new jobs cannot be picked for solving and that recording of progress or results of complete jobs The Optimization Engine solver process maintains contact with the Job Processor while running jobs ndash if the Job Processor fails then the Optimization Engine will stop The Job Processor runs as a Java EE application Database connectivity is provided by a JDBC data source The Job Processor responds to queries from the Job Manager using the Java Messaging Service (JMS) JMS is used only for interacting with jobs such as to cancel or accept the current solution and not for regular solving Multiple Job Processors can use the same Jobs database and will mark jobs as in‐progress as they are run
4) Job Manager The Job Manager receives requests from clients to schedule and solve jobs Jobs are stored as records in a Jobs relational database The Job Manager runs as a Java EE application Database connectivity to the Jobs DB is provided by a JDBC XA data source The Job Manager communicates with the Job Processor to interact with running jobs using the Java Messaging Service (JMS) Clients interact with the Job Manager using either a SOAPHTTP web service for solve jobs submission or by using a web‐based console for solve queue administration The Job Manager application holds no state information in memory (all state is in the database) so multiple instances of the application can be used The Job Manager includes a timer which checks the status of running jobs If a job has not been active (no reported heartbeat) for 120 seconds (default value settable in JobMonitors ejbjobActivityTimeout environment variable) the Job Manager will mark the job as failed and available to be restarted Failure of the Job Manager means that
bull No new jobs can be submitted bull Status of jobs cannot be queried bull Running solve jobs cannot be aborted bull Pending jobs can still be run as long as other components (Job Processor DBs) have not
failed
4
B Protecting the system This diagram shows the dependencies of ODME components on high‐level middleware components
Figure Component dependency diagram
These dependencies show the key consequences of the failure of any component For example if the application server for the Job Processor fails then so does the Job Processor application and any associated Optimization Engines If database manager for the Jobs DB fails then the Jobs DB Job Manager and Job Processor and Optimization Engine all fail While two logical application servers are depicted the solution may be deployed on a single physical application server instance The same is true for the two logical database managers This would of course mean that a failure could have a greater impact
1) Protecting the ability to run jobs Assuming job records are available in the Jobs database the ability to run optimization solve jobs is based on the following components
From a middleware perspective the following needs to be protected bull Application servers used by Job Processor bull The database servers used for the Jobs database bull The database servers used for the Scenario database bull The physical servers and network used by each of the above
2) Protecting the ability to manage jobs The ability to manage jobs is provided by the following logical components
bull Job Manager application bull Jobs database
5
From a middleware perspective the following needs to be protected bull HTTP server used by Job Manager bull Application server used by Job Manager bull The database server used for the Jobs database bull The physical servers and network used by each of the above
C Sample HA topology Here is a sample topology which can be used to protect the ODME solution
This logical topology consists of
Optimization Servers each running a Java EE application server with the Job Processor and Job Manager applications installed Each Optimization Server will host one or more Optimization Engines Both Optimization Servers can be operation at the same time ‐ if one Optimization Server fails then the other will continue taking jobs from the Jobs Database Any number of Optimization Servers could theoretically be used
Database server hosting both the Jobs and Scenario databases Keeping multiple copies of the same database active and up‐to‐date could be very difficult so instead a passive backup should be kept This backup needs to be up‐to‐date and ready to become active if the primary database server fails
Load balancing server which can route HTTP traffic to either of the two Job Managers The load balancer also needs to be backed‐up This backup can also be passive and ready to become active if the primary load balancer fails
D IBM middleware protection strategies There are many ways in which IBM middleware can be protected including (but not limited to)
bull Software bull WebSphere Application Server Network Deployment allows application servers to be
clustered and manages high availability across Java EE applications bull DB2reg can be used to keep an up‐to‐date replica of a database on a separate server using
its High Availability Disaster Recovery (HADR) feature bull Tivoli System Automation provides advanced clustering support for managing highly
6
available systems bull Hardware
bull A PowerHA solution provides a cluster of IBM Power servers using shared disks If one server (either software processes or hardware) fails the other takes over Power HA SystemMirror is available for AIX Linux and IBM i
bull Disk technology such as a Redundant Array of Independent Disks (RAID)
For example the following describes a software‐only topology to enable high availability using WebSphere Application Server‐Network Deployment and DB2 HADR
Figure 1 ODME Software‐only HA Topology example
This topology consists of
Optimization Servers running WebSphere Application Server (WAS) Each Optimization Server runs a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications is installed into the cluster Each Optimization Server hosts one or more ODM Optimization Engines A WAS Service Integration bus (SIBus) is configured to allow the JMS communication between Job Managers and Job Processors ‐ the SIBus uses a single messaging engine with a HADR database store which WAS will automatically move to the other WAS server in the event of a failure In case of primary database failure DB2 HADR will switch to the alternate standby database server
Database servers running DB2 Enterprise Server Edition Both servers run the same software with one acting as the primary database server The primary database server replicates all database updates on the standby server using DB2s High Availability Disaster Recovery (HADR) feature In addition Tivoli System Automation could be used on these servers to detect a failure and instruct the standby server to take over as the primary database
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests
7
to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis
E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical
It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection
There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring
F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings
ODM application
configuration (odmapp
ODM Enterprise IDE
ODMRepositorySCENARIO
DB
Development Deployment
IT developer
Java Development
Tools
OPL Studio
ODM Editors
ODM Studio -Planner and
Reviewer Editions
Optimization Server
Custom Clients and Batch Files
odmapp
odmapp
odmapp
odmapp
solve
solve
Readwrite
Rea
dwr
ite
Readwrite
The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs
When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server
8
2 Configuring a HA ODM Enterprise environment
A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products
bull IBM ILOG ODM Enterprise V3301
bull IBM WebSphere Application Server Network Deployment V61 FP 025
bull IBM HTTP Server V61 FP 025
bull IBM DB2 Enterprise Server Edition V95 FixPack 3
This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details
Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of
Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster
Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis
Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server
B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped
DB2 HADR Requirements
Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers
9
bull Identical operating system version and patch level
bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application
bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path
DB2 HADR Setup
1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well
2 Start the DB2 servers on both machines if they are not already running
3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)
bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema
bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE
bull SIBus database ndash used by the WAS Service Integration Bus
Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR
4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in
Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0
C Configuring WebSphere Application Server and HTTP Server
1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member
The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database
2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment
a) Install WAS 61 Network Deployment as detailed in
Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled
b) Start and connect to the deployment manager console create a new cluster in Servers =gt
Clusters and define cluster members for nodes created earlier
c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes
d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties
11
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well
The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules
e) Set the custom properties of these JDBC data sources
bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables
bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name
bull clientRerouteAlternatePortNumber standby server port number for client reroute
bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2
bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15
bull fullyMaterializeInputStreams set to true
bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues
f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled
g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection
h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)
bull OptimizationServerTopic named jmsoptimserverTopic in JNDI
bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI
bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI
bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic
i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope
j) Install IBM HTTP Server 61
12
Tip
bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application
bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier
k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of
plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps
l) Start cluster nodes
m) Start the cluster in Servers =gt Clusters
n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole
Useful links
bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo
bull ldquoRoadmap Installing the Network Deployment productrdquo
D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute
13
Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically
To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example
bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the
server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again
Notes bull This property list can be extended with other DB2 properties to match your
needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html
3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster
This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR
This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301
When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)
WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version
HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure
A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines
1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme
Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI
The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS
2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node
Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity
On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving
15
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
1 Planning a HA ODM Enterprise environment This document describes the points to consider when planning to use the IBMreg ILOG ODM Enterprise Optimization Server as part of a highly available solution
A Overview of ODM Enterprise IBM ILOG ODM Enterprise offers a platform for optimization‐based planning and scheduling applications One of its main components is the Optimization Server used by planners to perform computations At the Optimization Servers core is the Optimization Engine which performs long running computationally expensive optimization ldquosolvesrdquo Each solve requires access to Scenario Data as the input to the solve and as storage for solve results A Job Processor is an application which runs on the same server as one or more Optimization Engines and initiates new solve jobs based on the contents in the Jobs database The Jobs Database is populated by the Job Manager The Job Manager receives requests from clients to schedule and prioritize solve jobs
1) Databases The ODME product uses relational databases a Jobs database and one or more Scenario databases The Jobs database stores information on pending jobs The Scenario database holds data used as input to optimization jobs and the results of the previously completed solve operations Failure of the Jobs database causes the failure of both the Job Manager and the Job Processor Failure of the Scenario database causes the failure of any jobs using that database
2) Optimization Engines The Optimization Engines run as separate processes which wrap invocations of a native solve engine It retrieves scenario data from a database executes the optimization solve and finally writes back the result to the Scenario database The database connectivity is provided by JDBC database drivers A new instance of an Optimization Engine solver process is created for each new solve Failure of an Optimization Engine solver process means that the optimization in progress cannot complete
3
3) Job Processors A Job Processor initiates new solve jobs based on the contents of the Jobs database The Job Processor has a fixed number of solver slots (usually as many as there are physical CPU cores on the host) When the Job Processor has an Optimization Engine slot that is ready it will poll the database to check for new jobs New jobs will get solved by a newly launched Optimization Engine solve process The Job Processor is also responsible for updating the Jobs database with solve progress and final status Failure of the Job Processor means that new jobs cannot be picked for solving and that recording of progress or results of complete jobs The Optimization Engine solver process maintains contact with the Job Processor while running jobs ndash if the Job Processor fails then the Optimization Engine will stop The Job Processor runs as a Java EE application Database connectivity is provided by a JDBC data source The Job Processor responds to queries from the Job Manager using the Java Messaging Service (JMS) JMS is used only for interacting with jobs such as to cancel or accept the current solution and not for regular solving Multiple Job Processors can use the same Jobs database and will mark jobs as in‐progress as they are run
4) Job Manager The Job Manager receives requests from clients to schedule and solve jobs Jobs are stored as records in a Jobs relational database The Job Manager runs as a Java EE application Database connectivity to the Jobs DB is provided by a JDBC XA data source The Job Manager communicates with the Job Processor to interact with running jobs using the Java Messaging Service (JMS) Clients interact with the Job Manager using either a SOAPHTTP web service for solve jobs submission or by using a web‐based console for solve queue administration The Job Manager application holds no state information in memory (all state is in the database) so multiple instances of the application can be used The Job Manager includes a timer which checks the status of running jobs If a job has not been active (no reported heartbeat) for 120 seconds (default value settable in JobMonitors ejbjobActivityTimeout environment variable) the Job Manager will mark the job as failed and available to be restarted Failure of the Job Manager means that
bull No new jobs can be submitted bull Status of jobs cannot be queried bull Running solve jobs cannot be aborted bull Pending jobs can still be run as long as other components (Job Processor DBs) have not
failed
4
B Protecting the system This diagram shows the dependencies of ODME components on high‐level middleware components
Figure Component dependency diagram
These dependencies show the key consequences of the failure of any component For example if the application server for the Job Processor fails then so does the Job Processor application and any associated Optimization Engines If database manager for the Jobs DB fails then the Jobs DB Job Manager and Job Processor and Optimization Engine all fail While two logical application servers are depicted the solution may be deployed on a single physical application server instance The same is true for the two logical database managers This would of course mean that a failure could have a greater impact
1) Protecting the ability to run jobs Assuming job records are available in the Jobs database the ability to run optimization solve jobs is based on the following components
From a middleware perspective the following needs to be protected bull Application servers used by Job Processor bull The database servers used for the Jobs database bull The database servers used for the Scenario database bull The physical servers and network used by each of the above
2) Protecting the ability to manage jobs The ability to manage jobs is provided by the following logical components
bull Job Manager application bull Jobs database
5
From a middleware perspective the following needs to be protected bull HTTP server used by Job Manager bull Application server used by Job Manager bull The database server used for the Jobs database bull The physical servers and network used by each of the above
C Sample HA topology Here is a sample topology which can be used to protect the ODME solution
This logical topology consists of
Optimization Servers each running a Java EE application server with the Job Processor and Job Manager applications installed Each Optimization Server will host one or more Optimization Engines Both Optimization Servers can be operation at the same time ‐ if one Optimization Server fails then the other will continue taking jobs from the Jobs Database Any number of Optimization Servers could theoretically be used
Database server hosting both the Jobs and Scenario databases Keeping multiple copies of the same database active and up‐to‐date could be very difficult so instead a passive backup should be kept This backup needs to be up‐to‐date and ready to become active if the primary database server fails
Load balancing server which can route HTTP traffic to either of the two Job Managers The load balancer also needs to be backed‐up This backup can also be passive and ready to become active if the primary load balancer fails
D IBM middleware protection strategies There are many ways in which IBM middleware can be protected including (but not limited to)
bull Software bull WebSphere Application Server Network Deployment allows application servers to be
clustered and manages high availability across Java EE applications bull DB2reg can be used to keep an up‐to‐date replica of a database on a separate server using
its High Availability Disaster Recovery (HADR) feature bull Tivoli System Automation provides advanced clustering support for managing highly
6
available systems bull Hardware
bull A PowerHA solution provides a cluster of IBM Power servers using shared disks If one server (either software processes or hardware) fails the other takes over Power HA SystemMirror is available for AIX Linux and IBM i
bull Disk technology such as a Redundant Array of Independent Disks (RAID)
For example the following describes a software‐only topology to enable high availability using WebSphere Application Server‐Network Deployment and DB2 HADR
Figure 1 ODME Software‐only HA Topology example
This topology consists of
Optimization Servers running WebSphere Application Server (WAS) Each Optimization Server runs a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications is installed into the cluster Each Optimization Server hosts one or more ODM Optimization Engines A WAS Service Integration bus (SIBus) is configured to allow the JMS communication between Job Managers and Job Processors ‐ the SIBus uses a single messaging engine with a HADR database store which WAS will automatically move to the other WAS server in the event of a failure In case of primary database failure DB2 HADR will switch to the alternate standby database server
Database servers running DB2 Enterprise Server Edition Both servers run the same software with one acting as the primary database server The primary database server replicates all database updates on the standby server using DB2s High Availability Disaster Recovery (HADR) feature In addition Tivoli System Automation could be used on these servers to detect a failure and instruct the standby server to take over as the primary database
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests
7
to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis
E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical
It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection
There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring
F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings
ODM application
configuration (odmapp
ODM Enterprise IDE
ODMRepositorySCENARIO
DB
Development Deployment
IT developer
Java Development
Tools
OPL Studio
ODM Editors
ODM Studio -Planner and
Reviewer Editions
Optimization Server
Custom Clients and Batch Files
odmapp
odmapp
odmapp
odmapp
solve
solve
Readwrite
Rea
dwr
ite
Readwrite
The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs
When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server
8
2 Configuring a HA ODM Enterprise environment
A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products
bull IBM ILOG ODM Enterprise V3301
bull IBM WebSphere Application Server Network Deployment V61 FP 025
bull IBM HTTP Server V61 FP 025
bull IBM DB2 Enterprise Server Edition V95 FixPack 3
This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details
Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of
Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster
Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis
Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server
B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped
DB2 HADR Requirements
Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers
9
bull Identical operating system version and patch level
bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application
bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path
DB2 HADR Setup
1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well
2 Start the DB2 servers on both machines if they are not already running
3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)
bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema
bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE
bull SIBus database ndash used by the WAS Service Integration Bus
Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR
4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in
Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0
C Configuring WebSphere Application Server and HTTP Server
1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member
The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database
2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment
a) Install WAS 61 Network Deployment as detailed in
Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled
b) Start and connect to the deployment manager console create a new cluster in Servers =gt
Clusters and define cluster members for nodes created earlier
c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes
d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties
11
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well
The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules
e) Set the custom properties of these JDBC data sources
bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables
bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name
bull clientRerouteAlternatePortNumber standby server port number for client reroute
bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2
bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15
bull fullyMaterializeInputStreams set to true
bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues
f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled
g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection
h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)
bull OptimizationServerTopic named jmsoptimserverTopic in JNDI
bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI
bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI
bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic
i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope
j) Install IBM HTTP Server 61
12
Tip
bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application
bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier
k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of
plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps
l) Start cluster nodes
m) Start the cluster in Servers =gt Clusters
n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole
Useful links
bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo
bull ldquoRoadmap Installing the Network Deployment productrdquo
D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute
13
Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically
To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example
bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the
server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again
Notes bull This property list can be extended with other DB2 properties to match your
needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html
3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster
This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR
This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301
When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)
WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version
HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure
A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines
1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme
Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI
The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS
2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node
Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity
On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving
15
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
3) Job Processors A Job Processor initiates new solve jobs based on the contents of the Jobs database The Job Processor has a fixed number of solver slots (usually as many as there are physical CPU cores on the host) When the Job Processor has an Optimization Engine slot that is ready it will poll the database to check for new jobs New jobs will get solved by a newly launched Optimization Engine solve process The Job Processor is also responsible for updating the Jobs database with solve progress and final status Failure of the Job Processor means that new jobs cannot be picked for solving and that recording of progress or results of complete jobs The Optimization Engine solver process maintains contact with the Job Processor while running jobs ndash if the Job Processor fails then the Optimization Engine will stop The Job Processor runs as a Java EE application Database connectivity is provided by a JDBC data source The Job Processor responds to queries from the Job Manager using the Java Messaging Service (JMS) JMS is used only for interacting with jobs such as to cancel or accept the current solution and not for regular solving Multiple Job Processors can use the same Jobs database and will mark jobs as in‐progress as they are run
4) Job Manager The Job Manager receives requests from clients to schedule and solve jobs Jobs are stored as records in a Jobs relational database The Job Manager runs as a Java EE application Database connectivity to the Jobs DB is provided by a JDBC XA data source The Job Manager communicates with the Job Processor to interact with running jobs using the Java Messaging Service (JMS) Clients interact with the Job Manager using either a SOAPHTTP web service for solve jobs submission or by using a web‐based console for solve queue administration The Job Manager application holds no state information in memory (all state is in the database) so multiple instances of the application can be used The Job Manager includes a timer which checks the status of running jobs If a job has not been active (no reported heartbeat) for 120 seconds (default value settable in JobMonitors ejbjobActivityTimeout environment variable) the Job Manager will mark the job as failed and available to be restarted Failure of the Job Manager means that
bull No new jobs can be submitted bull Status of jobs cannot be queried bull Running solve jobs cannot be aborted bull Pending jobs can still be run as long as other components (Job Processor DBs) have not
failed
4
B Protecting the system This diagram shows the dependencies of ODME components on high‐level middleware components
Figure Component dependency diagram
These dependencies show the key consequences of the failure of any component For example if the application server for the Job Processor fails then so does the Job Processor application and any associated Optimization Engines If database manager for the Jobs DB fails then the Jobs DB Job Manager and Job Processor and Optimization Engine all fail While two logical application servers are depicted the solution may be deployed on a single physical application server instance The same is true for the two logical database managers This would of course mean that a failure could have a greater impact
1) Protecting the ability to run jobs Assuming job records are available in the Jobs database the ability to run optimization solve jobs is based on the following components
From a middleware perspective the following needs to be protected bull Application servers used by Job Processor bull The database servers used for the Jobs database bull The database servers used for the Scenario database bull The physical servers and network used by each of the above
2) Protecting the ability to manage jobs The ability to manage jobs is provided by the following logical components
bull Job Manager application bull Jobs database
5
From a middleware perspective the following needs to be protected bull HTTP server used by Job Manager bull Application server used by Job Manager bull The database server used for the Jobs database bull The physical servers and network used by each of the above
C Sample HA topology Here is a sample topology which can be used to protect the ODME solution
This logical topology consists of
Optimization Servers each running a Java EE application server with the Job Processor and Job Manager applications installed Each Optimization Server will host one or more Optimization Engines Both Optimization Servers can be operation at the same time ‐ if one Optimization Server fails then the other will continue taking jobs from the Jobs Database Any number of Optimization Servers could theoretically be used
Database server hosting both the Jobs and Scenario databases Keeping multiple copies of the same database active and up‐to‐date could be very difficult so instead a passive backup should be kept This backup needs to be up‐to‐date and ready to become active if the primary database server fails
Load balancing server which can route HTTP traffic to either of the two Job Managers The load balancer also needs to be backed‐up This backup can also be passive and ready to become active if the primary load balancer fails
D IBM middleware protection strategies There are many ways in which IBM middleware can be protected including (but not limited to)
bull Software bull WebSphere Application Server Network Deployment allows application servers to be
clustered and manages high availability across Java EE applications bull DB2reg can be used to keep an up‐to‐date replica of a database on a separate server using
its High Availability Disaster Recovery (HADR) feature bull Tivoli System Automation provides advanced clustering support for managing highly
6
available systems bull Hardware
bull A PowerHA solution provides a cluster of IBM Power servers using shared disks If one server (either software processes or hardware) fails the other takes over Power HA SystemMirror is available for AIX Linux and IBM i
bull Disk technology such as a Redundant Array of Independent Disks (RAID)
For example the following describes a software‐only topology to enable high availability using WebSphere Application Server‐Network Deployment and DB2 HADR
Figure 1 ODME Software‐only HA Topology example
This topology consists of
Optimization Servers running WebSphere Application Server (WAS) Each Optimization Server runs a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications is installed into the cluster Each Optimization Server hosts one or more ODM Optimization Engines A WAS Service Integration bus (SIBus) is configured to allow the JMS communication between Job Managers and Job Processors ‐ the SIBus uses a single messaging engine with a HADR database store which WAS will automatically move to the other WAS server in the event of a failure In case of primary database failure DB2 HADR will switch to the alternate standby database server
Database servers running DB2 Enterprise Server Edition Both servers run the same software with one acting as the primary database server The primary database server replicates all database updates on the standby server using DB2s High Availability Disaster Recovery (HADR) feature In addition Tivoli System Automation could be used on these servers to detect a failure and instruct the standby server to take over as the primary database
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests
7
to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis
E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical
It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection
There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring
F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings
ODM application
configuration (odmapp
ODM Enterprise IDE
ODMRepositorySCENARIO
DB
Development Deployment
IT developer
Java Development
Tools
OPL Studio
ODM Editors
ODM Studio -Planner and
Reviewer Editions
Optimization Server
Custom Clients and Batch Files
odmapp
odmapp
odmapp
odmapp
solve
solve
Readwrite
Rea
dwr
ite
Readwrite
The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs
When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server
8
2 Configuring a HA ODM Enterprise environment
A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products
bull IBM ILOG ODM Enterprise V3301
bull IBM WebSphere Application Server Network Deployment V61 FP 025
bull IBM HTTP Server V61 FP 025
bull IBM DB2 Enterprise Server Edition V95 FixPack 3
This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details
Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of
Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster
Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis
Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server
B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped
DB2 HADR Requirements
Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers
9
bull Identical operating system version and patch level
bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application
bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path
DB2 HADR Setup
1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well
2 Start the DB2 servers on both machines if they are not already running
3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)
bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema
bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE
bull SIBus database ndash used by the WAS Service Integration Bus
Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR
4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in
Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0
C Configuring WebSphere Application Server and HTTP Server
1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member
The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database
2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment
a) Install WAS 61 Network Deployment as detailed in
Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled
b) Start and connect to the deployment manager console create a new cluster in Servers =gt
Clusters and define cluster members for nodes created earlier
c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes
d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties
11
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well
The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules
e) Set the custom properties of these JDBC data sources
bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables
bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name
bull clientRerouteAlternatePortNumber standby server port number for client reroute
bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2
bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15
bull fullyMaterializeInputStreams set to true
bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues
f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled
g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection
h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)
bull OptimizationServerTopic named jmsoptimserverTopic in JNDI
bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI
bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI
bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic
i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope
j) Install IBM HTTP Server 61
12
Tip
bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application
bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier
k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of
plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps
l) Start cluster nodes
m) Start the cluster in Servers =gt Clusters
n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole
Useful links
bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo
bull ldquoRoadmap Installing the Network Deployment productrdquo
D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute
13
Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically
To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example
bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the
server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again
Notes bull This property list can be extended with other DB2 properties to match your
needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html
3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster
This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR
This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301
When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)
WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version
HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure
A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines
1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme
Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI
The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS
2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node
Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity
On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving
15
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
B Protecting the system This diagram shows the dependencies of ODME components on high‐level middleware components
Figure Component dependency diagram
These dependencies show the key consequences of the failure of any component For example if the application server for the Job Processor fails then so does the Job Processor application and any associated Optimization Engines If database manager for the Jobs DB fails then the Jobs DB Job Manager and Job Processor and Optimization Engine all fail While two logical application servers are depicted the solution may be deployed on a single physical application server instance The same is true for the two logical database managers This would of course mean that a failure could have a greater impact
1) Protecting the ability to run jobs Assuming job records are available in the Jobs database the ability to run optimization solve jobs is based on the following components
From a middleware perspective the following needs to be protected bull Application servers used by Job Processor bull The database servers used for the Jobs database bull The database servers used for the Scenario database bull The physical servers and network used by each of the above
2) Protecting the ability to manage jobs The ability to manage jobs is provided by the following logical components
bull Job Manager application bull Jobs database
5
From a middleware perspective the following needs to be protected bull HTTP server used by Job Manager bull Application server used by Job Manager bull The database server used for the Jobs database bull The physical servers and network used by each of the above
C Sample HA topology Here is a sample topology which can be used to protect the ODME solution
This logical topology consists of
Optimization Servers each running a Java EE application server with the Job Processor and Job Manager applications installed Each Optimization Server will host one or more Optimization Engines Both Optimization Servers can be operation at the same time ‐ if one Optimization Server fails then the other will continue taking jobs from the Jobs Database Any number of Optimization Servers could theoretically be used
Database server hosting both the Jobs and Scenario databases Keeping multiple copies of the same database active and up‐to‐date could be very difficult so instead a passive backup should be kept This backup needs to be up‐to‐date and ready to become active if the primary database server fails
Load balancing server which can route HTTP traffic to either of the two Job Managers The load balancer also needs to be backed‐up This backup can also be passive and ready to become active if the primary load balancer fails
D IBM middleware protection strategies There are many ways in which IBM middleware can be protected including (but not limited to)
bull Software bull WebSphere Application Server Network Deployment allows application servers to be
clustered and manages high availability across Java EE applications bull DB2reg can be used to keep an up‐to‐date replica of a database on a separate server using
its High Availability Disaster Recovery (HADR) feature bull Tivoli System Automation provides advanced clustering support for managing highly
6
available systems bull Hardware
bull A PowerHA solution provides a cluster of IBM Power servers using shared disks If one server (either software processes or hardware) fails the other takes over Power HA SystemMirror is available for AIX Linux and IBM i
bull Disk technology such as a Redundant Array of Independent Disks (RAID)
For example the following describes a software‐only topology to enable high availability using WebSphere Application Server‐Network Deployment and DB2 HADR
Figure 1 ODME Software‐only HA Topology example
This topology consists of
Optimization Servers running WebSphere Application Server (WAS) Each Optimization Server runs a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications is installed into the cluster Each Optimization Server hosts one or more ODM Optimization Engines A WAS Service Integration bus (SIBus) is configured to allow the JMS communication between Job Managers and Job Processors ‐ the SIBus uses a single messaging engine with a HADR database store which WAS will automatically move to the other WAS server in the event of a failure In case of primary database failure DB2 HADR will switch to the alternate standby database server
Database servers running DB2 Enterprise Server Edition Both servers run the same software with one acting as the primary database server The primary database server replicates all database updates on the standby server using DB2s High Availability Disaster Recovery (HADR) feature In addition Tivoli System Automation could be used on these servers to detect a failure and instruct the standby server to take over as the primary database
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests
7
to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis
E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical
It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection
There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring
F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings
ODM application
configuration (odmapp
ODM Enterprise IDE
ODMRepositorySCENARIO
DB
Development Deployment
IT developer
Java Development
Tools
OPL Studio
ODM Editors
ODM Studio -Planner and
Reviewer Editions
Optimization Server
Custom Clients and Batch Files
odmapp
odmapp
odmapp
odmapp
solve
solve
Readwrite
Rea
dwr
ite
Readwrite
The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs
When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server
8
2 Configuring a HA ODM Enterprise environment
A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products
bull IBM ILOG ODM Enterprise V3301
bull IBM WebSphere Application Server Network Deployment V61 FP 025
bull IBM HTTP Server V61 FP 025
bull IBM DB2 Enterprise Server Edition V95 FixPack 3
This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details
Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of
Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster
Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis
Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server
B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped
DB2 HADR Requirements
Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers
9
bull Identical operating system version and patch level
bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application
bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path
DB2 HADR Setup
1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well
2 Start the DB2 servers on both machines if they are not already running
3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)
bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema
bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE
bull SIBus database ndash used by the WAS Service Integration Bus
Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR
4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in
Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0
C Configuring WebSphere Application Server and HTTP Server
1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member
The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database
2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment
a) Install WAS 61 Network Deployment as detailed in
Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled
b) Start and connect to the deployment manager console create a new cluster in Servers =gt
Clusters and define cluster members for nodes created earlier
c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes
d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties
11
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well
The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules
e) Set the custom properties of these JDBC data sources
bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables
bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name
bull clientRerouteAlternatePortNumber standby server port number for client reroute
bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2
bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15
bull fullyMaterializeInputStreams set to true
bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues
f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled
g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection
h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)
bull OptimizationServerTopic named jmsoptimserverTopic in JNDI
bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI
bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI
bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic
i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope
j) Install IBM HTTP Server 61
12
Tip
bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application
bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier
k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of
plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps
l) Start cluster nodes
m) Start the cluster in Servers =gt Clusters
n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole
Useful links
bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo
bull ldquoRoadmap Installing the Network Deployment productrdquo
D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute
13
Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically
To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example
bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the
server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again
Notes bull This property list can be extended with other DB2 properties to match your
needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html
3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster
This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR
This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301
When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)
WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version
HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure
A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines
1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme
Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI
The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS
2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node
Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity
On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving
15
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
From a middleware perspective the following needs to be protected bull HTTP server used by Job Manager bull Application server used by Job Manager bull The database server used for the Jobs database bull The physical servers and network used by each of the above
C Sample HA topology Here is a sample topology which can be used to protect the ODME solution
This logical topology consists of
Optimization Servers each running a Java EE application server with the Job Processor and Job Manager applications installed Each Optimization Server will host one or more Optimization Engines Both Optimization Servers can be operation at the same time ‐ if one Optimization Server fails then the other will continue taking jobs from the Jobs Database Any number of Optimization Servers could theoretically be used
Database server hosting both the Jobs and Scenario databases Keeping multiple copies of the same database active and up‐to‐date could be very difficult so instead a passive backup should be kept This backup needs to be up‐to‐date and ready to become active if the primary database server fails
Load balancing server which can route HTTP traffic to either of the two Job Managers The load balancer also needs to be backed‐up This backup can also be passive and ready to become active if the primary load balancer fails
D IBM middleware protection strategies There are many ways in which IBM middleware can be protected including (but not limited to)
bull Software bull WebSphere Application Server Network Deployment allows application servers to be
clustered and manages high availability across Java EE applications bull DB2reg can be used to keep an up‐to‐date replica of a database on a separate server using
its High Availability Disaster Recovery (HADR) feature bull Tivoli System Automation provides advanced clustering support for managing highly
6
available systems bull Hardware
bull A PowerHA solution provides a cluster of IBM Power servers using shared disks If one server (either software processes or hardware) fails the other takes over Power HA SystemMirror is available for AIX Linux and IBM i
bull Disk technology such as a Redundant Array of Independent Disks (RAID)
For example the following describes a software‐only topology to enable high availability using WebSphere Application Server‐Network Deployment and DB2 HADR
Figure 1 ODME Software‐only HA Topology example
This topology consists of
Optimization Servers running WebSphere Application Server (WAS) Each Optimization Server runs a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications is installed into the cluster Each Optimization Server hosts one or more ODM Optimization Engines A WAS Service Integration bus (SIBus) is configured to allow the JMS communication between Job Managers and Job Processors ‐ the SIBus uses a single messaging engine with a HADR database store which WAS will automatically move to the other WAS server in the event of a failure In case of primary database failure DB2 HADR will switch to the alternate standby database server
Database servers running DB2 Enterprise Server Edition Both servers run the same software with one acting as the primary database server The primary database server replicates all database updates on the standby server using DB2s High Availability Disaster Recovery (HADR) feature In addition Tivoli System Automation could be used on these servers to detect a failure and instruct the standby server to take over as the primary database
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests
7
to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis
E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical
It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection
There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring
F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings
ODM application
configuration (odmapp
ODM Enterprise IDE
ODMRepositorySCENARIO
DB
Development Deployment
IT developer
Java Development
Tools
OPL Studio
ODM Editors
ODM Studio -Planner and
Reviewer Editions
Optimization Server
Custom Clients and Batch Files
odmapp
odmapp
odmapp
odmapp
solve
solve
Readwrite
Rea
dwr
ite
Readwrite
The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs
When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server
8
2 Configuring a HA ODM Enterprise environment
A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products
bull IBM ILOG ODM Enterprise V3301
bull IBM WebSphere Application Server Network Deployment V61 FP 025
bull IBM HTTP Server V61 FP 025
bull IBM DB2 Enterprise Server Edition V95 FixPack 3
This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details
Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of
Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster
Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis
Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server
B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped
DB2 HADR Requirements
Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers
9
bull Identical operating system version and patch level
bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application
bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path
DB2 HADR Setup
1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well
2 Start the DB2 servers on both machines if they are not already running
3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)
bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema
bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE
bull SIBus database ndash used by the WAS Service Integration Bus
Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR
4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in
Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0
C Configuring WebSphere Application Server and HTTP Server
1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member
The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database
2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment
a) Install WAS 61 Network Deployment as detailed in
Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled
b) Start and connect to the deployment manager console create a new cluster in Servers =gt
Clusters and define cluster members for nodes created earlier
c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes
d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties
11
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well
The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules
e) Set the custom properties of these JDBC data sources
bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables
bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name
bull clientRerouteAlternatePortNumber standby server port number for client reroute
bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2
bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15
bull fullyMaterializeInputStreams set to true
bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues
f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled
g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection
h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)
bull OptimizationServerTopic named jmsoptimserverTopic in JNDI
bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI
bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI
bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic
i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope
j) Install IBM HTTP Server 61
12
Tip
bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application
bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier
k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of
plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps
l) Start cluster nodes
m) Start the cluster in Servers =gt Clusters
n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole
Useful links
bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo
bull ldquoRoadmap Installing the Network Deployment productrdquo
D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute
13
Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically
To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example
bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the
server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again
Notes bull This property list can be extended with other DB2 properties to match your
needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html
3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster
This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR
This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301
When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)
WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version
HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure
A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines
1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme
Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI
The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS
2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node
Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity
On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving
15
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
available systems bull Hardware
bull A PowerHA solution provides a cluster of IBM Power servers using shared disks If one server (either software processes or hardware) fails the other takes over Power HA SystemMirror is available for AIX Linux and IBM i
bull Disk technology such as a Redundant Array of Independent Disks (RAID)
For example the following describes a software‐only topology to enable high availability using WebSphere Application Server‐Network Deployment and DB2 HADR
Figure 1 ODME Software‐only HA Topology example
This topology consists of
Optimization Servers running WebSphere Application Server (WAS) Each Optimization Server runs a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications is installed into the cluster Each Optimization Server hosts one or more ODM Optimization Engines A WAS Service Integration bus (SIBus) is configured to allow the JMS communication between Job Managers and Job Processors ‐ the SIBus uses a single messaging engine with a HADR database store which WAS will automatically move to the other WAS server in the event of a failure In case of primary database failure DB2 HADR will switch to the alternate standby database server
Database servers running DB2 Enterprise Server Edition Both servers run the same software with one acting as the primary database server The primary database server replicates all database updates on the standby server using DB2s High Availability Disaster Recovery (HADR) feature In addition Tivoli System Automation could be used on these servers to detect a failure and instruct the standby server to take over as the primary database
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests
7
to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis
E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical
It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection
There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring
F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings
ODM application
configuration (odmapp
ODM Enterprise IDE
ODMRepositorySCENARIO
DB
Development Deployment
IT developer
Java Development
Tools
OPL Studio
ODM Editors
ODM Studio -Planner and
Reviewer Editions
Optimization Server
Custom Clients and Batch Files
odmapp
odmapp
odmapp
odmapp
solve
solve
Readwrite
Rea
dwr
ite
Readwrite
The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs
When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server
8
2 Configuring a HA ODM Enterprise environment
A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products
bull IBM ILOG ODM Enterprise V3301
bull IBM WebSphere Application Server Network Deployment V61 FP 025
bull IBM HTTP Server V61 FP 025
bull IBM DB2 Enterprise Server Edition V95 FixPack 3
This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details
Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of
Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster
Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis
Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server
B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped
DB2 HADR Requirements
Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers
9
bull Identical operating system version and patch level
bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application
bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path
DB2 HADR Setup
1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well
2 Start the DB2 servers on both machines if they are not already running
3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)
bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema
bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE
bull SIBus database ndash used by the WAS Service Integration Bus
Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR
4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in
Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0
C Configuring WebSphere Application Server and HTTP Server
1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member
The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database
2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment
a) Install WAS 61 Network Deployment as detailed in
Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled
b) Start and connect to the deployment manager console create a new cluster in Servers =gt
Clusters and define cluster members for nodes created earlier
c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes
d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties
11
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well
The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules
e) Set the custom properties of these JDBC data sources
bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables
bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name
bull clientRerouteAlternatePortNumber standby server port number for client reroute
bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2
bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15
bull fullyMaterializeInputStreams set to true
bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues
f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled
g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection
h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)
bull OptimizationServerTopic named jmsoptimserverTopic in JNDI
bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI
bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI
bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic
i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope
j) Install IBM HTTP Server 61
12
Tip
bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application
bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier
k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of
plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps
l) Start cluster nodes
m) Start the cluster in Servers =gt Clusters
n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole
Useful links
bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo
bull ldquoRoadmap Installing the Network Deployment productrdquo
D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute
13
Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically
To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example
bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the
server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again
Notes bull This property list can be extended with other DB2 properties to match your
needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html
3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster
This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR
This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301
When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)
WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version
HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure
A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines
1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme
Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI
The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS
2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node
Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity
On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving
15
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
to one of the WAS servers In this configuration the WAS server is chosen on a round‐robin basis
E System failure detection A key factor in creating a highly available system is how quickly you can recover from a failure The solution might be able to cope with the failure of one component but two or more may be difficult so detecting and recovering from failures is critical
It is important to monitor at many levels A failure could occur in the ODME application the hosting application server the operating system the physical hardware for the server or with a network connection
There are many software solutions for monitoring middleware such as IBM Tivoli Monitoring
F Special considerations for ODM Clients ODM client applications such as ODM Planner Studio have direct access to the Scenario database defined in the odmapp deployment settings
ODM application
configuration (odmapp
ODM Enterprise IDE
ODMRepositorySCENARIO
DB
Development Deployment
IT developer
Java Development
Tools
OPL Studio
ODM Editors
ODM Studio -Planner and
Reviewer Editions
Optimization Server
Custom Clients and Batch Files
odmapp
odmapp
odmapp
odmapp
solve
solve
Readwrite
Rea
dwr
ite
Readwrite
The odmapp files generated with ODME IDE include their own Scenario database access definitions which are configured independently from the the Optimization Server JOBSs
When an odmapp is intended to take advantage of HA recovery of its Scenario DB its Data Source definitions must be enhanced with HA‐specific settings that will enable switching database operations to the alternate DB instance This will enable you to take HA recovery into account both when used from within the Planner Studio and when used for solving on an Optimization Server
8
2 Configuring a HA ODM Enterprise environment
A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products
bull IBM ILOG ODM Enterprise V3301
bull IBM WebSphere Application Server Network Deployment V61 FP 025
bull IBM HTTP Server V61 FP 025
bull IBM DB2 Enterprise Server Edition V95 FixPack 3
This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details
Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of
Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster
Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis
Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server
B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped
DB2 HADR Requirements
Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers
9
bull Identical operating system version and patch level
bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application
bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path
DB2 HADR Setup
1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well
2 Start the DB2 servers on both machines if they are not already running
3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)
bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema
bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE
bull SIBus database ndash used by the WAS Service Integration Bus
Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR
4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in
Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0
C Configuring WebSphere Application Server and HTTP Server
1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member
The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database
2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment
a) Install WAS 61 Network Deployment as detailed in
Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled
b) Start and connect to the deployment manager console create a new cluster in Servers =gt
Clusters and define cluster members for nodes created earlier
c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes
d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties
11
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well
The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules
e) Set the custom properties of these JDBC data sources
bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables
bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name
bull clientRerouteAlternatePortNumber standby server port number for client reroute
bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2
bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15
bull fullyMaterializeInputStreams set to true
bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues
f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled
g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection
h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)
bull OptimizationServerTopic named jmsoptimserverTopic in JNDI
bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI
bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI
bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic
i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope
j) Install IBM HTTP Server 61
12
Tip
bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application
bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier
k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of
plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps
l) Start cluster nodes
m) Start the cluster in Servers =gt Clusters
n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole
Useful links
bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo
bull ldquoRoadmap Installing the Network Deployment productrdquo
D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute
13
Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically
To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example
bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the
server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again
Notes bull This property list can be extended with other DB2 properties to match your
needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html
3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster
This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR
This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301
When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)
WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version
HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure
A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines
1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme
Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI
The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS
2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node
Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity
On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving
15
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
2 Configuring a HA ODM Enterprise environment
A Introduction This document describes a sample configuration of the IBM ILOG ODM Enterprise as part of a highly available (HA) solution There are many ways to provide high availability using various combinations of specialized hardware and software This document describes a software‐based solution using the following products
bull IBM ILOG ODM Enterprise V3301
bull IBM WebSphere Application Server Network Deployment V61 FP 025
bull IBM HTTP Server V61 FP 025
bull IBM DB2 Enterprise Server Edition V95 FixPack 3
This document does not provide an exhaustive step‐by‐step guide but instead highlights specific considerations for configuring HA with the products listed above Links are provided to product documentation articles and Redbooks which describe the steps in more details
Next configuration steps describe how to configure the sample topology depicted in Figure 1 ODME Software‐only HA Topology example above This topology consists of
Optimization Servers running WebSphere Application Server (WAS) with each Optimization Server will run a single WAS server as part of a cluster The ODM Job Manager and Job Processor applications are installed into the cluster
Database servers running DB2 Enterprise Server Edition with both servers running the same software in a activepassive HADR setup where the primary database server replicates all database updates to the standby server
Load balancing server running IBM HTTP Server and the WAS plug‐in which routes HTTP requests between the WAS servers on a weighted round‐robin basis
Not represented in the previous topology ODM Client Applications will be configured to benefit from the automatic client rerouting offered by DB2 HADR to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server
B Configuring DB2 HADR DB2 has a feature called High Availability Disaster Recovery (HADR) which provides a high performance replication system A DB2 HADR system consists of two database servers one active and one standby Any changes made to the active system will also be replicated in the standby system At any point an administrator can instruct the standby system to ldquotake overrdquo as the primary ndash after this happens the roles of the two systems are swapped
DB2 HADR Requirements
Before installing DB2 with HADR as the ODME application datastore you need to be aware of these basic requirements for both the primary and standby DB2 servers
9
bull Identical operating system version and patch level
bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application
bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path
DB2 HADR Setup
1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well
2 Start the DB2 servers on both machines if they are not already running
3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)
bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema
bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE
bull SIBus database ndash used by the WAS Service Integration Bus
Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR
4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in
Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0
C Configuring WebSphere Application Server and HTTP Server
1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member
The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database
2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment
a) Install WAS 61 Network Deployment as detailed in
Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled
b) Start and connect to the deployment manager console create a new cluster in Servers =gt
Clusters and define cluster members for nodes created earlier
c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes
d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties
11
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well
The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules
e) Set the custom properties of these JDBC data sources
bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables
bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name
bull clientRerouteAlternatePortNumber standby server port number for client reroute
bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2
bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15
bull fullyMaterializeInputStreams set to true
bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues
f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled
g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection
h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)
bull OptimizationServerTopic named jmsoptimserverTopic in JNDI
bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI
bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI
bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic
i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope
j) Install IBM HTTP Server 61
12
Tip
bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application
bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier
k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of
plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps
l) Start cluster nodes
m) Start the cluster in Servers =gt Clusters
n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole
Useful links
bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo
bull ldquoRoadmap Installing the Network Deployment productrdquo
D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute
13
Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically
To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example
bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the
server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again
Notes bull This property list can be extended with other DB2 properties to match your
needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html
3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster
This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR
This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301
When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)
WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version
HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure
A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines
1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme
Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI
The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS
2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node
Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity
On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving
15
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
bull Identical operating system version and patch level
bull The primary and standby server machine should be connected with high‐speed TCPIP connection and reachable by TCPIP from the client application
bull Identical DB2 version and patch level software bit size (32‐bit or 64‐bit) and installation path
DB2 HADR Setup
1 Install DB2 UDB Enterprise Server Edition on both the primary and standby machines
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connection between DB2 HADR primary and stand‐by machine works well
2 Start the DB2 servers on both machines if they are not already running
3 Create your database and the required tables on the primary machine only The databases on secondary machines will be cloned from the primary machine (See the DB2 Information Center for detailed installation information)
bull Optimization Server DB ndash used to store ODME jobs Use the scripts provided with ODM Enterprise (typically serverdatabasedb2-create-tablessql) to create the JOBS database tables Make a note of the userID that you use to create the tables because it is used in the table qualifier and schema
bull Scenario database ndash used to store ODME scenario data The database tables themselves will be initialized when developing the ODM application using the ODME IDE
bull SIBus database ndash used by the WAS Service Integration Bus
Tip The DB2 logs need to be of a sufficient size especially the scenario database in which there are important updates Make sure you set the database logging to archive logging rather than the default circular logging because otherwise it will not be possible to enable HADR
4 Configure HADR for each database from the primary machine using the Setup HADR wizard as presented in
Tip The easiest way to create the databases on the secondary machines is to do it during the HADR setup process by using the backup method During HADR setup you may be asked for the peer window parameter you can leave it at the default value of 0
C Configuring WebSphere Application Server and HTTP Server
1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member
The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database
2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment
a) Install WAS 61 Network Deployment as detailed in
Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled
b) Start and connect to the deployment manager console create a new cluster in Servers =gt
Clusters and define cluster members for nodes created earlier
c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes
d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties
11
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well
The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules
e) Set the custom properties of these JDBC data sources
bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables
bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name
bull clientRerouteAlternatePortNumber standby server port number for client reroute
bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2
bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15
bull fullyMaterializeInputStreams set to true
bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues
f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled
g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection
h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)
bull OptimizationServerTopic named jmsoptimserverTopic in JNDI
bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI
bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI
bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic
i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope
j) Install IBM HTTP Server 61
12
Tip
bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application
bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier
k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of
plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps
l) Start cluster nodes
m) Start the cluster in Servers =gt Clusters
n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole
Useful links
bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo
bull ldquoRoadmap Installing the Network Deployment productrdquo
D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute
13
Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically
To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example
bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the
server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again
Notes bull This property list can be extended with other DB2 properties to match your
needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html
3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster
This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR
This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301
When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)
WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version
HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure
A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines
1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme
Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI
The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS
2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node
Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity
On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving
15
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
httpwwwredbooksibmcomabstractsSG247363html
bull DB2 InfoCenter ldquoAutomatic client reroute description and setuprdquo
C Configuring WebSphere Application Server and HTTP Server
1) Overview WebSphere Application Server Network Deployment allows multiple servers to be clustered together Installing a Java EE application into the cluster will perform the installation on each cluster member
The ODM Enterprise Job Manager and Job Processor applications use the Java Messaging Service (JMS) to communicate with each other To use JMS in a clustered environment in WAS a service integration bus (SIBus) is used with each server added as a clustered bus member In our architecture only one server needs to host a messaging engine ndash in the event of a failure in that server WAS will move the messaging engine to another server To support this each WAS server must be able to access the SIBus data store so in this topology the data store will be hosted in a DB2 database
2) Procedure The following instructions extend the single‐server instructions provided with ODME 3301 with a focus on differences specific to a clustered deployment
a) Install WAS 61 Network Deployment as detailed in
Tip Install the deployment manager node first and start it For the other nodes used in optimisation cluster select a ldquoCustomrdquo environment in the profile manager wizard which will add the new node into the cell Deployment manager and cluster nodes should be created with security disabled
b) Start and connect to the deployment manager console create a new cluster in Servers =gt
Clusters and define cluster members for nodes created earlier
c) Create a ldquoDB2 Universal JDBC Driver Provider (XA)rdquo provider for the cluster scope in Resources =gt JDBC =gt JDBC Providers and specify database class path information for cluster nodes
d) With this provider create new JDBC Data sources in the cluster scope for each HA database used by the optimisation server cluster the Jobs and the SIBus databases Create the data sources with all the settings that pertain to the primary DB2 host alternate (standby) database definitions will be specified through additional DB connection properties
11
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well
The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules
e) Set the custom properties of these JDBC data sources
bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables
bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name
bull clientRerouteAlternatePortNumber standby server port number for client reroute
bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2
bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15
bull fullyMaterializeInputStreams set to true
bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues
f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled
g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection
h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)
bull OptimizationServerTopic named jmsoptimserverTopic in JNDI
bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI
bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI
bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic
i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope
j) Install IBM HTTP Server 61
12
Tip
bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application
bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier
k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of
plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps
l) Start cluster nodes
m) Start the cluster in Servers =gt Clusters
n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole
Useful links
bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo
bull ldquoRoadmap Installing the Network Deployment productrdquo
D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute
13
Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically
To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example
bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the
server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again
Notes bull This property list can be extended with other DB2 properties to match your
needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html
3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster
This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR
This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301
When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)
WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version
HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure
A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines
1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme
Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI
The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS
2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node
Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity
On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving
15
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
Tip Before testing the DB2 HADR takeover behaviour you need to verify that the connections between WebSphere Application Server host and the DB2 HADR primary and stand‐by hosts machine work well
The JNDI name to use for the Jobs DB should be OptimizationServerDB which is the default binding name used in the Optimization Server enterprise modules
e) Set the custom properties of these JDBC data sources
bull currentSchema ndash the schema used when creating the DB2 database This schema is by default the userID that you used to create the Jobs DB tables
bull clientRerouteAlternateServerName standby server name for client reroute This is HADR standby host name
bull clientRerouteAlternatePortNumber standby server port number for client reroute
bull maxRetriesForClientReroute limit the number of retries if the primary connection to the server fails Good default can be 2
bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again Good default can be 15
bull fullyMaterializeInputStreams set to true
bull progressiveStreaming disable by setting to a value of 2 This will prevent odmapp unpacking issues
f) Create the SIBus named OptimizationServerBus in Service integration =gt Buses with no security enabled
g) Set Bus members for OptimizationServerBus at the cluster scope and use the Data store for the HA SIBus database created earlier You may need to specify an authentication alias for the SIBus database connection
h) Create JMS resource in Resources =gt JMS for the cluster scope using the service integration bus named OptimizationServerBus created earlier (in the Bus Name field of the Connection section)
bull OptimizationServerTopic named jmsoptimserverTopic in JNDI
bull OptimizationServerTopicConnectionFactory named jmsoptimserverTopicConnectionFactory in JNDI
bull OptimizationServerQueueConnectionFactory named jmsoptimserverQueueConnectionFactory in JNDI
bull OptimizationServerTopicSpec named jmsoptimserverTopicSpec and pointing to topic jmsoptimserverTopic
i) Deploy optimserver‐mgmt‐ear and optimserver‐processor‐ear on the cluster scope
j) Install IBM HTTP Server 61
12
Tip
bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application
bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier
k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of
plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps
l) Start cluster nodes
m) Start the cluster in Servers =gt Clusters
n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole
Useful links
bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo
bull ldquoRoadmap Installing the Network Deployment productrdquo
D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute
13
Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically
To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example
bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the
server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again
Notes bull This property list can be extended with other DB2 properties to match your
needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html
3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster
This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR
This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301
When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)
WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version
HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure
A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines
1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme
Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI
The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS
2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node
Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity
On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving
15
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
Tip
bull Note the HTTP ltportgt defined during install is the one that will be used in Optimisation Server connection URL httpserverltport gtoptimserver to deploy your developed ODM Application
bull We recommend not to install WAS plugin as part of the IBM HTTP Server install but rather to launch as a separate installation afterwards because it makes configuration easier
k) Install Web server plug‐ins for IBM WebSphere Application Server V61 At the beginning of
plugin installation select the check box to view the installation roadmap then click Next In this roadmap identify your installation scenario and follow the installation steps
l) Start cluster nodes
m) Start the cluster in Servers =gt Clusters
n) Check that the Optimization Server installation is correct by going to httpserverltportgtoptimserverconsole
Useful links
bull ldquoIBM ILOG ODM Enterprise Optimization Server Installation Guide for WebSphere Application Serverrdquo
bull ldquoRoadmap Installing the Network Deployment productrdquo
D Configuring the ODM Application When ODM Repository relies on a DB2 HADR environment ODM Application configuration must be updated to fully benefit from automatic client reroute
13
Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically
To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example
bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the
server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again
Notes bull This property list can be extended with other DB2 properties to match your
needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html
3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster
This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR
This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301
When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)
WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version
HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure
A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines
1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme
Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI
The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS
2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node
Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity
On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving
15
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
Automatic client reroute is a DB2 feature that enables a DB2 Client to recover from a loss of connection to the DB2 server by rerouting the connection to an alternate server This automatic connection rerouting occurs automatically
To fully support this feature alternate server name and port should be specified with additional repository properties in the deployment settings file (odmsds) of your ODM Application Example
bull clientRerouteAlternateServerName alternate server names for client reroute bull clientRerouteAlternatePortNumber alternate port numbers for client reroute bull maxRetriesForClientReroute limits the number of retries if the primary connection to the
server fails bull retryIntervalForClientReroute amount of time (in seconds) to sleep before retrying again
Notes bull This property list can be extended with other DB2 properties to match your
needs This list is then passed to the ODM repository and underlying JDBC driver Additional properties description can be found at httppublibboulderibmcominfocenterdb2luwv9r5indexjsptopic=comibmdb2luwadminhadocdocc0011976html
3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster
This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR
This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301
When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)
WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version
HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure
A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines
1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme
Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI
The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS
2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node
Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity
On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving
15
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
3 ODM Enterprise Capabilities in a HA WAS and DB2 cluster
This section describes the additional capabilities that are enabled when ODMEs Optimization Server is deployed onto a HA cluster built using WebSphere‐Network Deployment and DB2 HADR
This currently pertains to the HA configuration as depicted in the sections above built with a 2‐node symetrical IHS+WAS‐ND 61025 cluster DB2 95 FP1 in activestandby HADR config and ODME 3301
When deploying on a multi‐host cluster the additional benefits fall in two categories Work Load Management (WLM) and High Availability (HA)
WLM is the ability to spread the processing workload across all cluster members and is a feature brought by WebSpheres NetWork‐Deployment version
HA is the ability for the system to continue operating continuously when some of its hardware network or software components encounter a failure
A Workload Management capabilities of ODME 3301 When running ODME in a multi‐node clustered environment there are two different types of workload being processed by OptimServer job control (solve abort ) and administrative requests on one side and job solves performed by the Optimization Engines
1) Job Control and Administrative requests Workload Management Job Control and administrative requests are submitted to the WAS Optimization Server cluster through the SOAPHTTP protocol and will be workload managed by the regular IHS+WAS HTTP load balancing scheme
Since SOAPHTTP sessions are stateless the load balancing scheme used by WAS will be round‐robin and will apply to all Job Control activity whether it is originating from ODM Studio or the SolveAPI
The Optimization Server Admin console is a stateless web application and will be also be load balanced in a round‐robin fashion by WAS
2) Job solving Workload Management The solver Optimization Engines processes are long‐running and their run duration may vary a lot across job types They are managed by the Job Processor independently on each node
Each Job Processor will pull jobs from the solve‐pending queue in a first‐infirst‐out fashion whenever there are solve slots available The resulting overall load balancing is a first‐come‐first‐serve scheme where solves will be processed across the nodes depending on their capacity
On lightly loaded Optimization Server clusters where the jobs processing load is below capacity and jobs are picked as soon as they are queued there will be no outstanding jobs pending in the queue and only one of the two nodes may seem active Once the load grows above solving
15
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
capacity of one node outstanding jobs will start to be processed evenly by the two nodes until the queue is drained
A typical timeline of job control and job solves is illustrated in Figure 2 below The job submit enquiry and control requests from the client are directed to the two instances while jobs are picked up for processing
Note that in the case depicted here instance DB2 HADR is setup in ActiveStandby so only one of the two DB2 nodes will be handling DB requests
A typical balancing of load is illustrated below The yellow line represents the queue depth starting at 500 jobs and consuming the load until it reaches 0 Green and cyan lines represent the current processing load of each of the job processors which have 3 solve processing slots Overall both processors will be handling 2 or 3 jobs until all are processed The diagram shows load for short jobs of even solve durations the X axis unit are events not linear time
0
1
2
3
4
5
6
7
time
0
100
200
300
400
500
600
running
server1
server2Queue Depth
Figure 3 Typical balance of load on ODME 3301
16
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
The irregularities towards the end are due to some administrator‐triggered cleansing of the processed jobs from the log
B High Availablity capabilities of ODME 3301 on WAS-NDampDB2 HA As detailed in the Protecting the system section running ODME 3301 in a clustered environment allows protection of the overall system from failure of some of its components This provides the ability for the system on one hand to continue operating across those failures and on the other hand to perform some level of recovery on the processing that was inflight at the point of failure
1) Operations continuity For ODME 3301 operations continuity is the ability for the Optimization Server to display the Admin console keep the capacity to accept new jobs submissions and continue processing queued jobs
Operations continuity across WAS failures Figure 2 illustrates the typical timeline when one of the WAS cluster members is stopped or otherwise fails the surviving cluster member will continue processing
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
Figure 4 Typical ODME 3301 Operations Continuity timeline across WAS node failure
stopWAS 1
WAS 1mgmt Create job A
Client submit job A
proc
JOBSDB
WAS 2
store
running
complete
mgmt
proc
submit job B
Create job B
Solve job B
hot
standby
job A Q status
job A read status
Solve job A
readStatus
progress
IHSWLMplugin WLM WLM
SO
AP
HTTP
SOA
PH
TTP
SOAP
HTTP
17
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
Note that the Optimization Server Admin console will also continue to be handled by the remaining server of the cluster
Operations continuity across DB2 failures When DB2 HADR has been setup and the JOBS DB and odmapp datasources have been set up with appropriate alternate server definitions the same kind of behavior will be observed where the Optimization Server will switch to the alternate DB instance for Jobs control and Admin console (JOBS DB) when the primary one fails Newly picked up jobs will
2) Operations recovery ODME 3301 offers some level of recovery for inflight jobs through WAS or DB2 failures The Optimization Engine solver process itself operates mainly in memory and does not have the ability to store intermediate synchronization points so a failure of a solver process while solving will result in the solve to be aborted and eventually marked as either failed‐and‐recoverable or unrecoverable depending on the way the failure happens Cases when the jobs cannot be recovered are documented in the next chapter
Failed‐and‐recoverable jobs recovery is based on the Optimization Servers built‐in failed jobs detection which will basically detect a timeout on the solve process (no heartbeat reported) for jobs that are registered as in process mark the jobs as recoverable and requeue so that they are solved again
18
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
4 Troubleshooting and limitations of ODME 3301 operating in a clustered environment
There are a certain number of cases where ODME 3301 will not be able to ensure a full recovery after the failure of one of the components involved in operations These cases may be addressed in subsequent fix packs of ODME
Those cases are listed below Note that most of the issues are not directly due to ODME being deployed in a clustered configuration but become more prevalent when seamless continuous operations and failure recovery is expected
Whenever possible we provide some troubleshooting tips to alleviate or circumvent the issues
A Job processor fails to extract OPL binaries upon restart
Symptoms
optimserver‐processor‐ear Enterprise Application is not started on the server although the optimserver‐mgmt‐ear is running
Queued jobs are not processed (remain in NOT_STARTED state)
Only one of the cluster members runs jobs although the queue is saturated
SystemErrlog contains an exception similar to javaioFileNotFoundException usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70libcplex121a (Text file busy)
Explanation
The OPL binaries are cached and locked for direct writing by the AIX operating system The job processor EAR module is thus not allowed to extract them again and fails during its initialization
Remediation
Delete the files in usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 before starting the WAS server where the Optimization Server is deployed
In order to allow a subsequent automated warm restart of WAS and its Optimization Server EAR modules after it has been stopped (for failure or maintenance) right after restarting and before any solver instance is started change the mod of the files in the above directory to 750 (instead of the default 755) chmod 750 usrIBMILOGODME3301Deploymentappsruntimesodmoplodmbinpower64_aix53_70 this will force AIX not to cache the files
19
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
B Solve cannot recover after WAS job-processor or odmsolver stops
Symptoms
When a database failure occurs a scenario solve job may be marked in the Optimization Server Admin console as failed and unrecoverable although the solve has completed successfully and a solution found message appears in ODM Studio
Explanation
In some circumstances the odmsolver may complete solving a scenario and be able to store the solve result in the Scenario DB but the Job Processor is not able to update the Jobs DB This may happen when the JOBSDB store operations occur during database connection recovery In this case the solve job is eventually detected as timed‐out by the Optimization Server and marked for recovery but subsequent attempts by the Job Processor to solve will fail because the scenario has released its solve lock
Remediation
The scenario is actually solved although it is not properly reported as such by the Optimization Server The business user will see the scenario as solved from within the ODM Studio and the corresponding job can safely be cleared from the Optimization Server Admin console
C Bad error reporting when Optimization Server loses connection from the Repository DB
Symptoms
The Optimization Server Admin console displays an Error 500 [code=javaxtransactionRollbackExceptionparams=] when connection to the JOBS DB is lost
Explanation
The JOBS DB connection is lost and The Optimization Server Admin console cannot extract the jobs queue status for display
Remediation
This error is transient refresh the Optimization Server Admin console after the JOBS DB will have recovered
D ODME cannot start when WAS administrative security is enabled
Symptoms
Although WAS with administrative security is not currently supported by ODME 3301 deployers of Optimization Server in a clustered WAS environment may need to deploy Optimization Server with security enabled
This results in an exception being raised during startup of Optimization Server reported in the
20
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs
21
SystemOutlog
Explanation
The Optimization Server needs to update some shared variables through JNDI during its startup and thus needs write access to the WAS JNDI tree
Remediation
WAS administrative security may be turned on but then write access to JNDI should be granted to the everyone group This is achieved using the WAS Admin console in the Environment‐gtNaming‐gtCORBA Naming Service Group section Group EVERYONE has to be added with Cos Naming Read Write Create Delete authorization
E ODM solver does not start
Symptoms
Solve jobs all end up in FAILED state and the log contains a line starting with javaioIOException CreateProcess and ending in error=14001
Explanation
The Microsoft Visual C++ Redistributable libraries have not been installed on the WAS host where Optimization Server is running
Remediation
Run redistvcredistvcredist_x86exe from the ODM Enterprise Developer edition redist directory on all machines where Optimization Server will execute ODM solve jobs