Top Banner
System Monitoring Best Practices For BCM 6.0 & 7.0 version 1.0
14

System monitoring Practices for BCM 6.0 & 7.0

May 06, 2017

Download

Documents

basis445
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: System monitoring  Practices for BCM 6.0 & 7.0

System Monitoring Best Practices

For

BCM 6.0 & 7.0 version 1.0

Page 2: System monitoring  Practices for BCM 6.0 & 7.0

Contents

OVERVIEW ....................................................................................................................................................... 3

KEY AREAS TO MONITOR .............................................................................................................................. 4 Automatically triggered events/reports ................................................................................................................ 4 Daily manual check ................................................................................................................................................ 4 Weekly manual check............................................................................................................................................. 5 Monthly manual check ........................................................................................................................................... 5 General good practices .......................................................................................................................................... 5

LOGS IN BCM 6.0 ............................................................................................................................................. 6 Generally: ................................................................................................................................................................. 6 CEM: ......................................................................................................................................................................... 6 Call Dispatcher ......................................................................................................................................................... 6 Media Routing Server: .............................................................................................................................................. 6 Federation Bridge: .................................................................................................................................................... 6 WEB: ......................................................................................................................................................................... 6 Connection Server: ................................................................................................................................................... 6 H.323 bridge: ............................................................................................................................................................ 6 High Availability Controller: ....................................................................................................................................... 6 Data Collector: .......................................................................................................................................................... 7 SIP bridge: ................................................................................................................................................................ 7 Detailed CEM errors for timeouts between CoS & CDT ........................................................................................... 7

LOGS IN BCM 7.0 ............................................................................................................................................. 9 Error types ............................................................................................................................................................... 9 Features ................................................................................................................................................................... 9 Logging Categories and Locations ...................................................................................................................... 9

SOLUTIONS IN USE BY CUSTOMERS ......................................................................................................... 11 Call Robot ...............................................................................................................................................................13

Page 3: System monitoring  Practices for BCM 6.0 & 7.0

OVERVIEW

Once a BCM system is installed and running in live production, it is recommended to monitor your

system on a regular basis. This helps to ensure the system continues to run smoothly, and any

potential issues are identified early. Many problems are preventable, such as a server running out of

disk space. By proactively monitoring your system, you can prevent problems from escalating and

potentially causing system downtime.

This document is designed to highlight the key areas to monitor in your BCM environment, and

suggest some best practices.

These are just guidelines which each customer and partner has to then review and adapt for their own

environments. The named third party solutions are only examples, you may have other solutions for

the same purposes.

Please also see the infrastructure_guide.pdf.

Page 4: System monitoring  Practices for BCM 6.0 & 7.0

KEY AREAS TO MONITOR

Note that the frequency of checking is only a suggestion

Automatically triggered events/reports

Critical server elements: A server monitor program should be setup to monitor items such as hard disk space, CPU usage and available memory on the production servers. There are solutions available such as; MS Operation Manager, HP OpenView, Whatup or Nagios to monitor the system components and create reports. Typically the partners have had an established system for hardware monitoring which they adapt for the BCM servers.

CPU usage, and available memory. These should create automatic triggers in the event of one of these elements passing a specified limit, such as less that 10Gig hard disk space available.

SNMP traps from alarm server: The alarm server, included in the BCM software package, can be configured to send SNMP traps. For example with the Alarm Server you can send alerts to some 3rd party software, so emails/SMS’s/SNMP alerts can be sent to notify you of any failure. Other customers have implemented a “ping check” to constantly poll the virtual unit’s IP and see if it is reachable.

SNMP (simple network management protocol) is part of the Internet Protocol Suite as defined by

Internet Engineering Task Force (IETF). SNMP is used in network management systems to monitor

network attached devices for conditions that warrant administrative attention.

Several SNMP management systems exist on the market and there is no particular preference for any

of them.

Basically SNMP defines the following message types:

- SNMP GET REQUEST

- SNMP SET REQUEST

- SNMP TRAP

SAP BCM sends only SNMP traps. SNMP GET and SET messages can be used for managing other

SAP BCM infrastructure devices supporting SNMP.SNMP management systems typically include

sophisticated functions, such as ability to:

- Filter the most relevant messages

- Send e-mail and SMS alerts initiated by events in the monitored system.

- Log events for further investigation.

- Export event data to files, sheets or syslog servers for reporting purposes.

Network traffic monitoring. BCM software relies on stable network connections, so it is vital to have network traffic monitoring in place. This could bring up an alarm if network connectivity between the sites encounter severe delays or connection breaks. This is particularly important when having two or more active sites.

Daily manual check

IA – Open IA and check are the processes running, and are they running on correct server. IA running on a large display visible in the office – this way you can easily monitor the current status of all virtual units.

Windows Event viewer

Page 5: System monitoring  Practices for BCM 6.0 & 7.0

Weekly manual check

SQL jobs – There are SQL jobs in the BCM system which are scheduled to run on a regular basis. These include job which creates the reports.

BCM Logs – Check no errors appear (see later section). After making any major configuration changes, it is recommended to check the logs for the related components immediately after the change, and also the following day – in addition to normal testing after the change.

Reports – check that the BCM reports are created correctly, and that the Monitoring application displays data

Monthly manual check

Windows/SQL – checking for the latest Windows & SQL updates help keep your system secure. Currently Microsoft releases updates on the second Tuesday of each month.

BCM updates – Keeping BCM updated with the latest fixes is important. Critical fixes are contained in hotfixes released when needed. Service Packs are released infrequently. They contain a complete build. It is important to keep the Service Pack level up to date, as typically hotfixes are only released for the latest Service Pack version.

Check for updated BCM documentation on SAP Service Marketplace

Reports – check that the BCM reports exist and contain data

Windows Performance Monitor – log onto the server(s) to check CPU & memory usage, and available hard disk space.

On the SQL server, check the database and transaction log size, and available disk space. Compare with size last month to plan ahead.

Remove old files from the mail server. Old mails can be deleted from the mail server also by setting an appropriate value to the parameter DeleteOldMailsAfterDays. See the System Administration Guide for more information about application parameters.

Remove old log files. Log file directories are defined during the installation. Define appropriate paths and logging levels, and follow the amount of collected data regularly. The amount of data depends on the system size, configuration and tasks it is used for.

General good practices

Maintenance: Scheduled daily backups of at least CEM, CPM, VWU & WDU databases.

BCM training – it is important that there are key staff who are trained on BCM. Whenever new versions are released, it is a good idea to attend a partner training session to keep up to date with the BCM software. Having a minimum of two BCM trained experts per system is a good practice.

Document any arising issues – e.g. agents might have the same issue, if recorded somewhere, then can link up related issues, e.g. calls from a certain gateway have audio issues

Good communication between teams. Having good communication between IT, telephony, and users is essential so that any issues are reported in a timely manner, and that the correct people are informed.

Page 6: System monitoring  Practices for BCM 6.0 & 7.0

LOGS IN BCM 6.0

Check there are no errors in the logs (for example CEM log). Normally this would be a manual process

but could be done for example on a weekly basis, and using WV to highlight

“ERR>|error|WRN>|EXC>” for example makes this easier or a batch with command findstr /r /s "ERR>

error WRN> EXC> " "[VU_logs_folder]\*.*" makes this easier.

The "log grepping" could be automated with third party applications. We are not aware of what monitoring tools other partners use to examine the logs themselves, and at the moment we do not have ready templates. Creating templates which best suite each installation is a learning process, so there will be some improvements for what is required/false positives occurring.

Generally: ERR> , EXC> , WRN>

CEM: ERR> , EXC> , WRN>, Failed

Call Dispatcher ERR> , EXC> , WRN>

Media Routing Server: ERR> , EXC> , WRN>, Terminating WCFPCore

Federation Bridge: ERR> , EXC> , WRN>

WEB: error, odbc error (ignore case)

Connection Server: fail (ignore case. Will include words such as Fail, Failure, Failed)

H.323 bridge: EXC>

High Availability Controller: Inactive, Inactivating, "no such process"

Page 7: System monitoring  Practices for BCM 6.0 & 7.0

Data Collector: ERR>, "Failed to write events into database transaction buffer" (This would mean that either the place

where data collector is running is out of space, or it has no connection to the database and it’s buffer

has run out.)

SIP bridge: (below are actual errors reported in SIP Bridge logs)

Received command to unknown session

Message handler failed

Couldn't parse message from

Caught unknown exception while handling message from

Unknown protocol

Socket create error

Socket bind error

getsockopt error

Socket WSAAsyncSelect error

Socket recv error

ReceiveAsync error:

OnConnectFailure:

OnReceive error:

ReceiveAsync error:

failed with error

failed. Reason=

DnsQuery failed getting SRV record for

failed. Error

send error

socket listen error

socket accept error

socket select error

disconnected with code

disconnected on receive timeout

can't send - not connected yet

SetSecurityOptions() failed.

Expired transaction

The WRN> is not usually critical, but can be included just to see how many appear.

We would advise just monitoring these to start with to see how many occurrences they produce.

Prior to implementing these in the tool, it would be good to manually check for example your last

hour’s or day’s worth of logs, incase these are already present. For example having an email channel

with wrong username can cause errors and exceptions.

Detailed CEM errors for timeouts between CoS & CDT (i.e. the server and the agent’s workstation)

Page 8: System monitoring  Practices for BCM 6.0 & 7.0

1) These lines occur when there are no messages from the client for about 30 seconds. If there are no

other messages coming the client should send at least a keep alive message. The client is set to

“paused” so no calls are sent. At this point, the CDT might show “Opening connection”

08:53:44.126 3284 ERR> LIBIPCServer : REC_Timeout : Timeout for received messages

(796B6A0E92E67148907C40A5FE3076AF) : ('i109',)

2) If between 30 seconds & 2 minutes, the connection recovers, the Opening Connection disappears:

08:31:47.063 3284 ERR> LIBIPCServer : REC_Timeout : Timeout for received messages

(8C32315692BD8F4A871C68E42F7B60A3) : ('i109',)

08:32:10.352 7616 REC> _CHANNEL_RECOVERY_ = {'_EVT': '_CHANNEL_RECOVERY_', '_ID':

'i109', '_SAP_ID': 'CONTROL', '_REP_ID': 'i109'}

3) If the connection between CoS & Client is still not reconnected after 2 minutes, the CEM stops the

current session, from it’s point of view. Note in green there is the affected extension number and user

id:

08:55:06.826 4420 INF> LIBIPCServer.OnLostChannel :

('796B6A0E92E67148907C40A5FE3076AF', 'i109', '796B6A0E92E67148907C40A5FE3076AF')

08:55:44.448 1976 INF> Closing login session due to timeout : ('1234', 'John Smith')

08:55:44.448 1976 INF> LIBIPCServer.CloseChannel : ('796B6A0E92E67148907C40A5FE3076AF',

'i109', '796B6A0E92E67148907C40A5FE3076AF')

4) If there is a network break over two minutes, from the user’s point of view the CDT will just show

opening connection until the connection is recovered, and then automatically recover.

08:25:31.438 6248 INF> {'CallType': 'In', 'ANumber': '987654321', '_SAP_ID': 'CALL_CONTROL',

'_CMD': '_UNKNOWN_REP', 'BNumber': '1234', '_EVT': 'Disconnect', 'CALL_ID': 'CI_101L', '_MGR':

'UI_MGR', '_CLS': 'UI', '_REP_ID': 'REP_20324', '_ID': 'BRW_20324', 'DATA': 'ERROR'} : ('REJ>',)

Possible Cause for these timeouts in the logs:

- CDT/atl is somehow terminated so that it does not log out properly.

- The workstation is so busy it does not send these in time. Then the session is probably restored

within 2 minutes.

- Workstation restart: new session should be imminent. In this case at least number 1234 did not

reconnect (to CEM, at least), so this might be a network problem

- Network problems. Then the session might be restored within 2 minutes.

Page 9: System monitoring  Practices for BCM 6.0 & 7.0

LOGS IN BCM 7.0

(initial version)

In BCM 7.0, logs can be configured in versatile way to increase relevancy and avoid collecting unnecessary files. For example, it is possible to set logging level of a location or category to the debugging level as other parts of code are logged in error level only. In addition to BCM legacy log file format two new formats are added, SAP Generic Log File (GLF) and List Log. SAP Generic Log File (GLF) format enables using SAP Log Viewer and SAP Solution Manager for analyzing and managing logs. Error types Almost all modules will use the above logging mechanism (including also CEM, MRS, CD etc. since

7.0 SP1), file formats and configuration style.

Thus, all errors (in BCM-formatted logs) are in ERR> (EXC> should disappear and ERR> printed

instead. If you choose GLF-formatting to be used, it is error. Warnings are as WRN> (in BCM, and

warning in GLF).

Features Configure logging function via Windows registry with a value starting Log followed by attributes delimited with periods, such as Log<ObjectType>.<ObjectIdentifier>.<Attribute>. Modules collect information in the log files that are named with the syntax: <module-name>_<computer-name>_<virtual-unit-name>_yyyymmdd[_hh][_nnn].log Log files are collected in the folder that is defined during installation as Log File Directory of the Virtual Unit installation variable. The default value is $VU_HOME$\logs. Logging Categories and Locations Using logging categories enable that, for example, database administrators only receive database related log files, and network administrators follow the network logs. Logging Categories and Locations Using logging categories enable that, for example, database administrators only receive database related log files, and network administrators follow the network logs.

Page 10: System monitoring  Practices for BCM 6.0 & 7.0

Log Levels Level Log Writing Includes

always Messages that should always be printed, such as process startup notifications.

error Unrecoverable error has occurred. Some or all functions of the module are inoperable or perform incorrectly. Often indicates malfunctioning hardware, major misconfiguration, or a bug in the product.

warning Warning messages indicate of problems that are somehow recoverable or are in less important functions. For example, minor misconfigurations, performance warnings, temporary inavailability of a service.

info Informative messages about business level or service level objects.

trace Messages to trace the execution of the code in more detail. Mostly contains useful information for developers only while providing information about how execution of the code occurred.

debug Most detailed messages about internal execution logic of the code. Mostly contains useful information for developers only. Often used for tracing rare issues or bugs in the product.

The global log level is the root level and cannot be assigned to be inherited. Log targets (log files, console etc.) and logging modules (locations and categories) are by default set to inherit the global log level, but they can also be assigned to their own level setting. Some logging modules may be a child module to another logging module; the above parent filter is then the level filter of the parent logging module. Threads are set by default, to inherited to follow other effective level filters, but they may also be given their own thread-local log level filter setting. Combining the above level filters and settings, one may control specific parts of the code, threads or types of events that produce log entries. Example A server module that prints debug-level log to file and error-level log to console. Also, entries from LibIpc code location are narrowed down to error-level. The registry of the module (module's own registry key) would have the following registry value names and value data (shown as ValueName=ValueData): LogLevel=debug LogConsoleLevel=error LogModule.LibIpc.Level=error

Page 11: System monitoring  Practices for BCM 6.0 & 7.0

SOLUTIONS IN USE BY CUSTOMERS

One large German customer uses the WhatsUp program, although this just collect SNMP traps and

monitor active virtual IP's.

One Swedish customer uses an application called www.pingplotter.com that pings the VUs every 10

seconds. It sends me an email according to different rules set up. They also use Pingplotter for

monitoring (pinging) Internet, servers and WAN servers. By using to WAN (or LAN) you can see

historical graphs regarding ping times and jitter.

Solutions from BaseN & Noval have been used for monitoring the network infrastructure and to spot

possible problems with connections and/or quality. They can simulate also RTP traffic, and if

necessary even monitor the connections from certain sites/workstations.

One Finnish Partner has been using test robots (built by Goodsign, partly based on our previous

WSAM) which use ClientCore phone for making test calls from remote sites every x minutes and at

the same time test logon/logoff/simple query over http. They also measure the time taken so that if

there are suddenly bigger/increasing delays etc they can alert someone to have a look even before

“everything explodes”. With the test calls they can validate “the whole chain” i.e. a call from a

softphone thru gateway to a service pool or voicemail box.

Page 12: System monitoring  Practices for BCM 6.0 & 7.0

One Finnish Partner uses an application called CastleRock. Two screenshots below show examples

of the interface & display.

CastleRock SNMP network view:

Page 13: System monitoring  Practices for BCM 6.0 & 7.0

CastleRock SNMP SAP BCM servers view:

Call Robot One option would be to use Call Robots to check service availability as experienced by phone users,

e.g. by utilizing the ClientCore interface of BCM. The robot could make automated calls with a given

pattern and confirm that the system responds as expected. If any anomalies were found, the robot

could generate an alert to notify the administrators.

In a more advanced scenario the robot could in addition to calling make simple test queries to the web

server to confirm that the web sites and databases respond without unexpected delays.

The design of the automated robots can vary depending on what you want to monitor and measure.

An important factor is not to create too much loads by the testing.

Page 14: System monitoring  Practices for BCM 6.0 & 7.0

© Copyright 2012 SAP AG. All rights reserved.

No part of this publication may be reproduced

ortransmitted in any form or for any purpose without the

express permission of SAP AG. The information contained

herein may be changed without prior notice.

Some software products marketed by SAP AG and its

distributors contain proprietary software components of

other software vendors.

Microsoft, Windows, Outlook, and PowerPoint are

registered trademarks of Microsoft Corporation.

IBM, DB2, DB2 Universal Database, OS/2, Parallel

Sysplex, MVS/ESA, AIX, S/390, AS/400, OS/390,

OS/400, iSeries, pSeries, xSeries, zSeries, z/OS, AFP,

Intelligent Miner, WebSphere, Netfinity, Tivoli, Informix,

i5/OS, POWER, POWER5, OpenPower and PowerPC are

trademarks or registered trademarks of IBM Corporation.

Adobe, the Adobe logo, Acrobat, PostScript, and Reader

are either trademarks or registered trademarks of Adobe

Systems Incorporated in the United States and/or other

countries.

Oracle is a registered trademark of Oracle Corporation.

UNIX, X/Open, OSF/1, and Motif are registered

trademarks of the Open Group.

Citrix, ICA, Program Neighborhood, MetaFrame,

WinFrame, VideoFrame, and MultiWin are trademarks or

registered trademarks of Citrix Systems, Inc.

HTML, XML, XHTML and W3C are trademarks or

registered trademarks of W3C®, World Wide Web

Consortium, Massachusetts Institute of Technology.

Java is a registered trademark of Sun Microsystems, Inc.

JavaScript is a registered trademark of Sun Microsystems,

Inc., used under license for technology invented and

implemented by Netscape.

MaxDB is a trademark of MySQL AB, Sweden.

SAP, R/3, mySAP, mySAP.com, xApps, xApp, SAP

NetWeaver, and other SAP products and services

mentioned herein as well as their respective logos are

trademarks or registered trademarks of SAP AG in

Germany and in several other countries all over the world.

All other product and service names mentioned are the

trademarks of their respective companies. Data contained

in this document serves informational purposes only.

National product specifications may vary.

These materials are subject to change without notice.

These materials are provided by SAP AG and its affiliated

companies ("SAP Group") for informational purposes only,

without representation or warranty of any kind, and SAP

Group shall not be liable for errors or omissions with

respect to the materials. The only warranties for SAP

Group products and services are those that are set forth in

the express warranty statements accompanying such

products and services, if any. Nothing herein should be

construed as constituting an additional warranty.

These materials are provided “as is” without a warranty of

any kind, either express or implied, including but not

limited to, the implied warranties of merchantability,

fitness for a particular purpose, or non-infringement.

SAP shall not be liable for damages of any kind including

without limitation direct, special, indirect, or consequential

damages that may result from the use of these materials.

SAP does not warrant the accuracy or completeness of the

information, text, graphics, links or other items contained

within these materials. SAP has no control over the

information that you may access through the use of hot

links contained in these materials and does not endorse

your use of third party web pages nor provide any warranty

whatsoever relating to third party web pages.

SAP NetWeaver “How-to” Guides are intended to simplify

the product implementation. While specific product

features and procedures typically are explained in a

practical business context, it is not implied that those

features and procedures are the only approach in solving a

specific business problem using SAP NetWeaver. Should

you wish to receive additional information, clarification or

support, please refer to SAP Consulting.

Any software coding and/or code lines / strings (“Code”)

included in this documentation are only examples and are

not intended to be used in a productive system

environment. The Code is only intended better explain and

visualize the syntax and phrasing rules of certain coding.

SAP does not warrant the correctness and completeness of

the Code given herein, and SAP shall not be liable for

errors or damages caused by the usage of the Code, except

if such damages were caused by SAP intentionally or

grossly negligent.

Disclaimer

Some components of this product are based on Java™. Any

code change in these components may cause unpredictable

and severe malfunctions and is therefore expressively

prohibited, as is any decompilation of these components.

Any Java™ Source Code delivered with this product is only

to be used by SAP’s Support Services and may not be

modified or altered in any way.