How to… Analyze Performance Problems...that have proven to be useful, and to show how to use these tools effectively to drill down performance problems. The document focuses on …

How to…

Analyze Performance Problems

SAP ENTERPRISE PORTAL 6.0

PUBLIC

BEST PRACTICES TO ISOLATE CAUSES FOR SUBOPTIMAL PERFORMANCE

ASAP “How to…” Paper

Applicable Releases: EP 6.0 SP2 June 2004

©2004 SAP AG 1

Table of Contents

1 INTRODUCTION.............................................................................................................................6

1.1 Structure of the Document and Roadmap .............................................................................................6

1.2 Prerequisites..............................................................................................................................................8

1.3 Related Documents...................................................................................................................................8

2 HIGH-LEVEL (USER-LEVEL) ANALYSIS.....................................................................................9

2.1 Starting Point............................................................................................................................................9

2.2 Experience and Document the Problem .................................................................................................9

2.3 Check for Logged Problems ..................................................................................................................10

2.4 Reduce the Complexity ..........................................................................................................................11 2.4.1 Connect Directly to the Portal ..........................................................................................................11 2.4.2 Portal Server Configuration..............................................................................................................11 2.4.3 Client Location .................................................................................................................................11 2.4.4 Load Balancing.................................................................................................................................12 2.4.5 Client Configuration.........................................................................................................................12

2.5 What Content and Activity is Affected?...............................................................................................12 2.5.1 Logon Procedure ..............................................................................................................................13 2.5.2 Analyze the Content of a Portal Page ...............................................................................................13 2.5.3 Analyze Expensive iViews...............................................................................................................14

3 MONITORING SETUP..................................................................................................................15

3.1 Checklist ..................................................................................................................................................15

3.2 OS-Independent......................................................................................................................................16 3.2.1 Java VM Garbage Collection Trace .................................................................................................16 3.2.2 Enable Creation of Full Thread Dumps............................................................................................16 3.2.3 J2EE Engine 6.20 Monitoring ..........................................................................................................16 3.2.4 Monitor Response Times and Throughput per URL ........................................................................17 3.2.5 SAP Enterprise Portal Monitoring....................................................................................................17

3.3 Windows-Specific ...................................................................................................................................17 3.3.1 Collection of Configuration Data .....................................................................................................17 3.3.2 CPU and Memory Performance Counters ........................................................................................18 3.3.3 Network Performance Counters .......................................................................................................18 3.3.4 Disk Performance Counters..............................................................................................................19

3.4 Solaris-Specific........................................................................................................................................19 3.4.1 Collect Configuration Data...............................................................................................................20 3.4.2 System Level CPU and Memory Performance Counters .................................................................20 3.4.3 Per-Process CPU and Memory Performance Counters ....................................................................21 3.4.4 Network Performance Counters .......................................................................................................22

©2004 SAP AG 2

3.4.5 Disk Performance Counters..............................................................................................................23

3.5 HP-UX Specific .......................................................................................................................................23 3.5.1 Collect Configuration Data...............................................................................................................24 3.5.2 CPU and Memory Performance Counters ........................................................................................24 3.5.3 Network Performance Counters .......................................................................................................25 3.5.4 Disk Performance Counters..............................................................................................................26

3.6 AIX-Specific ............................................................................................................................................26 3.6.1 Collect Configuration Data...............................................................................................................26 3.6.2 CPU and Memory Performance Counters ........................................................................................27 3.6.3 Network Performance Counters .......................................................................................................28 3.6.4 Disk Performance Counters..............................................................................................................28

4 PORTAL HOST ANALYSIS .........................................................................................................30

4.1 Snapshot Monitoring..............................................................................................................................30 4.1.1 Windows...........................................................................................................................................30 4.1.2 Unix ..................................................................................................................................................31

4.2 Analysis Worksheets ..............................................................................................................................32 4.2.1 Healthy System Worksheet ..............................................................................................................32 4.2.2 Operating System Memory Worksheet ............................................................................................34 4.2.3 CPU Worksheet ................................................................................................................................34 4.2.4 Network Worksheet..........................................................................................................................36 4.2.5 Disk Worksheet ................................................................................................................................37 4.2.6 Aging System Worksheet .................................................................................................................39

5 PORTAL WORKLOAD AND HOTSPOT ANALYSIS ..................................................................40

5.1 Portal Workload Analysis......................................................................................................................40 5.1.1 JARM Basics ....................................................................................................................................40 5.1.2 Further Details on the Portal Runtime Request Cycle......................................................................42 5.1.3 Overview of Portal Monitoring iViews ............................................................................................43 5.1.4 Identifying Expensive Content with the Component Overview.......................................................44 5.1.5 Request Overview: Most Expensive Request Executions ................................................................46 5.1.6 Snapshot Monitoring with the Thread Overview .............................................................................48 5.1.7 Active Users .....................................................................................................................................49 5.1.8 Workload Distribution......................................................................................................................49

5.2 Single Activity Trace ..............................................................................................................................49 5.2.1 Creation of the Trace ........................................................................................................................50 5.2.2 Understanding the Trace...................................................................................................................50 5.2.3 Identify Performance Problems from the Trace ...............................................................................51

5.3 Portal Infrastructure Analysis with J2EE Tools .................................................................................52 5.3.1 J2EE Monitoring ..............................................................................................................................52 5.3.2 J2EE Detailed Analysis Tools ..........................................................................................................52

5.4 Specific Analysis Tools ...........................................................................................................................53 5.4.1 Portal Content Directory (PCD) .......................................................................................................53 5.4.2 User Management Engine (UME)....................................................................................................53 5.4.3 CM-Specific Tools ...........................................................................................................................53 5.4.4 TREX Search Requests ....................................................................................................................54

©2004 SAP AG 3

6 JAVA VM THREAD ANALYSIS ...................................................................................................55

6.1 Benefit from Full Thread Dumps..........................................................................................................55

6.2 How to Create Valuable Full Thread Dumps ......................................................................................56 6.2.1 Obtaining Per-Thread CPU Usage ...................................................................................................56 6.2.2 Portal Monitoring .............................................................................................................................57 6.2.3 Automated Creation of Thread Dumps.............................................................................................57

6.3 Thread Dump Structure ........................................................................................................................57 6.3.1 Thread Names and Thread Groups...................................................................................................58 6.3.2 Thread States ....................................................................................................................................59 6.3.3 Thread Stacks ...................................................................................................................................59

6.4 Find Deadlocks .......................................................................................................................................61 6.4.1 VM-Detected Deadlocks ..................................................................................................................61 6.4.2 Deadlocks not Detected by the Java VM..........................................................................................62

6.5 Find CPU Consuming Java Threads ....................................................................................................63

6.6 Summary: How to Analyze Full Thread Dumps .................................................................................64

6.7 More Examples .......................................................................................................................................65 6.7.1 Example: Heavy Logging.................................................................................................................65 6.7.2 Example: Custom Coding.................................................................................................................66 6.7.3 Example: Backend Access................................................................................................................66

6.8 Tools for Thread Dump Analysis ..........................................................................................................67 6.8.1 JEdit..................................................................................................................................................67 6.8.2 ThreadDumpScan .............................................................................................................................69

7 JAVA VM MEMORY ANALYSIS..................................................................................................73

7.1 Understanding the Memory Model of the Java VM ...........................................................................73

7.2 Interpreting the verbose:gc Output ......................................................................................................74

7.3 More GC Examples ................................................................................................................................76

7.4 VM Memory Worksheet ........................................................................................................................78 7.4.1 Narrow Down Top Memory Consumer............................................................................................79 7.4.2 Sherlok..............................................................................................................................................79

7.5 Understanding an OutOfMemory Error..............................................................................................79

7.6 Other Analysis Tools ..............................................................................................................................80

7.7 Profilers ...................................................................................................................................................80

8 HTTP REQUEST ANALYSIS .......................................................................................................81

8.1 Basics for HTTP Protocol Analysis.......................................................................................................81

©2004 SAP AG 4

8.2 Creating an HTTP Trace.......................................................................................................................83

8.3 General Analysis of the HTTP Trace ...................................................................................................84 8.3.1 Understand the Content of a Request ...............................................................................................84 8.3.2 Response Times................................................................................................................................84 8.3.3 Transmitted Bytes.............................................................................................................................85 8.3.4 HTTP Status Codes ..........................................................................................................................85 8.3.5 Client-Side Caching..........................................................................................................................85 8.3.6 Number of Physical HTTP Connections (Keep-Alive) ....................................................................86 8.3.7 Content Length Header Field ...........................................................................................................86 8.3.8 HTTP Chunks...................................................................................................................................86 8.3.9 Compression .....................................................................................................................................86 8.3.10 Other .................................................................................................................................................87

8.4 Understanding URL Patterns................................................................................................................87

9 CLIENT ANALYSIS ......................................................................................................................88

10 BACKEND ANALYSIS .............................................................................................................88

10.1 Filesystem Accesses ................................................................................................................................88

10.2 LDAP Accesses .......................................................................................................................................88

10.3 Database Accesses...................................................................................................................................88

10.4 RFC Access to SAP Backend Systems ..................................................................................................88 10.4.2 JAVA Connectivity (JCO) Tracing ..................................................................................................89 10.4.3 RFC Trace in the Backend System...................................................................................................89

10.5 ITS ...........................................................................................................................................................89

10.6 Network Analysis....................................................................................................................................89 10.6.1 Network Diagnosis with NIPING ....................................................................................................89 10.6.2 Solaris Network Configuration Check .............................................................................................90 10.6.3 Generic Network Analysis (Packet Tracing)....................................................................................91

11 REFERENCES ..........................................................................................................................92

12 APPENDIX ................................................................................................................................93

12.1 Thread-Specific CPU Usage ..................................................................................................................93 12.1.1 Solaris: prstat and Alternate Thread Library ....................................................................................93 12.1.2 HP-UX: glance .................................................................................................................................95 12.1.3 AIX:..................................................................................................................................................95

©2004 SAP AG 5

1 Introduction This document is intended as a roadmap and hands-on guide for analyzing performance problems in SAP Enterprise Portal 6.0 SP2. The target audience includes everybody who faces the task of handling performance problems. The main target groups are SAP support teams and consultants, but also portal administrators at customers. The goal is to provide an overview of all the available tools that have proven to be useful, and to show how to use these tools effectively to drill down performance problems. The document focuses on customer installations that are either live or being load-tested, where a substantial load triggers performance problems. Many of the techniques presented here can also be used for systems without any load.

This guide describes a top-down approach for addressing performance problems and for finding the reason for the poor performance of a SAP Enterprise Portal installation. Starting from the end user perspective, the guide iteratively narrows down the problem. In each of these iterations, information (e.g. monitoring data) is collected and correlated, and further steps for analysis are defined. In each interation, you may repeat measurements with a different setup in order to isolate specific components or to collect specific kinds of monitoring data. After each iteration, you have to decide if your activities or suspicions point in the right direction and if they help to narrow down the problem. In practice, you often collect information that points to the same cause of a problem. In general, the goal is to isolate the problem as clearly as possible. For this reason, we try to reduce the complexity of the landscape during the analysis, that is to reduce the EP cluster to a single node and bypass reverse proxies. This, however, is for analysis purposes only; the original cluster must be used once the problem has been identified.

Since the portal is modular and contains many different components (each of which can be customized and enhanced by the customer), it is essential to identify the reason for performance problems as precisely as possible. Unfortunately it is difficult to identify bottlenecks because SAP Enterprise Portal is a part of a heterogeneous software environment (containing databases, backend systems, software load balancers, etc.). The hardware configuration is also complex because the setup of a SAP portal usually consists of dozens of machines, so that network problems, poor backend response times, problems with interfaces to third-party systems, etc. could also affect the performance. In a limited number of situations this document could directly help to resolve the performance problem. However, since identification of the reason for poor performance is a mandatory step before any countermeasures can be taken, the main emphasis of the document is on helping you to identify the problem.

This document cannot provide a complete collection of all possible tools and procedures for analyzing performance problems. Instead, it contains best practices that can be used to analyze such problems.

1.1 Structure of the Document and Roadmap This document is structured to guide you through a top-down analysis of a performance problem. The first analysis steps require little effort and are not time consuming. The further you proceed in the document, the greater is the expertise and time required.

©2004 SAP AG 6

The above diagram gives an overview of various analysis tasks. Most of the analysis activities focus on the Portal Server (shown in the yellow box on the left-hand side of the diagram). Several of these activities are only useful if there is some workload on the system, but other analysis tasks can be performed on the client-side in “single user mode”. In this case no other workload is required on the system in order to proceed with the analysis. The white boxes are not discussed in this document.

The suggested analysis path starts with a high-level analysis of the problem (Chapter 2). The analysis steps can be completed without the help of external tools and require only very little time for an experienced portal administrator. Chapter 3 (Monitoring Setup) describes preparatory steps to initiate the collection of several kinds of monitoring data for deeper analysis. An analysis at operating system level is performed in Chapter 4. Chapter 5 uses portal and J2EE tools to analyze the overall portal workload and to identify hotspots, e.g. expensive iViews. Chapter 6 describes deeper analysis techniques at Java VM level using the verbose:gc trace and full thread dumps.

The right-hand side of the diagram lists several techniques for analyzing other parts of the infrastructure than the portal itself. Most of the techniques can be applied in “single user analysis mode”, i.e. only one user session is analyzed while the productive load may still hit the system. The analysis of the HTTP protocol stream (Chapter 7) is enhances the Portal Server analysis described in Chapters 3-6. It is often advisable to perform an HTTP analysis early in the analysis. Chapters 8 and 9 list some tools that may be useful for backend and client analysis. Details will be added in a later version of this document.

We recommend that users do not adhere too strictly to the roadmap. The roadmap is a generic recommendation that has proven to be useful in most cases. Depending on the circumstances and the information available, it may be advisable to proceed in a different order. For example, if you already have sufficient details about the problem, you may want to start directly with the Java VM analysis. Feel free to skip individual sections if you think this saves time.

©2004 SAP AG 7

1.2 Prerequisites If you fulfill the following requirements you will save time during the analysis, and the chance of a successful analysis will be higher:

You should have a good understanding of the technical processes in the portal. See for example the Portal Runtime Technology Guide [2].

You should have performed a GoingLive Analysis (GA) session. The GA sessions have a different approach: A bottom-up analysis (complete configuration check) is performed during the GoingLive Analysis remote service. As a result, it delivers a report containing several configuration recommendations that target performance improvements and stable system operation. Before starting the analysis, check if the recommendations given in the GA report have been implemented.

You should generally apply the latest patch and hotfix provided on the SAP Service Marketplace. Otherwise you may waste a lot of time analyzing a problem that is well known and already fixed. On the other hand, patches should be implementated carefully, starting on test systems. Furthermore, the performance problem to be analyzed may be completely unrelated to anything handled by patches. A minimum essential requirement is to check the corresponding Notes for patches that were not yet applied for any known problems.

1.3 Related Documents The official SAP Enterprise Portal documentation [7] describes standard administration and configuration tasks. Revisit the administration guide for suggested configuration changes and standard administration activities.

The Troubleshooting Guide [1] lists almost all the tools available for troubleshooting SAP Enterprise Portal 6, including tools that are used in this guide. For details on where to obtain and how to use the tools, please refer to the troubleshooting guide.

Once performance problems have been identified, tuning recommendations as described in “How to… Fine-Tune Performance of SAP Enterprise Portal 6.0” [3] may help to tune the system.

Several kinds of performance problems only occur when the portal is under load. If the portal is not already live, it may help to generate a synthetic load with a load generator tool. The HowTo guide “Basic Load Testing / OpenSTA” [8] explains how to generate a load with a freely available tool. The best practice document “Enterprise Portal Volume Testing” [5] covers more generic aspects of performing load tests.

The use of CCMS-based monitoring and solution manager monitoring is not covered in this document. Please refer to the document Enterprise Portal 6.0 SP2 System Landscape Monitoring [9].

©2004 SAP AG 8

2 High-Level (User-Level) Analysis This chapter describes possible initial steps for analyzing the problem. Depending on the nature of the problem, some steps will already have been completed (e.g. by the customer). Other steps might not be necessary or not applicable for certain kinds of problems. All the analysis steps in this chapter are aimed at a high-level analysis without requiring special tools. The analysis is not very time-consuming andt does not involve the installation of any additional tools. However, a good initial analysis will save time in the later analysis steps and immediately guide you in the right direction.

2.1 Starting Point When you start performance analysis, you often get a complaint about unsatisfactory portal behavior of the form …

portal is slow in general: every action/click has a very high response time with / without much CPU utilization

portal gets slower the more users are in the system / does not scale, cluster scalability is not linear

response times from a periodical heartbeat increase to an inacceptable value

volume test performance / errors / timeouts: The performance requirements for the number of users, response times, transactions per second, etc. could not be met.

response time is not too poor, but there is a high load on the machine although there is hardly any EP traffic

high response time across LAN / Internet / modem connections

portal freezes on a very high load / after some time without much traffic, portal hangs / user session hangs

action in the portal is slow: logon, portal startup, search, some iViews, top-level navigation

certain iViews seem to time out

As a result of the analysis steps in this chapter, you should verify the diagnosis and be able to precisely describe the problem to be analyzed. Be careful not to combine several problems if they can be analyzed separately.

2.2 Experience and Document the Problem Reproduce the problem: measure the performance from an end-user perspective. First just

use a wrist watch or the clock of your PC to measure the time.

What activities on the portal are affected?

Is it really a problem? Compare the performance with other SAP Enterprise Portal installations.

Define expectations: What is the expected behavior? What is the expected or at least accepted response time? What is the observed response time? Document these numbers to have a well-defined goal. Otherwise your requirements might increase at the same pace as the improvements you implement. Under what circumstances are the response times expected (high load, single user, …)? What throughput is expected?

©2004 SAP AG 9

Since when does the problem occur? What has been changed since then? Take note of all activities, even if they appear to be absolutely unrelated to the problem. Were newly developed customer components uploaded on the portal? Were any patches or other changes applied near the time when the problem occurred? What other activities took place in that time frame (for example installation of a new Windows Service Pack, deployment of new portal content, …)?

Example: A customer told us that the only change was to switch off one (of many) Windows domain servers. At first sight, this should definitely not cause any problems since there were more than ten domain servers left. But it turned out that one of the portals was always trying to contact exactly this domain server, and only (successfully) tried to contact another one after a time-out.

Find out when the problem occurs. If it is not a permanent problem and cannot be safely reproduced, find out as many details as possible about the pattern of occurrence. Note down exact time stamps for occurrences of the problem. This helps to match log entries to the problem. Here are some examples of possible time observations:

o visible permanently

o periodically, every nth repetition

o occasionally, not really reproducible

o depends on some other activity in the network / on the Portal Server

o after a certain uptime of the portal

o at a certain time of day

o only under high load

o on first use (first login etc.)

o at a specific point of time. In this case, what were the activities in this time frame?

What is the observed behavior? Only a slow response (but with correct result), or additional symptoms like a timeout message or wrong and incomplete results?

2.3 Check for Logged Problems Error situations often have an impact on the performance. For example, if content management repositories are not configured correctly, a timeout may occur accessing the repository, resulting in a very slow startup of the portal.

Check the portal log files for errors and exceptions with stack traces that may indicate configuration problems, and try to resolve them. Creating and logging exceptions is expensive.

Check the log levels of portal logging. Heavy logging (at INFO or DEBUG level) may lead to very poor performance.

Check the file sizes in log directories (e.g. …/server/managers/log/portal/logs): Is there a large amount of log data generated while there is a load on the system? This is a clear indicator of wrong log configuration.

Review the system error log for critical errors that may lead to poor performance (e.g. looking up the network address of some backend system fails repeatedly).

Windows: Are there any critical events in the Windows Event Log in the time frame when the performance is bad?

Unix: Check the system error log for messages.

©2004 SAP AG 10

2.4 Reduce the Complexity To minimize the number of factors that may have an impact on your response time, keep the environment as simple as possible. Compare different configurations to get an idea of the impact of different factors on the performance. For example, comparing direct J2EE dispatcher access with the use of all firewalls, reverse proxies etc. shows the performance impact. An HTTP analysis (see Chapter 7) is often the adequate tool for this kind of measurement.

2.4.1 Connect Directly to the Portal

For performance measurements, try to connect directly to the J2EE Engine dispatcher. Bypass proxy servers, reverse proxies and Web servers, load balancers, and other network components for a performance measurement. If you observe good performance when bypassing these network components, this indicates a problem either in these components or in the interaction between the portal and network components / configuration.

Example: Establishing a network connection for HTTP requests is expensive for one of the network components. Analysis shows that removing this component yields good response times. But further analysis later on reveals that no persistent (keep-alive) connection was used for HTTP requests, causing a high number of established connections. So the root cause was not necessarily the network component, but some portal configuration that prohibited reuse of the HTTP connection. Nonetheless, getting two measurements, one with good and one with bad response times, is a very good starting point for analysis: What is the difference between these two scenarios?

It may also help to simplify the portal cluster. Reduce it to one J2EE application node, one state controller, and one J2EE dispatcher node. Shut down all other application nodes if possible, but be careful on a live system: Can the remaining node handle the productive load?

2.4.2 Portal Server Configuration

If possible, try using different configurations on the Portal Server:

If HTTPS is used, try to reproduce the same problem with plain HTTP. If the problem persists, you can be sure that it is not related to HTTPS. Proceed with the analysis of HTTP requests since they are easier to investigate. However, you cannot be sure that the problem is directly related to HTTPS / SSL if it occurs only with HTTPS.

Use a different authentication method. For example, if NTLM (integrated Windows) authentication is used, switch to Basic authentication. If this makes a difference you may want to skip directly to Section 8 and analyze the HTTP traffic.

Try static documents from the J2EE Engine (e.g. from /irj/portalapps). Is the download time for static content ok?

Disable any software that is not related to the portal and stop background services that are not required

o heartbeat processes

o monitoring processes (if not needed for performance analysis)

o antivirus software (We noticed that it is not sufficient to disable Netshield; it should be completely removed from the machine)

o any other background activity

2.4.3 Client Location

On what client PCs does the problem occur? Does it occur on all hosts?

only PCs at certain locations are affected: check the network of the affected group of PCs: perform a network analysis (see Chapter 10).

©2004 SAP AG 11

o network bandwidth?

o HTTP trace for find outtransfer volume for the actions on the portal

o check network latency and real bandwidth with niping, ping

o proxy configuration different? Routing table?

o Compare the network configuration of sane hosts with that of affected hosts. What is different?

problem only when accessing via Internet (not if directly in intranet)?

o Access the portal directly from SAP via Internet

o Use a HTTP trace (see Chapter 7) to check the transfer volume. Compare the transfer volume with the response time that is technically possible for the available network bandwidth.

well-performing and badly performing hosts are on the same network, but they still behave differently.

o Check other items of the network configuration: DNS server, proxy server, Windows domain,

o Integrated Windows authentication used? Check which domain controller is really contacted (Network sniffer)

o Compare software versions: W2K version, SP, hotfixes, IE config, Windows network configuration, proxy configuration

problem is present on all hosts except for Portal Server

problem is present on all hosts including Portal Server

2.4.4 Load Balancing

If a clustered portal installation with load balancing is used: Connect directly to the Portal Server, not using the host name for load balancing. Try each Portal Server – some might behave properly and others have performance problems. Does the problem occur in the same way if one of the load-balancing hosts is contacted directly? Contact all portal hosts directly and compare the response times. If the hosts behave differently

o compare the configurations of all Portal Servers.

2.4.5 Client Configuration

Restart the browser and clear the browser cache. Is the problem still present?

Check the proxy settings

Are there Java applets on the page? It may take a few seconds to start the Java VM that runs the applet.

2.5 What Content and Activity is Affected? Often only a few actions on the portal are slower than expected. In this case the performance problem must be narrowed down as far as possible to simplify the analysis. Here the intention is to measure (or “feel”) the user experience in the browser and to use the browser to isolate single pages or iViews. Chapter 5 explains workload analysis for identifying the expensive content in greater detail.

©2004 SAP AG 12

2.5.1 Logon Procedure

The logon procedure is a complex process that involves several steps. It is common for several HTTP requests to be sent by the browser to complete the logon and loading of the start page. A first guess on the slow part of the logon procedure is a frame that appears delayed in the browser. If all frames but one appear almost immediately and only one is delayed, this frame should be analyzed. Try to reproduce the problem by opening a new browser window for this frame only. Copy the URL by using right-click, properties in IE.

The logon procedure consists of the following steps:

1. Logon + authentication (including access to the user management repository: LDAP, R/3, DB or external)

2. Load the framework page (including top-level navigation and PCD access (loading roles))

3. Load the innerpage (including detailed navigation)

4. Load the content area (including content iViews)

To identify the step that consumes the response time, assign a different (if possible empty) start page to the user for testing. Is the behavior different? You can go directly to a specific page by using the request parameter NavigationTarget: Use a request of the form http://server:port/irj/servlet/prt/portal/prtroot/com.sap.portal.navigation.portallauncher.default?NavigationTarget=MyPageName to temporarily change the start page for the current logon. If the logon is fast with an empty start page, the performance problem is probably caused by the content of the start page (Steps 3 or 4). Go to the next section to analyze the page content.

If the performance for an empty start page is still slow, Steps 1 or 2 are the reason. Check the performance for repeated execution of the logon for the same user account (closing the browser completely after each execution).

If only the first logon to the empty page is slow, access to the UM repository or PCD repository may be too time-consuming.

If the repeated logons are also slow, caching may be suboptimal or creation of the framework page too slow. Try a different user account that contains only a single role (e.g. eu_rule). If this is faster, the poor performance might be due to the fact that the number of roles assigned to the user is too large.

For further analysis, perform a user activity trace (Section 5.2).

2.5.2 Analyze the Content of a Portal Page

If you identified a single page that has much higher loading time than other pages, you should examine the page content and page parameters.

Use the Portal Content Studio to check which iViews are on the page. Click the Preview button to display each iView in a separate window. If you can identify a single iView as slow, proceed to the next section “Expensive iViews”.

To examine the iView response times, remove iViews from the page or create a copy of the original page and remove iViews from the copy. Can the problem be reproduced with the new page? If not, one reason for the difference might be the personalization of the original page. Check the personalization data of the original page. Server-side caching (shared cache) and changing the isolation method might also help.

For a systematic breakdown of the page content, use the single activity trace (Section 5.2) to find out the processing time for each iView. You can also use the HTTP trace to identify time-consuming requests for iViews with load method “isolated” (separate HTTP request per iView).

©2004 SAP AG 13

http://server:port/irj/servlet/prt/portal/prtroot/com.sap.portal.navigation.portallauncher.default?NavigationTarget=MyPageNamehttp://server:port/irj/servlet/prt/portal/prtroot/com.sap.portal.navigation.portallauncher.default?NavigationTarget=MyPageName

2.5.3 Analyze Expensive iViews

If a single iView is identified as the source of poor performance, use the preview iView button in the Portal Content Studio to display the iView in a separate browser window. Find out the URL to the iView using the Browser menu. This allows you to invoke the iView directly.

For testing purposes, change the load method of the iView in question to URL mode. Does this change anything? For URL iVIews: load the target URL directly in the browser.

To confirm your analysis, create a new iView with the same properties as the iView in question. Does the same behavior occur?

As a shortcut, check if server-side caching (shared cache) can be activated for the iView.

Further analysis of the iView depends on the iView functionality. General tools covered in this guide are user activity tracing (Section 5.2) and full thread dumps (if the system is under load, Section 6). Further analysis techniques are available for specific functionality:

CM iViews and custom iViews with CM repository access CM cache monitor (Section 5.4.3)

iViews accessing a backend system like R/3 via RFC / jco Backend analysis (Chapter 10.4)

iViews accessing other backend systems Backend analysis (Chapter 10)

Custom iViews? Code review or profiling might be necessary

Is content from the Internet or from external applications displayed slowly? Check URLs, consider server-side caching.

©2004 SAP AG 14

3 Monitoring Setup If the details of a performance problem are not yet known, it is advisable to activate the collection of monitoring data with several tools. The steps to activate the recommended set of monitoring tools are described in this section.

All operating systems maintain counters to keep track of critical system resources. The reports based on these counters are an indication of how the major subsystems are performing. SAP Enterprise Portal and J2EE Engine 6.20 and the Java VM also maintain performance information for important resources in their own scope. These two types of counters described above are forming our monitoring setups.

In this guide, we focus on operating system-based tools since these are always available. SAP also offers monitoring integration into the Solution Manager. The setup of enterprise portal monitoring for the Solution Manager is described in a separate document (see [9]).

The setup of the monitoring infrastructure is exemplified for all platforms supported by EP 6.0 SP2. Only common OS tools are described, as well as monitoring tools provided by the portal and the J2EE Engine. However experienced users are also recommended to use tools of their own choice as long as they collect the data described in this section. When available at customer side, commercial or freeware automated tools for system performance analysis could be used as well.

In general, the monitoring setup should not only be established for the portal machines, but – as far as applicable – also for all kinds of backend systems like database machines, SAP WAS-based backend systems, custom storage systems, customer specific backend systems, etc.

This section is split into one part that describes operating system independent steps to activate monitoring and into one subsection per supported operating system that covers the collection of configuration and monitoring data specific for this operating system.

All important counters defined in this chapter could be collected automatically in a productive customer environment with no risk and minimum performance degradation. We recommend you perform regular system monitoring, by applying the monitoring setup, even if the system is working fine for the moment. This may serve as a base of comparison. We also recommend that customers keep the monitoring setup activated as long as possible.

3.1 Checklist

As a quick reference, check that the following monitoring tools are in place. They are explained in detail in the following sections of this chapter:

Collect performance data on operating system level (CPU consumption etc.)

Collect configuration data on operating system level (CPU power, memory etc.)

Activate -verbose:gc Java VM option

Enable creation of full thread dumps for J2EE Engine nodes

Start J2EE monitoring

Make sure that portal monitoring is active (is active by default).

Activate the extended HTTP log to record response time and response size

©2004 SAP AG 15

3.2 OS-Independent The following monitoring tools are available with all supported operating systems.

3.2.1 Java VM Garbage Collection Trace

The Java VM can be commanded to write a garbage collection trace. Use –verbose:gc as Java VM parameter in the startup script (e.g. go or go.bat) of each application node, state controller, and dispatcher node. This parameter works for all Java VM versions that are supported by EP6 SP2. The location of the VM parameter settings depends on the mode you use to start the J2EE nodes.

3.2.2 Enable Creation of Full Thread Dumps

Depending on the startup mode of the J2EE Engine nodes, preparatory steps may be necessary to trigger full thread dumps and to capture the VM output that contains these dumps. See the Troubleshooting guide [1] for details on the setup.

3.2.3 J2EE Engine 6.20 Monitoring

The J2EE Engine Monitor Server provides measurements for the J2EE Engine runtime environment and can be included in the monitoring setup defined here because of the negligible performance impact. SAP Note 498179 explains several alternatives to start the Monitor Server automatically on a system start, e.g. running the Monitor Server as a Windows service.

A single Monitor Server is sufficient for the whole cluster. The Monitor Server can be located on any machine with a fast network connection to the machines with installed J2EE Engine cluster nodes. The Monitor Server automatically registers or un-registers cluster nodes which have been just started or have dropped.

SAP Enterprise Portal is delivered to customer as a standalone product or accompanied by a Solution Manager system. If there is a CCMS system as part of the infrastructure at customer side (i.e. the customer maintains an R/3 system) then make sure to configure the Monitor Server to report to it (see [9] and Note 498179). The Monitor Server architecture allows it to report simultaneously to more than one destination with no additional overhead, so it could be used to report to CCMS and file system simultaneously. The file system reporter as well as the CCMS reporter should be configured in the /tools/monitorServer.properties file.

Property Default Value Setup Recommendation

monitor.system.FS.mode Off On

monitor.system.FS.fileFormat Html Xls

monitor.system.FS.filesize 64 2048

monitor.system.FS.path monitoring_html monitoring_xls

monitor.report.gap 2000 120000

monitor.system.CCMS.mode Off On (if CCMS connected) Off (if no CCMS available)

Default values are convenient for setup of 1 dispatcher and 1 server node, because the reported data is visible in a browser in a compact view. For bigger clusters, the data needs to be saved in xls format and then sorted and analyzed with the help of MS Excel or a similar tool which could read a file format of tab-separated entries. The data collection interval is by default 2 seconds – for productive systems this is too frequent. In a 24 hours run, 2 minutes (120 000 milliseconds) granularity is sufficient.

©2004 SAP AG 16

Never use the Visual Administrator to look at the data collected by Monitor Server al6though it is possible and convenient – the overhead of Visual Administrator when it is connected to the server is too high.

3.2.4 Monitor Response Times and Throughput per URL

The best and easiest way to narrow down the area of a slow response time to a single slower URL is to use the http service logs on server nodes. At log level “INFO” the http service writes information about the URL that was accessed, together with the response code, body length and response time for this URL. The valuable log is located in the file

/cluster/server/services/http/log/INFO.log.

The option to write the response times in the log is activated in the file server/services/http/properties. Use LogRequestTime=enableall (and EnableLoging=true) to have the response times for each URL which was called. A sample log line would look like

2004-07-05 14:10:58 | http | INFO | | 127.0.0.1 | 127.0.0.1 - - [05/Jul/2004:14:10:58 +0100] "GET /irj/servlet/prt/portal/prtroot/com.sap.portal.navigation.portallauncher.default" 200 2878 [5318] |

The number in square brackets [] is the time in milliseconds spent on the request on the server node. The number of the position before the execution time is the number of bytes sent to the dispatcher node as the body of this reply.

Evaluation of the log is covered in Chapter 7 (HTTP Analysis).

3.2.5 SAP Enterprise Portal Monitoring

SAP Enterprise Portal monitoring must be active for problem analysis. You can check this by navigating to System Administration Monitoring Portal Component Overview. If the displayed table is not empty, monitoring is active and no further steps are necessary.

If monitoring is disabled, reactivate it: Navigate to System Administration System Configuration Monitoring Configuration and select the checkbox Collect monitoring data.

3.3 Windows-Specific Windows provides a powerful infrastructure to monitor the operating system and applications over long periods. The Windows performance monitor tool (perfmon.exe) uses this infrastructure to collect performance counters. Configure and use the perfmon tool to record at least all recommended counters in log files. Activate logging from ‘Performance Logs and Alerts -> Counter Logs’. We recommend to log the counters in text format (either CSV or TSV) since this simplifies the evaluation of the results in a spreadsheet application. All measurements refer mainly to portal-related processes.

In performance monitoring tool terminology the counters are logically grouped by performance objects (for example a performance object is ‘Processor’, ‘Thread’, etc.) and can be correlated to object instances (for example all counters of performance Object Process can be assigned to one or more dedicated OS processes, like JAVA, cmd, etc ..). It is recommended that all counters are stored in the same log file for two reasons – just one file needs to be maintained and when visualized in performance monitoring the data of the different counters is correlated automatically for the measured time slice.

3.3.1 Collection of Configuration Data

We recommend using the application msinfo32.exe to get an overview of the system configuration. The second alternative of the invocation collects more data, but it may run several minutes.

msinfo32 /report MSD-Summary.txt /categories +SystemSummary

©2004 SAP AG 17

msinfo32 /report MSD-Full.txt /categories +all

3.3.2 CPU and Memory Performance Counters

On Windows the following CPU and memory-related counters need to be observed.

Performance Object

Performance Counter

Notes

System Context switches Context switches are the main cause of both latency and CPU load.

Processor % Privileged Time Select it for all available processors. Also known as system (kernel) time. Could be checked also in Task Manager if you select “Show kernel times”.

Processor % User Time Select it for all available processors. Most of the process time should be spent in user mode.

Processor % Processor Time Select it for all available processors. Overall CPU usage for all processes on the machine

Memory Pages Output/s Pages Output/sec is the number of pages written to disk to free up space in physical memory. A high rate of pages output might indicate a memory shortage.

Memory Available MBytes Free memory on the machine. 0 value is a problem.

The following process-specific counters should be activated for all enterprise portal specific processes. This includes at least all J2EE nodes (Java processes).

Performance Object

Performance Counter

Notes

Process Thread Count Number of threads currently active in this process. Process % Processor Time Consumed processor time. 200% = 2 CPUs are fully

occupied by this process (may exceed 100%!). Process % Privileged Time Consumed processor time spent in kernel. 100% = 1 CPU is

fully occupied by this process for kernel activities (may exceed 100%!).

Process Working Set Working Set is the set of memory pages touched recently by the threads in the process. If free memory in the computer is above a threshold, pages are left in the sorking set of a process even if they are not in use. This is of interest since this counter gives information about the amount of memory the processes really use and not only what is allocated.

3.3.3 Network Performance Counters

Again use the performance monitor tool to get the requested counters:

Performance Object

Performance Counter

Notes

TCP Connections Failures, Active, Established, Reset

Characterizes the dynamic in the system. Too many connections created and destroyed per interval could explain performance degradation. Connection failure complaints can also be investigated with the help of this information.

©2004 SAP AG 18

Network interface

Bytes Sent/sec Bytes Received/sec

Network interface

Packets Sent/sec Packets Received/sec

Characterizes the throughput of the system

Network interface

Output Queue Length

Output Queue Length is the length of the output packet queue (in packets).

Network interface

Packets Outbound Errors

Explains some occasional functionality errors affecting EP

3.3.4 Disk Performance Counters

With recent versions of Windows, the disk information is visible in the performance monitor tool in “Physical Disk” Performance Object. If not available there, use the information provided by “System” Performance Object.

Performance Object

Performance Counter Notes

Physical Disk % Disk time % Disk read time % Disk write time % Disk Idle time

Disk time is the busy time spent on both reading and writing.

Physical Disk Disk Read bytes/second Disk Write bytes/second

Physical Disk Disk Reads/ second Disk Writes/second

Indicator of disk load

Physical Disk Current disk queue length A long queue indicates a busy disk and a performance bottleneck.

System File read operations per second File write operations per second

System File reads per second File writes per second

Also includes reads and writes from the cache, and not only physical disk operations

3.4 Solaris-Specific For monitoring system resources, Solaris provides a set of standard commands out-of-the-box: sar (system activity report), iostat (input-output status), vmstat (virtual memory statistics), mpstat (multi-processor status) and uptime (high-level load overview).

There are also two graphical monitoring tools available by default (CDE versions):

/usr/dt/bin/sdtprocess – Lists and sorts all processes. It is possible to look further into process properties, terminate processes

/usr/dt/bin/sdtperfmeter – Draws the vmstat-data as a bar or line chart. A minimized version of this tool is displayed in the CDE front panel.

©2004 SAP AG 19

Both tools allow you to log the performance counters as well as their defining thresholds and therefore are appropriate candidates to perform basic monitoring tasks.

3.4.1 Collect Configuration Data

As introduction to the monitoring of the system, use prtdiag to get an overview about the hardware configuration. The command /usr/platform/`uname -i`/sbin/prtdiag -v prints information about the number and type of CPUs, RAM, extension cards, etc.

The command prtconf –v is an alternative / complement to prtdiag.

Additionally on multiprocessor systems, psrinfo –v gives information about the number, status and clock frequency of the processors.

3.4.2 System Level CPU and Memory Performance Counters

Use sar –u 120 720 > solaris_overallcpu.log to measure overall CPU time and memory counters for all processes (720 samples with a period of 120s = measurement timeframe of 24 hours).

Performance Counter

Notes

%usr Percentage of time that the processor is in user mode (that is, executing code requested by a user).

%sys Percentage of time that the processor is in system mode servicing system calls. Users can cause this percentage to increase above normal levels by using system calls inefficiently.

%idle Percentage of time the processor is idle. If the percentage is high and the system has a heavy load, there is probably a memory or an I/O problem.

%wio Percentage of time that the processor is waiting for completion of I/O from disk, NFS, or RFS. If the percentage is regularly high, check the I/O systems for inefficiencies

When looking at the overall memory-related counters on the system use sar –gpr 120 720 > solaris_memory.log

Performance Counter

Notes

pgout/s Page-out requests /s

ppgout/s Pages paged out /s

pgfree/s Pages placed on the free list /s by page scanner

pgscan/s Pages scanned /s by page scanner

%ufs_ipf Percentage of cached file system pages taken off the free list; these pages are flushed and cannot be reclaimed

atch/s Page faults / s that are satisfied by reclaiming a page from the

©2004 SAP AG 20

free list (attach)

pgin/s Page in requests /s

ppgin/s Number of pages paged in /s

pflt/s Number of faults caused by protection error (copy-on-writes faults)

freemem Average amount of free memory

freeswap Number of disk blocks available in paging space

3.4.3 Per-Process CPU and Memory Performance Counters

A common tool for system and process monitoring is top, which provides a list of the most CPU-intensive process running on a system, as well as additional statistics such as load average, the number of processes, and the amount of memory as well as paging space in use. Unfortunately, top tends to consume considerable processor time (as it often appears at the top of the consumers list).

Solaris comes with a similar tool, called prstat, which provides similar output using considerably less processor time.

Use

prstat –m –p 120 720 > solaris_prstat.log,

where “list of JAVA processes pids” is a comma-separated list of the IDs of interesting JAVA processes (all server, dispatcher nodes) and also database processes or other processes related to EP6.

The interesting counters produced by the command are defined below:

Performance Counter

Notes

USR Percentage of time the process spent in user mode

SYS Percentage of time the process spent in system mode

TRP Percentage of time the process spent in processing system traps

TFL Percentage of time the process spent processing text page faults

DFL Percentage of time the process spent processing data page faults

LCK Percentage of time the process spent waiting for user locks

SLP Percentage of time the process spent sleeping

LAT Percentage of time the process spent waiting for CPU

VCX Number of voluntary context switches

ICX Number of involuntary context switches

SCL Number of system calls

SIG Number of signals received

©2004 SAP AG 21

If you omit the –m option of prstat, youl get different output that includes memory sizes.

A more detailed list of memory status per process can be measured with pmap, which also shows the amount of shared and private memory.

pmap –x - Print memory requirements of a process

pmap –S - Swap reservation information per swapping

To do this periodically, create a script and define a job redirecting the output of the commands to a log file.

One needs to know a few special things about Solaris: The size of the free list is not an indicator of a memory shortage, because Solaris consumes any unused memory for caching recently used files. Additionally, the number of page-ins or page-outs per second is a bad metric, because Solaris handles all file system I/O by means of paging mechanism. Thousands of kilobytes paged in or out just means that the system is working. So consider a relatively high rate of paging more as normal behaviour than as critical behavior. It is more important to calculate the memory that is really needed by all relevant processes. For optimal performance, the sum of all memory needed should not exceed the available physical memory size.


To monitor the network activities use netstat command in two ways. netstat –i 30 > solaris_netstat.log

reports counters related to the intensity of network traffic like the number of packages in the system with a period of 30 seconds between the reports. The task runs unlimited, so observe the growth of the log file.

The second way to use the command is related to collecting TCP connection performance counters. The TCP connection status overview is very important for EP6.0 performance analysis.

Create a script which periodically (every 300 seconds) executes the command netstat –P tcp > solaris_tcpnetstat.log

The interesting counters are presented in the table

Performance Counter

Notes

Swind/ Rwind Send and receive window size

Send-Q/ Recv-Q Send and receive queue size

State (internal state of the protocol)

ESTABLISHED - Connection has been established.

Varian stages of close:

CLOSING - Closed, then remote shutdown; awaiting acknowledgment.

CLOSE_WAIT - Remote shutdown; waiting for the socket to close.

FIN_WAIT_1 - Socket closed; shutting down connection.

FIN_WAIT_2 - Socket closed; waiting for shutdown from remote.

LAST_ACK - Remote shutdown, then closed; awaiting acknowledgment.

©2004 SAP AG 22

The following are also checked:

Routing - Check with netstat -r. The default router (or standard gateway) can be entered in file: /etc/defaultrouter with its IP address. To delete all current routes use route flush.

Error rates with netstat -i – Duplicate IP addresses and other host interface misconfigurations can cause different problems. Cables may fail or begin to generate errors. It also displays MTU size input and ouput packets and collisions respectively.


Common disk performance problems are I/O skew (big load differences between different disks), disk overloading caused by paging and unexplainably high service times on idle drives.

Use iostat -xPmnz 30 > solaris_iostat.log

to find the busiest disks and those with the highest response times.

Performance Counter

Notes

r/s Number of read operations /s

w/s Number of write operations /s

kr/s Number of kb read /s

Kw/s Number of kb written /s

wait Average number of transaction waiting to be serviced

actv Number of requests currently being serviced

wsvc_t Average service time in wait queue

asvc_t Average service time for active transactions in ms

%w % of time that transactions were waiting to be serviced

%b % of time that the disk was actively serving transactions

The response time for a disk operation is hidden in the “ asvc_t “ performance counter: it is actually the time between a user process issuing a read and completion of the read (for example). This is often critical for user response times.

Alternatively, similar information can be obtained with the sar –d command.

Also check the growth of the log files with df -k and see if it influences load behavior.

3.5 HP-UX Specific A well-known and often-used tool for performance monitoring on HP/UX is Glance. Since it is commercial and may not be available, we discuss monitoring of system resources on HP-UX with a standard set of tools, available on each HP-UX installation. All tools are discussed separately, but we recommend that a script to start all of them at once be written. For example

©2004 SAP AG 23

sar -d 10 5 > hp_sar_disk.log & vmstat 10 5 > hp_vmstat.log &.


HP UX includes a tool called cfg2html that generates a very complete report on the system configuration. Enter the following commands to trigger report generation:

cd /opt/cfg2html cfg2html_hpux.sh

This generates a system configuration report named .html and .txt in the directory /opt/cfg2html.


Use vmstat –d 120 720 > hp_vmstat.log

This ensures collection of data over a 24-hour period. The log file will grow to approximately 300 Kb.

Performance Counter

Notes

us % CPU time spent processing user requests/commands Sy % CPU time spent for system specific tasks Id % CPU idle time cs Kernel thread context switches

sy System calls per second pi Pages paged in from paging space

po Pages paged out to paging space

avm Active virtual pages

free Size of free list

r In run queue b Blocked for resources (I/O, paging, etc.) w Short sleeper (< 20secs) but swapped

The data collected with the vmstat command gives the overall status of the machine. Counters that present each running JAVA process separately are also needed. The Top command reports statistics about all running processes on the machine. As these processes may be too numerous, apply a filter for just those processes started by the EP administrator user account:

Use top -s120 -d720 –u -f hp_hp_top.log

Another option is to filter only the JAVA processes.

©2004 SAP AG 24

Counters of interest for EP6.0-related monitoring are

Performance Counter

Notes

PRI Current priority of the process

SIZE Total size of the process in kilobytes including text, data, and stack

RES Resident size of the process in kilobytes; approximate value

%WCPU Weighted CPU percentage

%CPU Raw CPU percentage

This field is used to sort the top processes


Use netstat command in two ways. netstat 30 > hp_netstat.log

will give you impression about the intensity of network traffic in number of packages on the system with a period of 30 seconds between the reports. The task runs unlimited, so observe the growth of the log file.

netstat –p tcp > hp_tcpnetstat.log

Create a script, which periodically (every 300 seconds) invokes the TCP connections statistics command.

Performance Counter

Notes

Connection requests

Connection accepts

Connections established (including accepts)

Connections closed (including drops)

Keep-alive timeouts

Connect requests dropped due to full queue

Connect requests dropped due to no listener

The connection status overview in the report of the netstat –p tcp command is very important for EP6.0 performance analysis, because the tasks of handling customer connections and processing customer business logic are closely related in the dispatcher server architecture of the J2EE Engine 6.20 (which lays below the portal).

©2004 SAP AG 25


Use the iostat command in the following way: iostat -t 120 720 > hp_iostat.log

Performance Counter

Notes

Device Name of the specific disk drive (usually more than one on HP systems) bps Kilobytes transferred per second sps Number of seeks per second msps Milliseconds per average seek

The CPU time reported due to the –t option is necessary to map the disk IO information to the current CPU activity. Important are the us, sy and id columns. The output of the tty group can be ignored.

If available, the sar tool should be used in format sar -d 120 720 > hp_sar_disk.log

The output generated by the sar tool fits better to the counters, which we need to analyze : %busy, avque, r+w/s,, blks/s…

Performance Counter

Notes

%busy Percentage of time when the device was busy servicing a request

avque Average number of requests outstanding for the device r+w/s Number of (read and write) data transfers per second from

and to the device blks/s Number of bytes transferred (in 512-byte units) from and to

the device avwait Average time (in milliseconds) that transfer requests waited

idly for the device in the queue avserv Average time (in milliseconds) to service each transfer request

(includes seek, rotational latency, and data transfer times) for the device

3.6 AIX-Specific AIX OS comes with a good set of commands for monitoring system performance. These commands are discussed individually, but it is recommended that you create a script to start all of them at once. For example

vmstat –t –I 120 720 > aix_vmstat.log & ps v > aix_ps.log & ..


lscfg

©2004 SAP AG 26

lsattr –EH –l somedevice


Vmstat is a powerful command for getting statistics related to the CPU, memory and disk IO. The Vmstat command is located in /usr/bin/vmstat. Use the command in the following format:

vmstat –t –I 120 720 > aix_vmstat.log

This configuration performs 24-hour monitoring with a granularity of 2 minutes. The slices of 2 minutes are sufficient for long run and productive situations. Command parameter –t is necessary to put timestamps on each of the lines of information collected and stored. Parameter –I adds information about disk I/O to the output file.

The output of the command can be filtered so that too detailed metrics are excluded. The columns in vmstat output view, which are the most important for our specific measurements with EP6, are listed in the following table.

Performance Counter

Notes

us % CPU time spent for processing user requests/commands sy % CPU time spent for system specific tasks id % CPU idle time wa % CPU idle time due to outstanding I/O requests. This should

be interpreted keeping the number of CPUs on the system in mind. On a 4-way system, if 1 thread is doing I/O non-stop, this is visualized as 25% I/O, while on an 8-way machine the percent of I/O for this one thread will be 12%.

cs Kernel thread context switches

sy System calls pi Pages paged in from paging space

po Pages paged out to paging space

avm Active virtual pages

fre Size of the free list

r Number of running threads b Number of waiting threads (waiting for I/O or other resources) p Number of threads waiting for actual physical I/O (only

available with –I option)

In addition to what was measured on AIX for overall statistics, a detailed view on system resources used by JAVA processes is needed.

Use ps v | grep JAVA > aix_ps.log

©2004 SAP AG 27

Performance Counter

Notes

PGIN Number of disk I/Os resulting from references by the process to pages not loaded in core

SIZE Virtual size in kilobytes of the data section of the process

RSS Real-memory size of the process in kilobytes

%CPU Percentage of time the process has used the CPU since the process started. The value is computed by dividing the time the process uses the CPU by the elapsed time of the process. In a multi-processor environment, the value is further divided by the number of available CPUs since several threads in the same process can run on different CPUs at the same time. (Because the time base over which this data is computed varies, the sum of all %CPU fields can exceed 100%.)

%MEM Percentage of real memory used by this process


The AIX command for monitoring network activities offers numerous options and outputs. In contrast to vmstats, netstat performs continuous checks and report cycles. Therefore only the interval needs to be specified. Make sure to configure the same intervals as for vmstat.

Use netstat 30 > aix_netstat.log

and netstat –p tcp > aix_tcpnetstat.log

Look at data

Performance Counter

Notes

connection requests

connection accepts

connections established

(including accepts)

connections closed (including drops)

keep-alive timeouts

The connection status overview in the report of the netstat –p tcp command is very important for EP6.0 performance analysis, since it also directly handles customer http connections


Another useful command in the AIX standard command set is iostat, which reports disk activities.

©2004 SAP AG 28

Use it in format iostat 120 720 > aix_iostat.log

The intervals (120) and number of measurement (720) need to be synchronized with the intervals and numbers for the commands described above in order to be able to map the information together and correlate it one to the other at the end of the measurement.

For EP6 related measurements, the following counters need to be tracked.

Performance Counter

Notes

Disks The name of the specific disk drive (normally on AIX machines there is more than 1 disk)

% tm_act Percentage of time the physical disk was active

(bandwidth utilization for the drive) Kbps Amount of data transferred (read or written) to the drive in KB

per second tps Number of transfers per second issued to disk Kb_read Total number of KB read from disk Kb_wrtn Total number of KB written to disk

If present the sar tool should be used in format sar -d 120 720 > aix_sar_disk.log

The output in the form of %busy, avque , etc counters fits better in the analysis templates.

Column Name Semantic %busy Percentage of time the device was busy servicing a transfer

request avque Average number of requests outstanding during that time read/s Number of read/write transfers from or to a device write/s Number of read/write transfers from or to a device blks/s Number of bytes is transferred in 512-byte units

Column data avwait, avserv on AIX is always 0.

©2004 SAP AG 29

4 Portal Host Analysis This chapter describes analysis procedures that check for system sanity of the portal server machines on operating system level. These procedures are most useful if you do not yet have a clear understanding of the performance problem or if you feel that almost everything related to the portal is slow. On the other hand, if you already identified that only some specific areas or some content of the portal are slow it may be useful to only perform the first few sanity checks from this section and then quickly proceed to the next chapter Portal Workload Analysis.

The analysis in this chapter is based on the monitoring data that should be collected as described in the previous chapter (monitoring setup). Only parts of the analysis can be done with snapshot and additional ad-hoc analysis tools that are explained in this chapter. Furthermore, the monitoring setup should not be restricted to only the portal server. Basic monitoring should be set up (if not already in place) for all servers involved in the portal landscape. These servers may include Web servers, backend servers (database, LDAP, WebAS-based systems), etc. WebAS-based systems include a powerful monitoring infrastructure by default (e.g. Transaction OS07 for OS monitoring).

4.1 Snapshot Monitoring If you did not have the chance to set up monitoring or if you want to monitor with a higher sampling rate than two minutes as proposed in the previous section, then snapshot monitoring tools are a valuable supplement to long-term monitoring tools. A summary of these tools is listed in the table below, including some tools already discussed in the previous section.

Windows Solaris HP-UX AIX

Graphical tool for system level / process level snapshot monitoring

Taskmgr Performance / Taskmgr Processes

sdtperfmeter / sdtprocess

glm ?

Text tool for system level / process level snapshot monitoring

(Pstools) uptime, *stat / top, prstat

uptime, *stat

/ glance

topas (-P)

uptime

*stat

Monitoring data collection

perfmon sar sar sar

CPU usage per thread

pslist prstat glance ps –mo THREAD

Configuration info msinfo32 prtdiag cfg2html lscfg

4.1.1 Windows

For Windows, the tool of choice for snapshot monitoring is the Task Manager.

On the Performance tab, check the overall system load. Activate system times via the menu (View Show Kernel Times).

©2004 SAP AG 30

On the Processes tab, sort by CPU and find out which processes consume most of the CPU time. Note that on Windows a 100% CPU load means that all CPUs in a multi-CPU server are fully under load. Furthermore, the number of CPUs that taskmgr recognizes is relevant for performance calculations: A server containing four CPUs with hyperthreading appears as an 8-CPU machine in taskmgr (you see eight boxes in the Performance view). The per-process CPU load is calculated based on this number: Assuming again an eight CPU machine, 13% CPU usage by a single process means that one single CPU (not necessarily always the same one) is fully used by the process.

Do you see any CPU consuming process that is not directly related to the enterprise portal?

Activate additional columns in the display (View Select columns). You can thus also determine paging (PF delta), IO, number of threads, handles, etc. Use the analysis worksheet below.

4.1.2 Unix

First check the average load: uptime 11:54am up 103 day(s), 23:21, 1 user, load average: 5.23, 6.13, 7.01

The average load is the average number of runnable jobs (sum of run queue length and number of currently running jobs) over the last one, five and fifteen minutes, and gives a quick estimate if the system is heavily loaded as well as a possible indication of performance degradation over time.

To assess the load average correctly, you need to know the number of processors in the system. On an eight CPU system, a load average of 7 would mean that at least one CPU is still idle (7 runnable jobs for 8 CPUs). On the other hand, a four CPU machine would be quite overloaded with a load average of 7. Note that this load calculation is different from the Windows calculation.

For process-level monitoring, each Unix brand has its own monitoring tool (see the first two rows in the table above and the tools mentioned in the previous chapter Monitoring Setup). Similar to Windows, a CPU usage of 30% by a single process on an eight-way machine means that 2-3 CPUs are busy due to this process.

©2004 SAP AG 31

4.2 Analysis Worksheets The following analysis worksheets guide you through the interpretation of the collected performance counters. In many cases, however, the OS-level performance data is not sufficient to derive conclusions. Instead, it only provides indicators that must be correlated with data from other parts of the monitoring setup (e.g. verbose:gc trace, portal monitoring, etc.). For this reason, the analysis below already includes monitoring data from these other sources and directs you to other chapters of this guide.

4.2.1 Healthy System Worksheet

All related to Enterprise Portal installation machines are in good health if the metrics are following the “Healthy System Template”.

Performance Counter Thresholds

Total CPU time (User + system (kernel) time)

(related to CPU)

85%

For Unix systems, where the load is relative to the number n of CPUs (for example 4 CPUs fully loaded = 400% CPU) , use the formula load average < 85%*n

User time related to system time

(related to CPU)

System time

Several architectural facts about EP 6.0 SP2 and J2EE 6.20 are considered below.

4.2.1.1 CPU

The targeted CPU ratio for system and user time can only be achieved if dispatcher activities are not intensive. For a J2EE dispatcher node, the system CPU time is almost equal to the user CPU time. For J2EE server nodes, the ratio described can be achieved with proper thread count configuration (normally reducing client threads to less than 10, and the system thread around 20 improves the ratio of user to system CPU usage).

Low CPU usage (or alternating quite high and quite low CPU usage) in combination with an overall slow portal may also be caused by frequent full garbage collections. During garbage collection, only a single CPU is used by a Java VM. Referring to the 8-way portal server in the snapshot monitoring example above, this means you observe a 13% CPU usage of the Java VM during garbage collection. To investigate in detail, check the verbose:gc trace and see Section 7 VM Memory Analysis.

Two possbile reasons for high system CPU time are heavy I/O activity (disk or network) and lock contention. Lock contention means that many Java threads are blocked from execution since they are waiting to get access to some resource. To investigate lock contention, see Section Java VM Thread Analysis.

4.2.1.2 Memory

The total amount of memory allocated by all Java processes and other processes relevant to the portal (e.g. database processes if running on the same host) has to fit into main memory.

In particular, the sum of the heap size (Xmx, Xms) and the maximum PermSize for all JAVA processes must be configured to fit completely into main memory (RAM) since the effect of paging will be much more intensive than in any other program. When calculating the sizes, also consider memory needed by all other active processes and by the operating system itself.

4.2.1.3 Disk I/O

The portal and the J2EE Engine 6.20 can easily (and unconsciously) be configured to collect a lot of information in log files. If you observe high disk I/O, carefully review the log configuration to reduce log levels to a minimum. Most of the logs are designed to be used for troubleshooting only, and not for being active permanently on a high log level during productive use. T