Top Banner
1 09/26-28/2006 LLNL Livermore, CA Performance Measurement and Visualization on the Cray XT3 Luiz DeRose Programming Environment Director Cray Inc. [email protected] 09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 2 The Cray Tools Strategy Must be easy to use Automatic program instrumentation no source code or makefile modification needed Integrated performance tools solution Multiple platforms Multiple functionality MPI, I/O, Heap, HW Counters Strategy based on the three main steps normally used for application optimization and tuning: Debug application Single processor and vector optimization Parallel processing and I/O optimization Close interaction with user for feedback targeting functionality enhancements
45

Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

Jun 29, 2018

Download

Documents

dinhthu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

1

09/26-28/2006LLNL Livermore, CA

Performance Measurement and Visualization on the Cray XT3

Luiz DeRoseProgramming Environment Director

Cray [email protected]

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 2

The Cray Tools Strategy• Must be easy to use

• Automatic program instrumentation• no source code or makefile modification needed

• Integrated performance tools solution• Multiple platforms• Multiple functionality

• MPI, I/O, Heap, HW Counters

• Strategy based on the three main steps normally used for application optimization and tuning:• Debug application• Single processor and vector optimization• Parallel processing and I/O optimization

• Close interaction with user for feedback targeting functionality enhancements

Page 2: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

2

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 3

Cray Performance Analysis Infrastructure• CrayPat

• pat_hwpc: for whole program measurement• pat_build: Utility for application instrumentation

• No source code modification required• run-time library for measurements

• transparent to the user• pat_report:

• Performance reports• Performance visualization file

• libhwpc• pat_help

• Cray Apprentice2

• Graphical performance analysis and visualization tool• Can be used off-line on Linux system

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 4

Performance Data Collection• Two dimensions

• When Performance Collection is triggered• External agent (asynchronous)

Sampling» timer interrupt» hardware counters overflow

• Internal agent (synchronous)Code instrumentation (event trace)

» Automatic instrumentation» Hand instrumentation

• How performance data is recorded• Profile (runtime summary)• Trace file

Page 3: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

3

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 5

Single Processor Optimization• Answer the following questions:

• Do I have a performance problem at all?• pat_hwpc

Provides overall view of the program execution» Time / Resource / Hardware Counters measurement

• Where are the main bottlenecks?• CrayPat provides profilers

Based on sampling and runtime summarization» XT3 sampling support under development

Flat, Call graph, Function, …

• Why is it there?• CrayPat Contains:

HW counters based Instrumentation libraryAPI for lower level instrumentation

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 6

Parallel Processing, I/O and Memory Optimization

• Answer the following questions:• Do I have communication/synchronization problems?

• CrayPat addresses:Communication Profiler Load balance profile

• Do I have I/O or Memory problems?• CrayPat addresses (will address):

I/O ProfilerHeap profiler

• Why?• Tracing

CrayPat tracing library• Cray PAT Visualization GUI

Cray Apprentice2

Page 4: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

4

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 7

Six Steps for Performance Analysis1. Load CrayPat module2. Build application

No makefile modification needed3. Instrument application with pat_build

% pat_build [-g group] [-u] [options] a.outGroups: mpi, io, heap, user function (-u) …Automatic instrumentation at group (function) level

No source code modification neededAPI provided for instrumentation at a finer granularity

4. Run instrumented application5. Generate performance file (.ap2) with pat_report

% pat_report –f ap2 [options] <.xf file>6. Performance analysis and visualization with CrayPat and

Cray Apprentice2

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 8

Application Instrumentation with Pat_Build• No source code or makefile modification required

• API available for fine grain instrumentation• Compiler flag “–Mprof=func” needed for some Fortran90

programs with module• This problem has been addressed with PGI 6.1.4

In certain cases the flag is still needed

• Performs binary rewrite• Relink application• Requires object files• Generates a stand alone instrumented program

• Runtime environment variable defines if profile or trace file will be generated• PAT_RT_SUMMARY• Default is 1 (for runtime summarization)

Page 5: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

5

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 9

CrayPat API• CrayPat performs automatic instrumentation at function level• The CrayPat API can be used for fine grain instrumentation

• Fortran• call PAT_region_begin(id, “label”, ierr)• DO Work• call PAT_region_end(id, ierr)

• C• include <pat_api.h>• …• ierr = PAT_region_begin(id, “label”);• DO_Work();• ierr = PAT_region_end(id);

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 10

Additional API Functions• int PAT_profiling_state (int state)• int PAT_record (int state)• int PAT_sampling_state (int state)• int PAT_tracing_state (int state)• int PAT_trace_function (const void *addr, int state)• State can have one of the following:

• PAT_STATE_ON • PAT_STATE_OFF• PAT_STATE_QUERY

• int PAT_flush_buffer (void)

Page 6: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

6

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 11

Runtime Environment Variables• The following runtime environment variables affect

how the data is collected: • PAT_RT_REGION_MAX

• Specifies the largest numerical ID that may be used as an argument to the CrayPat API functions PAT_region_begin and PAT_region_end

The default is 100• PAT_RT_SUMMARY

• Enables run-time summarizationIncludes the aggregation of data during run-timeRuntime summarization is enabled by default

• PAT_RT_HWPC <set #>• Activate collection of hardware performance counters

There are 9 sets on the XT3

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 12

pat_report Options• Reformating the performance file (Cray Apprentice2 input)

• pat_report [-V] [-i dir|instrprog] [-o output_file] -f ap2 |txt |xml data_directory | data_file.xf

• Generating performance reports• pat_report [-V] [-i dir|instrprog] [-o output_file]

[-O keyword] [-b b-opts] [-d d-opts] [-s key=value] [-P] [-T] data_directory | data_file.xf | data_file.ap2

• Main options:• -i is only if the instrumented program has a different name or

is in a different directory path than when it was executed• -O provides shortcuts for common reports:• -b, -d, -s can be used to further customize the report

Page 7: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

7

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 13

Pat_report OutputCrayPat/X: Version 3.1 Revision 398 (xf 305) 09/21/06 16:23:52

Experiment: trace

Experiment data file:/lus/nid00007/ldr/L_Apps/sweep3d/sweep3d+all+1375td.xf (RTS)

Current path to data file: /ufs/home/users/ldr/P+sweep+48p.ap2 (RTS)

Original program: /lus/nid00007/ldr/L_Apps/sweep3d/sweep3d

Instrumented program: /lus/nid00007/ldr/L_Apps/sweep3d/./sweep3d+all

Program invocation: ./sweep3d+all

Number of PEs: 48

Exit Status: 0 PEs: 0-47

Runtime environment variables: PAT_RT_SUMMARY=1

Report time environment variables:PAT_ROOT=/home/users/homer/opt/xt-tools/craypat/craypat/cpatx

Report command line options: <none>

Host name and type: perch x86_64 2400 MHz

Operating system: catamount 1.0 2.0

Traced functions:MAIN_ .../ldr/L_Apps/sweep3d/driver.fMPI_Abort ==NA==MPI_Allgather ==NA==MPI_Allreduce ==NA==MPI_Attr_put ==NA==

. . .

All profiles have this information, which is

helpful to link the data with a particular execution

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 14

Table 1: Flat Profile (Default)Notes for table 1:

High level option: -O profileLow level options: -d ti%@0.05,ti,imb_ti,imb_ti%,tr \

-b exp,gr,fu,pe=HIDE

This table shows only lines with Time% > 0.05.

Percentages at each level are relative(for absolute percentages, specify: -s percent=a).

Table 1: Profile by Function Group and Function

Time % | Time |Imb. Time | Imb. | Calls |Experiment=1| | | Time % | |Group| | | | | Function| | | | | PE='HIDE'

100.0% | 3.798177 | -- | -- | 579653 |Total|------------------------------------------------------------| 70.9% | 2.692783 | -- | -- | 245380 |USER||-----------------------------------------------------------|| 97.1% | 2.615916 | 0.137362 | 5.1% | 576 |sweep_|| 1.7% | 0.046263 | 0.001465 | 3.1% | 576 |source_|| 0.4% | 0.010300 | 0.001399 | 12.2% | 118080 |snd_real_|| 0.3% | 0.009208 | 0.000261 | 2.8% | 576 |flux_err_|| 0.2% | 0.004303 | 0.000726 | 14.7% | 118080 |rcv_real_|| 0.1% | 0.002387 | 0.001871 | 44.9% | 48 |MAIN_

By default, the report will only show

functions with at least 0.05% of the time

Page 8: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

8

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 15

Table 1: Flat Profile (Continuation)

|| 0.1% | 0.001785 | 0.000046 | 2.6% | 48 |initialize_|| 0.1% | 0.001399 | 0.000065 | 4.6% | 48 |initxs_||=========================================================== | 28.8% | 1.092307 | -- | -- | 238224 |MPI||-----------------------------------------------------------|| 76.1% | 0.831311 | 0.234766 | 22.5% | 118080 |mpi_recv_|| 12.0% | 0.131277 | 0.124790 | 49.8% | 1536 |mpi_allreduce_|| 4.9% | 0.053189 | 0.009683 | 15.7% | 118080 |mpi_send_|| 4.1% | 0.044522 | 0.001153 | 2.6% | 144 |mpi_barrier_|| 2.9% | 0.032002 | 0.002497 | 7.4% | 192 |mpi_bcast_||===========================================================| 0.2% | 0.007329 | -- | -- | 95597 |HEAP||-----------------------------------------------------------|| 61.1% | 0.004481 | 0.001403 | 24.3% | 47861 |malloc|| 38.9% | 0.002848 | 0.000900 | 24.5% | 47735 |free||===========================================================| 0.2% | 0.005758 | -- | -- | 452 |IO||-----------------------------------------------------------|| 81.3% | 0.004679 | 0.072462 | 95.9% | 309 |fwrite|| 9.8% | 0.000566 | 0.026590 | 100.0% | 68 |getc|| 8.6% | 0.000498 | 0.023405 | 100.0% | 8 |fputc|| 0.1% | 0.000007 | 0.000330 | 100.0% | 2 |fopen|| 0.1% | 0.000005 | 0.000000 | 5.8% | 48 |setlinebuf|============================================================|=======================================================

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 16

Table 2: Load Balance (Default)Notes for table 2:

High level option: -O load_balance_smLow level options: -d ti%@0.05,ti,sc,sm,sz -b exp,gr,pe=[mmm]

Table 2: Load Balance with MPI Sent Message Stats

Time % | Time | Sent | Sent Msg | Avg Sent |Experiment=1| | Msg |Total Bytes | Msg Size |Group| | Count | | | PE[mmm]

100.0% | 3.798177 | 118080 | 1244160000 | 10536.59 |Total|----------------------------------------------------------------| 70.9% | 2.692783 | -- | -- | -- |USER||---------------------------------------------------------------|| 2.2% | 2.833001 | -- | -- | -- |pe.0|| 2.1% | 2.717019 | -- | -- | -- |pe.12|| 2.0% | 2.597093 | -- | -- | -- |pe.43||===============================================================| 28.8% | 1.092307 | 118080 | 1244160000 | 10536.59 |MPI||---------------------------------------------------------------|| 2.3% | 1.188383 | 2160 | 21081600 | 9760.00 |pe.43|| 2.0% | 1.069314 | 2880 | 30412800 | 10560.00 |pe.7|| 1.6% | 0.859333 | 1440 | 15206400 | 10560.00 |pe.0||===============================================================| 0.2% | 0.007329 | -- | -- | -- |HEAP||---------------------------------------------------------------|| 2.7% | 0.009363 | -- | -- | -- |pe.12|| 2.2% | 0.007614 | -- | -- | -- |pe.40|| 0.6% | 0.002062 | -- | -- | -- |pe.0||===============================================================| 0.2% | 0.005758 | -- | -- | -- |IO||---------------------------------------------------------------|| 46.6% | 0.128685 | -- | -- | -- |pe.0|| 1.1% | 0.003144 | -- | -- | -- |pe.47|| 0.6% | 0.001644 | -- | -- | -- |pe.29|================================================================

Page 9: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

9

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 17

Table 3: MPI Send Stats by BucketNotes for table 3:

High level option: -O mpiLow level options: -d sc@,mb1..7 -b exp,fu,ca,pe=[mmm]

This table shows only lines with Sent Msg Count > 0.

Table 3: MPI Sent Messages Stats by Bucket

Sent | 256B<= |Experiment=1Msg | MsgSz |Function

Count | <4KB | Caller| | PE[mmm]

118080 | 118080 |Total|-----------------------------| 118080 | 118080 |mpi_send_| | | snd_real_| | | sweep_| | | inner_| | | inner_auto_| | | MAIN_|||||||-----------------------||||||| 2880 | 2880 |pe.33||||||| 2160 | 2160 |pe.2||||||| 1440 | 1440 |pe.5|=============================

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 18

Table 4: Heap Usage

Notes for table 4:

High level option: -O heap_programLow level options: -d IU,IF,NF,FM -b exp,pe=[mmm]

Table 4: Heap Usage at Start and End of Main Program

MB Heap | MB Heap | Heap | Max Free |Experiment=1Used at | Free at | Not |Object at |PE[mmm]

Start | Start | Freed | End || | MB | |

91.180 | 1834.820 | 0.001 | 1834.799 |Total|---------------------------------------------------| 92.754 | 1833.246 | 1.623 | 1833.211 |pe.0| 91.147 | 1834.853 | 0.000 | 1834.833 |pe.42| 91.147 | 1834.853 | 0.001 | 1834.833 |pe.5|===================================================

Page 10: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

10

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 19

Table 5: Heap Statistics

Notes for table 5:

High level option: -O heap_hiwaterLow level options: -d am@,ub,ta,ua,tf,ac,ab -b exp,pe=[mmm]

This table shows only lines with Tracked Heap HiWater MBytes > 0.

Table 5: Heap Stats during Main Program

Tracked | MBytes | Total | Allocs | Total | Tracked | Tracked |Experiment=1Heap | Not | Allocs | Not | Frees | Objects | MBytes |PE[mmm]

HiWater | Tracked | | Tracked | | Not | Not |MBytes | | | | | Freed | Freed |

8.793 | 0.000 | 997 | 0 | 995 | 3 | 0.010 |Total|------------------------------------------------------------------------------| 8.927 | 0.000 | 418 | 0 | 386 | 33 | 0.030 |pe.0| 8.907 | 0.000 | 945 | 0 | 943 | 2 | 0.010 |pe.23| 8.445 | 0.000 | 1471 | 0 | 1469 | 2 | 0.010 |pe.43|==============================================================================

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 20

Table 6: Heap Leaks

Table 6: Heap Leaks during Main Program

Tracked | Tracked | Tracked |Experiment=1MBytes | MBytes | Objects |Caller

Not | Not | Not | PE[mmm]Freed % | Freed | Freed |

100.0% | 0.010 | 2 |Total|-----------------------------------------| 95.0% | 0.010 | 1 |(N/A)||----------------------------------------|| 2.1% | 0.010 | 1 |pe.33|| 2.1% | 0.010 | 1 |pe.42|| 2.1% | 0.010 | 1 |pe.5|=========================================

Page 11: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

11

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 21

Table 7: I/O (Read) StatisticsNotes for table 7:

High level option: -O read_statsLow level options: -d rt,rb,rR,rd@,rC -b exp,fi,pe=[mmm],fd

This table shows only lines with Reads > 0.

Table 7: File Input Stats by Filename

Read | Read MB | Read Rate | Reads | Read |Experiment=1Time | | MB/sec | | B/Call |File Name

| | | | | PE[mmm]| | | | | File Desc

0.000 | 0.000065 | 153.038002 | 68 | 1.00 |Total|-------------------------------------------------------------| 0.000 | 0.000065 | 153.038002 | 68 | 1.00 |input||------------------------------------------------------------|| 0.000 | 0.000065 | 3.189272 | 68 | 1.00 |pe.0|| | | | | | fd.6|| 0.000 | -- | -- | -- | -- |pe.22|| 0.000 | -- | -- | -- | -- |pe.5|=============================================================

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 22

Table 8: I/O (Write) StatisticsNotes for table 8:

High level option: -O write_statsLow level options: -d wt,wb,wR,wr@,wC -b exp,fi,pe=[mmm],fd

This table shows only lines with Writes > 0.

Table 8: File Output Stats by Filename

Write | Write MB | Write Rate | Writes | Write |Experiment=1Time | | MB/sec | | B/Call |File Name

| | | | | PE[mmm]| | | | | File Desc

0.000 | 0.002596 | 708.859207 | 317 | 8.59 |Total|--------------------------------------------------------------| 0.000 | 0.002001 | 621.128045 | 269 | 7.80 |stdout||-------------------------------------------------------------|| 0.000 | 0.002001 | 12.940342 | 269 | 7.80 |pe.0|| | | | | | fd.1|| 0.000 | -- | -- | -- | -- |pe.22|| 0.000 | -- | -- | -- | -- |pe.5||=============================================================| 0.000 | 0.000595 |1349.926896 | 48 | 13.00 |stderr||-------------------------------------------------------------|| 0.000 | 0.000012 | 23.392012 | 1 | 13.00 |pe.30|| | | | | | fd.2|| 0.000 | 0.000012 | 26.878626 | 1 | 13.00 |pe.10|| | | | | | fd.2|| 0.000 | 0.000012 | 56.567754 | 1 | 13.00 |pe.0|| | | | | | fd.2|==============================================================

Page 12: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

12

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 23

Table 9: Wall Clock Time

Notes for table 9:

High level option: -O program_timeLow level options: -d pt -b exp,pe=[mmm]

Table 9: Program Wall Clock Time

Process |Experiment=1Time |PE[mmm]

7.034316 |Total|----------------------| 7.468644 |pe.0| 7.030273 |pe.47| 6.585806 |pe.29|======================

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 24

Call Tree Profile (Top Down)Notes for table 1:

High level option: -O calltreeLow level options: -d ti%@0.05,cum_ti%,ti,tr -b exp,ct,pe=HIDE \

-s show_ca='fu,so,li' -s source_limit='1'

This table shows only lines with Time% > 0.05.

Percentages at each level are relative(for absolute percentages, specify: -s percent=a).

Table 1: Calltree View with Callsite Line Numbers

Time % | Cum. | Time | Calls |Experiment=1| Time % | | |Calltree| | | | PE='HIDE'

100.0% | 100.0% | 90.217759 | 637231917 |Total|-----------------------------------------------------| 100.0% | 100.0% | 90.175202 | 637205576 |MAIN_||----------------------------------------------------|| 99.7% | 99.7% | 89.922750 | 637194666 |runhyd_|||---------------------------------------------------||| 15.4% | 15.4% | 13.864217 | 106169040 |zysweep_||||--------------------------------------------------|||| 87.3% | 87.3% | 12.097038 | 106168320 |sppm2_|||||-------------------------------------------------||||| 49.4% | 49.4% | 5.980766 | 11796480 |sppm2_(exclusive)||||| 24.1% | 73.6% | 2.920440 | 11796480 |difuze_||||| 19.0% | 92.6% | 2.296747 | 58982400 |interf_||||| 7.4% | 100.0% | 0.899084 | 23592960 |dintrf_|||||=================================================|||| 12.7% | 100.0% | 1.767180 | 720 |zysweep_(exclusive)||||==================================================||| 15.4% | 30.8% | 13.854807 | 106169040 |xysweep_||||--------------------------------------------------|||| 87.0% | 87.0% | 12.049373 | 106168320 |sppm2_|||||-------------------------------------------------||||| 49.5% | 49.5% | 5.970403 | 11796480 |sppm2_(exclusive)||||| 24.0% | 73.6% | 2.894189 | 11796480 |difuze_

Page 13: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

13

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 25

Callers Profile (Bottom Up)Notes for table 1:

High level option: -O callersLow level options: -d ti%@0.05,cum_ti%,ti,tr -b exp,gr,fu,ca,pe=HIDE

This table shows only lines with Time% > 0.05.

Table 1: Profile by Function and Callers

Time % | Cum. | Time | Calls |Experiment=1| Time % | | |Group| | | | Function| | | | Caller| | | | PE='HIDE'

100.0% | 100.0% | 90.217759 | 637231917 |Total|-----------------------------------------------------| 92.3% | 92.3% | 83.265853 | 637033288 |USER||----------------------------------------------------|| 43.1% | 43.1% | 35.864107 | 70778880 |sppm2_|||---------------------------------------------------||| 16.7% | 16.7% | 5.986173 | 11796480 |yxsweep_||| | | | | runhyd_||| | | | | MAIN_||| 16.7% | 33.4% | 5.980851 | 11796480 |yzsweep_||| | | | | runhyd_||| | | | | MAIN_||| 16.7% | 50.0% | 5.980766 | 11796480 |zysweep_||| | | | | runhyd_||| | | | | MAIN_||| 16.7% | 66.7% | 5.973496 | 11796480 |zzsweep_||| | | | | runhyd_||| | | | | MAIN_||| 16.7% | 83.4% | 5.972417 | 11796480 |xxsweep_||| | | | | runhyd_||| | | | | MAIN_||| 16.6% | 100.0% | 5.970403 | 11796480 |xysweep_||| | | | | runhyd_||| | | | | MAIN_|||===================================================|| 21.0% | 64.0% | 17.447719 | 70778880 |difuze_|| | | | | sppm2_

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 26

Callers Profile – MPI (Cont.)||====================================================| 7.7% | 99.9% | 6.906194 | 106344 |MPI||----------------------------------------------------|| 70.2% | 70.2% | 4.851312 | 51840 |mpi_wait_|||---------------------------------------------------||| 41.2% | 41.2% | 1.997854 | 17280 |zbdrys_||||--------------------------------------------------|||| | | | |runhyd_|||||-------------------------------------------------||||| | | | |MAIN_||||==================================================||| 34.3% | 75.5% | 1.664276 | 17280 |ybdrys_||| | | | | runhyd_||| | | | | MAIN_||| 24.5% | 100.0% | 1.189183 | 17280 |xbdrys_||| | | | | runhyd_||| | | | | MAIN_|||===================================================|| 29.7% | 99.9% | 2.048254 | 2232 |mpi_allreduce_|||---------------------------------------------------||| 96.6% | 96.6% | 1.978537 | 720 |glblmax_||| | | | | runhyd_||| | | | | MAIN_||| 3.4% | 100.0% | 0.069717 | 1512 |glbldsum_||||--------------------------------------------------|||| 98.5% | 98.5% | 0.068700 | 792 |trace_|||| | | | | MAIN_|||| 1.5% | 100.0% | 0.001017 | 720 |runhyd_|||| | | | | MAIN_|||===================================================|| 0.1% | 100.0% | 0.004263 | 25920 |mpi_isend_|||---------------------------------------------------||| 33.7% | 33.7% | 0.001436 | 8640 |xbdrys_||| | | | | runhyd_||| | | | | MAIN_||| 33.2% | 66.9% | 0.001416 | 8640 |zbdrys_||| | | | | runhyd_||| | | | | MAIN_||| 33.1% | 100.0% | 0.001410 | 8640 |ybdrys_||| | | | | runhyd_||| | | | | MAIN_|=====================================================

Page 14: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

14

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 27

Callers Profile with Line Numbers% pat_report –O ca+src <performance file>

Time% | Cum.Time% | Time | Calls |Function|Caller

100.0% | 100.0% | 2647.397216 | 283222164 |Total|--------------------------------------------------------| 40.5% | 40.5% | 1071.748077 | 31457280 |sppm2_||-------------------------------------------------------|| 6.8% | 6.8% | 179.125866 | 5242880 |yzsweep_:/scratch/derose/sppm2/sweeps.F:line.518|| | | | | runhyd_:/scratch/derose/sppm2/main.F:line.1056|| | | | | MAIN_:/scratch/derose/sppm2/main.F:line.226|| | | | | main:NA:line.0|| 6.8% | 13.5% | 178.901135 | 5242880 |zzsweep_:/scratch/derose/sppm2/sweeps.F:line.812|| | | | | runhyd_:/scratch/derose/sppm2/main.F:line.1064|| | | | | MAIN_:/scratch/derose/sppm2/main.F:line.226|| | | | | main:NA:line.0|| 6.8% | 20.3% | 178.709675 | 5242880 |yxsweep_:/scratch/derose/sppm2/sweeps.F:line.1400|| | | | | runhyd_:/scratch/derose/sppm2/main.F:line.1080|| | | | | MAIN_:/scratch/derose/sppm2/main.F:line.226|| | | | | main:NA:line.0|| 6.7% | 27.0% | 178.614169 | 5242880 |zysweep_:/scratch/derose/sppm2/sweeps.F:line.1106|| | | | | runhyd_:/scratch/derose/sppm2/main.F:line.1072|| | | | | MAIN_:/scratch/derose/sppm2/main.F:line.226|| | | | | main:NA:line.0|| 6.7% | 33.8% | 178.273669 | 5242880 |xxsweep_:/scratch/derose/sppm2/sweeps.F:line.1694|| | | | | runhyd_:/scratch/derose/sppm2/main.F:line.1088|| | | | | MAIN_:/scratch/derose/sppm2/main.F:line.226|| | | | | main:NA:line.0|| 6.7% | 40.5% | 178.123564 | 5242880 |xysweep_:/scratch/derose/sppm2/sweeps.F:line.219|| | | | | runhyd_:/scratch/derose/sppm2/main.F:line.1048|| | | | | MAIN_:/scratch/derose/sppm2/main.F:line.226|| | | | | main:NA:line.0||=======================================================| 20.2% | 60.7% | 534.482687 | 31457280 |difuze_| | | | | sppm2_:/scratch/derose/sppm2/sppm.F:line.630|||------------------------------------------------------||| 3.4% | 43.9% | 89.361514 | 5242880 |zzsweep_:/scratch/derose/sppm2/sweeps.F:line.812||| | | | | runhyd_:/scratch/derose/sppm2/main.F:line.1064||| | | | | MAIN_:/scratch/derose/sppm2/main.F:line.226||| | | | | main:NA:line.0||| 3.4% | 47.2% | 89.178333 | 5242880 |zysweep_:/scratch/derose/sppm2/sweeps.F:line.1106

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 28

Load Balancing Function per PENotes for table 1:

High level option: -O load_balance_programLow level options: -d ti%@0.05,cum_ti%,ti,tr -b exp,pe

This table shows only lines with Time% > 0.05.

Percentages at each level are relative(for absolute percentages, specify: -s percent=a).

Table 1: Load Balance across PE's

Time % | Cum. | Time | Calls |Experiment=1| Time % | | |PE

100.0% | 100.0% | 3.798177 | 579653 |Total|-------------------------------------------------| 2.1% | 2.1% | 3.823080 | 7160 |pe.0| 2.1% | 4.2% | 3.799148 | 13753 |pe.8|| ...| 2.1% | 97.9% | 3.796151 | 7683 |pe.5| 2.1% | 100.0% | 3.796144 | 10431 |pe.29|=================================================

Page 15: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

15

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 29

Table 2: LB Across PE’s by GroupNotes for table 2:

High level option: -O load_balance_groupLow level options: -d ti%@0.05,cum_ti%,ti,tr -b exp,gr,pe

. . .

Table 2: Load Balance across PE's by FunctionGroup

Time % | Cum. | Time | Calls |Experiment=1| Time % | | |Group| | | | PE

100.0% | 100.0% | 3.798177 | 579653 |Total|-------------------------------------------------| 70.9% | 70.9% | 2.692783 | 245380 |USER||------------------------------------------------|| 2.2% | 2.2% | 2.833001 | 3076 |pe.0|| ...|| 2.0% | 100.0% | 2.597093 | 4512 |pe.43|=================================================| 28.8% | 99.7% | 1.092307 | 238224 |MPI||------------------------------------------------|| 2.3% | 2.3% | 1.188383 | 4363 |pe.43|| ...|| 1.6% | 100.0% | 0.859333 | 2923 |pe.0||================================================| 0.2% | 99.8% | 0.007329 | 95597 |HEAP||------------------------------------------------|| 2.7% | 2.7% | 0.009363 | 2482 |pe.12|| ...|| 0.6% | 100.0% | 0.002062 | 803 |pe.0||================================================| 0.2% | 100.0% | 0.005758 | 452 |IO||------------------------------------------------|| 46.6% | 46.6% | 0.128685 | 358 |pe.0|| ...|| 0.6% | 100.0% | 0.001644 | 2 |pe.29|=================================================

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 30

Table 3: LB Across PE’s by FunctionNotes for table 3:

High level option: -O load_balance_functionLow level options: -d ti%@0.05,cum_ti%,ti,tr -b exp,gr,fu,pe

This table shows only lines with Time% > 0.05.

Percentages at each level are relative(for absolute percentages, specify: -s percent=a).

Table 3: Load Balance across PE's by Function

Time % | Cum. | Time | Calls |Experiment=1| Time % | | |Group| | | | Function| | | | PE

100.0% | 100.0% | 3.798177 | 579653 |Total|-------------------------------------------------| 70.9% | 70.9% | 2.692783 | 245380 |USER||------------------------------------------------|| 97.1% | 97.1% | 2.615916 | 576 |sweep_|||-----------------------------------------------||| 2.2% | 2.2% | 2.753279 | 12 |pe.0||| 2.1% | 4.3% | 2.654725 | 12 |pe.5||| . . .||| 2.0% | 98.0% | 2.525587 | 12 |pe.43||| 2.0% | 100.0% | 2.523325 | 12 |pe.37|||===============================================. . .

|||===============================================|| 0.4% | 99.2% | 0.010300 | 118080 |snd_real_|||-----------------------------------------------||| 2.4% | 2.4% | 0.011699 | 2880 |pe.26||| 2.3% | 4.7% | 0.011475 | 2880 |pe.27||| . . .||| 1.5% | 98.6% | 0.007266 | 1440 |pe.0||| 1.4% | 100.0% | 0.006907 | 1440 |pe.5|||===============================================

Page 16: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

16

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 31

Table 3 (Cont.)||================================================| 28.8% | 99.7% | 1.092307 | 238224 |MPI||------------------------------------------------|| 76.1% | 76.1% | 0.831311 | 118080 |mpi_recv_|||-----------------------------------------------||| 2.7% | 2.7% | 1.066077 | 1440 |pe.47||| 2.6% | 5.3% | 1.034307 | 2160 |pe.41||| . . . ||| 1.8% | 98.6% | 0.700970 | 2160 |pe.1||| 1.4% | 100.0% | 0.573420 | 1440 |pe.0|||===============================================. . .||================================================| 0.2% | 99.8% | 0.007329 | 95597 |HEAP||------------------------------------------------|| 61.1% | 61.1% | 0.004481 | 47861 |malloc|||-----------------------------------------------||| 2.7% | 2.7% | 0.005884 | 1242 |pe.12||| 2.6% | 5.4% | 0.005658 | 1226 |pe.19||| . . .||| 1.3% | 99.5% | 0.002827 | 618 |pe.34||| 0.5% | 100.0% | 0.001164 | 417 |pe.0|||===============================================|| 38.9% | 100.0% | 0.002848 | 47735 |free|||-----------------------------------------------||| 2.7% | 2.7% | 0.003748 | 1422 |pe.37||| 2.7% | 5.5% | 0.003706 | 1469 |pe.43||| . . .||| 1.4% | 99.3% | 0.001867 | 616 |pe.34||| 0.7% | 100.0% | 0.000896 | 385 |pe.0||================================================| 0.2% | 100.0% | 0.005758 | 452 |IO||------------------------------------------------|| 81.3% | 81.3% | 0.004679 | 309 |fwrite|||-----------------------------------------------||| 34.3% | 34.3% | 0.077141 | 262 |pe.0||| 2.1% | 36.4% | 0.004615 | 1 |pe.8||| . . .

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 32

Load Balance: Max, Median, Min

Notes for table 1:

High level option: -O load_balance_programLow level options: -d ti%@0.05,cum_ti%,ti,tr -b exp,pe=[mmm]

This table shows only lines with Time% > 0.05.

Percentages at each level are relative(for absolute percentages, specify: -s percent=a).

Table 1: Load Balance across PE's

Time % | Cum. | Time | Calls |Experiment=1| Time % | | |PE[mmm]

100.0% | 100.0% | 3.798177 | 579653 |Total|-------------------------------------------------| 2.1% | 2.1% | 3.823080 | 7160 |pe.0| 2.1% | 52.1% | 3.797671 | 10695 |pe.3| 2.1% | 100.0% | 3.796144 | 10431 |pe.29|=================================================

Page 17: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

17

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 33

LB [MMM] Table 2Notes for table 2:

High level option: -O load_balance_groupLow level options: -d ti%@0.05,cum_ti%,ti,tr \-b exp,gr,pe=[mmm]

This table shows only lines with Time% > 0.05.

Percentages at each level are relative(for absolute percentages, specify: -s percent=a).

Table 2: Load Balance across PE's by FunctionGroup

Time % | Cum. | Time | Calls |Experiment=1| Time % | | |Group| | | | PE[mmm]

100.0% | 100.0% | 3.798177 | 579653 |Total|-------------------------------------------------| 70.9% | 70.9% | 2.692783 | 245380 |USER||------------------------------------------------|| 2.2% | 2.2% | 2.833001 | 3076 |pe.0|| 2.1% | 52.8% | 2.717019 | 4512 |pe.12|| 2.0% | 100.0% | 2.597093 | 4512 |pe.43||================================================| 28.8% | 99.7% | 1.092307 | 238224 |MPI||------------------------------------------------|| 2.3% | 2.3% | 1.188383 | 4363 |pe.43|| 2.0% | 53.7% | 1.069314 | 5803 |pe.7|| 1.6% | 100.0% | 0.859333 | 2923 |pe.0||================================================| 0.2% | 99.8% | 0.007329 | 95597 |HEAP||------------------------------------------------|| 2.7% | 2.7% | 0.009363 | 2482 |pe.12|| 2.2% | 59.6% | 0.007614 | 2192 |pe.40|| 0.6% | 100.0% | 0.002062 | 803 |pe.0||================================================| 0.2% | 100.0% | 0.005758 | 452 |IO||------------------------------------------------|| 46.6% | 46.6% | 0.128685 | 358 |pe.0|| 1.1% | 80.3% | 0.003144 | 2 |pe.47|| 0.6% | 100.0% | 0.001644 | 2 |pe.29|=================================================

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 34

LB [MMM] Table 3Notes for table 3:High level option: -O load_balance_functionLow level options: -d ti%@0.05,cum_ti%,ti,tr \

-b exp,gr,fu,pe=[mmm]This table shows only lines with Time% > 0.05.Percentages at each level are relative

(for absolute percentages, specify: -s percent=a).

Table 3: Load Balance across PE's by FunctionTime % | Cum. | Time | Calls |Experiment=1

| Time % | | |Group| | | | Function| | | | PE[mmm]

100.0% | 100.0% | 3.798177 | 579653 |Total|-------------------------------------------------| 70.9% | 70.9% | 2.692783 | 245380 |USER|| 97.1% | 97.1% | 2.615916 | 576 |sweep_|||-----------------------------------------------||| 2.2% | 2.2% | 2.753279 | 12 |pe.0||| 2.1% | 52.8% | 2.638898 | 12 |pe.16||| 2.0% | 100.0% | 2.523325 | 12 |pe.37|||===============================================. . .||================================================| 28.8% | 99.7% | 1.092307 | 238224 |MPI||------------------------------------------------|| 76.1% | 76.1% | 0.831311 | 118080 |mpi_recv_|||-----------------------------------------------||| 2.7% | 2.7% | 1.066077 | 1440 |pe.47||| 2.0% | 56.9% | 0.801256 | 2880 |pe.21||| 1.4% | 100.0% | 0.573420 | 1440 |pe.0|||===============================================. . .| 0.2% | 99.8% | 0.007329 | 95597 |HEAP||------------------------------------------------. . .||================================================| 0.2% | 100.0% | 0.005758 | 452 |IO. . .

Page 18: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

18

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 35

pat_report –O Keywords• profile• callers or ca; ca+src• calltree or ct; ct+src• load_balance or lb; load_balance_all or lb_all

• load_balance_program• load_balance_group• load_balance_function

• mpi• heap

• heap_program• heap_hiwater• heap_leaks

Multiple values can be specified in a comma-listBy default, all reports show only the PEs having the maximum,

median, and minimum values

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 36

L1Instruction

Cache64KB

44-entryLoad/Store

Queue

L2Cache1 MB

16-way assocL1

DataCache64KB

2-way assoc

Crossbar

MemoryController

HyperTransportTM

SystemRequestQueue

Fetch

Int Decode & Rename

µOPs

36-entry FP scheduler

FADD FMISCFMUL

BranchPrediction

Instruction Control Unit (72 entries)

Fastpath Microcode EngineScan/Align

FP Decode & Rename

AGU

ALU

AGU

ALU

MULT

AGU

ALU

Res Res Res

Bus

Unit

9-way Out-Of-Order execution

16 instruction bytes fetched per cycle

• 36 entry FPU instruction scheduler• 64-bit/80-bit FP Realized throughput (1 Mul + 1 Add)/cycle: 1.9 FLOPs/cycle• 32-bit FP Realized throughput (2 Mul + 2 Add)/cycle: 3.4+ FLOPs/cycle

AMD Opteron Processor

Page 19: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

19

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 37

Simplified memory hierachy on the AMD Opteron

…...

registers

L1 data cache

L2 cache

16 SSE2 128-bit registers16 64 bit registers

2 x 8 Bytes per clock, i.e. Either 2 loads, 1 load 1 store, or 2 stores (38 GB/s on 2.4 Ghz)

Main memory

64 Byte cache linecomplete data cache lines are loaded from mainmemory, if not in L2 cacheif L1 data cache needs to be refilled, thenstoring back to L2 cache

64 Byte cache linewrite back cache: data offloaded from L1 data cache are stored here firstuntil they are flushed out to main memory

16 Bytes wide data bus => 6.4 GB/s for DDR400

8 Bytes per clock

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 38

Hardware Performance Counters• AMD Opteron Hardware Performance Counters

• Four 48-bit performance counters.• Each counter can monitor a single event

Count specific processor events» the processor increments the counter when it detects an

occurrence of the event» (e.g., cache misses)

Duration of events» the processor counts the number of processor clocks it

takes to complete an event» (e.g., the number of clocks it takes to return data from

memory after a cache miss)• Time Stamp Counters (TSC)

• Cycles (user time)

Page 20: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

20

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 39

XT3 Hardware Counters Interface• The Performance API (PAPI) is provided by default on the

XT3 software stack• The XT3 Unicos/lc hardware counters interface was

developed by Sandia National Laboratories• Based on the perfctr kernel patch developed by Mikael

Pettersson from Uppsala University• No hardware counters access in any major Linux distribution

Kernel patch needed for user level access to hardware counters • Provides system level access to the x86 and x86-64

performance counters• Provides per-process 64-bit memory-mapped virtual counters• Provides per-process virtual Time Stamp Counter (TSC)

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 40

PAPI Predefined Events• Common set of events deemed relevant and useful

for application performance tuning• papiStdEventDefs.h• Accesses to the memory hierarchy, cache coherence

protocol events, cycle and instruction counts, functional unit and pipeline status

• PAPI “avail” utility shows which predefined events are available on the system

• PAPI also provides access to native events• PAPI “native_avail” utility list all AMD native events

available on the system

Page 21: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

21

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 41

PAPI Preset Listing(derose@jaguar1) 184% yod -sz=1 /opt/xt-tools/papi/3.0.8.1/bin/availLibLustre: NAL NID: 0005dc02 (2)Lustre: OBD class driver Build Version: 1, [email protected] case avail.c: Available events and hardware information.-------------------------------------------------------------------------Vendor string and code : AuthenticAMD (2)Model string and code : AMD K8 (13)CPU Revision : 1.000000CPU Megahertz : 2400.000000CPU's in this Node : 1Nodes in this System : 1Total CPU's : 1Number Hardware Counters : 4Max Multiplex Counters : 32-------------------------------------------------------------------------Name Code Avail Deriv Description (Note)PAPI_L1_DCM 0x80000000 Yes Yes Level 1 data cache misses ()PAPI_L1_ICM 0x80000001 Yes Yes Level 1 instruction cache misses ()PAPI_L2_DCM 0x80000002 Yes No Level 2 data cache misses ()PAPI_L2_ICM 0x80000003 Yes No Level 2 instruction cache misses ()PAPI_L3_DCM 0x80000004 No No Level 3 data cache misses ()PAPI_L3_ICM 0x80000005 No No Level 3 instruction cache misses ()PAPI_L1_TCM 0x80000006 Yes Yes Level 1 cache misses ()PAPI_L2_TCM 0x80000007 Yes Yes Level 2 cache misses ()PAPI_L3_TCM 0x80000008 No No Level 3 cache misses (). . .

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 42

PAPI avail utility% avail -hThis is the PAPI avail program.It provides availability and detail informationfor PAPI preset and native events. Usage:

avail [options] [event name]avail TESTS_QUIET

Options:

-a display only available PAPI preset events-d display PAPI preset event info in detailed format-e EVENTNAME display full detail for named preset or native event-h print this help message-t display PAPI preset event info in tabular format (default)

Page 22: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

22

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 43

Example: avail –e PAPI_L1_TCMEvent name: PAPI_L1_TCMEvent Code: 0x80000006Number of Native Events: 4Short Description: |L1 cache misses|Long Description: |Level 1 cache misses|Developer's Notes: ||Derived Type: |DERIVED_ADD|Postfix Processing String: |||Native Code[0]: 0x40001e1c DC_SYS_REFILL_MOES||Number of Register Values: 2||Register[0]: 0x20f P3 Ctr Mask||Register[1]: 0x1e43 P3 Ctr Code||Native Event Description: |Refill from system. Cache bits: Modified Owner Exclusive Shared|

|Native Code[1]: 0x40000037 IC_SYS_REFILL||Number of Register Values: 2||Register[0]: 0xf P3 Ctr Mask||Register[1]: 0x83 P3 Ctr Code||Native Event Description: |Refill from system|

|Native Code[2]: 0x40000036 IC_L2_REFILL||Number of Register Values: 2||Register[0]: 0xf P3 Ctr Mask||Register[1]: 0x82 P3 Ctr Code||Native Event Description: |Refill from L2|

|Native Code[3]: 0x40001e1b DC_L2_REFILL_MOES||Number of Register Values: 2||Register[0]: 0x20f P3 Ctr Mask||Register[1]: 0x1e42 P3 Ctr Code||Native Event Description: |Refill from L2. Cache bits: Modified Owner Exclusive Shared|

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 44

PAPI native_avail Utility(derose@jaguar1) 187% yod -sz=1 /opt/xt-tools/papi/3.0.8.1/bin/native_avail |moreLibLustre: NAL NID: 0005dc05 (5)Lustre: OBD class driver Build Version: 1, [email protected] case NATIVE_AVAIL: Available native events and hardware information.-------------------------------------------------------------------------Vendor string and code : AuthenticAMD (2)Model string and code : AMD K8 (13)CPU Revision : 1.000000CPU Megahertz : 2400.000000CPU's in this Node : 1Nodes in this System : 1Total CPU's : 1Number Hardware Counters : 4Max Multiplex Counters : 32-------------------------------------------------------------------------The following correspond to fields in the PAPI_event_info_t structure.Symbol Event Code Count|Short Description||Long Description||Derived||PostFix|

The count field indicates whether it is a) available (count >= 1) and b) derived(count > 1)

FP_ADD_PIPE 0x40000000|Dispatched FPU ops - Revision B and later revisions - Speculative add pipe opsexcluding junk ops||Register Value[0]: 0xf P3 Ctr Mask||Register Value[1]: 0x100 P3 Ctr Code|

FP_MULT_PIPE 0x40000001|Dispatched FPU ops - Revision B and later revisions - Speculative multiply pip

e ops excluding junk ops||Register Value[0]: 0xf P3 Ctr Mask||Register Value[1]: 0x200 P3 Ctr Code|

. . .

Page 23: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

23

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 45

Hardware Counters Selection• PAT_RT_HWPC <set number> | <event list>

• Specifies hardware counter events to be monitored• A set number can be used to select a group of

predefined hardware counters events (recommended)CrayPat provides 9 sets on the Cray XT3

• Alternatively a list of hardware performance counter event names can be used

Maximum of 4 events

• Both formats can be specified at the same time, with later definitions overriding previous definitions

• By default, no hardware performance counter events are monitored during tracing experiments

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 46

Accuracy Issues

• Pay attention to what is not measured:• Out-of-order processors• Speculation• Lack of standard on what is counted

• Microbenchmarks can help determine accuracy of the hardware counters

• For more information on AMD counters:• architecture manuals:

• http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26049.PDF

user

interface

Kernel

Hardwarecounters

• Granularity of the measured code• If not sufficiently large enough, overhead

of the counter interfaces may dominate

Page 24: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

24

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 47

Hardware Performance CountersPAPI_TLB_DM Data translation lookaside buffer missesPAPI_L1_DCA Level 1 data cache accessesPAPI_FP_OPS Floating point operationsDC_MISS Data Cache MissUser_Cycles Virtual Cycles

========================================================================USER / sweep_------------------------------------------------------------------------Time% 96.9%Time 2.619020Imb.Time 0.141776Imb.Time% 5.2%Calls 576PAPI_TLB_DM 9816474 missesPAPI_L1_DCA 125098572621 opsPAPI_FP_OPS 86987557635 opsDC_MISS 4515210161 opsUser time 6292017317.66667 cyclesUtilization rate 100.0%HW FP Ops / Cycles 13.83 ops/cycleHW FP Ops / User time 86987557635 ops 14.4%peakHW FP Ops / WCTComputation intensity 0.70 ops/refLD & ST per TLB miss 12743.74 ops/missLD & ST per D1 miss 27.71 ops/missD1 cache hit ratio 96.4%% TLB misses / cycle 0.0%

PAT_RT_HWPC=1Flat profile dataHard counts

Derived metrics

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 48

PAT_RT_HWPC=2 (Cache Info)PAPI_L1_DCA Level 1 data cache accessesDC_L2_REFILL_MOESI Refill from L2. Cache bits: Modified Owner Exclusive Shared InvalidDC_SYS_REFILL_MOESI Refill from system. Cache bits: Modified Owner Exclusive Shared InvalidBU_L2_REQ_DC Internal L2 request - DC fillUser_Cycles Virtual Cycles

========================================================================USER / sweep_------------------------------------------------------------------------Time% 96.9%Time 2.619454Imb.Time 0.142876Imb.Time% 5.3%Calls 576PAPI_L1_DCA 125116346746 opsDC_L2_REFILL_MOESI 4519351614 opsDC_SYS_REFILL_MOESI 1023533083 opsBU_L2_REQ_DC 4729707701 reqUser time 6292985840.77083 cyclesUtilization rate 100.0%L1 Data cache misses 5542884697 missesLD & ST per D1 miss 22.57 ops/missD1 cache hit ratio 95.6%LD & ST per D2 miss 122.24 ops/missD2 cache hit ratio 78.4%L2 cache hit ratio 81.5%Memory to D1 refill 1023533083 linesMemory to D1 bandwidth 65506117312 bytesL2 to Dcache bandwidth 289238503296 bytes

Page 25: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

25

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 49

PAT_RT_HWPC=3 (L1 & L2 BW)PAPI_L1_DCM Level 1 data cache missesPAPI_L1_DCA Level 1 data cache accessesDC_L2_REFILL_MOES Refill from L2. Cache bits: Modified Owner Exclusive SharedDC_COPYBACK_MOES Copyback. Cache bits: Modified Owner Exclusive SharedUser_Cycles Virtual Cycles

========================================================================USER / sweep_------------------------------------------------------------------------Time% 96.9%Time 2.619887Imb.Time 0.140539Imb.Time% 5.2%Calls 576PAPI_L1_DCM 4517129259 missesPAPI_L1_DCA 125119230870 opsDC_L2_REFILL_MOES 3493981416 opsDC_COPYBACK_MOES 5538788237 opsUser time 6294208767.6875 cyclesUtilization rate 100.0%LD & ST per D1 miss 27.70 ops/missD1 cache hit ratio 96.4%Memory to D1 refill 1023147843 linesMemory to D1 bandwidth 65481461952 bytesL2 to Dcache bandwidth 223614810624 bytesDcache to L2 bandwidth 354482447168 bytes

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 50

PAT_RT_HWPC=4 (FP Mix)PAPI_FML_INS Floating point multiply instructionsPAPI_FAD_INS Floating point add instructionsPAPI_FP_OPS Floating point operationsFP_FAST_FLAG Dispatched FPU ops that use the fast flag interfaceUser_Cycles Virtual Cycles

========================================================================USER / sweep_------------------------------------------------------------------------Time% 96.9%Time 2.619768Imb.Time 0.142962Imb.Time% 5.3%Calls 576PAPI_FML_INS 40176676793 instrPAPI_FAD_INS 46825416366 instrPAPI_FP_OPS 87002093159 opsFP_FAST_FLAG 2449277001 opsUser time 6293787713.72917 cyclesUtilization rate 100.0%HW FP Ops / Cycles 13.82 ops/cycleHW FP Ops / User time 87002093159 ops 14.4%peakHW FP Ops / WCTFP Multiply / FP Ops 46.2%FP Add / FP Ops 53.8%

Page 26: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

26

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 51

PAT_RT_HWPC=5 (Vectorization)FR_FPU_X87 Retired FPU instructions - x87 instructionsFR_FPU_MMX_3D Retired FPU instructions – Combined MMX and 3DNow! instructionsFR_FPU_SSE_SSE2_PACKED Retired FPU instructions – Combined packed SSE and SSE 2 instructionsFR_FPU_SSE_SSE2_SCALAR Retired FPU instructions – Combined scalar SSE and SSE 2 instructionsUser_Cycles Virtual Cycles

========================================================================USER / sweep_------------------------------------------------------------------------

Time% 96.9%Time 2.619615Imb.Time 0.139317Imb.Time% 5.2%Calls 576FR_FPU_X87 0 instrFR_FPU_MMX_3D 0 instrFR_FPU_SSE_SSE2_PACKED 24032397312 instrFR_FPU_SSE_SSE2_SCALAR 101185460710 instrUser time 6293357456.66667 cyclesUtilization rate 100.0%

When compiled without fastsse:

========================================================================USER / sweep_------------------------------------------------------------------------

Time% 97.5%Time 3.128695Imb.Time 0.166962Imb.Time% 5.2%Calls 576FR_FPU_X87 0 instrFR_FPU_MMX_3D 0 instrFR_FPU_SSE_SSE2_PACKED 0 instrFR_FPU_SSE_SSE2_SCALAR 138996966424 instrUser time 7515016198.60417 cyclesUtilization rate 100.0%

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 52

PAT_RT_HWPC=6 (Stalls / Resources Idle)PAPI_FPU_IDL Cycles floating point units are idlePAPI_STL_ICY Cycles with no instruction issuePAPI_RES_STL Cycles stalled on any resourceIC_FETCH_STALL Instruction fetch stallUser_Cycles Virtual Cycles

========================================================================USER / sweep_------------------------------------------------------------------------

Time% 96.9%Time 2.619334Imb.Time 0.141884Imb.Time% 5.2%Calls 576PAPI_FPU_IDL 500570926.75 cyclesPAPI_STL_ICY 70217803.2916667 cyclesPAPI_RES_STL 4140098264.75 cyclesIC_FETCH_STALL 4631824703.22917 cyclesUser time 6292743345.83333 cyclesUtilization rate 100.0%Total time stalled 4140098264.75 cyclesTime I Fetch Stalled 4631824703.22917 cyclesAvg Time FPUs idle 250285463.375 cyclesTime Decoder empty 70217803.2916667 cycles

Page 27: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

27

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 53

PAT_RT_HWPC=7 (Stalls/ Resources Full)FR_DECODER_EMPTY Nothing to dispatch - decoder emptyFR_DISPATCH_STALLS Dispatch stalls - D2h or DAh combinedFR_DISPATCH_STALLS_FULL_FPU Dispatch stall when FPU is fullFR_DISPATCH_STALLS_FULL_LS Dispatch stall when LS is fullUser_Cycles Virtual Cycles

========================================================================USER / sweep_------------------------------------------------------------------------Time% 97.0%Time 2.618878Imb.Time 0.142062Imb.Time% 5.3%Calls 576FR_DECODER_EMPTY 3360773456 opsFR_DISPATCH_STALLS 4139586865.875 cyclesFR_DISPATCH_STALLS_FULL_FPU 2683961106.04167 cyclesFR_DISPATCH_STALLS_FULL_LS 1050422214.02083 cyclesUser time 6291691461.4375 cyclesUtilization rate 100.0%Total time stalled 4139586865.875 cyclesAvg Time FPUs stalled 1341980553.02083 cyclesAvg Time LSs stalled 525211107.010417 cyclesTime Decoder empty 3360773456 cycles

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 54

PAT_RT_HWPC Other Sets-------------------------------------------------------------------Set 8: Branches

PAPI_BR_TKN Conditional branch instructions takenPAPI_BR_MSP Conditional branch instructions mispredictedPAPI_TOT_INS Instructions completedIC_MISS IC MissUser_Cycles Virtual Cycles

-------------------------------------------------------------------

Set 9: InstructionsPAPI_L2_ICM Level 2 instruction cache missesPAPI_L1_ICA Level 1 instruction cache accessesIC_MISS IC MissIC_L2_REFILL Refill from L2User_Cycles Virtual Cycles

Page 28: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

28

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 55

LIBHWPC• Instrumentation API for Fortran, C, and C++• Event profiler with runtime summarization• For each instrumented section provides:

• Total count & duration (user & wall clock time)• Hardware performance counters information• Derived metrics

• Supports:• Multiple instrumentation sections• Nested instrumentation• Multiple calls to an instrumented section

• Requires• Hand instrumentation• Linking with libhwpc.a

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 56

LIBHWPC• Declaration:

• Fortran:• #include "hwpcf.h"

• C & C++:• #include “hwpc.h”

• Four calls:• PAT_hwpc_init(taskID, PrgName)• PAT_hwpc_finalize(taskID)• PAT_event_begin(instID, label)• PAT_event_end(instID)

• Compiling and Linking• my.x : my.f

$(FTN) $(FLAGS) my.f -o my.x -lhwpc -lpapi -lm

Page 29: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

29

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 57

LIBHWPC Example (Set 1)Hardware Info : CPU: AMD K8 (1.0), 2400 MHzMemory Hierarchy: 2 cache levels:L1 I Cache: 65536 Bytes, Line size: 64 Bytes, 2-wayL1 D Cache: 65536 Bytes, Line size: 64 Bytes, 2-wayL2 Cache: 1048576 Bytes, Line size: 64 Bytes, 16-wayI TLB: 512 entries, 4-wayD TLB: 512 entries, 4-way

Wall Clock time of instrumented section: 19.032111 secondsUser time of instrumented section : 19.0317893620833 secondsUser Cycles : 45676294469

Section 10 (Calc1): file swim_seq.F, lines 94 <--> 98Number of calls: 500Wall Clock Time: 4.52193 secondsAverage WCT : 0.00904386 secondsStd Deviation : 1.04463e-05Exclusive time : 0.055326 secondsUser time (exc): 0.0556901 seconds (133656144 cycles)User time (inc): 4.52169 seconds (10852054936 cycles)

PAPI_FP_OPS (FP operations) : 3146878906PAPI_L1_DCA (L1 Data accesses): 2234140346DC_MISS (Total L1 Data misses): 117004940PAPI_TLB_DM (D TLB misses) : 108844312

User time : 4.522 secondsUtilization rate : 99.995 %HW FP Ops / Cycles : 0.290HW FP Ops / User time : 695.952 M HW FP Ops/sHW FP Ops / WCT : 695.915 M HW FP Ops/WCTComputation intensity : 1.409LD & ST per TLB miss : 20.526LD & ST per D1 miss : 19.094D1 cache hit ratio : 94.763 %% TLB misses/cycle : 1.003 %

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 58

pat_help & Documentation• The pat_help utility is an interactive viewer used to

access information about and examples of using CrayPat• pat_help [topic [subtopic...]]

• See also man pages:• craypat• pat• pat_build• pat_report• pat_help• hwpc• papi_counters

Page 30: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

30

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 59

pat_help Example% pat_help

The top level CrayPat/X help topics are listed below.A good place to start is:

overview

If a topic has subtopics, they are displayed under the heading"Additional topics", as below. To view a subtopic, you needonly enter as many initial letters as required to distinguishit from other items in the list. To see a table of contentsincluding subtopics of those subtopics, etc., enter:

toc

To produce the full text corresponding to the table of contents,specify "all", but preferably in a non-interactive invocation:

pat_help all . > all_pat_helppat_help report all . > all_report_help

Additional topics:

API executebalance experimentbuild first_examplecounters overviewdemos reportenvironment run

pat_help (.=quit ,=back ^=up /=top ~=search)=>

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 60

Cray Apprentice2

• Call graph profile• Communication statistics• Time-line view

• Communication • I/O

• Activity view• Pair-wise communication

statistics• Text reports• Source code mapping

• Cray Apprentice2

• is target to help identify and correct:• Load imbalance• Excessive

communication• Network contention• Excessive serialization• I/O Problems

Page 31: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

31

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 61

Statistics Overview New feature:Switch Overview display

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 62

Function Profile

Page 32: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

32

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 63

Load Balance View (Aggregated)Min, Avg, and Max

Values

-1, +1 Std Dev marks

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 64

Call Graph View

Page 33: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

33

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 65

Call Graph View

Zoom

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 66

Call Graph View - Zoom

Width inclusive timeHeight exclusive time

Load balance overview:Heigh Max time

Left bar Average timeRight bar Min time

Page 34: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

34

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 67

Call Graph View - Zoom

FunctionList

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 68

Call Graph View – Function List

FunctionList off

Mouse right click:hide node

hide children

Page 35: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

35

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 69

Call Graph Hide Children

hidden children

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 70

Call Graph Unhide One Level

Page 36: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

36

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 71

Call Graph Unhide One Level (2)

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 72

Call Graph Unhide One Level (3)

Page 37: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

37

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 73

Call Graph Unhide All Children

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 74

Load Balance View (from Call Graph)

-1, +1 Std Dev marks

Min, Avg, and Max Values

Page 38: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

38

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 75

Source Mapping from Call Graph

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 76

Function Profile

Page 39: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

39

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 77

Distribution by PE, by Call, & by Time

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 78

Environment & Execution Details

Page 40: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

40

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 79

Time Line View (Sweep3D)

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 80

Time Line View (Zoom) User Functions, MPI & SHMEM Line

I/O Line

Page 41: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

41

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 81

Time Line View (Fine Grain Zoom)

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 82

Activity View

Page 42: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

42

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 83

Pair-wise Communication

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 84

I/O Overview

Page 43: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

43

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 85

I/O Traffic Report

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 86

I/O Rates

Page 44: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

44

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 87

Hardware Counters Overview

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 88

Hardware Counters Time Line

Page 45: Luiz DeRose Programming Environment Director Cray … · 3 09/26-28/2006 Luiz DeRose (ldr@cray.com) @ Cray Inc. 5 Single Processor Optimization • Answer the following questions:

45

09/26-28/2006 Luiz DeRose ([email protected]) @ Cray Inc. 89

Controlling Trace File Size• Several environment variables are available to limit trace

files to a reasonable size:• PAT_RT_CALLSTACK

• Limit the depth to trace the call stack • PAT_RT_HWPC

• Avoid collecting hardware counters (unset)• PAT_RT_RECORD_PE

• Collect trace for a subset of the PEs• PAT_RT_TRACE_FUNCTION_ARGS

• Limit the number of function arguments to be traced• PAT_RT_TRACE_FUNCTION_LIMITS

• Avoid tracing indicated functions• PAT_RT_TRACE_FUNCTION_MAX

• Limit the maximum number of traces generated for all functions for a single process

• Use the limit built-in command for ksh(1) or csh(1) to control how much disk space the trace file can consume

09/26-28/2006LLNL Livermore, CA

Performance Measurement and Visualization on the Cray XT3

Questions / CommentsThank You!