Page 1
1
09/26-28/2006LLNL Livermore, CA
Performance Measurement and Visualization on the Cray XT3
Luiz DeRoseProgramming Environment Director
Cray [email protected]
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 2
The Cray Tools Strategy• Must be easy to use
• Automatic program instrumentation• no source code or makefile modification needed
• Integrated performance tools solution• Multiple platforms• Multiple functionality
• MPI, I/O, Heap, HW Counters
• Strategy based on the three main steps normally used for application optimization and tuning:• Debug application• Single processor and vector optimization• Parallel processing and I/O optimization
• Close interaction with user for feedback targeting functionality enhancements
Page 2
2
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 3
Cray Performance Analysis Infrastructure• CrayPat
• pat_hwpc: for whole program measurement• pat_build: Utility for application instrumentation
• No source code modification required• run-time library for measurements
• transparent to the user• pat_report:
• Performance reports• Performance visualization file
• libhwpc• pat_help
• Cray Apprentice2
• Graphical performance analysis and visualization tool• Can be used off-line on Linux system
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 4
Performance Data Collection• Two dimensions
• When Performance Collection is triggered• External agent (asynchronous)
Sampling» timer interrupt» hardware counters overflow
• Internal agent (synchronous)Code instrumentation (event trace)
» Automatic instrumentation» Hand instrumentation
• How performance data is recorded• Profile (runtime summary)• Trace file
Page 3
3
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 5
Single Processor Optimization• Answer the following questions:
• Do I have a performance problem at all?• pat_hwpc
Provides overall view of the program execution» Time / Resource / Hardware Counters measurement
• Where are the main bottlenecks?• CrayPat provides profilers
Based on sampling and runtime summarization» XT3 sampling support under development
Flat, Call graph, Function, …
• Why is it there?• CrayPat Contains:
HW counters based Instrumentation libraryAPI for lower level instrumentation
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 6
Parallel Processing, I/O and Memory Optimization
• Answer the following questions:• Do I have communication/synchronization problems?
• CrayPat addresses:Communication Profiler Load balance profile
• Do I have I/O or Memory problems?• CrayPat addresses (will address):
I/O ProfilerHeap profiler
• Why?• Tracing
CrayPat tracing library• Cray PAT Visualization GUI
Cray Apprentice2
Page 4
4
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 7
Six Steps for Performance Analysis1. Load CrayPat module2. Build application
No makefile modification needed3. Instrument application with pat_build
% pat_build [-g group] [-u] [options] a.outGroups: mpi, io, heap, user function (-u) …Automatic instrumentation at group (function) level
No source code modification neededAPI provided for instrumentation at a finer granularity
4. Run instrumented application5. Generate performance file (.ap2) with pat_report
% pat_report –f ap2 [options] <.xf file>6. Performance analysis and visualization with CrayPat and
Cray Apprentice2
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 8
Application Instrumentation with Pat_Build• No source code or makefile modification required
• API available for fine grain instrumentation• Compiler flag “–Mprof=func” needed for some Fortran90
programs with module• This problem has been addressed with PGI 6.1.4
In certain cases the flag is still needed
• Performs binary rewrite• Relink application• Requires object files• Generates a stand alone instrumented program
• Runtime environment variable defines if profile or trace file will be generated• PAT_RT_SUMMARY• Default is 1 (for runtime summarization)
Page 5
5
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 9
CrayPat API• CrayPat performs automatic instrumentation at function level• The CrayPat API can be used for fine grain instrumentation
• Fortran• call PAT_region_begin(id, “label”, ierr)• DO Work• call PAT_region_end(id, ierr)
• C• include <pat_api.h>• …• ierr = PAT_region_begin(id, “label”);• DO_Work();• ierr = PAT_region_end(id);
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 10
Additional API Functions• int PAT_profiling_state (int state)• int PAT_record (int state)• int PAT_sampling_state (int state)• int PAT_tracing_state (int state)• int PAT_trace_function (const void *addr, int state)• State can have one of the following:
• PAT_STATE_ON • PAT_STATE_OFF• PAT_STATE_QUERY
• int PAT_flush_buffer (void)
Page 6
6
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 11
Runtime Environment Variables• The following runtime environment variables affect
how the data is collected: • PAT_RT_REGION_MAX
• Specifies the largest numerical ID that may be used as an argument to the CrayPat API functions PAT_region_begin and PAT_region_end
The default is 100• PAT_RT_SUMMARY
• Enables run-time summarizationIncludes the aggregation of data during run-timeRuntime summarization is enabled by default
• PAT_RT_HWPC <set #>• Activate collection of hardware performance counters
There are 9 sets on the XT3
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 12
pat_report Options• Reformating the performance file (Cray Apprentice2 input)
• pat_report [-V] [-i dir|instrprog] [-o output_file] -f ap2 |txt |xml data_directory | data_file.xf
• Generating performance reports• pat_report [-V] [-i dir|instrprog] [-o output_file]
[-O keyword] [-b b-opts] [-d d-opts] [-s key=value] [-P] [-T] data_directory | data_file.xf | data_file.ap2
• Main options:• -i is only if the instrumented program has a different name or
is in a different directory path than when it was executed• -O provides shortcuts for common reports:• -b, -d, -s can be used to further customize the report
Page 7
7
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 13
Pat_report OutputCrayPat/X: Version 3.1 Revision 398 (xf 305) 09/21/06 16:23:52
Experiment: trace
Experiment data file:/lus/nid00007/ldr/L_Apps/sweep3d/sweep3d+all+1375td.xf (RTS)
Current path to data file: /ufs/home/users/ldr/P+sweep+48p.ap2 (RTS)
Original program: /lus/nid00007/ldr/L_Apps/sweep3d/sweep3d
Instrumented program: /lus/nid00007/ldr/L_Apps/sweep3d/./sweep3d+all
Program invocation: ./sweep3d+all
Number of PEs: 48
Exit Status: 0 PEs: 0-47
Runtime environment variables: PAT_RT_SUMMARY=1
Report time environment variables:PAT_ROOT=/home/users/homer/opt/xt-tools/craypat/craypat/cpatx
Report command line options: <none>
Host name and type: perch x86_64 2400 MHz
Operating system: catamount 1.0 2.0
Traced functions:MAIN_ .../ldr/L_Apps/sweep3d/driver.fMPI_Abort ==NA==MPI_Allgather ==NA==MPI_Allreduce ==NA==MPI_Attr_put ==NA==
. . .
All profiles have this information, which is
helpful to link the data with a particular execution
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 14
Table 1: Flat Profile (Default)Notes for table 1:
High level option: -O profileLow level options: -d ti%@0.05,ti,imb_ti,imb_ti%,tr \
-b exp,gr,fu,pe=HIDE
This table shows only lines with Time% > 0.05.
Percentages at each level are relative(for absolute percentages, specify: -s percent=a).
Table 1: Profile by Function Group and Function
Time % | Time |Imb. Time | Imb. | Calls |Experiment=1| | | Time % | |Group| | | | | Function| | | | | PE='HIDE'
100.0% | 3.798177 | -- | -- | 579653 |Total|------------------------------------------------------------| 70.9% | 2.692783 | -- | -- | 245380 |USER||-----------------------------------------------------------|| 97.1% | 2.615916 | 0.137362 | 5.1% | 576 |sweep_|| 1.7% | 0.046263 | 0.001465 | 3.1% | 576 |source_|| 0.4% | 0.010300 | 0.001399 | 12.2% | 118080 |snd_real_|| 0.3% | 0.009208 | 0.000261 | 2.8% | 576 |flux_err_|| 0.2% | 0.004303 | 0.000726 | 14.7% | 118080 |rcv_real_|| 0.1% | 0.002387 | 0.001871 | 44.9% | 48 |MAIN_
By default, the report will only show
functions with at least 0.05% of the time
Page 8
8
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 15
Table 1: Flat Profile (Continuation)
|| 0.1% | 0.001785 | 0.000046 | 2.6% | 48 |initialize_|| 0.1% | 0.001399 | 0.000065 | 4.6% | 48 |initxs_||=========================================================== | 28.8% | 1.092307 | -- | -- | 238224 |MPI||-----------------------------------------------------------|| 76.1% | 0.831311 | 0.234766 | 22.5% | 118080 |mpi_recv_|| 12.0% | 0.131277 | 0.124790 | 49.8% | 1536 |mpi_allreduce_|| 4.9% | 0.053189 | 0.009683 | 15.7% | 118080 |mpi_send_|| 4.1% | 0.044522 | 0.001153 | 2.6% | 144 |mpi_barrier_|| 2.9% | 0.032002 | 0.002497 | 7.4% | 192 |mpi_bcast_||===========================================================| 0.2% | 0.007329 | -- | -- | 95597 |HEAP||-----------------------------------------------------------|| 61.1% | 0.004481 | 0.001403 | 24.3% | 47861 |malloc|| 38.9% | 0.002848 | 0.000900 | 24.5% | 47735 |free||===========================================================| 0.2% | 0.005758 | -- | -- | 452 |IO||-----------------------------------------------------------|| 81.3% | 0.004679 | 0.072462 | 95.9% | 309 |fwrite|| 9.8% | 0.000566 | 0.026590 | 100.0% | 68 |getc|| 8.6% | 0.000498 | 0.023405 | 100.0% | 8 |fputc|| 0.1% | 0.000007 | 0.000330 | 100.0% | 2 |fopen|| 0.1% | 0.000005 | 0.000000 | 5.8% | 48 |setlinebuf|============================================================|=======================================================
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 16
Table 2: Load Balance (Default)Notes for table 2:
High level option: -O load_balance_smLow level options: -d ti%@0.05,ti,sc,sm,sz -b exp,gr,pe=[mmm]
Table 2: Load Balance with MPI Sent Message Stats
Time % | Time | Sent | Sent Msg | Avg Sent |Experiment=1| | Msg |Total Bytes | Msg Size |Group| | Count | | | PE[mmm]
100.0% | 3.798177 | 118080 | 1244160000 | 10536.59 |Total|----------------------------------------------------------------| 70.9% | 2.692783 | -- | -- | -- |USER||---------------------------------------------------------------|| 2.2% | 2.833001 | -- | -- | -- |pe.0|| 2.1% | 2.717019 | -- | -- | -- |pe.12|| 2.0% | 2.597093 | -- | -- | -- |pe.43||===============================================================| 28.8% | 1.092307 | 118080 | 1244160000 | 10536.59 |MPI||---------------------------------------------------------------|| 2.3% | 1.188383 | 2160 | 21081600 | 9760.00 |pe.43|| 2.0% | 1.069314 | 2880 | 30412800 | 10560.00 |pe.7|| 1.6% | 0.859333 | 1440 | 15206400 | 10560.00 |pe.0||===============================================================| 0.2% | 0.007329 | -- | -- | -- |HEAP||---------------------------------------------------------------|| 2.7% | 0.009363 | -- | -- | -- |pe.12|| 2.2% | 0.007614 | -- | -- | -- |pe.40|| 0.6% | 0.002062 | -- | -- | -- |pe.0||===============================================================| 0.2% | 0.005758 | -- | -- | -- |IO||---------------------------------------------------------------|| 46.6% | 0.128685 | -- | -- | -- |pe.0|| 1.1% | 0.003144 | -- | -- | -- |pe.47|| 0.6% | 0.001644 | -- | -- | -- |pe.29|================================================================
Page 9
9
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 17
Table 3: MPI Send Stats by BucketNotes for table 3:
High level option: -O mpiLow level options: -d sc@,mb1..7 -b exp,fu,ca,pe=[mmm]
This table shows only lines with Sent Msg Count > 0.
Table 3: MPI Sent Messages Stats by Bucket
Sent | 256B<= |Experiment=1Msg | MsgSz |Function
Count | <4KB | Caller| | PE[mmm]
118080 | 118080 |Total|-----------------------------| 118080 | 118080 |mpi_send_| | | snd_real_| | | sweep_| | | inner_| | | inner_auto_| | | MAIN_|||||||-----------------------||||||| 2880 | 2880 |pe.33||||||| 2160 | 2160 |pe.2||||||| 1440 | 1440 |pe.5|=============================
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 18
Table 4: Heap Usage
Notes for table 4:
High level option: -O heap_programLow level options: -d IU,IF,NF,FM -b exp,pe=[mmm]
Table 4: Heap Usage at Start and End of Main Program
MB Heap | MB Heap | Heap | Max Free |Experiment=1Used at | Free at | Not |Object at |PE[mmm]
Start | Start | Freed | End || | MB | |
91.180 | 1834.820 | 0.001 | 1834.799 |Total|---------------------------------------------------| 92.754 | 1833.246 | 1.623 | 1833.211 |pe.0| 91.147 | 1834.853 | 0.000 | 1834.833 |pe.42| 91.147 | 1834.853 | 0.001 | 1834.833 |pe.5|===================================================
Page 10
10
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 19
Table 5: Heap Statistics
Notes for table 5:
High level option: -O heap_hiwaterLow level options: -d am@,ub,ta,ua,tf,ac,ab -b exp,pe=[mmm]
This table shows only lines with Tracked Heap HiWater MBytes > 0.
Table 5: Heap Stats during Main Program
Tracked | MBytes | Total | Allocs | Total | Tracked | Tracked |Experiment=1Heap | Not | Allocs | Not | Frees | Objects | MBytes |PE[mmm]
HiWater | Tracked | | Tracked | | Not | Not |MBytes | | | | | Freed | Freed |
8.793 | 0.000 | 997 | 0 | 995 | 3 | 0.010 |Total|------------------------------------------------------------------------------| 8.927 | 0.000 | 418 | 0 | 386 | 33 | 0.030 |pe.0| 8.907 | 0.000 | 945 | 0 | 943 | 2 | 0.010 |pe.23| 8.445 | 0.000 | 1471 | 0 | 1469 | 2 | 0.010 |pe.43|==============================================================================
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 20
Table 6: Heap Leaks
Table 6: Heap Leaks during Main Program
Tracked | Tracked | Tracked |Experiment=1MBytes | MBytes | Objects |Caller
Not | Not | Not | PE[mmm]Freed % | Freed | Freed |
100.0% | 0.010 | 2 |Total|-----------------------------------------| 95.0% | 0.010 | 1 |(N/A)||----------------------------------------|| 2.1% | 0.010 | 1 |pe.33|| 2.1% | 0.010 | 1 |pe.42|| 2.1% | 0.010 | 1 |pe.5|=========================================
Page 11
11
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 21
Table 7: I/O (Read) StatisticsNotes for table 7:
High level option: -O read_statsLow level options: -d rt,rb,rR,rd@,rC -b exp,fi,pe=[mmm],fd
This table shows only lines with Reads > 0.
Table 7: File Input Stats by Filename
Read | Read MB | Read Rate | Reads | Read |Experiment=1Time | | MB/sec | | B/Call |File Name
| | | | | PE[mmm]| | | | | File Desc
0.000 | 0.000065 | 153.038002 | 68 | 1.00 |Total|-------------------------------------------------------------| 0.000 | 0.000065 | 153.038002 | 68 | 1.00 |input||------------------------------------------------------------|| 0.000 | 0.000065 | 3.189272 | 68 | 1.00 |pe.0|| | | | | | fd.6|| 0.000 | -- | -- | -- | -- |pe.22|| 0.000 | -- | -- | -- | -- |pe.5|=============================================================
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 22
Table 8: I/O (Write) StatisticsNotes for table 8:
High level option: -O write_statsLow level options: -d wt,wb,wR,wr@,wC -b exp,fi,pe=[mmm],fd
This table shows only lines with Writes > 0.
Table 8: File Output Stats by Filename
Write | Write MB | Write Rate | Writes | Write |Experiment=1Time | | MB/sec | | B/Call |File Name
| | | | | PE[mmm]| | | | | File Desc
0.000 | 0.002596 | 708.859207 | 317 | 8.59 |Total|--------------------------------------------------------------| 0.000 | 0.002001 | 621.128045 | 269 | 7.80 |stdout||-------------------------------------------------------------|| 0.000 | 0.002001 | 12.940342 | 269 | 7.80 |pe.0|| | | | | | fd.1|| 0.000 | -- | -- | -- | -- |pe.22|| 0.000 | -- | -- | -- | -- |pe.5||=============================================================| 0.000 | 0.000595 |1349.926896 | 48 | 13.00 |stderr||-------------------------------------------------------------|| 0.000 | 0.000012 | 23.392012 | 1 | 13.00 |pe.30|| | | | | | fd.2|| 0.000 | 0.000012 | 26.878626 | 1 | 13.00 |pe.10|| | | | | | fd.2|| 0.000 | 0.000012 | 56.567754 | 1 | 13.00 |pe.0|| | | | | | fd.2|==============================================================
Page 12
12
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 23
Table 9: Wall Clock Time
Notes for table 9:
High level option: -O program_timeLow level options: -d pt -b exp,pe=[mmm]
Table 9: Program Wall Clock Time
Process |Experiment=1Time |PE[mmm]
7.034316 |Total|----------------------| 7.468644 |pe.0| 7.030273 |pe.47| 6.585806 |pe.29|======================
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 24
Call Tree Profile (Top Down)Notes for table 1:
High level option: -O calltreeLow level options: -d ti%@0.05,cum_ti%,ti,tr -b exp,ct,pe=HIDE \
-s show_ca='fu,so,li' -s source_limit='1'
This table shows only lines with Time% > 0.05.
Percentages at each level are relative(for absolute percentages, specify: -s percent=a).
Table 1: Calltree View with Callsite Line Numbers
Time % | Cum. | Time | Calls |Experiment=1| Time % | | |Calltree| | | | PE='HIDE'
100.0% | 100.0% | 90.217759 | 637231917 |Total|-----------------------------------------------------| 100.0% | 100.0% | 90.175202 | 637205576 |MAIN_||----------------------------------------------------|| 99.7% | 99.7% | 89.922750 | 637194666 |runhyd_|||---------------------------------------------------||| 15.4% | 15.4% | 13.864217 | 106169040 |zysweep_||||--------------------------------------------------|||| 87.3% | 87.3% | 12.097038 | 106168320 |sppm2_|||||-------------------------------------------------||||| 49.4% | 49.4% | 5.980766 | 11796480 |sppm2_(exclusive)||||| 24.1% | 73.6% | 2.920440 | 11796480 |difuze_||||| 19.0% | 92.6% | 2.296747 | 58982400 |interf_||||| 7.4% | 100.0% | 0.899084 | 23592960 |dintrf_|||||=================================================|||| 12.7% | 100.0% | 1.767180 | 720 |zysweep_(exclusive)||||==================================================||| 15.4% | 30.8% | 13.854807 | 106169040 |xysweep_||||--------------------------------------------------|||| 87.0% | 87.0% | 12.049373 | 106168320 |sppm2_|||||-------------------------------------------------||||| 49.5% | 49.5% | 5.970403 | 11796480 |sppm2_(exclusive)||||| 24.0% | 73.6% | 2.894189 | 11796480 |difuze_
Page 13
13
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 25
Callers Profile (Bottom Up)Notes for table 1:
High level option: -O callersLow level options: -d ti%@0.05,cum_ti%,ti,tr -b exp,gr,fu,ca,pe=HIDE
This table shows only lines with Time% > 0.05.
Table 1: Profile by Function and Callers
Time % | Cum. | Time | Calls |Experiment=1| Time % | | |Group| | | | Function| | | | Caller| | | | PE='HIDE'
100.0% | 100.0% | 90.217759 | 637231917 |Total|-----------------------------------------------------| 92.3% | 92.3% | 83.265853 | 637033288 |USER||----------------------------------------------------|| 43.1% | 43.1% | 35.864107 | 70778880 |sppm2_|||---------------------------------------------------||| 16.7% | 16.7% | 5.986173 | 11796480 |yxsweep_||| | | | | runhyd_||| | | | | MAIN_||| 16.7% | 33.4% | 5.980851 | 11796480 |yzsweep_||| | | | | runhyd_||| | | | | MAIN_||| 16.7% | 50.0% | 5.980766 | 11796480 |zysweep_||| | | | | runhyd_||| | | | | MAIN_||| 16.7% | 66.7% | 5.973496 | 11796480 |zzsweep_||| | | | | runhyd_||| | | | | MAIN_||| 16.7% | 83.4% | 5.972417 | 11796480 |xxsweep_||| | | | | runhyd_||| | | | | MAIN_||| 16.6% | 100.0% | 5.970403 | 11796480 |xysweep_||| | | | | runhyd_||| | | | | MAIN_|||===================================================|| 21.0% | 64.0% | 17.447719 | 70778880 |difuze_|| | | | | sppm2_
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 26
Callers Profile – MPI (Cont.)||====================================================| 7.7% | 99.9% | 6.906194 | 106344 |MPI||----------------------------------------------------|| 70.2% | 70.2% | 4.851312 | 51840 |mpi_wait_|||---------------------------------------------------||| 41.2% | 41.2% | 1.997854 | 17280 |zbdrys_||||--------------------------------------------------|||| | | | |runhyd_|||||-------------------------------------------------||||| | | | |MAIN_||||==================================================||| 34.3% | 75.5% | 1.664276 | 17280 |ybdrys_||| | | | | runhyd_||| | | | | MAIN_||| 24.5% | 100.0% | 1.189183 | 17280 |xbdrys_||| | | | | runhyd_||| | | | | MAIN_|||===================================================|| 29.7% | 99.9% | 2.048254 | 2232 |mpi_allreduce_|||---------------------------------------------------||| 96.6% | 96.6% | 1.978537 | 720 |glblmax_||| | | | | runhyd_||| | | | | MAIN_||| 3.4% | 100.0% | 0.069717 | 1512 |glbldsum_||||--------------------------------------------------|||| 98.5% | 98.5% | 0.068700 | 792 |trace_|||| | | | | MAIN_|||| 1.5% | 100.0% | 0.001017 | 720 |runhyd_|||| | | | | MAIN_|||===================================================|| 0.1% | 100.0% | 0.004263 | 25920 |mpi_isend_|||---------------------------------------------------||| 33.7% | 33.7% | 0.001436 | 8640 |xbdrys_||| | | | | runhyd_||| | | | | MAIN_||| 33.2% | 66.9% | 0.001416 | 8640 |zbdrys_||| | | | | runhyd_||| | | | | MAIN_||| 33.1% | 100.0% | 0.001410 | 8640 |ybdrys_||| | | | | runhyd_||| | | | | MAIN_|=====================================================
Page 14
14
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 27
Callers Profile with Line Numbers% pat_report –O ca+src <performance file>
Time% | Cum.Time% | Time | Calls |Function|Caller
100.0% | 100.0% | 2647.397216 | 283222164 |Total|--------------------------------------------------------| 40.5% | 40.5% | 1071.748077 | 31457280 |sppm2_||-------------------------------------------------------|| 6.8% | 6.8% | 179.125866 | 5242880 |yzsweep_:/scratch/derose/sppm2/sweeps.F:line.518|| | | | | runhyd_:/scratch/derose/sppm2/main.F:line.1056|| | | | | MAIN_:/scratch/derose/sppm2/main.F:line.226|| | | | | main:NA:line.0|| 6.8% | 13.5% | 178.901135 | 5242880 |zzsweep_:/scratch/derose/sppm2/sweeps.F:line.812|| | | | | runhyd_:/scratch/derose/sppm2/main.F:line.1064|| | | | | MAIN_:/scratch/derose/sppm2/main.F:line.226|| | | | | main:NA:line.0|| 6.8% | 20.3% | 178.709675 | 5242880 |yxsweep_:/scratch/derose/sppm2/sweeps.F:line.1400|| | | | | runhyd_:/scratch/derose/sppm2/main.F:line.1080|| | | | | MAIN_:/scratch/derose/sppm2/main.F:line.226|| | | | | main:NA:line.0|| 6.7% | 27.0% | 178.614169 | 5242880 |zysweep_:/scratch/derose/sppm2/sweeps.F:line.1106|| | | | | runhyd_:/scratch/derose/sppm2/main.F:line.1072|| | | | | MAIN_:/scratch/derose/sppm2/main.F:line.226|| | | | | main:NA:line.0|| 6.7% | 33.8% | 178.273669 | 5242880 |xxsweep_:/scratch/derose/sppm2/sweeps.F:line.1694|| | | | | runhyd_:/scratch/derose/sppm2/main.F:line.1088|| | | | | MAIN_:/scratch/derose/sppm2/main.F:line.226|| | | | | main:NA:line.0|| 6.7% | 40.5% | 178.123564 | 5242880 |xysweep_:/scratch/derose/sppm2/sweeps.F:line.219|| | | | | runhyd_:/scratch/derose/sppm2/main.F:line.1048|| | | | | MAIN_:/scratch/derose/sppm2/main.F:line.226|| | | | | main:NA:line.0||=======================================================| 20.2% | 60.7% | 534.482687 | 31457280 |difuze_| | | | | sppm2_:/scratch/derose/sppm2/sppm.F:line.630|||------------------------------------------------------||| 3.4% | 43.9% | 89.361514 | 5242880 |zzsweep_:/scratch/derose/sppm2/sweeps.F:line.812||| | | | | runhyd_:/scratch/derose/sppm2/main.F:line.1064||| | | | | MAIN_:/scratch/derose/sppm2/main.F:line.226||| | | | | main:NA:line.0||| 3.4% | 47.2% | 89.178333 | 5242880 |zysweep_:/scratch/derose/sppm2/sweeps.F:line.1106
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 28
Load Balancing Function per PENotes for table 1:
High level option: -O load_balance_programLow level options: -d ti%@0.05,cum_ti%,ti,tr -b exp,pe
This table shows only lines with Time% > 0.05.
Percentages at each level are relative(for absolute percentages, specify: -s percent=a).
Table 1: Load Balance across PE's
Time % | Cum. | Time | Calls |Experiment=1| Time % | | |PE
100.0% | 100.0% | 3.798177 | 579653 |Total|-------------------------------------------------| 2.1% | 2.1% | 3.823080 | 7160 |pe.0| 2.1% | 4.2% | 3.799148 | 13753 |pe.8|| ...| 2.1% | 97.9% | 3.796151 | 7683 |pe.5| 2.1% | 100.0% | 3.796144 | 10431 |pe.29|=================================================
Page 15
15
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 29
Table 2: LB Across PE’s by GroupNotes for table 2:
High level option: -O load_balance_groupLow level options: -d ti%@0.05,cum_ti%,ti,tr -b exp,gr,pe
. . .
Table 2: Load Balance across PE's by FunctionGroup
Time % | Cum. | Time | Calls |Experiment=1| Time % | | |Group| | | | PE
100.0% | 100.0% | 3.798177 | 579653 |Total|-------------------------------------------------| 70.9% | 70.9% | 2.692783 | 245380 |USER||------------------------------------------------|| 2.2% | 2.2% | 2.833001 | 3076 |pe.0|| ...|| 2.0% | 100.0% | 2.597093 | 4512 |pe.43|=================================================| 28.8% | 99.7% | 1.092307 | 238224 |MPI||------------------------------------------------|| 2.3% | 2.3% | 1.188383 | 4363 |pe.43|| ...|| 1.6% | 100.0% | 0.859333 | 2923 |pe.0||================================================| 0.2% | 99.8% | 0.007329 | 95597 |HEAP||------------------------------------------------|| 2.7% | 2.7% | 0.009363 | 2482 |pe.12|| ...|| 0.6% | 100.0% | 0.002062 | 803 |pe.0||================================================| 0.2% | 100.0% | 0.005758 | 452 |IO||------------------------------------------------|| 46.6% | 46.6% | 0.128685 | 358 |pe.0|| ...|| 0.6% | 100.0% | 0.001644 | 2 |pe.29|=================================================
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 30
Table 3: LB Across PE’s by FunctionNotes for table 3:
High level option: -O load_balance_functionLow level options: -d ti%@0.05,cum_ti%,ti,tr -b exp,gr,fu,pe
This table shows only lines with Time% > 0.05.
Percentages at each level are relative(for absolute percentages, specify: -s percent=a).
Table 3: Load Balance across PE's by Function
Time % | Cum. | Time | Calls |Experiment=1| Time % | | |Group| | | | Function| | | | PE
100.0% | 100.0% | 3.798177 | 579653 |Total|-------------------------------------------------| 70.9% | 70.9% | 2.692783 | 245380 |USER||------------------------------------------------|| 97.1% | 97.1% | 2.615916 | 576 |sweep_|||-----------------------------------------------||| 2.2% | 2.2% | 2.753279 | 12 |pe.0||| 2.1% | 4.3% | 2.654725 | 12 |pe.5||| . . .||| 2.0% | 98.0% | 2.525587 | 12 |pe.43||| 2.0% | 100.0% | 2.523325 | 12 |pe.37|||===============================================. . .
|||===============================================|| 0.4% | 99.2% | 0.010300 | 118080 |snd_real_|||-----------------------------------------------||| 2.4% | 2.4% | 0.011699 | 2880 |pe.26||| 2.3% | 4.7% | 0.011475 | 2880 |pe.27||| . . .||| 1.5% | 98.6% | 0.007266 | 1440 |pe.0||| 1.4% | 100.0% | 0.006907 | 1440 |pe.5|||===============================================
Page 16
16
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 31
Table 3 (Cont.)||================================================| 28.8% | 99.7% | 1.092307 | 238224 |MPI||------------------------------------------------|| 76.1% | 76.1% | 0.831311 | 118080 |mpi_recv_|||-----------------------------------------------||| 2.7% | 2.7% | 1.066077 | 1440 |pe.47||| 2.6% | 5.3% | 1.034307 | 2160 |pe.41||| . . . ||| 1.8% | 98.6% | 0.700970 | 2160 |pe.1||| 1.4% | 100.0% | 0.573420 | 1440 |pe.0|||===============================================. . .||================================================| 0.2% | 99.8% | 0.007329 | 95597 |HEAP||------------------------------------------------|| 61.1% | 61.1% | 0.004481 | 47861 |malloc|||-----------------------------------------------||| 2.7% | 2.7% | 0.005884 | 1242 |pe.12||| 2.6% | 5.4% | 0.005658 | 1226 |pe.19||| . . .||| 1.3% | 99.5% | 0.002827 | 618 |pe.34||| 0.5% | 100.0% | 0.001164 | 417 |pe.0|||===============================================|| 38.9% | 100.0% | 0.002848 | 47735 |free|||-----------------------------------------------||| 2.7% | 2.7% | 0.003748 | 1422 |pe.37||| 2.7% | 5.5% | 0.003706 | 1469 |pe.43||| . . .||| 1.4% | 99.3% | 0.001867 | 616 |pe.34||| 0.7% | 100.0% | 0.000896 | 385 |pe.0||================================================| 0.2% | 100.0% | 0.005758 | 452 |IO||------------------------------------------------|| 81.3% | 81.3% | 0.004679 | 309 |fwrite|||-----------------------------------------------||| 34.3% | 34.3% | 0.077141 | 262 |pe.0||| 2.1% | 36.4% | 0.004615 | 1 |pe.8||| . . .
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 32
Load Balance: Max, Median, Min
Notes for table 1:
High level option: -O load_balance_programLow level options: -d ti%@0.05,cum_ti%,ti,tr -b exp,pe=[mmm]
This table shows only lines with Time% > 0.05.
Percentages at each level are relative(for absolute percentages, specify: -s percent=a).
Table 1: Load Balance across PE's
Time % | Cum. | Time | Calls |Experiment=1| Time % | | |PE[mmm]
100.0% | 100.0% | 3.798177 | 579653 |Total|-------------------------------------------------| 2.1% | 2.1% | 3.823080 | 7160 |pe.0| 2.1% | 52.1% | 3.797671 | 10695 |pe.3| 2.1% | 100.0% | 3.796144 | 10431 |pe.29|=================================================
Page 17
17
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 33
LB [MMM] Table 2Notes for table 2:
High level option: -O load_balance_groupLow level options: -d ti%@0.05,cum_ti%,ti,tr \-b exp,gr,pe=[mmm]
This table shows only lines with Time% > 0.05.
Percentages at each level are relative(for absolute percentages, specify: -s percent=a).
Table 2: Load Balance across PE's by FunctionGroup
Time % | Cum. | Time | Calls |Experiment=1| Time % | | |Group| | | | PE[mmm]
100.0% | 100.0% | 3.798177 | 579653 |Total|-------------------------------------------------| 70.9% | 70.9% | 2.692783 | 245380 |USER||------------------------------------------------|| 2.2% | 2.2% | 2.833001 | 3076 |pe.0|| 2.1% | 52.8% | 2.717019 | 4512 |pe.12|| 2.0% | 100.0% | 2.597093 | 4512 |pe.43||================================================| 28.8% | 99.7% | 1.092307 | 238224 |MPI||------------------------------------------------|| 2.3% | 2.3% | 1.188383 | 4363 |pe.43|| 2.0% | 53.7% | 1.069314 | 5803 |pe.7|| 1.6% | 100.0% | 0.859333 | 2923 |pe.0||================================================| 0.2% | 99.8% | 0.007329 | 95597 |HEAP||------------------------------------------------|| 2.7% | 2.7% | 0.009363 | 2482 |pe.12|| 2.2% | 59.6% | 0.007614 | 2192 |pe.40|| 0.6% | 100.0% | 0.002062 | 803 |pe.0||================================================| 0.2% | 100.0% | 0.005758 | 452 |IO||------------------------------------------------|| 46.6% | 46.6% | 0.128685 | 358 |pe.0|| 1.1% | 80.3% | 0.003144 | 2 |pe.47|| 0.6% | 100.0% | 0.001644 | 2 |pe.29|=================================================
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 34
LB [MMM] Table 3Notes for table 3:High level option: -O load_balance_functionLow level options: -d ti%@0.05,cum_ti%,ti,tr \
-b exp,gr,fu,pe=[mmm]This table shows only lines with Time% > 0.05.Percentages at each level are relative
(for absolute percentages, specify: -s percent=a).
Table 3: Load Balance across PE's by FunctionTime % | Cum. | Time | Calls |Experiment=1
| Time % | | |Group| | | | Function| | | | PE[mmm]
100.0% | 100.0% | 3.798177 | 579653 |Total|-------------------------------------------------| 70.9% | 70.9% | 2.692783 | 245380 |USER|| 97.1% | 97.1% | 2.615916 | 576 |sweep_|||-----------------------------------------------||| 2.2% | 2.2% | 2.753279 | 12 |pe.0||| 2.1% | 52.8% | 2.638898 | 12 |pe.16||| 2.0% | 100.0% | 2.523325 | 12 |pe.37|||===============================================. . .||================================================| 28.8% | 99.7% | 1.092307 | 238224 |MPI||------------------------------------------------|| 76.1% | 76.1% | 0.831311 | 118080 |mpi_recv_|||-----------------------------------------------||| 2.7% | 2.7% | 1.066077 | 1440 |pe.47||| 2.0% | 56.9% | 0.801256 | 2880 |pe.21||| 1.4% | 100.0% | 0.573420 | 1440 |pe.0|||===============================================. . .| 0.2% | 99.8% | 0.007329 | 95597 |HEAP||------------------------------------------------. . .||================================================| 0.2% | 100.0% | 0.005758 | 452 |IO. . .
Page 18
18
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 35
pat_report –O Keywords• profile• callers or ca; ca+src• calltree or ct; ct+src• load_balance or lb; load_balance_all or lb_all
• load_balance_program• load_balance_group• load_balance_function
• mpi• heap
• heap_program• heap_hiwater• heap_leaks
Multiple values can be specified in a comma-listBy default, all reports show only the PEs having the maximum,
median, and minimum values
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 36
L1Instruction
Cache64KB
44-entryLoad/Store
Queue
L2Cache1 MB
16-way assocL1
DataCache64KB
2-way assoc
Crossbar
MemoryController
HyperTransportTM
SystemRequestQueue
Fetch
Int Decode & Rename
µOPs
36-entry FP scheduler
FADD FMISCFMUL
BranchPrediction
Instruction Control Unit (72 entries)
Fastpath Microcode EngineScan/Align
FP Decode & Rename
AGU
ALU
AGU
ALU
MULT
AGU
ALU
Res Res Res
Bus
Unit
9-way Out-Of-Order execution
16 instruction bytes fetched per cycle
• 36 entry FPU instruction scheduler• 64-bit/80-bit FP Realized throughput (1 Mul + 1 Add)/cycle: 1.9 FLOPs/cycle• 32-bit FP Realized throughput (2 Mul + 2 Add)/cycle: 3.4+ FLOPs/cycle
AMD Opteron Processor
Page 19
19
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 37
Simplified memory hierachy on the AMD Opteron
…...
registers
L1 data cache
L2 cache
16 SSE2 128-bit registers16 64 bit registers
2 x 8 Bytes per clock, i.e. Either 2 loads, 1 load 1 store, or 2 stores (38 GB/s on 2.4 Ghz)
Main memory
64 Byte cache linecomplete data cache lines are loaded from mainmemory, if not in L2 cacheif L1 data cache needs to be refilled, thenstoring back to L2 cache
64 Byte cache linewrite back cache: data offloaded from L1 data cache are stored here firstuntil they are flushed out to main memory
16 Bytes wide data bus => 6.4 GB/s for DDR400
8 Bytes per clock
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 38
Hardware Performance Counters• AMD Opteron Hardware Performance Counters
• Four 48-bit performance counters.• Each counter can monitor a single event
Count specific processor events» the processor increments the counter when it detects an
occurrence of the event» (e.g., cache misses)
Duration of events» the processor counts the number of processor clocks it
takes to complete an event» (e.g., the number of clocks it takes to return data from
memory after a cache miss)• Time Stamp Counters (TSC)
• Cycles (user time)
Page 20
20
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 39
XT3 Hardware Counters Interface• The Performance API (PAPI) is provided by default on the
XT3 software stack• The XT3 Unicos/lc hardware counters interface was
developed by Sandia National Laboratories• Based on the perfctr kernel patch developed by Mikael
Pettersson from Uppsala University• No hardware counters access in any major Linux distribution
Kernel patch needed for user level access to hardware counters • Provides system level access to the x86 and x86-64
performance counters• Provides per-process 64-bit memory-mapped virtual counters• Provides per-process virtual Time Stamp Counter (TSC)
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 40
PAPI Predefined Events• Common set of events deemed relevant and useful
for application performance tuning• papiStdEventDefs.h• Accesses to the memory hierarchy, cache coherence
protocol events, cycle and instruction counts, functional unit and pipeline status
• PAPI “avail” utility shows which predefined events are available on the system
• PAPI also provides access to native events• PAPI “native_avail” utility list all AMD native events
available on the system
Page 21
21
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 41
PAPI Preset Listing(derose@jaguar1) 184% yod -sz=1 /opt/xt-tools/papi/3.0.8.1/bin/availLibLustre: NAL NID: 0005dc02 (2)Lustre: OBD class driver Build Version: 1, [email protected] case avail.c: Available events and hardware information.-------------------------------------------------------------------------Vendor string and code : AuthenticAMD (2)Model string and code : AMD K8 (13)CPU Revision : 1.000000CPU Megahertz : 2400.000000CPU's in this Node : 1Nodes in this System : 1Total CPU's : 1Number Hardware Counters : 4Max Multiplex Counters : 32-------------------------------------------------------------------------Name Code Avail Deriv Description (Note)PAPI_L1_DCM 0x80000000 Yes Yes Level 1 data cache misses ()PAPI_L1_ICM 0x80000001 Yes Yes Level 1 instruction cache misses ()PAPI_L2_DCM 0x80000002 Yes No Level 2 data cache misses ()PAPI_L2_ICM 0x80000003 Yes No Level 2 instruction cache misses ()PAPI_L3_DCM 0x80000004 No No Level 3 data cache misses ()PAPI_L3_ICM 0x80000005 No No Level 3 instruction cache misses ()PAPI_L1_TCM 0x80000006 Yes Yes Level 1 cache misses ()PAPI_L2_TCM 0x80000007 Yes Yes Level 2 cache misses ()PAPI_L3_TCM 0x80000008 No No Level 3 cache misses (). . .
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 42
PAPI avail utility% avail -hThis is the PAPI avail program.It provides availability and detail informationfor PAPI preset and native events. Usage:
avail [options] [event name]avail TESTS_QUIET
Options:
-a display only available PAPI preset events-d display PAPI preset event info in detailed format-e EVENTNAME display full detail for named preset or native event-h print this help message-t display PAPI preset event info in tabular format (default)
Page 22
22
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 43
Example: avail –e PAPI_L1_TCMEvent name: PAPI_L1_TCMEvent Code: 0x80000006Number of Native Events: 4Short Description: |L1 cache misses|Long Description: |Level 1 cache misses|Developer's Notes: ||Derived Type: |DERIVED_ADD|Postfix Processing String: |||Native Code[0]: 0x40001e1c DC_SYS_REFILL_MOES||Number of Register Values: 2||Register[0]: 0x20f P3 Ctr Mask||Register[1]: 0x1e43 P3 Ctr Code||Native Event Description: |Refill from system. Cache bits: Modified Owner Exclusive Shared|
|Native Code[1]: 0x40000037 IC_SYS_REFILL||Number of Register Values: 2||Register[0]: 0xf P3 Ctr Mask||Register[1]: 0x83 P3 Ctr Code||Native Event Description: |Refill from system|
|Native Code[2]: 0x40000036 IC_L2_REFILL||Number of Register Values: 2||Register[0]: 0xf P3 Ctr Mask||Register[1]: 0x82 P3 Ctr Code||Native Event Description: |Refill from L2|
|Native Code[3]: 0x40001e1b DC_L2_REFILL_MOES||Number of Register Values: 2||Register[0]: 0x20f P3 Ctr Mask||Register[1]: 0x1e42 P3 Ctr Code||Native Event Description: |Refill from L2. Cache bits: Modified Owner Exclusive Shared|
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 44
PAPI native_avail Utility(derose@jaguar1) 187% yod -sz=1 /opt/xt-tools/papi/3.0.8.1/bin/native_avail |moreLibLustre: NAL NID: 0005dc05 (5)Lustre: OBD class driver Build Version: 1, [email protected] case NATIVE_AVAIL: Available native events and hardware information.-------------------------------------------------------------------------Vendor string and code : AuthenticAMD (2)Model string and code : AMD K8 (13)CPU Revision : 1.000000CPU Megahertz : 2400.000000CPU's in this Node : 1Nodes in this System : 1Total CPU's : 1Number Hardware Counters : 4Max Multiplex Counters : 32-------------------------------------------------------------------------The following correspond to fields in the PAPI_event_info_t structure.Symbol Event Code Count|Short Description||Long Description||Derived||PostFix|
The count field indicates whether it is a) available (count >= 1) and b) derived(count > 1)
FP_ADD_PIPE 0x40000000|Dispatched FPU ops - Revision B and later revisions - Speculative add pipe opsexcluding junk ops||Register Value[0]: 0xf P3 Ctr Mask||Register Value[1]: 0x100 P3 Ctr Code|
FP_MULT_PIPE 0x40000001|Dispatched FPU ops - Revision B and later revisions - Speculative multiply pip
e ops excluding junk ops||Register Value[0]: 0xf P3 Ctr Mask||Register Value[1]: 0x200 P3 Ctr Code|
. . .
Page 23
23
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 45
Hardware Counters Selection• PAT_RT_HWPC <set number> | <event list>
• Specifies hardware counter events to be monitored• A set number can be used to select a group of
predefined hardware counters events (recommended)CrayPat provides 9 sets on the Cray XT3
• Alternatively a list of hardware performance counter event names can be used
Maximum of 4 events
• Both formats can be specified at the same time, with later definitions overriding previous definitions
• By default, no hardware performance counter events are monitored during tracing experiments
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 46
Accuracy Issues
• Pay attention to what is not measured:• Out-of-order processors• Speculation• Lack of standard on what is counted
• Microbenchmarks can help determine accuracy of the hardware counters
• For more information on AMD counters:• architecture manuals:
• http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26049.PDF
user
interface
Kernel
Hardwarecounters
• Granularity of the measured code• If not sufficiently large enough, overhead
of the counter interfaces may dominate
Page 24
24
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 47
Hardware Performance CountersPAPI_TLB_DM Data translation lookaside buffer missesPAPI_L1_DCA Level 1 data cache accessesPAPI_FP_OPS Floating point operationsDC_MISS Data Cache MissUser_Cycles Virtual Cycles
========================================================================USER / sweep_------------------------------------------------------------------------Time% 96.9%Time 2.619020Imb.Time 0.141776Imb.Time% 5.2%Calls 576PAPI_TLB_DM 9816474 missesPAPI_L1_DCA 125098572621 opsPAPI_FP_OPS 86987557635 opsDC_MISS 4515210161 opsUser time 6292017317.66667 cyclesUtilization rate 100.0%HW FP Ops / Cycles 13.83 ops/cycleHW FP Ops / User time 86987557635 ops 14.4%peakHW FP Ops / WCTComputation intensity 0.70 ops/refLD & ST per TLB miss 12743.74 ops/missLD & ST per D1 miss 27.71 ops/missD1 cache hit ratio 96.4%% TLB misses / cycle 0.0%
PAT_RT_HWPC=1Flat profile dataHard counts
Derived metrics
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 48
PAT_RT_HWPC=2 (Cache Info)PAPI_L1_DCA Level 1 data cache accessesDC_L2_REFILL_MOESI Refill from L2. Cache bits: Modified Owner Exclusive Shared InvalidDC_SYS_REFILL_MOESI Refill from system. Cache bits: Modified Owner Exclusive Shared InvalidBU_L2_REQ_DC Internal L2 request - DC fillUser_Cycles Virtual Cycles
========================================================================USER / sweep_------------------------------------------------------------------------Time% 96.9%Time 2.619454Imb.Time 0.142876Imb.Time% 5.3%Calls 576PAPI_L1_DCA 125116346746 opsDC_L2_REFILL_MOESI 4519351614 opsDC_SYS_REFILL_MOESI 1023533083 opsBU_L2_REQ_DC 4729707701 reqUser time 6292985840.77083 cyclesUtilization rate 100.0%L1 Data cache misses 5542884697 missesLD & ST per D1 miss 22.57 ops/missD1 cache hit ratio 95.6%LD & ST per D2 miss 122.24 ops/missD2 cache hit ratio 78.4%L2 cache hit ratio 81.5%Memory to D1 refill 1023533083 linesMemory to D1 bandwidth 65506117312 bytesL2 to Dcache bandwidth 289238503296 bytes
Page 25
25
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 49
PAT_RT_HWPC=3 (L1 & L2 BW)PAPI_L1_DCM Level 1 data cache missesPAPI_L1_DCA Level 1 data cache accessesDC_L2_REFILL_MOES Refill from L2. Cache bits: Modified Owner Exclusive SharedDC_COPYBACK_MOES Copyback. Cache bits: Modified Owner Exclusive SharedUser_Cycles Virtual Cycles
========================================================================USER / sweep_------------------------------------------------------------------------Time% 96.9%Time 2.619887Imb.Time 0.140539Imb.Time% 5.2%Calls 576PAPI_L1_DCM 4517129259 missesPAPI_L1_DCA 125119230870 opsDC_L2_REFILL_MOES 3493981416 opsDC_COPYBACK_MOES 5538788237 opsUser time 6294208767.6875 cyclesUtilization rate 100.0%LD & ST per D1 miss 27.70 ops/missD1 cache hit ratio 96.4%Memory to D1 refill 1023147843 linesMemory to D1 bandwidth 65481461952 bytesL2 to Dcache bandwidth 223614810624 bytesDcache to L2 bandwidth 354482447168 bytes
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 50
PAT_RT_HWPC=4 (FP Mix)PAPI_FML_INS Floating point multiply instructionsPAPI_FAD_INS Floating point add instructionsPAPI_FP_OPS Floating point operationsFP_FAST_FLAG Dispatched FPU ops that use the fast flag interfaceUser_Cycles Virtual Cycles
========================================================================USER / sweep_------------------------------------------------------------------------Time% 96.9%Time 2.619768Imb.Time 0.142962Imb.Time% 5.3%Calls 576PAPI_FML_INS 40176676793 instrPAPI_FAD_INS 46825416366 instrPAPI_FP_OPS 87002093159 opsFP_FAST_FLAG 2449277001 opsUser time 6293787713.72917 cyclesUtilization rate 100.0%HW FP Ops / Cycles 13.82 ops/cycleHW FP Ops / User time 87002093159 ops 14.4%peakHW FP Ops / WCTFP Multiply / FP Ops 46.2%FP Add / FP Ops 53.8%
Page 26
26
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 51
PAT_RT_HWPC=5 (Vectorization)FR_FPU_X87 Retired FPU instructions - x87 instructionsFR_FPU_MMX_3D Retired FPU instructions – Combined MMX and 3DNow! instructionsFR_FPU_SSE_SSE2_PACKED Retired FPU instructions – Combined packed SSE and SSE 2 instructionsFR_FPU_SSE_SSE2_SCALAR Retired FPU instructions – Combined scalar SSE and SSE 2 instructionsUser_Cycles Virtual Cycles
========================================================================USER / sweep_------------------------------------------------------------------------
Time% 96.9%Time 2.619615Imb.Time 0.139317Imb.Time% 5.2%Calls 576FR_FPU_X87 0 instrFR_FPU_MMX_3D 0 instrFR_FPU_SSE_SSE2_PACKED 24032397312 instrFR_FPU_SSE_SSE2_SCALAR 101185460710 instrUser time 6293357456.66667 cyclesUtilization rate 100.0%
When compiled without fastsse:
========================================================================USER / sweep_------------------------------------------------------------------------
Time% 97.5%Time 3.128695Imb.Time 0.166962Imb.Time% 5.2%Calls 576FR_FPU_X87 0 instrFR_FPU_MMX_3D 0 instrFR_FPU_SSE_SSE2_PACKED 0 instrFR_FPU_SSE_SSE2_SCALAR 138996966424 instrUser time 7515016198.60417 cyclesUtilization rate 100.0%
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 52
PAT_RT_HWPC=6 (Stalls / Resources Idle)PAPI_FPU_IDL Cycles floating point units are idlePAPI_STL_ICY Cycles with no instruction issuePAPI_RES_STL Cycles stalled on any resourceIC_FETCH_STALL Instruction fetch stallUser_Cycles Virtual Cycles
========================================================================USER / sweep_------------------------------------------------------------------------
Time% 96.9%Time 2.619334Imb.Time 0.141884Imb.Time% 5.2%Calls 576PAPI_FPU_IDL 500570926.75 cyclesPAPI_STL_ICY 70217803.2916667 cyclesPAPI_RES_STL 4140098264.75 cyclesIC_FETCH_STALL 4631824703.22917 cyclesUser time 6292743345.83333 cyclesUtilization rate 100.0%Total time stalled 4140098264.75 cyclesTime I Fetch Stalled 4631824703.22917 cyclesAvg Time FPUs idle 250285463.375 cyclesTime Decoder empty 70217803.2916667 cycles
Page 27
27
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 53
PAT_RT_HWPC=7 (Stalls/ Resources Full)FR_DECODER_EMPTY Nothing to dispatch - decoder emptyFR_DISPATCH_STALLS Dispatch stalls - D2h or DAh combinedFR_DISPATCH_STALLS_FULL_FPU Dispatch stall when FPU is fullFR_DISPATCH_STALLS_FULL_LS Dispatch stall when LS is fullUser_Cycles Virtual Cycles
========================================================================USER / sweep_------------------------------------------------------------------------Time% 97.0%Time 2.618878Imb.Time 0.142062Imb.Time% 5.3%Calls 576FR_DECODER_EMPTY 3360773456 opsFR_DISPATCH_STALLS 4139586865.875 cyclesFR_DISPATCH_STALLS_FULL_FPU 2683961106.04167 cyclesFR_DISPATCH_STALLS_FULL_LS 1050422214.02083 cyclesUser time 6291691461.4375 cyclesUtilization rate 100.0%Total time stalled 4139586865.875 cyclesAvg Time FPUs stalled 1341980553.02083 cyclesAvg Time LSs stalled 525211107.010417 cyclesTime Decoder empty 3360773456 cycles
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 54
PAT_RT_HWPC Other Sets-------------------------------------------------------------------Set 8: Branches
PAPI_BR_TKN Conditional branch instructions takenPAPI_BR_MSP Conditional branch instructions mispredictedPAPI_TOT_INS Instructions completedIC_MISS IC MissUser_Cycles Virtual Cycles
-------------------------------------------------------------------
Set 9: InstructionsPAPI_L2_ICM Level 2 instruction cache missesPAPI_L1_ICA Level 1 instruction cache accessesIC_MISS IC MissIC_L2_REFILL Refill from L2User_Cycles Virtual Cycles
Page 28
28
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 55
LIBHWPC• Instrumentation API for Fortran, C, and C++• Event profiler with runtime summarization• For each instrumented section provides:
• Total count & duration (user & wall clock time)• Hardware performance counters information• Derived metrics
• Supports:• Multiple instrumentation sections• Nested instrumentation• Multiple calls to an instrumented section
• Requires• Hand instrumentation• Linking with libhwpc.a
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 56
LIBHWPC• Declaration:
• Fortran:• #include "hwpcf.h"
• C & C++:• #include “hwpc.h”
• Four calls:• PAT_hwpc_init(taskID, PrgName)• PAT_hwpc_finalize(taskID)• PAT_event_begin(instID, label)• PAT_event_end(instID)
• Compiling and Linking• my.x : my.f
$(FTN) $(FLAGS) my.f -o my.x -lhwpc -lpapi -lm
Page 29
29
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 57
LIBHWPC Example (Set 1)Hardware Info : CPU: AMD K8 (1.0), 2400 MHzMemory Hierarchy: 2 cache levels:L1 I Cache: 65536 Bytes, Line size: 64 Bytes, 2-wayL1 D Cache: 65536 Bytes, Line size: 64 Bytes, 2-wayL2 Cache: 1048576 Bytes, Line size: 64 Bytes, 16-wayI TLB: 512 entries, 4-wayD TLB: 512 entries, 4-way
Wall Clock time of instrumented section: 19.032111 secondsUser time of instrumented section : 19.0317893620833 secondsUser Cycles : 45676294469
Section 10 (Calc1): file swim_seq.F, lines 94 <--> 98Number of calls: 500Wall Clock Time: 4.52193 secondsAverage WCT : 0.00904386 secondsStd Deviation : 1.04463e-05Exclusive time : 0.055326 secondsUser time (exc): 0.0556901 seconds (133656144 cycles)User time (inc): 4.52169 seconds (10852054936 cycles)
PAPI_FP_OPS (FP operations) : 3146878906PAPI_L1_DCA (L1 Data accesses): 2234140346DC_MISS (Total L1 Data misses): 117004940PAPI_TLB_DM (D TLB misses) : 108844312
User time : 4.522 secondsUtilization rate : 99.995 %HW FP Ops / Cycles : 0.290HW FP Ops / User time : 695.952 M HW FP Ops/sHW FP Ops / WCT : 695.915 M HW FP Ops/WCTComputation intensity : 1.409LD & ST per TLB miss : 20.526LD & ST per D1 miss : 19.094D1 cache hit ratio : 94.763 %% TLB misses/cycle : 1.003 %
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 58
pat_help & Documentation• The pat_help utility is an interactive viewer used to
access information about and examples of using CrayPat• pat_help [topic [subtopic...]]
• See also man pages:• craypat• pat• pat_build• pat_report• pat_help• hwpc• papi_counters
Page 30
30
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 59
pat_help Example% pat_help
The top level CrayPat/X help topics are listed below.A good place to start is:
overview
If a topic has subtopics, they are displayed under the heading"Additional topics", as below. To view a subtopic, you needonly enter as many initial letters as required to distinguishit from other items in the list. To see a table of contentsincluding subtopics of those subtopics, etc., enter:
toc
To produce the full text corresponding to the table of contents,specify "all", but preferably in a non-interactive invocation:
pat_help all . > all_pat_helppat_help report all . > all_report_help
Additional topics:
API executebalance experimentbuild first_examplecounters overviewdemos reportenvironment run
pat_help (.=quit ,=back ^=up /=top ~=search)=>
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 60
Cray Apprentice2
• Call graph profile• Communication statistics• Time-line view
• Communication • I/O
• Activity view• Pair-wise communication
statistics• Text reports• Source code mapping
• Cray Apprentice2
• is target to help identify and correct:• Load imbalance• Excessive
communication• Network contention• Excessive serialization• I/O Problems
Page 31
31
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 61
Statistics Overview New feature:Switch Overview display
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 62
Function Profile
Page 32
32
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 63
Load Balance View (Aggregated)Min, Avg, and Max
Values
-1, +1 Std Dev marks
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 64
Call Graph View
Page 33
33
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 65
Call Graph View
Zoom
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 66
Call Graph View - Zoom
Width inclusive timeHeight exclusive time
Load balance overview:Heigh Max time
Left bar Average timeRight bar Min time
Page 34
34
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 67
Call Graph View - Zoom
FunctionList
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 68
Call Graph View – Function List
FunctionList off
Mouse right click:hide node
hide children
Page 35
35
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 69
Call Graph Hide Children
hidden children
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 70
Call Graph Unhide One Level
Page 36
36
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 71
Call Graph Unhide One Level (2)
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 72
Call Graph Unhide One Level (3)
Page 37
37
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 73
Call Graph Unhide All Children
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 74
Load Balance View (from Call Graph)
-1, +1 Std Dev marks
Min, Avg, and Max Values
Page 38
38
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 75
Source Mapping from Call Graph
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 76
Function Profile
Page 39
39
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 77
Distribution by PE, by Call, & by Time
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 78
Environment & Execution Details
Page 40
40
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 79
Time Line View (Sweep3D)
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 80
Time Line View (Zoom) User Functions, MPI & SHMEM Line
I/O Line
Page 41
41
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 81
Time Line View (Fine Grain Zoom)
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 82
Activity View
Page 42
42
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 83
Pair-wise Communication
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 84
I/O Overview
Page 43
43
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 85
I/O Traffic Report
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 86
I/O Rates
Page 44
44
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 87
Hardware Counters Overview
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 88
Hardware Counters Time Line
Page 45
45
09/26-28/2006 Luiz DeRose ([email protected] ) @ Cray Inc. 89
Controlling Trace File Size• Several environment variables are available to limit trace
files to a reasonable size:• PAT_RT_CALLSTACK
• Limit the depth to trace the call stack • PAT_RT_HWPC
• Avoid collecting hardware counters (unset)• PAT_RT_RECORD_PE
• Collect trace for a subset of the PEs• PAT_RT_TRACE_FUNCTION_ARGS
• Limit the number of function arguments to be traced• PAT_RT_TRACE_FUNCTION_LIMITS
• Avoid tracing indicated functions• PAT_RT_TRACE_FUNCTION_MAX
• Limit the maximum number of traces generated for all functions for a single process
• Use the limit built-in command for ksh(1) or csh(1) to control how much disk space the trace file can consume
09/26-28/2006LLNL Livermore, CA
Performance Measurement and Visualization on the Cray XT3
Questions / CommentsThank You!