11/17/02 1 PAPI and Dynaprof Application Signatures and Performance Analysis of Scientific Applications Philip J. Mucci Innovative Computing Laboratory, UTK Performance Evaluation Research Center, LBL [email protected]http://icl.cs.utk.edu/~mucci/dynaprof/snapshots/sc2002.ppt
55
Embed
11/17/02 1 PAPI and Dynaprof Application Signatures and Performance Analysis of Scientific Applications Philip J. Mucci Innovative Computing Laboratory,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
11/17/02
1
PAPI and Dynaprof
Application Signatures and Performance Analysis of Scientific Applications
Philip J. MucciInnovative Computing Laboratory, UTK
● Understanding the behavior of the application– Identification of bottlenecks.– Usage of the hardware resources.– Effects of that usage on performance.
● Using Dynaprof to achieve that goal– Command line usage– 3 Dynaprof probes
● LD_LIBRARY_PATH: Colon seperated list where to look for shared libraries. We need to find:– DynInst library– PAPI library– Any dependancies on the above. (libperfctr.so,
libcpc.so)● DYNINSTAPI_RT_LIB: Full pathname of
DynInst runtime library.● No settings necessary for AIX/DPCL port
11/17/02
1
Running Dynaprof
● Usage:
dynaprof [-d] [serial_application]● -d enables debugging output● Specifying an application automatically loads it
into the tool immediately after initialization.
11/17/02
1
Command Line Interface
● Uses GNU Readline library for input● Full featured Command Line Editing
– File and command completion: <Tab>– History: <Up>/<Down>
● Settings, macros and aliases in ~/.inputrc● Allows Emacs or VI style bindings
– set editing-mode emacs– set editing-mode vi
● See man page, TexInfo file or home page.
11/17/02
1
Load command
● Starts the application and stops it at the first instruction.
● Usage:
load <application> [args]
> dynaprof
(dynaprof) load tests/fpsx
11/17/02
1
Poeload command
● For use with MPI applications on AIX and DPCL.– DPCL < 3.2.5 requires full path
● Usage:
poeload <application> [args]
(dynaprof) poeload tests/swim -procs 2
11/17/02
1
Mpiload command
● For use with MPI applications.● Stops the application after it calls PMPI_Init().
● Mostly useful for script driven execution of MPI jobs
● Usage:
mpiload <application> [args]
(dynaprof) mpiload tests/mpicount
11/17/02
1
Attach command
● Attaches to a running application (or poe process) and stops it.
● Usage:
attach <application> <pid>(dynaprof) ^Z
> tests/fspx &
[2] 17500
> fg
(dynaprof) attach tests/fspx 17500
11/17/02
1
Poeattach Command
● For use with MPI applications on AIX and DPCL.– DPCL < 3.2.5 requires full path
● Usage:
poeattach <application> <pid_of_poe>
(dynaprof) ^Z
poe ex19 -da_grid_x 56 -da_grid_y 56 -procs 2 &
[2] 17500
> fg
(dynaprof) poeattach ex19 17500
11/17/02
1
List command
● list
– List all modules in process● list <pattern>
– List all matching modules● list <module>
– List all functions in module● list <module> <pattern>
– List all matching functions in module● list <module> <function>
(dynaprof) use [probe [args]]● Use by itself displays current probe.● To change options, respecify probe.● 4 probes in this release
– Wallclock: Real time clock– PAPI: Hardware metrics– Perfometer: RT Visi of streaming hardware metrics
11/17/02
1
Instr command
● instr
– list all instrumented functions● instr module <pattern> [arg]
– Instrument all functions in modules matching pattern● instr function <module> <pattern> [arg]
– Instrument all functions matching pattern in module
11/17/02
1
Threads and Dynaprof Probes
● For threaded code, use the same probe!● Dynaprof detects threads and loads a special
version of the probe library.● Each probe specifies what to do when a new
thread is discovered.● Each thread gets the same instrumentation.
11/17/02
1
Probe Warning
● Instrumentation is not free.● Consider granularity of region being measured.● Overhead for PAPI 2.3 is O(100) cycles.
– Between 500 and 2000 cycles for a 2 counter read.● Overhead for Wallclock is O(100) cycles.
11/17/02
1
Wallclock Probe
● High resolution, low latency timer● Usage:
use wallclockprobe● Reports time in microseconds, 1.0x10-6s.
11/17/02
1
PAPI Probe
● Count PAPI Presets or Native Events● Usage:
use papiprobe [event,event,...]● Default argument is either PAPI_FP_INS or PAPI_TOT_INS if the architecture doesn't support it.
● Available events a can be obtained by using:
papi_avail -a
11/17/02
1
PAPI Probe and Multiplexing
● More than physical number of metrics automatically enables multiplexing.
● Minimum runtime of instrumented regions must be observed, such that all virtual counters get a chance to run at least once.
run-timemin
= num_events * .01s
● Automatic warning functionality is being rolled into PAPI.
11/17/02
1
PAPI Native Events
● Look in the PAPI distribution● See the README file for your architecture in the src directory
● See the example program tests/native.c in the src/tests directory
11/17/02
1
Power 3 EventsPAPI_L1_DCM Yes Level 1 data cache misses (PM_LD_MISS_L1,PM_ST_L1MISS)PAPI_L1_ICM No Level 1 instruction cache misses (PM_IC_MISS)PAPI_L1_TCM Yes Level 1 cache misses (PM_IC_MISS,PM_LD_MISS_L1,PM_ST_L1MISS)PAPI_CA_SNP No Requests for a snoop (PM_SNOOP)PAPI_CA_SHR No Requests for exclusive access to shared cache line (PM_SNOOP_E_TO_S)PAPI_CA_ITV No Requests for cache line intervention (PM_SNOOP_PUSH_INT)PAPI_BRU_IDL No Cycles branch units are idle (PM_BRU_IDLE)PAPI_FXU_IDL No Cycles integer units are idle (PM_FXU_IDLE)PAPI_FPU_IDL No Cycles floating point units are idle (PM_FPU_IDLE)PAPI_LSU_IDL No Cycles load/store units are idle (PM_LSU_IDLE)PAPI_TLB_TL No Total translation lookaside buffer misses (PM_TLB_MISS)PAPI_L1_LDM No Level 1 load misses (PM_LD_MISS_L1)PAPI_L1_STM No Level 1 store misses (PM_ST_L1MISS)PAPI_L2_LDM No Level 2 load misses (PM_LD_MISS_EXCEED_L2)PAPI_L2_STM No Level 2 store misses (PM_ST_MISS_EXCEED_L2)PAPI_BTAC_M No Branch target address cache misses (PM_BTAC_MISS)PAPI_PRF_DM No Data prefetch cache misses (PM_PREF_MATCH_DEM_MISS)PAPI_TLB_SD No Translation lookaside buffer shootdowns (PM_TLBSYNC_RERUN)PAPI_CSR_FAL No Failed store conditional instructions (PM_ST_COND_FAIL)PAPI_CSR_SUC No Successful store conditional instructions (PM_RESRV_CMPL)PAPI_CSR_TOT No Total store conditional instructions (PM_RESRV_RQ)PAPI_MEM_SCY Yes Cycles Stalled Waiting for memory accesses (PM_CMPLU_WT_LD,PM_CMPLU_WT_ST)PAPI_MEM_RCY No Cycles Stalled Waiting for memory Reads (PM_CMPLU_WT_LD)PAPI_MEM_WCY No Cycles Stalled Waiting for memory writes (PM_CMPLU_WT_ST)PAPI_STL_ICY No Cycles with no instruction issue (PM_0INST_DISP)PAPI_STL_CCY No Cycles with no instructions completed (PM_0INST_CMPL)PAPI_BR_CN No Conditional branch instructions (PM_CBR_DISP)PAPI_BR_MSP No Conditional branch instructions mispredicted (PM_MPRED_BR_CAUSED_GC)PAPI_BR_PRC No Conditional branch instructions correctly predicted (PM_BR_PRED)
11/17/02
1
Power 3 Events 2
PAPI_FMA_INS No FMA instructions completed (PM_EXEC_FMA)PAPI_TOT_IIS No Instructions issued (PM_INST_DISP)PAPI_TOT_INS No Instructions completed (PM_INST_CMPL)PAPI_INT_INS Yes Integer instructions (PM_FXU0_PROD_RESULT,PM_FXU1_PROD_RESULT,PM_FXU2_PROD_RESULT)PAPI_FP_INS Yes Floating point instructions (PM_FPU0_CMPL,PM_FPU1_CMPL)PAPI_LD_INS No Load instructions (PM_LD_CMPL)PAPI_SR_INS No Store instructions (PM_ST_CMPL)PAPI_BR_INS No Branch instructions (PM_BR_CMPL)PAPI_FLOPS Yes Floating point instructions per second (PM_CYC,PM_FPU0_CMPL,PM_FPU1_CMPL)PAPI_TOT_CYC No Total cycles (PM_CYC)PAPI_IPS Yes Instructions per second (PM_CYC,PM_INST_CMPL)PAPI_LST_INS Yes Load/store instructions completed (PM_LD_CMPL,PM_ST_CMPL)PAPI_SYC_INS No Synchronization instructions completed (PM_SYNC)PAPI_FDV_INS No Floating point divide instructions (PM_FPU_FDIV)PAPI_FSQ_INS No Floating point square root instructions (PM_FPU_FSQRT)
11/17/02
1
Power 4 Events
PAPI_L1_DCM Yes Level 1 data cache misses (PM_LD_MISS_L1,PM_ST_MISS_L1)PAPI_FXU_IDL No Cycles integer units are idle (PM_FXU_IDLE)PAPI_TLB_DM No Data translation lookaside buffer misses (PM_DTLB_MISS)PAPI_TLB_IM No Instruction translation lookaside buffer misses (PM_ITLB_MISS)PAPI_TLB_TL Yes Total translation lookaside buffer misses (PM_DTLB_MISS,PM_ITLB_MISS)PAPI_L1_LDM No Level 1 load misses (PM_LD_MISS_L1)PAPI_L1_STM No Level 1 store misses (PM_ST_MISS_L1)PAPI_STL_ICY No Cycles with no instruction issue (PM_0INST_FETCH)PAPI_HW_INT No Hardware interrupts (PM_EXT_INT)PAPI_FMA_INS No FMA instructions completed (PM_FPU_FMA)PAPI_TOT_IIS No Instructions issued (PM_INST_DISP)PAPI_TOT_INS No Instructions completed (PM_INST_CMPL)PAPI_INT_INS No Integer instructions (PM_FXU_FIN)PAPI_FP_INS No Floating point instructions (PM_FPU_FIN)PAPI_FLOPS Yes Floating point instructions per second (PM_CYC,PM_FPU_FIN)PAPI_TOT_CYC No Total cycles (PM_CYC)PAPI_IPS Yes Instructions per second (PM_CYC,PM_INST_CMPL)PAPI_L1_DCA Yes Level 1 data cache accesses (PM_LD_REF_L1,PM_ST_REF_L1)PAPI_L1_DCR No Level 1 data cache reads (PM_LD_REF_L1)PAPI_L1_DCW No Level 1 data cache writes (PM_ST_REF_L1)PAPI_FDV_INS No Floating point divide instructions (PM_FPU_FDIV)PAPI_FSQ_INS No Floating point square root instructions (PM_FPU_FSQRT)
11/17/02
1
Pentium III EventsPAPI_L1_DCM No Level 1 data cache misses (0x45,0x45)PAPI_L1_ICM No Level 1 instruction cache misses (0xf28,0xf28)PAPI_L2_ICM No Level 2 instruction cache misses (0x68,0x68)PAPI_L1_TCM No Level 1 cache misses (0xf2e,0xf2e)PAPI_L2_TCM No Level 2 cache misses (0x24,0x24)PAPI_CA_SHR No Requests for exclusive access to shared cache line (0x22e,0x22e)PAPI_CA_CLN No Requests for exclusive access to clean cache line (0x66,0x66)PAPI_CA_INV No Requests for cache line invalidation (0x69,0x69)PAPI_CA_ITV No Requests for cache line intervention (0x4007b,0x4007b)PAPI_TLB_IM No Instruction translation lookaside buffer misses (0x85,0x85)PAPI_L1_LDM No Level 1 load misses (0xf29,0xf29)PAPI_L1_STM No Level 1 store misses (0xf2a,0xf2a)PAPI_L2_LDM Yes Level 2 load misses (0x24,0x25)PAPI_L2_STM No Level 2 store misses (0x25,0x25)PAPI_BTAC_M No Branch target address cache misses (0xe2,0xe2)PAPI_HW_INT No Hardware interrupts (0xc8,0xc8)PAPI_BR_CN No Conditional branch instructions (0xc4,0xc4)PAPI_BR_TKN No Conditional branch instructions taken (0xc9,0xc9)PAPI_BR_NTK Yes Conditional branch instructions not taken (0xc4,0xc9)PAPI_BR_MSP No Conditional branch instructions mispredicted (0xc5,0xc5)PAPI_BR_PRC Yes Conditional branch instructions correctly predicted (0xc4,0xc5)PAPI_TOT_IIS No Instructions issued (0xd0,0xd0)PAPI_TOT_INS No Instructions completed (0xc0,0xc0)PAPI_FP_INS No Floating point instructions (0xc1,0x0)PAPI_BR_INS No Branch instructions (0xc4,0xc4)PAPI_VEC_INS No Vector/SIMD instructions (0xb0,0xb0)PAPI_FLOPS Yes Floating point instructions per second (0xc1,0x79)
11/17/02
1
Intel Pentium IV Events
PAPI_L1_DCM No Level 1 data cache misses 0x0003b000/0x12000204@0x8000000c)
PAPI_L2_DCM No Level 2 data cache misses (0x0003b000/0x12000204@0x8000000c)
PAPI_L1_LDM No Level 1 load misses (0x0003b000/0x12000204@0x8000000c)PAPI_L1_STM No Level 1 store misses (0x0003b000/0x12000204@0x8000000c)PAPI_L2_LDM No Level 2 load misses (0x0003b000/0x12000204@0x8000000c)PAPI_L2_STM No Level 2 store misses (0x0003b000/0x12000204@0x8000000c)PAPI_TOT_INS No Instructions completed
(0x00039000/0x04000204@0x8000000c)PAPI_FP_INS No Floating point instructions
PAPI_TOT_CYC No Total cycles (0x00ff9000/0x7e000004@0x8000000d)
(Arguments to perfex -e from PerfCtr distribution)
11/17/02
1
Sun UltraSparc II Events
PAPI_L1_ICM Yes Level 1 instruction cache misses (0x8,0x8)PAPI_L2_TCM Yes Level 2 cache misses (0xc,0xc)PAPI_CA_SNP No Requests for a snoop (-1,0xe)PAPI_CA_INV No Requests for cache line invalidation (0xe,-1)PAPI_L1_LDM Yes Level 1 load misses (0x9,0x9)PAPI_L1_STM Yes Level 1 store misses (0xa,0xa)PAPI_BR_MSP No Conditional branch instructions mispredicted (-1,0x2)PAPI_TOT_IIS No Instructions issued (-1,0x1)PAPI_TOT_INS No Instructions completed (-1,0x1)PAPI_LD_INS No Load instructions (0x9,-1)PAPI_SR_INS No Store instructions (0xa,-1)PAPI_TOT_CYC No Total cycles (0x0,0x0)PAPI_IPS Yes Instructions per second (0x0,0x1)PAPI_L1_DCR No Level 1 data cache reads (0x9,-1)PAPI_L1_DCW No Level 1 data cache writes (0xa,-1)PAPI_L1_ICH No Level 1 instruction cache hits (-1,0x8)PAPI_L2_ICH No Level 2 instruction cache hits (-1,0xf)PAPI_L1_ICA No Level 1 instruction cache accesses (0x8,-1)PAPI_L2_TCH No Level 2 total cache hits (-1,0xc)PAPI_L2_TCA No Level 2 total cache accesses (0xc,-1)
11/17/02
1
Sun UltraSparc III Events
PAPI_L1_ICM No Level 1 instruction cache misses (-1,0x8)PAPI_L2_ICM No Level 2 instruction cache misses (-1,0xf)PAPI_L2_TCM No Level 2 cache misses (-1,0xc)PAPI_TLB_DM No Data translation lookaside buffer misses (-1,0x12)PAPI_TLB_IM No Instruction translation lookaside buffer misses (-1,0x11)PAPI_L1_LDM No Level 1 load misses (-1,0x9)PAPI_L1_STM No Level 1 store misses (-1,0xa)PAPI_BR_MSP No Conditional branch instructions mispredicted (-1,0x2)PAPI_TOT_IIS No Instructions issued (0x1,0x1)PAPI_TOT_INS No Instructions completed (0x1,0x1)PAPI_FP_INS Yes Floating point instructions (0x18,0x27)PAPI_TOT_CYC No Total cycles (0x0,0x0)PAPI_IPS Yes Instructions per second (0x0,0x1)PAPI_L1_DCR No Level 1 data cache reads (0x9,-1)PAPI_L1_DCW No Level 1 data cache writes (0xa,-1)PAPI_L1_ICH No Level 1 instruction cache hits (0x8,-1)PAPI_L1_ICA Yes Level 1 instruction cache accesses (0x8,0x8)PAPI_L2_TCH Yes Level 2 total cache hits (0xc,0xc)PAPI_L2_TCA No Level 2 total cache accesses (0xc,-1)PAPI_FML_INS No Floating point multiply instructions (-1,0x27)PAPI_FAD_INS No Floating point add instructions (0x18,-1)
11/17/02
1
MIPS R12K EventsPAPI_L1_DCM No Level 1 data cache misses (25)PAPI_L1_ICM No Level 1 instruction cache misses (9)PAPI_L2_DCM No Level 2 data cache misses (26)PAPI_L2_ICM No Level 2 instruction cache misses (10)PAPI_L1_TCM Yes Level 1 cache misses (9,25)PAPI_L2_TCM Yes Level 2 cache misses (10,26)PAPI_CA_SHR No Requests for exclusive access to shared cache line (31)PAPI_CA_INV No Requests for cache line invalidation (13)PAPI_CA_ITV No Requests for cache line intervention (12)PAPI_TLB_TL No Total translation lookaside buffer misses (23)PAPI_PRF_DM No Data prefetch cache misses (17)PAPI_CSR_FAL No Failed store conditional instructions (5)PAPI_CSR_SUC Yes Successful store conditional instructions (20,5)PAPI_CSR_TOT No Total store conditional instructions (20)PAPI_BR_CN No Conditional branch instructions (6)PAPI_BR_MSP No Conditional branch instructions mispredicted (24)PAPI_BR_PRC Yes Conditional branch instructions correctly predicted(6,24)PAPI_TOT_IIS No Instructions issued (1)PAPI_TOT_INS No Instructions completed (15)PAPI_FP_INS No Floating point instructions (21)PAPI_LD_INS No Load instructions (18)PAPI_SR_INS No Store instructions (19)PAPI_FLOPS Yes Floating point instructions per second (0,21)PAPI_TOT_CYC No Total cycles (0)PAPI_IPS Yes Instructions per second (0,15)PAPI_LST_INS Yes Load/store instructions completed (18,19)
11/17/02
1
Alpha/DADD 21264 Events
PAPI_L1_ICM No Level 1 instruction cache misses (0x3)PAPI_L2_TCM No Level 2 cache misses (0x1)PAPI_TLB_DM No Data translation lookaside buffer misses (0x2)PAPI_BR_UCN No Unconditional branch instructions (0x15)PAPI_BR_CN No Conditional branch instructions (0x16)PAPI_BR_NTK No Conditional branch instructions not taken (0x18)PAPI_BR_MSP No Conditional branch instructions mispredicted (0x19)PAPI_BR_PRC No Conditional branch instructions correctly predicted (0x1a)PAPI_TOT_IIS No Instructions issued (0x7)PAPI_TOT_INS No Instructions completed (0x8)PAPI_INT_INS No Integer instructions (0x9)PAPI_FP_INS No Floating point instructions (0x14)PAPI_LD_INS No Load instructions (0xa)PAPI_SR_INS No Store instructions (0xb)PAPI_TOT_CYC No Total cycles (0x0)PAPI_LST_INS No Load/store instructions completed (0xc)PAPI_SYC_INS No Synchronization instructions completed (0xd)PAPI_FML_INS No Floating point multiply instructions (0x11)PAPI_FAD_INS No Floating point add instructions (0x10)PAPI_FDV_INS No Floating point divide instructions (0x12)PAPI_FSQ_INS No Floating point square root instructions (0x13)
11/17/02
1
Perfometer Probe
● Sends a stream of performance data every N seconds to the Perfometer GUI.
● Functions can be colored at instrumentation time.– Default color is white, 0xFFFFFF
● Usage:use perfometerprobe [0xRRGGBB]
instr <args> <0xRRGGBB>
11/17/02
1
Perfometer Probe 2
● Perfometer GUI is NOT launched automatically.● showrgb in X11 lists colors and names.● Run the Java GUI
– Java -jar Perfometer.jar● Connect up to the specified hostname and port.
11/17/02
1
Instrumenting SWIM withperfometerprobe
Module perfometerprobe.so was loaded.Module libperfometer.so was loaded.Module libpapi.so was loaded.(dynaprof) instr function swim.F calc1_ 0xff0000swim.F, inserted 1 instrumentation points(dynaprof) instr function swim.F calc2_ 0x00ff00swim.F, inserted 1 instrumentation points(dynaprof) instr function swim.F calc3_ 0x0000ffswim.F, inserted 1 instrumentation points(dynaprof) runModule libnss_files.so.2 was loaded.Module libnss_nisplus.so.2 was loaded.Module libnsl.so.1 was loaded.Module libnss_dns.so.2 was loaded.Module libresolv.so.2 was loaded.Perfometer client awaiting connection on port #33733
11/17/02
1
Instrumenting FSPX forInstructions Per Cycle
(dynaprof) use probes/papiprobe PAPI_TOT_CYC, PAPI_TOT_INSModule papiprobe.so was loaded.Module libpapi.so was loaded.Module libperfctr.so was loaded.(dynaprof) instr module update.Fupdate.F, inserted 3 instrumentation points(dynaprof) instr module pde.F (dynaprof) instrproflux_flux_pde_(dynaprof) instr module phase.Fphase.F, inserted 1 instrumentation points(dynaprof) instrproflux_flux_pde_phase_
11/17/02
1
Instrumenting SWIM forInstructions Per Cycle
(dynaprof) use probes/papiprobe PAPI_TOT_CYC, PAPI_TOT_INSModule papiprobe.so was loaded.Module libpapi.so was loaded.Module libperfctr.so was loaded.(dynaprof) instr function swim.F calc*Swim.F, inserted 3 instrumentation points(dynaprof) instrcalc1_calc2_calc3_calc3z_
11/17/02
1
Reporting Probe Data
● The wallclock and PAPI probes produce very similar data.
● Both use a parsing script written in Perl.– wallclockrpt <file>– papiproberpt <file>
● Binary distribution for 4 Platforms on the website– AIX 3.x / DPCL 3.2.5 on Power 3– Linux / DynInst 3.0 on Pentium <= III– Solaris 2.8 / DynInst 3.0 on UltraSparc II/III– IRIX / DynInst 3.0 on MIPS R10/12/14k– Power 4 and Pentium 4 are coming...