Introduction to the Intel tracing tools R. Bader (LRZ) M. Deilmann (Intel)
Introduction to the Intel tracing tools
R. Bader (LRZ)
M. Deilmann (Intel)
©2005-8 LRZ, RRZE and Intel 2March 2008
Help - my parallel program doesn’t scale!
Isolate performance issues in parallel programs
... especially those in large MPI programs
inefficient MPI programming bottleneckslatency-dominated performance load imbalancedeadlock (hmmm ...)
... but also those via subroutine time consumption
Existing facilitiesgprof for subroutinesPMPI profiling interface (MPI standard!) + upshot
... have problems with large parallel programs: Trace files cannot be easily analyzed
©2005-8 LRZ, RRZE and Intel 3March 2008
Intel Tracing Tools
Solve the above problem:integration of subroutine and MPI profilingoriginally developed by ZI für Mathematik/FZ Jülichmarketed by Pallas, now Intel
Platformsx86 (Solaris & Linux)IPF, (EM64T) (Linux)
Future development on non-Intel platforms
unclearVampir NG (FZ Jülich, ZIH Dresden)
Supercomputers:Altix 3700 and 4700NEC SX-4/5/6/8CRAY
Version installed at LRZCampus licenseGUI Version 7.1Tracing Library Version 7.1
at RRZEnot availabletrace libraries are rather expensive
©2005-8 LRZ, RRZE and Intel 4March 2008
Documentation and resources
Intel Web Sitehttp://www.intel.com/cd/software/products/asmo-na/eng/cluster/tanalyzer
Note: name mapping VAMPIR Intel trace analyzer
VAMPIRtrace Intel trace collector
LRZ Web Sitehttp://www.lrz-muenchen.de/services/software/parallel/vampir
see especially the links to the user’s guides at the end of this document
specific Usage instructions for each platform the tracing libraries are available on
Basic usage of tracing librariesand GUI
©2005-8 LRZ, RRZE and Intel 6March 2008
Two Components of ITT
1. Instrumentation
MPI library calls: divert to profiling interfaceAchieved by specifying –vtraceswitch to mpif90/mpicccommands
this is a LRZ-specific settingmake things work uniformly across platformsIntel MPI: -t=log instead
In most cases, relinkingsufficient
Additional functionalitysubroutine tracing switching tracing on and off
requires source code changes and recompilation
2. Visualization
during/after program run tracefile is written to disk
may be visualized on any platform using the ITA GUI
many views of the trace data available via a transparent menu structure
hundreds of MPI processes viewable with good performance
STF (Structured Trace File) format still lots of data
©2005-8 LRZ, RRZE and Intel 7March 2008
Step 1: Instrument your Code
Example: MPI heat conduction exampleUses non-blocking sends and receives as well as reduction operationssets up MPI data types and topology (MPI_CART_CREATE)
set up environment: module load mpi_tracing
recompile code (for inclusion of subroutine tracing):mpif90 -vtrace -c <lots_of_options> <source_file>
relink binary:mpif90 -vtrace -o heat_mpi.exe <lots_of_objects>
this is the name of our application
Warning:do not specify MPI librariesat your own discretion
ruins link sequence set up by –vtrace!
©2005-8 LRZ, RRZE and Intel 8March 2008
Step 2: Prepare Configuration File
Edit a file vt_heat to contain the following lines:
# Log fileLOGFILE-NAME heat.stfLOGFILE-FORMAT STF# disable all MPI activityACTIVITY MPI OFF# enable all bcasts, recvs and sendsSYMBOL MPI_WAITALL ONSYMBOL MPI_IRECV ONSYMBOL MPI_ISEND ONSYMBOL MPI_BARRIER ONSYMBOL MPI_ALLREDUCE ON# enable all activities in the Application classACTIVITY Application ON
©2005-8 LRZ, RRZE and Intel 9March 2008
Step 3: Run the application
Set up environment: module load mpi_tracing (if not yet done)export VT_CONFIG=vt_heat
Run the executable:mpirun -np 4 ./heat_mpi.exe
Upon completion you should get the message
and a number of files heat.stf* should exist.
To save disk space, convert to stfsingle:
stftool heat.stf --convert - --logfile-format STFSINGLE | \gzip -c > heat.stfsingle.gz
[0] Intel Trace Collector INFO: Writing tracefile heat.stf in /home/cluster/a2832ba/Kurse/ParProg_2005/vt/heat
Do not use this format forviewing!
©2005-8 LRZ, RRZE and Intel 10March 2008
Step 4: Start the GUI
X11R5 or higher is needed
Is X Authentication working (Use ssh –X ...)?
Now enter
traceanalyzer pmatmul.stfThe Qt based Trace Analyzer GUI should now start up, showing the following widgets:
Main windowInside main window: panel referring to the presently analyzed tracefile
there can be more than one panel/tracefile in the main window at any time
©2005-8 LRZ, RRZE and Intel 11March 2008
And this is what things look like at start:Chart: Function profile Flat profile
©2005-8 LRZ, RRZE and Intel 12March 2008
Overall load balance
times inclusive traced subroutine calls
©2005-8 LRZ, RRZE and Intel 13March 2008
Resolving the MPI callsright klick Ungroup Group MPI
can sort by category
Times foreach MPI routine specified in configfile
©2005-8 LRZ, RRZE and Intel 14March 2008
Call tree
Notes:
call tree can only resolve activities switched on for tracing
By default, User code not resolvable unless automatic or manual subroutine tracing compiled in
(automatic subroutine tracing not yet available on Intel)
©2005-8 LRZ, RRZE and Intel 15March 2008
Call tree split up into user processes
Select “Children of Group All_Processes”
can also select expansion into various groups, two of which are defined at least:
MPIApplication
(dotted circle click)
©2005-8 LRZ, RRZE and Intel 16March 2008
... call up function group editor
Select e.g., one of the major function groups and press OK
You will then essentially filter out all other groups from the view ...
©2005-8 LRZ, RRZE and Intel 17March 2008
... like so:
©2005-8 LRZ, RRZE and Intel 18March 2008
Now a different view:Time-Line of all MPI processes
Select
Charts Event Timeline
this will open an additional pane in the tracing subwindow
then zoom in to some region of interest by using the left mouse button
do this repeatedly to obtain ...
©2005-8 LRZ, RRZE and Intel 19March 2008
... this small, 0.0001 second trace section
red parts: (different shades) are the various MPI activitiesblue parts:User code
black lines:communication, left click for context menu message properties ...to obtain further information about this message
Note that profile window (below) adjusts itself to selected timeslice!
©2005-8 LRZ, RRZE and Intel 20March 2008
Next Chart option:Qualitative Timeline
This will give cosynchronous information for e.g., transfer rates
other quantities selectable via context menu:
transfer durationtransfer volume
selectable events can befunction eventsmessages (as here)collective operations
©2005-8 LRZ, RRZE and Intel 21March 2008
Quantitative Timeline: Accumulated measure of activities
This answers the question:
How many CPUs are presently engaged in each activity?
Example:
Yellow arrow shows interval where application does communicationexclusively
Note:
can remove activities via the context menu if too cluttered
©2005-8 LRZ, RRZE and Intel 22March 2008
The final Chart type:Message profile
gives you metricsmessage volumethroughputtimecount
for senders (horizontal) vs.
receivers (vertical)
Note:color codes enable you to easily find sore pointsnon-dense communication patterns are good
Some advanced features:
The Vampirtrace API
Controlling tracefile output
MPI message checking
©2005-8 LRZ, RRZE and Intel 24March 2008
User-Level instrumentation:The ITC API (1)
Subroutine library:Control profiling process from within the programDefine own activity classes and their membersDefine performance counters (not discussed here)Include Files
Fortran: VT.incC: VT.h
Fortran calls differ from C callsIn this presentation, reference to Fortran callsrefer to user’s guide for further information
VT API has changed over timeMay need to update existing instrumentation for newer releases
©2005-8 LRZ, RRZE and Intel 25March 2008
User-Level instrumentation:The ITC API (2)
initialize / finalizeVTINIT(ierr) is automatically called fromMPI_INIT(...)
for tracing of non-MPI programs:
explicit call is required
use libVTsp.a if available
VTFINI(ierr) called fromMPI_FINALIZE(...)
control:VTTRACEON(), VTTRACEOFF()
switch tracing on/offVTFLUSH(ierr)
write memory to flush file
user-defined states
group subroutine calls into class of activities ...
call VTCLASSDEF( && 'mylib',mylib_handle,ierr)
... and then add symbols for each state (subroutine?)
call VTFUNCDEF( && 'mystate1',mylib_handle, && mystate1_handle,ierr)
(etc.)... until group is complete
default integers
©2005-8 LRZ, RRZE and Intel 26March 2008
User-Level instrumentation:The ITC API (3)
Actual measurement start with
call VTBEGIN( && mystate1_handle,ierr)
stop withcall VTEND( && mystate1_handle,ierr)
nested calls are possible... but no overlaps!
program proginclude 'VT.inc'...[ call vttraceoff() ]call mpi_init(...) call vtclassdef( &
'mylib',mylib_handle,ierr)call vtfuncdef('name', &
mylib_handle,name_handle,ierr)[ call vttraceon(ierr) ]...subroutine name(....)include 'VT.inc'
(declarations)call vtbegin(name_handle,ierr)
(executable statements)call vtend(name_handle,ierr)
name_handleglobal variable
©2005-8 LRZ, RRZE and Intel 27March 2008
Controlling tracefile output (1)
Tracefile production:trace data are stored in memory buffers, possibly controlled by configuration entries:
MEM-BLOCKSIZE: size of buffersMEM-MAXBLOCKS: maximum number of memory buffers
VT should not use too much memory (may disrupt application!)
Exhaustion of memory buffers can happen after some time
⇒ What to do?Applications which hang or crash
⇒ ?
x
64kByte
0 =unlimited
Environment Variables:VT_CONFIG: Name of configuration fileVT_CONFIG_RANK: Rank of MPI process
reading configuration filewriting trace file
Default is MPI process ID 0Always specify name and format of log file in configuration file!Change rank of process writing trace file:
LOGFILE-RANK <number>entry in configuration file
©2005-8 LRZ, RRZE and Intel 28March 2008
Controlling tracefile output (2)
coping with buffer overrun
a) Flush data to disk: default action (AUTOFLUSH is on)
possibly use suitable MEM-FLUSHBLOCKS value to trigger background flushing
b) Overwrite from beginning:only last part of tracefile written.
AUTOFLUSH offMEM-OVERWRITE on
c) Stop trace collection only first part of tracefile written
AUTOFLUSH off
MEM-OVERWRITE off(default)
coping with tracefile size
Long runs may produce tens (or hundreds) GBytes!STF reduces difficulties with visualizing this
Frame definition (not discussed here)
Reduction of size:Insert VTtraceon() and VTtraceoff() callsUse activity/symbol filtering via configuration file (e. g., no administrative MPI calls)Use b) or c) shown leftTrace subset of processes
PROCESS 0:N offPROCESS 0:N:3 on
©2005-8 LRZ, RRZE and Intel 30March 2008
Controlling tracefile output (3)Treatment of crashing applications
In this case:replace the –vtrace switch by –vtrace_fs(“failsafe tracing”)presently available only for Fortran on the Altix
replacement of tracing library
Failures:Signals (SIGINT, SIGTERM) – SIGKILL not caught
Premature exit of processes without MPI_FinalizeMPI error (comm. problems, wrong parameters)
What is done?freezes MPI processeswriting done via TCP socketsSIGINT is sent after writing data
©2005-8 LRZ, RRZE and Intel 31March 2008
Controlling tracefile output (4)Treatment of hanging applications
Deadlock detection is automatically performedIf ITC observes no progress for a certain amount of time in any process, then it assumes a deadlock
stops the application and writes a trace file.
Timeout is configurable via DEADLOCK-TIMEOUT
“no progress” is defined as “inside the same MPI call” -obviously this is just a heuristic approach and may fail:
If all processes remain in MPI for a long time e.g. due to a long data transfer, then the timeout might be reached premature abort
default timeout is 5 minutes
©2005-8 LRZ, RRZE and Intel 32March 2008
Further functionality
Recording source locationenables you to dive into source from message line or function in the GUIpotentially very large performance overheadpresently only for GCC based MPI
Automatic subroutine tracingclass application can be ungroupedno manual instrumentation neededUse the –tcollect switch in addition to –vtrace
Intel compiler 10.0 or higher is needed
©2005-8 LRZ, RRZE and Intel 33
MPI message checking
Error detection for MPI code
also makes use of MPI profiling interfacesupported module stack on LRZ systems:
completely recompile application
No static linkage
Run executable with LD_PRELOAD set:
Run executable with LD_PRELOAD set:
report is written to stderrcheck lines marked ERROR or WARNING
Further environment variables for execution control
VT_DEADLOCK_TIMEOUTdefault 60 sec
VT_DEADLOCK_WARNINGdefault 300 s
VT_CHECK_MAX_ERRORSdefault 1
VT_CHECK_MAX_REPORTSdefault 0 (unlimited)
use –genv to propagate
March 2008
module unload mpi_tracing module unload mpi.parastation mpi.altix# may need to unload further modules module load mpi.intel module load mpi_tracing
mpiexec -genv LD_PRELOAD libVTmc.so \-n [No. of tasks] ./myprog.exe
©2005-8 LRZ, RRZE and Intel 34
Message checking: Example output for insufficient buffering
[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON… many more info lines
[0] ERROR: LOCAL:BUFFER:INSUFFICIENT_BUFFER: error[0] ERROR: Buffer [0x6000000000056920, 0x600000000005698e[ of size 110 cannot store message of size 111.[0] ERROR: Free space [0x6000000000056920, 0x600000000005698e[, 110 bytes.[0] ERROR: Check buffer handling (use larger buffer in MPI_Buffer_attach(),[0] ERROR: receive the oldest message(s) to free up space before buffering new ones,[0] ERROR: check for race conditions between buffering and receiving messages, etc).[0] ERROR: Note that message sizes are calculated using the worst-case scenario that[0] ERROR: the application has to be prepared for: MPI_Pack_size() + MPI_BSEND_OVERHEAD.[0] ERROR: New message of size 111 was to be sent by:[0] ERROR: MPI_Bsend(*buf=0x6006480, count=16, datatype=MPI_CHAR, dest=1, tag=100, comm=MPI_COMM_WORLD)
[0] ERROR: testpairs (/home/cluster/a2832ba/size.c:85)[0] ERROR: wrongbuffer (/home/cluster/a2832ba/size.c:358)[0] ERROR: main (/home/cluster/a2832ba/size.c:378)[0] ERROR: __libc_start_main (/lib/libc-2.4.so)[0] ERROR: _start (/home/cluster/a2832ba/size)[0] INFO: 1 error, limit CHECK-MAX-ERRORS reached => aborting
points out source lines and faulty MPI call
provides recommendations on fixing the problem
March 2008
©2005-8 LRZ, RRZE and Intel 35
MPI checking: Final remarks
further possible environment settings:deduce from <program>.prot
deadlock detection for large number of MPI taskspossibly need to readjust timeoutselse large messages may provoke false negative
Documentationcheck the Intel Trace Collector User‘s Guideon x86: can also integrate Valgrind for memory checking
March 2008