Programming for Performance Prof. Dr. Michael Gerndt Lehrstuhl für Rechnertechnik und Rechnerorganisation/Parallelrechnerarchitektur
Programming for Performance
Prof. Dr. Michael Gerndt
Lehrstuhl für Rechnertechnik und
Rechnerorganisation/Parallelrechnerarchitektur
2
Speedup Limited by Overheads
Sequential Work
Max (Work + Synch Time + Comm Cost + Extra Work)Speedup <
3
Load Balance
• Limit on speedup:
• Work includes data access and other costs
• Not just equal work, but must be busy at same time
• Four parts to load balance
1. Identify enough concurrency
2. Decide how to manage it
3. Determine the granularity at which to exploit it
4. Reduce serialization
ProcessoranyonWorkMax
WorkSequential)(Speedup p
4
Reducing Synch Time
• Reduce wait time due to load imbalance
• Reduce synchronization overhead
5
Reducing Synchronization Overhead
• Event synchronization
• Reduce use of conservative synchronization
– e.g. point-to-point instead of barriers, or granularity of pt-to-pt
• But fine-grained synch more difficult to program, more synch
ops.
• Mutual exclusion
• Separate locks for separate data
– e.g. locking records in a database: lock per process, record, or
field
– lock per task in task queue, not per queue
– finer grain => less contention/serialization, more space, less
reuse
• Smaller, less frequent critical sections
– don’t do reading/testing in critical section, only modification
– e.g. searching for task to dequeue in task queue, building tree
6
Implications of Load Balance/Synchronization
• Extends speedup limit expression to:
• Generally, responsibility of software
• Architecture can support task stealing and
synchronization efficiently
• Fine-grained communication, low-overhead access to queues
– efficient support allows smaller tasks, better load balance
• Accessing shared data in the presence of task stealing
– need to access data of stolen tasks
– Hardware shared address space advantageous
)Synch time Work (Max
WorkSequential)(Speedup
p
7
Reducing Inherent Communication
• Communication is expensive!
• Measure: communication to computation ratio
• Focus here on inherent communication
• Determined by assignment of tasks to processes
• Actual communication can be greater
• Assign tasks that access same data to same process
• Solving communication and load balance NP-hard in
general case
• But simple heuristic solutions work well in practice
• Applications have structure!
Sequential Work
Max (Work + Synch Time + Comm Cost)Speedup <
8
Reducing Extra Work
• Common sources of extra work:
• Computing a good partition
• Using redundant computation to avoid communication
• Task, data and process management overhead
– applications, languages, runtime systems, OS
• Imposing structure on communication
– coalescing messages, allowing effective naming
• Architectural implications:
• Reduce need by making communication and orchestration
efficient
Sequential Work
Max (Work + Synch Wait Time + Comm Cost + Extra Work)Speedup <
9
A Lot Depends on Sizes
• Application parameters and no. of procs affect
inherent properties
• Load balance, communication, extra work, temporal and
spatial locality
• Memory hierarchy
• Interactions with organization parameters of extended
memory hierarchy affect artifactual communication and
performance
• Effects often dramatic, sometimes small: application-
dependent
10
A Lot Depends on Sizes
1 4 7 10 13 16 19 22 25 28 310
5
10
15
20
25
30
Number of processors Number of processors
Sp
ee
du
p
Sp
ee
du
p
N = 130
N = 258
N = 514
N = 1,026
1 4 7 10 13 16 19 22 25 28 310
5
10
15
20
25
30 Origin—16 K
Origin—64 K
Origin—512 K
Challenge—16 K
Challenge—512 K
Ocean Barnes-Hut
11
Measuring Performance
• Absolute performance
• Performance = Work / Time
• Most important to end user
• Performance improvement due to parallelism
• Speedup(p) = Performance(p) / Performance(1)
• Both should be measured
• Work is determined by input configuration of the problem
• If work is fixed,can measure performance as 1/Time
– Or retain explicit work measure (e.g. transactions/sec, bonds/sec)
– Still w.r.t particular configuration, and still what’s measured is
time
• Speedup(p) = or
Time(1)
Time(p)
Operations Per Second (p)
Operations Per Second (1)
12
Scaling: Why Worry?
• Fixed problem size is of limited usefulness
• Too small a problem:
• May be appropriate for small machine
• Parallelism overheads begin to dominate benefits for larger
machines
– Load imbalance
– Communication to computation ratio
• May even achieve slowdowns
• Doesn’t reflect real usage, and inappropriate for large
machines
– Can exaggerate benefits of architectural improvements,
especially when measured as percentage improvement in
performance
• Too large a problem
• Difficult to measure improvement (next)
13
Too Large a Problem
• Suppose problem realistically large for big machine
• May not “fit” in small machine
• Can’t run
• Thrashing to disk
• Working set doesn’t fit in cache
• Fits at some p, leading to superlinear speedup
• Finally, users want to scale problems as machines
grow
• Can help avoid these problems
14
Demonstrating Scaling Problems
• Small Ocean and big equation solver problems on SGI
Origin2000
Number of processors Number of processors
Sp
ee
du
p
Sp
ee
du
p
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310
5
10
15
20
25
30 Ideal
Ocean: 258 x 258
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310
5
10
15
20
25
30
35
40
45
50
Grid solver: 12 K x 12 K
Ideal
15
Questions in Scaling
• Under what constraints to scale the application?
• What are the appropriate metrics for performance
improvement?
– work is not fixed any more, so time not enough
• How should the application be scaled?
• Definitions:
• Scaling a machine: Can scale power in many ways
– Assume adding identical nodes, each bringing memory
• Problem size: Vector of input parameters, e.g. N = (n, q, Dt)
– Determines work done
– Distinct from data set size and memory usage
– Start by assuming it’s only one parameter n, for simplicity
16
Scaling Models
• Problem constrained (PC)
• Memory constrained (MC)
• Time constrained (TC)
17
Problem Constrained Scaling
• User wants to solve same problem, only faster
• Video compression
• Computer graphics
• VLSI routing
• But limited when evaluating larger machines
)(
)1()(Speedup PC
pTime
Timep
18
Time Constrained Scaling
• Execution time is kept fixed as system scales
• Example: User has fixed time to use machine or wait for result
• Performance = Work/Time as usual, and time is fixed,
so
• How to measure work(p)?
• Execution time on a single processor? (thrashing problems)
• The work metric should be easy to measure, ideally analytical.
• Should scale linearly with sequential complexity
– Or ideal speedup will not be linear in p (e.g. no. of rows, no of
points, no. of operations in matrix program)
• If we cannot find an intuitive application measure, as often
true, measure execution time with ideal memory system on a
uniprocessor.
)1(
)()(
Work
pWorkpSpeedupTC
19
Memory Constrained Scaling (1)
• Scale so memory usage per processor stays fixed
• Speedup can not be defined as Time(1) / Time(p) for
scaled up problem since time(1) is hard to measure
and inappropriate
• Insert performance=work/time in speedup formula
gives
TimeinIncrease
WorkinIncrease
Time
pTime
Work
pWork
Time
Work
pTime
pWorkpSpeedupMC
)1(
)(/
)1(
)(
)1(
)1(/
)(
)()(
20
Memory Constrained Scaling (2)
• MC scaling can lead to large increases in execution
time
• If work grows faster than linearly in memory usage
• e.g. matrix factorization with complexity n³
– 10,000-by 10,000 matrix takes 800MB and 1 hour on
uniprocessor
– With 1,000 processors, can run 320K-by-320K matrix, but ideal
parallel time grows to 32 hours!
– With 10,000 processors, 100 hours ...
21
Scaling Down Problem Parameters
• Some parameters don’t affect parallel performance
much, but do affect runtime, and can be scaled down
• Common example is no. of time-steps in many scientific
applications
– need a few to allow settling down, but don’t need more
– may need to omit cold-start when recording time and statistics
• First look for such parameters
• But many application parameters affect key
characteristics
• Scaling them down requires scaling down no. of processors
too
• Otherwise can obtain highly unrepresentative behavior
22
Difficulties in Scaling N, p Representatively
• Want to preserve many aspects of full-scale scenario
• Distribution of time in different phases
• Key behavioral characteristics
• Scaling relationships among application parameters
• Contention and communication patterns
• Can’t really hope for full representativeness, but can
• Cover range of realistic operating points
• Avoid unrealistic scenarios
• Gain insights and estimates of performance
23
Performance Analysis Process
Measurement
Analysis
Ranking
Refinement
Coding
Performance Analysis
Production
Program Tuning
24
Performance Prediction and Benchmarking
• Performance analysis determines the performance on
a given machine.
• Performance prediction allows to evaluate programs
for a hypthetical machine. It is based on:
• runtime data of an actual execution
• machine model of the target machine
• analytical techniques
• simulation techniques
• Benchmarking determines the performance of a
computer system on the basis of a set of typical
applications.
25
Overhead Analysis
• How to decide whether a code performs well:
• Comparison of measured MFLOPS with peak performance
• Comparison with a sequential version
• Estimate distance to ideal
time via overhead classes
– tmem
– tcomm
– tsync
– tred
– ...
11 #processors
speedup
2
2
tmem
tcomm
tred
p
s
t
t)p(speedup
26
The Basics
• Successful tuning is a combination of
• right algorithms and libraries
• compiler flags and directives
• thinking!
• Measurement is better than guessing:
• to determine performance problems
• to validate tuning decisions and optimization
• Measurement should be repeated after each
significant code modification and optimizations
27
The Basics
• Do I have a performance problem at all?
• Compare MFlops/MOps to typical rate
• Speedup measurements
• What are the hot code region?
• Flat profiling
• Is there a bottleneck in those regions?
• Single node: Hardware counter profiling
• Parallel: Synchronization and communication analysis profiling
• Does the bottleneck vary over time or processor space?
• Profiling individual processes and/or threads
• Tracing
• Does the code behave similar for different configurations?
• Analyze runs with different processor counts
• Analyze different input configurations
28
Performance Analysis
Instrumentation Analysis
Execution
refinement
Current Hypotheses
Requirements Performance Data
Detected Bottlenecks
Instr: DatISPEC
Info: HypDat
Prove: HypDat{T,F}
Refine: HypPHyp
29
Performance Measurement Techniques
• Event model of the execution
• Events occur at a processor at a specific point in time
• Events belong to event types
– clock cycles
– cache misses
– remote references
– start of a send operation
– ...
• Profiling: Recording accumulated performance data for
events
• Sampling: Statistical approach
• Instrumentation: Precise measurement
• Tracing: Recording performance data of individual
events
30
Statistical Sampling
Program Main...
end Main
Function Asterix (...)...
end Asterix
Function Obelix (...)...
end Obelix...
CPU
program counter
cycle counter
cache miss counter
flop counter
Main
Asterix
Obelix +
Function Table
interrupt every10 ms
add and reset counter
31
...Function Obelix (...)
call monitor(“Obelix“, “enter“)...
call monitor(“Obelix“,“exit“)end Obelix
...
CPU
monitor(routine, location)if (“enter“) then
else
end if
Function Table
Instrumentation and Monitoring
cache miss counter
Main
Asterix
Obelix + - 1020013001490
32
Instrumentation Techniques
• Source code instrumentation
• done by the compiler, source-to-source tool, or manually
– portability
– link back to source code easy
– re-compile necessary when instrumentation is changed
– difficult to instrument mixed-code applications
– cannot instrument system or 3rd party libraries or executables
• Binary instrumentation
• „patching“ the executable to insert hooks (like a debugger)
– inverse pros/cons
• Offline
• Online
33
Instrumentation Tools
• Standard compilers
• Add callbacks for profiling functions
• Typically an function level
• Be careful of overhead for frequently called functions
• gcc, for example, adds calls if –finstrument-functions
is given.
• OPARI
• Jülich Supercomputing Center
• OpenMP for C and FORTRAN
• Source-level instrumentation see exercise
• PMPI interface
• Library interposition
• Link own library before real library, e.g. frequently used for
own malloc function.
34
Instrumentation Tools
• TAU Generic Instrumenter
• Parsers for C++, FORTRAN, UPC,…
• Creation of PTD (Program Database Toolkit)
• Approach
– Specify which string to insert before and after certain regions
– Use provided variables to access file and line information
• Limited program region types
• tau.oregon.edu
• OMPT
• Proposal for profiling API
• Based on callbacks
35
Source Code Transformation Tools
• Rose
• rosecompiler.org, LLNL
• LLVM
• Language independent code optimizer
and code generator
• http://www.llvm.org/, Univ. Illinois
• Clang C frontend for LLVM, http://clang.llvm.org/
• C/C++ and Objective C/C++
• Open64
• www.open64.net
• Compiler infrastructure based originally on the SGI compiler.
• Interprocecural and loop optimizations
36
Binary Instrumentation Tools
• Dyninst
• Dynamic instrumentation on binary level
• Context pf Paradyn project
• Univ. Wisconsin-Madison, Maryland
• Bart Miller, Jeff Hollingsworth
• Intel Pin
• Intel for x86
• Online instrumentation of binaries
• Valgrind
• Dynamic instrumentation
• Based on emulation of x86 machine instructions
37
Tr P n-1
Trace P1
Tracing
...Function Obelix (...)
call monitor(“Obelix“, “enter“)...
call monitor(“Obelix“,“exit“)end Obelix
...
MPI LibraryFunction MPI_send (...)
call monitor(“MPI_send“, “enter“)...
call PMPI_send(...)
call monitor(“MPI_send“,“exit“)end Obelix
...
Process 0
Process 1
Process n-1
Trace P0
10.4 P0 Obelix enter
10.6 P0 MPI_Send enter
10.8 P0 MPI_Send exit
38
Tr P n-1
Trace P1
Merging
Trace P0
Merge Process
P0 - Pn-1
10.4 P0 Obelix enter
10.5 P1 Obelix enter
10.6 P0 MPI_Send enter
10.7 P1 MPI_Recv enter
10.8 P0 MPI_Send exit
11.0 P1 MPI_Recv exit
39
Visualization of Dynamic Behaviour
P0 - Pn-1
10.4 P0 Obelix enter
10.5 P1 Obelix enter
10.6 P0 MPI_Send enter
10.7 P1 MPI_Recv enter
10.8 P0 MPI_Send exit
11.0 P1 MPI_Recv exit
P0
P1
10.4 10.5 10.6 10.7 10.8 10.9 11.0
Timeline Visualization
Obelix
Obelix MPI_Recv
MPI_Send Obelix
Obeli
40
Profiling vs Tracing
• Profiling
• recording summary information (time, #calls,#misses...)
• about program entities (functions, objects, basic blocks)
• very good for quick, low cost overview
• points out potential bottlenecks
• implemented through sampling or instrumentation
• moderate amount of performance data
• Tracing
• recording information about events
• trace record typically consists of timestamp, processid, ...
• output is a trace file with trace records sorted by time
• can be used to reconstruct the dynamic behavior
• creates huge amounts of data
• needs selective instrumentation
41
Program Monitors
• Each PA tools has its own monitor
• Score-P
• In the last years, Score-P was developed by tools groups of
Scalasca, Vampir and Periscope.
• Provides support for
– MPI, OpenMP, CUDA
– Profiling and tracing
– Callpath profiles
– Online Access Interface
• Cube 4 profiling data format
• OTF2 (Open Trace Format)
42
Performance Analysis
Instrumentation Analysis
Execution
refinement
Current Hypotheses
Requirements Performance Data
Detected Bottlenecks
Instr: DatISPEC
Info: HypDat
Prove: HypDat{T,F}
Refine: HypPHyp
43
Common Performance Problems with MPI
• Single node performance
• Excessive number of 2nd-level cache misses
• Low number of issued instructions
• IO
• High data volume
• Sequential IO due to IO subsystem or sequentialization in the
program
• Excessive communication
• Frequent communication
• High data volume
44
Common Performance Problems with MPI
• Frequent synchronization
• Reduction operations
• Barrier operations
• Load balancing
• Wrong data decomposition
• Dynamically changing load
45
Common Performance Problems with SM
• Single node performance
• ...
• IO
• ...
• Excessive communication
• Large number of remote memory accesses
• False sharing
• False data mapping
• Frequent synchronization
• Implicit synchronization of parallel constructs
• Barriers, locks, ...
• Load balancing
• Uneven scheduling of parallel loops
• Uneven work in parallel sections
46
Analysis Techniques
• Offline vs Online Analysis
• Offline: first generate data then analyze
• Online: generate and analyze data while application is running
• Online requires automationlimited to standard bottlenecks
• Offline suffers more from size of measurement information
• Three techniques to support user in analysis
• Source-level presentation of performance data
• Graphical visualization
• Ranking of high-level performance properties
47
Statistical Profiling based Tools
• Gprof – GNU profiling tool
• Time profiling
• Inclusive and exclusive time
• Flat profile
• Call graph profile
• Based on instrumentation of
function entry and exit
• Records were the call is
coming from.
48
Statistical Profiling based Tools
• Allinea MAP
• Annotations to the application source code.
• Based on time series of profiles
• For parallel applications it indicates outlying processes.
49
Profiling Tools based on Instrumentation
• TAU (Tuning and Analysis
Utilities)
• Measurements are based on
instrumentation
• Visualization via paraprof
– Graphical display for aggregated and
per node, context, or thread
– Topology views of performance data
• Scalasca
• Cube performance visualizer
• Profiles based on Score-P
• Call-path profiling
50
Trace-based Analysis Tools
• Vampir
• Graphical views presenting
events and summary data
• Flexible scrolling and
zooming features
• OTF2 trace format
generated by Score-P
• Commercial license
• www.vampire.eu
51
Trace-based Analysis Tools
• Paraver
• Barcelona Supercomputing
Center
• MPI, OMP, pthreads, OmpSs,
CUDA
• http://www.bsc.es/computer-sciences/performance-tools/paraver
• Clustering of program phases, i.e. segments between MPI calls
• Recently tracking of clusters in time series of profiles based on
object tracking
52
Automatic Analysis Tools
• Paradyn
• University of Wisconsin Madison
• Periscope
• TU München
• Automatic detection of formalized performance properties
• Profile data
• Distributed online tool
• Scalasca
• Search for performance patterns in traces
• Post-mortem on parallel resources of the application
• Visualization of patterns in CUBE