Top Banner
Programming for Performance Prof. Dr. Michael Gerndt Lehrstuhl für Rechnertechnik und Rechnerorganisation/Parallelrechnerarchitektur
52

Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

May 14, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

Programming for Performance

Prof. Dr. Michael Gerndt

Lehrstuhl für Rechnertechnik und

Rechnerorganisation/Parallelrechnerarchitektur

Page 2: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

2

Speedup Limited by Overheads

Sequential Work

Max (Work + Synch Time + Comm Cost + Extra Work)Speedup <

Page 3: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

3

Load Balance

• Limit on speedup:

• Work includes data access and other costs

• Not just equal work, but must be busy at same time

• Four parts to load balance

1. Identify enough concurrency

2. Decide how to manage it

3. Determine the granularity at which to exploit it

4. Reduce serialization

ProcessoranyonWorkMax

WorkSequential)(Speedup p

Page 4: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

4

Reducing Synch Time

• Reduce wait time due to load imbalance

• Reduce synchronization overhead

Page 5: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

5

Reducing Synchronization Overhead

• Event synchronization

• Reduce use of conservative synchronization

– e.g. point-to-point instead of barriers, or granularity of pt-to-pt

• But fine-grained synch more difficult to program, more synch

ops.

• Mutual exclusion

• Separate locks for separate data

– e.g. locking records in a database: lock per process, record, or

field

– lock per task in task queue, not per queue

– finer grain => less contention/serialization, more space, less

reuse

• Smaller, less frequent critical sections

– don’t do reading/testing in critical section, only modification

– e.g. searching for task to dequeue in task queue, building tree

Page 6: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

6

Implications of Load Balance/Synchronization

• Extends speedup limit expression to:

• Generally, responsibility of software

• Architecture can support task stealing and

synchronization efficiently

• Fine-grained communication, low-overhead access to queues

– efficient support allows smaller tasks, better load balance

• Accessing shared data in the presence of task stealing

– need to access data of stolen tasks

– Hardware shared address space advantageous

)Synch time Work (Max

WorkSequential)(Speedup

p

Page 7: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

7

Reducing Inherent Communication

• Communication is expensive!

• Measure: communication to computation ratio

• Focus here on inherent communication

• Determined by assignment of tasks to processes

• Actual communication can be greater

• Assign tasks that access same data to same process

• Solving communication and load balance NP-hard in

general case

• But simple heuristic solutions work well in practice

• Applications have structure!

Sequential Work

Max (Work + Synch Time + Comm Cost)Speedup <

Page 8: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

8

Reducing Extra Work

• Common sources of extra work:

• Computing a good partition

• Using redundant computation to avoid communication

• Task, data and process management overhead

– applications, languages, runtime systems, OS

• Imposing structure on communication

– coalescing messages, allowing effective naming

• Architectural implications:

• Reduce need by making communication and orchestration

efficient

Sequential Work

Max (Work + Synch Wait Time + Comm Cost + Extra Work)Speedup <

Page 9: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

9

A Lot Depends on Sizes

• Application parameters and no. of procs affect

inherent properties

• Load balance, communication, extra work, temporal and

spatial locality

• Memory hierarchy

• Interactions with organization parameters of extended

memory hierarchy affect artifactual communication and

performance

• Effects often dramatic, sometimes small: application-

dependent

Page 10: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

10

A Lot Depends on Sizes

1 4 7 10 13 16 19 22 25 28 310

5

10

15

20

25

30

Number of processors Number of processors

Sp

ee

du

p

Sp

ee

du

p

N = 130

N = 258

N = 514

N = 1,026

1 4 7 10 13 16 19 22 25 28 310

5

10

15

20

25

30 Origin—16 K

Origin—64 K

Origin—512 K

Challenge—16 K

Challenge—512 K

Ocean Barnes-Hut

Page 11: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

11

Measuring Performance

• Absolute performance

• Performance = Work / Time

• Most important to end user

• Performance improvement due to parallelism

• Speedup(p) = Performance(p) / Performance(1)

• Both should be measured

• Work is determined by input configuration of the problem

• If work is fixed,can measure performance as 1/Time

– Or retain explicit work measure (e.g. transactions/sec, bonds/sec)

– Still w.r.t particular configuration, and still what’s measured is

time

• Speedup(p) = or

Time(1)

Time(p)

Operations Per Second (p)

Operations Per Second (1)

Page 12: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

12

Scaling: Why Worry?

• Fixed problem size is of limited usefulness

• Too small a problem:

• May be appropriate for small machine

• Parallelism overheads begin to dominate benefits for larger

machines

– Load imbalance

– Communication to computation ratio

• May even achieve slowdowns

• Doesn’t reflect real usage, and inappropriate for large

machines

– Can exaggerate benefits of architectural improvements,

especially when measured as percentage improvement in

performance

• Too large a problem

• Difficult to measure improvement (next)

Page 13: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

13

Too Large a Problem

• Suppose problem realistically large for big machine

• May not “fit” in small machine

• Can’t run

• Thrashing to disk

• Working set doesn’t fit in cache

• Fits at some p, leading to superlinear speedup

• Finally, users want to scale problems as machines

grow

• Can help avoid these problems

Page 14: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

14

Demonstrating Scaling Problems

• Small Ocean and big equation solver problems on SGI

Origin2000

Number of processors Number of processors

Sp

ee

du

p

Sp

ee

du

p

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310

5

10

15

20

25

30 Ideal

Ocean: 258 x 258

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310

5

10

15

20

25

30

35

40

45

50

Grid solver: 12 K x 12 K

Ideal

Page 15: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

15

Questions in Scaling

• Under what constraints to scale the application?

• What are the appropriate metrics for performance

improvement?

– work is not fixed any more, so time not enough

• How should the application be scaled?

• Definitions:

• Scaling a machine: Can scale power in many ways

– Assume adding identical nodes, each bringing memory

• Problem size: Vector of input parameters, e.g. N = (n, q, Dt)

– Determines work done

– Distinct from data set size and memory usage

– Start by assuming it’s only one parameter n, for simplicity

Page 16: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

16

Scaling Models

• Problem constrained (PC)

• Memory constrained (MC)

• Time constrained (TC)

Page 17: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

17

Problem Constrained Scaling

• User wants to solve same problem, only faster

• Video compression

• Computer graphics

• VLSI routing

• But limited when evaluating larger machines

)(

)1()(Speedup PC

pTime

Timep

Page 18: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

18

Time Constrained Scaling

• Execution time is kept fixed as system scales

• Example: User has fixed time to use machine or wait for result

• Performance = Work/Time as usual, and time is fixed,

so

• How to measure work(p)?

• Execution time on a single processor? (thrashing problems)

• The work metric should be easy to measure, ideally analytical.

• Should scale linearly with sequential complexity

– Or ideal speedup will not be linear in p (e.g. no. of rows, no of

points, no. of operations in matrix program)

• If we cannot find an intuitive application measure, as often

true, measure execution time with ideal memory system on a

uniprocessor.

)1(

)()(

Work

pWorkpSpeedupTC

Page 19: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

19

Memory Constrained Scaling (1)

• Scale so memory usage per processor stays fixed

• Speedup can not be defined as Time(1) / Time(p) for

scaled up problem since time(1) is hard to measure

and inappropriate

• Insert performance=work/time in speedup formula

gives

TimeinIncrease

WorkinIncrease

Time

pTime

Work

pWork

Time

Work

pTime

pWorkpSpeedupMC

)1(

)(/

)1(

)(

)1(

)1(/

)(

)()(

Page 20: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

20

Memory Constrained Scaling (2)

• MC scaling can lead to large increases in execution

time

• If work grows faster than linearly in memory usage

• e.g. matrix factorization with complexity n³

– 10,000-by 10,000 matrix takes 800MB and 1 hour on

uniprocessor

– With 1,000 processors, can run 320K-by-320K matrix, but ideal

parallel time grows to 32 hours!

– With 10,000 processors, 100 hours ...

Page 21: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

21

Scaling Down Problem Parameters

• Some parameters don’t affect parallel performance

much, but do affect runtime, and can be scaled down

• Common example is no. of time-steps in many scientific

applications

– need a few to allow settling down, but don’t need more

– may need to omit cold-start when recording time and statistics

• First look for such parameters

• But many application parameters affect key

characteristics

• Scaling them down requires scaling down no. of processors

too

• Otherwise can obtain highly unrepresentative behavior

Page 22: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

22

Difficulties in Scaling N, p Representatively

• Want to preserve many aspects of full-scale scenario

• Distribution of time in different phases

• Key behavioral characteristics

• Scaling relationships among application parameters

• Contention and communication patterns

• Can’t really hope for full representativeness, but can

• Cover range of realistic operating points

• Avoid unrealistic scenarios

• Gain insights and estimates of performance

Page 23: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

23

Performance Analysis Process

Measurement

Analysis

Ranking

Refinement

Coding

Performance Analysis

Production

Program Tuning

Page 24: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

24

Performance Prediction and Benchmarking

• Performance analysis determines the performance on

a given machine.

• Performance prediction allows to evaluate programs

for a hypthetical machine. It is based on:

• runtime data of an actual execution

• machine model of the target machine

• analytical techniques

• simulation techniques

• Benchmarking determines the performance of a

computer system on the basis of a set of typical

applications.

Page 25: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

25

Overhead Analysis

• How to decide whether a code performs well:

• Comparison of measured MFLOPS with peak performance

• Comparison with a sequential version

• Estimate distance to ideal

time via overhead classes

– tmem

– tcomm

– tsync

– tred

– ...

11 #processors

speedup

2

2

tmem

tcomm

tred

p

s

t

t)p(speedup

Page 26: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

26

The Basics

• Successful tuning is a combination of

• right algorithms and libraries

• compiler flags and directives

• thinking!

• Measurement is better than guessing:

• to determine performance problems

• to validate tuning decisions and optimization

• Measurement should be repeated after each

significant code modification and optimizations

Page 27: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

27

The Basics

• Do I have a performance problem at all?

• Compare MFlops/MOps to typical rate

• Speedup measurements

• What are the hot code region?

• Flat profiling

• Is there a bottleneck in those regions?

• Single node: Hardware counter profiling

• Parallel: Synchronization and communication analysis profiling

• Does the bottleneck vary over time or processor space?

• Profiling individual processes and/or threads

• Tracing

• Does the code behave similar for different configurations?

• Analyze runs with different processor counts

• Analyze different input configurations

Page 28: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

28

Performance Analysis

Instrumentation Analysis

Execution

refinement

Current Hypotheses

Requirements Performance Data

Detected Bottlenecks

Instr: DatISPEC

Info: HypDat

Prove: HypDat{T,F}

Refine: HypPHyp

Page 29: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

29

Performance Measurement Techniques

• Event model of the execution

• Events occur at a processor at a specific point in time

• Events belong to event types

– clock cycles

– cache misses

– remote references

– start of a send operation

– ...

• Profiling: Recording accumulated performance data for

events

• Sampling: Statistical approach

• Instrumentation: Precise measurement

• Tracing: Recording performance data of individual

events

Page 30: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

30

Statistical Sampling

Program Main...

end Main

Function Asterix (...)...

end Asterix

Function Obelix (...)...

end Obelix...

CPU

program counter

cycle counter

cache miss counter

flop counter

Main

Asterix

Obelix +

Function Table

interrupt every10 ms

add and reset counter

Page 31: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

31

...Function Obelix (...)

call monitor(“Obelix“, “enter“)...

call monitor(“Obelix“,“exit“)end Obelix

...

CPU

monitor(routine, location)if (“enter“) then

else

end if

Function Table

Instrumentation and Monitoring

cache miss counter

Main

Asterix

Obelix + - 1020013001490

Page 32: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

32

Instrumentation Techniques

• Source code instrumentation

• done by the compiler, source-to-source tool, or manually

– portability

– link back to source code easy

– re-compile necessary when instrumentation is changed

– difficult to instrument mixed-code applications

– cannot instrument system or 3rd party libraries or executables

• Binary instrumentation

• „patching“ the executable to insert hooks (like a debugger)

– inverse pros/cons

• Offline

• Online

Page 33: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

33

Instrumentation Tools

• Standard compilers

• Add callbacks for profiling functions

• Typically an function level

• Be careful of overhead for frequently called functions

• gcc, for example, adds calls if –finstrument-functions

is given.

• OPARI

• Jülich Supercomputing Center

• OpenMP for C and FORTRAN

• Source-level instrumentation see exercise

• PMPI interface

• Library interposition

• Link own library before real library, e.g. frequently used for

own malloc function.

Page 34: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

34

Instrumentation Tools

• TAU Generic Instrumenter

• Parsers for C++, FORTRAN, UPC,…

• Creation of PTD (Program Database Toolkit)

• Approach

– Specify which string to insert before and after certain regions

– Use provided variables to access file and line information

• Limited program region types

• tau.oregon.edu

• OMPT

• Proposal for profiling API

• Based on callbacks

Page 35: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

35

Source Code Transformation Tools

• Rose

• rosecompiler.org, LLNL

• LLVM

• Language independent code optimizer

and code generator

• http://www.llvm.org/, Univ. Illinois

• Clang C frontend for LLVM, http://clang.llvm.org/

• C/C++ and Objective C/C++

• Open64

• www.open64.net

• Compiler infrastructure based originally on the SGI compiler.

• Interprocecural and loop optimizations

Page 36: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

36

Binary Instrumentation Tools

• Dyninst

• Dynamic instrumentation on binary level

• Context pf Paradyn project

• Univ. Wisconsin-Madison, Maryland

• Bart Miller, Jeff Hollingsworth

• Intel Pin

• Intel for x86

• Online instrumentation of binaries

• Valgrind

• Dynamic instrumentation

• Based on emulation of x86 machine instructions

Page 37: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

37

Tr P n-1

Trace P1

Tracing

...Function Obelix (...)

call monitor(“Obelix“, “enter“)...

call monitor(“Obelix“,“exit“)end Obelix

...

MPI LibraryFunction MPI_send (...)

call monitor(“MPI_send“, “enter“)...

call PMPI_send(...)

call monitor(“MPI_send“,“exit“)end Obelix

...

Process 0

Process 1

Process n-1

Trace P0

10.4 P0 Obelix enter

10.6 P0 MPI_Send enter

10.8 P0 MPI_Send exit

Page 38: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

38

Tr P n-1

Trace P1

Merging

Trace P0

Merge Process

P0 - Pn-1

10.4 P0 Obelix enter

10.5 P1 Obelix enter

10.6 P0 MPI_Send enter

10.7 P1 MPI_Recv enter

10.8 P0 MPI_Send exit

11.0 P1 MPI_Recv exit

Page 39: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

39

Visualization of Dynamic Behaviour

P0 - Pn-1

10.4 P0 Obelix enter

10.5 P1 Obelix enter

10.6 P0 MPI_Send enter

10.7 P1 MPI_Recv enter

10.8 P0 MPI_Send exit

11.0 P1 MPI_Recv exit

P0

P1

10.4 10.5 10.6 10.7 10.8 10.9 11.0

Timeline Visualization

Obelix

Obelix MPI_Recv

MPI_Send Obelix

Obeli

Page 40: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

40

Profiling vs Tracing

• Profiling

• recording summary information (time, #calls,#misses...)

• about program entities (functions, objects, basic blocks)

• very good for quick, low cost overview

• points out potential bottlenecks

• implemented through sampling or instrumentation

• moderate amount of performance data

• Tracing

• recording information about events

• trace record typically consists of timestamp, processid, ...

• output is a trace file with trace records sorted by time

• can be used to reconstruct the dynamic behavior

• creates huge amounts of data

• needs selective instrumentation

Page 41: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

41

Program Monitors

• Each PA tools has its own monitor

• Score-P

• In the last years, Score-P was developed by tools groups of

Scalasca, Vampir and Periscope.

• Provides support for

– MPI, OpenMP, CUDA

– Profiling and tracing

– Callpath profiles

– Online Access Interface

• Cube 4 profiling data format

• OTF2 (Open Trace Format)

Page 42: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

42

Performance Analysis

Instrumentation Analysis

Execution

refinement

Current Hypotheses

Requirements Performance Data

Detected Bottlenecks

Instr: DatISPEC

Info: HypDat

Prove: HypDat{T,F}

Refine: HypPHyp

Page 43: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

43

Common Performance Problems with MPI

• Single node performance

• Excessive number of 2nd-level cache misses

• Low number of issued instructions

• IO

• High data volume

• Sequential IO due to IO subsystem or sequentialization in the

program

• Excessive communication

• Frequent communication

• High data volume

Page 44: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

44

Common Performance Problems with MPI

• Frequent synchronization

• Reduction operations

• Barrier operations

• Load balancing

• Wrong data decomposition

• Dynamically changing load

Page 45: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

45

Common Performance Problems with SM

• Single node performance

• ...

• IO

• ...

• Excessive communication

• Large number of remote memory accesses

• False sharing

• False data mapping

• Frequent synchronization

• Implicit synchronization of parallel constructs

• Barriers, locks, ...

• Load balancing

• Uneven scheduling of parallel loops

• Uneven work in parallel sections

Page 46: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

46

Analysis Techniques

• Offline vs Online Analysis

• Offline: first generate data then analyze

• Online: generate and analyze data while application is running

• Online requires automationlimited to standard bottlenecks

• Offline suffers more from size of measurement information

• Three techniques to support user in analysis

• Source-level presentation of performance data

• Graphical visualization

• Ranking of high-level performance properties

Page 47: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

47

Statistical Profiling based Tools

• Gprof – GNU profiling tool

• Time profiling

• Inclusive and exclusive time

• Flat profile

• Call graph profile

• Based on instrumentation of

function entry and exit

• Records were the call is

coming from.

Page 48: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

48

Statistical Profiling based Tools

• Allinea MAP

• Annotations to the application source code.

• Based on time series of profiles

• For parallel applications it indicates outlying processes.

Page 49: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

49

Profiling Tools based on Instrumentation

• TAU (Tuning and Analysis

Utilities)

• Measurements are based on

instrumentation

• Visualization via paraprof

– Graphical display for aggregated and

per node, context, or thread

– Topology views of performance data

• Scalasca

• Cube performance visualizer

• Profiles based on Score-P

• Call-path profiling

Page 50: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

50

Trace-based Analysis Tools

• Vampir

• Graphical views presenting

events and summary data

• Flexible scrolling and

zooming features

• OTF2 trace format

generated by Score-P

• Commercial license

• www.vampire.eu

Page 51: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

51

Trace-based Analysis Tools

• Paraver

• Barcelona Supercomputing

Center

• MPI, OMP, pthreads, OmpSs,

CUDA

• http://www.bsc.es/computer-sciences/performance-tools/paraver

• Clustering of program phases, i.e. segments between MPI calls

• Recently tracking of clusters in time series of profiles based on

object tracking

Page 52: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

52

Automatic Analysis Tools

• Paradyn

• University of Wisconsin Madison

• Periscope

• TU München

• Automatic detection of formalized performance properties

• Profile data

• Distributed online tool

• Scalasca

• Search for performance patterns in traces

• Post-mortem on parallel resources of the application

• Visualization of patterns in CUBE