Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

Programming for Performance

Prof. Dr. Michael Gerndt

Lehrstuhl für Rechnertechnik und

Rechnerorganisation/Parallelrechnerarchitektur

2

Speedup Limited by Overheads

Sequential Work

Max (Work + Synch Time + Comm Cost + Extra Work)Speedup <

3

Load Balance

• Limit on speedup:

• Work includes data access and other costs

• Not just equal work, but must be busy at same time

• Four parts to load balance

1. Identify enough concurrency

2. Decide how to manage it

3. Determine the granularity at which to exploit it

4. Reduce serialization

ProcessoranyonWorkMax

WorkSequential)(Speedup p

4

Reducing Synch Time

• Reduce wait time due to load imbalance

• Reduce synchronization overhead

5

Reducing Synchronization Overhead

• Event synchronization

• Reduce use of conservative synchronization

– e.g. point-to-point instead of barriers, or granularity of pt-to-pt

• But fine-grained synch more difficult to program, more synch

ops.

• Mutual exclusion

• Separate locks for separate data

– e.g. locking records in a database: lock per process, record, or

field

– lock per task in task queue, not per queue

– finer grain => less contention/serialization, more space, less

reuse

• Smaller, less frequent critical sections

– don’t do reading/testing in critical section, only modification

– e.g. searching for task to dequeue in task queue, building tree

6

Implications of Load Balance/Synchronization

• Extends speedup limit expression to:

• Generally, responsibility of software

• Architecture can support task stealing and

synchronization efficiently

• Fine-grained communication, low-overhead access to queues

– efficient support allows smaller tasks, better load balance

• Accessing shared data in the presence of task stealing

– need to access data of stolen tasks

– Hardware shared address space advantageous

)Synch time Work (Max

WorkSequential)(Speedup

p

7

Reducing Inherent Communication

• Communication is expensive!

• Measure: communication to computation ratio

• Focus here on inherent communication

• Determined by assignment of tasks to processes

• Actual communication can be greater

• Assign tasks that access same data to same process

• Solving communication and load balance NP-hard in

general case

• But simple heuristic solutions work well in practice

• Applications have structure!

Sequential Work

Max (Work + Synch Time + Comm Cost)Speedup <

8

Reducing Extra Work

• Common sources of extra work:

• Computing a good partition

• Using redundant computation to avoid communication

• Task, data and process management overhead

– applications, languages, runtime systems, OS

• Imposing structure on communication

– coalescing messages, allowing effective naming

• Architectural implications:

• Reduce need by making communication and orchestration

efficient

Sequential Work

Max (Work + Synch Wait Time + Comm Cost + Extra Work)Speedup <

9

A Lot Depends on Sizes

• Application parameters and no. of procs affect

inherent properties

• Load balance, communication, extra work, temporal and

spatial locality

• Memory hierarchy

• Interactions with organization parameters of extended

memory hierarchy affect artifactual communication and

performance

• Effects often dramatic, sometimes small: application-

dependent

10

A Lot Depends on Sizes

1 4 7 10 13 16 19 22 25 28 310

5

10

15

20

25

30

Number of processors Number of processors

Sp

ee

du

p

Sp

ee

du

p

N = 130

N = 258

N = 514

N = 1,026

1 4 7 10 13 16 19 22 25 28 310

5

10

15

20

25

30 Origin—16 K

Origin—64 K

Origin—512 K

Challenge—16 K

Challenge—512 K

Ocean Barnes-Hut

11

Measuring Performance

• Absolute performance

• Performance = Work / Time

• Most important to end user

• Performance improvement due to parallelism

• Speedup(p) = Performance(p) / Performance(1)

• Both should be measured

• Work is determined by input configuration of the problem

• If work is fixed,can measure performance as 1/Time

– Or retain explicit work measure (e.g. transactions/sec, bonds/sec)

– Still w.r.t particular configuration, and still what’s measured is

time

• Speedup(p) = or

Time(1)

Time(p)

Operations Per Second (p)

Operations Per Second (1)

12

Scaling: Why Worry?

• Fixed problem size is of limited usefulness

• Too small a problem:

• May be appropriate for small machine

• Parallelism overheads begin to dominate benefits for larger

machines

– Load imbalance

– Communication to computation ratio

• May even achieve slowdowns

• Doesn’t reflect real usage, and inappropriate for large

machines

– Can exaggerate benefits of architectural improvements,

especially when measured as percentage improvement in

performance

• Too large a problem

• Difficult to measure improvement (next)

13

Too Large a Problem

• Suppose problem realistically large for big machine

• May not “fit” in small machine

• Can’t run

• Thrashing to disk

• Working set doesn’t fit in cache

• Fits at some p, leading to superlinear speedup

• Finally, users want to scale problems as machines

grow

• Can help avoid these problems

14

Demonstrating Scaling Problems

• Small Ocean and big equation solver problems on SGI

Origin2000

Number of processors Number of processors

Sp

ee

du

p

Sp

ee

du

p

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310

5

10

15

20

25

30 Ideal

Ocean: 258 x 258

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310

5

10

15

20

25

30

35

40

45

50

Grid solver: 12 K x 12 K

Ideal

15

Questions in Scaling

• Under what constraints to scale the application?

• What are the appropriate metrics for performance

improvement?

– work is not fixed any more, so time not enough

• How should the application be scaled?

• Definitions:

• Scaling a machine: Can scale power in many ways

– Assume adding identical nodes, each bringing memory

• Problem size: Vector of input parameters, e.g. N = (n, q, Dt)

– Determines work done

– Distinct from data set size and memory usage

– Start by assuming it’s only one parameter n, for simplicity

16

Scaling Models

• Problem constrained (PC)

• Memory constrained (MC)

• Time constrained (TC)

17

Problem Constrained Scaling

• User wants to solve same problem, only faster

• Video compression

• Computer graphics

• VLSI routing

• But limited when evaluating larger machines

)(

)1()(Speedup PC

pTime

Timep

18

Time Constrained Scaling

• Execution time is kept fixed as system scales

• Example: User has fixed time to use machine or wait for result

• Performance = Work/Time as usual, and time is fixed,

so

• How to measure work(p)?

• Execution time on a single processor? (thrashing problems)

• The work metric should be easy to measure, ideally analytical.

• Should scale linearly with sequential complexity

– Or ideal speedup will not be linear in p (e.g. no. of rows, no of

points, no. of operations in matrix program)

• If we cannot find an intuitive application measure, as often

true, measure execution time with ideal memory system on a

uniprocessor.

)1(

)()(

Work

pWorkpSpeedupTC

19

Memory Constrained Scaling (1)

• Scale so memory usage per processor stays fixed

• Speedup can not be defined as Time(1) / Time(p) for

scaled up problem since time(1) is hard to measure

and inappropriate

• Insert performance=work/time in speedup formula

gives

TimeinIncrease

WorkinIncrease

Time

pTime

Work

pWork

Time

Work

pTime

pWorkpSpeedupMC

)1(

)(/

)1(

)(

)1(

)1(/

)(

)()(

20

Memory Constrained Scaling (2)

• MC scaling can lead to large increases in execution

time

• If work grows faster than linearly in memory usage

• e.g. matrix factorization with complexity n³

– 10,000-by 10,000 matrix takes 800MB and 1 hour on

uniprocessor

– With 1,000 processors, can run 320K-by-320K matrix, but ideal

parallel time grows to 32 hours!

– With 10,000 processors, 100 hours ...

21

Scaling Down Problem Parameters

• Some parameters don’t affect parallel performance

much, but do affect runtime, and can be scaled down

• Common example is no. of time-steps in many scientific

applications

– need a few to allow settling down, but don’t need more

– may need to omit cold-start when recording time and statistics

• First look for such parameters

• But many application parameters affect key

characteristics

• Scaling them down requires scaling down no. of processors

too

• Otherwise can obtain highly unrepresentative behavior

22

Difficulties in Scaling N, p Representatively

• Want to preserve many aspects of full-scale scenario

• Distribution of time in different phases

• Key behavioral characteristics

• Scaling relationships among application parameters

• Contention and communication patterns

• Can’t really hope for full representativeness, but can

• Cover range of realistic operating points

• Avoid unrealistic scenarios

• Gain insights and estimates of performance

23

Performance Analysis Process

Measurement

Analysis

Ranking

Refinement

Coding

Performance Analysis

Production

Program Tuning

24

Performance Prediction and Benchmarking

• Performance analysis determines the performance on

a given machine.

• Performance prediction allows to evaluate programs

for a hypthetical machine. It is based on:

• runtime data of an actual execution

• machine model of the target machine

• analytical techniques

• simulation techniques

• Benchmarking determines the performance of a

computer system on the basis of a set of typical

applications.

25

Overhead Analysis

• How to decide whether a code performs well:

• Comparison of measured MFLOPS with peak performance

• Comparison with a sequential version

• Estimate distance to ideal

time via overhead classes

– tmem

– tcomm

– tsync

– tred

– ...

11 #processors

speedup

2

2

tmem

tcomm

tred

p

s

t

t)p(speedup

26

The Basics

• Successful tuning is a combination of

• right algorithms and libraries

• compiler flags and directives

• thinking!

• Measurement is better than guessing:

• to determine performance problems

• to validate tuning decisions and optimization

• Measurement should be repeated after each

significant code modification and optimizations

27

The Basics

• Do I have a performance problem at all?

• Compare MFlops/MOps to typical rate

• Speedup measurements

• What are the hot code region?

• Flat profiling

• Is there a bottleneck in those regions?

• Single node: Hardware counter profiling

• Parallel: Synchronization and communication analysis profiling

• Does the bottleneck vary over time or processor space?

• Profiling individual processes and/or threads

• Tracing

• Does the code behave similar for different configurations?

• Analyze runs with different processor counts

• Analyze different input configurations

28


Instrumentation Analysis

Execution

refinement

Current Hypotheses

Requirements Performance Data

Detected Bottlenecks

Instr: DatISPEC

Info: HypDat

Prove: HypDat{T,F}

Refine: HypPHyp

29

Performance Measurement Techniques

• Event model of the execution

• Events occur at a processor at a specific point in time

• Events belong to event types

– clock cycles

– cache misses

– remote references

– start of a send operation

– ...

• Profiling: Recording accumulated performance data for

events

• Sampling: Statistical approach

• Instrumentation: Precise measurement

• Tracing: Recording performance data of individual

events

30

Statistical Sampling

Program Main...

end Main

Function Asterix (...)...

end Asterix

Function Obelix (...)...

end Obelix...

CPU

program counter

cycle counter

cache miss counter

flop counter

Main

Asterix

Obelix +

Function Table

interrupt every10 ms

add and reset counter

31

...Function Obelix (...)

call monitor(“Obelix“, “enter“)...

call monitor(“Obelix“,“exit“)end Obelix

...

CPU

monitor(routine, location)if (“enter“) then

else

end if

Function Table

Instrumentation and Monitoring

cache miss counter

Main

Asterix

Obelix + - 1020013001490

32

Instrumentation Techniques

• Source code instrumentation

• done by the compiler, source-to-source tool, or manually

– portability

– link back to source code easy

– re-compile necessary when instrumentation is changed

– difficult to instrument mixed-code applications

– cannot instrument system or 3rd party libraries or executables

• Binary instrumentation

• „patching“ the executable to insert hooks (like a debugger)

– inverse pros/cons

• Offline

• Online

33

Instrumentation Tools

• Standard compilers

• Add callbacks for profiling functions

• Typically an function level

• Be careful of overhead for frequently called functions

• gcc, for example, adds calls if –finstrument-functions

is given.

• OPARI

• Jülich Supercomputing Center

• OpenMP for C and FORTRAN

• Source-level instrumentation see exercise

• PMPI interface

• Library interposition

• Link own library before real library, e.g. frequently used for

own malloc function.

34

Instrumentation Tools

• TAU Generic Instrumenter

• Parsers for C++, FORTRAN, UPC,…

• Creation of PTD (Program Database Toolkit)

• Approach

– Specify which string to insert before and after certain regions

– Use provided variables to access file and line information

• Limited program region types

• tau.oregon.edu

• OMPT

• Proposal for profiling API

• Based on callbacks

35

Source Code Transformation Tools

• Rose

• rosecompiler.org, LLNL

• LLVM

• Language independent code optimizer

and code generator

• http://www.llvm.org/, Univ. Illinois

• Clang C frontend for LLVM, http://clang.llvm.org/

• C/C++ and Objective C/C++

• Open64

• www.open64.net

• Compiler infrastructure based originally on the SGI compiler.

• Interprocecural and loop optimizations

http://www.llvm.org/

http://clang.llvm.org/

http://www.open64.net/

36

Binary Instrumentation Tools

• Dyninst

• Dynamic instrumentation on binary level

• Context pf Paradyn project

• Univ. Wisconsin-Madison, Maryland

• Bart Miller, Jeff Hollingsworth

• Intel Pin

• Intel for x86

• Online instrumentation of binaries

• Valgrind

• Dynamic instrumentation

• Based on emulation of x86 machine instructions

37

Tr P n-1

Trace P1

Tracing

...Function Obelix (...)

call monitor(“Obelix“, “enter“)...

call monitor(“Obelix“,“exit“)end Obelix

...

MPI LibraryFunction MPI_send (...)

call monitor(“MPI_send“, “enter“)...

call PMPI_send(...)

call monitor(“MPI_send“,“exit“)end Obelix

...

Process 0

Process 1

Process n-1

Trace P0

10.4 P0 Obelix enter

10.6 P0 MPI_Send enter

10.8 P0 MPI_Send exit

38

Tr P n-1

Trace P1

Merging

Trace P0

Merge Process

P0 - Pn-1




10.7 P1 MPI_Recv enter


11.0 P1 MPI_Recv exit

39

Visualization of Dynamic Behaviour

P0 - Pn-1




10.7 P1 MPI_Recv enter


11.0 P1 MPI_Recv exit

P0

P1

10.4 10.5 10.6 10.7 10.8 10.9 11.0

Timeline Visualization

Obelix

Obelix MPI_Recv

MPI_Send Obelix

Obeli

40

Profiling vs Tracing

• Profiling

• recording summary information (time, #calls,#misses...)

• about program entities (functions, objects, basic blocks)

• very good for quick, low cost overview

• points out potential bottlenecks

• implemented through sampling or instrumentation

• moderate amount of performance data

• Tracing

• recording information about events

• trace record typically consists of timestamp, processid, ...

• output is a trace file with trace records sorted by time

• can be used to reconstruct the dynamic behavior

• creates huge amounts of data

• needs selective instrumentation

41

Program Monitors

• Each PA tools has its own monitor

• Score-P

• In the last years, Score-P was developed by tools groups of

Scalasca, Vampir and Periscope.

• Provides support for

– MPI, OpenMP, CUDA

– Profiling and tracing

– Callpath profiles

– Online Access Interface

• Cube 4 profiling data format

• OTF2 (Open Trace Format)

42


Instrumentation Analysis

Execution

refinement

Current Hypotheses

Requirements Performance Data

Detected Bottlenecks

Instr: DatISPEC

Info: HypDat

Prove: HypDat{T,F}

Refine: HypPHyp

43

Common Performance Problems with MPI

• Single node performance

• Excessive number of 2nd-level cache misses

• Low number of issued instructions

• IO

• High data volume

• Sequential IO due to IO subsystem or sequentialization in the

program

• Excessive communication

• Frequent communication

• High data volume

44

Common Performance Problems with MPI

• Frequent synchronization

• Reduction operations

• Barrier operations

• Load balancing

• Wrong data decomposition

• Dynamically changing load

45

Common Performance Problems with SM

• Single node performance

• ...

• IO

• ...

• Excessive communication

• Large number of remote memory accesses

• False sharing

• False data mapping

• Frequent synchronization

• Implicit synchronization of parallel constructs

• Barriers, locks, ...

• Load balancing

• Uneven scheduling of parallel loops

• Uneven work in parallel sections

46

Analysis Techniques

• Offline vs Online Analysis

• Offline: first generate data then analyze

• Online: generate and analyze data while application is running

• Online requires automationlimited to standard bottlenecks

• Offline suffers more from size of measurement information

• Three techniques to support user in analysis

• Source-level presentation of performance data

• Graphical visualization

• Ranking of high-level performance properties

47

Statistical Profiling based Tools

• Gprof – GNU profiling tool

• Time profiling

• Inclusive and exclusive time

• Flat profile

• Call graph profile

• Based on instrumentation of

function entry and exit

• Records were the call is

coming from.

48

Statistical Profiling based Tools

• Allinea MAP

• Annotations to the application source code.

• Based on time series of profiles

• For parallel applications it indicates outlying processes.

49

Profiling Tools based on Instrumentation

• TAU (Tuning and Analysis

Utilities)

• Measurements are based on

instrumentation

• Visualization via paraprof

– Graphical display for aggregated and

per node, context, or thread

– Topology views of performance data

• Scalasca

• Cube performance visualizer

• Profiles based on Score-P

• Call-path profiling

50

Trace-based Analysis Tools

• Vampir

• Graphical views presenting

events and summary data

• Flexible scrolling and

zooming features

• OTF2 trace format

generated by Score-P

• Commercial license

• www.vampire.eu

51

Trace-based Analysis Tools

• Paraver

• Barcelona Supercomputing

Center

• MPI, OMP, pthreads, OmpSs,

CUDA

• http://www.bsc.es/computer-sciences/performance-tools/paraver

• Clustering of program phases, i.e. segments between MPI calls

• Recently tracking of clusters in time series of profiles based on

object tracking

http://www.bsc.es/computer-sciences/performance-tools/paraver

52

Automatic Analysis Tools

• Paradyn

• University of Wisconsin Madison

• Periscope

• TU München

• Automatic detection of formalized performance properties

• Profile data

• Distributed online tool

• Scalasca

• Search for performance patterns in traces

• Post-mortem on parallel resources of the application

• Visualization of patterns in CUBE

Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer

Documents