Code Tuning and Parallelization on Boston Universitys Scientific Computing Facility Doug Sondak [email protected] Boston University Scientific Computing and.

Code Tuning and Parallelization on Boston

University’s Scientific Computing Facility

Doug Sondak

[email protected]

Boston UniversityScientific Computing

and Visualization

mailto:[email protected]

http://scv.bu.edu/images/scv.mpg

Outline

• Introduction• Timing• Profiling• Cache• Tuning• Timing/profiling exercise• Parallelization

Introduction

• Tuning– Where is most time being used?– How to speed it up

• Often as much art as science

• Parallelization– After serial tuning, try parallel processing– MPI– OpenMP

Timing

Timing

• When tuning/parallelizing a code, need to assess effectiveness of your efforts

• Can time whole code and/or specific sections

• Some types of timers– unix time command– function/subroutine calls– profiler

CPU or Wall-Clock Time?

• both are useful• for parallel runs, really want wall-clock

time, since CPU time will be about the same or even increase as number of procs. is increased

• CPU time doesn’t account for wait time• wall-clock time may not be accurate if

sharing processors– wall-clock timings should always be

performed in batch mode

Unix Time Command

• easiest way to time code• simply type time before your run

command• output differs between c-type shells

(cshell, tcshell) and Bourne-type shells (bsh, bash, ksh)

• tcsh results

Unix time Command (cont’d)

twister:~ % time mycode1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w

user CPU time (s)

system CPU time (s)

wall-clock time (s)

(u+s)/wc

avg. shared + unsharedtext space

input + output operations

page faults + no. timesproc. was swapped

Unix Time Command (3)

• bsh results

$ time mycodeReal 1.62User 1.57System 0.03

wall-clock time (s)

user CPU time (s)

system CPU time (s)

Function/Subroutine Calls

• often need to time part of code• timers can be inserted in source code• language-dependent

cpu_time

• intrinsic subroutine in Fortran• returns user CPU time (in seconds)

– no system time is included

• 0.01 sec. resolution on p-series

real :: t1, t2call cpu_time(t1) ... do stuff to be timed ... call cpu_time(t2)print*, 'CPU time = ', t2-t1, ' sec.'

system_clock

• intrinsic subroutine in Fortran• good for measuring wall-clock time• on p-series:

– resolution is 0.01 sec.– max. time is 24 hr.

system_clock (cont’d)

• t1 and t2 are tic counts• count_rate is optional argument

containing tics/sec.

integer :: t1, t2, count_rate call system_clock(t1, count_rate) ... do stuff to be timed ... call system_clock(t2) print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’

times

• can be called from C to obtain CPU time


• can also get system time with tms_stime

#include <sys/times.h>#include <unistd.h>void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); tic1 = timedat.tms_utime; … do stuff to be timed … times(&timedat); tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); }

gettimeofday

• can be called from C to obtain wall-clock time

• sec resolution on p-series

#include <sys/time.h> void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); t1 = t.tv_sec + 1.0e-6*t.tv_usec; … do stuff to be timed … gettimeofday(&t, NULL); t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); }

MPI_Wtime

• convenient wall-clock timer for MPI codes

• sec resolution on p-series

MPI_Wtime (cont’d)

• Fortran

• C

double precision t1, t2t1 = mpi_wtime() ... do stuff to be timed ...t2 = mpi_wtime()print*,'wall-clock time = ', t2-t1

double t1, t2;t1 = MPI_Wtime();... do stuff to be timed ...t2 = MPI_Wtime();printf(“wall-clock time = %5.3f\n”,t2-t1);

omp_get_wtime

• convenient wall-clock timer for OpenMP codes

• resolution available by calling omp_get_wtick()


omp_get_wtime (cont’d)

• Fortran

• C

double precision t1, t2, omp_get_wtimet1 = omp_get_wtime() ... do stuff to be timed ...t2 = omp_get_wtime()print*,'wall-clock time = ', t2-t1

double t1, t2;t1 = omp_get_wtime();... do stuff to be timed ...t2 = omp_get_wtime();printf(“wall-clock time = %5.3f\n”,t2-t1);

Timer Summary

CPU Wall

Fortran cpu_time system_clock

C times gettimeofday

MPI MPI_Wtime

OpenMP omp_get_time

Profiling

Profilers

• profile tells you how much time is spent in each routine

• various profilers available, e.g.– gprof (GNU)– pgprof (Portland Group)– Xprofiler (AIX)

gprof

• compile with -pg

• file gmon.out will be created when you run

• gprof executable > myprof

• for multiple procs. (MPI), copy or link gmon.out.n to gmon.out, then run gprof

gprof (cont’d)ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds

called/total parents index %time self descendents called+self name index called/total children

0.00 340.50 1/1 .__start [2][1] 78.3 0.00 340.50 1 .main [1] 2.12 319.50 10/10 .contrl [3] 0.04 7.30 10/10 .force [34] 0.00 5.27 1/1 .initia [40] 0.56 3.43 1/1 .plot3da [49] 0.00 1.27 1/1 .data [73]

gprof (3)

ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds

% cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00 .conduct [5] 7.6 122.34 33.17 323 102.69 102.69 .getxyz [8] 7.5 154.77 32.43 .__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17 .btri [10] 7.2 217.33 31.17 .kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00 .rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51 .getq [24]

pgprof

• compile with Portland Group compiler– pgf95 (pgf90, etc.)– pgcc– –Mprof=func

• similar to –pg

– run code

• pgprof –exe executable

• pops up window with flat profile

pgprof (cont’d)

pgprof (3)

• line-level profiling– –Mprof=line

• optimizer will re-order lines– profiler will lump lines in some loops or

other constructs– may want to compile without

optimization, may not

• in flat profile, double-click on function

pgprof (4)

xprofiler• AIX (twister) has a graphical interface to

gprof• compile with -g -pg -Ox

– Ox represents whatever level of optimization you’re using (e.g., O5)

• run code– produces gmon.out file

• type xprofiler mycode– mycode is your code run comamnd

xprofiler (cont’d)

xprofiler (3)

• filled boxes represent functions or subroutines

• “fences” represent libraries• left-click a box to get function name

and timing information• right-click on box to get source code

or other information

xprofiler (4)

• can also get same profiles as from gprof by using menus– report flat profile– report call graph profile

Cache

Cache

• Cache is a small chunk of fast memory between the main memory and the registers

secondary cache

registers

primary cache

main memory

Cache (cont’d)

• Variables are moved from main memory to cache in lines– L1 cache line sizes on our machines

• Opteron (katana cluster) 64 bytes• Power4 (p-series) 128 bytes• PPC440 (Blue Gene) 32 bytes• Pentium III (linux cluster) 32 bytes

• If variables are used repeatedly, code will run faster since cache memory is much faster than main memory

Cache (cont’d)

• Why not just make the main memory out of the same stuff as cache?– Expensive– Runs hot– This was actually done in Cray computers

• Liquid cooling system

Cache (cont’d)

• Cache hit– Required variable is in cache

• Cache miss– Required variable not in cache– If cache is full, something else must be

thrown out (sent back to main memory) to make room

– Want to minimize number of cache misses

Cache example

…

x[0]

x[1]

x[2]

x[3]

x[4]

x[5]

x[6]

x[7]

x[8]

x[9]

Main memory

“mini” cacheholds 2 lines, 4 words each

for(i=0; i<10; i++) x[i] = i

a

b

…

Cache example (cont’d)

…

x[0]

x[1]

x[2]

x[3]

x[4]

x[5]

x[6]

x[7]

x[8]

x[9]

for(i=0; i<10; i++) x[i] = i

x[0]

x[1]

x[2]

x[3]

•We will ignore i for simplicity•need x[0], not in cache cache miss•load line from memory into cache•next 3 loop indices result in cache hits

a

b

…


…

x[0]

x[1]

x[2]

x[3]

x[4]

x[5]

x[6]

x[7]

x[8]

x[9]for(i=0; i<10; i++) x[i] = i

x[0]

x[1]

x[2]

x[3]

•need x[4], not in cache cache miss•load line from memory into cache•next 3 loop indices result in cache hits

x[4]

x[5]

x[6]

x[7]

a

b

…


…

x[0]

x[1]

x[2]

x[3]

x[4]

x[5]

x[6]

x[7]

x[8]

x[9]

for(i==0; i<10; i++) x[i] = i

x[8]

x[9]

a

b

•need x[8], not in cache cache miss•load line from memory into cache•no room in cache!•replace old line

x[4]

x[5]

x[6]

x[7]

a

b…

Cache (cont’d)

• Contiguous access is important• In C, multidimensional array is stored

in memory as a[0][0] a[0][1] a[0][2]

…

Cache (cont’d)

• In Fortran and Matlab, multidimensional array is stored the opposite way:

a(1,1) a(2,1) a(3,1)

…

Cache (cont’d)

• Rule: Always order your loops appropriately– will usually be taken care of by optimizer– suggestion: don’t rely on optimizer!

for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; }}

do j = 1, n do i = 1, n a(i,j) = 1.0 enddoenddo

C Fortran

Tuning Tips

Tuning Tips

• Some of these tips will be taken care of by compiler optimization– It’s best to do them yourself, since

compilers vary

Tuning Tips (cont’d)

• Access arrays in contiguous order– For multi-dimensional arrays,

rightmost index varies fastest for C and C++, leftmost for Fortran and Matlab

Bad Goodfor(i=0; i<N; i++){ for(j=0; j<N; j++{ a[i][j] = 1.0; }}

for(j=0; j<N; j++){ for(i=0; i<N; i++{ a[i][j] = 1.0; }}

Tuning Tips (3)

• Eliminate redundant operations in loops

• Bad:

• Good:

for(i=0; i<N; i++){ x = 10;

}

…

x = 10;for(i=0; i<N; i++){ }

…

Tuning Tips (4)

• Eliminate if statements within loops

• They may inhibit pipelining

for(i=0; i<N; i++){

if(i==0)

perform i=0 calculations

else

perform i>0 calculations

}

Tuning Tips (5)

• Better way

perform i=0 calculations

for(i=1; i<N; i++){

perform i>0 calculations

}

Tuning Tips (6)

• Divides cost far more than multiplies or adds– Often order of magnitude difference!

• Bad:

• Good:

for(i=0; i<N; i++)

x[i] = y[i]/scalarval;

qs = 1.0/scalarval;

for(i=0; i<N; i++)

x[i] = y[i]*qs;

Tuning Tips (7)

• There is overhead associated with a function call


• Bad:

• Good:

for(i=0; i<N; i++)

myfunc(i);

myfunc ( );

void myfunc(x){

for(int i=0; i<N; i++){

do stuff

}

}

Tuning Tips (8)


• Minimize calls to math functions

• Bad:

• Good:

for(i=0; i<N; i++)

z[i] = log(x[i]) * log(y[i]);

for(i=0; i<N; i++){

z[i] = log(x[i] + y[i]);

Tuning Tips (9)


• recasting may be costlier than you think

• Bad:

• Good:

sum = 0.0;

for(i=0; i<N; i++)

sum += (float) i

isum = 0;

for(i=0; i<N; i++)

isum += i;

sum = (float) isum

Parallelization

Parallelization

• Introduction

• MPI & OpenMP

• Performance metrics

• Amdahl’s Law

Introduction

• Divide and conquer!– divide operations among many

processors

– perform operations simultaneously

– if serial run takes 10 hours and we hit the problem with 5000 processors, it should take about 7 seconds to complete, right?• not so easy, of course

Introduction (cont’d)

• problem – some calculations depend upon previous calculations– can’t be performed simultaneously

– sometimes tied to the physics of the problem, e.g., time evolution of a system

• want to maximize amount of parallel code– occasionally easy

– usually requires some work

Introduction (3)

• method used for parallelization may depend on hardware

proc0

proc1

proc2

proc3

mem0

mem1

mem2

mem3

distributed memory

proc0

proc1

proc2

proc3

mem

shared memory

proc0

proc1

proc2

proc3

mem0

mem1

mixed memory

Introduction (4)

• distributed memory– e.g., katana, Blue Gene

– each processor has own address space

– if one processor needs data from another processor, must be explicitly passed

• shared memory– e.g., p-series IBM machines

– common address space

– no message passing required

Introduction (5)

• MPI– for both distributed and shared memory

– portable

– freely downloadable

• OpenMP– shared memory only

– must be supported by compiler (most do)

– usually easier than MPI

– can be implemented incrementally

MPI

• Computational domain is typically decomposed into regions– One region assigned to each processor

• Separate copy of program runs on each processor

MPI (cont’d)

• Discretized domain to solve flow over airfoil

• System of coupled PDE’s solved at each point

MPI (3)

• Decomposed domain for 4 processors

MPI (4)

• Since points depend on adjacent points, must transfer information after each iteration

• This is done with explicit calls in the source code

xxiii

211

MPI (5)

• Diminishing returns– Sending messages can get expensive

– Want to maximize ratio of computation to communication

OpenMP

• Usually loop-level parallelization

• An OpenMP directive is placed in the source code before the loop– Assigns subset of loop indices to each processor

– No message passing since each processor can “see” the whole domain

for(i=0; i<N; i++){ do lots of stuff}

OpenMP (cont’d)

• Can’t guarantee order of operations for(i = 0; i < 7; i++)

a[i] = 1;for(i = 1; i < 7; i++) a[i] = 2*a[i-1];

i a[i] (serial) a[i] (parallel)

0 1 1

1 2 2

2 4 4

3 8 8

4 16 2

5 32 4

6 64 8

Proc. 0

Proc. 1

Parallelize this loop on 2 processors

Example of how to do it wrong!

Quantify performance

• Two common methods– parallel speedup

– parallel efficiency

Parallel Speedup

Sn = parallel speedup

n = number of processors

T1 = time on 1 processor

Tn = time on n processors

nn T

TS 1

Parallel Speedup (2)

Parallel Efficiency

n = parallel efficiency

T1 = time on 1 processor

Tn = time on n processors

n = number of processors

nS

nTT n

nn

*1

Parallel Efficiency (2)


• What is a “reasonable” level of parallel efficiency?

• depends on– how much CPU time you have available– when the paper is due

• can think of (1-) as “wasted” CPU time

• my personal rule of thumb ~60%


• Superlinear speedup– parallel efficiency > 1.0– sometimes quoted in the literature– generally attributed to cache issues

• subdomains fit entirely in cache, entire domain does not

• this is very problem-dependent• be suspicious!

Amdahl’s Law

• let fraction of code that can execute in parallel be denoted p

• let fraction of code that must execute serially be denoted s

• let T = time, n = number of processors

np

sTTn

1

Amdahl’s Law (2)

• Noting that p = (1-s)

parallel speedup is (don’t confuse Sn with s)

Amdahl’sLaw

n

ss

T

Tn

1

1

)1(11

ns

n

T

TS

nn

Amdahl’s Law (3)

• can also be expressed as parallel efficiency by dividing by n

)1(1

1

nsn Amdahl’sLaw

• suppose s = 0; => linear speedup

Amdahl’s Law (4)

)1(1

1

nsn

1n

)1(1

ns

nSn

nSn

Amdahl’s Law (5)

• suppose s = 1; => no speedup

)1(1

1

nsn

1nS

)1(1

ns

nSn

nn

1

Amdahl’s Law (6)

Amdahl’s Law (7)

• Should we despair?– No!– bigger machines bigger

computations

smaller value of s

• if you want to run on a large number of processors, try to minimize s

Recommendations

Recommendations• Add timers to your code

– As you make changes and/or run new cases, they may give you an indication of a problem

• Profile your code– Sometimes results are surprising– Review “tuning tips”– See if you can speed up functions that are

consuming the most time

• Try highest levels of compiler optimization

Recommendations (cont’d)• Once you’re comfortable that you’re getting

reasonable serial performance, parallelize• If portability is an issue, MPI is a good

choice• If you’ll always be running on a shared-

memory machine (e.g., multicore PC), consider OpenMP

• For parallel code, plot parallel efficiency vs. number of processors– Choose appropriate number of processors

Code Tuning and Parallelization on Boston Universitys Scientific Computing Facility Doug Sondak [email protected] Boston University Scientific Computing and.

Documents

time slide

time wallclock time

system time

time code

t2t1 slide

wtime printfwallclock

wallclock time sec resolution

usec printfwallclock