Top Banner
www.cineca.i t Optimization techniques Carlo Cavazzoni, HPC department, CINECA
43

Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

Mar 27, 2015

Download

Documents

Natalie Ballard
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Optimization techniques 

Carlo Cavazzoni, HPC department, CINECA

Page 2: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Modern node architecture

CPU

RAM

Disk

cacheI, D

Small & fast

Page 3: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Cache

Hierarchy register L1 L2 L3 RAML1: Instruction and dataSize: L1 … LnSpeed: L1 … Ln

CPU looks for data in L1, if it is there (L1 cache hit), if not (L1 cache miss) and looks in L2 …

cache miss penaly in terms of clock cycle

Page 4: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

CACHE Direct Mapped

32 Kbyte

32 Kbyte

32 Kbyte

32 Kbyte

32 Kbyte

0

32 K

64 K

128 Kcache

Page 5: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Cache set associative

32 Kbyte

32 Kbyte16 Kbyte

32 Kbyte

32 Kbyte

0

32 K

64 K

128 K16 Kbyte

16 K

Es. 2-ways

48 K

cache

LastRecentlyUsed

Round Robin

Random

Page 6: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Loop optimization

Page 7: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Loop fusionLocality in time

do i=1, n

a(i) = b(i) + 1.0

enddodo i=2, n c(i) = sqrt(a(i-1))enddo

do i=2, n

a(i) = b(i) + 1.0

c(i) = sqrt(a(i-1))enddoa(1) = b(1) + 1.0

if n is big enough, a is loaded, offloaded and loaded again into cache

Reuse the a(i) loaded into cache

Page 8: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Loop interchangeLocality in space

do i=1, n

do j=1, n a(i,j) = b(i,j) + 1.0 enddoenddo

Load elements into cache lines and use only one before replacing them with new elements

Load elements into cache and use all of them before replacing them with new elements

do j=1, n

do i=1, n a(i,j) = b(i,j) + 1.0 enddoenddo

0x000x010x020x03

a b

j j

ii

Page 9: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Cache thrashingreal, dimension (1024) :: a,b

COMMON /my_com/ a, b

do i=1, 1024 a(i) = b(i) + 1.0enddo

offset shift matrixes w.r.t. cache no more problems

Avoid power of 2 for matrix dimensions

integer offset = (linea_cache)/SIZE(REAL)

real, dimension (1024+offset) :: a,b

COMMON /my_com/ a, b

do i=1, 1024 a(i) = b(i) + 1.0enddo

Padding

size cache = 4*1024, direct mapped, a, b contiguous cache thrashing

array size = multiple of cache size possible source of cache thrashing

Set Associative help reducing thrashing problems

Page 10: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Loop unrolling

do j=1, n

do i=1, (n-1) a(i,j)= b(i,j)+b(i+1,j)+1.0 enddoenddo

Equivalent Loops.

Fewer jump.

Fewer dependencies.

Fill pipelines and vector units.

do j=1, n

do i=1, (n-1), 2 a(i,j) = b(i,j) +b(i+1,j)+1.0 a(i+1,j) = b(i+1,j)+b(i+2,j)+1.0 enddoenddo

Page 11: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Optimize with numerical libraries

Less coding

Tested and (almost) bug free

Standard

Efficient implementation

Optimized

Page 12: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

BLASBasic Linear Algebra Subprogram, Parallel BLAS and Basic Linear Algebra Communication Subsystem (www.netlib.org)• Level 1 BLAS: Vector-Vector operations

(scalar only).

• Level 2 BLAS, PBLAS: Vector-Matrix operations (scalar and parallel).

• Level 3 BLAS, PBLAS: Matrix-Matrix operation (scalar and parallel).

• Level 1 and 2 BLACS: vector reduction, vector and matrics communications.

Page 13: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Lapack and ScalapackLinear Algebra Package and Scalable Lapack (www.netlib.org)

Matrix decomposition. Solution of Linear Systems. Eigenvalues and Eigenvetors Linear Least Square solutions

Page 14: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

MKLESSLACMLCUBLASMAGMAPLASMA

Page 15: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

MASS (IBM)• Accelerated version of SQRT, SIN, COS,

EXP, LOG, ecc… Scalar and vector

Page 16: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

VML

Equivalent to MASS (vector version only) For Intel processors

Accelerated version of:sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2, sinh, cosh, tanh, dnint, x**y

Page 17: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

VML

do i = 1, n r = r + sin( a(i) )end do

call vdsin( n, a, y )do i = 1, n r = r + y( i )end do

CALL vml_subroutine( n, a, y )

Page 18: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

BLASMatrix multiplication

DGEMM (transa, transb, l, n, m, alpha, a, lda, b, ldb, beta, c, ldc)

c = alpha op( a ) * op( b ) + beta c

Clm = n Aln Bnm + Clm Clm = n ATln Bnm + Clm

Clm = n Aln BTnm + Clm Clm = n AT

ln BTnm + Clm

real*8 a(lda,*), b(ldb,*), c(ldc,*)

Page 19: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Profileing with gprof

Compiler flag “-pg” or “-p” (depend on the compiler)

gcc -pg –c mio.c ./a.out

gmon.out gprof

Page 20: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

gcc -pg -funroll-loops –O2 dotprod.c -static

[cineca@rfxoff1 Carlo]$ ./a.out d = 1000000.000000

% cumulative self self total time seconds seconds calls us/call us/call name 68.57 0.05 0.05 2 23437.50 23437.50 set_vector 31.43 0.07 0.02 1 21484.38 21484.38 dot_product 0.00 0.07 0.00 1 0.00 68359.38 main

gprof

Page 21: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Profileing “by hand”

Find “hot spot” in your application

Use temporization functions

CALL SYSTEM_CLOCK(iclk1, count_rate=nclk)

CALL critical_subroutine( …… )

CALL SYSTEM_CLOCK(iclk2)

PRINT *,REAL(iclk2-iclk1)/nclk

t1 = cclock()

CALL critical_subroutine( …… )

t2 = cclock()

PRINT *, (t2-t1)

CALL CPU_TIME( t3 )

CALL critical_subroutine( …… )

CALL CPU_TIME( t4 )

PRINT *, (t4-t3)

Page 22: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Mesure performances

#include<stdio.h>#include<time.h>#include<ctype.h>#include<sys/types.h>#include<sys/time.h>

double cclock_(){

/* Restituisce il valore del CLOCK di sistema in secondi */

struct timeval tmp; double sec; gettimeofday( &tmp, (struct timezone *)0 ); sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0; return sec;

}

Page 23: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

PROGRAM test_dgemm

IMPLICIT NONE

INTEGER, PARAMETER :: dim = 1000 REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock ALLOCATE( x( dim, dim ), y( dim, dim ) ) ALLOCATE( z( dim, dim ) ) y = 1.0d0 z = 1.0d0 / DBLE( dim ) x = 0.0d0 t1 = cclock( ) do j = 1, dim do i = 1, dim do k = 1, dim x(i,j) = x(i,j) + y(i,k) * z(k,j) end do end do end do t2 = cclock() write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1 DEALLOCATE( x, y, z )

END PROGRAM

PROGRAM test_dgemm

IMPLICIT NONE

INTEGER, PARAMETER :: dim = 1000 REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock ALLOCATE( x( dim, dim ), y( dim, dim ), z( dim, dim ) )

y = 1.0d0 z = 1.0d0 / DBLE( dim ) x = 0.0d0 t1 = cclock()

! x = matmul( y, z ) call dgemm('N', 'N', dim, dim, dim, 1.0d0, y, c dim, z, dim,0.0d0, x, dim)

t2 = cclock() write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1 DEALLOCATE( x, y, z )

END PROGRAM

Page 24: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

ATLAS

BLAS compatible

Automatically Tuned Linear Algebra Software

http://sourceforge.net/http://math-atlas.sourceforge.net/devel/

Page 25: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

http://www.fftw.org

Fast Fourier Trasform

FFT complex to complex

FFT complex to real

Parallel FFT

Moulti-thread FFT

Page 26: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Advanced techniques

Page 27: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

do i=1,n do j=1,m y(j,i) = x(i,j) enddoenddo

Case Study: matrix transposition

y

x

Think Fortran: Consecutive elements in memory

Page 28: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Suppose 2-way set associative

y x

For each value of x I need to load into cache a whole line.1) X “allocate” the 2nd way.2) Risk of thrashing3) When the cache is full, the proc. Start to overwrite cache lines

data mapped in cachey “allocate” the 1st way

What happens with the cache

Page 29: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Suppose 2-way set associative

y x

As before for each value of x I need to load into cache a whole cache line.

data mapped in cachey “allocate” the 1st way

What happens with the cache, cont.

We can see that a lot of data are loaded into the cache but they are not used!

Page 30: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Suppose 2-way set associative

yx

Block Algorithm

Load a block of data into cache

Swap data in cache

Write data back to memory

Page 31: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it 31

do i=1,n do j=1,m y(j,i) = x(i,j) enddoenddo

do ib = 1, nb ioff = (ib-1) * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, bsiz buf(i,j) = x(i+ioff, j+joff) enddo enddo do j = 1, bsiz do i = 1, j-1 bswp = buf(i,j) buf(i,j) = buf(j,i) buf(j,i) = bswp enddo enddo do i=1,bsiz do j=1,bsiz y(j+joff, i+ioff) = buf(j,i) enddo enddo enddoenddo

bsiz = block sizenb = n / bsizmb = m / bsiz

You need to handle: MOD(n / bsiz) /= 0 ORMOD(m / bsiz) /= 0

Solution: Block algorithm

Page 32: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Whole block transpose do ib = 1, nb ioff = (ib-1) * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, bsiz buf(i,j) = x(i+ioff,j+joff) enddo enddo do j = 1, bsiz do i = 1, j-1 bswp = buf(i,j) buf(i,j) = buf(j,i) buf(j,i) = bswp enddo enddo do i=1,bsiz do j=1,bsiz y(j+joff,i+ioff) = buf(j,i) enddo enddo enddoenddo

IF( min( 1, MOD(n,bsiz) ) .GT. 0 ) THEN ioff = nb * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, MIN(bsiz, n-ioff) buf(i,j) = x(i+ioff, j+joff) enddo enddo do i = 1, MIN(bsiz, n-ioff) do j = 1, bsiz y(j+joff,i+ioff) = buf(i,j) enddo enddo enddoEND IF

IF( MIN(1, MOD(m, bsiz)) .GT. 0 ) THEN joff = mb * bsiz do ib = 1, nb ioff = (ib-1) * bsiz do j = 1, MIN(bsiz, m-joff) do i = 1, bsiz buf(i,j) = x(i+ioff, j+joff) enddo enddo do i = 1, bsiz do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo enddo enddoEND IF

IF( MIN(1,MOD(n,bsiz)).GT.0 .AND. & & MIN(1,MOD(m,bsiz)).GT.0 ) THEN joff = mb * bsiz ioff = nb * bsiz do j = 1, MIN(bsiz, m-joff) do i = 1, MIN(bsiz, n-ioff) buf(i,j) = x(i+ioff, j+joff) enddo enddo do i = 1, MIN(bsiz, n-ioff) do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo enddoEND IF

1

2

3

Page 33: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it 33

Performance tuning and analysis: user codes

Matrix TraspositionMatrix size: 2048x2048

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0 20 40 60 80 100 120

block size

exec

uti

on

tim

e

Straightforward implementation

Block implementation

Page 34: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

DO l=1,nphase IF(au1(l,l) /= 0.D0) THEN lp1=l+1 div=1.D0/au1(l,l) DO lj=lp1,nphase au1(l,lj)=au1(l,lj)*div END DO bu1(l)=bu1(l)*div au1(l,l)=0.D0 DO li=1,nphase amul=au1(li,l) DO lj=lp1,nphase au1(li,lj)=au1(li,lj)-amul*au1(l,lj) END DO bu1(li)=bu1(li)-amul*bu1(l) END DO END IF END DO

IF( a(1,1) /= 0.D0 ) THEN div = 1.D0 / a(1,1) a(1,2) = a(1,2) * div a(1,3) = a(1,3) * div b(1) = b(1) * div a(1,1) = 0.D0 !li=2 amul = a(2,1) a(2,2) = a(2,2) - amul * a(1,2) a(2,3) = a(2,3) - amul * a(1,3) b(2) = b(2) - amul * b(1) !li=3 amul = a(3,1) a(3,2) = a(3,2) - amul * a(1,2) a(3,3) = a(3,3) - amul * a(1,3) b(3) = b(3) - amul * b(1)END IF

IF( a(2,2) /= 0.D0 ) THEN div=1.D0/a(2,2) a(2,3)=a(2,3)*div b(2)=b(2)*div a(2,2)=0.D0 !li=1 amul=a(1,2) a(1,3)=a(1,3)-amul*a(2,3) b(1)=b(1)-amul*b(2) !li=3 amul=a(3,2) a(3,3)=a(3,3)-amul*a(2,3) b(3)=b(3)-amul*b(2)END IF

IF( a(3,3) /= 0.D0 ) THEN div=1.D0/a(3,3) b(3)=b(3)*div a(3,3)=0.D0 !li=1 amul=a(1,3) b(1)=b(1)-amul*b(3) !li=2 amul=a(2,3) b(2)=b(2)-amul*b(3)END IF

Parameter Dependent Code & Unrolling

per un dato set di parametri(di input), riesco ad eliminareogni loop, ottimizzando cachee pipe di esecuzione

Page 35: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Debugging (post mortem)

program hello_bug real(kind=8) :: a( 10 ) call clearv( a, 10000 ) print *, SUM( a )end program

subroutine clearv( a, n) real(kind=8) :: a( * ) integer :: n integer :: i do i = 1, n a( n ) = 0.0 end doend subroutine

gfortran –g hello_bug.f90

Remove core size limitulimit –c unlimited

./a.out Segmentation fault (core dumped)

gdb ./a.out core

Page 36: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Debugging (in vivo)gfortran -g hello_bug.f90

gdb ./a.out

GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-32.el5)Copyright (C) 2009 Free Software Foundation, Inc.License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>This is free software: you are free to change and redistribute it.There is NO WARRANTY, to the extent permitted by law. Type "show copying"and "show warranty" for details.This GDB was configured as "x86_64-redhat-linux-gnu".For bug reporting instructions, please see:<http://www.gnu.org/software/gdb/bugs/>...Reading symbols from /plx/userinternal/acv0/a.out...done.

(gdb) runStarting program: /plx/userinternal/acv0/a.out warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000

Program received signal SIGSEGV, Segmentation fault.0x0000000000400833 in clearv (a=0x7fffffffe2f0, n=@0x400968) at hello_bug.f90:1212 a( n ) = 0.0

Page 37: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

#include<sys/types.h>#include<sys/time.h>

double cclock_()

{

/* Restituisce il valore del CLOCK di sistema in secondi */

struct timeval tmp; double sec; gettimeofday( &tmp, (struct timezone *)0 ); sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0; return sec;

} program mat_mul integer, parameter :: n = 100 real*8 :: a(n,n), b(n,n), c(n,n) real*8 :: t1, t2 real*8 :: cclock external cclock a = 1.0d0 b = 1.0d0 t1 = cclock() call dgemm('N', 'N', n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n ) t2 = cclock() write(*,*) SUM(c), t2-t1end program

1) gcc –c cclock.c

2) f95 matmul_prof.f90 cclock.o -L. -lblas

Link Fortran and C

Page 38: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Link Fortran and C

program rand real(kind=8) :: a external crand call crand( a ) print *,'this is random ', aend program

#include<stdlib.h>#include<time.h>

void crand_( double * x ){ (*x) = ( (double)random()/(double)RAND_MAX );}

Link a C subroutine with a Fortran program

rand.f90 crand.c

Fortran passes arguments by reference, C passes them by value

Page 39: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Link a Fortran subroutine with a C program

#include<stdio.h>

int main(){ int n; double a[10], d;

n = 10; d = 1.0; setv_( a, &d, &n ); printf("%lf\n", a[0]); }

subroutine setv( a, d, n ) real(kind=8) :: a( * ) real(kind=8) :: d integer :: n integer :: i do i = 1, n a( i ) = d end doend subroutine

cvec.c vset.f90

Link Fortran and C

Page 40: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Make CommandIf a code is large and/or it shares subroutines withother codes, it is useful to split the source in many files that could be placed in different directories.

In F90 there are dependencies among program units,i.e. modules must be compiled before than any other program units.Therefore there is a well defined order for compiling source files

To avoid compiling by hands the sources in the proper order,the make command could be used

Page 41: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

Make Command

The make command can be programmed to do the job for youusing a file containing instruction and dircetive.

By default the make command looks in the present directoryfor a file colled Makefile or makefile

Page 42: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

A simple makefile# this is a comment within the makefile

myprog.x : modules.o main.of90 –o myprog.x modules.o main.o

modules.o : modules.f90f90 –c modules.f90

main.o : modules.o main.f90f90 –c main.f90

this tell to the make commandthat myprog.x depend frommodules.o and main.o

make execute the command onlywhen modules.o and main.ohave been built

to compile the code, from the console the programmer issuethe command:> make

Page 43: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

www.cineca.it

A less simple makefile# this is a comment within the makefile

myprog.x : modules.o main.of90 –o myprog.x modules.o main.o

main.o : modules.o

.f90.of90 –c $<

this is an implicit dependency, it state that allfiles “.o” depend and should be generated fromthe corresponding “.f90” files

this is a make macro, and it is expandend with the proper “.f90” filename

In the above example, make try to built myprog.x but it realizes that main.o and modules.o should begenerated first. Then it starts looking for a rule to make the “.o”, and it finds that main.o depend onmodules.o, and thern make build an internal hierarchy for compilation in which modules.o come beforemain.o . Finally make finds the implicit rule and starts compiling the sources.