www.cineca.i t Optimization techniques Carlo Cavazzoni, HPC department, CINECA
Mar 27, 2015
www.cineca.it
Optimization techniques
Carlo Cavazzoni, HPC department, CINECA
www.cineca.it
Modern node architecture
CPU
RAM
Disk
cacheI, D
Small & fast
www.cineca.it
Cache
Hierarchy register L1 L2 L3 RAML1: Instruction and dataSize: L1 … LnSpeed: L1 … Ln
CPU looks for data in L1, if it is there (L1 cache hit), if not (L1 cache miss) and looks in L2 …
cache miss penaly in terms of clock cycle
www.cineca.it
CACHE Direct Mapped
32 Kbyte
32 Kbyte
32 Kbyte
32 Kbyte
32 Kbyte
0
32 K
64 K
128 Kcache
www.cineca.it
Cache set associative
32 Kbyte
32 Kbyte16 Kbyte
32 Kbyte
32 Kbyte
0
32 K
64 K
128 K16 Kbyte
16 K
Es. 2-ways
48 K
cache
LastRecentlyUsed
Round Robin
Random
www.cineca.it
Loop optimization
www.cineca.it
Loop fusionLocality in time
do i=1, n
a(i) = b(i) + 1.0
enddodo i=2, n c(i) = sqrt(a(i-1))enddo
do i=2, n
a(i) = b(i) + 1.0
c(i) = sqrt(a(i-1))enddoa(1) = b(1) + 1.0
if n is big enough, a is loaded, offloaded and loaded again into cache
Reuse the a(i) loaded into cache
www.cineca.it
Loop interchangeLocality in space
do i=1, n
do j=1, n a(i,j) = b(i,j) + 1.0 enddoenddo
Load elements into cache lines and use only one before replacing them with new elements
Load elements into cache and use all of them before replacing them with new elements
do j=1, n
do i=1, n a(i,j) = b(i,j) + 1.0 enddoenddo
0x000x010x020x03
a b
j j
ii
www.cineca.it
Cache thrashingreal, dimension (1024) :: a,b
COMMON /my_com/ a, b
do i=1, 1024 a(i) = b(i) + 1.0enddo
offset shift matrixes w.r.t. cache no more problems
Avoid power of 2 for matrix dimensions
integer offset = (linea_cache)/SIZE(REAL)
real, dimension (1024+offset) :: a,b
COMMON /my_com/ a, b
do i=1, 1024 a(i) = b(i) + 1.0enddo
Padding
size cache = 4*1024, direct mapped, a, b contiguous cache thrashing
array size = multiple of cache size possible source of cache thrashing
Set Associative help reducing thrashing problems
www.cineca.it
Loop unrolling
do j=1, n
do i=1, (n-1) a(i,j)= b(i,j)+b(i+1,j)+1.0 enddoenddo
Equivalent Loops.
Fewer jump.
Fewer dependencies.
Fill pipelines and vector units.
do j=1, n
do i=1, (n-1), 2 a(i,j) = b(i,j) +b(i+1,j)+1.0 a(i+1,j) = b(i+1,j)+b(i+2,j)+1.0 enddoenddo
www.cineca.it
Optimize with numerical libraries
Less coding
Tested and (almost) bug free
Standard
Efficient implementation
Optimized
www.cineca.it
BLASBasic Linear Algebra Subprogram, Parallel BLAS and Basic Linear Algebra Communication Subsystem (www.netlib.org)• Level 1 BLAS: Vector-Vector operations
(scalar only).
• Level 2 BLAS, PBLAS: Vector-Matrix operations (scalar and parallel).
• Level 3 BLAS, PBLAS: Matrix-Matrix operation (scalar and parallel).
• Level 1 and 2 BLACS: vector reduction, vector and matrics communications.
www.cineca.it
Lapack and ScalapackLinear Algebra Package and Scalable Lapack (www.netlib.org)
Matrix decomposition. Solution of Linear Systems. Eigenvalues and Eigenvetors Linear Least Square solutions
www.cineca.it
MKLESSLACMLCUBLASMAGMAPLASMA
www.cineca.it
MASS (IBM)• Accelerated version of SQRT, SIN, COS,
EXP, LOG, ecc… Scalar and vector
www.cineca.it
VML
Equivalent to MASS (vector version only) For Intel processors
Accelerated version of:sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2, sinh, cosh, tanh, dnint, x**y
www.cineca.it
VML
do i = 1, n r = r + sin( a(i) )end do
call vdsin( n, a, y )do i = 1, n r = r + y( i )end do
CALL vml_subroutine( n, a, y )
www.cineca.it
BLASMatrix multiplication
DGEMM (transa, transb, l, n, m, alpha, a, lda, b, ldb, beta, c, ldc)
c = alpha op( a ) * op( b ) + beta c
Clm = n Aln Bnm + Clm Clm = n ATln Bnm + Clm
Clm = n Aln BTnm + Clm Clm = n AT
ln BTnm + Clm
real*8 a(lda,*), b(ldb,*), c(ldc,*)
www.cineca.it
Profileing with gprof
Compiler flag “-pg” or “-p” (depend on the compiler)
gcc -pg –c mio.c ./a.out
gmon.out gprof
www.cineca.it
gcc -pg -funroll-loops –O2 dotprod.c -static
[cineca@rfxoff1 Carlo]$ ./a.out d = 1000000.000000
% cumulative self self total time seconds seconds calls us/call us/call name 68.57 0.05 0.05 2 23437.50 23437.50 set_vector 31.43 0.07 0.02 1 21484.38 21484.38 dot_product 0.00 0.07 0.00 1 0.00 68359.38 main
gprof
www.cineca.it
Profileing “by hand”
Find “hot spot” in your application
Use temporization functions
CALL SYSTEM_CLOCK(iclk1, count_rate=nclk)
CALL critical_subroutine( …… )
CALL SYSTEM_CLOCK(iclk2)
PRINT *,REAL(iclk2-iclk1)/nclk
t1 = cclock()
CALL critical_subroutine( …… )
t2 = cclock()
PRINT *, (t2-t1)
CALL CPU_TIME( t3 )
CALL critical_subroutine( …… )
CALL CPU_TIME( t4 )
PRINT *, (t4-t3)
www.cineca.it
Mesure performances
#include<stdio.h>#include<time.h>#include<ctype.h>#include<sys/types.h>#include<sys/time.h>
double cclock_(){
/* Restituisce il valore del CLOCK di sistema in secondi */
struct timeval tmp; double sec; gettimeofday( &tmp, (struct timezone *)0 ); sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0; return sec;
}
www.cineca.it
PROGRAM test_dgemm
IMPLICIT NONE
INTEGER, PARAMETER :: dim = 1000 REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock ALLOCATE( x( dim, dim ), y( dim, dim ) ) ALLOCATE( z( dim, dim ) ) y = 1.0d0 z = 1.0d0 / DBLE( dim ) x = 0.0d0 t1 = cclock( ) do j = 1, dim do i = 1, dim do k = 1, dim x(i,j) = x(i,j) + y(i,k) * z(k,j) end do end do end do t2 = cclock() write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1 DEALLOCATE( x, y, z )
END PROGRAM
PROGRAM test_dgemm
IMPLICIT NONE
INTEGER, PARAMETER :: dim = 1000 REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock ALLOCATE( x( dim, dim ), y( dim, dim ), z( dim, dim ) )
y = 1.0d0 z = 1.0d0 / DBLE( dim ) x = 0.0d0 t1 = cclock()
! x = matmul( y, z ) call dgemm('N', 'N', dim, dim, dim, 1.0d0, y, c dim, z, dim,0.0d0, x, dim)
t2 = cclock() write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1 DEALLOCATE( x, y, z )
END PROGRAM
www.cineca.it
ATLAS
BLAS compatible
Automatically Tuned Linear Algebra Software
http://sourceforge.net/http://math-atlas.sourceforge.net/devel/
www.cineca.it
http://www.fftw.org
Fast Fourier Trasform
FFT complex to complex
FFT complex to real
Parallel FFT
Moulti-thread FFT
www.cineca.it
Advanced techniques
www.cineca.it
do i=1,n do j=1,m y(j,i) = x(i,j) enddoenddo
Case Study: matrix transposition
y
x
Think Fortran: Consecutive elements in memory
www.cineca.it
Suppose 2-way set associative
y x
For each value of x I need to load into cache a whole line.1) X “allocate” the 2nd way.2) Risk of thrashing3) When the cache is full, the proc. Start to overwrite cache lines
data mapped in cachey “allocate” the 1st way
What happens with the cache
www.cineca.it
Suppose 2-way set associative
y x
As before for each value of x I need to load into cache a whole cache line.
data mapped in cachey “allocate” the 1st way
What happens with the cache, cont.
We can see that a lot of data are loaded into the cache but they are not used!
www.cineca.it
Suppose 2-way set associative
yx
Block Algorithm
Load a block of data into cache
Swap data in cache
Write data back to memory
www.cineca.it 31
do i=1,n do j=1,m y(j,i) = x(i,j) enddoenddo
do ib = 1, nb ioff = (ib-1) * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, bsiz buf(i,j) = x(i+ioff, j+joff) enddo enddo do j = 1, bsiz do i = 1, j-1 bswp = buf(i,j) buf(i,j) = buf(j,i) buf(j,i) = bswp enddo enddo do i=1,bsiz do j=1,bsiz y(j+joff, i+ioff) = buf(j,i) enddo enddo enddoenddo
bsiz = block sizenb = n / bsizmb = m / bsiz
You need to handle: MOD(n / bsiz) /= 0 ORMOD(m / bsiz) /= 0
Solution: Block algorithm
www.cineca.it
Whole block transpose do ib = 1, nb ioff = (ib-1) * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, bsiz buf(i,j) = x(i+ioff,j+joff) enddo enddo do j = 1, bsiz do i = 1, j-1 bswp = buf(i,j) buf(i,j) = buf(j,i) buf(j,i) = bswp enddo enddo do i=1,bsiz do j=1,bsiz y(j+joff,i+ioff) = buf(j,i) enddo enddo enddoenddo
IF( min( 1, MOD(n,bsiz) ) .GT. 0 ) THEN ioff = nb * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, MIN(bsiz, n-ioff) buf(i,j) = x(i+ioff, j+joff) enddo enddo do i = 1, MIN(bsiz, n-ioff) do j = 1, bsiz y(j+joff,i+ioff) = buf(i,j) enddo enddo enddoEND IF
IF( MIN(1, MOD(m, bsiz)) .GT. 0 ) THEN joff = mb * bsiz do ib = 1, nb ioff = (ib-1) * bsiz do j = 1, MIN(bsiz, m-joff) do i = 1, bsiz buf(i,j) = x(i+ioff, j+joff) enddo enddo do i = 1, bsiz do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo enddo enddoEND IF
IF( MIN(1,MOD(n,bsiz)).GT.0 .AND. & & MIN(1,MOD(m,bsiz)).GT.0 ) THEN joff = mb * bsiz ioff = nb * bsiz do j = 1, MIN(bsiz, m-joff) do i = 1, MIN(bsiz, n-ioff) buf(i,j) = x(i+ioff, j+joff) enddo enddo do i = 1, MIN(bsiz, n-ioff) do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo enddoEND IF
1
2
3
www.cineca.it 33
Performance tuning and analysis: user codes
Matrix TraspositionMatrix size: 2048x2048
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0 20 40 60 80 100 120
block size
exec
uti
on
tim
e
Straightforward implementation
Block implementation
www.cineca.it
DO l=1,nphase IF(au1(l,l) /= 0.D0) THEN lp1=l+1 div=1.D0/au1(l,l) DO lj=lp1,nphase au1(l,lj)=au1(l,lj)*div END DO bu1(l)=bu1(l)*div au1(l,l)=0.D0 DO li=1,nphase amul=au1(li,l) DO lj=lp1,nphase au1(li,lj)=au1(li,lj)-amul*au1(l,lj) END DO bu1(li)=bu1(li)-amul*bu1(l) END DO END IF END DO
IF( a(1,1) /= 0.D0 ) THEN div = 1.D0 / a(1,1) a(1,2) = a(1,2) * div a(1,3) = a(1,3) * div b(1) = b(1) * div a(1,1) = 0.D0 !li=2 amul = a(2,1) a(2,2) = a(2,2) - amul * a(1,2) a(2,3) = a(2,3) - amul * a(1,3) b(2) = b(2) - amul * b(1) !li=3 amul = a(3,1) a(3,2) = a(3,2) - amul * a(1,2) a(3,3) = a(3,3) - amul * a(1,3) b(3) = b(3) - amul * b(1)END IF
IF( a(2,2) /= 0.D0 ) THEN div=1.D0/a(2,2) a(2,3)=a(2,3)*div b(2)=b(2)*div a(2,2)=0.D0 !li=1 amul=a(1,2) a(1,3)=a(1,3)-amul*a(2,3) b(1)=b(1)-amul*b(2) !li=3 amul=a(3,2) a(3,3)=a(3,3)-amul*a(2,3) b(3)=b(3)-amul*b(2)END IF
IF( a(3,3) /= 0.D0 ) THEN div=1.D0/a(3,3) b(3)=b(3)*div a(3,3)=0.D0 !li=1 amul=a(1,3) b(1)=b(1)-amul*b(3) !li=2 amul=a(2,3) b(2)=b(2)-amul*b(3)END IF
Parameter Dependent Code & Unrolling
per un dato set di parametri(di input), riesco ad eliminareogni loop, ottimizzando cachee pipe di esecuzione
www.cineca.it
Debugging (post mortem)
program hello_bug real(kind=8) :: a( 10 ) call clearv( a, 10000 ) print *, SUM( a )end program
subroutine clearv( a, n) real(kind=8) :: a( * ) integer :: n integer :: i do i = 1, n a( n ) = 0.0 end doend subroutine
gfortran –g hello_bug.f90
Remove core size limitulimit –c unlimited
./a.out Segmentation fault (core dumped)
gdb ./a.out core
www.cineca.it
Debugging (in vivo)gfortran -g hello_bug.f90
gdb ./a.out
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-32.el5)Copyright (C) 2009 Free Software Foundation, Inc.License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>This is free software: you are free to change and redistribute it.There is NO WARRANTY, to the extent permitted by law. Type "show copying"and "show warranty" for details.This GDB was configured as "x86_64-redhat-linux-gnu".For bug reporting instructions, please see:<http://www.gnu.org/software/gdb/bugs/>...Reading symbols from /plx/userinternal/acv0/a.out...done.
(gdb) runStarting program: /plx/userinternal/acv0/a.out warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000
Program received signal SIGSEGV, Segmentation fault.0x0000000000400833 in clearv (a=0x7fffffffe2f0, n=@0x400968) at hello_bug.f90:1212 a( n ) = 0.0
www.cineca.it
#include<sys/types.h>#include<sys/time.h>
double cclock_()
{
/* Restituisce il valore del CLOCK di sistema in secondi */
struct timeval tmp; double sec; gettimeofday( &tmp, (struct timezone *)0 ); sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0; return sec;
} program mat_mul integer, parameter :: n = 100 real*8 :: a(n,n), b(n,n), c(n,n) real*8 :: t1, t2 real*8 :: cclock external cclock a = 1.0d0 b = 1.0d0 t1 = cclock() call dgemm('N', 'N', n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n ) t2 = cclock() write(*,*) SUM(c), t2-t1end program
1) gcc –c cclock.c
2) f95 matmul_prof.f90 cclock.o -L. -lblas
Link Fortran and C
www.cineca.it
Link Fortran and C
program rand real(kind=8) :: a external crand call crand( a ) print *,'this is random ', aend program
#include<stdlib.h>#include<time.h>
void crand_( double * x ){ (*x) = ( (double)random()/(double)RAND_MAX );}
Link a C subroutine with a Fortran program
rand.f90 crand.c
Fortran passes arguments by reference, C passes them by value
www.cineca.it
Link a Fortran subroutine with a C program
#include<stdio.h>
int main(){ int n; double a[10], d;
n = 10; d = 1.0; setv_( a, &d, &n ); printf("%lf\n", a[0]); }
subroutine setv( a, d, n ) real(kind=8) :: a( * ) real(kind=8) :: d integer :: n integer :: i do i = 1, n a( i ) = d end doend subroutine
cvec.c vset.f90
Link Fortran and C
www.cineca.it
Make CommandIf a code is large and/or it shares subroutines withother codes, it is useful to split the source in many files that could be placed in different directories.
In F90 there are dependencies among program units,i.e. modules must be compiled before than any other program units.Therefore there is a well defined order for compiling source files
To avoid compiling by hands the sources in the proper order,the make command could be used
www.cineca.it
Make Command
The make command can be programmed to do the job for youusing a file containing instruction and dircetive.
By default the make command looks in the present directoryfor a file colled Makefile or makefile
www.cineca.it
A simple makefile# this is a comment within the makefile
myprog.x : modules.o main.of90 –o myprog.x modules.o main.o
modules.o : modules.f90f90 –c modules.f90
main.o : modules.o main.f90f90 –c main.f90
this tell to the make commandthat myprog.x depend frommodules.o and main.o
make execute the command onlywhen modules.o and main.ohave been built
to compile the code, from the console the programmer issuethe command:> make
www.cineca.it
A less simple makefile# this is a comment within the makefile
myprog.x : modules.o main.of90 –o myprog.x modules.o main.o
main.o : modules.o
.f90.of90 –c $<
this is an implicit dependency, it state that allfiles “.o” depend and should be generated fromthe corresponding “.f90” files
this is a make macro, and it is expandend with the proper “.f90” filename
In the above example, make try to built myprog.x but it realizes that main.o and modules.o should begenerated first. Then it starts looking for a rule to make the “.o”, and it finds that main.o depend onmodules.o, and thern make build an internal hierarchy for compilation in which modules.o come beforemain.o . Finally make finds the implicit rule and starts compiling the sources.