237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages • Actor model – Erlang – Scala • Coordination languages – Linda • CSP-based (Communicating Se- quential Processes) – FortranM – Occam • Dataflow – SISAL (Streams and Itera- tion in a Single Assignment Language) • Distributed – Sequoia – Bloom • Event-driven and hardware de- scription
22
Embed
237// Program Design10.3 11 Parallel programming models · brid Multicore Parallel Pro-gramming) Logic programming – Parlog Multi-threaded – Clojure Object-oriented ... test and
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
237 // Program Design 10.3 Assessing parallel programs
11 Parallel programming models
Many different models for expressing parallelism in programming languages
• Actor model
– Erlang
– Scala
• Coordination languages
– Linda
• CSP-based (Communicating Se-quential Processes)
– FortranM
– Occam
• Dataflow
– SISAL (Streams and Itera-tion in a Single AssignmentLanguage)
• Distributed
– Sequoia
– Bloom
• Event-driven and hardware de-scription
238 // Program Design 10.3 Assessing parallel programs
– Verilog hardware descriptionlanguage (HDL)
• Functional
– Concurrent Haskell
• GPU languages
– CUDA
– OpenMP
– OpenACC
– OpenHMPP (HMPP for Hy-brid Multicore Parallel Pro-gramming)
• Logic programming
– Parlog
• Multi-threaded
– Clojure
• Object-oriented
– Charm++
– Smalltalk
• Message-passing
– MPI
– PVM
• Partitioned global address space(PGAS)
– High Performance Fortran(HPF)
239 // programming models 11.1 HPF
11.1 HPF
Partitioned global address space parallel programming modelFortran90 extension
• SPMD (Single Program Multiple Data) model
• each process operates with its own part of data
• HPF commands specify which processor gets which part of the data
• Concurrency is defined by HPF commands based on Fortran90
IMPLICIT NONEINTEGER , PARAMETER : : N = 100INTEGER , DIMENSION (N,N) : : A, B , CINTEGER : : i , j
! HPF$ PROCESSORS s q u a r e ( 2 , 2 )! HPF$ DISTRIBUTE (BLOCK,BLOCK) ONTO s q u a r e : : C! HPF$ ALIGN A( i , ∗ ) WITH C( i , ∗ )! r e p l i c a t e c o p i e s o f row A ( i , ∗ ) on to proc . s which compute C( i , j )! HPF$ ALIGN B(∗ , j ) WITH C(∗ , j )! r e p l i c a t e c o p i e s o f c o l . B (∗ , j ) ) on to proc . s which compute C( i , j )
A = 1B = 2C = 0DO i = 1 , N
DO j = 1 , N! A l l t h e work i s l o c a l due t o ALIGNsC( i , j ) = DOT_PRODUCT(A( i , : ) , B ( : , j ) )
END DOEND DOWRITE(∗ ,∗ ) CEND�
241 // programming models 11.1 HPF
HPF programming methodology
• Need to find balance between concurrency and communication
• the more processes the more communication
• aiming to
– find balanced load based from the owner calculates rule
– data locality
Easy to write a program in HPF but difficult to gain good efficiencyProgramming in HPF technique is more or less like this:
1. Write a correctly working serial program, test and debug it
2. add distribution directives introducing as less as possible communication
OpenMP Example: Matrix Multiplication:� �! h t t p : / / www. l l n l . gov / comput ing / t u t o r i a l s / openMP / e x e r c i s e . h tm lC∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗C OpenMp Example − Ma t r ix M u l t i p l y − F o r t r a n V e r s i o nC FILE : omp_mm . fC DESCRIPTION :C D e m o n s t r a t e s a m a t r i x m u l t i p l y u s i n g OpenMP . Threads s h a r e row i t e r a t i o n sC a c c o r d i n g to a p r e d e f i n e d chunk s i z e .C LAST REVISED : 1 / 5 / 0 4 B l a i s e BarneyC∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
PROGRAM MATMULT
INTEGER NRA, NCA, NCB, TID , NTHREADS, I , J , K, CHUNK,+ OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM
C number of rows in m a t r i x APARAMETER (NRA=62)
C number of columns in m a t r i x APARAMETER (NCA=15)
C number of columns in m a t r i x BPARAMETER (NCB=7)
REAL∗8 A(NRA,NCA) , B(NCA,NCB) , C(NRA,NCB)
245 // programming models 11.2 OpenMP
C S e t loop i t e r a t i o n chunk s i z eCHUNK = 10
C Spawn a p a r a l l e l r e g i o n e x p l i c i t l y s c o p i n g a l l v a r i a b l e s!$OMP PARALLEL SHARED(A, B , C ,NTHREADS,CHUNK) PRIVATE ( TID , I , J ,K)
TID = OMP_GET_THREAD_NUM( )IF ( TID .EQ . 0 ) THEN
NTHREADS = OMP_GET_NUM_THREADS( )PRINT ∗ , ’ S t a r t i n g m a t r i x m u l t i p l e example wi th ’ , NTHREADS,
+ ’ t h r e a d s ’PRINT ∗ , ’ I n i t i a l i z i n g m a t r i c e s ’
END IF
C I n i t i a l i z e m a t r i c e s!$OMP DO SCHEDULE( STATIC , CHUNK)
DO 30 I =1 , NRADO 30 J =1 , NCA
A( I , J ) = ( I−1) +( J−1)30 CONTINUE
!$OMP DO SCHEDULE( STATIC , CHUNK)DO 40 I =1 , NCA
DO 40 J =1 , NCBB( I , J ) = ( I−1)∗ ( J−1)
40 CONTINUE
246 // programming models 11.2 OpenMP
!$OMP DO SCHEDULE( STATIC , CHUNK)DO 50 I =1 , NRA
DO 50 J =1 , NCBC( I , J ) = 0
50 CONTINUE
C Do m a t r i x m u l t i p l y s h a r i n g i t e r a t i o n s on o u t e r l oopC D i s p l a y who does which i t e r a t i o n s f o r d e m o n s t r a t i o n p u r p o s e s
PRINT ∗ , ’ Thread ’ , TID , ’ s t a r t i n g m a t r i x m u l t i p l y . . . ’!$OMP DO SCHEDULE( STATIC , CHUNK)
DO 60 I =1 , NRAPRINT ∗ , ’ Thread ’ , TID , ’ d i d row ’ , I
DO 60 J =1 , NCBDO 60 K=1 , NCA
C( I , J ) = C( I , J ) + A( I ,K) ∗ B(K, J )60 CONTINUE
C End of p a r a l l e l r e g i o n!$OMP END PARALLEL
C P r i n t r e s u l t sPRINT ∗ , ’∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ’PRINT ∗ , ’ R e s u l t Ma t r i x : ’DO 90 I =1 , NRA
CUDA Fortran Matrix Multiplication example (for more information, see http://www.pgroup.com/lit/articles/insider/v1n3a2.htm)
Host (CPU) code:
� �s u b r o u t i n e mmul ( A, B , C )
use c u d a f o rrea l , dimension ( : , : ) : : A, B , Ci n t e g e r : : N, M, Lrea l , d ev i ce , a l l o c a t a b l e , dimension ( : , : ) : : Adev , Bdev , Cdevtype ( dim3 ) : : dimGrid , dimBlockN = s i z e (A, 1 ) ; M = s i z e (A, 2 ) ; L = s i z e (B , 2 )a l l o c a t e ( Adev (N,M) , Bdev (M, L ) , Cdev (N, L ) )Adev = A( 1 : N, 1 :M)Bdev = B ( 1 :M, 1 : L )dimGrid = dim3 ( N/ 1 6 , L / 1 6 , 1 )dimBlock = dim3 ( 16 , 16 , 1 )c a l l mmul_kernel <<<dimGrid , dimBlock >>>( Adev , Bdev , Cdev , N,M, L )C ( 1 : N, 1 :M) = Cdevd e a l l o c a t e ( Adev , Bdev , Cdev )
GPU code:� �a t t r i b u t e s ( g l o b a l ) s u b r o u t i n e MMUL_KERNEL( A, B , C , N,M, L )
rea l , d e v i c e : : A(N,M) ,B(M, L ) ,C(N, L )i n t e g e r , v a l u e : : N,M, Li n t e g e r : : i , j , kb , k , tx , t yrea l , s h a r e d : : Ab ( 1 6 , 1 6 ) , Bb ( 1 6 , 1 6 )r e a l : : C i jt x = t h r e a d i d x%x ; t y = t h r e a d i d x%yi = ( b l o c k i d x%x−1) ∗ 16 + t xj = ( b l o c k i d x%y−1) ∗ 16 + t yC i j = 0 . 0do kb = 1 , M, 16
! Fe tch one e l e m e n t each i n t o Ab and Bb ; n o t e t h a t 16 x16 = 256! t h r e a d s i n t h i s t h read−b l o c k are f e t c h i n g s e p a r a t e e l e m e n t s! o f Ab and BbAb ( tx , t y ) = A( i , kb+ ty −1)Bb ( tx , t y ) = B( kb+ tx −1, j )! Wait u n t i l a l l e l e m e n t s o f Ab and Bb are f i l l e dc a l l s y n c t h r e a d s ( )do k = 1 , 16
C i j = C i j + Ab ( tx , k ) ∗ Bb ( k , t y )enddo! Wait u n t i l a l l t h r e a d s i n t h e th read−b l o c k f i n i s h w i t h
251 // programming models 11.3 GPGPU
! t h i s i t e r a t i o n ’ s Ab and Bbc a l l s y n c t h r e a d s ( )
enddoC( i , j ) = C i jend s u b r o u t i n e�
11.3.2 OpenCL
Open Computing Language (OpenCL):
• Heterogeneous systems of
– CPUs (central processing units)
– GPUs (graphics processing units)
– DSPs (digital signal processors)
– FPGAs (field-programmable gate arrays)
– and other processors
252 // programming models 11.3 GPGPU
• language (based on C99)
– kernels (functions that execute on OpenCL devices)
– plus application programming interfaces (APIs)
– ( fortrancl )
• parallel computing using
– task-based and data-based parallelism
– GPGPU
• open standard by Khronos Group (Apple, AMD, IBM, Intel and Nvidia)
– + adopted by Altera, Samsung, Vivante and ARM Holdings
Example: Matrix Multiplication
253 // programming models 11.3 GPGPU
� �/∗ k e r n e l . c l∗ Ma tr i x m u l t i p l i c a t i o n : C = A ∗ B . ( ( Dev ice code ) )( h t t p : / / gpgpu−comput ing4 . b l o g s p o t . co . uk / 2 0 0 9 / 0 9 / ma t r i x−m u l t i p l i c a t i o n −2−o p e n c l . h tm l )
∗ // / OpenCL Ke rn e l_ _ k e r n e l voidmatr ixMul ( _ _ g l o b a l f l o a t ∗ C ,
_ _ g l o b a l f l o a t ∗ A,_ _ g l o b a l f l o a t ∗ B ,i n t wA, i n t wB)
{ / / 2D Thread ID/ / Old CUDA code/ / i n t t x = b l o c k I d x . x ∗ TILE_SIZE + t h r e a d I d x . x ;/ / i n t t y = b l o c k I d x . y ∗ TILE_SIZE + t h r e a d I d x . y ;i n t t x = g e t _ g l o b a l _ i d ( 0 ) ;i n t t y = g e t _ g l o b a l _ i d ( 1 ) ;/ / v a l u e s t o r e s t h e e l e m e n t t h a t i s/ / computed by t h e t h r e a df l o a t v a l u e = 0 ;f o r ( i n t k = 0 ; k < wA; ++k ){
f l o a t elementA = A[ t y ∗ wA + k ] ;f l o a t elementB = B[ k ∗ wB + t x ] ;
254 // programming models 11.3 GPGPU
v a l u e += elementA ∗ elementB ;}/ / W r i t e t h e m a t r i x t o d e v i c e memory each/ / t h r e a d w r i t e s one e l e m e n tC[ t y ∗ wA + t x ] = v a l u e ;
}�
11.3.3 OpenACC
• standard (Cray, CAPS, Nvidia and PGI) to simplify parallel programming ofheterogeneous CPU/GPU systems
• using annotations (like OpenMP), C, C++, Fortran source code
• code started on both CPU and GPU automatically
• OpenACC to merge into OpenMP in a future release of OpenMP
255 // programming models 11.3 GPGPU
� �! A s i m p l e OpenACC k e r n e l f o r M at r i x M u l t i p l i c a t i o n! $acc k e r n e l s
do k = 1 , n1do i = 1 , n3
c ( i , k ) = 0 . 0do j = 1 , n2
c ( i , k ) = c ( i , k ) + a ( i , j ) ∗ b ( j , k )enddo
enddoenddo
! $acc end k e r n e l s�� �
! h t t p : / / www. bu . edu / t e c h / r e s e a r c h / c o m p u t a t i o n / l i n u x−c l u s t e r / ka tana−c l u s t e r / gpu−comput ing / openacc−f o r t r a n / ma t r i x−m u l t i p l y−f o r t r a n /
program m a t r i x _ m u l t i p l yuse omp_l ibuse openacci m p l i c i t nonei n t e g e r : : i , j , k , myid , m, n , c o m p i l e d _ f o r , o p t i o ni n t e g e r , parameter : : f d = 11i n t e g e r : : t1 , t2 , d t , c o u n t _ r a t e , count_maxrea l , a l l o c a t a b l e , dimension ( : , : ) : : a , b , c
256 // programming models 11.3 GPGPU
r e a l : : tmp , s e c sopen ( fd , f i l e = ’ w a l l c l o c k t i m e ’ , form= ’ f o r m a t t e d ’ )o p t i o n = c o m p i l e d _ f o r ( fd ) ! 1− s e r i a l , 2−OpenMP , 3−OpenACC , 4−bo th
! $omp p a r a l l e l! $ myid = OMP_GET_THREAD_NUM( )! $ i f ( myid . eq . 0 ) t h e n! $ w r i t e ( fd , " ( ’ Number o f p r o c s i s ’ , i 4 ) " ) OMP_GET_NUM_THREADS( )! $ e n d i f! $omp end p a r a l l e l
c a l l sys t em_c lock ( count_max=count_max , c o u n t _ r a t e = c o u n t _ r a t e )do m=1 ,4 ! compute f o r d i f f e r e n t s i z e m a t r i x m u l t i p l i e s
c a l l sys t em_c lock ( t 1 )n = 1000∗2∗∗(m−1) ! 1000 , 2000 , 4000 , 8000a l l o c a t e ( a ( n , n ) , b ( n , n ) , c ( n , n ) )
! I n i t i a l i z e m a t r i c e sdo j =1 , n
do i =1 , na ( i , j ) = r e a l ( i + j )b ( i , j ) = r e a l ( i − j )
enddoenddo
! $omp p a r a l l e l do s h a r e d ( a , b , c , n , tmp ) r e d u c t i o n ( + : tmp )! $ acc d a t a co py in ( a , b ) copy ( c )! $ acc k e r n e l s
257 // programming models 11.3 GPGPU
! Compute m a t r i x m u l t i p l i c a t i o n .do j =1 , n
do i =1 , ntmp = 0 . 0 ! e n a b l e s ACC p a r a l l e l i s m f o r k−l oopdo k =1 , n
tmp = tmp + a ( i , k ) ∗ b ( k , j )enddoc ( i , j ) = tmp
enddoenddo
! $ acc end k e r n e l s! $ acc end d a t a! $omp end p a r a l l e l do
c a l l sys t em_c lock ( t 2 )d t = t2−t 1s e c s = r e a l ( d t ) / r e a l ( c o u n t _ r a t e )w r i t e ( fd , " ( ’ For n = ’ , i4 , ’ , w a l l c l o c k t ime i s ’ , f12 . 2 , ’ seconds ’ ) " ) &
n , s e c sd e a l l o c a t e ( a , b , c )
enddoc l o s e ( fd )
end program m a t r i x _ m u l t i p l y
i n t e g e r f u n c t i o n c o m p i l e d _ f o r ( fd )
258 // programming models 11.3 GPGPU
i m p l i c i t nonei n t e g e r : : f d# i f d e f i n e d _OPENMP && d e f i n e d _OPENACC
c o m p i l e d _ f o r = 4w r i t e ( fd , " ( ’ Th i s code i s compi l ed wi th OpenMP & OpenACC ’ ) " )
# e l i f d e f i n e d _OPENACCc o m p i l e d _ f o r = 3w r i t e ( fd , " ( ’ Th i s code i s compi l ed wi th OpenACC ’ ) " )
# e l i f d e f i n e d _OPENMPc o m p i l e d _ f o r = 2w r i t e ( fd , " ( ’ Th i s code i s compi l ed wi th OpenMP ’ ) " )
# e l s ec o m p i l e d _ f o r = 1w r i t e ( fd , " ( ’ Th i s code i s compi l ed f o r s e r i a l o p e r a t i o n s ’ ) " )
# e n d i fend f u n c t i o n c o m p i l e d _ f o r�