Programming with CellSs BSC
Jan 12, 2016
Programming with CellSs
BSC
ScicomP15, Cell tutorial, May 18th 2009
Outline
•StarSs Programming Model
•CellSs runtime
•CellSs syntax
•CellSs compiler
•Programming examples
•Performance analysis using Paraver
•Conclusions
ScicomP15, Cell tutorial, May 18th 2009
STARSs programming model
Basic idea
...for (i=0; i<N; i++){ T1 (data1, data2); T2 (data4, data5); T3 (data2, data5, data6); T4 (data7, data8); T5 (data6, data8, data9);}...
Sequential Application
T10 T20
T30
T40
T50
T11 T21
T31
T41
T51
T12
…
Resource 1
Resource 2
Resource 3
Resource N
.
.
.
Task graph creation
based on data
precedence
Task selection +
parameters direction
(input, output, inout)
Scheduling,
data transfer,
task execution
Synchronization,
results transfer
Parallel Resources(multicore, SMP, cluster, grid)
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Syntax example - matrix multiply
int main (int argc, char **argv) {int i, j, k;…
initialize(A, B, C);
for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply (C[i][j], A[i][k], B[k][j]);
}
static void block_addmultiply (float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;
for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j];}
BS
BS
NB
NB
BS
BS
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Syntax example - matrix multiply
int main (int argc, char **argv) {int i, j, k;…
initialize(A, B, C);
for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]);
}
#pragma css task input(A, B) inout(C)static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;
for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j];}
B
B
NB
NB
B
B
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Runtime
PPE
User main program
CellSs PPU lib
SPE0
DMA inTask executionDMA outSynchronization
CellSs SPU lib
Original task code
Helper threadMain thread
Memory
Userdata
Task control buffer
Synchronization
Tasks
Finalization signal
Stage in/out data
Work assignment
Data dependence Data renaming
Scheduling
SPE1
SPE2
Renaming table
...
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Runtime - argument renaming
•False dependences (WaW and WaR) are removed with dynamic
renaming of argumentsfor (i=0; i<N; i++) { T1 (…,…, block1); T2 (block1, …, block2[i]); T3 (block2[i],…,…);}
Block1 is output from task T1
Block1 is input to task T2
block1block1
T1_1
T2_1
T3_1
T1_2
T2_2
T3_2
T1_N
T2_N
T3_N
…block1
WaR
WaW
WaR
WaW
WaR
WaW
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Runtime - argument renaming
•False dependences (WaW and WaR) are removed with dynamic
renaming of argumentsfor (i=0; i<N; i++) { T1 (…,…, block1); T2 (block1, …, block2[i]); T3 (block2[i],…,…);}
Block1 is output from task T1
Block1 is input to task T2
block1_Nblock1_2
T1_1
T2_1
T3_1
T1_2
T2_2
T3_2
T1_N
T2_N
T3_N
…block1_1
WaR
WaW
WaR
WaW
WaR
WaW
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Runtime – scheduling
•Scheduling strategy
• Critical path
• Locality
... ...
Bundle of dependent tasks: data locality in SPE Bundle of independent tasks:
Mixed bundle
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Runtime
•Paraver view of the runtime behavior
Bundle
Main thread:runs user code and adds and remove tasks to the task graph
SPEs: execute tasks' code
Helper thread:schedules tasks and synchronize with SPEs
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Runtime – specific SPE library features
•Data dependence analysis, data renaming, task scheduling
performed in the CellSs PPE runtime library
•CellSs SPE runtime library implements specific features, to assist
the CellSs PPE runtime library, but independently
• Early callback
• Minimal stage-out
• Software cache in the SPE Local Store
• Double buffering
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Runtime – specific SPE library features
•Early call-back
• Initially, communication of completion of tasks is
done per bundle basis
• There are cases where this limits the application
• Task A in the example
• An early callback after the limiting task, enables
the scheduling of new bundles
• Condition: the task has more than one outgoing
dependency
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Runtime – specific SPE library features
•Minimal stage-out
• For each task in a bundle its outpus will be written
back to main memory
• If inside the bundle, a task rewrites the same
output, there is no need for writing back to main
memory
• The case in the figure can not happen!
• Thanks to renaming
• Example: matmul
• C[i][j] += A[i][k]*B[k][j]
X
Y
X
Zwrites A'
writes A
reads A
X
Y
X
Zwrites A
writes A
reads A
X
Y
X
Z
writes A
writes A
reads A
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Runtime – specific SPE library features
•Software cache in the SPE Local Store
• Maintained by the SPE runtime
• LRU replacement strategy
• PPE scheduling is not aware of this behavior
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Runtime – specific SPE library features
...
#pragma css task input(A, B) inout(C)block_addmultiply( C[i][j], A[i][k], B[k][j])
C[i][j]
A[i][k] B[k][j]
• For each operation, two blocks of data are get from PPE memory to SPE local storage• Clusters of dependent tasks are scheduled to the same PPE The inout block is kept in the local storage and only put in PPE memory once (reuse)
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Runtime - specific SPE library features
•Double buffering
• CellSs overlaps DMA transfers with computations
DMA programming: reading task control buffer
Waiting for DMA transfer
DMA programming: reading data
Task execution overlapped with data transfers
DMA programming: writing data
Task 1 in bundle Task 2 in bundle Task N in bundle
Synchronization with helper thread
...
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Runtime - specific SPE library features
•Double buffering: paraver view
SPE reads data
SPE executes task
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Runtime - specific SPE library features
•Double buffering: paraver view
DMA programming DMA programming
SPE waits for DMA in
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Runtime - specific SPE library features
•Double buffering: paraver view
DMA out programmingDMA in programming
SPE waits for DMA in
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Runtime - specific SPE library features
•Double buffering: paraver view
DMA out programmingSPE waits for DMA out (all)
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Syntax
• pragmas' syntax:
#pragma css task [input (<input parameters>)] \
[output (<output parameters>)] \
[inout (<input/output parameters>)] \
[highpriority]
void task(<parameters>) { ...
#pragma css wait on(<data address>)
#pragma css barrier
#pragma css start
#pragma css finish
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Syntax
• Examples: task selection
#pragma css task input(A, B) inout(C)void block_addmultiply( float C[N][N], float A[N][N], float B[N][N] ) { ...
#pragma css task input(A[BS][BS], B[BS][BS]) inout(C[BS][BS])void block_addmultiply( float *C, float *A, float *B ) { ..
#pragma css task input(A[BS][BS], B[BS][BS], BS) inout(C[BS][BS])void block_addmultiply( float *C, float *A, float *B, int BS ) { ...
• Examples: waiting for data
#pragma css task input (ref_block, to_comp) output (mse) void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *mse) { ... ...are_blocks_equal (X[ii][jj],Y[ii][jj], &sq_error);#pragma css wait on (sq_error)
if (sq_error >0.0000001)
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Syntax
• Examples: synchronization
for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); #pragma css barrier
• Examples: priorization #pragma css task input(lefthalo[32], tophalo[32], righthalo[32], \ bottomhalo[32]) inout(A[32][32]) highpriority void jacobi (float *lefthalo, float *tophalo, float *righthalo, float
*bottomhalo, float *A) { ... }
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Syntax
• Examples: CellSs program boundary
#pragma css start for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); #pragma css finish
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Syntax in Fortran
subroutine example() ... interface !$CSS TASK subroutine block_add_multiply(C, A, B, BS) imtlicit none integer, intent (in) :: BS real, intent (in) :: A(BS,BS), B(BS,BS) real, intent (inout) :: C(BS,BS) end subroutine end interface ... !$CSS START ... call block_add_multiply(C, A, B, BLOCK_SIZE) ... !$CSS FINISH...end subroutine!$CSS TASKsubroutine block_add_multiply(C, A, B, BS)...end subroutine
ScicomP15, Cell tutorial, May 18th 2009
CellSs compiler: Compiler phase
Code translation
(mcc)
cellss-spu-cc_app.c
pack
app.tasks (tasks list)
app.c
cellss-spu-cc_app.o
app.o
CELSS-CC
cellss-ppu-cc_app.c
SPE Compiler PPE Compiler
cellss-spu-cc_app.o
ScicomP15, Cell tutorial, May 18th 2009
CellSs compiler: Compiler phase
•Files
• app.c: User code, with CellSs annotations
• cellss-spu-cc_app.c: specific code generated for the spu (tasks code)
• cellss-ppu-cc_app.c: specific code generated for the ppu (main program)
• app.tasks: list of annotated tasks
•Compilation steps
• Mcc: Mercurium compiler (BCS), source to source compiler
• SPE compiler: Generic SPE compiler (IBM SDK)
• PPE compiler: Generic PPE compiler (IBM SDK)
• pack: Specific CellSs module that combines objects (BSC)
ScicomP15, Cell tutorial, May 18th 2009
CellSs compiler: Linker phase
app.c
unpackapp-adapters.c
exec
libCellSS.so
glue code generator
app.capp.o
app.tasks
exec-adapters.c
app-adapters.cccellss-spu-cc_app.o
exec-registration.c
exec-adapters.o
exec-registration.o
CELLSS-CC
app-adapters.capp-adapters.cccellss-ppu-cc_app.o
PPE Linker
exec-spu
SPE Compiler
PPE Compiler
SPE Embedder
SPE Linker
libCellSS-spu.a
exec-spu.o
app.tasksapp.tasks
ScicomP15, Cell tutorial, May 18th 2009
CellSs compiler: Linker phase
•Files
• exec-adapters.c: code generated for each of the annotated tasks to uniformly
call them (“stubs”).
• exec-registration.c: code generated to register the annotated tasks
• Linker steps
• unpack: unpacks objects
• glue code generator: from all the *.tasks files of an application generates a
single “adapters” file and a single “registration” file per executable
• SPE, PPE compilers and linkers and SPE embedder (IBM SDK)
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Programming examples
•Cholesky factorization
•Common matrix operation used to solve normal equations in linear
least squares problems.
•Calculates a triangular matrix (L) from a symetric and positive defined
matrix A.
Cholesky(A) = L
L · Lt = A
•Different possible implementations, depending on how the matrix is
traversed (by rows, by columns, left-looking, right-looking)
• It can be decomposed in block operations
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Programming examples
• In each iteration red and blue blocks are updated
• SPOTRF: Computes the Cholesky factorization of the diagonal block .
• STRSM: Computes the column panel
• SSYRK: Computes the row panel
• SGEMM: Updates the rest of the matrix
block_syrk
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Programming examples
main (){... for (int j = 0; j < DIM; j++){ for (int k= 0; k< j; k++){ for (int i = j+1; i < DIM; i++){ // A[i,j] = A[i,j] - A[i,k] * (A[j,k])^t css_sgemm_tile( A[i][k], A[j][k], A[i][j] ); } } for (int i = 0; i < j; i++){ // A[j,j] = A[j,j] - A[j,i] * (A[j,i])^t css_ssyrk_tile(A[j][i],A[j][j]); }
// Cholesky Factorization of A[j,j] css_spotrf_tile( A[j][j] ); for (int i = j+1; i < DIM; i++){ // A[i,j] <- A[i,j] = X * (A[j,j])^t css_strsm_tile( A[j][j], A[i][j] ); } }
... for (int i = 0; i < DIM; i++) { for (int j = 0; j < DIM; j++) {#pragma css wait on (A[i][j]) print_block(A[i][j]); } }... }
#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])void sgemm_tile(float *A, float *B, float *C)
#pragma css task input (T[64][64]) inout(B[64][64])void strsm_tile(float *T, float *B)
#pragma css task input(A[64][64]) inout(C[64][64])void ssyrk_tile(float *A, float *C)
#pragma css task inout(A[64][64])void spotrf_tile(float *A)
DIM
DIM
64
64
Cholesky factorization
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Programming examples
•Sparse LU
• More generic factorization than Cholesky
• Deals with non symetric matrixes
• Calculates one lower triangular matrix (L) and one upper triangular(U) matrix
which product fits with a permutation of rows of the original
Perm(A)=L*U
• Difficult to program for Cell, since some operations are for columns (not
blocks)
• The example shown here is a simplified version (without pivoting) based on
an initial sparse matrix
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Programming examples
int main(int argc, char **argv) {int ii, jj, kk;…
for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL) A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } } }
}
B
B
NB
NB
B
B
void lu0(float *diag);
void bdiv(float *diag, float *row);
void bmod(float *row, float *col, float *inner);
void fwd(float *diag, float *col);
Sparse LU
ScicomP15, Cell tutorial, May 18th 2009
Dynamic main memory allocationData dependent parallelism
int main(int argc, char **argv) {int ii, jj, kk;…
for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL) A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } } }
}
CellSs: Programming examples
#pragma css task inout(diag[B][B]) highpriorityvoid lu0(float *diag);#pragma css task input(diag[B][B]) inout(row[B][B])void bdiv(float *diag, float *row);#pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])void bmod(float *row, float *col, float *inner);#pragma css task input(diag[B][B]) inout(col[B][B])void fwd(float *diag, float *col);
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Programming examples
int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}
#pragma css task input(Src) out(Dst) void copy_block (float Src[BS][BS], float Dst[BS][BS]);
void copy_mat (float *Src,float *Dst){ ... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) ... copy_block(Src[ii][jj],block); ...}
#pragma gss task input(A) out(L,U)void split_block (float A[BS][BS], float L[BS][BS], float U[BS][BS]);
void split_mat (float *LU[NB][NB],float *L[NB][NB],float *U[NB][NB]){... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++){ ... split_block (LU[ii][ii],L[ii][ii],U[ii][ii]); ... }}
Checking LU
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Programming examples
int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}
Checking LU
void clean_mat (p_block_t Src[NB][NB]){ int ii, jj;
for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) if (Src[ii][jj] != NULL) { free (Src[ii][jj]); Src[ii][jj]=NULL; }}
#pragma css task output(Dst)void clean_block (float Dst[BS][BS] );
void clean_mat (p_block_t Src[NB][NB]){ int ii, jj;
for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) if (Src[ii][jj] != NULL) { clean_block(Src[ii][jj]); }}
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Programming examples
int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}
Checking LU
void sparse_matmult (float *A[NB][NB], float *B[NB][NB], float *C[NB][NB]){ int ii, jj, kk;
for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) for (kk=0; kk<NB; kk++) if ((A[ii][kk]!= NULL) && (B[kk][jj] !=NULL )) { if (C[ii][jj] == NULL)
C[ii][jj] = allocate_clean_block(); block_matmul (A[ii][kk], B[kk][jj], C[ii][jj]); }}
#pragma css task input(a,b) inout(c)void block_matmul(float a[BS][BS], float b[BS][BS], float c[BS][BS]){ int i, j, k;
for (i=0; i<BS; i++) for (j=0; j<BS; j++) for (k=0; k<BS; k++) c[i][j] += a[i][k]*b[k][j];}
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Programming examples
int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}
Checking LU#pragma css task input (ref_block, to_comp) output (mse)void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *ms e);void compare_mat (p_block_t X[NB][NB], p_block_t Y[NB][NB], struct timeval *stop){ ... Zero_block = allocate_clean_block(); for (ii = 0; ii < NB; ii++) for (jj = 0; jj < NB; jj++) { if (X[ii][jj] == NULL) if (Y[ii][jj] == NULL) sq_error[ii][jj] = 0.0f; else are_blocks_equal(Zero_block, Y[ii [jj],&sq_error[ii][jj]); else are_blocks_equal(X[ii][jj], Y[ii][jj],&sq_error[ii][jj]); }#pragma css finish for (ii = 0; ii < NB; ii++) for (jj = 0; jj < NB; jj++) if (sq_error[ii][jj] >0.0000001L) { printf ("block [%d, %d]: detected mse = % 20lf\n", ii,
jj,sq_error[ii][jj]); some_difference =TRUE; } if ( some_difference == FALSE) printf ("matrices are identical\n");}
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Programming examples
int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}
Checking LU #pragma css task input (ref_block, to_comp) output (mse)void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *ms e);void compare_mat (p_block_t X[NB][NB], p_block_t Y[NB][NB], struct timeval *stop){ ... Zero_block = allocate_clean_block(); for (ii = 0; ii < NB; ii++) for (jj = 0; jj < NB; jj++) { if (X[ii][jj] == NULL) if (Y[ii][jj] == NULL) sq_error[ii][jj] = 0.0f; else are_blocks_equal(Zero_block, Y[ii [jj],&sq_error[ii][jj]); else are_blocks_equal(X[ii][jj], Y[ii][jj],&sq_error[ii][jj]); } for (ii = 0; ii < NB; ii++) for (jj = 0; jj < NB; jj++)#pragma css wait on (&sq_error[ii][jj]) if (sq_error[ii][jj] >0.0000001L) { printf ("block [%d, %d]: detected mse = % 20lf\n", ii,
jj,sq_error[ii][jj]); some_difference =TRUE; } if ( some_difference == FALSE) printf ("matrices are identical\n");}
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Programming examples
copy_mat (A, origA);
LU (A);
split_mat (A, L, U);
clean_mat(A);
sparse_matmult (L, U, A); compare_mat (origA, A);
Without CellSs With CellSs(for NB=4 matrix)
Behavior Checking LU
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Programming examples
Behavior Checking LU
0: are_blocks_equal1: bdiv_adapte2: block_mpy_add3: bmod4: clean_block5: copy_block6: fwd7: lu08: split_block
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Programming examples
•Molecular dynamics: Argon simulation
• Simulates the mobility of Argon atoms in gas state, in a
constant volume at T=300K
• All elestrostatic forces observed for each of the atoms due to
the others are considered (Fi)
• The second Newton law is then applied to each atom
Fi=m*a
i
• The initial velocities are random but reasonable for argon
atoms at 300K
• To maintain a constant temperature in all the process the
Berendsen algorithm is applied
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Programming examples
program argon...!$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii),
z(ii), x(jj), y(jj), z(jj), vx(ii),vy(ii), vz(ii))
enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj)) enddo!$CSS BARRIER
tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins) do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii),
vz(ii), x(ii), y(ii), z(ii)) enddo enddo!$CSS FINISHend
program argon... interface !$CSS TASK subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj,
vx, vy, vz) implicit none integer, intent(in) :: BSIZE, ii, jj real, intent(in), dimension(BSIZE) :: xi, yi, zi, xj, yj, zj real, intent(inout), dimension(BSIZE) :: vx, vy, vz end subroutine !$CSS TASK subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z) implicit none integer, intent(in) :: BSIZE real, intent(in) :: lam1 real, intent(inout), dimension(BSIZE) :: vx, vy, vz real, intent(inout), dimension(BSIZE) :: x, y, z end subroutine !$CSS TASK subroutine v_mod(BSIZE, v, vx, vy, vz) implicit none integer, intent(in) :: BSIZE real, intent(out) :: v(BSIZE) real, intent(in), dimension(BSIZE) :: vx, vy, vz end subroutine end interface
Molecular dynamics: Argon simulation
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Programming examples
program argon...!$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii),
z(ii), x(jj), y(jj), z(jj), vx(ii),vy(ii), vz(ii))
enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj)) enddo!$CSS BARRIER
tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins) do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii),
vz(ii), x(ii), y(ii), z(ii)) enddo enddo!$CSS FINISHend
!$CSS TASKsubroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj, vx, vy, vz)! subroutine code end subroutine!$CSS TASKsubroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)! subroutine code end subroutine!$CSS TASKsubroutine v_mod(BSIZE, v, vx, vy, vz)! subroutine code end subroutine
Molecular dynamics: Argon simulation
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Performance Analysis with Paraver
•Paraver
• Flexible performance visualization and analysis tool that can be used to analyze:
• MPI, OpenMP, MPI+OpenMP
• Java
• Hardware counters profile
• Operating system activity
• ... and many other things you may think of
• Generally it uses external trace file generators. Example for MPI:
> mpitrace mpirun -n 10 my_mpi-binary
• For CellSs, the libraries have been instrumented.
• When installing the distribution, two libraries are generated: normal and instrumented
• Flag -t links with instrumented version
• Available for free from the BSC website: www.bsc.es/paraver
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Performance Analysis with Paraver
•Running paraver
paraver tracefile-0001.prv
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Performance Analysis with Paraver
•Configuration files
Configuration file Feature shown
2dh inbw.cfg
2dh inbytes.cfg
2dh outbw.cfg
2dh outbytes.cfg
3dh duration phase.cfg
3dh duration tasks.cfg
DMA bw.cfg
DMA bytes.cfg
execution phases.cfg
Histogram of the bandwidth achieved by individual DMA IN transfers. Histogram of bytes read by the stage in DMA transfers.Histogram of the bandwidth achieved by individual DMA OUT transfersHistogram of bytes writen by the stage out DMA transfers.Histogram of duration for each of the runtime phases.Histogram of duration of SPU tasks.DMA (in + out) bandwidth per SPU.Bytes being DMAed (in + out) by each SPU.Profile of percentage of time spent by each thread at each of the major phases
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Performance Analysis with Paraver
•Configuration files
Configuration file Feature shown
flushing.cfg
general.cfg Mix of timelines.
stage in out phase.cfg
task.cfg
task distance histogram.cfg .
task number.cfg
Task profile.cfg
task repetitions.cfg
Total DMA bw.cfg
Intervals (dark blue) where each SPU is flushing its local trace buffer to main memory.
Identification of DMA in (grey) and out phases (green).Outlined function being executed by each SPU.Histogram of task distance between dependent tasks Number of task being executed by each SPUTime (microseconds) each SPU spent executing the different tasksShows which SPU executed each task and the number of times that the task was executed.Total DMA (in+out) bandwidth to Memory.
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Performance Analysis with Paraver
Clustering Group of 8 tasks (23 us)Block size: 64x64 floatsDMA in/out
Data re-use
Main thread
Helper thread
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Performance Analysis with Paraver
•Demo
• matmul application
• first view, explain what is seen
• show cfgs, explain that they are in the distribution and where
• $(CELLSS_HOME)/share/cellss/paraver_cfgs/
• matmul
• show execution phases, tasks (and task type), task number
• show flushing
• size of DMA in
• another cholesky
• execution phases, tasks, task number, order of tasks that copy data from
memory
ScicomP15, Cell tutorial, May 18th 2009
CellSs: Performance Analysis with Paraver
Another Cholesky
ScicomP15, Cell tutorial, May 18th 2009
CellSs performance evolution & scalability
0 1 2 3 4 5 6 7 8 9
0
20
40
60
80
100
120
140
160
180
Matmul performance
March 2007
July 2007
Nov 2007
December 2008
May 2009
#SPUs
GF
lop
s
0 500 1000 1500 2000 2500 3000 3500 4000 45000
20
40
60
80
100
120
140
160Cholesky performance evolution
Apr 2007Jul 2007Jul 2007Set 2008May 2009
Matrix size
GF
lop
s
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
9Cholesky scalability
1024
2048
4096
# SPUs
Spe
ed u
p
0 1 2 3 4 5 6 7 8 9
0
20
40
60
80
100
120
140
160Cholesky performance
1024
2048
4096
# SPUs
GF
lops
ScicomP15, Cell tutorial, May 18th 2009
CellSs: issues and ongoing efforts
• CellSs programming model
• Array regions, subobject accesses
• Blocks larger than Local Store
• Access to global memory by tasks
• CellSs runtime system
• Further optimization of overheads (insert task and remove task)
• By-passing (SPE to SPE transfers)
• Scheduling algorithms: overhead, locality
• Lazy renaming
• Other members of the family: SMPSs, GPUSs, hierarchical (SMPSs + CellSs)
• Convergence with OpenMP 3.0
ScicomP15, Cell tutorial, May 18th 2009
Conclusions
•The road for new chips with multi and many cores is open
•New programming models that can deal with the complexity of the
hardware are now more needed than ever
•StarSs
• Simple
• Portable
• Enough performance
• Enabled for different architectures: CellSs, SMPSs, GPUSs
ScicomP15, Cell tutorial, May 18th 2009
CellSs and SMPSs websites
•CellSs
• www.bsc.es/cellsuperscalar
•SMPSs
• www.bsc.es/smpsuperscalar
•Both available for download (open source, GPL and LGPL)