Parallel Object Programming in POP- C++: a case study for sparse matrix vector multiplication Clovis Dongmo Jiogo Pierre Manneback Faculté polytechnique de Mons Pierre Kuonen University of Fribourg
Parallel Object Programming in POP-C++: a case study for sparse matrix
vector multiplication
Clovis Dongmo JiogoPierre MannebackFaculté polytechnique de Mons
Pierre KuonenUniversity of Fribourg
Purpose of this work
Test Pop-C++ for some scientific computations on Grids
Present the parallel programming model POP-C++Evaluate its performances in Grid environmentShow how POP-C++ can improve matrix computations
Agenda
Overview of POP-C++Sparse Matrix/Vector productProgramming in Pop-C++Experimental resultsFuture work
Object oriented application
POP: Parallel Object Programming
Grid environment
ObjectObj
Object
ObjectObject
Object
• Heterogeneous• Large scale• Unstructured• Dynamic and unknown topology
• Distributed objects• Heterogeneous• Dynamic
execute
Approach of POP-C++
Service oriented approachResource allocation driven by object requirementsVarious invocations semanticsObject-oriented parallel programming paradigm (parallel objects)Object-oriented Programming System
POP-C++ Programming Model
Extension of C++ languageData transmission via shared objectTwo level of parallelism
Inter-object parallelismIntra-object parallelism
Transparent and dynamic object allocation guided by the object resources needCapacity to glue to Grid Toolkits
Semantic invocation : interface side
Two ways to call a method
SynchronousMethod returns when the execution is finished
Same semantic than sequential invocation
AsynchronousMethod returns immediately
Allows parallelism but.. no returned value
Object 1 Object 2
Object 1 Object 2
Parallelexecution
Method call semantics : definition
1 - An arriving concurrent call can be executed concurrently (time sharing) when it arrives, except if mutex calls are pending or executing. In the later case he is executed after completion of all mutex calls previously arrived.
2 - An arriving sequential call is executed after completion of all sequential and mutex calls previously arrived.
3 - An arriving mutex call is executed after completion of all calls previously arrived.
O1 O2
Method call semantics : example
All calls are asynchronous
Delayed
O2.Mseq()
O2.Mconc()
O2.Mseq()
O2.Mconc()
O2.Mseq()
O2.Mmut()
O2.Mconc()
Delayed
Delayed
Delayed
POP-C++ Syntax
POP-C++ is an implementation of the parallel object model as an extension of C++ with six new key words :
parclass : to declare a parallel classasync : asynchronous method callsync : synchronous method callconc : concurrent method executionseq : sequential method executionmutex : mutex method execution
POP-C++ architecture
A multi-layer architecture
Integration of new middleware into the system in a PnP flavor
Computational environment
POP-C++ essential service abstractions
Globus Toolkit XtremWeb Standalone POP-C++
Grid Web computing
Testing Distributed
environment
Other toolkits
Other distributed
environments
POP-C++ programming
POP-C++ services for
Globus
POP-C++ services for XtremWeb
POP-C++ services for
testing
Other customizable
services
Customizable service implementations
Requirement-driven objects
Each parallel object has a user-specified object description (OD)OD describes the requirements of parallel objectsOD is used as a guideline for allocating resource and object migrationOD can be expressed in terms of:
Maximum computing power (e.g. Mflops)Communication bandwidth with its interfaceMemory needed
OD can be parameterized on each parallel object (based on the actual input)
Object description example
parclass Matrix{Matrix (int n) @{
od.power(300 , 100);od.memory(n*n*sizeof(double)/1E6)od.protocol("socket http")
… }}The creation of an object for Matrix parallel class requires:
A computing power of 300Mfps, but 100Mfps are acceptableA capacity memory of de n*n*sizeof(double)/1E6 MbytesA protocol socket or http for the communication
Agenda
Overview of POP-C++Sparse Matrix/Vector productProgramming in Pop-C++Experimental resultsFuture work
Sparse storage format : CRS
11 0 14 0 00 22 0 0 00 0 0 0 014 0 0 0 4515 0 0 45 0
Row_ptr[*] = [1; 3; 4; 6; 8]Col_ind[*] = [1; 3; 2; 1; 5; 1; 4]Mat_val[*] = [11; 14; 22; 14; 45; 15; 45]
CRS data structure use three vectors
Sparse Matrix/vector partitioning
××××××××
× × × ×× × × ×
× × × ×× × ×
× × ×× × × × ×
× × × ×× × × × ×
==
R1
R2
R3
Sparse matrix is partitioned according to the resource power
××××××××
×××× ××××× ×××××××××× ×× ××××× ×× ××××××××××× × ××× ××××× ××××× × ×××× ×× ×× ×××× ×× ×× ×××××××× ×××××× ×××××××
A1
A2
A3
A4
A
A1
A2
A3
A4
Tminimal
Execution time
?
Distribution model
Find a matrix partitioning which minimizes the total execution time?
Objectives:
Load balancing:
Fast : linear computing time
Efficient : ε << 1
Balancing Heuristic
∑ε+=ε+≈i
ii
avgi
i W)1(pp)1(W
pkpW
Agenda
Overview of POP-C++Sparse Matrix/Vector productProgramming in Pop-C++Experimental resultsFuture work
The parallel class SparseMatrixparclass SparseMatrix{
public :SparseMatrix(int wanted, int min)@od.power(wanted, min) ;seq async void Init( [in, size=n+1] double *rom_ptr, int n, …) ;seq async void MvMultiply( [in, size=n] double *vector, int n) ;mutex sync int GetResult( [out, size=m] double *V, int m) ;
private :double *mat_val , *vect_res; int *col_ind, *row_ptr;…}
The object requirements are defined by the constructor
Minimal extension of C++
parclass Foo {…
Foo(…) @ {power =100; };
conc async void Mymethod(…);
};
class Foo {…
Foo(…);
Void Mymethod(…);
};
Foo : : Foo(…) {…}
Void Foo : : Mymethod (…) {… }
C++Constructor:
Method:
POP-C++
Shared implementation
Methods are implemented in C++
…void SparseMatrix : :MvMultiply ( double *vector, int n) {
for (int i = 0 ; i < n ; i++){vect res[i] = 0.0 ;for (int j=row ptr[i] ; j<row ptr[i+1] ; j++)
vect res[i] += mat val[j] * vector[col ind[j]] ;}
}…
5721R4R3R2R1
power_ptr
R1R2
R4R3
row_ptrmat_valcol_indvector
SetMatVarDataMatDist
PartitionMatrix
Init
GetResult
MvMultiply
ComputeResult
Fichiers de données
Execution steps
Agenda
Overview of POP-C++Sparse Matrix/Vector productProgramming in Pop-C++Experimental resultsFuture work
Experimental Platform
PCs properties
AMD Athelon2 Ghz256Mb of RamFast Ethernet
Cluster properties
Cluster Sun Fire V2010 bi-opteron nodes1.8 Ghz1Gb of RamGigaBit Ethernet
Test matrices
Nom matrice Domaine
d’application Taille(n) NZ(m)
(a) fidapFinite element modeling 16614 1091362
(b) poisson3DbFinite element modeling 85623 2374949
(c) Stanford-web Web crawling 281903 2382912
(d) Stanford-w.b. Web crawling 685230 8006115
Matrix Markets Format
<i> <j> <Aij>
Experimental results
# Proc. Matrice Type
POP-C++ 108.2 62.8 31.4 22.9 22.7 LAM/MPI 96.5 52.6 39.2 20.7 16.9 POP-C++ 230.3 120.0 63.3 41.4 36.4 LAM/MPI 215.6 111.6 73.8 43.2 33.6 POP-C++ 267.7 112.4 80.5 49.2 48.4 LAM/MPI 173.5 101.3 64.5 46.2 46.8 (d)
(c)
(b)
#1 #2 #4 #8 #12
Total execution time for 1000 iterations
Experimental results
0
50
100
150
200
250
300
# 1 # 2 # 4 # 8 # 12
Nombre de processeursTe
mps
(s)
POP-C++
LAM/MPI
0
50
100
150
200
250
#1 #2 #4 #8 #12
Nombre de processeurs
Tem
ps(s
)
POP-C++
LAM/MPI
Matrice (b) Matrice (d)
Agenda
Overview of POP-C++Sparse Matrix/Vector productProgramming in Pop-C++Experimental resultsFuture work
Future work
Improve the performance by coupling POP-C++ with MPISetting up a Scheduler for tasks assignmentImplement iterative methods in grid environment based on heuristic for load balancingEvaluate POP-C++ performance in Globus environment