Parallel Object Programming in POP- C++: a case study for ...pmaa06.irisa.fr/pres/13-Jiogo-PMAA06.pdf · Method call semantics : definition 1 - An arriving concurrent call can be

Parallel Object Programming in POP-C++: a case study for sparse matrix

vector multiplication

Clovis Dongmo JiogoPierre MannebackFaculté polytechnique de Mons

Pierre KuonenUniversity of Fribourg

Purpose of this work

Test Pop-C++ for some scientific computations on Grids

Present the parallel programming model POP-C++Evaluate its performances in Grid environmentShow how POP-C++ can improve matrix computations

Agenda

Overview of POP-C++Sparse Matrix/Vector productProgramming in Pop-C++Experimental resultsFuture work

Object oriented application

POP: Parallel Object Programming

Grid environment

ObjectObj

Object

ObjectObject

Object

• Heterogeneous• Large scale• Unstructured• Dynamic and unknown topology

• Distributed objects• Heterogeneous• Dynamic

execute

Approach of POP-C++

Service oriented approachResource allocation driven by object requirementsVarious invocations semanticsObject-oriented parallel programming paradigm (parallel objects)Object-oriented Programming System

POP-C++ Programming Model

Extension of C++ languageData transmission via shared objectTwo level of parallelism

Inter-object parallelismIntra-object parallelism

Transparent and dynamic object allocation guided by the object resources needCapacity to glue to Grid Toolkits

Semantic invocation : interface side

Two ways to call a method

SynchronousMethod returns when the execution is finished

Same semantic than sequential invocation

AsynchronousMethod returns immediately

Allows parallelism but.. no returned value

Object 1 Object 2

Object 1 Object 2

Parallelexecution

Method call semantics : definition

1 - An arriving concurrent call can be executed concurrently (time sharing) when it arrives, except if mutex calls are pending or executing. In the later case he is executed after completion of all mutex calls previously arrived.

2 - An arriving sequential call is executed after completion of all sequential and mutex calls previously arrived.

3 - An arriving mutex call is executed after completion of all calls previously arrived.

O1 O2

Method call semantics : example

All calls are asynchronous

Delayed

O2.Mseq()

O2.Mconc()

O2.Mseq()

O2.Mconc()

O2.Mseq()

O2.Mmut()

O2.Mconc()

Delayed

Delayed

Delayed

POP-C++ Syntax

POP-C++ is an implementation of the parallel object model as an extension of C++ with six new key words :

parclass : to declare a parallel classasync : asynchronous method callsync : synchronous method callconc : concurrent method executionseq : sequential method executionmutex : mutex method execution

POP-C++ architecture

A multi-layer architecture

Integration of new middleware into the system in a PnP flavor

Computational environment

POP-C++ essential service abstractions

Globus Toolkit XtremWeb Standalone POP-C++

Grid Web computing

Testing Distributed

environment

Other toolkits

Other distributed

environments

POP-C++ programming

POP-C++ services for

Globus

POP-C++ services for XtremWeb

POP-C++ services for

testing

Other customizable

services

Customizable service implementations

Requirement-driven objects

Each parallel object has a user-specified object description (OD)OD describes the requirements of parallel objectsOD is used as a guideline for allocating resource and object migrationOD can be expressed in terms of:

Maximum computing power (e.g. Mflops)Communication bandwidth with its interfaceMemory needed

OD can be parameterized on each parallel object (based on the actual input)

Object description example

parclass Matrix{Matrix (int n) @{

od.power(300 , 100);od.memory(n*n*sizeof(double)/1E6)od.protocol("socket http")

… }}The creation of an object for Matrix parallel class requires:

A computing power of 300Mfps, but 100Mfps are acceptableA capacity memory of de n*n*sizeof(double)/1E6 MbytesA protocol socket or http for the communication

Agenda


Sparse storage format : CRS

11 0 14 0 00 22 0 0 00 0 0 0 014 0 0 0 4515 0 0 45 0

Row_ptr[*] = [1; 3; 4; 6; 8]Col_ind[*] = [1; 3; 2; 1; 5; 1; 4]Mat_val[*] = [11; 14; 22; 14; 45; 15; 45]

CRS data structure use three vectors

Sparse Matrix/vector partitioning

××××××××

× × × ×× × × ×

× × × ×× × ×

× × ×× × × × ×

× × × ×× × × × ×

==

R1

R2

R3

Sparse matrix is partitioned according to the resource power

××××××××

×××× ××××× ×××××××××× ×× ××××× ×× ××××××××××× × ××× ××××× ××××× × ×××× ×× ×× ×××× ×× ×× ×××××××× ×××××× ×××××××

A1

A2

A3

A4

A

A1

A2

A3

A4

Tminimal

Execution time

?

Distribution model

Find a matrix partitioning which minimizes the total execution time?

Objectives:

Load balancing:

Fast : linear computing time

Efficient : ε << 1

Balancing Heuristic

∑ε+=ε+≈i

ii

avgi

i W)1(pp)1(W

pkpW

Agenda


The parallel class SparseMatrixparclass SparseMatrix{

public :SparseMatrix(int wanted, int min)@od.power(wanted, min) ;seq async void Init( [in, size=n+1] double *rom_ptr, int n, …) ;seq async void MvMultiply( [in, size=n] double *vector, int n) ;mutex sync int GetResult( [out, size=m] double *V, int m) ;

private :double *mat_val , *vect_res; int *col_ind, *row_ptr;…}

The object requirements are defined by the constructor

Minimal extension of C++

parclass Foo {…

Foo(…) @ {power =100; };

conc async void Mymethod(…);

};

class Foo {…

Foo(…);

Void Mymethod(…);

};

Foo : : Foo(…) {…}

Void Foo : : Mymethod (…) {… }

C++Constructor:

Method:

POP-C++

Shared implementation

Methods are implemented in C++

…void SparseMatrix : :MvMultiply ( double *vector, int n) {

for (int i = 0 ; i < n ; i++){vect res[i] = 0.0 ;for (int j=row ptr[i] ; j<row ptr[i+1] ; j++)

vect res[i] += mat val[j] * vector[col ind[j]] ;}

}…

5721R4R3R2R1

power_ptr

R1R2

R4R3

row_ptrmat_valcol_indvector

SetMatVarDataMatDist

PartitionMatrix

Init

GetResult

MvMultiply

ComputeResult

Fichiers de données

Execution steps

Agenda


Experimental Platform

PCs properties

AMD Athelon2 Ghz256Mb of RamFast Ethernet

Cluster properties

Cluster Sun Fire V2010 bi-opteron nodes1.8 Ghz1Gb of RamGigaBit Ethernet

Test matrices

Nom matrice Domaine

d’application Taille(n) NZ(m)

(a) fidapFinite element modeling 16614 1091362

(b) poisson3DbFinite element modeling 85623 2374949

(c) Stanford-web Web crawling 281903 2382912

(d) Stanford-w.b. Web crawling 685230 8006115

Matrix Markets Format

<i> <j> <Aij>

Experimental results

# Proc. Matrice Type

POP-C++ 108.2 62.8 31.4 22.9 22.7 LAM/MPI 96.5 52.6 39.2 20.7 16.9 POP-C++ 230.3 120.0 63.3 41.4 36.4 LAM/MPI 215.6 111.6 73.8 43.2 33.6 POP-C++ 267.7 112.4 80.5 49.2 48.4 LAM/MPI 173.5 101.3 64.5 46.2 46.8 (d)

(c)

(b)

#1 #2 #4 #8 #12

Total execution time for 1000 iterations

Experimental results

0

50

100

150

200

250

300

# 1 # 2 # 4 # 8 # 12

Nombre de processeursTe

mps

(s)

POP-C++

LAM/MPI

0

50

100

150

200

250

#1 #2 #4 #8 #12

Nombre de processeurs

Tem

ps(s

)

POP-C++

LAM/MPI

Matrice (b) Matrice (d)

Agenda


Future work

Improve the performance by coupling POP-C++ with MPISetting up a Scheduler for tasks assignmentImplement iterative methods in grid environment based on heuristic for load balancingEvaluate POP-C++ performance in Globus environment

Parallel Object Programming in POP- C++: a case study for ...pmaa06.irisa.fr/pres/13-Jiogo-PMAA06.pdf · Method call semantics : definition 1 - An arriving concurrent call can be

Documents