Top Banner
1 COSC 6374 Parallel Computation Remote Direct Memory Access Edgar Gabriel Fall 2015 Communication Models P0 P1 A B send receive Message Passing Model P0 P1 A B put Remote Memory Access P0 P1 A B A=B Shared Memory Model
12

COSC 6374 Parallel Computation Remote Direct Memory Accessgabriel/courses/cosc6374_f15/ParCo_24_RDMA.pdf · Remote Direct Memory Access •Direct Memory Access (DMA) allows data to

Sep 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: COSC 6374 Parallel Computation Remote Direct Memory Accessgabriel/courses/cosc6374_f15/ParCo_24_RDMA.pdf · Remote Direct Memory Access •Direct Memory Access (DMA) allows data to

1

COSC 6374

Parallel Computation

Remote Direct Memory Access

Edgar Gabriel

Fall 2015

Communication Models

P0 P1

A B

send receive

Message Passing Model

P0 P1

A B put

Remote Memory Access

P0 P1

A B

A=B

Shared Memory Model

Page 2: COSC 6374 Parallel Computation Remote Direct Memory Accessgabriel/courses/cosc6374_f15/ParCo_24_RDMA.pdf · Remote Direct Memory Access •Direct Memory Access (DMA) allows data to

2

Data Movement

CPU Mem

NIC

CPU

NIC

Mem

Message Passing Model:

• Two-sided communication

CPU Mem

NIC

CPU

NIC

Mem

Remote Memory Access:

• One-sided communication

Remote Direct Memory Access

• Direct Memory Access (DMA) allows data to be sent

directly from an attached device to the memory on the

computer's motherboard.

• One CPU is freed from involvement with the data

transfer, thus speeding up overall computer operation

• Remote Direct Memory Access (RDMA): two or more

computers communicate directly from the main

memory of one system to the main memory of another

Page 3: COSC 6374 Parallel Computation Remote Direct Memory Accessgabriel/courses/cosc6374_f15/ParCo_24_RDMA.pdf · Remote Direct Memory Access •Direct Memory Access (DMA) allows data to

3

One-sided communication in MPI

• MPI-2 defines one-sided communication:

– A process can put some data into the main memory of another process (MPI_Put)

– A process can get some data from the main memory of another process (MPI_Get)

– A process can perform some operations on a data item in the main memory of another process (MPI_Accumulate)

• Target process not actively involved in the

communication

RDMA in MPI

• Problems:

– How can a process define which part of its main memory

are available for RDMA?

– How can a process define when this part of the main

memory is available for RDMA?

– How can a process define who is allowed to access its

memory?

– How can a process define which elements in a remote

memory it wants to access?

Page 4: COSC 6374 Parallel Computation Remote Direct Memory Accessgabriel/courses/cosc6374_f15/ParCo_24_RDMA.pdf · Remote Direct Memory Access •Direct Memory Access (DMA) allows data to

4

The window concept of MPI-2 (I)

• An MPI_Win defines the group of process allowed to access a certain memory area

• Arguments:

– base: Starting address for the public memory region

– size: size of the public memory area in bytes

– disp_unit: offset from the base address in bytes

– info: Hint to the MPI how the window will be used (e.g. only reading or only writing)

– comm: communicator defining the group of processes allowed to access the memory window

MPI_Win_create(void *base, MPI_Aint size, int

disp_unit, MPI_Info info, MPI_Comm comm,

MPI_Win *win);

The window concept of MPI-2 (II)

• Definition of a temporal window:

– Access Epoch: time slot in which a process accesses

remote memory of another process

– Exposure Epoch: time slot in which a process allows

access to a memory window by other processes

• Does a process have control when other processes are

accessing its memory window?

– yes: active target communication

– no: passive target communication

Page 5: COSC 6374 Parallel Computation Remote Direct Memory Accessgabriel/courses/cosc6374_f15/ParCo_24_RDMA.pdf · Remote Direct Memory Access •Direct Memory Access (DMA) allows data to

5

Active Target Communication (I)

• Synchronization of all operations within a window

– collective across all processes of win

– No difference between access and exposure epoch

– Starts or closes an access and exposure epoch

• Arguments

– assert: Hint to the library on the usage (default: 0)

MPI_Win_fence ( int assert, MPI_Win win);

Data exchange (I)

• A single process controls the data parameters of both

processes

• Put data described by (oaddr, ocount, otype)

into the main memory of the process defined by Rank rank in the window win at the position

(base+disp*disp_unit,tcount,ttype)

– base and disp_unit have been defined in MPI_Win_create

– Value of base and disp_unit not known by the

process calling MPI_Put!

MPI_Put (void *oaddr, int ocount, MPI_Datatype otype, int rank, MPI_Aint disp, int tcount, MPI_Datatype ttype, MPI_Win win);

Page 6: COSC 6374 Parallel Computation Remote Direct Memory Accessgabriel/courses/cosc6374_f15/ParCo_24_RDMA.pdf · Remote Direct Memory Access •Direct Memory Access (DMA) allows data to

6

21 3050 xx

Example: Ghost-cell update

4

3

2

1

4

3

2

1

5020

305020

305020

3050

rhs

rhs

rhs

rhs

x

x

x

x

321 305020 xxx

3432 305020 rhsxxx

443 5020 rhsxx

1rhs

2rhs

Process 0

Process 1

Process 0 needs x3

Process 1 needs x2

Parallel Matrix-vector multiply for band-matrices

Example: Ghost-cell update (II)

• Ghost cells: (read-only) copy of elements held by another process

• Ghost-cells for 2-D matrices: additional row of data

3x1x

Process 0

2x

Process 1

4x3x2x

Process 0

Process 1

Process 2

nxlocal

nxlocal

nxlocal

ny

Page 7: COSC 6374 Parallel Computation Remote Direct Memory Accessgabriel/courses/cosc6374_f15/ParCo_24_RDMA.pdf · Remote Direct Memory Access •Direct Memory Access (DMA) allows data to

7

• Data structure: u[i][j] is stored in a matrix

• Extent of variable u

• with containing the local data

:

:

y

xlocal

n

n no of data points in x direction

no of data points in y direction

]][2[ yxlocal nnu

]1:0][:1[ yxlocal nnu

Example: Ghost-cell update (III)

Example: Ghost-cell update (IV)

MPI_Win_create ( u,(nxlocal+2)*ny*sizeof(double),

0, MPI_INFO_NULL, &win);

MPI_Win_fence ( 0, win);

MPI_Put ( &u[1][0], ny, MPI_DOUBLE, rank-1,

(nxlocal+1)*ny*sizeof(double), ny, MPI_DOUBLE, win);

MPI_Put ( &u[nxlocal][0], ny, MPI_DOUBLE, rank+1, 0, ny, MPI_DOUBLE, win);

MPI_Win_fence ( 0, win);

MPI_Win_free ( &win);

Page 8: COSC 6374 Parallel Computation Remote Direct Memory Accessgabriel/courses/cosc6374_f15/ParCo_24_RDMA.pdf · Remote Direct Memory Access •Direct Memory Access (DMA) allows data to

8

Comments to the example

• Modifications to the data items might only be visible

after closing the corresponding epochs

– No guarantee whether the data item is really transfered during MPI_Put or during MPI_Win_fence

• If multiple processes modify the very same memory

address at the very same process, no guarantees are

given on which data item will be visible.

– Responsibility of the user to get it right

Passive Target Communication

• MPI_Win_lock starts an access epoch to access the main

memory of the process with rank rank

• All RDMA operations between a lock/unlock appear atomic

• lock_type: MPI_LOCK_EXCLUSIVE or MPI_LOCK_SHARED

• Update to the local memory exposed through the MPI window should also happen using MPI_Win_lock/MPI_Put

– Otherwise undefined access order/race condition

between local update and RDMA access

MPI_Win_lock (int lock_type, int rank, int assert,

MPI_Win win);

MPI_Win_unlock (int rank, MPI_Win win);

Page 9: COSC 6374 Parallel Computation Remote Direct Memory Accessgabriel/courses/cosc6374_f15/ParCo_24_RDMA.pdf · Remote Direct Memory Access •Direct Memory Access (DMA) allows data to

9

Example: Ghost-cell update (V)

MPI_Win_create ( u,(nxlocal+2)*ny*sizeof(double),

0, MPI_INFO_NULL, &win);

MPI_Win_lock ( MPI_LOCK_EXCLUSIVE, rank-1, 0, win);

MPI_Put ( &u[1][0], ny, MPI_DOUBLE, rank-1,

(nxlocal+1)*ny*sizeof(double), ny,

MPI_DOUBLE, win);

MPI_Win_unlock( rank-1, win);

MPI_Win_lock ( MPI_LOCK_EXCLUSIVE, rank+1, 0, win);

MPI_Put ( &u[nxlocal][0], ny, MPI_DOUBLE, rank+1, 0, ny, MPI_DOUBLE, win);

MPI_Win_unlock ( rank+1, win);

One-sided vs. Two-sided communication

• One-sided communication doesn’t need

– message matching

– unexpected message queues

– Uses only one processor

potentially faster!

• One-sided communication in MPI can optimize potentially

– multiple transactions

– between multiple processes

Page 10: COSC 6374 Parallel Computation Remote Direct Memory Accessgabriel/courses/cosc6374_f15/ParCo_24_RDMA.pdf · Remote Direct Memory Access •Direct Memory Access (DMA) allows data to

10

Limitations of the MPI-2 model

• Synchronization costs (e.g. MPI_Win_fence) can be

significant

• Static model

– Size of memory window can not be altered after creating an MPI_Win

– Difficult to support dynamic data structures such as a

linked list

• Passive target model has limited usability

– But that is what most other RDMA libraries focus on

• In MPI-3:

– Introduction of dynamic windows

– Extending the functionality passive target operations

Use case: distributed linked list

• A linked list maintained across multiple processes

– E.g. after a global sort operation of all elements

– E.g. having fixed rules for the keys

rank 0: keys which start with ‘a’ to ‘d’

rank 1: keys which start with ‘e’ to ‘h’ …

Rank 0 Rank 1 Rank 2

Page 11: COSC 6374 Parallel Computation Remote Direct Memory Accessgabriel/courses/cosc6374_f15/ParCo_24_RDMA.pdf · Remote Direct Memory Access •Direct Memory Access (DMA) allows data to

11

Use case: Distributed linked list

typedef struct{

char key[MAX_KEY_SIZE];

char value[MAX_VALUE_SIZE];

MPI_Aint next_disp;

int next_rank;

void *next_local; // next local element

} ListElem;

// Create an MPI data type describing this

// structure using MPI_Type_create_struct. Not shown

// here for brevity

Equivalent to the next pointer

in a non-distributed linked list

Traversing a distributed linked list ListElem local_copy, *current;

ListElem *head; //assumed to be already set

current=head;

MPI_Win_lock_all ( win );

while (!found ) {

if ( current->next_rank != myrank ) {

MPI_Get (&local_copy, 1, ListElem_type,

current->next_rank, current->next_disp,

1, ListElem_type, win );

MPI_Win_flush ( current->next_rank, win );

current = &local_copy;

} else

current = current->next_local;

if ( strcmp(current->key, key ) == 0 )

break;

}

MPI_Win_unlock_all( win);

Get a shared (read-only) lock to

all processes that are part of win

Enforce the completion of all

pending operations to a process

without having to release the lock(s)

Page 12: COSC 6374 Parallel Computation Remote Direct Memory Accessgabriel/courses/cosc6374_f15/ParCo_24_RDMA.pdf · Remote Direct Memory Access •Direct Memory Access (DMA) allows data to

12

Inserting elements into a linked list

• Assuming that only a local process is allowed to insert an

element (e.g. after a global sort operation)

– Remote processes only allowed to read elements on other

processes

• Requires dynamically allocating memory and extending a

memory region

• A dynamic window defines only the participating group of process

– More than one memory region can be assigned to a single window

MPI_Win_create_dynamic( MPI_Info info, MPI_Comm comm,

MPI_Win *win);

MPI_Win_attach (MPI_Win win, void *base, MPI_Aint size);

Inserting elements into a linked list (II) // create window instance once

MPI_Win_create_dynamic (MPI_INFO_NULL, comm, &win);

// insert each element into the memory window

t = (ListElem *) malloc ( sizeof (ListElem) );

t->key = strdup (key);

t->value = strdup (value);

current = find_prev_element (head, key, value)

t2 = current->next_local;

current->next_local = t;

t->next_local = t2;

MPI_Win_attach ( win, t, sizeof(ListElem );

// add another element

t = (ListElem *) malloc ( sizeof (ListElem) );

MPI_Win_attach ( win, t, sizeof(ListElem );

MPI_Barrier (comm);

Similarly for updating next_rank and

next_disp on current and t