Top Banner
OpenSHMEM Ehsan alirezaei [email protected] 1 30 pages
31
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Open shmem

OpenSHMEM

Ehsan alirezaei

[email protected]

130 pages

Page 2: Open shmem

OPENSHMEM

PREREQUISITES

Knowledge of C/Fortran Familiarity with parallel computing Linux/UNIX command-line Useful for hands-on ◦ 64-bit Linux (native, VM or remote) E.g. Fedora, CentOS, ...

http://www.virtualbox.org/!

◦ Download of GASNet, OpenSHMEM library, test-suite & demo programs

http://gasnet.lbl.gov/#download◦ Installation of GASNet,

2

Page 3: Open shmem

Introducing OpenSHMEM

Supported by

◦ Oak Ridge National Laboratory

◦ University of Houston

◦ Open Source Software Solutions

◦ Department of Defense

◦ U.S. Department of Energy

◦ Extreme Scale Systems Center

SHared MEMory

SHMEM is a 1-sided communications library

◦ C and Fortran PGAS programming model

◦ Point-to-point and collective routines

◦ Synchronizations

◦ Atomic operations

Can take advantage of hardware offload

◦ Performance benefits

3

Page 4: Open shmem

One-sided communication

4

Page 5: Open shmem

Introduction

Address Spaces◦ Global vs. distributed

◦ OpenMP has global (shared) space

◦ MPI has partitioned space Private data exchanged via messages

◦ SHMEM is “partitioned global address space” PGAS

◦ Has private and shared data Shared data accessible directly by other processes

Used for programs that◦ perform computations in separate address spaces and

explicitly pass data to and from different processes in the program.

The processes participating in shared memory applications are referred to as processing elements (PEs).

5

Page 6: Open shmem

Introduction

All processors see symmetric variables ◦ Global Address Space

All processors have own view of symmetric variables◦ Partitioned Global Address Space

OpenSHMEM supports Single Program Multiple Data (SPMD) style of programming.

The OpenSHMEM Specification is an effort to create a standardized SHMEM library API for C, C++, and Fortran

SGI’s SHMEM API is the baseline for OpenSHMEMSpecification 1.0

The specification is open to the community for reviews and contributions.

6

Page 7: Open shmem

Implementation layers

7

Page 8: Open shmem

OpenSHMEM Features

Use of symmetric variables and point to point “put” and “get” operations.

Remote direct memory access enables one-sided operations, leading to performance benefits.

A standard API for different hardware provides network hardware independence and renders support to different network layer technologies.

Along with data transfer operations the library provides synchronization mechanisms (barrier, fence, quiet, wait), collective operations (broadcast, collection, reduction), and atomic memory operations (swap, add, increment).

Open to the community for reviews and contributions.

8

Page 9: Open shmem

OpenSHMEM HISTORY AND

IMPLEMENTATIONS Cray◦ SHMEM first introduced by Cray Research Inc. in

1993 for Cray T3D

◦ Platforms: Cray T3D, T3E, PVP, XT series

SGI◦ Owns the “rights” for SHMEM

◦ Baseline for OpenSHMEM development (Altix)

Quadrics (company out of business)◦ Optimized API for QsNet

◦ Platform: Linux cluster with QsNet interconnect

Others◦ HP SHMEM, IBM SHMEM

◦ GPSHMEM (cluster with ARMCI & MPI support, old)

9

Page 10: Open shmem

Symmetric Variables

Arrays or variables that exist with the

same size, type, and relative address

on all PEs.

◦ The following kinds of data objects are

symmetric:

Fortran data objects in common blocks or with

the SAVE attribute.

Non-stack C and C++ variables.

Fortran arrays allocated with shpalloc

C and C++ data allocated by shmalloc

10

Page 11: Open shmem

Memory Model

11

Page 12: Open shmem

Memory Model

12

Page 13: Open shmem

Memory Model

13

Page 14: Open shmem

Memory Model

14

Page 15: Open shmem

Memory Model

15

Page 16: Open shmem

FREQUENTLY USED

OPENSHMEM API (C, C++)

OpenSHMEM Library Call Functionality

start pes() Initializes the OpenSHMEM library

my pe() Returns an integer identification of the PE

num pes() Returns the number of PEs executing the application

shmem type get(*dest, *src, size t len, int pe)

Returns with the same type value of the symmetric

src from the remote PE to the symmetric dest on the

local PE

shmem type put(*dest, *src, size t len, int pe)

Returns when the same type value of the symmetric

src on the calling PE is on its way to the symmetric

dest on the remote PE

shmem barrier all() Returns when all PEs reach the same barrier call

and have completed all outstanding communication

16

Page 17: Open shmem

Execution Model

1. Initializing

◦ Start_pes()

Setup symmetric heap

Creating PE numbers (identifiers to PEs)

Data transfer

Query routines

Finalization

17

Page 18: Open shmem

Communication Operations Data Transfers

a. One-sided puts

b. One-sided gets

Synchronization Mechanisms

a. Fence

b. Quiet

c. Barrier

Collective Communication

a. Broadcast

b. Collection

c. Reduction

Address Manipulation

a. Allocating and deallocating memory blocks in the symmetric space.

Locks

a. Implementation of mutual exclusion

Atomic Memory Operations

a. Swap, Conditional Swap, Add and Increment

Data Cache control

18

Page 19: Open shmem

OpenSHMEM

ATOMIC OPERATIONS What does “atomic” mean anyway?

◦ Indivisible operation on symmetric variable

◦ No other operation can interpose during

update

◦ But “no other operation” actually means…?

No other atomic operation

Can’t do anything about other mechanisms

interfering

E.g. thread outside of SHMEM program

Non-atomic SHMEM operation

Why this restriction?

Implementation in hardware

19

Page 20: Open shmem

OpenSHMEM

ATOMIC OPERATIONS Atomic Swap◦ Unconditional

◦ Conditional

Arithmetic◦ Increment

◦ Fetch-and-increment & fetch-and-add value

◦ Return previous value at target on PE

Locks◦ Symmetric variables

◦ Acquired and released to define mutual-exclusion execution regions

◦ Initialize lock to 0. After that managed by above API

◦ Can be used for updating distributed data structures

20

Page 21: Open shmem

OpenSHMEM

LOOKING FOR OVERLAPS How to identify overlap opportunities◦ Put is not an indivisible operation Send local, reuse local, on-wire, stored

Can do useful work on other data in between

◦ General principle: Identify independent tasks/data

Initiate action as early as possible Put/barrier/collective

Interpose independent work

Synchronize as late as possible

Divide application into distinct communication and computation phases to minimize synchronization points

Use of point-to-point synchronization as opposed to collective synchronization

21

Page 22: Open shmem

Benchmarks and Tools

OpenSHMEM NPB: the NAS Parallel benchmarks for use the OpenSHMEM library.

The OpenSHMEM benchmarks’ performance evaluated with respect to their MPI versions on three distinctly different platforms with different OpenSHMEM/SHMEM library implementations

On mature library implementations with hardware support for RMA (SGI), the benchmarks using the OpenSHMEM library have comparable or better performance than MPI.

The OpenSHMEM Analyzer (OSA) developed in collaboration with ORNL allows for better error checking during compile time.

22

Page 23: Open shmem

Performance of Implementing BT

and SP benchmarks

23

Page 24: Open shmem

COMPARING MPI 2 AND

OpenSHMEM MPI Window

semantics

◦ All process which intend

to use the window must

participate in window

creation

◦ Many or All the local

allocations/objects

should be coalesced

within a single window

creation.

SHMEM semantics

◦ All global and static

data are by default

accessible to all

process.

◦ Local

allocations/objects can

be made shmem

accessible using

shmalloc instead of

malloc

24

Page 25: Open shmem

COMPARING MPI 2 AND

OpenSHMEM MPI_Win_fence

◦ Fence is a collective call.

◦ Need 2 fence calls, one to separate and another one to complete.

◦ So it mostly functions like barrier

shmem_fence

◦ shmem fence is just meant for ordering of puts.

◦ It does not separate the processes neither does it mean completion

◦ Ensures there are no pending puts to be delivered to the same target before the next put

25

Page 26: Open shmem

COMPARING MPI 2 AND

OpenSHMEM Point to point

synchronization.

◦ Sender does Start and waits for Post from receiver

◦ The receiver does Post and waits for the data.

◦ The sender Puts the data and signals completion to receiver

Point to point synchronization.

◦ The receiver can directly wait for the data using shmem_wait on a event flag.

◦ The sender puts the data and sets the event flag to signal the receiver.

◦ Both post and complete are implicit inside the wait and put operation.

26

Page 27: Open shmem

COMPARING MPI 2 AND

OpenSHMEM

No mutual

exclusion

Lock is not real

lock, but begin

RMA

Unlock means end

RMA

Only the source

calls lock

Enforces mutual

exclusion

The PE which

acquires lock doe

put

The waiting PE

gets the lock on

first come first

serve basis

27

Page 28: Open shmem

OpenShmem in use

Uses GASNET for any devices (also

mobile)

It has been tested with standard

libraries on controlled situations

No test results available for GRID

It is comparable with CUDA, MPI and

OPENMP

28

Page 29: Open shmem

GASNet GASNet is a language-independent, low-level networking layer that provides network-

independent, high-performance communication primitives tailored for implementing parallel global address space SPMD languages and libraries such as UPC, Co-Array Fortran, SHMEM, Cray Chapel, and Titanium.

The interface is primarily intended as a compilation target and for use by runtime library writers (as opposed to end users), and the primary goals are high performance, interface portability, and expressiveness. GASNet stands for "Global-Address Space Networking".

The design of GASNet is partitioned into two layers to maximize porting ease without sacrificing performance:

◦ lower level GASNet core API - the design is based heavily on Active Messages, and is implemented directly on top of each individual network architecture.

◦ The upper level GASNet extended API, which provides high-level operations such as remote memory access and various collective operations.

Operating systems: Linux, FreeBSD, NetBSD, Tru64, AIX, IRIX, HPUX, Solaris, MSWindows-Cygwin, MacOSX, Unicos, SuperUX, Catamount, BLRTS, MTX

Architectures: x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, PA-RISC, SPARC, Cray T3E, Cray X-1, Cray XT, Cray XE, Cray XK, Cray XC30, SX-6, IBM BlueGene/L, IBM BlueGene/P, IBM BlueGene/Q, IBM Power 775, SiCortex, PlayStation3

Compilers: GCC, Portland Group C, Pathscale C, Intel C, SunPro C, Compaq C, HP C, MIPSPro C, IBM VisualAge C, Cray C, NEC C, MTA C, LLVM Clang, Open64

System diagram

showing GASNet

layers29

Page 30: Open shmem

hardware offload

host-side network adapter that offloads most of the networking “work”◦ server’s main CPU(s) don’t have to

◦ OS bypass techniques

you can have dedicated (read: very fast/optimized) hardware do the heavy lifting◦ rest of the server’s resources are free

it’s not just processor cycles that are saved◦ Caches — both instruction and data — are likely

not to be thrashed

◦ Interrupts may be fired less frequently

◦ There may be (slightly) less data transferred across internal buses

30

Page 31: Open shmem

Refrences

Using OpenCL: Programming Massively Parallel

Computers, edited by Janusz Kowalik, Tadeusz

Puźniakowski

OpenSHMEM,Application Programming

Interface,Version 1.0 FINAL

OpenSHMEM TUTORIAL, Presenters: Tony

Curtis and Swaroop Pophale

OpenSHMEM Performance and Potential: A NPB

Experimental Study, Swaroop Pophale,

OPENSHMEM BOF SC2011 TCC-303 ,T. Curtis,

S. Poole

31