YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Unified Parallel C

Unified Parallel C

Kathy YelickEECS, U.C. Berkeley and NERSC/LBNL

NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian

Bell

Page 2: Unified Parallel C

Outline

ā€¢Global Address Space Languages in Generalā€“ Programming models

ā€¢Overview of Unified Parallel C (UPC)ā€“ Programmability advantageā€“ Performance opportunity

ā€¢Statusā€“ Next step

ā€¢Related projects

Page 3: Unified Parallel C

Programming Model 1: Shared Memoryā€¢Program is a collection of threads of control.

ā€“ Many languages allow threads to be created dynamically,

ā€¢Each thread has a set of private variables, e.g. local variables on the stack.

ā€¢Collectively with a set of shared variables, e.g., static variables, shared common blocks, global heap.ā€“ Threads communicate implicitly by writing/reading shared

variables.ā€“ Threads coordinate using synchronization operations on shared

variables

PnP1P0 . . .

x = ...y = ..x ...

Shared

Private

Page 4: Unified Parallel C

Programming Model 2: Message Passingā€¢ Program consists of a collection of named processes.

ā€“ Usually fixed at program startup timeā€“ Thread of control plus local address space -- NO shared

data.ā€“ Logically shared data is partitioned over local processes.

ā€¢ Processes communicate by explicit send/receive pairsā€“ Coordination is implicit in every communication event.ā€“ MPI is the most common example

PnP1P0 . . .

send P0,X

recv Pn,Y

XY

Page 5: Unified Parallel C

Tradeoffs Between the Models

ā€¢Shared memory+Programming is easier

ā€¢ Can build large shared data structures ā€“ Machines donā€™t scale

ā€¢ SMPs typically < 16 processors (Sun, DEC, Intel, IBM)ā€¢ Distributed shared memory < 128 (SGI)

ā€“ Performance is hard to predict and control

ā€¢Message passing+Machines easier to build from commodity partsā€“ Can scale (given sufficient network)ā€“ Programming is harder

ā€¢ Distributed data structures only in the programmers mind

ā€¢ Tedious packing/unpacking of irregular data structures

Page 6: Unified Parallel C

Global Address Space Programming

ā€¢ Intermediate point between message passing and shared memory

ā€¢ Program consists of a collection of processes.ā€“ Fixed at program startup time, like MPI

ā€¢ Local and shared data, as in shared memory modelā€“ But, shared data is partitioned over local processesā€“ Remote data stays remote on distributed memory machinesā€“ Processes communicate by reads/writes to shared variables

ā€¢ Examples are UPC, Titanium, CAF, Split-C

ā€¢ Note: These are not data-parallel languagesā€“ heroic compilers not required

Page 7: Unified Parallel C

GAS Languages on Clusters of SMPs

ā€¢Cluster of SMPs (CLUMPs)hbā€“ IBM SP: 16-way SMP nodesā€“ Berkeley Millennium: 2-way and 4-way nodes

ā€¢What is an appropriate programming model?ā€“ Use message passing throughout

ā€¢ Most common modelā€¢ Unnecessary packing/unpacking overhead

ā€“ Hybrid modelsā€¢ Write 2 parallel programs (MPI + OpenMP or Threads)

ā€“ Global address spaceā€¢ Only adds test (on/off node) before local read/write

Page 8: Unified Parallel C

Support for GAS Languages

ā€¢Unified Parallel C (UPC)ā€“ Funded by the NSAā€“ Compaq compiler for Alpha/Quadricsā€“ HP, Sun and Cray compilers under

developmentā€“ Gcc-based compiler for SGI (Intrepid)ā€“ Gcc-based compiler (SRC) for Cray T3Eā€“ MTU and Compaq effort for MPI-based compilerā€“ LBNL compiler based on Open64

ā€¢Co-Array Fortran (CAF)ā€“ Cray compilerā€“ Rice and UMN effort based on Open64

ā€¢SPMD Java (Titanium)ā€“ UCB compiler available for most machines

Page 9: Unified Parallel C

Parallelism Model in UPC

ā€¢ UPC uses an SPMD model of parallelismā€“ A set if THREADS threads working independently

ā€¢ Two compilation modelsā€“ THREADS may be fixed at compile time orā€“ Dynamically set at program startup time

ā€¢ MYTHREAD specifies thread index (0..THREADS-1)ā€¢ Basic synchronization mechanisms

ā€“ Barriers (normal and split-phase), locks

ā€¢ What UPC does not do automatically:ā€“ Determine data layoutā€“ Load balance ā€“ move computationsā€“ Caching ā€“ move data

ā€¢ These are intentionally left to the programmer

Page 10: Unified Parallel C

UPC Pointers

ā€¢Pointers may point to shared or private variablesā€¢Same syntax for use, just add qualifier

shared int *sp;

int *lp;

ā€¢ sp is a pointer to an integer residing in the shared memory space.

ā€¢ sp is called a shared pointer (somewhat sloppy).

Shared

Glo

bal ad

dre

ss s

pace x: 3

Privatesp: sp: sp:

lp: lp: lp:

Page 11: Unified Parallel C

Shared Arrays in UPC

ā€¢ Shared array elements are spread across the threadsshared int x[THREADS] /*One element per thread */shared int y[3][THREADS] /* 3 elements per thread */shared int z[3*THREADS] /* 3 elements per thread, cyclic */

ā€¢ In the pictures belowā€“ Assume THREADS = 4ā€“ Elements with affinity to processor 0 are red

x

y blocked

z cyclic

Of course, this is really a 2D array

Page 12: Unified Parallel C

Overlapping Communication in UPC

ā€¢Programs with fine-grained communication require overlap for performance

ā€¢UPC compiler does this automatically for ā€œrelaxedā€ accesses.

ā€“Accesses may be designated as strict, relaxed, or unqualified (the default).

ā€“There are several ways of designating the ordering type.

ā€¢ A type qualifier, strict or relaxed can be used to affect all variables of that type.

ā€¢ Labels strict or relaxed can be used to control the accesses within a statement.

strict : { x = y ; z = y+1; }ā€¢ A strict or relaxed cast can be used to override the

current label or type qualifier.

Page 13: Unified Parallel C

Performance of UPC

ā€¢Reason why UPC may be slower than MPIā€“ Shared array indexing is expensiveā€“ Small messages encouraged by model

ā€¢Reasons why UPC may be faster than MPIā€“ MPI encourages synchronyā€“ Buffering required for many MPI calls

ā€¢ Remote read/write of a single word may require very little overhead

ā€¢ Cray t3e, Quadrics interconnect (next version)

ā€¢Assuming overlapped communication, the real issues is overhead: how much time does it take to issue a remote read/write?

Page 14: Unified Parallel C

UPC vs. MPI: Sparse MatVec Multiply

ā€¢Short term goal: ā€“ Evaluate language and compilers using small

applications

ā€¢Longer term, identify large application

Sparse Matrix-Vector Multiply (T3E)

0

50

100

150

200

250

1 2 4 8 16 32

Processors

Mfl

op

s

UPC + PrefetchMPI (Aztec)UPC BulkUPC Small

ā€¢Show advantage of t3e network model and UPC

ā€¢Performance on Compaq machine worse:- Serial code- Communication

performance- New compiler just

released

Page 15: Unified Parallel C

UPC versus MPI for Edge detection

a. Execution time b. Scalability

Execution time(N=512)

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0 5 10 15 20

Proc.

Tim

e(s

)

UPC O1+O2

MPI

Speedup(N=512)

0

2

4

6

8

10

12

14

16

18

20

0 5 10 15 20Proc.

Spe

edup

UPC O1+O2MPILinear

ā€¢Performance from Cray T3E ā€¢Benchmark developed by El Ghazawiā€™s group at GWU

Page 16: Unified Parallel C

UPC versus MPI for Matrix Multiplication

a. Execution time b. Scalability

Execution time

0

1

2

3

4

5

6

7

0 5 10 15 20

Proc.

Tim

e(s

)

UPC O1 + O2

MPI

Speedup

0

5

10

15

20

0 5 10 15 20Proc.

UPC O1+O2

MPI

Linear

ā€¢Performance from Cray T3E ā€¢Benchmark developed by El Ghazawiā€™s group at GWU

Page 17: Unified Parallel C

Implementing UPC

ā€¢UPC extensions to C are smallā€“ < 1 person-year to implement in existing

compiler

ā€¢Simplest approachā€“ Reads and writes of shared pointers become

small message puts/getsā€“ UPC has ā€œrelaxedā€ keyword for nonblocking

communicationā€“ Small message performance is key

ā€¢Advanced optimizations include conversions to bulk communication by eitherā€“ Application programmer ā€“ Compiler

Page 18: Unified Parallel C

Overview of NERSC Compiler

1) Compiler ā€“ Portable compiler infrastructure (UPC->C)ā€“ Explore optimizations: communication, shared

pointersā€“ Based on Open64: plan to release sources

2) Runtime systems for multiple compilersā€“ Allow use by other languages (Titanium and CAF)ā€“ And in other UPC compilers, e.g., Intrepidā€“ Performance of small message put/get are keyā€“ Designed to be easily ported, then tunedā€“ Also designed for low overhead (macros, inline

functions)

Page 19: Unified Parallel C

Compiler and Runtime Status

ā€¢Basic parsing and type-checking completeā€¢Generates code for small serial kernels

ā€“ Still testing and debuggingā€“ Needs runtime for complete testing

ā€¢UPC runtime layerā€“ Initial implementation should be done this monthā€“ Based on processes (not threads) on GASNet

ā€¢GASNetā€“ Initial specification completeā€“ Reference implementation done on MPIā€“ Working on Quadrics and IBM (LAPIā€¦)

Page 20: Unified Parallel C

Benchmarks for GAS Languages

ā€¢ EEL ā€“ End to end latency or time spent sending a short message between two processes.

ā€¢ BW ā€“ Large message network bandwidthā€¢ Parameters of the LogP Model

ā€“ L ā€“ ā€œLatencyā€or time spent on the networkā€¢ During this time, processor can be doing other work

ā€“ O ā€“ ā€œOverheadā€ or processor busy time on the sending or receiving side.

ā€¢ During this time, processor cannot be doing other workā€¢ We distinguish between ā€œsendā€ and ā€œrecvā€ overhead

ā€“ G ā€“ ā€œgapā€ the rate at which messages can be pushed onto the network.

ā€“ P ā€“ the number of processors

Page 21: Unified Parallel C

LogP Parameters: Overhead & Latency

ā€¢Non-overlapping overhead

ā€¢Send and recv overhead can overlap

P0

P1

osend

L

orecv

P0

P1

osend

orecv

EEL = osend + L + orecv EEL = f(osend, L, orecv)

Page 22: Unified Parallel C

Benchmarks

ā€¢ Designed to measure the network parameters ā€“ Also provide: gap as function of queue depth ā€“ Measured for ā€œbest caseā€ in general

ā€¢ Implemented once in MPIā€“ For portability and comparison to target specific layer

ā€¢ Implemented again in target specific communication layer:ā€“ LAPIā€“ ELANā€“ GMā€“ SHMEMā€“ VIPL

Page 23: Unified Parallel C

Results: EEL and Overhead

0

5

10

15

20

25

T3E/M

PI

T3E/S

hmem

T3E/E

-Reg

IBM

/MPI

IBM

/LAPI

Quadr

ics/M

PI

Quadr

ics/P

ut

Quadr

ics/G

et

M2K

/MPI

M2K

/GM

Dolph

in/M

PI

Gigan

et/V

IPL

use

c

Send Overhead (alone) Send & Rec Overhead Rec Overhead (alone) Added Latency

Page 24: Unified Parallel C

Results: Gap and Overhead

6.7

1.2 0.2

8.2 9.5

6.0

1.6

6.5

10.3

17.8

7.84.6

0.0

5.0

10.0

15.0

20.0

use

c

Gap Send Overhead Receive Overhead

Page 25: Unified Parallel C

Send Overhead Over Time

ā€¢ Overhead has not improved significantly; T3D was bestā€“ Lack of integration; lack of attention in software

Myrinet2K

Dolphin

T3E

Cenju4

CM5

CM5

Meiko

MeikoParagon

T3D

Dolphin

Myrinet

SP3

SCI

Compaq

NCube/2

T3E0

2

4

6

8

10

12

14

1990 1992 1994 1996 1998 2000 2002Year (approximate)

usec

Page 26: Unified Parallel C

Summary

ā€¢Global address space languages offer alternative to MPI for large machinesā€“ Easier to use: shared data structuresā€“ Recover users left behind on shared

memory?ā€“ Performance tuning still possible

ā€¢Implementation ā€“ Small compiler effort given lightweight

communicationā€“ Portable communication layer: GASNetā€“ Difficulty with small message performance on

IBM SP platform


Related Documents