Unified Parallel C
Kathy YelickEECS, U.C. Berkeley and NERSC/LBNL
NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian
Bell
Outline
ā¢Global Address Space Languages in Generalā Programming models
ā¢Overview of Unified Parallel C (UPC)ā Programmability advantageā Performance opportunity
ā¢Statusā Next step
ā¢Related projects
Programming Model 1: Shared Memoryā¢Program is a collection of threads of control.
ā Many languages allow threads to be created dynamically,
ā¢Each thread has a set of private variables, e.g. local variables on the stack.
ā¢Collectively with a set of shared variables, e.g., static variables, shared common blocks, global heap.ā Threads communicate implicitly by writing/reading shared
variables.ā Threads coordinate using synchronization operations on shared
variables
PnP1P0 . . .
x = ...y = ..x ...
Shared
Private
Programming Model 2: Message Passingā¢ Program consists of a collection of named processes.
ā Usually fixed at program startup timeā Thread of control plus local address space -- NO shared
data.ā Logically shared data is partitioned over local processes.
ā¢ Processes communicate by explicit send/receive pairsā Coordination is implicit in every communication event.ā MPI is the most common example
PnP1P0 . . .
send P0,X
recv Pn,Y
XY
Tradeoffs Between the Models
ā¢Shared memory+Programming is easier
ā¢ Can build large shared data structures ā Machines donāt scale
ā¢ SMPs typically < 16 processors (Sun, DEC, Intel, IBM)ā¢ Distributed shared memory < 128 (SGI)
ā Performance is hard to predict and control
ā¢Message passing+Machines easier to build from commodity partsā Can scale (given sufficient network)ā Programming is harder
ā¢ Distributed data structures only in the programmers mind
ā¢ Tedious packing/unpacking of irregular data structures
Global Address Space Programming
ā¢ Intermediate point between message passing and shared memory
ā¢ Program consists of a collection of processes.ā Fixed at program startup time, like MPI
ā¢ Local and shared data, as in shared memory modelā But, shared data is partitioned over local processesā Remote data stays remote on distributed memory machinesā Processes communicate by reads/writes to shared variables
ā¢ Examples are UPC, Titanium, CAF, Split-C
ā¢ Note: These are not data-parallel languagesā heroic compilers not required
GAS Languages on Clusters of SMPs
ā¢Cluster of SMPs (CLUMPs)hbā IBM SP: 16-way SMP nodesā Berkeley Millennium: 2-way and 4-way nodes
ā¢What is an appropriate programming model?ā Use message passing throughout
ā¢ Most common modelā¢ Unnecessary packing/unpacking overhead
ā Hybrid modelsā¢ Write 2 parallel programs (MPI + OpenMP or Threads)
ā Global address spaceā¢ Only adds test (on/off node) before local read/write
Support for GAS Languages
ā¢Unified Parallel C (UPC)ā Funded by the NSAā Compaq compiler for Alpha/Quadricsā HP, Sun and Cray compilers under
developmentā Gcc-based compiler for SGI (Intrepid)ā Gcc-based compiler (SRC) for Cray T3Eā MTU and Compaq effort for MPI-based compilerā LBNL compiler based on Open64
ā¢Co-Array Fortran (CAF)ā Cray compilerā Rice and UMN effort based on Open64
ā¢SPMD Java (Titanium)ā UCB compiler available for most machines
Parallelism Model in UPC
ā¢ UPC uses an SPMD model of parallelismā A set if THREADS threads working independently
ā¢ Two compilation modelsā THREADS may be fixed at compile time orā Dynamically set at program startup time
ā¢ MYTHREAD specifies thread index (0..THREADS-1)ā¢ Basic synchronization mechanisms
ā Barriers (normal and split-phase), locks
ā¢ What UPC does not do automatically:ā Determine data layoutā Load balance ā move computationsā Caching ā move data
ā¢ These are intentionally left to the programmer
UPC Pointers
ā¢Pointers may point to shared or private variablesā¢Same syntax for use, just add qualifier
shared int *sp;
int *lp;
ā¢ sp is a pointer to an integer residing in the shared memory space.
ā¢ sp is called a shared pointer (somewhat sloppy).
Shared
Glo
bal ad
dre
ss s
pace x: 3
Privatesp: sp: sp:
lp: lp: lp:
Shared Arrays in UPC
ā¢ Shared array elements are spread across the threadsshared int x[THREADS] /*One element per thread */shared int y[3][THREADS] /* 3 elements per thread */shared int z[3*THREADS] /* 3 elements per thread, cyclic */
ā¢ In the pictures belowā Assume THREADS = 4ā Elements with affinity to processor 0 are red
x
y blocked
z cyclic
Of course, this is really a 2D array
Overlapping Communication in UPC
ā¢Programs with fine-grained communication require overlap for performance
ā¢UPC compiler does this automatically for ārelaxedā accesses.
āAccesses may be designated as strict, relaxed, or unqualified (the default).
āThere are several ways of designating the ordering type.
ā¢ A type qualifier, strict or relaxed can be used to affect all variables of that type.
ā¢ Labels strict or relaxed can be used to control the accesses within a statement.
strict : { x = y ; z = y+1; }ā¢ A strict or relaxed cast can be used to override the
current label or type qualifier.
Performance of UPC
ā¢Reason why UPC may be slower than MPIā Shared array indexing is expensiveā Small messages encouraged by model
ā¢Reasons why UPC may be faster than MPIā MPI encourages synchronyā Buffering required for many MPI calls
ā¢ Remote read/write of a single word may require very little overhead
ā¢ Cray t3e, Quadrics interconnect (next version)
ā¢Assuming overlapped communication, the real issues is overhead: how much time does it take to issue a remote read/write?
UPC vs. MPI: Sparse MatVec Multiply
ā¢Short term goal: ā Evaluate language and compilers using small
applications
ā¢Longer term, identify large application
Sparse Matrix-Vector Multiply (T3E)
0
50
100
150
200
250
1 2 4 8 16 32
Processors
Mfl
op
s
UPC + PrefetchMPI (Aztec)UPC BulkUPC Small
ā¢Show advantage of t3e network model and UPC
ā¢Performance on Compaq machine worse:- Serial code- Communication
performance- New compiler just
released
UPC versus MPI for Edge detection
a. Execution time b. Scalability
Execution time(N=512)
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0 5 10 15 20
Proc.
Tim
e(s
)
UPC O1+O2
MPI
Speedup(N=512)
0
2
4
6
8
10
12
14
16
18
20
0 5 10 15 20Proc.
Spe
edup
UPC O1+O2MPILinear
ā¢Performance from Cray T3E ā¢Benchmark developed by El Ghazawiās group at GWU
UPC versus MPI for Matrix Multiplication
a. Execution time b. Scalability
Execution time
0
1
2
3
4
5
6
7
0 5 10 15 20
Proc.
Tim
e(s
)
UPC O1 + O2
MPI
Speedup
0
5
10
15
20
0 5 10 15 20Proc.
UPC O1+O2
MPI
Linear
ā¢Performance from Cray T3E ā¢Benchmark developed by El Ghazawiās group at GWU
Implementing UPC
ā¢UPC extensions to C are smallā < 1 person-year to implement in existing
compiler
ā¢Simplest approachā Reads and writes of shared pointers become
small message puts/getsā UPC has ārelaxedā keyword for nonblocking
communicationā Small message performance is key
ā¢Advanced optimizations include conversions to bulk communication by eitherā Application programmer ā Compiler
Overview of NERSC Compiler
1) Compiler ā Portable compiler infrastructure (UPC->C)ā Explore optimizations: communication, shared
pointersā Based on Open64: plan to release sources
2) Runtime systems for multiple compilersā Allow use by other languages (Titanium and CAF)ā And in other UPC compilers, e.g., Intrepidā Performance of small message put/get are keyā Designed to be easily ported, then tunedā Also designed for low overhead (macros, inline
functions)
Compiler and Runtime Status
ā¢Basic parsing and type-checking completeā¢Generates code for small serial kernels
ā Still testing and debuggingā Needs runtime for complete testing
ā¢UPC runtime layerā Initial implementation should be done this monthā Based on processes (not threads) on GASNet
ā¢GASNetā Initial specification completeā Reference implementation done on MPIā Working on Quadrics and IBM (LAPIā¦)
Benchmarks for GAS Languages
ā¢ EEL ā End to end latency or time spent sending a short message between two processes.
ā¢ BW ā Large message network bandwidthā¢ Parameters of the LogP Model
ā L ā āLatencyāor time spent on the networkā¢ During this time, processor can be doing other work
ā O ā āOverheadā or processor busy time on the sending or receiving side.
ā¢ During this time, processor cannot be doing other workā¢ We distinguish between āsendā and ārecvā overhead
ā G ā āgapā the rate at which messages can be pushed onto the network.
ā P ā the number of processors
LogP Parameters: Overhead & Latency
ā¢Non-overlapping overhead
ā¢Send and recv overhead can overlap
P0
P1
osend
L
orecv
P0
P1
osend
orecv
EEL = osend + L + orecv EEL = f(osend, L, orecv)
Benchmarks
ā¢ Designed to measure the network parameters ā Also provide: gap as function of queue depth ā Measured for ābest caseā in general
ā¢ Implemented once in MPIā For portability and comparison to target specific layer
ā¢ Implemented again in target specific communication layer:ā LAPIā ELANā GMā SHMEMā VIPL
Results: EEL and Overhead
0
5
10
15
20
25
T3E/M
PI
T3E/S
hmem
T3E/E
-Reg
IBM
/MPI
IBM
/LAPI
Quadr
ics/M
PI
Quadr
ics/P
ut
Quadr
ics/G
et
M2K
/MPI
M2K
/GM
Dolph
in/M
PI
Gigan
et/V
IPL
use
c
Send Overhead (alone) Send & Rec Overhead Rec Overhead (alone) Added Latency
Results: Gap and Overhead
6.7
1.2 0.2
8.2 9.5
6.0
1.6
6.5
10.3
17.8
7.84.6
0.0
5.0
10.0
15.0
20.0
use
c
Gap Send Overhead Receive Overhead
Send Overhead Over Time
ā¢ Overhead has not improved significantly; T3D was bestā Lack of integration; lack of attention in software
Myrinet2K
Dolphin
T3E
Cenju4
CM5
CM5
Meiko
MeikoParagon
T3D
Dolphin
Myrinet
SP3
SCI
Compaq
NCube/2
T3E0
2
4
6
8
10
12
14
1990 1992 1994 1996 1998 2000 2002Year (approximate)
usec
Summary
ā¢Global address space languages offer alternative to MPI for large machinesā Easier to use: shared data structuresā Recover users left behind on shared
memory?ā Performance tuning still possible
ā¢Implementation ā Small compiler effort given lightweight
communicationā Portable communication layer: GASNetā Difficulty with small message performance on
IBM SP platform