Towards Optimized UPC Implementations Tarek A. El-Ghazawi The George Washington University tarek@gwu.edu Tarek A. El-Ghazawi The George Washington University.
Post on 12-Jan-2016
215 Views
Preview:
Transcript
Towards Optimized UPC Implementations
Towards Optimized UPC Implementations
Tarek A. El-GhazawiThe George Washington University
tarek@gwu.edu
Tarek A. El-GhazawiThe George Washington University
tarek@gwu.edu
IBM T.J. Waston UPC: Unified Parallel C 202/22/05
Agenda
Background
UPC Language Overview
Productivity
Performance Issues
Automatic Optimizations
Conclusions
IBM T.J. Waston UPC: Unified Parallel C 302/22/05
Parallel Programming Models What is a programming model?
An abstract machine which outlines the view perceived by the programmer of data and execution
Where architecture and applications meet A non-binding contract between the programmer and
the compiler/system
Good Programming Models Should Allow efficient mapping on different architectures Keep programming easy
Benefits Application - independence from architecture Architecture - independence from applications
IBM T.J. Waston UPC: Unified Parallel C 402/22/05
Programming Models
Message Passing Shared Memory DSM/PGAS
MPI OpenMP UPC
Process/Thread
Address Space
IBM T.J. Waston UPC: Unified Parallel C 502/22/05
Programming Paradigms ExpressivityLOCALITY
Implicit Explicit
PARALLEISM
Implicit
Explicit
Sequential(e.g. C, Fortran, Java)
Data Parallel(e.g. HPF, C*)
Shared Memory(e.g. OpenMP)
Distributed Shared
Memory/PGAS(e.g. UPC, CAF, and
Titanium)
IBM T.J. Waston UPC: Unified Parallel C 602/22/05
What is UPC?
Unified Parallel C
An explicit parallel extension of ISO C
A distributed shared memory/PGAS parallel programming language
IBM T.J. Waston UPC: Unified Parallel C 702/22/05
Why not message passing?
Performance High-penalty for short transactions Cost of calls Two sided Excessive buffering
Ease-of-use Explicit data transfers Domain decomposition does not maintain the
original global application view More code and conceptual difficulty
IBM T.J. Waston UPC: Unified Parallel C 802/22/05
Why DSM/PGAS?
Performance No calls Efficient short transfers locality
Ease-of-use Implicit transfers Consistent global application view Less code and conceptual difficulty
IBM T.J. Waston UPC: Unified Parallel C 902/22/05
Why DSM/PGAS:New Opportunities for Compiler Optimizations
Gh
ost Z
on
es
Thread0
Thread1
Thread2
Thread3
ImageSobel Operator
DSM P_Model exposes sequential remote accesses at compile time Opportunity for compiler directed prefetching
IBM T.J. Waston UPC: Unified Parallel C 1002/22/05
History
Initial Tech. Report from IDA in collaboration with LLNL and UCB in May 1999
UPC consortium of government, academia, and HPC vendors coordinated by GWU, IDA, and DoD
The participants currently are: IDA CCS, GWU, UCB, MTU, UMN, ARSC, UMCP, U florida, ANL, LBNL, LLNL, DoD, DoE, HP, Cray, IBM, Sun, Intrepid, Etnus, …
IBM T.J. Waston UPC: Unified Parallel C 1102/22/05
Status Specification v1.0 completed February of 2001, v1.1.1 in
October of 2003, v1.2 will add collectives and UPC/IO
Benchmarking Suites: Stream, GUPS, RandomAccess, NPB suite, Splash-2, and others
Testing suite v1.0, v1.1
Short courses and tutorials in the US and abroad
Research Exhibits at SC 2000-2004
UPC web site: upc.gwu.edu
UPC Book by mid 2005 from John Wiley and Sons
Manual(s)
IBM T.J. Waston UPC: Unified Parallel C 1202/22/05
Hardware Platforms
UPC implementations are available for SGI O 2000/3000
Intrepid – 32 and 64b GCC UCB – 32 b GCC
Cray T3D/E Cray X-1 HP AlphaServer SC, Superdome UPC Berkeley Compiler: Myrinet, Quadrics,
and Infiniband Clusters Beowulf Reference Implementation (MPI-
based, MTU) New ongoing efforts by IBM and Sun
IBM T.J. Waston UPC: Unified Parallel C 1302/22/05
UPC Execution Model
A number of threads working independently in a SPMD fashion MYTHREAD specifies thread index
(0..THREADS-1) Number of threads specified at compile-time or
run-time
Process and Data Synchronization when needed Barriers and split phase barriers Locks and arrays of locks Fence Memory consistency control
IBM T.J. Waston UPC: Unified Parallel C 1402/22/05
UPC Memory Model
Shared space with thread affinity, plus private spaces
A pointer-to-shared can reference all locations in the shared space
A private pointer may reference only addresses in its private space or addresses in its portion of the shared space
Static and dynamic memory allocations are supported for both shared and private memory
Shared
Thread 0
Private 0
Thread THREADS-1
Private 1 Private THREADS-1
Thread 1
IBM T.J. Waston UPC: Unified Parallel C 1502/22/05
UPC Pointers
How to declare them? int *p1; /* private pointer pointing
locally */ shared int *p2; /* private pointer pointing into
the shared space */ int *shared p3; /* shared pointer pointing locally */ shared int *shared p4; /* shared pointer pointing
into the shared space */
You may find many using “shared pointer” to mean a pointer pointing to a shared object, e.g. equivalent to p2 but could be p4 as well.
IBM T.J. Waston UPC: Unified Parallel C 1602/22/05
UPC Pointers
Shared
Private P1
P2
P4
P3
Thread 0
P1 P1P2
P2
IBM T.J. Waston UPC: Unified Parallel C 1702/22/05
Synchronization - Barriers
No implicit synchronization among the threads
UPC provides the following synchronization mechanisms: Barriers Locks Memory Consistency Control Fence
IBM T.J. Waston UPC: Unified Parallel C 1802/22/05
Memory Consistency Models
Has to do with ordering of shared operations, and when a change of a shared object by a thread becomes visible to others
Consistency can be strict or relaxed
Under the relaxed consistency model, the shared operations can be reordered by the compiler / runtime system
The strict consistency model enforces sequential ordering of shared operations. (No operation on shared can begin before the previous ones are done, and changes become visible immediately)
IBM T.J. Waston UPC: Unified Parallel C 1902/22/05
Memory Consistency Models
User specifies the memory model through: declarations pragmas for a particular statement or sequence of
statements use of barriers, and global operations
Programmers responsible for using correct consistency model
IBM T.J. Waston UPC: Unified Parallel C 2002/22/05
UPC and Productivity
Metrics Lines of ‘useful’ Code
indicates the development time as well as the maintenance cost
Number of ‘useful’ Characters alternative way to measure development and maintenance
efforts Conceptual Complexity
function level, keyword usage, number of tokens, max loop depth, …
IBM T.J. Waston UPC: Unified Parallel C 2102/22/05
Manual Effort – NPB Example
SEQ UPC SEQ MPI UPC Effort (%)
MPI Effort (%)
#line 665 710 506 1046 6.77 106.72 NPB-CG #char 16145 17200 16485 37501 6.53 127.49 #line 127 183 130 181 44.09 36.23 NPB-EP #char 2868 4117 4741 6567 43.55 38.52 #line 575 1018 665 1278 77.04 92.18 NPB-FT #char 13090 21672 22188 44348 65.56 99.87 #line 353 528 353 627 49.58 77.62 NPB-IS #char 7273 13114 7273 13324 80.31 83.20 #line 610 866 885 1613 41.97 82.26 NPB-MG #char 14830 21990 27129 50497 48.28 86.14
SEQ
SEQUPCUPCeffort #
##
SEQ
SEQMPIMPIeffort #
##
IBM T.J. Waston UPC: Unified Parallel C 2202/22/05
Manual Effort – More Examples
SEQ MPI SEQ UPC MPI Effort (%)
UPC Effort (%)
#line 41 98 41 47 139.02 14.63 GUPS
#char 1063 2979 1063 1251 180.02 17.68 #line 12 30 12 20 150.00 66.67
Histogram #char 188 705 188 376 275.00 100.00 #line 86 166 86 139 93.02 61.63
N-Queens #char 1555 3332 1555 2516 124.28 61.80
SEQ
SEQUPCUPCeffort #
##
SEQ
SEQMPIMPIeffort #
##
IBM T.J. Waston UPC: Unified Parallel C 2302/22/05
Conceptual Complexity - HIST
Work Distr.
Data Distr.
Comm. Synch. & Consist.
Misc. Ops Sum Overall Score
#Parameters 5 4 0 3 0 12 #Function calls 0 0 0 4 0 4 #References to THREADS and MYTHREAD
2 1 0 0 0 3
#UPC Constructs & UPC Types
0 2 0 1 0 3
HIS
TO
GR
AM
UP
C
Notes 2 if 1 for
2 shared decl.
1 lockdec 1 lock/unlock 2 barriers
22
#Parameters 5 0 15 0 6 26 #Function calls 0 0 2 2 4 8 # References to myrank and nprocs
3 0 2 0 2 5
#MPI Types 0 0 6 0 2 8
HIS
TO
GR
AM
MP
I
Notes 2 if 1 for
1 Scatter 1 Reduce
(implicit w. Collective)
1 Init/Finalize 2 Comm
47
IBM T.J. Waston UPC: Unified Parallel C 2402/22/05
Conceptual Complexity - GUPS
Work Distr.
Data Distr.
Comm. Synch. & Consist.
Misc. Ops Sum Overall Score
#Parameters 21 6 0 0 0 27 #Function calls 0 4 0 2 0 6 #References to THREADS and MYTHREAD
3 4 0 0 0 7
#UPC Constructs & UPC Types
3 0 0 0 0 3 GU
PS
UP
C
Notes 3 forall 2 for 3 if
5 shared 2 all_alloc 2 free
2 barriers
43
#Parameters 18 17 38 1 6 80 #Function calls 0 7 6 3 6 22 # References to myrank and nprocs
3 5 13 1 4 26
#MPI Types 0 6 2 0 0 8
GU
PS
MP
I
Notes 5 for 3 if
2 mem alloc 2 mem free 3 window
2 one-sided 4 collect
(implicit w. Collective and WinFence) 1 barrier
Init Finalize comm_rank comm_size 2 Wtime (6 error handle)
136
IBM T.J. Waston UPC: Unified Parallel C 2502/22/05
UPC Optimizations Issues
Particular Challenges Avoiding Address Translation Cost of Address Translation
Special Opportunities Locality-driven compiler-directed prefetching Aggregation
General Low-level optimized libraries, e.g. collective Backend optimizations Overlapping of remote accesses and
synchronization with other work
IBM T.J. Waston UPC: Unified Parallel C 2602/22/05
Showing Potential Optimizations Through Emulated Hand-Tunings
Different Hand-tuning levels: Unoptimized UPC code
referred as UPC.O0 Privatized UPC code
referred as UPC.O1 Prefetched UPC code
hand-optimized variant using block get/put to mimic the effect of prefetching
referred as UPC.O2 Fully Hand-Tuned UPC code
Hand-optimized variant integrating privatization, aggregation of remote accesses as well as prefetching
Referred as UPC.O3
T. El-Ghazawi and S. Chauvin, “UPC Benchmarking Issues”, 30th Annual Conference IEEE International Conference on Parallel Processing,2001 (ICPP01) Pages: 365-372
IBM T.J. Waston UPC: Unified Parallel C 2702/22/05
Address Translation Cost and Local Space Privatization- Cluster
ST
RE
AM
BE
NC
HM
AR
K
MB/s Put Get Scale Sum
CC N/A N/A 1565.04 5409.3
UPC Private N/A N/A 1687.63 1776.81
UPC Local 1196.51 1082.89 54.22 82.7
UPC Remote 241.43 237.51 0.09 0.16
MB/s Copy (arr) Copy (ptr) Memcpy Memset
CC 1340.99 1488.02 1223.86 2401.26
UPC Private 1383.57 433.45 1252.47 2352.71
UPC Local 47.2 90.67 1202.8 2398.9
UPC Remote 0.09 0.20 1197.22 2360.59
Results gathered on a Myrinet Cluster
IBM T.J. Waston UPC: Unified Parallel C 2802/22/05
MB/SecMemory
copyBlock Get
Block Put
ArraySet
Array Copy
Sum Scale
GCC 127 N/A N/A 175 106 223 108
UPC Private 127 N/A N/A 173 106 215 107
UPC Local Shared 139 140 136 26 14 31 13
UPC Remote Shared (within SMP node) 130 129 136 26 13 30 13
UPC Remote Shared (beyond SMP node) 112 117 136 24 12 28 12
ST
RE
AM
BE
NC
HM
AR
K M
B/S
Bulk operations Element-by-Element operations
Address Translation and Local Space Privatization –
DSM ARCHITECTURE
IBM T.J. Waston UPC: Unified Parallel C 2902/22/05
Aggregation and Overlapping of Remote Shared Memory Accesses
0
0.05
0.1
0.15
0.2
0.25
1 2 4 8 16
THREADS
Exe
cuti
on
Tim
e (s
ec)
UPC NO OPT. UPC FULL OPT.
0.01
0.1
1
10
100
1 2 4 8 16 32
NP
Exe
cuti
on
Tim
e (s
ec)
UPC NO OPT. UPC FULL OPT.
Benefit of hand-optimizations are greatly application dependent: N-Queens does not perform any better, mainly because it is an
embarrassingly parallel program Sobel Edge Detector does get a speedup of one order of magnitude
after hand-optimizating, scales linearly perfectly. SGI O2000
UPC N-Queens: Execution Time
UPC Sobel Edge: Execution Time
IBM T.J. Waston UPC: Unified Parallel C 3002/22/05
Impact of Hand-Optimizations on NPB.CG
0
10
20
30
40
50
60
70
1 2 4 8 16 32
Processors
Co
mp
uta
tion
Tim
e (
sec
)
UPC - O0 UPC - O1 UPC - O3 GCC Class A onSGI Origin 2k
IBM T.J. Waston UPC: Unified Parallel C 3102/22/05
Shared Address Translation Overhead
Address translation overhead is quite significant More than 70% of work for a local-shared memory access
Demonstrates the real need for optimization
ZActualAccess
ZActualAccess
YAddress
CalculationOverhead
UPC Put/GetFunction Call
OverheadX
PRIVATEMEMORY ACCESS
LOCALSHARED
MEMORY ACCESS
AddressTranslationOverhead
144
247
123
0
100
200
300
400
500
600
Local Shared memory access
Lo
cal S
har
ed A
cces
s T
ime
(ns)
Memory Access Time Address Calculation Address Function Call
Overhead Present in Local-Shared Memory Accesses (SGI Origin 2000, GCC-UPC)
Quantification of the Address Translation Overheads
IBM T.J. Waston UPC: Unified Parallel C 3202/22/05
Shared Address Translation Overheads for Sobel Edge Detection
0
10
20
30
40
50
60
70
80
90
100
UP
C.O
0
UP
C.O
3
UP
C.O
0
UP
C.O
3
UP
C.O
0
UP
C.O
3
UP
C.O
0
UP
C.O
3
UP
C.O
0
UP
C.O
3
#Processors
Ex
ec
uti
on
Tim
e (
se
c)
Processing + Memory Access Address Function Call
Address Calculation
1 2 4 8 16
UPC.O0: unoptimized UPC code, UPC.O3: handoptimized UPC code. Ox notations from
T. El-Ghazawi, S. Chauvin, “UPC Benchmarking Issues”, Proceedings of the 2001 International Conference on Parallel Processing, Valencia, September 2001
IBM T.J. Waston UPC: Unified Parallel C 3302/22/05
Reducing Address Translation Overheads via Translation Look-Aside Buffers
F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber, “Fast Address Translation Techniques for Distributed Shared Memory Compilers”, IPDPS’05, Denver CO, April 2005
Use Look-up Memory Model Translation Buffers (MMTB) to perform fast translations
Two alternative methods proposed to create and use MMTB’s: FT: basic method using direct addressing RT: advanced method, using indexed addressing
Was prototyped as a compiler-enabled optimization no modifications to actual UPC codes are needed
IBM T.J. Waston UPC: Unified Parallel C 3402/22/05
array[0] TH0
array[1] TH1
array[2] TH2
array[3] TH3
array[4] TH0
array[5] TH1
array[6] TH2
array[7] TH3
array[0] TH0
array[1] TH1
array[2] TH2
array[3] TH3
array[4] TH0
array[5] TH1
array[6] TH2
array[7] TH3
[0]
[4]
TH0
[1]
[5]
TH1
[2]
[6]
TH2
[3]
[7]
TH3
[0]
[4]
TH0
[1]
[5]
TH1
[2]
[6]
TH2
[3]
[7]
TH3
Array distributed across 4 THREADS
shared int array[8];
MMTB stored on each thread
FT Look-up Table
Data affinity
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
[0] 57FF8040
[4] 57FF8048
[1] 5FFF8040
[5] 5FFF8048
[2] 67FF8040
[6] 67FF8048
[3] 6FFF8040
[7] 6FFF8048
FT[0]
FT[4]
FT[1]
FT[5]
FT[2]
FT[6]
FT[3]
FT[7]
FT[0]
FT[4]
FT[1]
FT[5]
FT[2]
FT[6]
FT[3]
FT[7]
FT[0]
FT[4]
FT[1]
FT[5]
FT[2]
FT[6]
FT[3]
FT[7]
Different Strategies – Full-Table
Pros Direct mapping No address calculation
Cons Large memory required Can lead to competition over caches
and main memory
Consider shared [B] int array[8];To Initialize FT:
i [0,7], FT[i] = _get_vaddr(&array[i])To Access array[ ]:
i [0,7], array[i] = _get_value_at(FT[i])
IBM T.J. Waston UPC: Unified Parallel C 3502/22/05
Different Strategies – Reduced-Table: Infinite blocksize
RT Strategy:
Only one table entry in this case
Address calculation step is simple in that case
array[0]
array[1]
array[2]
array[3]
THREAD0
i
THREAD1
RT[0] RT[0]
THREAD2
RT[0]
THREAD3
RT[0]
BLOCKSIZE=infiniteOnly first address of the element of the array needs to be saved since all array data is contiguous
Consider shared [] int array[4];
To initialize RT:
RT[0] = _get_vaddr(&array[0])
To access array[]:
i [0,3], array[i] = _get_value_at( RT[0] + i )
IBM T.J. Waston UPC: Unified Parallel C 3602/22/05
Different Strategies – Reduced-Table: Default blocksize
RT Strategy:
Less memory required than FT, MMTB buffer has threads entries
Address calculation step is a bit costly but much cheaper than current implementations
array[0]
array[4]
array[8]
array[12]
THREAD1
RT RT
THREAD2
RT
THREAD3
RT
array[1]
array[5]
array[9]
array[13]
array[2]
array[6]
array[10]
array[14]
array[3]
array[7]
array[11]
array[15]
THREAD0
RT[0]
RT[1]
RT[2]
RT[3]
RT
BLOCKSIZE=1Only first address of elements on each thread are saved since all array data is contiguous
Consider shared [1] int array[16];
To initialize RT:
i [0,THREADS-1], RT[i] = _get_vaddr(&array[i])
To access array[]:
i [0,15], array[i] = _get_value_at( RT[i mod THREADS] + (i/THREADS))
IBM T.J. Waston UPC: Unified Parallel C 3702/22/05
Different Strategies – Reduced-Table: Arbitrary blocksize
RT Strategy:
Less memory required than for FT, but more than previous cases
Address calculation step more costly than previous cases
array[0]
array[1]
array[8]
array[9]
THREAD1
RT RT
THREAD2
RT
THREAD3
RT
array[2]
array[3]
array[10]
array[11]
array[4]
array[5]
array[12]
array[13]
array[6]
array[7]
array[14]
array[15]
THREAD0
RT[0]
RT[1]
RT[2]
RT[3]
RT
RT[4]
RT[5]
RT[6]
RT[7]
ARBITRARY BLOCK SIZESOnly first address of elements of each block are saved since all block data is contiguous
Consider shared [2] int array[16];
To initialize T:
i [0,7], RT[i] = _get_vaddr(&array[i*blocksize(array)])
To access array[]:
i [0,15], array[i] = _get_value_at( RT[i / blocksize(array)] + (i mod blocksize(array)) )
IBM T.J. Waston UPC: Unified Parallel C 3802/22/05
Performance Impact of the MMTB – Sobel Edge
FT and RT are performing around 6 to 8 folds better than the regular basic UPC version (O0)
RT strategy slower than FT since address calculation (arbitrary block size case), becomes more complex.
FT on the other hand is performing almost as good as the hand-tuned versions (O3 and MPI)
Sobel Edge (N=2048)
0
2
4
6
8
10
12
14
16
1 2 4 8 16
#THREADS
Exe
cuti
on
Tim
e (s
ec)
O0 O0.FT O0.RT O3 MPI
Sobel Edge (N=2048)
0
0.5
1
1.5
2
2.5
3
1 2 4 8 16
#THREADS
Exe
cuti
on
Tim
e (s
ec)
O0.FT O0.RT O3 MPI
Performance of Sobel-Edge Detection using new MMTB strategies (with and without O0)
IBM T.J. Waston UPC: Unified Parallel C 3902/22/05
Performance Impact of the MMTB – Matrix Multiplication
FT strategy: increase in L1 data cache misses due to the large table size
RT strategy: L1 kept low, but increase in number of loads and stores is observed showing increase in computations (arbitrary blocksize used)
MATRIX MULTIPLICATION (N=256)
0
2
4
6
8
10
12
14
16
1 2 4 8 16
# THREADS
Exe
cuti
on
Tim
e (s
ec)
UPC.O0 UPC.O0.FT UPC.O0.RT UPC.O3 MPI
Performance and Hardware Profiling of Matrix Multiplication using new MMTB strategies
MATRIX MULTIPLICATION (N=256)
0
2
4
6
8
10
12
14
16
UPC.O0
UPC.O3
UPC.O0.
FT
UPC.O0.
RT
UPC.O0
UPC.O3
UPC.O0.
FT
UPC.O0.
RT
UPC.O0
UPC.O3
UPC.O0.
FT
UPC.O0.
RT
UPC.O0
UPC.O3
UPC.O0.
FT
UPC.O0.
RT
UPC.O0
UPC.O3
UPC.O0.
FT
UPC.O0.
RT
THREADS
Tim
e (
se
c)
Computation L1 Data Cache Misses L2 Data Cache Misses TLB Misses
Graduated Loads Graduated Stores Decoded Branches
1 THREAD 2 THREADS 4 THREADS 8 THREADS 16 THREADS
IBM T.J. Waston UPC: Unified Parallel C 4002/22/05
Time and storage requirements of the Address Translation Methods for the
Matrix Multiply Microkernel
Number of loads and stores can increase with arithmetic operators
Comparison among Optimizations of Storage, Memory Accesses and Computation Requirements
EN
THREADSPNEN
THREADSPB
NEN
For a shared array of N elements with B
as blocksize
Storage requirements per
shared array
# of memory accesses per
shared memoryaccess
# of arithmetic operations pershared memory
access
UPC.O0 More than 25 More than 5
UPC.O0.FT 1 0
UPC.O0.RT 1 Up to 3(E: element size in bytes,P: pointer size in bytes)
IBM T.J. Waston UPC: Unified Parallel C 4102/22/05
UPC Work-sharing Construct Optimizations
By thread/index number (upc_forall integer)
upc_forall(i=0; i<N; i++; i)
loop body;
By the address of a shared variable (upc_forall address)
upc_forall(i=0; i<N; i++; &shared_var[i])
loop body;
By thread/index number (for optimized)
for(i=MYTHREAD; i<N; i+=THREADS)
loop body;
By thread/index number (for integer)
for(i=0; i<N; i++)
{
if(MYTHREAD == i%THREADS)
loop body;
}
By the address of a shared variable (for address)
for(i=0; i<N; i++)
{
if(upc_threadof(&shared_var[i]) ==
MYTHREAD)
loop body;
}
IBM T.J. Waston UPC: Unified Parallel C 4202/22/05
Performance of Equivalent upc_forall and for Loops
0
0.01
0.02
0.03
0.04
0.05
0.06
1 2 4 8 16 Processor(s)
upc_forall address upc_forall integer for address for integer for optimized
Tim
e (
se
c.)
IBM T.J. Waston UPC: Unified Parallel C 4302/22/05
Performance Limitations Imposed by Sequential C Compilers -- STREAM
NU
MA
(MB
/s)
BULK Element-by-Element
mem
cpy
mem
set
Stru
ct cp
Co
py (arr)
Co
py (p
tr)
Set
Su
m
Sca
le
Ad
d
Triad
F 291.21 163.90 N/A 291.59 N/A 159.68 135.37 246.3 235.1 303.82
C 231.20 214.62 158.86 120.57 152.77 147.70 298.38 133.4 13.86 20.71
Ve
ctor
(MB
/s)
BULK Element-by-Element
mem
cpy
mem
set
Stru
ct cp
Co
py (arr)
Co
py (p
tr)
Set
Su
m
Sca
le
Ad
d
Triad
F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053
C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824
IBM T.J. Waston UPC: Unified Parallel C 4402/22/05
Loopmark – SET/ADD Operations
Vecto
r
BULK Element-by-Element
mem
cpy
mem
set
Stru
ct cp
Co
py (arr)
Co
py (p
tr)
Set
Su
m
Sca
le
Ad
d
Triad
F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053
C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824
Let us compare loopmarks for each F / C operation
IBM T.J. Waston UPC: Unified Parallel C 4502/22/05
Loopmark – SET/ADD Operations
MEMSET (bulk set)
146. 1 t = mysecond(tflag)
147. 1 V M--<><> a(1:n) = 1.0d0
148. 1 t = mysecond(tflag) - t
149. 1 times(2,k) = t
SET
158. 1 arrsum = 2.0d0;
159. 1 t = mysecond(tflag)
160. 1 MV------< DO i = 1,n
161. 1 MV c(i) = arrsum
162. 1 MV arrsum = arrsum + 1
163. 1 MV------> END DO
164. 1 t = mysecond(tflag) - t
165. 1 times(4,k) = t
ADD
180. 1 t = mysecond(tflag)
181. 1 V M--<><> c(1:n) = a(1:n) + b(1:n)
182. 1 t = mysecond(tflag) - t
183. 1 times(7,k) = t
MEMSET (bulk set)
163. 1 times[1][k] = mysecond_();
164. 1 memset(a, 1, NDIM*sizeof(elem_t));;
165. 1 times[1][k] = mysecond_() - times[1][k];
SET
217. 1 set = 2;
220. 1 times[5][k] = mysecond_();
222. 1 MV--< for (i=0; i<NDIM; i++)
223. 1 MV {
224. 1 MV c[i] = (set++);
225. 1 MV--> }
227. 1 times[5][k] = mysecond_() - times[5][k];
ADD
283. 1 times[10][k]= mysecond_();
285. 1 Vp--< for (j=0; j<NDIM; j++)
286. 1 Vp {
287. 1 Vp c[j] = a[j] + b[j];
288. 1 Vp--> }
290. 1 times[10][k] = mysecond_() - times[10][k];
Fortran C
Legend: V: Vectorized – M: Multistreamed – p: conditional, partial and/or computed
IBM T.J. Waston UPC: Unified Parallel C 4602/22/05
UPC vs CAF using the NPB workloads In General, UPC slower than CAF, mainly due to
Point-to-point vs barrier synchronization Better scalability with proper collective operations Program writers can do a p-to-p syncronization using current
constructs Scalar performance of source-to-source translated code
Alias analysis (C pointers)
» Can highlight the need for explicitly using restrict to help several compiler backends
Lack of support for multi-dimensional arrays in C» Can prevent high level loop transformations and software
pipelining, causing a 2 times slowdown in SP for UPC Need for exhaustive C compiler analysis
» A failure to perform proper loop fusion and alignment in the critical section of MG can lead to 51% more loads for UPC than CAF
» A failure to unroll adequately the sparse matrix-vector multiplication in CG can lead to more cycles in UPC
IBM T.J. Waston UPC: Unified Parallel C 4702/22/05
Conclusions UPC is a locality-aware parallel programming
language
With proper optimizations, UPC can outperform MPI in random short accesses and can otherwise perform as good as MPI
UPC is very productive and UPC applications result in much smaller and more readable code than MPI
UPC compiler optimizations are still lagging, in spite of the fact that substantial progress has been made
For future architectures, UPC has the unique opportunity of having very efficient implementations as most of the pitfalls and obstacles are revealed along with adequate solutions
IBM T.J. Waston UPC: Unified Parallel C 4802/22/05
Conclusions
In general, four types of optimizations: Optimizations to Exploit the Locality
Consciousness and other Unique Features of UPC
Optimizations to Keep the Overhead of UPC low Optimizations to Exploit Architectural Features Standard Optimizations that are Applicable to all
Systems Compilers
IBM T.J. Waston UPC: Unified Parallel C 4902/22/05
Conclusions
Optimizations possible at three levels: Source to source program acting during the
compilation phase and incorporating most UPC specific optimizations
C backend compilers to compete with Fortran Strong run-time system that can work effectively
with the Operating System
IBM T.J. Waston UPC: Unified Parallel C 5002/22/05
Selected Publications
T. El-Ghazawi, W. Carlson, T. Sterling, and K. Yelick, UPC: Distributed Shared Memory Programming. John Wiley &Sons Inc., New York, 2005. ISBN: 0-471-22048-5. (June 2005)
T. El-Ghazawi, F. Cantonnet, Y. Yao, S. Annareddy, A. Mohamed, Benchmarking Parallel Compilers for Distributed Shared Memory Languages: A UPC Case Study, Journal of Future Generation Computer Systems, North-Holland (Accepted)
IBM T.J. Waston UPC: Unified Parallel C 5102/22/05
Selected Publications
T. El-Ghazawi and S. Chauvin, “UPC Benchmarking Issues”, 30th Annual Conference IEEE International Conference on Parallel Processing,2001 (ICPP01) Pages: 365-372
T. El-Ghazawi and F. Cantonnet. “UPC performance and potential: A NPB experimental study”. Supercomputing 2002 (SC2002), Baltimore, November 2002
F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber, “Fast Address Translation Techniques for Distributed Shared Memory Compilers”, IPDPS’05, Denver CO, April 2005
CUG and PPOP
top related