Towards Optimized UPC Implementations Tarek A. El-Ghazawi The George Washington University tarek@gwu.edu Tarek A. El-Ghazawi The George Washington University.

Towards Optimized UPC Implementations

Tarek A. El-GhazawiThe George Washington University

tarek@gwu.edu

Tarek A. El-GhazawiThe George Washington University

tarek@gwu.edu

IBM T.J. Waston UPC: Unified Parallel C 202/22/05

Agenda

Background

UPC Language Overview

Productivity

Performance Issues

Automatic Optimizations

Conclusions

Parallel Programming Models What is a programming model?

An abstract machine which outlines the view perceived by the programmer of data and execution

Where architecture and applications meet A non-binding contract between the programmer and

the compiler/system

Good Programming Models Should Allow efficient mapping on different architectures Keep programming easy

Benefits Application - independence from architecture Architecture - independence from applications

Programming Models

Message Passing Shared Memory DSM/PGAS

MPI OpenMP UPC

Process/Thread

Address Space

Programming Paradigms ExpressivityLOCALITY

Implicit Explicit

PARALLEISM

Implicit

Explicit

Sequential(e.g. C, Fortran, Java)

Data Parallel(e.g. HPF, C*)

Shared Memory(e.g. OpenMP)

Distributed Shared

Memory/PGAS(e.g. UPC, CAF, and

Titanium)

What is UPC?

Unified Parallel C

An explicit parallel extension of ISO C

A distributed shared memory/PGAS parallel programming language

Why not message passing?

Performance High-penalty for short transactions Cost of calls Two sided Excessive buffering

Ease-of-use Explicit data transfers Domain decomposition does not maintain the

original global application view More code and conceptual difficulty

Why DSM/PGAS?

Performance No calls Efficient short transfers locality

Ease-of-use Implicit transfers Consistent global application view Less code and conceptual difficulty

Why DSM/PGAS:New Opportunities for Compiler Optimizations

Thread0

Thread1

Thread2

Thread3

ImageSobel Operator

DSM P_Model exposes sequential remote accesses at compile time Opportunity for compiler directed prefetching

History

Initial Tech. Report from IDA in collaboration with LLNL and UCB in May 1999

UPC consortium of government, academia, and HPC vendors coordinated by GWU, IDA, and DoD

The participants currently are: IDA CCS, GWU, UCB, MTU, UMN, ARSC, UMCP, U florida, ANL, LBNL, LLNL, DoD, DoE, HP, Cray, IBM, Sun, Intrepid, Etnus, …

Status Specification v1.0 completed February of 2001, v1.1.1 in

October of 2003, v1.2 will add collectives and UPC/IO

Benchmarking Suites: Stream, GUPS, RandomAccess, NPB suite, Splash-2, and others

Testing suite v1.0, v1.1

Short courses and tutorials in the US and abroad

Research Exhibits at SC 2000-2004

UPC web site: upc.gwu.edu

UPC Book by mid 2005 from John Wiley and Sons

Manual(s)

Hardware Platforms

UPC implementations are available for SGI O 2000/3000

Intrepid – 32 and 64b GCC UCB – 32 b GCC

Cray T3D/E Cray X-1 HP AlphaServer SC, Superdome UPC Berkeley Compiler: Myrinet, Quadrics,

and Infiniband Clusters Beowulf Reference Implementation (MPI-

based, MTU) New ongoing efforts by IBM and Sun

UPC Execution Model

A number of threads working independently in a SPMD fashion MYTHREAD specifies thread index

(0..THREADS-1) Number of threads specified at compile-time or

run-time

Process and Data Synchronization when needed Barriers and split phase barriers Locks and arrays of locks Fence Memory consistency control

UPC Memory Model

Shared space with thread affinity, plus private spaces

A pointer-to-shared can reference all locations in the shared space

A private pointer may reference only addresses in its private space or addresses in its portion of the shared space

Static and dynamic memory allocations are supported for both shared and private memory

Shared

Thread 0

Private 0

Thread THREADS-1

Private 1 Private THREADS-1

Thread 1

UPC Pointers

How to declare them? int *p1; /* private pointer pointing

locally */ shared int *p2; /* private pointer pointing into

the shared space */ int *shared p3; /* shared pointer pointing locally */ shared int *shared p4; /* shared pointer pointing

into the shared space */

You may find many using “shared pointer” to mean a pointer pointing to a shared object, e.g. equivalent to p2 but could be p4 as well.

UPC Pointers

Shared

Private P1

Thread 0

P1 P1P2

Synchronization - Barriers

No implicit synchronization among the threads

UPC provides the following synchronization mechanisms: Barriers Locks Memory Consistency Control Fence

Memory Consistency Models

Has to do with ordering of shared operations, and when a change of a shared object by a thread becomes visible to others

Consistency can be strict or relaxed

Under the relaxed consistency model, the shared operations can be reordered by the compiler / runtime system

The strict consistency model enforces sequential ordering of shared operations. (No operation on shared can begin before the previous ones are done, and changes become visible immediately)

Memory Consistency Models

User specifies the memory model through: declarations pragmas for a particular statement or sequence of

statements use of barriers, and global operations

Programmers responsible for using correct consistency model

UPC and Productivity

Metrics Lines of ‘useful’ Code

indicates the development time as well as the maintenance cost

Number of ‘useful’ Characters alternative way to measure development and maintenance

efforts Conceptual Complexity

function level, keyword usage, number of tokens, max loop depth, …

Manual Effort – NPB Example

SEQ UPC SEQ MPI UPC Effort (%)

MPI Effort (%)

#line 665 710 506 1046 6.77 106.72 NPB-CG #char 16145 17200 16485 37501 6.53 127.49 #line 127 183 130 181 44.09 36.23 NPB-EP #char 2868 4117 4741 6567 43.55 38.52 #line 575 1018 665 1278 77.04 92.18 NPB-FT #char 13090 21672 22188 44348 65.56 99.87 #line 353 528 353 627 49.58 77.62 NPB-IS #char 7273 13114 7273 13324 80.31 83.20 #line 610 866 885 1613 41.97 82.26 NPB-MG #char 14830 21990 27129 50497 48.28 86.14

SEQUPCUPCeffort #

SEQMPIMPIeffort #

Manual Effort – More Examples

SEQ MPI SEQ UPC MPI Effort (%)

UPC Effort (%)

#line 41 98 41 47 139.02 14.63 GUPS

#char 1063 2979 1063 1251 180.02 17.68 #line 12 30 12 20 150.00 66.67

Histogram #char 188 705 188 376 275.00 100.00 #line 86 166 86 139 93.02 61.63

N-Queens #char 1555 3332 1555 2516 124.28 61.80

SEQUPCUPCeffort #

SEQMPIMPIeffort #

Conceptual Complexity - HIST

Work Distr.

Data Distr.

Comm. Synch. & Consist.

Misc. Ops Sum Overall Score

#Parameters 5 4 0 3 0 12 #Function calls 0 0 0 4 0 4 #References to THREADS and MYTHREAD

2 1 0 0 0 3

#UPC Constructs & UPC Types

0 2 0 1 0 3

Notes 2 if 1 for

2 shared decl.

1 lockdec 1 lock/unlock 2 barriers

#Parameters 5 0 15 0 6 26 #Function calls 0 0 2 2 4 8 # References to myrank and nprocs

3 0 2 0 2 5

#MPI Types 0 0 6 0 2 8

Notes 2 if 1 for

1 Scatter 1 Reduce

(implicit w. Collective)

1 Init/Finalize 2 Comm

Conceptual Complexity - GUPS

Work Distr.

Data Distr.

Comm. Synch. & Consist.

Misc. Ops Sum Overall Score

#Parameters 21 6 0 0 0 27 #Function calls 0 4 0 2 0 6 #References to THREADS and MYTHREAD

3 4 0 0 0 7

#UPC Constructs & UPC Types

3 0 0 0 0 3 GU

Notes 3 forall 2 for 3 if

5 shared 2 all_alloc 2 free

2 barriers

#Parameters 18 17 38 1 6 80 #Function calls 0 7 6 3 6 22 # References to myrank and nprocs

3 5 13 1 4 26

#MPI Types 0 6 2 0 0 8

Notes 5 for 3 if

2 mem alloc 2 mem free 3 window

2 one-sided 4 collect

(implicit w. Collective and WinFence) 1 barrier

Init Finalize comm_rank comm_size 2 Wtime (6 error handle)

UPC Optimizations Issues

Particular Challenges Avoiding Address Translation Cost of Address Translation

Special Opportunities Locality-driven compiler-directed prefetching Aggregation

General Low-level optimized libraries, e.g. collective Backend optimizations Overlapping of remote accesses and

synchronization with other work

Showing Potential Optimizations Through Emulated Hand-Tunings

Different Hand-tuning levels: Unoptimized UPC code

referred as UPC.O0 Privatized UPC code

referred as UPC.O1 Prefetched UPC code

hand-optimized variant using block get/put to mimic the effect of prefetching

referred as UPC.O2 Fully Hand-Tuned UPC code

Hand-optimized variant integrating privatization, aggregation of remote accesses as well as prefetching

Referred as UPC.O3

T. El-Ghazawi and S. Chauvin, “UPC Benchmarking Issues”, 30th Annual Conference IEEE International Conference on Parallel Processing,2001 (ICPP01) Pages: 365-372

Address Translation Cost and Local Space Privatization- Cluster

MB/s Put Get Scale Sum

CC N/A N/A 1565.04 5409.3

UPC Private N/A N/A 1687.63 1776.81

UPC Local 1196.51 1082.89 54.22 82.7

UPC Remote 241.43 237.51 0.09 0.16

MB/s Copy (arr) Copy (ptr) Memcpy Memset

CC 1340.99 1488.02 1223.86 2401.26

UPC Private 1383.57 433.45 1252.47 2352.71

UPC Local 47.2 90.67 1202.8 2398.9

UPC Remote 0.09 0.20 1197.22 2360.59

Results gathered on a Myrinet Cluster

MB/SecMemory

copyBlock Get

Block Put

ArraySet

Array Copy

Sum Scale

GCC 127 N/A N/A 175 106 223 108

UPC Private 127 N/A N/A 173 106 215 107

UPC Local Shared 139 140 136 26 14 31 13

UPC Remote Shared (within SMP node) 130 129 136 26 13 30 13

UPC Remote Shared (beyond SMP node) 112 117 136 24 12 28 12

Bulk operations Element-by-Element operations

Address Translation and Local Space Privatization –

DSM ARCHITECTURE

Aggregation and Overlapping of Remote Shared Memory Accesses

1 2 4 8 16

THREADS

UPC NO OPT. UPC FULL OPT.

1 2 4 8 16 32

UPC NO OPT. UPC FULL OPT.

Benefit of hand-optimizations are greatly application dependent: N-Queens does not perform any better, mainly because it is an

embarrassingly parallel program Sobel Edge Detector does get a speedup of one order of magnitude

after hand-optimizating, scales linearly perfectly. SGI O2000

UPC N-Queens: Execution Time

UPC Sobel Edge: Execution Time

Impact of Hand-Optimizations on NPB.CG

1 2 4 8 16 32

Processors

UPC - O0 UPC - O1 UPC - O3 GCC Class A onSGI Origin 2k

Shared Address Translation Overhead

Address translation overhead is quite significant More than 70% of work for a local-shared memory access

Demonstrates the real need for optimization

ZActualAccess

YAddress

CalculationOverhead

UPC Put/GetFunction Call

OverheadX

PRIVATEMEMORY ACCESS

LOCALSHARED

MEMORY ACCESS

AddressTranslationOverhead

Local Shared memory access

Memory Access Time Address Calculation Address Function Call

Overhead Present in Local-Shared Memory Accesses (SGI Origin 2000, GCC-UPC)

Quantification of the Address Translation Overheads

Shared Address Translation Overheads for Sobel Edge Detection

#Processors

Processing + Memory Access Address Function Call

Address Calculation

1 2 4 8 16

UPC.O0: unoptimized UPC code, UPC.O3: handoptimized UPC code. Ox notations from

T. El-Ghazawi, S. Chauvin, “UPC Benchmarking Issues”, Proceedings of the 2001 International Conference on Parallel Processing, Valencia, September 2001

Reducing Address Translation Overheads via Translation Look-Aside Buffers

F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber, “Fast Address Translation Techniques for Distributed Shared Memory Compilers”, IPDPS’05, Denver CO, April 2005

Use Look-up Memory Model Translation Buffers (MMTB) to perform fast translations

Two alternative methods proposed to create and use MMTB’s: FT: basic method using direct addressing RT: advanced method, using indexed addressing

Was prototyped as a compiler-enabled optimization no modifications to actual UPC codes are needed

array[0] TH0

array[1] TH1

array[2] TH2

array[3] TH3

array[4] TH0

array[5] TH1

array[6] TH2

array[7] TH3

array[0] TH0

array[1] TH1

array[2] TH2

array[3] TH3

array[4] TH0

array[5] TH1

array[6] TH2

array[7] TH3

Array distributed across 4 THREADS

shared int array[8];

MMTB stored on each thread

FT Look-up Table

Data affinity

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

[0] 57FF8040

[4] 57FF8048

[1] 5FFF8040

[5] 5FFF8048

[2] 67FF8040

[6] 67FF8048

[3] 6FFF8040

[7] 6FFF8048

Different Strategies – Full-Table

Pros Direct mapping No address calculation

Cons Large memory required Can lead to competition over caches

and main memory

Consider shared [B] int array[8];To Initialize FT:

i [0,7], FT[i] = _get_vaddr(&array[i])To Access array[ ]:

i [0,7], array[i] = _get_value_at(FT[i])

Different Strategies – Reduced-Table: Infinite blocksize

RT Strategy:

Only one table entry in this case

Address calculation step is simple in that case

array[0]

array[1]

array[2]

array[3]

THREAD0

THREAD1

RT[0] RT[0]

THREAD2

THREAD3

BLOCKSIZE=infiniteOnly first address of the element of the array needs to be saved since all array data is contiguous

Consider shared [] int array[4];

To initialize RT:

RT[0] = _get_vaddr(&array[0])

To access array[]:

i [0,3], array[i] = _get_value_at( RT[0] + i )

Different Strategies – Reduced-Table: Default blocksize

RT Strategy:

Less memory required than FT, MMTB buffer has threads entries

Address calculation step is a bit costly but much cheaper than current implementations

array[0]

array[4]

array[8]

array[12]

THREAD1

THREAD2

THREAD3

array[1]

array[5]

array[9]

array[13]

array[2]

array[6]

array[10]

array[14]

array[3]

array[7]

array[11]

array[15]

THREAD0

BLOCKSIZE=1Only first address of elements on each thread are saved since all array data is contiguous

Consider shared [1] int array[16];

To initialize RT:

i [0,THREADS-1], RT[i] = _get_vaddr(&array[i])

To access array[]:

i [0,15], array[i] = _get_value_at( RT[i mod THREADS] + (i/THREADS))

Different Strategies – Reduced-Table: Arbitrary blocksize

RT Strategy:

Less memory required than for FT, but more than previous cases

Address calculation step more costly than previous cases

array[0]

array[1]

array[8]

array[9]

THREAD1

THREAD2

THREAD3

array[2]

array[3]

array[10]

array[11]

array[4]

array[5]

array[12]

array[13]

array[6]

array[7]

array[14]

array[15]

THREAD0

ARBITRARY BLOCK SIZESOnly first address of elements of each block are saved since all block data is contiguous

Consider shared [2] int array[16];

To initialize T:

i [0,7], RT[i] = _get_vaddr(&array[i*blocksize(array)])

To access array[]:

i [0,15], array[i] = _get_value_at( RT[i / blocksize(array)] + (i mod blocksize(array)) )

Performance Impact of the MMTB – Sobel Edge

FT and RT are performing around 6 to 8 folds better than the regular basic UPC version (O0)

RT strategy slower than FT since address calculation (arbitrary block size case), becomes more complex.

FT on the other hand is performing almost as good as the hand-tuned versions (O3 and MPI)

Sobel Edge (N=2048)

1 2 4 8 16

#THREADS

O0 O0.FT O0.RT O3 MPI

Sobel Edge (N=2048)

1 2 4 8 16

#THREADS

O0.FT O0.RT O3 MPI

Performance of Sobel-Edge Detection using new MMTB strategies (with and without O0)

Performance Impact of the MMTB – Matrix Multiplication

FT strategy: increase in L1 data cache misses due to the large table size

RT strategy: L1 kept low, but increase in number of loads and stores is observed showing increase in computations (arbitrary blocksize used)

MATRIX MULTIPLICATION (N=256)

1 2 4 8 16

# THREADS

UPC.O0 UPC.O0.FT UPC.O0.RT UPC.O3 MPI

Performance and Hardware Profiling of Matrix Multiplication using new MMTB strategies

MATRIX MULTIPLICATION (N=256)

UPC.O0

UPC.O3

UPC.O0.

UPC.O0

UPC.O3

UPC.O0.

UPC.O0

UPC.O3

UPC.O0.

UPC.O0

UPC.O3

UPC.O0.

UPC.O0

UPC.O3

UPC.O0.

THREADS

Computation L1 Data Cache Misses L2 Data Cache Misses TLB Misses

Graduated Loads Graduated Stores Decoded Branches

1 THREAD 2 THREADS 4 THREADS 8 THREADS 16 THREADS

Time and storage requirements of the Address Translation Methods for the

Matrix Multiply Microkernel

Number of loads and stores can increase with arithmetic operators

Comparison among Optimizations of Storage, Memory Accesses and Computation Requirements

THREADSPNEN

THREADSPB

For a shared array of N elements with B

as blocksize

Storage requirements per

shared array

# of memory accesses per

shared memoryaccess

# of arithmetic operations pershared memory

access

UPC.O0 More than 25 More than 5

UPC.O0.FT 1 0

UPC.O0.RT 1 Up to 3(E: element size in bytes,P: pointer size in bytes)

UPC Work-sharing Construct Optimizations

By thread/index number (upc_forall integer)

upc_forall(i=0; i<N; i++; i)

loop body;

By the address of a shared variable (upc_forall address)

upc_forall(i=0; i<N; i++; &shared_var[i])

loop body;

By thread/index number (for optimized)

for(i=MYTHREAD; i<N; i+=THREADS)

loop body;

By thread/index number (for integer)

for(i=0; i<N; i++)

if(MYTHREAD == i%THREADS)

loop body;

By the address of a shared variable (for address)

for(i=0; i<N; i++)

if(upc_threadof(&shared_var[i]) ==

MYTHREAD)

loop body;

Performance of Equivalent upc_forall and for Loops

1 2 4 8 16 Processor(s)

upc_forall address upc_forall integer for address for integer for optimized

Performance Limitations Imposed by Sequential C Compilers -- STREAM

BULK Element-by-Element

py (arr)

F 291.21 163.90 N/A 291.59 N/A 159.68 135.37 246.3 235.1 303.82

C 231.20 214.62 158.86 120.57 152.77 147.70 298.38 133.4 13.86 20.71

py (arr)

F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053

C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824

Loopmark – SET/ADD Operations

py (arr)

F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053

C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824

Let us compare loopmarks for each F / C operation

Loopmark – SET/ADD Operations

MEMSET (bulk set)

146. 1 t = mysecond(tflag)

147. 1 V M--<><> a(1:n) = 1.0d0

148. 1 t = mysecond(tflag) - t

149. 1 times(2,k) = t

158. 1 arrsum = 2.0d0;

160. 1 MV------< DO i = 1,n

161. 1 MV c(i) = arrsum

162. 1 MV arrsum = arrsum + 1

163. 1 MV------> END DO

165. 1 times(4,k) = t

181. 1 V M--<><> c(1:n) = a(1:n) + b(1:n)

183. 1 times(7,k) = t

MEMSET (bulk set)

163. 1 times[1][k] = mysecond_();

164. 1 memset(a, 1, NDIM*sizeof(elem_t));;

165. 1 times[1][k] = mysecond_() - times[1][k];

217. 1 set = 2;

220. 1 times[5][k] = mysecond_();

222. 1 MV--< for (i=0; i<NDIM; i++)

223. 1 MV {

224. 1 MV c[i] = (set++);

225. 1 MV--> }

283. 1 times[10][k]= mysecond_();

285. 1 Vp--< for (j=0; j<NDIM; j++)

286. 1 Vp {

287. 1 Vp c[j] = a[j] + b[j];

288. 1 Vp--> }

Fortran C

Legend: V: Vectorized – M: Multistreamed – p: conditional, partial and/or computed

UPC vs CAF using the NPB workloads In General, UPC slower than CAF, mainly due to

Point-to-point vs barrier synchronization Better scalability with proper collective operations Program writers can do a p-to-p syncronization using current

constructs Scalar performance of source-to-source translated code

Alias analysis (C pointers)

» Can highlight the need for explicitly using restrict to help several compiler backends

Lack of support for multi-dimensional arrays in C» Can prevent high level loop transformations and software

pipelining, causing a 2 times slowdown in SP for UPC Need for exhaustive C compiler analysis

» A failure to perform proper loop fusion and alignment in the critical section of MG can lead to 51% more loads for UPC than CAF

» A failure to unroll adequately the sparse matrix-vector multiplication in CG can lead to more cycles in UPC

Conclusions UPC is a locality-aware parallel programming

language

With proper optimizations, UPC can outperform MPI in random short accesses and can otherwise perform as good as MPI

UPC is very productive and UPC applications result in much smaller and more readable code than MPI

UPC compiler optimizations are still lagging, in spite of the fact that substantial progress has been made

For future architectures, UPC has the unique opportunity of having very efficient implementations as most of the pitfalls and obstacles are revealed along with adequate solutions

Conclusions

In general, four types of optimizations: Optimizations to Exploit the Locality

Consciousness and other Unique Features of UPC

Optimizations to Keep the Overhead of UPC low Optimizations to Exploit Architectural Features Standard Optimizations that are Applicable to all

Systems Compilers

Conclusions

Optimizations possible at three levels: Source to source program acting during the

compilation phase and incorporating most UPC specific optimizations

C backend compilers to compete with Fortran Strong run-time system that can work effectively

with the Operating System

Selected Publications

T. El-Ghazawi, W. Carlson, T. Sterling, and K. Yelick, UPC: Distributed Shared Memory Programming. John Wiley &Sons Inc., New York, 2005. ISBN: 0-471-22048-5. (June 2005)

T. El-Ghazawi, F. Cantonnet, Y. Yao, S. Annareddy, A. Mohamed, Benchmarking Parallel Compilers for Distributed Shared Memory Languages: A UPC Case Study, Journal of Future Generation Computer Systems, North-Holland (Accepted)

Selected Publications

T. El-Ghazawi and S. Chauvin, “UPC Benchmarking Issues”, 30th Annual Conference IEEE International Conference on Parallel Processing,2001 (ICPP01) Pages: 365-372

T. El-Ghazawi and F. Cantonnet. “UPC performance and potential: A NPB experimental study”. Supercomputing 2002 (SC2002), Baltimore, November 2002

F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber, “Fast Address Translation Techniques for Distributed Shared Memory Compilers”, IPDPS’05, Denver CO, April 2005

CUG and PPOP

Towards Optimized UPC Implementations Tarek A. El-Ghazawi The George Washington University tarek@gwu.edu Tarek A. El-Ghazawi The George Washington University.

unified parallel c

programming modelsmessage

upc web site

upc consortium of government

statusspecification

otherstesting suite

shared memorye

conceptual difficultyupc

Documents

Programming in the Distributed Shared-Memory Model Tarek...

Tarek Shafagoj Portfolio

Tarek Youssef C.v

Tarek Hamdi Ruling

Performance and Overhead in a Hybrid Reconfigurable Computer...

Loay tarek profile

1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1,.....

Dr Tarek Abdolkader Dr Tarek AbdolkaderDr Tarek Abdolkader.....

Dr Sadik Al-GHAZAWI MRCP, FRCP UK

Tarek ayoub portfolio

Tarek zameen par

Cryptographic applications Kris Gaj Tarek El-Ghazawi & GMU,....

An Optimized Hardware Architecture for the Montgomery...

© Tarek Hegazy – 1 Basics of Asset Management Prof....

HYPERSPECTRAL IMAGE ANALYSIS FOR OIL SPILL DETECTION 1...

Dr tarek osteopro2