cpeg421-10-F/Topic-3-II-EARTH 1 Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading - The EARTH Model (in more details) Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor Electrical & Computer Engineering University of Delaware [email protected]
56
Embed
Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor
Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading - The EARTH Model (in more details). Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor Electrical & Computer Engineering University of Delaware [email protected]. Outline. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
cpeg421-10-F/Topic-3-II-EARTH 1
Topic 2 -- II: Compilers and Runtime Technology:
Optimization Under Fine-Grain Multithreading- The EARTH Model (in more details)
Guang R. Gao
ACM Fellow and IEEE FellowEndowed Distinguished ProfessorElectrical & Computer Engineering
Note:How loop carried dependencies are handled?And its implication on cross core software pipelining
T1 T2 T3
Main Features of EARTH
* Fast thread context switching• Efficient parallel function invocation• Good support of fine grain dynamic load
balancing* Efficient support split phase transactions
and fibers
cpeg421-10-F/Topic-3-II-EARTH 19
*Features unique to the EARTH model in comparison to the CILK model
cpeg421-10-F/Topic-3-II-EARTH 20
Outline• Overview• Fine-grain multithreading• Compiling for fine-grain multithreading• The power of fine-grain synchronization -
SSB• The percolation model and its applications• Summary
Compiling C for EARTHObjectives
• Design simple high-level extensions for C that allow programmers to write programs that will run efficiently on multi-threaded architectures. (EARTH-C)
• Develop compiler techniques to automatically translate programs written in EARTH-C to multi-threaded programs. (EARTH-C, Threaded-C)
• Determine if EARTH-C + compiler can compete with hand-coded Threaded-C programs.
cpeg421-10-F/Topic-3-II-EARTH 21
Summary of EARTH-C Extensions
• Explicit Parallelism– Parallel versus Sequential statement sequences– Forall loops
• Locality Annotation– Local versus Remote Memory references (global, local,
replicate, …)• Dynamic Load Balancing
– Basic versus remote function and invocation sites
for (i=0; i < N; i++) for (j=0; j < N ; j++) { sum = 0; for (k=0; k < N; k++) sum = sum + a [i] [k] * b [k] [j] c [i] [j] = sum; }}
Sequential Version
Matrix Multiplication
04/24/23 \Petaflop\Workshop98-7B.ppt 27
BLKMOV_SYNC (a, row_a, N, slot_1);BLKMOV_SYNC (b, column_b, N, slot_1);sum = 0;END_THREAD;
THREAD-1; for (i=0; i<N; i++ ); sum = sum + (row_a[i] * column_b[i]); DATA_RSYNC (sum, result, done); END_THREAD ( ) ;
0 0
2 2
innera result doneb
The Inner Product Example
END_FUNCTION
Summary of EARTH-C Extensions
• Explicit Parallelism– Parallel versus Sequential statement sequences– Forall loops
• Locality Annotation– Local versus Remote Memory references (global, local,
replicate, …)• Dynamic Load Balancing
– Basic versus remote function and invocation sites
cpeg421-10-F/Topic-3-II-EARTH 28
EARTH C Threaded C(Thread Generation)
Given a sequence of statements, s1, s2, …sn, we wish to create threads such that:– Maximize thread length (minimize thread
switching overhead)– retain sufficient parallelism– Issue remote memory requests as early as
possible (prefetching)– Compile split-phase remote memory operations
and remote function calls correctly
cpeg421-10-F/Topic-3-II-EARTH 29
An Example
cpeg421-10-F/Topic-3-II-EARTH 30
int f(int *x, int i, int j){ int a, b, sum, prod, fact; int r1, r2, r3; a = x[i]; fact = 1; fact = fact * a;
b = x[j]; sum = a + b; prod = a * b; r1 = g(sum); r2 = g(prod); r3 = g(fact); return(r1 + r2 + r3); }
1
3
1
Example Partitioned into Four Fibers
cpeg421-10-F/Topic-3-II-EARTH 31
a = x[i];fact = 1; fact = fact * a;
b = x[j];
sum = a + b;prod = a * b;r1 = g(sum);r2 = g(prod);r3 = g(fact);
return (r1 + r2 + r3);
Fiber-0:
Fiber-1:
Fiber-2:
Fiber-3:
Better Strategy Using List Scheduling
• Put each instruction in the earliest possible thread.
• Within a thread, the remote operations are executed as early as possible.
Build a Data Dependence Graph (DDG), and use a list scheduling strategy, where the selection of instructions is guided by Earliest Thread Number and Statement Type.
• Associate “state” to a memory location (fine-granularity). Fine-grain synchronization for the memory location is realized through “state transition” on such “state”.
cpeg421-10-F/Topic-3-II-EARTH 40
I-Structure state transition[ArvindEtAl89 @ TOPLAS]
Empty
Full Deferred
read
readwrite
resetwrite
read
With Memory Based Fine-Grain Sync
• Using a single atomic operation complete synchronized write/read in memory directly
• No need to implement synchronization with other resources, e.g., shared memory.
For computation with more complicated data dependencies, memory-based fine-grain synchronization is more effective and efficient. [ArvindEtAl89 @ TOPLAS]
A Question!
cpeg421-10-F/Topic-3-II-EARTH 44
cpeg421-10-F/Topic-3-II-EARTH 45
Key ObservationKey Observation:
Solution:
What is SSB?
• A small hardware buffer attached to the memory controller of each memory bank.
• Record and manage states of actively synchronized data units.
• Hardware Cost– Each SSB is a small look-up table: Easy-to-implement– Independence of each SSB: hardware cost increases
only linearly proportional to # of memory banks
cpeg421-10-F/Topic-3-II-EARTH 46
cpeg421-10-F/Topic-3-II-EARTH 47
SSB on Many-Core (IBM C64)
IBM Cyclops-64, Designed by Monty Denneau.
SSB Synchronization Functionalities
Data Synchronization: Enforce RAW data dependencies• Support word-level
– Two single-writer-single-reader (SWSR) modes– One single-writer-multiple-reader (SWMR) mode
Fine-Grain Locking: Enforce mutual exclusion• Support word-level
• What is percolation?dynamic, adaptive computation/data movement, migration, transformation in-place or on-the fly to keep system resource usefully busy
• Features of percolation– both data and thread
may percolate– computation
reorganization and data layout reorganization
– asynchronous invocation
An Example of percolation—Cannon’s Algorithm
Level 0
Level 1
Level 2
Level 3
Level 0: fast cpu
Level 1 PIM
Level 2 PIM
Level 3
percolation
HTML-like Architectures
Cannon’s nearest neighbor data transferData layout reorganization during percolation
Performance of SCCA2Kernel 4
cpeg421-10-F/Topic-3-II-EARTH 55
#threads C64 SMPs MTA2
4 2917082 5369740 752256
8 5513257 2141457 619357
16 9799661 915617 488894
32 17349325 362390 482681
• Reasonable scalability–Scale well with # threads–Linear speedup for #threads < 32
• Commodity SMPs has poor performance• Competitive vs. MTA-2
Metric:TEPS -- Traversed Edges per second
SMPs: 4-way Xeon dual-core, 2MB L2 Cache
cpeg421-10-F/Topic-3-II-EARTH 56
Outline• Overview• Fine-grain multithreading• Compiling for fine-grain multithreading• The power of fine-grain synchronization -
SSB• The percolation model and its applications• Summary