1 Common clock or independent clocks Distributed memory M 1 M 2 M 3 M p P 1 P p P 3 P 2 Low- overhead multi- threading High-bandwidth synchronous network L 1 L p L 3 L 2 Common clock or independent clocks Distributed memory P 2 P 3 P p P 1 M 1 M 2 M 3 M p C 2 C 3 C p C 1 Single- threaded cores & coherent caches Asynchronous cache coherence/memory network Massively parallel com- puting—Heterogeneous/ homogeneous and pro- gramming applications Martti Forsell Chief Research Scientist VTT—Technical Research Center of Finland Communication Platforms Annual Seminar 2009 October 6, 2009 VTT, Oulu, Finland Focus on MP-CMPs
28
Embed
Single- puting—Heterogeneous/ homogeneous and pro ... · VTT 8 Homogeneous tera-device chip approach Application specific engine 1 - GPU/SIMD General purpose computing engine ASE
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
VTT
1
Common clock or independent clocks
Distributed memory
M1 M2 M3 Mp
P1 PpP3P2
Low-overhead
multi-threading
High-bandwidth synchronous network
L1 LpL3L2
Common clock or independent clocks
Distributed memory
P2 P3 PpP1
M1 M2 M3 Mp
C2 C3 CpC1
Single-threadedcores &coherentcaches
Asynchronous cache coherence/memory network
Massively parallel com-puting—Heterogeneous/homogeneous and pro-gramming applications
Martti ForsellChief Research Scientist
VTT—Technical Research Center of Finland
Communication PlatformsAnnual Seminar 2009
October 6, 2009VTT, Oulu, Finland
Focus on
MP-CMP
s
VTT
2
Massively parallel computing—Heterogeneous/homogeneous and programming applicationsMartti Forsell, VTT, Finland
Abstract: So far massively parallel computing has been successfully applied to scientific,database and data center applications as well as computer aided design and manufacturing.Now also the major processor manufacturers and SOC designers are projecting to increase thenumber of processor cores per chip beyond one thousand in 10 to 15 years, yielding to mas-sively parallel computing on a chip. At the same time frame we are also likely see a newmilestone, chips with more than (US) trillion (1012) transistors, called tera device chips.In this presentation we will take an early look at architectural and methodological challengesthat massively parallel computing on a chip may face. Particularly, we will analyze thestrengths and weaknesses of heterogeneous and homogeneous approaches to tera devicechips and their programmability. It turns out that radically new solutions are needed for botharchitectures and programming methodologies before these chips can become as efficientand useful as sequential computers are nowadays. The biggest challenges are related toimplementation of easy to use computational model providing performance and power efficiency and portability of applications for such machines.
VTT
3
Parallelism—a resouce for efficient computationForget sequential computing!
• Almost all computational problems are parallel bytheir nature
• It is unnecessary and even harmful to solve themwith a sequential algorithm and then ”parallelize” it
- extra work- optimizing hard due to artificial dependencies- do not count on so called ”automatic parallelization”
• Sequential solutions execute slow and with a low uti-lization in massively parallel machines
Parallelism should be seen as a resource for efficientcomputation!
Sequential(Ideal)Parallel
Loop overhead:- update 2x pointers- compare and branch(can be eliminated withloop unrolling or archi-tectural ILP support)
No loopingneeded!
VTT
4
Data Center Total Concurrency
1.E+09
1.E+10
Billion-way concurrency
1.E+07
1.E+08
cu
rrecn
cy
1.E+05
1.E+06
To
tal
Co
nc
Million-way concurrency
1.E+03
1.E+04
Thousand-way concurrency1.E 03
1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20
Top 10 Top System Top 1 Trend
Historical Exa Strawman Evolutionary Light Nodey g
Evolutionary Heavy Node
!"#$%&'()"*)+&%&$),"--&.)/01 and Thomas Sterling, LSU
Where we are now?
VTTData Center Performance Projections
1 E+09
1.E+10
Exascale
1.E+08
1.E+09 Exascale
Lightweight
1 E+06
1.E+07
GFlops
Heavyweight
g g
1.E+05
1.E+06
1.E+04
1/1/04 1/1/08 1/1/12 1/1/16 1/1/20
But not at 20 MW!
!"#$%&'()"*)+&%&$),"--&.)/01 and Thomas Sterling, LSU
Programmability, flexibility Optimality for fixed applicationsReplication Reuse of existing IP blocks
Heterogeneity+ Better performance/area, performance/power for matching applications+ Natural partitioning for matching applications- Partitioning problems with non-matching applications- Programmability/methodological complexity APP1+APP2+...+APPN = N xmethodology!
- More than N x risk of architectural changes requiring partial software rewrite- Inflexibility, waste of area (and power) for non-matching applications- Reuse-based design complexity reduction may not be as efficient as replication-based
VTT
10
Performance scalabilityof current approachesWith current approaches, some well-behavingcases in certain domains (DSP) scale well asthe number of cores per chip increases.
Unfortunately there seems to be a huge gapbetween the current approaches and an idealparallel machine in demanding applications.
Are we doomed to the fate of diminishingreturns as the number of cores per chip increas-es (like for ILP in early 2000’s)?
1,00E+06
1,00E+07
1,00E+08
1,00E+09
1,00E+10
1,00E+11
1,00E+12
1,00E+13
4 16
64
256
1024
40
96
1638
4
6553
6 E
xecu
tion
time
(clo
ck c
ycle
s)
Number of cores
Ideal PRAM CC-NUMA NUMA SMP VC MP
1,00E+13
1,00E+12
1 11,00E+
Exe
cutio
n tim
e (c
lock
cyc
les)
1,00E+10
1,00E+09
1,00E+08 Exe
cutio
n tim
e (c
lock
cyc
les)
Ideal PRAM ACC-NUM
ANUMPSM
VC PM
1,00E+07
1,00E+06 4 16
64
256
1024
40
96
1638
4
6553
6
Number of cores
36
Hugegap!
P-core PRAM+ILP chaining
Early estimations,not verifiedresults!
Performance as a function of the number of processors P in demanding applications, Fs=0.1, Fdp=0.45, Pcm=0.1, Pcsd=0.5,Pg=0.2, Tb=106, F=4, Fnv=0.2, preliminary results from [Forsell10a].
VTT
11
Programmability vs. current trendsComplexity/Low abstraction/Heterogeneity
Simplicity/High abstraction/Homogeneity
Synchronous Asynchronous
Easy
Difficult
PROGRAMMABILITY
μcontroller+ SRAM
μprocessor+ memoryhierarchy
homog.MP-SOC
heterog.NOC
PRAM
BSP
Vector
Clusteredvector
VectorMPPhybrid
Upscaledapplication-specifictera-device
SYNCHRONITY
COMPLEXITYABSTRACTIONHETEROGENEITY
Upscaledgeneralpurposetera-device
VTT
12
Application programmingCommon clock or independent clocks
Distributed memory
P2 P3 PpP1
M1 M2 M3 Mp
C2 C3 CpC1
Single-threadedcores &coherentcaches
Asynchronous cache coherence/memory network
The main obstacle preventing parallel computingfrom becoming the main stream of applicationdevelopment is
• Poor and error-prone programmability of currentparallel architectures
Other problems are
• Architecture dependency• Architectural inefficiency of key primitives of paral-
lel computing• Huge existing base of sequential code
With the advent of massively parallel computing on a chip these problems are get-ting even worse!
VTT
13
Secrets of the success of sequential computingparadigm—Right type of abstraction: RAM
• Captures the essential properties of the underlying machine• Good portability of programs between machines with substan-
tially different properties• Idealized properties of the computational model, e.g.
- access time = 1 cycle- sequential operation
can be emulated well enough even with a speculative super-scalar architecture with virtual memory
• Easy to learn and use• Theory of algorithms exists and helps programming & analysis• Efficient teaching (all universities teach it)
Clock
Memory
M
PSequential
execution ofinstructions
Latency=1
VTT
14
Secrets of the success of sequential computingparadigm—Synchronization, techniques, locality
Simple concept of synchronization• Subtasks of originally parallel computational problems are executed syn-chronously due to deterministically sequential execution at program level
• Looping is used to process multiple data elements• Multitasking is used to emulate parallelism (problems are introduced,
atomic operations, lockings, transactions, etc. needed)
Flexible performance enhancement techniques• Low-level parallelism (static or dynamic superscalar execution)• Speculations (out of order execution, data speculations, etc.)• Locality exploitation (caching, buffering, interleaving)
Trivial concept of locality that is in line with memory hierarchies
Clock
Memory
M
PSequential
execution ofinstructions
Latency=1
VTT
15
Problems of current parallel comput-ing approach—AbstractionNot so efficient, too low and inappropriate abstraction of the underlyingmachine• A programmer needs to take care of mapping, partitioning, synchronization
(most models) and even low-level communication (message passing)• Portability between machines with different properties is often limited
(rewriting of software too often needed if machine changes/is updated)• Quite difficult to learn and use, error-prone• Generality/usefulness of the theory of parallel algorithms limited due to archi-
tecture dependency• Teaching at elementary level virtually nonexistent (advanced level courses do
exist in universities)
VTT
16
Problems of current parallel computingapproach—Synchronization and techniquesSynchronization• Asynchronous operation (missing low-level synchronicity)• High-level synchronization very expensive (barriers take typically hundreds of cycles)• This limits greatly the applicability and rules out fine-grained parallel algorithms
Not particularly innovative/flexible/suitable/well-linked performance enhancementtechniques (copied mostly from sequential computing)• Low-level parallelism (superscalar execution - not well linked to TLP execution, complex compiling)• Speculations (multithreading)• Locality exploitation (NUMA, coherent caching)• Inefficient implementation or missing support for important primitives of parallel comput-
ing, e.g. exploitation of virtual ILP, concurrent memory access, multioperations
VTT
17
Problems of current parallel comput-ing approach—LocalityMost experts agree that locality is needed to perform parallel computation efficiently.
Locality = area/neighborhood in which certain special technique is possible
Problems: - The concept of locality is too narrowly understood as a locality to a single processor- There is no way to maximize data locality with respect to processor cores in general case
In fact, there are different kinds of (or hierarchy of) localities including e.g.:- VLSI locality (range of a wire in one clock cycle)- Synchronous area in VLSI design (beyond which one must use the GALS paradigm)- Chip locality (everything could be fitted onto a single chip)- Processor core locality (locality to a processor core)- Synchronous shared memory locality (beyond which one must use asynchronous/hybrid SM)- Machine locality (everything could be put onto a single machine)- Grid locality (everything fits into a grid)
VTT
18
VLSI locality vs. chip locality
Source: SamanAmarasinghe, MIT, 2007
Range of a wire in one clock cycle
Interestingly, theselocalities alsotend to changeboth absolutelyand with respectto each others(sometimes quitedramatically) astime goes on.
• Proper TLP synchronization (exploit the synchronous shared memory locality)• Correct abstraction (e.g. synchronous shared memory machine)• New kind of performance enhancement techniques (not those borrowed fromsequential computing but e.g. chaining, concurrent memory access, multioperations)
• Algorithmics for fine-grained synchronous TLP (existing PRAM algorithmics goes well)• Teaching, further research etc...
VTT
20
Main idea —SynchronousShared Memory (SSM)Instead of providing asynchronous operationand hard-to-solve cache coherence problems
- Drop caches next to processors! — Usethroughput computing (parallel slackness) tohide the latency of the memory system
- Synchronize the steps of execution by using asynchronization wave technique [Ranade91]
- Decrease the probability of congestion by ran-domization of traffic via hashing of memorylocations
Common clock or independent clocks
Distributed memory
M1 M2 M3 Mp
P1 PpP3P2
Low-overhead
multi-threading
High-bandwidth synchronous network
L1 LpL3L2
Common clock or independent clocks
Distributed memory
P2 P3 PpP1
M1 M2 M3 Mp
C2 C3 CpC1
Single-threadedcores &coherentcaches
Asynchronous cache coherence/memory network
VTT
21
Our approach [Forsell09]S
SS
S S
SS
S S
SS
S S
SS
S
S
SS
S S
SS
S S
SS
S S
SS
S
S
SS
S S
SS
S S
SS
S S
SS
S
S
SS
S S
SS
S S
SS
S S
SS
S
I P
M L
I P
M L
I P
M L
I P
M L
I P
M L
I P
M L
I P
M L
I P
M L
I P
M L
I/O
M
I/O
M
I/O
M
I/O
M
I/O
M
I/O
M
I/O
M
I P
I/O L
I P
I/O L
I P
I/O L
I P
I/O L
I P
I/O L
I P
I/O L
I P
I/O L
Common clock
Word-wise accessible shared memory
Read/write operations from/to shared memory
P2 P3 P4P1
Common clock or independent clocks
Distributed memory
M1 M2 M3 Mp
P1 PpP3P2
Low-overhead
multi-threading
High-bandwidth synchronous network
L1 LpL3L2
Idealizedprogrammer’sview: PRAM
Computationalmodel: SSM
Architecturalrealization:Total Eclipse
VTT
22
SSM—High-bandwidth network
Goal: Sufficient throughput and low enough latency forrandom communication patterns (with a high probability)see e.g. [Forsell05]
• A variant of two-dimensional mesh, a Mc-way, double
acyclic two-dimensional sparse mesh or multi-mesh
• To maximize the throughput for read-intensive code, sep-arate network for reply messages
• A simple XY-routing algorithm
• Deadlocks are not possible
• Fast - lower power alternatives exist
S
S
S
PM
rni
Mrni
PM
rni
Mrni
PM
rni
Mrni
PM
rni
Mrni
PM
rni
Mrni
PM
rni
Mrni
PM
rni
Mrni
PM
rni
Mrni
PM
rni
Mrni
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
VTT
23
Potential of SSMs indemanding applicationsAccording to our investigations, with SSM it ispossible to get close to the ideal PRAM even indemanding applications.
Like for all new academic architectures thereare a number of questions open related to
• practical off-chip memory bandwidth• virtual memory• implementation efficiency• power and area concerns
1,00E+06
1,00E+07
1,00E+08
1,00E+09
1,00E+10
1,00E+11
1,00E+12
1,00E+13
4 16
64
256
1024
40
96
1638
4
6553
6 E
xecu
tion
time
(clo
ck c
ycle
s)
Number of cores
Ideal PRAM CC-NUMA NUMA SMP VC MP
1,00E+13
1,00E+12
1 11,00E+
Exe
cutio
n tim
e (c
lock
cyc
les)
1,00E+10
1,00E+09
1,00E+08 Exe
cutio
n tim
e (c
lock
cyc
les)
Ideal PRAM ACC-NUM
ANUMPSM
VC PM
1,00E+07
1,00E+06 4 16
64
256
1024
40
96
1638
4
6553
6
Number of cores
36
P-core PRAM+ILP chaining
Early estimations,not verifiedresults!
Performance as a function of the number of processors P in demanding applications, Fs=0.1, Fdp=0.45, Pcm=0.1, Pcsd=0.5,Pg=0.2, Tb=106, F=4, Fnv=0.2, preliminary results from [Forsell10a].
SSM potential
VTT
24
Two-level/hybrid model
maximum sizedSSM block
SSM locality
If the size of forthcoming tera device chips turns out toexceed the SSM locality, then one may need to usetwo-level/hybrid model providing natural support forblocking used in many parallel algorithms.
Machine for such a model would consist of maximalsize SSM blocks that are interconnected and supportinge.g. the BSP model [Valiant90] for inter SSM blockcomputation.
This kind of a compromise, however, would makeprogramming of parallel algorithms without naturalblocking more difficult.
VTT
25
Programming perspectives for two-level model [Forsell10b]If computation suits to a block do it with the PRAM model otherwise useblocking algorithm and organize block-wise computation intosupersteps so that:
• During a superstep computation is local to blocks (PRAM modelavailable) but they may use local copies of shared variables.
• Between each superstep, shared variables are updated according-ly and execution of blocks is synchronized.
ConclusionsMassively parallel computing is expected to reach single chip processing engines in 10 to15 years with designs integrating more than 1000 processor cores.
This will have a tremendous impact on architectures and programming to guarantee
• ease of use/programming• portability of code• performance and power efficiency for general purpose applications
Our insight points to homogeneous approach addressing latency hiding, multiple levels ofparallelism, and providing good abstraction of the underlying machine with explicit paral-lelism, synchrony, and conceptual simplicity, i.e. the SSM being investigated at VTT.
If the size of tera device chips turns out to exceed the SSM locality, one needs to use two-level/hybrid model providing natural support for blocking algorithms.
VTT
28
References[Forsell02a] M. Forsell, Architectural differences of efficient sequential and parallel computers, Journal of Systems
Architecture 47, 13 (July 2002), 1017-1041.[Forsell02b] M. Forsell, A Scalable High-Performance Computing Solution for Network on Chips, IEEE Micro 22, 5
(September-October 2002), 46-55.[Forsell05] M. Forsell, V. Leppänen, High-Bandwidth on-chip Communication Architecture for General Purpose
Computing, In the Proceedings of the 9th World Multiconference on Systemics, Cybernetics andInformatics (WMSCI 2005) Volume IV, July 10-13, 2005, Orlando, USA, 1-6.
[Forsell09] M. Forsell, TOTAL ECLIPSE—An Efficient Architectural Realization of the Parallel Random Access Machine,to appear in ”Parallel and Distributed Computing”, IN-TECH, Vienna, 2009. (ISBN 978-3-902613-45-5)
[Forsell10a] M. Forsell and J. Träff, Modeling the relative Efficiency of certain Parallel Computing Architectures,manuscript, to be submitted to ACM Transactions on Architecture and Code Optimization, 2010.
[Forsell10b] M. Forsell and C. Kessler, Ultimately scalable general purpose computing platform, under preparation,2010.
[ITRS07] International Technology Roadmap for Semiconductors, Semiconductor Industry Association, 2007;http://www.itrs.net.
[Ranade91] A. Ranade, How to Emulate Shared Memory, Journal of Computer and System Sciences 42, (1991), 307--326.
[Valiant90] L. G. Valiant, A Bridging Model for Parallel Computation, Communications of the ACM 33, 8 (1990), 103-111.