Single- puting—Heterogeneous/ homogeneous and pro ... · VTT 8 Homogeneous tera-device chip approach Application specific engine 1 - GPU/SIMD General purpose computing engine ASE

VTT

1

Common clock or independent clocks

Distributed memory

M1 M2 M3 Mp

P1 PpP3P2

Low-overhead

multi-threading

High-bandwidth synchronous network

L1 LpL3L2


Distributed memory

P2 P3 PpP1

M1 M2 M3 Mp

C2 C3 CpC1

Single-threadedcores &coherentcaches

Asynchronous cache coherence/memory network

Massively parallel com-puting—Heterogeneous/homogeneous and pro-gramming applications

Martti ForsellChief Research Scientist

VTT—Technical Research Center of Finland

Communication PlatformsAnnual Seminar 2009

October 6, 2009VTT, Oulu, Finland

Focus on

MP-CMP

s

VTT

2

Massively parallel computing—Heterogeneous/homogeneous and programming applicationsMartti Forsell, VTT, Finland

Abstract: So far massively parallel computing has been successfully applied to scientific,database and data center applications as well as computer aided design and manufacturing.Now also the major processor manufacturers and SOC designers are projecting to increase thenumber of processor cores per chip beyond one thousand in 10 to 15 years, yielding to mas-sively parallel computing on a chip. At the same time frame we are also likely see a newmilestone, chips with more than (US) trillion (1012) transistors, called tera device chips.In this presentation we will take an early look at architectural and methodological challengesthat massively parallel computing on a chip may face. Particularly, we will analyze thestrengths and weaknesses of heterogeneous and homogeneous approaches to tera devicechips and their programmability. It turns out that radically new solutions are needed for botharchitectures and programming methodologies before these chips can become as efficientand useful as sequential computers are nowadays. The biggest challenges are related toimplementation of easy to use computational model providing performance and power efficiency and portability of applications for such machines.

VTT

3

Parallelism—a resouce for efficient computationForget sequential computing!

• Almost all computational problems are parallel bytheir nature

• It is unnecessary and even harmful to solve themwith a sequential algorithm and then ”parallelize” it

- extra work- optimizing hard due to artificial dependencies- do not count on so called ”automatic parallelization”

• Sequential solutions execute slow and with a low uti-lization in massively parallel machines

Parallelism should be seen as a resource for efficientcomputation!

Sequential(Ideal)Parallel

Loop overhead:- update 2x pointers- compare and branch(can be eliminated withloop unrolling or archi-tectural ILP support)

No loopingneeded!

VTT

4

Data Center Total Concurrency

1.E+09

1.E+10

Billion-way concurrency

1.E+07

1.E+08

cu

rrecn

cy

1.E+05

1.E+06

To

tal

Co

nc

Million-way concurrency

1.E+03

1.E+04

Thousand-way concurrency1.E 03

1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

Top 10 Top System Top 1 Trend

Historical Exa Strawman Evolutionary Light Nodey g

Evolutionary Heavy Node

!"#$%&'()"*)+&%&$),"--&.)/01 and Thomas Sterling, LSU

Where we are now?

VTTData Center Performance Projections

1 E+09

1.E+10

Exascale

1.E+08

1.E+09 Exascale

Lightweight

1 E+06

1.E+07

GFlops

Heavyweight

g g

1.E+05

1.E+06

1.E+04

1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

But not at 20 MW!

!"#$%&'()"*)+&%&$),"--&.)/01 and Thomas Sterling, LSU

VTT

6

Massively parallel computing on a chip [ITRS07]

' !!!! !

!

' ' '''(,-'-./0$1&%'2.%3456&'*%"7&%'8%9:"3&93$%&';&1<643&'

!" ## $% &' ()( ("*(*(

"(""*%

!#%

#"#

$"*

**'

%&%

()"!

(#!$

)

$

()

($

")

"$

!)

!$

#)

#$

$)

"))& "))% "))' ")() ")(( ")(" ")(! ")(# ")($ ")(* ")(& ")(% ")(' ")") ")"( ")""

+,-./01234,5617.8319:,54;<.83=1>,1"))&?

)

"))

#))

*))

%))

(0)))

(0"))

(0#))

(0*))

(0%))

"0)))

@1,A1B5,/3CC.D-1ED-.D3C

:F4G351,A1B5,/3CC.D-1ED-.D3C9H.-I>1J1KL.C?

M,>;<1+,-./17.839:,54;<.83=1>,1"))&01+3A> 1J1KL.C?

M,>;<1234,5617.839:,54;<.83=1>,1"))&01+3A>1J1KL.C? !

' ' '''(,-'-./0$1&%'2.%3456&'*&0"#/'-.1<6&>"3?';%&/@0'

! $ $ $ $ $ $

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

' ! ! ! ! ! ! ! ! ! ! !

' ! ! ! ! ! ! ! ! ! ! ! ! !

' ! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! !!

!' ! ! ! ! ! ! ' ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! !

!!

!' ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! ! !

$

!

$

' ' ' ' ' ''''! '

teradev

icechip

VTT

7

Heterogeneous tera-device chip approach

Off-chip memory Off-chip memory Off-chip memoryOff-chip memory

• Large number of applica-tion-specific IPs

• Larger accelerators occupy-ing multiple ”slots”

• On-chip memory embed-ded to engines

• Multiple ports to off-chipmemory

• Design complexity reductionby reuse of existing IPs

VTT

8

Homogeneous tera-device chip approachApplication specific engine 1 - GPU/SIMD

General purpose computing engine

ASE 3 - FPGAASE 2 - BASEBAND

Off-chip memory Off-chip memory Off-chip memoryOff-chip memory O

• Large homogeneous generalpurpose computing engine

• A small number of dedicatedaccelerators/engines for cer-tain fixed/crucial needs

- Graphics/SIMD- Radio signal processing- Reconfigurable engine for

irregularly shaped computing• On-chip memory embedded

to engines• Multiple ports to off-chip

memory• Replication to reduce design

complexity

VTT

9

Heterogeneity vs. homogeneity

Homogeneous (general purpose) Heterogeneous (special purpose)

Programmability, flexibility Optimality for fixed applicationsReplication Reuse of existing IP blocks

Heterogeneity+ Better performance/area, performance/power for matching applications+ Natural partitioning for matching applications- Partitioning problems with non-matching applications- Programmability/methodological complexity APP1+APP2+...+APPN = N xmethodology!

- More than N x risk of architectural changes requiring partial software rewrite- Inflexibility, waste of area (and power) for non-matching applications- Reuse-based design complexity reduction may not be as efficient as replication-based

VTT

10

Performance scalabilityof current approachesWith current approaches, some well-behavingcases in certain domains (DSP) scale well asthe number of cores per chip increases.

Unfortunately there seems to be a huge gapbetween the current approaches and an idealparallel machine in demanding applications.

Are we doomed to the fate of diminishingreturns as the number of cores per chip increas-es (like for ILP in early 2000’s)?

1,00E+06

1,00E+07

1,00E+08

1,00E+09

1,00E+10

1,00E+11

1,00E+12

1,00E+13

4 16

64

256

1024

40

96

1638

4

6553

6 E

xecu

tion

time

(clo

ck c

ycle

s)

Number of cores

Ideal PRAM CC-NUMA NUMA SMP VC MP

1,00E+13

1,00E+12

1 11,00E+

Exe

cutio

n tim

e (c

lock

cyc

les)

1,00E+10

1,00E+09

1,00E+08 Exe

cutio

n tim

e (c

lock

cyc

les)

Ideal PRAM ACC-NUM

ANUMPSM

VC PM

1,00E+07

1,00E+06 4 16

64

256

1024

40

96

1638

4

6553

6

Number of cores

36

Hugegap!

P-core PRAM+ILP chaining

Early estimations,not verifiedresults!

Performance as a function of the number of processors P in demanding applications, Fs=0.1, Fdp=0.45, Pcm=0.1, Pcsd=0.5,Pg=0.2, Tb=106, F=4, Fnv=0.2, preliminary results from [Forsell10a].

VTT

11

Programmability vs. current trendsComplexity/Low abstraction/Heterogeneity

Simplicity/High abstraction/Homogeneity

Synchronous Asynchronous

Easy

Difficult

PROGRAMMABILITY

μcontroller+ SRAM

μprocessor+ memoryhierarchy

homog.MP-SOC

heterog.NOC

PRAM

BSP

Vector

Clusteredvector

VectorMPPhybrid

Upscaledapplication-specifictera-device

SYNCHRONITY

COMPLEXITYABSTRACTIONHETEROGENEITY

Upscaledgeneralpurposetera-device

VTT

12

Application programmingCommon clock or independent clocks

Distributed memory

P2 P3 PpP1

M1 M2 M3 Mp

C2 C3 CpC1



The main obstacle preventing parallel computingfrom becoming the main stream of applicationdevelopment is

• Poor and error-prone programmability of currentparallel architectures

Other problems are

• Architecture dependency• Architectural inefficiency of key primitives of paral-

lel computing• Huge existing base of sequential code

With the advent of massively parallel computing on a chip these problems are get-ting even worse!

VTT

13

Secrets of the success of sequential computingparadigm—Right type of abstraction: RAM

• Captures the essential properties of the underlying machine• Good portability of programs between machines with substan-

tially different properties• Idealized properties of the computational model, e.g.

- access time = 1 cycle- sequential operation

can be emulated well enough even with a speculative super-scalar architecture with virtual memory

• Easy to learn and use• Theory of algorithms exists and helps programming & analysis• Efficient teaching (all universities teach it)

Clock

Memory

M

PSequential

execution ofinstructions

Latency=1

VTT

14

Secrets of the success of sequential computingparadigm—Synchronization, techniques, locality

Simple concept of synchronization• Subtasks of originally parallel computational problems are executed syn-chronously due to deterministically sequential execution at program level

• Looping is used to process multiple data elements• Multitasking is used to emulate parallelism (problems are introduced,

atomic operations, lockings, transactions, etc. needed)

Flexible performance enhancement techniques• Low-level parallelism (static or dynamic superscalar execution)• Speculations (out of order execution, data speculations, etc.)• Locality exploitation (caching, buffering, interleaving)

Trivial concept of locality that is in line with memory hierarchies

Clock

Memory

M

PSequential

execution ofinstructions

Latency=1

VTT

15

Problems of current parallel comput-ing approach—AbstractionNot so efficient, too low and inappropriate abstraction of the underlyingmachine• A programmer needs to take care of mapping, partitioning, synchronization

(most models) and even low-level communication (message passing)• Portability between machines with different properties is often limited

(rewriting of software too often needed if machine changes/is updated)• Quite difficult to learn and use, error-prone• Generality/usefulness of the theory of parallel algorithms limited due to archi-

tecture dependency• Teaching at elementary level virtually nonexistent (advanced level courses do

exist in universities)

VTT

16

Problems of current parallel computingapproach—Synchronization and techniquesSynchronization• Asynchronous operation (missing low-level synchronicity)• High-level synchronization very expensive (barriers take typically hundreds of cycles)• This limits greatly the applicability and rules out fine-grained parallel algorithms

Not particularly innovative/flexible/suitable/well-linked performance enhancementtechniques (copied mostly from sequential computing)• Low-level parallelism (superscalar execution - not well linked to TLP execution, complex compiling)• Speculations (multithreading)• Locality exploitation (NUMA, coherent caching)• Inefficient implementation or missing support for important primitives of parallel comput-

ing, e.g. exploitation of virtual ILP, concurrent memory access, multioperations

VTT

17

Problems of current parallel comput-ing approach—LocalityMost experts agree that locality is needed to perform parallel computation efficiently.

Locality = area/neighborhood in which certain special technique is possible

Problems: - The concept of locality is too narrowly understood as a locality to a single processor- There is no way to maximize data locality with respect to processor cores in general case

In fact, there are different kinds of (or hierarchy of) localities including e.g.:- VLSI locality (range of a wire in one clock cycle)- Synchronous area in VLSI design (beyond which one must use the GALS paradigm)- Chip locality (everything could be fitted onto a single chip)- Processor core locality (locality to a processor core)- Synchronous shared memory locality (beyond which one must use asynchronous/hybrid SM)- Machine locality (everything could be put onto a single machine)- Grid locality (everything fits into a grid)

VTT

18

VLSI locality vs. chip locality

Source: SamanAmarasinghe, MIT, 2007

Range of a wire in one clock cycle

Interestingly, theselocalities alsotend to changeboth absolutelyand with respectto each others(sometimes quitedramatically) astime goes on.

VTT

19

What is needed to avoid most ofthese problems?

High-communication-bandwidth HW architecture supporting

• Proper TLP synchronization (exploit the synchronous shared memory locality)• Correct abstraction (e.g. synchronous shared memory machine)• New kind of performance enhancement techniques (not those borrowed fromsequential computing but e.g. chaining, concurrent memory access, multioperations)

• Algorithmics for fine-grained synchronous TLP (existing PRAM algorithmics goes well)• Teaching, further research etc...

VTT

20

Main idea —SynchronousShared Memory (SSM)Instead of providing asynchronous operationand hard-to-solve cache coherence problems

- Drop caches next to processors! — Usethroughput computing (parallel slackness) tohide the latency of the memory system

- Synchronize the steps of execution by using asynchronization wave technique [Ranade91]

- Decrease the probability of congestion by ran-domization of traffic via hashing of memorylocations


Distributed memory

M1 M2 M3 Mp

P1 PpP3P2

Low-overhead

multi-threading


L1 LpL3L2


Distributed memory

P2 P3 PpP1

M1 M2 M3 Mp

C2 C3 CpC1



VTT

21

Our approach [Forsell09]S

SS

S S

SS

S S

SS

S S

SS

S

S

SS

S S

SS

S S

SS

S S

SS

S

S

SS

S S

SS

S S

SS

S S

SS

S

S

SS

S S

SS

S S

SS

S S

SS

S

I P

M L

I P

M L

I P

M L

I P

M L

I P

M L

I P

M L

I P

M L

I P

M L

I P

M L

I/O

M

I/O

M

I/O

M

I/O

M

I/O

M

I/O

M

I/O

M

I P

I/O L

I P

I/O L

I P

I/O L

I P

I/O L

I P

I/O L

I P

I/O L

I P

I/O L

Common clock

Word-wise accessible shared memory

Read/write operations from/to shared memory

P2 P3 P4P1


Distributed memory

M1 M2 M3 Mp

P1 PpP3P2

Low-overhead

multi-threading


L1 LpL3L2

Idealizedprogrammer’sview: PRAM

Computationalmodel: SSM

Architecturalrealization:Total Eclipse

VTT

22

SSM—High-bandwidth network

Goal: Sufficient throughput and low enough latency forrandom communication patterns (with a high probability)see e.g. [Forsell05]

• A variant of two-dimensional mesh, a Mc-way, double

acyclic two-dimensional sparse mesh or multi-mesh

• To maximize the throughput for read-intensive code, sep-arate network for reply messages

• A simple XY-routing algorithm

• Deadlocks are not possible

• Fast - lower power alternatives exist

S

S

S

PM

rni

Mrni

PM

rni

Mrni

PM

rni

Mrni

PM

rni

Mrni

PM

rni

Mrni

PM

rni

Mrni

PM

rni

Mrni

PM

rni

Mrni

PM

rni

Mrni

S

S

S

S

S

S

S

S

S

S

S

S

S

S

S

S

S

S

S

S

S

S

S

S

VTT

23

Potential of SSMs indemanding applicationsAccording to our investigations, with SSM it ispossible to get close to the ideal PRAM even indemanding applications.

Like for all new academic architectures thereare a number of questions open related to

• practical off-chip memory bandwidth• virtual memory• implementation efficiency• power and area concerns

1,00E+06

1,00E+07

1,00E+08

1,00E+09

1,00E+10

1,00E+11

1,00E+12

1,00E+13

4 16

64

256

1024

40

96

1638

4

6553

6 E

xecu

tion

time

(clo

ck c

ycle

s)

Number of cores

Ideal PRAM CC-NUMA NUMA SMP VC MP

1,00E+13

1,00E+12

1 11,00E+

Exe

cutio

n tim

e (c

lock

cyc

les)

1,00E+10

1,00E+09

1,00E+08 Exe

cutio

n tim

e (c

lock

cyc

les)

Ideal PRAM ACC-NUM

ANUMPSM

VC PM

1,00E+07

1,00E+06 4 16

64

256

1024

40

96

1638

4

6553

6

Number of cores

36

P-core PRAM+ILP chaining

Early estimations,not verifiedresults!

Performance as a function of the number of processors P in demanding applications, Fs=0.1, Fdp=0.45, Pcm=0.1, Pcsd=0.5,Pg=0.2, Tb=106, F=4, Fnv=0.2, preliminary results from [Forsell10a].

SSM potential

VTT

24

Two-level/hybrid model

maximum sizedSSM block

SSM locality

If the size of forthcoming tera device chips turns out toexceed the SSM locality, then one may need to usetwo-level/hybrid model providing natural support forblocking used in many parallel algorithms.

Machine for such a model would consist of maximalsize SSM blocks that are interconnected and supportinge.g. the BSP model [Valiant90] for inter SSM blockcomputation.

This kind of a compromise, however, would makeprogramming of parallel algorithms without naturalblocking more difficult.

VTT

25

Programming perspectives for two-level model [Forsell10b]If computation suits to a block do it with the PRAM model otherwise useblocking algorithm and organize block-wise computation intosupersteps so that:

• During a superstep computation is local to blocks (PRAM modelavailable) but they may use local copies of shared variables.

• Between each superstep, shared variables are updated according-ly and execution of blocks is synchronized.

VTT

26

// PREFIXSUM — Blocking two-level implementationblocksize = size/_number_of_blocks;start = _block_id * blocksize;stop = start + blocksize - 1;

// Determine prefix sums of blocks in parallel step( for_ ( i=start+1 , i<=stop , i<<=1 ,

if (thread_id-i>=0)source_[start+thread_id]

+=source_[start+thread_id-i]; ); );

// Logartihmic prefix for block sumsstep( sequential(

for_ ( i=blocksize , i<size , i<<=1 ,if (stop-i>=0)

source_[stop]+=source_[stop-i]; ); ); );

// Add results of prefix of block sums to blocks in parallelstep( if_ ( start>0 ,

source_[start+_thread_id]+=source_[start-1]; ); );

+ + + + + + + + + + + +

+ + + + + + + +

+ + +

+ +

2 7 1 9 6 4 8 3 6 1 6 2 6 2 3 7

2 9 8 10 6 10 12 11 6 7 7 8 6 8 5 10

2 9 10 19 6 10 18 21 6 7 13 15 6 8 11 18

2 9 10 19 6 10 18 40 6 7 13 36 6 8 11 33

2 9 10 19 6 10 18 40 6 7 13 55 6 8 11 73

2 9 10 19 25 29 37 40 46 47 53 55 61 63 66 73

+ + + + + + + + +

Block 1 Block 2 Block 3 Block 4

VTT

27

ConclusionsMassively parallel computing is expected to reach single chip processing engines in 10 to15 years with designs integrating more than 1000 processor cores.

This will have a tremendous impact on architectures and programming to guarantee

• ease of use/programming• portability of code• performance and power efficiency for general purpose applications

Our insight points to homogeneous approach addressing latency hiding, multiple levels ofparallelism, and providing good abstraction of the underlying machine with explicit paral-lelism, synchrony, and conceptual simplicity, i.e. the SSM being investigated at VTT.

If the size of tera device chips turns out to exceed the SSM locality, one needs to use two-level/hybrid model providing natural support for blocking algorithms.

VTT

28

References[Forsell02a] M. Forsell, Architectural differences of efficient sequential and parallel computers, Journal of Systems

Architecture 47, 13 (July 2002), 1017-1041.[Forsell02b] M. Forsell, A Scalable High-Performance Computing Solution for Network on Chips, IEEE Micro 22, 5

(September-October 2002), 46-55.[Forsell05] M. Forsell, V. Leppänen, High-Bandwidth on-chip Communication Architecture for General Purpose

Computing, In the Proceedings of the 9th World Multiconference on Systemics, Cybernetics andInformatics (WMSCI 2005) Volume IV, July 10-13, 2005, Orlando, USA, 1-6.

[Forsell09] M. Forsell, TOTAL ECLIPSE—An Efficient Architectural Realization of the Parallel Random Access Machine,to appear in ”Parallel and Distributed Computing”, IN-TECH, Vienna, 2009. (ISBN 978-3-902613-45-5)

[Forsell10a] M. Forsell and J. Träff, Modeling the relative Efficiency of certain Parallel Computing Architectures,manuscript, to be submitted to ACM Transactions on Architecture and Code Optimization, 2010.

[Forsell10b] M. Forsell and C. Kessler, Ultimately scalable general purpose computing platform, under preparation,2010.

[ITRS07] International Technology Roadmap for Semiconductors, Semiconductor Industry Association, 2007;http://www.itrs.net.

[Ranade91] A. Ranade, How to Emulate Shared Memory, Journal of Computer and System Sciences 42, (1991), 307--326.

[Valiant90] L. G. Valiant, A Bridging Model for Parallel Computation, Communications of the ACM 33, 8 (1990), 103-111.

Single- puting—Heterogeneous/ homogeneous and pro ... · VTT 8 Homogeneous tera-device chip approach Application specific engine 1 - GPU/SIMD General purpose computing engine ASE

Documents