Top Banner
The Plural Architecture Shared Memory Many-core with Hardware Scheduling Ran Ginosar Technion, Israel March 2015 1
72

Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Feb 17, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

The Plural ArchitectureShared Memory Many-core with Hardware Scheduling

Ran Ginosar

Technion, Israel

March 2015

1

Page 2: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Outline

• Motivation: Programming model

• Plural architecture

• Plural implementation

• Plural programming model

• Plural programming examples

• ManyFlow for the Plural architecture

• Scaling the Plural architecture

• Mathematical model of the Plural architecture

2

Page 3: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

many-cores

• Many-core is:• a single chip

• with many (how many?) cores and on-chip memory

• running one (parallel) program at a time, solving one problem

• an accelerator

• Many-core is NOT:• Not a “normal” multi-core

• Not running an OS

• Contending many-core architectures• Shared memory (the Plural architecture, XMT)

• Tiled (Tilera, Godson-T)

• Clustered (Rigel)

• GPU (Nvidia)

• Contending programming models

3

Page 4: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Rx Phy

Tx Phy

Several

Applications

MAC

LINUX

Code Mapping

Plural shared memory architecture

4

One Parallel Program

Shared Memory

Architecture

Page 5: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Context

• Plural: homogeneous acceleration for

heterogeneous systems

5

HOST

OS

I/O

Network

Peripherals

Plural

Accelerator

streaming

Page 6: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

One (parallel) program ?

• Best formal approach to parallel programming is

the PRAM model

• Manages

• all cores as a single shared resource

• all memory as a single shared resource

• and more…

6

Page 7: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

PRAM matrix-vector multiply

7

× =

The PRAM algorithm𝑖 is core index

AND slice index

Begin

yi=AixEnd

A,x,y in shared memory

(Concurrent Read of x)

Temp are in private memories (e.g. computing actual addresses given 𝑖)

Ax=y

Ai xyi

× =

× =

× =

× =

× =

Core 1

Core 2

Core 3

Core 4

Core 5

Page 8: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

PRAM logarithmic sumThe PRAM algorithm

// Sum vector A(*)

Begin

B(i) := A(i)

For h=1:log(n)

if 𝑖 ≤ 𝑛/2ℎ then

B(i) = B(2i-1) + B(2i)

End

// B(1) holds the sum

8

a1 a2 a3 a4 a5 a6 a7 a8

h=3

h=2

h=1

B(i)=A(i)

if (..) B(i)=B(2i-1)+B(2i)

h

h

Page 9: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

PRAM SoP: Concurrent Write

• Boolean X=a1b1+a2b2+…

• The PRAM algorithm

Begin

if (aibi) X=1

End

All cores which write into X, write the same value

9

if (aibi) X=1

Page 10: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Outline

• Motivation: Programming model

• Plural architecture

• Plural implementation

• Plural programming model

• Plural programming examples

• ManyFlow for the Plural architecture

• Scaling the Plural architecture

• Mathematical model of the Plural architecture

10

Page 11: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

11

The Plural Architecture: Part I

“Anti-local” address interleaving

Negligible conflicts

Many small processor cores

Small private memories (stack, L1)PPPPPPPP

external memory, IO

Shared Memory

P-to-M resolving NoCFast NOC to memory

(Multistage Interconnection Network)

NOC resolves conflicts

SHARED memory, many banks

~Equi-distant from cores (2-3 cycles)

Page 12: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

PPPPPPPP

P-to-M resolving NoC

Low (zero) latency parallel scheduling

enables fine granularity

scheduler

P-to-S

scheduling NoC

The Plural Architecture: Part II

Hardware scheduler / dispatcher / synchronizer

Shared Memory“Anti-local” address interleaving

Negligible conflicts

Many small processor cores

Small private memories (stack, L1)

Fast NOC to memory

(Multistage Interconnection Network)

NOC resolves conflicts

SHARED memory, many banks

~Equi-distant from cores (2-3 cycles)

12external memory, IO

Page 13: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Outline

• Motivation: Programming model

• Plural architecture

• Plural implementation

• Plural programming model

• Plural programming examples

• ManyFlow for the Plural architecture

• Scaling the Plural architecture

• Mathematical model of the Plural architecture

13

Page 14: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

How does the P-to-M NOC look like?

• Full bi-partite connectivity required

• But full cross-bar not required: minimize conflicts and allow stalls/re-starts 14

P

P

P

P

P

P

P

P

P

P

P

P

M

M

M

M

M

M

M

M

M

M

M

M

Page 15: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Logarithmic multistage interconnection network

P

P

P

P

P

P

P

P

P

P

P

P

M

M

M

M

M

M

M

M

M

M

M

M

Pipeline stage (registers)Combinational switches 15

Page 16: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

16

Floorplans

and an example of one route The dual floor plan. Why?

Page 17: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

access sequence: fixed latency (when successful)

time

Processors

MEMORY

pipeline stage 3

Pipeline Stage 1

Pipeline Stage 2

cycleRead Request

17

Page 18: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Example floorplan + layout

18

1MByte Data Memory

1MByte Data Memory6

4kB

Instru

ctio

n

Me

mo

ry

64

kB

Instru

ctio

n

Me

mo

ry

Sync/S

ch

ed

64 cores

40nm GP

4×4mm

64 cores

16 FPU

2MB D$

in 128 banks

128kB I$

400 MHz

1 Watt

PLURALITY

Page 19: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Outline

• Motivation: Programming model

• Plural architecture

• Plural implementation

• Plural programming model

• Plural programming examples

• ManyFlow for the Plural architecture

• Scaling the Plural architecture

• Mathematical model of the Plural architecture

19

Page 20: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

20

The Plural task-oriented programming model

• Programmer generates TWO parts:

• Task-dependency-graph = ‘task map’

• Sequential task codes

• Task maps loaded into scheduler

• Tasks loaded into memory

regular

duplicable taskName ( instance_id )

{

… instance_id ….

// instance_id is instance number

…..

}

Task template: PPPPPPPP

P-to-M resolving NoC

scheduler

P-to-S

scheduling NoC

Shared memory

Page 21: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Outline

• Motivation: Programming model

• Plural architecture

• Plural implementation

• Plural programming model

• Plural programming examples

• ManyFlow for the Plural architecture

• Scaling the Plural architecture

• Mathematical model of the Plural architecture

21

Page 22: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Fine Grain Parallelization

Convert (independent) loop iterations

for ( i=0; i<10000; i++ ) { a[i] = b[i]*c[i]; }

into parallel tasksset_task_quota(doLargeLoop, 10000)

void doLargeLoop(id)

{ a[id] = b[id]*c[id]; } //id is instance number

22

duplicable doLargeLoop

Page 23: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

23

Task map example (2D FFT)

Duplicable task ………

………

Condition

Join / fork

Singular task

Page 24: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

24

Another task map (linear solver)

Page 25: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

25

Linear Solver: Simulation snap-shots

Page 26: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Plural Task Oriented Programming Model:

Task Rules 1• Tasks are sequential

• All ready tasks, or any subset, can be executed in

parallel on any number of cores

• All computing organized in tasks. All code lines belong to

tasks

• Tasks use shared data in shared memory

• May employ local private memory.

• Its contents disappear once a task completes

• Precedence relations among tasks:

• Described in task map

• Managed by scheduler: receive task completion messages,

schedule dependent tasks

• Nesting task spawning is easy and natural

26

Page 27: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Plural Task Oriented Programming Model:

Task Rules 2• 2 types of tasks:

• Regular task (Executes once)

• Duplicable task

• Duplicated into quota=d independent concurrent instances

• Identified by entry point (same for all d instances) and by unique instance number.

• Task quota is actually a variable. The only reason for the synchronizer to access data memory

• Conditions on tasks executed by scheduler

• Tasks are not functions

• No arguments, no inputs, no outputs

• Share data only in shared memory

• No synchronization points other than task completion

• No BSP, no barriers

• No locks, no access control in tasks

• Conflicts are designed into the algorithm (they are no surprise)

• Resolved only by NoC

27

Page 28: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Example: Matrix Multiplication

28

set_task_quota(mm, N*N); // create N×N tasks

extern float A[],B[],C[] // A,B,C in shared mem

void mm(id) // id = instance number

{

i = id mod N; // row number

k = id / N; // column number

sum = 0;

for(m=0; m<N; m++){

sum += A[i][m] * B[m][k]; // read row & column from

// shared mem

}

C[i][k] = sum; // store result in shared mem

}

duplicable MM

Page 29: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

What if parallelism is limited ?

• So far, examples were highly parallel

• What if algorithm CANNOT be parallelized?

• Execute many (serial) instances in parallel

• Each instance on different data

• What if algorithm is mixture of serial / parallel

segments?

• Use ManyFlow

29

Page 30: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Outline

• Motivation: Programming model

• Plural architecture

• Plural implementation

• Plural programming model

• Plural programming examples

• ManyFlow for the Plural architecture

• Scaling the Plural architecture

• Mathematical model of the Plural architecture

30

Page 31: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Stream Processing

• Data arrives in a sequence of blocks

• In parallel:

• Process current block (K)

• Output results of previous block (K-1)

• Input next block (K+1)

31

time

Process block K K + 1K - 1

Output

block K-1

Input

block K+1

Output

block K

Input

block

K+2

Output

block K-2

Input

block K

ProcessOut+In

data block cycle time

Page 32: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

PIPELINED stream processing

• For faster data & slower processing

32

time

Process block K K + 3

Input

block K

Process block K+1 K + 4

Process block K+2K - 1

K - 2

K - 3

K - 4

Input

K-1

Input

K-2

Input

K+2

Input

K+1

Input

K+4

Input

K+3

Input

K+

Input

K+5

Output

K-4

Output

K-5

Output

K-6

Output

K-2

Output

K-3

Output

block K

Output

K-1

Out

K+

Output

K+1

data block cycle time

Page 33: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

PIPELINED stream processing: ManyFlow

• Parallel execution of pipelined stream

processing on the shared-memory manycore

Plural architectures

• Flexible, dynamic, out-of-order, task-oriented

execution

33

Page 34: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Example: A DWT image compression

algorithm

34

)

A

B

C

)D

E

B

A C

D

E

Low utilization: only 65%

Image compression time: 160 (relative time units)

DWT

(highly

parallel *)

Bit-plane

encoding

(highly

parallel *)Time

Num. cores

utilized

Max 64 cores

serial

serial

serial

Page 35: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Speed it up with a pipeline?

25 50 18 54 13 =160

54 54 54 54 54

Sequential

Pipeline

35

Page 36: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Hardware-like Pipeline

Needs 5 stages: two with 64 cores each, three with one core each (total 131 cores)

If only 64 cores, time / step = 64x2 + 25 = 153 (how ? What is the utilization?)

Hard to program, inefficient, inflexible, fixed task per core. Need to store 5 images36

Step i

Time step i+1 Step i+2 Step i+3 Step i+4 Step i+5 Step i+6

Step i+7Image k+4

Image k+5

Image k+6

Image k+7

Image k+3

Image k+2

Image k+1

Page 37: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Parallel / pipelined “ManyFlow”

37

All 5 stages are independent (order does not matter)

Can run concurrently

Scheduler will dispatch most efficiently

)A B C )D E

Pipeline Stage

Sync

Image

k

Image

k+1

Image

k+2

Image

k+3

Image

k+4

Step i

Still need to store 5 images (and their temporary storage)

Page 38: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Parallel / pipelined “ManyFlow”

38

)A B C )D EInput

raw image

Output

compressed

image

Pipeline Stage

Sync

Task map for continuous execution

Includes two more pipe stages, for I/O of images

Now need to store 7 images (and their temporary storage)

Page 39: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Parallel / pipelined “ManyFlow”(automatically scheduled)

39

Higher utilization: 99%

B

A C

D

E

Image compression time (piped): 95

Page 40: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

The code

PROGRAM#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#define N 1000

int round_counter = 0;

void program_start() {

set_task_quota(BB,N);

set_task_quota (DD,N);

}

void AA (void) { set_task_runtime(25); }

void BB (void) { set_task_runtime(3); }

void CC (void) { set_task_runtime(20); }

void DD (void) { set_task_runtime(3); }

void EE (void) { set_task_runtime(10); }

int task_manager(void) {

round_counter++;

if (round_counter < 5)

return(0);

else

return(1);

}

void program_end(void) { }

TASK MAPregular task program_start()

regular task AA (program_start || task_manager==0)

regular task CC (program_start || task_manager==0)

regular task EE (program_start || task_manager==0)

duplicable task BB (program_start || task_manager==0)

duplicable task DD (program_start || task_manager==0)

regular task task_manager (AA && BB && CC && DD && EE)

regular task program_end (task_manager==1)

40

(for simplicity, real task code replaced by indication of duration)

Page 41: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Challenges

• What if on-chip memory is limited?

• Input & output to/from same area

• Process smaller data blocks

• Decompose algorithm to fewer steps

• Beware of combining serial and parallel code segments in

same pipe stage

• Stages may be serial, highly parallel, or limited parallel

41

Page 42: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Example: JPEG compression algorithm

using ManyFlow

RGB pixel

streamConvert RGB

to YCrCb

Compress

color 4:2:0 DCT 8X8

16X16 pl

Y,Cr,Cb

8X8 pl

4Y 1Cr 1Cb

Quantization

8X8 Coeff

4Y 1Cr 1Cb

DPCM

ZigZag Scan

DC

AC

8X8 Coeff

4Y 1Cr 1Cb

DC Huffman

AC Huffman

8X8 Coeff

4Y 1Cr 1Cb

Combine

Bit Stream

DC

AC

DC

AC

Variable

Length code

4Y 1Cr 1Cb

Compressed

bit stream

42

Page 43: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

JPEG compression: ManyFlow

43

Page 44: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

44

JPEG compression: Task Allocation

Page 45: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

45

JPEG compression: Most cores active

Page 46: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Example: JPEG2000 Encoder

46Parallel fraction 𝑓=95%

A

B

C

D

E

Number of busy cores

X10 msec

Serial: 220 msec

Parallel:

1280/64=20 msec

Serial: 60 msec

Parallel:

1920/64 = 30 msec

Serial: 70 msec

Serial time 𝑇1 = 3.55 sec

Parallel time 𝑇64 = 400 msec

Speed-up: 𝑆𝑈(64) = 𝑇1/𝑇64 ≈ 9

Efficiency: 𝐸 64 =𝑆𝑈 64

64= 0.14

Image: 1𝐾 × 1𝐾 8b pixels Core frequency 𝐹1 = 250 MHz

A C E

B D

Page 47: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Non-ManyFlow RIGID Multi-Job Scheduling

• Run multiple serial sections in parallel

• Run a single parallel section at a time

47

A

B

C

D

E

A

B

C

D

E

A

B

C

D

E

A

B

C

D

E

A

B

C

D

E

A

B

C

D

E

A

B

C

D

E

A

B

C

D

E

S

T

Page 48: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Non-ManyFlow RIGID Multi-Job Scheduling

• Fixed number of cores p=64

• Job with fraction 𝑓 parallel, (1 − 𝑓) serial• Time of parallel section 𝑓𝑇1/𝑝

• Variable number of Jobs J=1,2,…

• Schedule:• J serial sections in parallel, time 𝑇𝑃𝑆 = (1 − 𝑓)𝑇1• J parallel sections in series, time 𝑇𝑃𝑃 = 𝐽 × 𝑓𝑇1/𝑝

• Serial time 𝑇𝑆(𝐽) = 𝐽 × 𝑇1• Parallel time 𝑇𝑃 𝐽 = 𝑇𝑃𝑆 + 𝑇𝑃𝑃

48

JPEG2000, J=1, 𝑓=95%

J=16

Page 49: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Non-ManyFlow RIGID Multi-Job Scheduling• Memory-limited

• 8MB (¼ max memory) enables:• J=16 jobs

• Speed-up 50 (cf. 9)

• 0.8 efficiency (cf. 0.14)

• ManyFlow works better !

49

JPEG2000, J=1

J=16

Page 50: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Outline

• Motivation: Programming model

• Plural architecture

• Plural implementation

• Plural programming model

• Plural programming examples

• ManyFlow for the Plural architecture

• Scaling the Plural architecture

• Mathematical model of the Plural architecture

50

Page 51: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Possible Full-Chip Plan

51

Multi-bank 256MB L2

Mem Cntl, SERDES I/O

20 m

m

4 m

m

2-3 cycle

L1

10-30 cycles

L2

2MB L1

64

cores

Page 52: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

But does it scale (more processors)?

52

64 Cores

2 MB shared L1

256 MB shared L2

256 Cores

8 MB shared L1

192 MB shared L2

1024 Cores

32 MB shared L1

64 MB shared L2

256 MB L2

2MB

64

192 MB L2

8 MB L1

256

64 MB L2

32 MB L1

1024

Long, high energy access to larger shared memory

Cluster,Not a shared memory

192 MB L2

2MB

64

2MB

64

2MB

64

2MB

64

64 MB L2

2MB

64

2MB

64

2MB

64

2MB

64

2MB

64

2MB

64

2MB

64

2MB

64

2MB

64

2MB

64

2MB

64

2MB

64

2MB

64

2MB

64

2MB

64

2MB

64

Page 53: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Compare with “tiled” CMP using mesh NOC

53

20×20mm

64 tiles

32 kB L1 x64

= 2 MB

4 MB L2 x64

= 256 MB

Directory:

All L2’s = L3

20 m

m

1 cycle

L1

7cycles

L2

70cycles

L3

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

L1L2

P

Page 54: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Other proposed NOC-based manycores

54

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

P L2L1

P L2L1

PL2L1

P L2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL2

L1

PL1

PL1

PL1

P

L2L1

P L2L1

PL2

L2L1

PL2 L2

L2L2L1

PL2

L2L1

PL2

L1

P

L2

L2L2L1

PL2

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L2

L2

L2

L2 L2

L2

L2L2

L1

PL1

PL1

PL1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

PL2

L2

L2

L2

L2

L2L2 L2L2

Page 55: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

GPU: Yet another manycore

55

Page 56: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

56

EXAMPLE

256 cores

memory banks

1 MB x256

= 256 MB

Another idea: SIMD

I/O

I/O

I/O

I/O

P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P

P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P

P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P

P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P

control

Page 57: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Outline

• Motivation: Programming model

• Plural architecture

• Plural implementation

• Plural programming model

• Plural programming examples

• ManyFlow for the Plural architecture

• Scaling the Plural architecture

• Mathematical model of the Plural architecture

57

Page 58: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

The many-core research question

• Given fixed area, into how many processor

cores should we divide it?

• Analysis can be based on Pollack’s rule

• Other good questions (not dealt here):

• Given fixed power, how many cores? which cores?

• Given fixed energy, how many cores? which cores?

• Given target performance, how many? Which?

59

Page 59: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

The history at the basis of Pollack’s analysis

60

Technology

generations

P1

P2

P3

P4

P5

G1 G2 G3 G4 G5

Shrink, scaling

New architecture,

same process

Q: On red arrows, how

much more performance

for how much more

area?

Page 60: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Pollack’s rule for processors:

Area or Power vs. Performance

61

• Pollack (& Borkar & Ronen, Micro 1999)

observed many years of (intel) architecture

• In each Intel technology node, they compared:

• Old uArch (shrink from previous node)

• New uArch (faster clock and/or higher IPC)

• They noted:

• New uArch used 2-3X larger area

• New uArch achieved 1.5-1.7X higher performance

• Resulting from both higher frequency and higher IPC

• They did not consider power increase

• Who thought about power in 1999?

• Observation: Performance ~ 𝑎𝑟𝑒𝑎

Page 61: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

The many-core fixed-total-area model

• Assume fixed chip area (typically 300-500 mm2)

• Split chip area A = Acores + Amem

• Memory size addressed by other math models

• Divide Acores into m cores. How many ?

• Area of each core: 𝑎 =𝐴𝑐𝑜𝑟𝑒𝑠

𝑚. Thus, m ~ 1 𝑎

• [Pollack’s]: core area determines core performance. Select IPC and frequency f so that:• Performance (core) = IPC × 𝑓 ~ 𝑎. Thus, a ~ 𝐼𝑃𝐶2𝑓2 , m ~ 1 𝐼𝑃𝐶2𝑓2

• Power (core) ~ a × 𝑓 ~ 𝐼𝑃𝐶2𝑓3

• Assume perfect parallelism (at least as upper bound)

• Performance (m cores) = IPC × 𝑓 ×𝑚 ~ 𝐼𝑃𝐶∙𝑓

𝐼𝑃𝐶2𝑓2=

1

𝐼𝑃𝐶∙𝑓~

𝐼𝑃𝐶∙𝑚

𝐼𝑃𝐶 𝑚= 𝑚

• Power (m cores) = a × 𝑓 ×𝑚 ~ 𝐼𝑃𝐶2𝑓3

𝐼𝑃𝐶2𝑓2= f ~

1

𝐼𝑃𝐶 𝑚

63

Summary: Performance~1

𝑓~ 𝑚, Power~

1

𝑚~𝑓, m ~

1

𝑓2

Page 62: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Performance (core) = IPC × 𝑓

64

Page 63: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

a ~ 𝐼𝑃𝐶2𝑓2

65

For each IPC curve, a ~ 𝑓2

Page 64: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

m ~ 1

𝐼𝑃𝐶2𝑓2

66

For each IPC curve, m ~ 1

𝑓2

Page 65: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Performance~1

𝑓~ 𝑚

67

Power~𝑓~1

𝑚

Page 66: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

68

𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒

𝑃𝑜𝑤𝑒𝑟~

1 𝑓

𝑓=

1

𝑓2~

𝑚

1 𝑚= 𝑚

Analysis of the results so far:

• Slower frequency and lower IPC higher performance, lower power

• Thanks to Pollack’s square rule

But this changes when we also consider memory power…

Page 67: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Now add memory

• So far, only computing power

• Including power to access local cache/memory in each

core

• Only small private memory is local in the SM Plural architecture

• But we also need to access not-so-local shared

memory

• Access rate to memory: once every rm instructions

• About every 20 instructions in the SM Plural architecture

• Ignore cache misses, assume using only on-chip memory

• Need to add memory access power to the

computing power

• Relative energy: assume access is 10x higher than exec.

69

Page 68: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

70

𝑚1

𝑓

𝑚

𝑚+1𝑚

1𝑓

1𝑓

+𝑓

𝑚+1

𝑚

1

𝑓+ 𝑓

Page 69: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Summary of the model

• Considering only cores, fixed-total-area model

implies: for highest performance and lowest

power, use

• smallest / weakest cores (lowest IPC)

• lowest frequency

• Adding on-chip access to memory leads to a

different conclusion: for lowest power and

highest performance/power ratio, use

• Strongest cores (high IPC)

• But stay with lowest frequency

• Lower frequency lower access rate to global memory

74

Page 70: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

75

The Plural Architecture: Some benefits

• Shared, uniform (~equi-distant) memory

• no worry which core does what

• no advantage to any core because it already holds the data

• Many-bank memory + fast P-to-M NoC

• low latency

• no bottleneck accessing shared memory

• Fast scheduling of tasks to free cores (many at once)

• enables fine grain data parallelism

• harder in other architectures due to:

• task scheduling overhead

• data locality

• Any core can do any task equally well on short notice

• scales well

• Programming model:

• intuitive to programmers

• “easy” for automatic parallelizing compiler (?)

Page 71: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

On-going Research

• Mathematical model incl. memories

• Scaling: full chip, multiple chips

• Plural algorithms and Plural programming

• FPGA versions

• Better NoC to shared memory

• Better scheduler and NoC to scheduler

• Near/sub-threshold for extremely low energy/power

• Using asynchronous logic design

• 3D for larger ‘on-chip’ memory

• Converting large message-passing programs to

shared-memory plus message passing codes

76

Page 72: Ran Ginosar Technion, Israel March 2015webee.technion.ac.il/~ran/papers/PluralArchitectureMarch2015.pdf · Ran Ginosar Technion, Israel March 2015 1. Outline • Motivation: Programming

Summary

• Simple many-core architecture

• Inspired by PRAM

• Hardware scheduling

• Task-based programming model

• Designed to achieve the goal of

‘more cores, less power’

• Developing model to illuminate / investigate

77