FORMLESS: Scalable Utilization of Embedded Manycores in ...sharif.ir/~matin/pub/2012_lctes_slides.pdf · FORMLESS: Scalable Utilization of Embedded Manycores in Streaming Applications

FORMLESS: Scalable Utilization of Embedded Manycores in Streaming Applications

Matin Hashemi1, Mohammad H. Foroozannejad2, Christoph Etzel3, Soheil Ghiasi2

Sharif University of TechnologyUniversity of California, DavisUniversity of Augsburg

2

Streaming Applications

Widespread Cell phones , mp3 players, video conference,

real-time encryption, graphics, HDTV editing, hyperspectral imaging, cellular base stations

Definition Infinite sequence of data items At any given time, operates on

a small window of this sequenceMoves forward in data space

5 5 2 6 4 1 8 9 3 input

output-1 7 2 0.4 7.2 1

//53° around the z axisconst R[3][3]={

{0.6,-0.8, 0.0},{0.8, 0.6, 0.0},{0.0, 0.0, 1.0}}

Rotation3D {for (i=0; i<3; i++)for (j=0; j<3; j++)B[i] += R[i][j] * A[j]

}

3

Application Model:Dataflow Task Graph Vertices or actors

functions, computations Edges

data dependency, communication between actors Execution Model

any actor can perform its computation whenever all necessary input data are available on incoming edges.

SDF is one special case statically schedulable [Lee ‘87]

4

Example Duplicate splitter

DFT

Round robin joiner

DFTDFT DFT DFT DFT

Round robin splitter

Duplicate splitter

FIR Smoothing Identity

Round robin joiner

Deconvolve

Round robin splitter

Liner Interpolator

Round robin joiner

Multiplier

Decimator

Liner Interpolator

Decimator

Round robin joiner

Phase unwrapper

Const Multiplier

Linear Interpolator

Decimator

Vocoder Task Graphhttp://www.cag.csail.mit.edu/streamit

5

Manycore Model Distributed memory Interconnect network for sending/receiving messages Examples: Tilera TILE64, UC Davis ASAP

6

Software Synthesis

Compile the high-level dataflow specification into the parallel software modules.

Task Assignment Assign tasks to processors

Task Scheduling Schedule the tasks assigned to the same processor

for periodic sequential execution on that processor.

Baseline Software Synthesis

// msort.hvoid sort(int* x, int n){...}void merge(int* x, int* y,

int* z, int n){...}

#include msort.h; // core 1.cint x[100];while()for i=1:100 x[i]=read(in);for i=1:25 x1[i]=x[i];for i=1:25 write(x[i+25],2);for i=1:25 write(x[i+50],3);for i=1:25 write(x[i+75],3);sort(x1,25);for i=1:25 write(x1[i],2);

#include msort.h; // core 3.cint x3[25],x4[25];while()for i=1:25 x3[i]=read(1);for i=1:25 x4[i]=read(1);sort(x3,25);sort(x4,25);for i=1:25 write(x3[i],4);for i=1:25 write(x4[i],4);

#include msort.h; // core 2.cint x1[25],x2[25];int y1[50];while()for i=1:25 x2[i]=read(1);sort(x2,25);for i=1:25 x1[i]=read(1);merge(x1,x2,y1,25);for i=1:50 write(y1[i],4);

#include msort.h; // core 4.cint x3[25],x4[25],y1[50];int y2[50],y3[100];while()for i=1:25 x3[i]=read(3);for i=1:25 x4[i]=read(3);merge(x3,x4,y2,25);for i=1:50 y1[i]=read(2);merge(y1,y2,y3,50);for i=1:100 write(y3,out);

sort

merge

scatter

M2M1

S1 S2 S3 S4

X25

50M3

100

M2M1

M3

S1 S2 S3 S4

X

Task Scheduling

Backend Optimizations

Code Generation

TaskGraph

Task Assignment

interconnect network

7

core1

core2

core3

core4

Observation

In principle “behavior” and “implementation” are separated

Nevertheless, “some” inflexible structure is dictated by the designer Implementations on a few vs.

many processorsn

n/8

merge

sortN / 8

merge

N

N

8

Motivating Example

Automatic task assignment to 7 processors:

Execution period= 107 (x1000 clk @ 100MHz)

Experiment platform is FPGA prototyped multiprocessor NiosII/f @ 200MHz 32KB data cache 8KB inst. cache Inter-processor connections are

FIFO buffers of depth 1024 Offchip DDR2-800 main memory

n

n/8

merge

sort

36.6

workload=18.3

9.247.7

N / 8

9

Programmer manually constructs the task graph for 7 processors

Automatic task assignment to7 processors:

Execution period = 110

n/6

nmerge2 arrays

merge3 arrays

24.8

36.6

66.0

N / 6

merge2 arraysmerge2 arrays

Motivating Example (Cont’d)

10

Automatically generated task graph for 7 processors.

Automatic task assignment to 7 processors

Execution period = 94

n/9

n

n/6

49.4

12.216.5

41.9

66.0

12.2

66.0

Motivating Example (Cont’d)

11

Observation

In principle “behavior” and “implementation” are separated

Nevertheless, “some” inflexible structure is dictated by the designer Implementations on a few vs.

many processorn

n/8

merge

sortN / 8

merge

N

N

Solution: Raise the abstraction level in specification Functionally consistent Admitting transformations, i.e., structurally malleable

12

Higher-Level Specification Functionally-consistent

Structurally-Malleable Streaming Specification (FORMLESS)

Tasks functionality, ports, rates and their composition are governed by forming parameters.

task ActorName ( //list of parametersType1 ParamName1,Type2 ParamName2,... ){

interface {//list of input and output portsinput InputPortName1 ( PortRate );input InputPortName2 ( PortRate );...output OutputPortName1 ( PortRate );...

}function {

//data transformation function}

}

application AppName ( //list of parameters...

){interface {

//list of input and output ports...

}composition {

//actors:instantiate ActorName ActorID (ParamValue1,

...); ...//channels:connect ( ActorID.PortName, ActorID.PortName );...

}} 13

scatter2 (r[ ], x[ ], y[ ], n )

scatter3 (r[ ], x[ ], y[ ], z[ ], n )

sort (x[ ], s[ ], n )

merge2 (x[ ], y[ ], r[ ], n )

merge3 (x[ ], y[ ], z[ ], r[ ], n )

FORMLESS Task GraphCase Study 1: Merge Sort

Parameters: φ1: number of

parallel sort tasks φ2: fan-in degree of

merge and fan-out degree of scatter tasks.

N / 3N N / 8

Φ=(3,3)Φ=(8,2)

Φ=(1,1)

14

Case Study 2:Matrix MultiplyMatrix Multiply:Amxn x Bnxp = Cmxp

C p

m

A n

m

B p

nrow 7

column 5 C7,5

Parallelized Matrix Multiply:e.g., A2 x B1 = C21

A2

A BB1

A3

C

A1 B2 C11 C12

C31 C32

C21 C22

B

B1 B2

scatter

copy

Amatrixmultiply

A2

A1

A3

gather

C

Φ=(3,2)

Parameters: φ1 and φ2: number

of divisions in rows and columns of A and B.

15

Parameters: φ1: radix of the butterfly tasks. φ2: number of butterfly tasks grouped together.

Case Study 3:Fast Fourier Transform

radix-4

radix-2

Φ=(4,1)

Φ=(2,1) Φ=(2,4)

16

Case Study 4:Advanced Encryption Standard (AES)

ark sub shf mix ark sub shf ark…16 16

repeated 9 times

16 16 16 16 16 16 16 16

ark

sub

shf

mix ark ark…16 16

sub

sub

sub

shf

48

4 sub

shfsub

sub

sub

shf

48

4

16

repeated 9 times

Φ=(1,1,1,1)

Φ=(4,2,1,1)

Basic operations: substitute byte (sub): data parallel across elements shift row (shf): data parallel across rows add round key (ark): data parallel across elements mix column (mix): data parallel across columns

φ1…φ4 represent the number of parallel sub, shf, ark and mix tasks.

17

Case Study 5:Low Density Parity Check (LDPC)

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12

0 0 1 0 1 0 0 1 0 1 0 0 C1

1 0 0 0 1 0 0 0 1 0 1 0 C2

0 1 0 0 0 1 1 0 0 0 1 0 C3

0 0 1 1 0 0 1 0 0 0 0 1 C4

1 0 0 0 0 1 0 1 0 0 0 1 C5

0 1 0 1 0 0 0 0 1 1 0 0 C6

V1-12 C1-6

unrolled 6 times

C’1-6

C”1-6V7-12

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12

C1 C2 C3 C4 C5 C6

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12

C’1 C’3C’2 C’4 C’6C’5 C”1 C”3C”2 C”4 C”6C”5

H Matrix: Tanner graph is constructed by adding an edge for every cell of value 1 in matrix.

Task graph is constructed by condensing all C nodes into one and all V nodes into one, and then, unroll the graph to remove the bi-directional edges.

Row-split LDPC with split factor of φ=2 based on work by Mohsenin et al.

Task graph is constructed from the Tanner graph as before:

V1-6

18

Benchmark SummaryVector Φ Δ = Domain of Φ | Δ |

AES (φ1,φ2,φ3,φ4) δ1=δ2=δ3=δ4={1,2,4} 81

FFT (φ1,φ2) δ1={2,4,8,16}, δ2={1,2,4,...,128} 32

SORT (φ1,φ2) δ1={1,...,100}, δ2={1,...,10}, φ1=φ2n 26

MMUL (φ1,φ2) δ1=δ2={1,...,16} 256

LDPC (φ1) δ1={1,2,4,8,16} 5

19

FORMLESS Design Flow

Programmer specifies the application in the proposed malleable format

Tool explores the design space to hammer out a specific task graph from the malleable specificationbased on the target architecture.

Baseline Software Synthesis

Design Space Exploration

Task Assignment

TaskProfiling

CodeGeneration

SEAMSimulator

FPGA prototyped

platform

Throughput Estimation

Task GraphFormation

Task Assignment

LocalScheduling

repeat ?yes

Instantiated Task Graph

.Cfiles

MalleableSpecification

BackendOptimization

20

Experiment Setup Emulated multicore systems with Nios soft processors on FPGA.

Cores: NiosII/f 200MHz, Floating Point Unit, Memory: 8KB instruction and 32KB data cache, DDR2 main memory. Network: Adjacent cores are connected with point-to-point FIFO, depth of

each FIFO is 1024. Area constraint

Only 8 cores fit on the FPGA. For more cores we use Sequential Execution Abstraction Model (SEAM)

Simulate the entire system with Modelsim Simplify the sequential sections (no inter-core communications) as wait

functions in order to speed up the simulation. Our previous experiments show the error is small [Huang ‘08].

21

Sequential Execution Abstraction Model (SEAM)

P1

P3

P2

P5

P4

while(1) {for i=1..nS[i] = readf()

for i=1..nX[i]= S[i]+S[n-i]

for i=1..nY[i] = readf()

for i=1..nZ[i]= X[i]*Y[n-i]

for i=1..nwritef(Z[i])

}

code3.c

while(1) beginread ( N1 )#W1read ( N2 )#W2write( N3 )

end

system-levelbehaviorextraction

N1

N2

N3

W1

W2

code3.v

22

23

Experiment Results

Application throughput on manycore platforms normalized with respect to single-core throughput.

DSE instantiated task graphs have higher throughput than fixed task graphs.

AES

Φopt

FPGA

rigid graph Φ=2,2,1,2

0

10

20

30

40

0 20 40 60 80 100

Norm

alize

d Th

roug

hput

FFT

Φopt

FPGA

rigid graph Φ=(2,16)

0

5

10

15

20

25

0 20 40 60 80 100

Norm

alize

d Th

roug

hput

SORTΦopt

FPGA

rigid graph Φ=(3,3)

01

234

56

0 20 40 60 80 100

Norm

alize

d Th

roug

hput

MMUL

Φopt

rigid graph Φ=(5,5)FPGA

0

10

20

30

40

0 20 40 60 80 100

Norm

alize

d Th

roug

hput Φopt

rigid graph Φ=(4)

0

10

20

30

40

0 20 40 60 80 100

Norm

alize

d Thr

ough

put

LDPC

24

Number of ParametersAES

89

msq

11%

41%

max

20

40

60

80

100

0 5 10 15# of parameters, total = 81

Cove

rage

0

10

20

30

40

50

% T

hru.

Deg

rada

tion

FFT

88

msq

max

6.1%

36%

20

40

60

80

100

0 2 4 6 8# of parameters, total = 32

Cove

rage

0

10

20

30

40

% T

hru.

Deg

rada

tion

SORT

96

msq

23%

12%max

85

90

95

100

0 2 4 6 8 10# of parameters, total = 26

Cove

rage

0

10

20

30

% T

hru.

Deg

rada

tion

MMUL

93

msq

max

33%

9.8%

0

25

50

75

100

0 5 10 15 20 25 30 35# of parameters, total = 256

Cove

rage

0

10

20

30

40

% T

hru.

Deg

rada

tion

LDPC

98

msq 6.0%max

60

70

80

90

100

0 1 2 3 4 5# of parameters, total = 5

Cove

rage

0

25

50

75

100

% T

hru.

Deg

rada

tion

Coverage and throughput degradation vs the number of forming vectors considered by the DSE.

On average, 15% of forming vectors are enough to form the task graphs with at most 10% throughput degradation.

FORMLESS: Scalable Utilization of Embedded Manycores in ...sharif.ir/~matin/pub/2012_lctes_slides.pdf · FORMLESS: Scalable Utilization of Embedded Manycores in Streaming Applications

Documents