FORMLESS: Scalable Utilization of Embedded Manycores in Streaming Applications Matin Hashemi 1 , Mohammad H. Foroozannejad 2 , Christoph Etzel 3 , Soheil Ghiasi 2 Sharif University of Technology University of California, Davis University of Augsburg
FORMLESS: Scalable Utilization of Embedded Manycores in Streaming Applications
Matin Hashemi1, Mohammad H. Foroozannejad2, Christoph Etzel3, Soheil Ghiasi2
Sharif University of TechnologyUniversity of California, DavisUniversity of Augsburg
2
Streaming Applications
Widespread Cell phones , mp3 players, video conference,
real-time encryption, graphics, HDTV editing, hyperspectral imaging, cellular base stations
Definition Infinite sequence of data items At any given time, operates on
a small window of this sequenceMoves forward in data space
5 5 2 6 4 1 8 9 3 input
output-1 7 2 0.4 7.2 1
//53° around the z axisconst R[3][3]={
{0.6,-0.8, 0.0},{0.8, 0.6, 0.0},{0.0, 0.0, 1.0}}
Rotation3D {for (i=0; i<3; i++)for (j=0; j<3; j++)B[i] += R[i][j] * A[j]
}
3
Application Model:Dataflow Task Graph Vertices or actors
functions, computations Edges
data dependency, communication between actors Execution Model
any actor can perform its computation whenever all necessary input data are available on incoming edges.
SDF is one special case statically schedulable [Lee ‘87]
4
Example Duplicate splitter
DFT
Round robin joiner
DFTDFT DFT DFT DFT
Round robin splitter
Duplicate splitter
FIR Smoothing Identity
Round robin joiner
Deconvolve
Round robin splitter
Liner Interpolator
Round robin joiner
Multiplier
Decimator
Liner Interpolator
Decimator
Round robin joiner
Phase unwrapper
Const Multiplier
Linear Interpolator
Decimator
Vocoder Task Graphhttp://www.cag.csail.mit.edu/streamit
5
Manycore Model Distributed memory Interconnect network for sending/receiving messages Examples: Tilera TILE64, UC Davis ASAP
6
Software Synthesis
Compile the high-level dataflow specification into the parallel software modules.
Task Assignment Assign tasks to processors
Task Scheduling Schedule the tasks assigned to the same processor
for periodic sequential execution on that processor.
Baseline Software Synthesis
// msort.hvoid sort(int* x, int n){...}void merge(int* x, int* y,
int* z, int n){...}
#include msort.h; // core 1.cint x[100];while()for i=1:100 x[i]=read(in);for i=1:25 x1[i]=x[i];for i=1:25 write(x[i+25],2);for i=1:25 write(x[i+50],3);for i=1:25 write(x[i+75],3);sort(x1,25);for i=1:25 write(x1[i],2);
#include msort.h; // core 3.cint x3[25],x4[25];while()for i=1:25 x3[i]=read(1);for i=1:25 x4[i]=read(1);sort(x3,25);sort(x4,25);for i=1:25 write(x3[i],4);for i=1:25 write(x4[i],4);
#include msort.h; // core 2.cint x1[25],x2[25];int y1[50];while()for i=1:25 x2[i]=read(1);sort(x2,25);for i=1:25 x1[i]=read(1);merge(x1,x2,y1,25);for i=1:50 write(y1[i],4);
#include msort.h; // core 4.cint x3[25],x4[25],y1[50];int y2[50],y3[100];while()for i=1:25 x3[i]=read(3);for i=1:25 x4[i]=read(3);merge(x3,x4,y2,25);for i=1:50 y1[i]=read(2);merge(y1,y2,y3,50);for i=1:100 write(y3,out);
sort
merge
scatter
M2M1
S1 S2 S3 S4
X25
50M3
100
M2M1
M3
S1 S2 S3 S4
X
Task Scheduling
Backend Optimizations
Code Generation
TaskGraph
Task Assignment
interconnect network
7
core1
core2
core3
core4
Observation
In principle “behavior” and “implementation” are separated
Nevertheless, “some” inflexible structure is dictated by the designer Implementations on a few vs.
many processorsn
n/8
merge
sortN / 8
merge
N
N
8
Motivating Example
Automatic task assignment to 7 processors:
Execution period= 107 (x1000 clk @ 100MHz)
Experiment platform is FPGA prototyped multiprocessor NiosII/f @ 200MHz 32KB data cache 8KB inst. cache Inter-processor connections are
FIFO buffers of depth 1024 Offchip DDR2-800 main memory
n
n/8
merge
sort
36.6
workload=18.3
9.247.7
N / 8
9
Programmer manually constructs the task graph for 7 processors
Automatic task assignment to7 processors:
Execution period = 110
n/6
nmerge2 arrays
merge3 arrays
24.8
36.6
66.0
N / 6
merge2 arraysmerge2 arrays
Motivating Example (Cont’d)
10
Automatically generated task graph for 7 processors.
Automatic task assignment to 7 processors
Execution period = 94
n/9
n
n/6
49.4
12.216.5
41.9
66.0
12.2
66.0
Motivating Example (Cont’d)
11
Observation
In principle “behavior” and “implementation” are separated
Nevertheless, “some” inflexible structure is dictated by the designer Implementations on a few vs.
many processorn
n/8
merge
sortN / 8
merge
N
N
Solution: Raise the abstraction level in specification Functionally consistent Admitting transformations, i.e., structurally malleable
12
Higher-Level Specification Functionally-consistent
Structurally-Malleable Streaming Specification (FORMLESS)
Tasks functionality, ports, rates and their composition are governed by forming parameters.
task ActorName ( //list of parametersType1 ParamName1,Type2 ParamName2,... ){
interface {//list of input and output portsinput InputPortName1 ( PortRate );input InputPortName2 ( PortRate );...output OutputPortName1 ( PortRate );...
}function {
//data transformation function}
}
application AppName ( //list of parameters...
){interface {
//list of input and output ports...
}composition {
//actors:instantiate ActorName ActorID (ParamValue1,
...); ...//channels:connect ( ActorID.PortName, ActorID.PortName );...
}} 13
scatter2 (r[ ], x[ ], y[ ], n )
scatter3 (r[ ], x[ ], y[ ], z[ ], n )
sort (x[ ], s[ ], n )
merge2 (x[ ], y[ ], r[ ], n )
merge3 (x[ ], y[ ], z[ ], r[ ], n )
FORMLESS Task GraphCase Study 1: Merge Sort
Parameters: φ1: number of
parallel sort tasks φ2: fan-in degree of
merge and fan-out degree of scatter tasks.
N / 3N N / 8
Φ=(3,3)Φ=(8,2)
Φ=(1,1)
14
Case Study 2:Matrix MultiplyMatrix Multiply:Amxn x Bnxp = Cmxp
C p
m
A n
m
B p
nrow 7
column 5 C7,5
Parallelized Matrix Multiply:e.g., A2 x B1 = C21
A2
A BB1
A3
C
A1 B2 C11 C12
C31 C32
C21 C22
B
B1 B2
scatter
copy
Amatrixmultiply
A2
A1
A3
gather
C
Φ=(3,2)
Parameters: φ1 and φ2: number
of divisions in rows and columns of A and B.
15
Parameters: φ1: radix of the butterfly tasks. φ2: number of butterfly tasks grouped together.
Case Study 3:Fast Fourier Transform
radix-4
radix-2
Φ=(4,1)
Φ=(2,1) Φ=(2,4)
16
Case Study 4:Advanced Encryption Standard (AES)
ark sub shf mix ark sub shf ark…16 16
repeated 9 times
16 16 16 16 16 16 16 16
ark
sub
shf
mix ark ark…16 16
sub
sub
sub
shf
48
4 sub
shfsub
sub
sub
shf
48
4
16
repeated 9 times
Φ=(1,1,1,1)
Φ=(4,2,1,1)
Basic operations: substitute byte (sub): data parallel across elements shift row (shf): data parallel across rows add round key (ark): data parallel across elements mix column (mix): data parallel across columns
φ1…φ4 represent the number of parallel sub, shf, ark and mix tasks.
17
Case Study 5:Low Density Parity Check (LDPC)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
0 0 1 0 1 0 0 1 0 1 0 0 C1
1 0 0 0 1 0 0 0 1 0 1 0 C2
0 1 0 0 0 1 1 0 0 0 1 0 C3
0 0 1 1 0 0 1 0 0 0 0 1 C4
1 0 0 0 0 1 0 1 0 0 0 1 C5
0 1 0 1 0 0 0 0 1 1 0 0 C6
V1-12 C1-6
unrolled 6 times
C’1-6
C”1-6V7-12
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
C1 C2 C3 C4 C5 C6
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
C’1 C’3C’2 C’4 C’6C’5 C”1 C”3C”2 C”4 C”6C”5
H Matrix: Tanner graph is constructed by adding an edge for every cell of value 1 in matrix.
Task graph is constructed by condensing all C nodes into one and all V nodes into one, and then, unroll the graph to remove the bi-directional edges.
Row-split LDPC with split factor of φ=2 based on work by Mohsenin et al.
Task graph is constructed from the Tanner graph as before:
V1-6
18
Benchmark SummaryVector Φ Δ = Domain of Φ | Δ |
AES (φ1,φ2,φ3,φ4) δ1=δ2=δ3=δ4={1,2,4} 81
FFT (φ1,φ2) δ1={2,4,8,16}, δ2={1,2,4,...,128} 32
SORT (φ1,φ2) δ1={1,...,100}, δ2={1,...,10}, φ1=φ2n 26
MMUL (φ1,φ2) δ1=δ2={1,...,16} 256
LDPC (φ1) δ1={1,2,4,8,16} 5
19
FORMLESS Design Flow
Programmer specifies the application in the proposed malleable format
Tool explores the design space to hammer out a specific task graph from the malleable specificationbased on the target architecture.
Baseline Software Synthesis
Design Space Exploration
Task Assignment
TaskProfiling
CodeGeneration
SEAMSimulator
FPGA prototyped
platform
Throughput Estimation
Task GraphFormation
Task Assignment
LocalScheduling
repeat ?yes
Instantiated Task Graph
.Cfiles
MalleableSpecification
BackendOptimization
20
Experiment Setup Emulated multicore systems with Nios soft processors on FPGA.
Cores: NiosII/f 200MHz, Floating Point Unit, Memory: 8KB instruction and 32KB data cache, DDR2 main memory. Network: Adjacent cores are connected with point-to-point FIFO, depth of
each FIFO is 1024. Area constraint
Only 8 cores fit on the FPGA. For more cores we use Sequential Execution Abstraction Model (SEAM)
Simulate the entire system with Modelsim Simplify the sequential sections (no inter-core communications) as wait
functions in order to speed up the simulation. Our previous experiments show the error is small [Huang ‘08].
21
Sequential Execution Abstraction Model (SEAM)
P1
P3
P2
P5
P4
while(1) {for i=1..nS[i] = readf()
for i=1..nX[i]= S[i]+S[n-i]
for i=1..nY[i] = readf()
for i=1..nZ[i]= X[i]*Y[n-i]
for i=1..nwritef(Z[i])
}
code3.c
while(1) beginread ( N1 )#W1read ( N2 )#W2write( N3 )
end
system-levelbehaviorextraction
N1
N2
N3
W1
W2
code3.v
22
23
Experiment Results
Application throughput on manycore platforms normalized with respect to single-core throughput.
DSE instantiated task graphs have higher throughput than fixed task graphs.
AES
Φopt
FPGA
rigid graph Φ=2,2,1,2
0
10
20
30
40
0 20 40 60 80 100
Norm
alize
d Th
roug
hput
FFT
Φopt
FPGA
rigid graph Φ=(2,16)
0
5
10
15
20
25
0 20 40 60 80 100
Norm
alize
d Th
roug
hput
SORTΦopt
FPGA
rigid graph Φ=(3,3)
01
234
56
0 20 40 60 80 100
Norm
alize
d Th
roug
hput
MMUL
Φopt
rigid graph Φ=(5,5)FPGA
0
10
20
30
40
0 20 40 60 80 100
Norm
alize
d Th
roug
hput Φopt
rigid graph Φ=(4)
0
10
20
30
40
0 20 40 60 80 100
Norm
alize
d Thr
ough
put
LDPC
24
Number of ParametersAES
89
msq
11%
41%
max
20
40
60
80
100
0 5 10 15# of parameters, total = 81
Cove
rage
0
10
20
30
40
50
% T
hru.
Deg
rada
tion
FFT
88
msq
max
6.1%
36%
20
40
60
80
100
0 2 4 6 8# of parameters, total = 32
Cove
rage
0
10
20
30
40
% T
hru.
Deg
rada
tion
SORT
96
msq
23%
12%max
85
90
95
100
0 2 4 6 8 10# of parameters, total = 26
Cove
rage
0
10
20
30
% T
hru.
Deg
rada
tion
MMUL
93
msq
max
33%
9.8%
0
25
50
75
100
0 5 10 15 20 25 30 35# of parameters, total = 256
Cove
rage
0
10
20
30
40
% T
hru.
Deg
rada
tion
LDPC
98
msq 6.0%max
60
70
80
90
100
0 1 2 3 4 5# of parameters, total = 5
Cove
rage
0
25
50
75
100
% T
hru.
Deg
rada
tion
Coverage and throughput degradation vs the number of forming vectors considered by the DSE.
On average, 15% of forming vectors are enough to form the task graphs with at most 10% throughput degradation.