System Design Using Kahn System Design Using Kahn Process Networks: Process Networks: The Compaan/Laura Approach The Compaan/Laura Approach Bart Kienhuis Assistant Professor LIACS, Leiden University The Netherlands
System Design Using Kahn System Design Using Kahn Process Networks: Process Networks: The Compaan/Laura ApproachThe Compaan/Laura Approach
Bart KienhuisAssistant ProfessorLIACS, Leiden University The Netherlands
DSP Performance RequirementsDSP Performance Requirements
0
500
1000
1500
2000
2500
2000 2001 2002 2003 2004 2005 2006
Billi
on M
AC/s
HDTV
MPEG4
Video
over IP
3G Wireless/WCDMA
FutureBroadband Standards
Voice
over IPGeneral
Purpose DSP/CPU
Market Requirements Increasing
Gap
2.5G
Applications have a ferocious appetite for more programmable compute power
Source: TI, Xilinx – 1 MAC = 8 bit Multiply-Accumulate
Embedded DSP ArchitecturesEmbedded DSP Architectures
Programmable Interconnect (NoC)
Programmable Interconnect (NoC)
IPcoreIPcore
RPU
RPU
Mem
oryM
emory
CPU
CPU
Micro
ProcessorM
icro Processor
MemoryMemory
...
CPU: A simple MicroprocessorRPU: Reconfigurable Processing UnitIPcore; Dedicated Accelerator blockNoC: Network on a Chip
Weakly coupledProcessing elements
Programming ProblemProgramming Problem
for j = 1:1:N,[x(j)] = Source1( );
endfor i = 1:1:K,
[y(i)] = Source2( ); endfor j = 1:1:N,
for i = 1:1:K,[y(i), x(j)] = F( y(i), x(j) );
endendfor i = 1:1:K,
[Out(i)] = Sink( y( I ) ); end
SequentialApplication Specification
EASY to specify
DIFFICULT to map
Programmable Interconnect (NoC)Programmable Interconnect (NoC)
IPcoreIPcore
RPU
RPU
Mem
oryM
emory
CPU
CPU
Micro
ProcessorM
icro Processor
MemoryMemory
...
Programming
P1 P2
S1Source
P3 P4
Sink
ParallelApplication Specification
EASY to map
DIFFICULT to specify
Compaan
L
Application
aura
OutlineOutline
The programming problemKahn Process NetworksSystem Design: Compaan/Laura ApproachCase-study M-JPEGConclusions
Embedded DSP ArchitecturesEmbedded DSP Architectures
• Distributed Control• Distributed Memory
To satisfy the computational requirements, these architectures have to exploit:
Task-level Parallelism
Inst
ruct
ion
Para
llelis
mProgrammable Interconnect (NoC)
Programmable Interconnect (NoC)
RPU
RPU
Mem
oryM
emory
CPU
CPU
Micro
ProcessorM
icro Processor
...
MemoryMemory
IPcoreIPcore
QR Algorithm (smart antennas)QR Algorithm (smart antennas)
%parameter N 8 16;%parameter K 100 1000;
for k = 1:1:K,for j = 1:1:N,
[ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) );for i = j+1:1:N,
[ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );end
endend
Matlab Code (QR Algorithm)
Sequ
entia
lly O
rder
ed
Matrices are located inBig Global Memory
QR simple program: but keeps your CPU very busy
SolutionSolution
Change the model of computation in such a way that it better fits the model of architecture.Make sure the data-type is of precisely the format that fits the architecture (e.g. Streams)What model of Computation would fit this description, when looking at Digital Signal Processing (DSP) applications, Imaging and Multi Media?
Kahn Process Networks
Kahn Process Network (KPN)Kahn Process Network (KPN)Kahn Process Networks [Kahn 1974][Parks&Lee 95]– Processes run
autonomously– Communicate via
unbounded FIFOs– Synchronize via blocking
readProcess is either– executing (execute)– communicating
(send/get)DeterministicDistributed ControlDistributed Memory
Fc
A
Fa Fb
getexecutesend C
getexecute
sendsend
getgetexecutesend
Fifo
C
B
Kahn Process Network (KPN)Kahn Process Network (KPN)Fifo
Process A
Process C
Process BFifoFifo
Fifo FifoFPGA B
CPU 1FPGA A
•Autonomously operating Processes; no global schedule needed•Blocking Read simple realize in Hardware•Buffer Sizes of the FIFOs are quite often very small
The Compaan Tool ChainThe Compaan Tool Chain
%parameter N 8 16;%parameter K 100 1000;
for k = 1:1:K,for j = 1:1:N,
[r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) );for i = j+1:1:N,
[r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );end
endend
Matlab Program
SAC
MatParser
Matlab ApplicationoutputR
Rotate
VectorizeinitialR
inputSamples
DgParser
PRDG
Source
P1 P2
S1
P3 P1
Process Network
Sink
Panda
Kahn ProcessNetwork
Polyhedral Reduce Dependence Graph (PRDG)
Data DependencyAnalysis
Linearization
Data Dependency AnalysisData Dependency Analysis
j
1 2 3 4 5 N=612
43
5N=6
for i= 1 : 1 : N,for j= 1 : 1 : N,
[ a(i+j) ] = funcA( a(i+j) );end
end
i
i+j=6
a(i+1,j-1)
Ax >= b (polytope)
The for-next loops define an Iteration Domain
Polyhedral Reduced Dependence GraphPolyhedral Reduced Dependence Graph
%parameter N 8 16;%parameter K 100 1000;
for k = 1:1:K,for j = 1:1:N,
[ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) );for i = j+1:1:N,
[ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );end
endend
Matlab Code (QR Algorithm)
Dependence Graph
k
i
Vecj
Rot
Polyhedral Reduced Dependence GraphPolyhedral Reduced Dependence Graph
CA
B D
E
Polytope “C”
Polytope “D”
LinearizationLinearizationLinearization is the process of mapping high-order data-structures (e.g., Matrices) on a 1-D streamWe replaced the indexing of the variable x(j,I) and x(n-1,m) by relative put and getoperation on a FIFO buffer (unboxing)Is this always possible?
for j = 1 :1 5,for i = j : 1 : 5,
[ x(j,I) ]=F1(); end
end
for j = 1 :1 5,for i = j : 1 : 5,
F2(x(n-1,m)); end
end
Global Memory
for j = 1 :1 5,for i = j : 1 : 5,
[ out ]=F1(); FIFO.put(out);
endend
for j = 1 :1 5,for i = j : 1 : 5,
in = FIFO.get();F2(in);
endend
FIFO
LinearizationProducer Consumer
KPN Hand OffKPN Hand Off
Kahn Process Network
(b)
(a)(b)
(a) – rotate(b) – vectorize
Synthesizable VHDL
FPGA
Laura
Functional Simulation in Ptolemy II
Ptolemy Actor models in Java C++/YAPI
The Laura ToolThe Laura Tool
KPN
Network of Virtual Processors
KPNtoArchicture
MappingLibrary ofIP cores
Network of Synthesizable Processors
Verilog SystemCVHDL
Platform dependent
Platform Independent
The Laura ToolThe Laura Tool
IP2 OP1
OP2IP1
Ch2
P2
Kahn Process Network
P3P1
FIFO1 FIFO3IP2
IP1
OP1
OP2VP2
FIFO2
Ch1 Ch3
KPNToArchitecture
Abstract Architecture
VP1 VP3
The Laura ToolThe Laura Tool
DATA FLOW
Execution UnitIP Core
Read UnitController
Write UnitM
UX
DeM
UX
FIFO
FIFO
FIFO
FIFO
Control Tables Control Tables
Structure of an individual processor
System Design FlowSystem Design Flow
The Tools in action– M-JPEG Example based on the original C-code
of the Portable Video Research Group, Stanford University.
– Simple Target Platform• Common PC platform• With FPGA board
Motion JPEG encoder Motion JPEG encoder
Sequence of T frames
JPEG encodingM-JPEG encoded
video streamVideo stream
(4:2:2 YUV format)
observed bitrate
dimV
dimH
Target PlatformTarget Platform
Mem
ory
Ban
ks
Mul
tiple
xerAddress
ControlH
ost I
nter
face
Control
Select
Virtex-II 2V6000 FPGAADM-XRCII board
Stat
us
Con
trol
HW
Des
ign
Pentium IVMicroprocessor
PCI bus
Control
Address
Data
Data
Data
DataAddress
Data
System Design FlowSystem Design Flow
KPN
ApplicationIn Matlab
HW/SW partitioning(Workload Analysis)
Compaan Compiler
Compaan/Laura HW Compiler
Hardware Processes(Matlab)
Hardware DescriptionVHDL
Mem
ory
Ban
ks
Mul
tiple
xerAddress
Control
Hos
t Int
erfa
ce
Control
Select
Virtex-II 2V6000 FPGAADM-XRCII board
Stat
us
Con
trol
HW
Des
ign
Pentium IVMicroprocessor
PCI bus
ControlAddress
Data
Data
Data
DataAddress
DataSW Programming HW Programming
Software Path Hardware Path
Software Processes(YAPI)
GCC/V++SW Compiler
Object Code
MM--JPEG Specification in MatlabJPEG Specification in Matlab[ QTables, HuffTables, TablesInfo, EndOfFrame ] = P2_l_DefaultTables( );for k = 1:1:NumFrames,[ HeaderInfo ] = P1_l_VideoInInit( );for j = 1:1:VNumBlocks,
for i = 1:1:HNumBlocks,[ Block( j ,i ) ] = P1_l_VideoInMain( );
endendfor j = 1:1:VNumBlocks,
for i = 1:1:HNumBlocks,[ Block( j , i ) ] = DCT( Block( j , i ) );
endendfor j = 1:1:VNumBlocks,
for i = 1:1:HNumBlocks,[ Block( j , i ) ] = Q( Block( j , i ), QTables );[ Packets, StatisticsB ] = VLE( Block( j , i ), EndOfFrame, HuffTables );[ BitRate, StatisticsF, EndOfFrame ] = CtrlF1( StatisticsB ); [ ] = VideoOut( HeaderInfo, TablesInfo, Packets );
end end [ QTable, HuffTables, TablesInfo ] = P2_l_CtrlF2( BitRate, StatisticsF,
QTables, HuffTables, TablesInfo );end
Parameterized%parameter NumFrames 1 1000;%parameter VNumBlocks 16 256;%parameter HNumBlocks 8 256;
Block( j , i )
Deriving a KPN Deriving a KPN
ApplicationIn Matlab
Compaan Compiler FunctionalVerification
Ptolemy IIPN Domain
Workload Analysis to do the HW/SW
Partitioning YAPI/C++
Deriving a KPNDeriving a KPN
The KPN of MThe KPN of M--JPEGJPEG
VOut
CtrlF1
QDCTP1
P2
VLEBlock Block Block Packets
BitRateQ
Tabl
es
HuffTab
les
Stat
istic
sB
StatisticsF
EndO
fFra
me
TablesInfo
struct Block {int Y1[64]; /* block 8x8 pixels */int Y2[64]; /* block 8x8 pixels */int U[64]; /* block 8x8 pixels */int V[64]; /* block 8x8 pixels */
};
HeaderInfo
Interface Code DCTInterface Code DCT
The DCT process is selected to move to Hardware.Interface code is needed to run with the Software ProcessesObserve that ‘Blocks’ are being moved to the FPGA and from the FPGA
void DCT::main() {// NumFrames = 100;// VNumBlocks = 16;// HNumBlocks = 8;
for ( int k=1; k <= NumFrames; k++ ) {for ( int j=1; j <= VNumBlocks; j++ ) {
for ( int i=1; i <= HNumBlocks; i++ ) {read( inPort, inBlock );outBlock = DCT( inBlock );write( outPort, outBlock );
}}
}
}
Matlab of the DCT ProcessMatlab of the DCT Processfor k = 1:1:4,
for j = 1:1:64,[ Pixel( k , j ) ] = Source( inBlock );
endendfor k = 1:1:4,if k <= 2,
for j = 1:1:64,[ Pixel( k , j ) ] = PreShift( Pixel( k , j ) );
endendfor j = 1:1:64,
[ Block ] = P_l_PixelsToBlock( Pixel( k , j ) );end[ Block ] = P_l_2D_dct( Block );for j = 1:1:64,
[ Pixel( k , j ) ] = P_l_BlockToPixels( Block );end
endfor k = 1:1:4,
for j = 1:1:64,[ outBlock ] = Sink( Pixel( k , j ) );
endend
Of the DCT process, we make a new Matlab program– This exposes more
parallelism at a finer lever.– Automatic conversion
from Blocks to Stream (Linearization)
Compaan produces by default a process per function call.– However, using the
Preamble ‘P_1’ we can group processes.
KPN Sub network DCTKPN Sub network DCT
P
PreShift
Source SinkPixel
Pixel
Pixel
PixelinBlock outBlock
DCT
VOut
CtrlF1
QDCTP1
P2
VLEBlock Block Block Packets
HeaderInfo
BitRate
QTa
bles
HuffTab
les
Stat
istic
sB
StatisticsF
EndO
fFra
me
TablesInfo
Hierarchical Subnet of DCT
Programming the CPUProgramming the CPU
M-JPEG specifiedIn YAPI
C++ Compiler
YAPI Executable
Pentium IVPentium IV
YAPI Multithreading Environment
Laura DCT Hardware ModelLaura DCT Hardware Model
P
PreShift
Source SinkPixel
Pixel
Pixel
PixelinBlock outBlock
DCT
IP2
IP1 OP1VP3FIFO1
FIFO2
FIFO
3FIFO4
VP2
VP1(Source)
VP4(Sink)
(PreShift)
(P)
Sink/Source do the typeConversion
One-to-OneMapping
Abstract Hardware Model: Network of Virtual Processors
Laura DCT Hardware ModelLaura DCT Hardware ModelTo get the functionality of the Virtual Processor, we integrated an IPcore.We have taken the Core (2D-DCT) from the Xilinx Webside.Make Processor specific for a platform– Determine the Bit width– Determine the FIFO sizes– Take into account the
Clock– Determine the Control
tables for the switches
IP2
IP1 OP1VP3FIFO1
FIFO
3
FIFO4 (xilinx)
(P)
0
1
MU
X
IP1
IP2in_0 out_0
C
OP1
Control Table
Synch. Logic Synch. Logic
2D-DCT (Xilinx)
Control Unit
Execute UnitIP Core implementing
Read Unit Write Unit
Hw/Hw/Sw Sw Solution for MSolution for M--JPEGJPEG
Mem
ory
Ban
ks
Mul
tiple
xerAddress
ControlH
ost I
nter
face
Control
Select
Virtex-II 2V6000 FPGAADM-XRCII board
Stat
us
Con
trol
PCI bus
Control
Address
Data
Data
Data
DataAddress
DataYAPI Multithreading Environment
Pentium IVPentium IV
The way it is programmed; the CPU and FPGA run in parallel
Processing Time MProcessing Time M--JPEGJPEG
Compaan Laura Other tools Manually Total
M-JPEG -> KPN 00:00:22 -- -- 00:30:00 00:30:22
Software Compilation -- -- 00:00:35 -- 00:00:35
DCT Subnet Compilation 00:00:08 -- -- -- 00:00:08
Laura -- 00:00:07 -- 03:00:00 03:00:07
Synthesis to FPGA -- -- 00:13:10 -- 00:13:10
Overall 00:00:30 00:00:07 00:13:45 03:30:00 03:44:22
Device Utilization for the DCTDevice Utilization for the DCT
FPGA resource Utilization %
Number of MULT18x18s 8 out of 144 5%
Number of RAMB16s 4 out of 144 2%
Number of SLICEs 2367 out of 33792 7%
Number of BUFGMUXs 2 out of 16 12%
Virtex-II 2V6000 FPGA(taking 4% of the FPGA)
RealReal--time performance Mtime performance M--JPEGJPEG
Throughput of the system– 10.5 CIF frame (128x128) per second
• Running Windows 2000• Simple Compiler• Simple Multithreading architecture
Required is 25 frames per second– Communication FPGA/CPU is too slow (PCI)
However,– 64 bit PCI– Running at 66Mhz– 4 times increase in performance
Then 25 frames per second (128x128) not a problem
ExplorationExplorationP1 P2
S1 SinkS2KPN_4
Generatefor j = 1:1:N,[x(j)] = Source1( );
endfor i = 1:1:K,
[y(i)] = Source2( ); endfor j = 1:1:N,
for i = 1:1:K,[y(i), x(j)] = F( y(i), x(j) );
endendfor i = 1:1:K,
[Out(i)] = Sink( y(i) ); end
P1 P2
P3 P4
S1 SinkS2
KPN_1
P1 P2
S1 SinkS2KPN_3
P4
S1 SinkS2
P3P2P1
KPN_2
P
S1
Sink
S2
KPN_5Map
Explore
Programmable Interconnect (NoC)Programmable Interconnect (NoC)
IPcoreIPcore
RPU
RPU
Mem
oryM
emory
CPU
CPU
Micro
ProcessorM
icro Processor
MemoryMemory
...
Alternative Application Instances
Unrolling/UnfoldingUnrolling/Unfolding
%parameter N 100 1000;%parameter K 8 48;
for j = 1:1:N,for i = 1:1:K,
[y(i), x(j)] = F(y(i), x(j));end
end
F F F F
F
F
F
F
F
F
F
F
x(1) x(2) x(3) x(4)
y(1)
y(2)
y(3)
j
i F F F F
F
F
F
F
F
F
F
F
x(1) x(2) x(3) x(4)
y(1)
y(2)
y(3)
j
i
Compaan
U = [ N, K ]
Difficult to derive
for j = 1:1:N,if mod( j , if mod( j , 2 2 ) = 1,) = 1,
for i = 1:1:K,[y(i), x(j)] = F(y(i), x(j));
endendend
if mod( j , if mod( j , 2 2 ) = 0,) = 0,for i = 1:1:K,[y(i), x(j)] = F(y(i), x(j));
endendend
end
MatTransform
Retiming/SkewingRetiming/Skewing
Skewing matrix
==
==
100111
22222121
12121111
mmmmmmmm
MM
F F F F
F
F
F
F
F
F
F
F
x(1) x(2) x(3) x(4)
y(1)
y(2)
y(3)
j
i
for j = 2:1:N+K,if mod( j , if mod( j , 22 ) = 1,) = 1,for i = max(1, j-N):1:min(j-1, K),
[y(i), x(j-i)] = F(y(i), x(j-i));end
endendif mod( j , if mod( j , 22 ) = 0,) = 0,for i = max(1, j-N):1:min(j-1, K),
[y(i), x(j-i)] = F(y(i), x(j-i));end
endendend
F F F F
F
F
F
F
F
F
F
F
x(1) x(2) x(3) x(4)
y(1)
y(2)
y(3)
j
i F F F F
F
F
F
F
F
F
F
F
x(1) x(2) x(3) x(4)
y(1)
y(2)
y(3)
j
i
for j = 2:1:N+K,for i = max(1, j-N):1:min(j-1, K),
[y(i), x(j-i)] = F(y(i), x(j-i));end
end
Unfolding vectorU = [ u1, u2 ] = [2, 1]
Compaan
Difficult to derive
%parameter N 100 1000;%parameter K 8 48;
for j = 1:1:N,for i = 1:1:K,
[y(i), x(j)] = F(y(i), x(j));end
end
Design Space ExplorationDesign Space Exploration
Retiming/Unrolling
Compaan Compiler
Initial Values ofParameters
Matlab Application
Matlab Code
New Values ofParameters
MappingLaura/XFT
PerformanceAnalysis
PerformanceNumbers
Xilinx Virtex-II
ProcessNetwork
ConclusionsConclusions
To satisfy tomorrow’s applications, we will see hierarchical multiprocessor systems with a number of CPUs, Memories, IPcores, and RPU.Programming these system will be difficult unless the MoC is changed to take Concurrency into account. The key items will be– Distributed Memory– Distributed Control
Kahn Process Networks seem to be a very promising programming formalism for tomorrow’s HW/SW codesign platforms
ConclusionsConclusions
We showed proof-of-concept with a case in which we Convert M-JPEG into a KPN of which the processes are mapped either on hardware or software.In the M-JPEG case, the hardware and software were running concurrently, exploiting task-level parallelismHaving good tools, we can start from (legacy) code in Matlab, C, or other imperative languages.The results are just the beginning. There is more to achieve when more mature / commercial products are used (RTOS, Compiler, Target Platform, Virtex Pro)
PublicationsPublicationsBart Kienhuis, Edwin Rijpkema, and Ed F. Deprettere ``Compaan: Deriving Process Networks from Matlab for Embedded Signal Processing Architectures.'', 8th International Workshop on Hardware/Software Codesign (CODES'2000), May 3 -- 5 2000, San Diego, CA, USA.Alexandru Turjan, Bart Kienhuis, and Ed Deprettere``A compile time based approach for solving out-of-order communication in Kahn Process
Networks'', in proceeding of IEEE 13th International Conference on Application-specific Systems, Architectures and Processors (ASAP'2002), San Jose, CA, USA, July 17-19, 2002Tim Harriss, Richard Walke, Bart Kienhuis, and Ed Deprettere``Compilation from Matlab to Process Networks Realized in FPGA'', In journal on Design
Automation of Embedded Systems, Kluwer, Vol 7, Issue 4, 2002Todor Stefanov, Bart Kienhuis, and Ed Deprettere``Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application
Instances'', in proceeding of Tenth International Symposium on Hardware/Software Codesign CODES'2002, Stanley Hotel, Estes Park, Colorado, USA, May 6 -- 8, 2002Edwin Rijpkema, ``Modeling Task Level Parallelism in Piece-wise Regular Programs'',PhD thesis, Leiden University, Leiden Institute of Advanced Computer Science (LIACS), The Netherlands, Sept 2002. Alexandru Turjan, Bart Kienhuis, and Ed Deprettere, ``Solving out-of-order communication in Kahn Process Networks '', submitted for publication in Journal on VLSI Signal Processing-Systems for Signal, Image, and Video Technology. Kluwer Academic Publishers., 2003Claudiu Zissulescu , Todor Stefanov, Bart Kienhuis and Ed Deprettere, “Laura: Leiden Architecture Research and Exploration Tool”, submitted to The International Conference on Field Programmable Logic and Applications, September 1-3, 2003 Lisbon, PortugalTodor Stefanov, Claudiu Zissulescu, Alexandru Turjan, Bart Kienhuis and Ed Deprettere, “System Design using Kahn Process Networks: The Compaan/Laura Approach”, submitted for review to ICCAD, November 9 –13, 2003, San Jose, CA, USA.