System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

System Design Using Kahn System Design Using Kahn Process Networks: Process Networks: The Compaan/Laura ApproachThe Compaan/Laura Approach

Bart KienhuisAssistant ProfessorLIACS, Leiden University The Netherlands

DSP Performance RequirementsDSP Performance Requirements

0

500

1000

1500

2000

2500

2000 2001 2002 2003 2004 2005 2006

Billi

on M

AC/s

HDTV

MPEG4

Video

over IP

3G Wireless/WCDMA

FutureBroadband Standards

Voice

over IPGeneral

Purpose DSP/CPU

Market Requirements Increasing

Gap

2.5G

Applications have a ferocious appetite for more programmable compute power

Source: TI, Xilinx – 1 MAC = 8 bit Multiply-Accumulate

Embedded DSP ArchitecturesEmbedded DSP Architectures

Programmable Interconnect (NoC)


IPcoreIPcore

RPU

RPU

Mem

oryM

emory

CPU

CPU

Micro

ProcessorM

icro Processor

MemoryMemory

...

CPU: A simple MicroprocessorRPU: Reconfigurable Processing UnitIPcore; Dedicated Accelerator blockNoC: Network on a Chip

Weakly coupledProcessing elements

Programming ProblemProgramming Problem

for j = 1:1:N,[x(j)] = Source1( );

endfor i = 1:1:K,

[y(i)] = Source2( ); endfor j = 1:1:N,

for i = 1:1:K,[y(i), x(j)] = F( y(i), x(j) );

endendfor i = 1:1:K,

[Out(i)] = Sink( y( I ) ); end

SequentialApplication Specification

EASY to specify

DIFFICULT to map

Programmable Interconnect (NoC)Programmable Interconnect (NoC)

IPcoreIPcore

RPU

RPU

Mem

oryM

emory

CPU

CPU

Micro

ProcessorM

icro Processor

MemoryMemory

...

Programming

P1 P2

S1Source

P3 P4

Sink

ParallelApplication Specification

EASY to map

DIFFICULT to specify

Compaan

L

Application

aura

OutlineOutline

The programming problemKahn Process NetworksSystem Design: Compaan/Laura ApproachCase-study M-JPEGConclusions

Embedded DSP ArchitecturesEmbedded DSP Architectures

• Distributed Control• Distributed Memory

To satisfy the computational requirements, these architectures have to exploit:

Task-level Parallelism

Inst

ruct

ion

Para

llelis

mProgrammable Interconnect (NoC)


RPU

RPU

Mem

oryM

emory

CPU

CPU

Micro

ProcessorM

icro Processor

...

MemoryMemory

IPcoreIPcore

QR Algorithm (smart antennas)QR Algorithm (smart antennas)

%parameter N 8 16;%parameter K 100 1000;

for k = 1:1:K,for j = 1:1:N,

[ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) );for i = j+1:1:N,

[ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );end

endend

Matlab Code (QR Algorithm)

Sequ

entia

lly O

rder

ed

Matrices are located inBig Global Memory

QR simple program: but keeps your CPU very busy

SolutionSolution

Change the model of computation in such a way that it better fits the model of architecture.Make sure the data-type is of precisely the format that fits the architecture (e.g. Streams)What model of Computation would fit this description, when looking at Digital Signal Processing (DSP) applications, Imaging and Multi Media?

Kahn Process Networks

Kahn Process Network (KPN)Kahn Process Network (KPN)Kahn Process Networks [Kahn 1974][Parks&Lee 95]– Processes run

autonomously– Communicate via

unbounded FIFOs– Synchronize via blocking

readProcess is either– executing (execute)– communicating

(send/get)DeterministicDistributed ControlDistributed Memory

Fc

A

Fa Fb

getexecutesend C

getexecute

sendsend

getgetexecutesend

Fifo

C

B

Kahn Process Network (KPN)Kahn Process Network (KPN)Fifo

Process A

Process C

Process BFifoFifo

Fifo FifoFPGA B

CPU 1FPGA A

•Autonomously operating Processes; no global schedule needed•Blocking Read simple realize in Hardware•Buffer Sizes of the FIFOs are quite often very small

The Compaan Tool ChainThe Compaan Tool Chain


for k = 1:1:K,for j = 1:1:N,

[r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) );for i = j+1:1:N,

[r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );end

endend

Matlab Program

SAC

MatParser

Matlab ApplicationoutputR

Rotate

VectorizeinitialR

inputSamples

DgParser

PRDG

Source

P1 P2

S1

P3 P1

Process Network

Sink

Panda

Kahn ProcessNetwork

Polyhedral Reduce Dependence Graph (PRDG)

Data DependencyAnalysis

Linearization

Data Dependency AnalysisData Dependency Analysis

j

1 2 3 4 5 N=612

43

5N=6

for i= 1 : 1 : N,for j= 1 : 1 : N,

[ a(i+j) ] = funcA( a(i+j) );end

end

i

i+j=6

a(i+1,j-1)

Ax >= b (polytope)

The for-next loops define an Iteration Domain

Polyhedral Reduced Dependence GraphPolyhedral Reduced Dependence Graph


for k = 1:1:K,for j = 1:1:N,

[ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) );for i = j+1:1:N,

[ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );end

endend

Matlab Code (QR Algorithm)

Dependence Graph

k

i

Vecj

Rot

Polyhedral Reduced Dependence GraphPolyhedral Reduced Dependence Graph

CA

B D

E

Polytope “C”

Polytope “D”

LinearizationLinearizationLinearization is the process of mapping high-order data-structures (e.g., Matrices) on a 1-D streamWe replaced the indexing of the variable x(j,I) and x(n-1,m) by relative put and getoperation on a FIFO buffer (unboxing)Is this always possible?

for j = 1 :1 5,for i = j : 1 : 5,

[ x(j,I) ]=F1(); end

end

for j = 1 :1 5,for i = j : 1 : 5,

F2(x(n-1,m)); end

end

Global Memory

for j = 1 :1 5,for i = j : 1 : 5,

[ out ]=F1(); FIFO.put(out);

endend

for j = 1 :1 5,for i = j : 1 : 5,

in = FIFO.get();F2(in);

endend

FIFO

LinearizationProducer Consumer

KPN Hand OffKPN Hand Off

Kahn Process Network

(b)

(a)(b)

(a) – rotate(b) – vectorize

Synthesizable VHDL

FPGA

Laura

Functional Simulation in Ptolemy II

Ptolemy Actor models in Java C++/YAPI

The Laura ToolThe Laura Tool

KPN

Network of Virtual Processors

KPNtoArchicture

MappingLibrary ofIP cores

Network of Synthesizable Processors

Verilog SystemCVHDL

Platform dependent

Platform Independent


IP2 OP1

OP2IP1

Ch2

P2

Kahn Process Network

P3P1

FIFO1 FIFO3IP2

IP1

OP1

OP2VP2

FIFO2

Ch1 Ch3

KPNToArchitecture

Abstract Architecture

VP1 VP3


DATA FLOW

Execution UnitIP Core

Read UnitController

Write UnitM

UX

DeM

UX

FIFO

FIFO

FIFO

FIFO

Control Tables Control Tables

Structure of an individual processor

System Design FlowSystem Design Flow

The Tools in action– M-JPEG Example based on the original C-code

of the Portable Video Research Group, Stanford University.

– Simple Target Platform• Common PC platform• With FPGA board

Motion JPEG encoder Motion JPEG encoder

Sequence of T frames

JPEG encodingM-JPEG encoded

video streamVideo stream

(4:2:2 YUV format)

observed bitrate

dimV

dimH

Target PlatformTarget Platform

Mem

ory

Ban

ks

Mul

tiple

xerAddress

ControlH

ost I

nter

face

Control

Select

Virtex-II 2V6000 FPGAADM-XRCII board

Stat

us

Con

trol

HW

Des

ign

Pentium IVMicroprocessor

PCI bus

Control

Address

Data

Data

Data

DataAddress

Data

System Design FlowSystem Design Flow

KPN

ApplicationIn Matlab

HW/SW partitioning(Workload Analysis)

Compaan Compiler

Compaan/Laura HW Compiler

Hardware Processes(Matlab)

Hardware DescriptionVHDL

Mem

ory

Ban

ks

Mul

tiple

xerAddress

Control

Hos

t Int

erfa

ce

Control

Select


Stat

us

Con

trol

HW

Des

ign

Pentium IVMicroprocessor

PCI bus

ControlAddress

Data

Data

Data

DataAddress

DataSW Programming HW Programming

Software Path Hardware Path

Software Processes(YAPI)

GCC/V++SW Compiler

Object Code

MM--JPEG Specification in MatlabJPEG Specification in Matlab[ QTables, HuffTables, TablesInfo, EndOfFrame ] = P2_l_DefaultTables( );for k = 1:1:NumFrames,[ HeaderInfo ] = P1_l_VideoInInit( );for j = 1:1:VNumBlocks,

for i = 1:1:HNumBlocks,[ Block( j ,i ) ] = P1_l_VideoInMain( );

endendfor j = 1:1:VNumBlocks,

for i = 1:1:HNumBlocks,[ Block( j , i ) ] = DCT( Block( j , i ) );

endendfor j = 1:1:VNumBlocks,

for i = 1:1:HNumBlocks,[ Block( j , i ) ] = Q( Block( j , i ), QTables );[ Packets, StatisticsB ] = VLE( Block( j , i ), EndOfFrame, HuffTables );[ BitRate, StatisticsF, EndOfFrame ] = CtrlF1( StatisticsB ); [ ] = VideoOut( HeaderInfo, TablesInfo, Packets );

end end [ QTable, HuffTables, TablesInfo ] = P2_l_CtrlF2( BitRate, StatisticsF,

QTables, HuffTables, TablesInfo );end

Parameterized%parameter NumFrames 1 1000;%parameter VNumBlocks 16 256;%parameter HNumBlocks 8 256;

Block( j , i )

Deriving a KPN Deriving a KPN

ApplicationIn Matlab

Compaan Compiler FunctionalVerification

Ptolemy IIPN Domain

Workload Analysis to do the HW/SW

Partitioning YAPI/C++

Deriving a KPNDeriving a KPN

The KPN of MThe KPN of M--JPEGJPEG

VOut

CtrlF1

QDCTP1

P2

VLEBlock Block Block Packets

BitRateQ

Tabl

es

HuffTab

les

Stat

istic

sB

StatisticsF

EndO

fFra

me

TablesInfo

struct Block {int Y1[64]; /* block 8x8 pixels */int Y2[64]; /* block 8x8 pixels */int U[64]; /* block 8x8 pixels */int V[64]; /* block 8x8 pixels */

};

HeaderInfo

Interface Code DCTInterface Code DCT

The DCT process is selected to move to Hardware.Interface code is needed to run with the Software ProcessesObserve that ‘Blocks’ are being moved to the FPGA and from the FPGA

void DCT::main() {// NumFrames = 100;// VNumBlocks = 16;// HNumBlocks = 8;

for ( int k=1; k <= NumFrames; k++ ) {for ( int j=1; j <= VNumBlocks; j++ ) {

for ( int i=1; i <= HNumBlocks; i++ ) {read( inPort, inBlock );outBlock = DCT( inBlock );write( outPort, outBlock );

}}

}

}

Matlab of the DCT ProcessMatlab of the DCT Processfor k = 1:1:4,

for j = 1:1:64,[ Pixel( k , j ) ] = Source( inBlock );

endendfor k = 1:1:4,if k <= 2,

for j = 1:1:64,[ Pixel( k , j ) ] = PreShift( Pixel( k , j ) );

endendfor j = 1:1:64,

[ Block ] = P_l_PixelsToBlock( Pixel( k , j ) );end[ Block ] = P_l_2D_dct( Block );for j = 1:1:64,

[ Pixel( k , j ) ] = P_l_BlockToPixels( Block );end

endfor k = 1:1:4,

for j = 1:1:64,[ outBlock ] = Sink( Pixel( k , j ) );

endend

Of the DCT process, we make a new Matlab program– This exposes more

parallelism at a finer lever.– Automatic conversion

from Blocks to Stream (Linearization)

Compaan produces by default a process per function call.– However, using the

Preamble ‘P_1’ we can group processes.

KPN Sub network DCTKPN Sub network DCT

P

PreShift

Source SinkPixel

Pixel

Pixel

PixelinBlock outBlock

DCT

VOut

CtrlF1

QDCTP1

P2

VLEBlock Block Block Packets

HeaderInfo

BitRate

QTa

bles

HuffTab

les

Stat

istic

sB

StatisticsF

EndO

fFra

me

TablesInfo

Hierarchical Subnet of DCT

Programming the CPUProgramming the CPU

M-JPEG specifiedIn YAPI

C++ Compiler

YAPI Executable

Pentium IVPentium IV

YAPI Multithreading Environment

Laura DCT Hardware ModelLaura DCT Hardware Model

P

PreShift

Source SinkPixel

Pixel

Pixel

PixelinBlock outBlock

DCT

IP2

IP1 OP1VP3FIFO1

FIFO2

FIFO

3FIFO4

VP2

VP1(Source)

VP4(Sink)

(PreShift)

(P)

Sink/Source do the typeConversion

One-to-OneMapping

Abstract Hardware Model: Network of Virtual Processors

Laura DCT Hardware ModelLaura DCT Hardware ModelTo get the functionality of the Virtual Processor, we integrated an IPcore.We have taken the Core (2D-DCT) from the Xilinx Webside.Make Processor specific for a platform– Determine the Bit width– Determine the FIFO sizes– Take into account the

Clock– Determine the Control

tables for the switches

IP2

IP1 OP1VP3FIFO1

FIFO

3

FIFO4 (xilinx)

(P)

0

1

MU

X

IP1

IP2in_0 out_0

C

OP1

Control Table

Synch. Logic Synch. Logic

2D-DCT (Xilinx)

Control Unit

Execute UnitIP Core implementing

Read Unit Write Unit

Hw/Hw/Sw Sw Solution for MSolution for M--JPEGJPEG

Mem

ory

Ban

ks

Mul

tiple

xerAddress

ControlH

ost I

nter

face

Control

Select


Stat

us

Con

trol

PCI bus

Control

Address

Data

Data

Data

DataAddress

DataYAPI Multithreading Environment

Pentium IVPentium IV

The way it is programmed; the CPU and FPGA run in parallel

Processing Time MProcessing Time M--JPEGJPEG

Compaan Laura Other tools Manually Total

M-JPEG -> KPN 00:00:22 -- -- 00:30:00 00:30:22

Software Compilation -- -- 00:00:35 -- 00:00:35

DCT Subnet Compilation 00:00:08 -- -- -- 00:00:08

Laura -- 00:00:07 -- 03:00:00 03:00:07

Synthesis to FPGA -- -- 00:13:10 -- 00:13:10

Overall 00:00:30 00:00:07 00:13:45 03:30:00 03:44:22

Device Utilization for the DCTDevice Utilization for the DCT

FPGA resource Utilization %

Number of MULT18x18s 8 out of 144 5%

Number of RAMB16s 4 out of 144 2%

Number of SLICEs 2367 out of 33792 7%

Number of BUFGMUXs 2 out of 16 12%

Virtex-II 2V6000 FPGA(taking 4% of the FPGA)

RealReal--time performance Mtime performance M--JPEGJPEG

Throughput of the system– 10.5 CIF frame (128x128) per second

• Running Windows 2000• Simple Compiler• Simple Multithreading architecture

Required is 25 frames per second– Communication FPGA/CPU is too slow (PCI)

However,– 64 bit PCI– Running at 66Mhz– 4 times increase in performance

Then 25 frames per second (128x128) not a problem

ExplorationExplorationP1 P2

S1 SinkS2KPN_4

Generatefor j = 1:1:N,[x(j)] = Source1( );

endfor i = 1:1:K,

[y(i)] = Source2( ); endfor j = 1:1:N,

for i = 1:1:K,[y(i), x(j)] = F( y(i), x(j) );

endendfor i = 1:1:K,

[Out(i)] = Sink( y(i) ); end

P1 P2

P3 P4

S1 SinkS2

KPN_1

P1 P2

S1 SinkS2KPN_3

P4

S1 SinkS2

P3P2P1

KPN_2

P

S1

Sink

S2

KPN_5Map

Explore

Programmable Interconnect (NoC)Programmable Interconnect (NoC)

IPcoreIPcore

RPU

RPU

Mem

oryM

emory

CPU

CPU

Micro

ProcessorM

icro Processor

MemoryMemory

...

Alternative Application Instances

Unrolling/UnfoldingUnrolling/Unfolding


for j = 1:1:N,for i = 1:1:K,

[y(i), x(j)] = F(y(i), x(j));end

end

F F F F

F

F

F

F

F

F

F

F

x(1) x(2) x(3) x(4)

y(1)

y(2)

y(3)

j

i F F F F

F

F

F

F

F

F

F

F

x(1) x(2) x(3) x(4)

y(1)

y(2)

y(3)

j

i

Compaan

U = [ N, K ]

Difficult to derive

for j = 1:1:N,if mod( j , if mod( j , 2 2 ) = 1,) = 1,

for i = 1:1:K,[y(i), x(j)] = F(y(i), x(j));

endendend

if mod( j , if mod( j , 2 2 ) = 0,) = 0,for i = 1:1:K,[y(i), x(j)] = F(y(i), x(j));

endendend

end

MatTransform

Retiming/SkewingRetiming/Skewing

Skewing matrix

==

==

100111

22222121

12121111

mmmmmmmm

MM

F F F F

F

F

F

F

F

F

F

F

x(1) x(2) x(3) x(4)

y(1)

y(2)

y(3)

j

i

for j = 2:1:N+K,if mod( j , if mod( j , 22 ) = 1,) = 1,for i = max(1, j-N):1:min(j-1, K),

[y(i), x(j-i)] = F(y(i), x(j-i));end

endendif mod( j , if mod( j , 22 ) = 0,) = 0,for i = max(1, j-N):1:min(j-1, K),

[y(i), x(j-i)] = F(y(i), x(j-i));end

endendend

F F F F

F

F

F

F

F

F

F

F

x(1) x(2) x(3) x(4)

y(1)

y(2)

y(3)

j

i F F F F

F

F

F

F

F

F

F

F

x(1) x(2) x(3) x(4)

y(1)

y(2)

y(3)

j

i

for j = 2:1:N+K,for i = max(1, j-N):1:min(j-1, K),

[y(i), x(j-i)] = F(y(i), x(j-i));end

end

Unfolding vectorU = [ u1, u2 ] = [2, 1]

Compaan

Difficult to derive


for j = 1:1:N,for i = 1:1:K,

[y(i), x(j)] = F(y(i), x(j));end

end

Design Space ExplorationDesign Space Exploration

Retiming/Unrolling

Compaan Compiler

Initial Values ofParameters

Matlab Application

Matlab Code

New Values ofParameters

MappingLaura/XFT

PerformanceAnalysis

PerformanceNumbers

Xilinx Virtex-II

ProcessNetwork

ConclusionsConclusions

To satisfy tomorrow’s applications, we will see hierarchical multiprocessor systems with a number of CPUs, Memories, IPcores, and RPU.Programming these system will be difficult unless the MoC is changed to take Concurrency into account. The key items will be– Distributed Memory– Distributed Control

Kahn Process Networks seem to be a very promising programming formalism for tomorrow’s HW/SW codesign platforms

ConclusionsConclusions

We showed proof-of-concept with a case in which we Convert M-JPEG into a KPN of which the processes are mapped either on hardware or software.In the M-JPEG case, the hardware and software were running concurrently, exploiting task-level parallelismHaving good tools, we can start from (legacy) code in Matlab, C, or other imperative languages.The results are just the beginning. There is more to achieve when more mature / commercial products are used (RTOS, Compiler, Target Platform, Virtex Pro)

PublicationsPublicationsBart Kienhuis, Edwin Rijpkema, and Ed F. Deprettere ``Compaan: Deriving Process Networks from Matlab for Embedded Signal Processing Architectures.'', 8th International Workshop on Hardware/Software Codesign (CODES'2000), May 3 -- 5 2000, San Diego, CA, USA.Alexandru Turjan, Bart Kienhuis, and Ed Deprettere``A compile time based approach for solving out-of-order communication in Kahn Process

Networks'', in proceeding of IEEE 13th International Conference on Application-specific Systems, Architectures and Processors (ASAP'2002), San Jose, CA, USA, July 17-19, 2002Tim Harriss, Richard Walke, Bart Kienhuis, and Ed Deprettere``Compilation from Matlab to Process Networks Realized in FPGA'', In journal on Design

Automation of Embedded Systems, Kluwer, Vol 7, Issue 4, 2002Todor Stefanov, Bart Kienhuis, and Ed Deprettere``Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application

Instances'', in proceeding of Tenth International Symposium on Hardware/Software Codesign CODES'2002, Stanley Hotel, Estes Park, Colorado, USA, May 6 -- 8, 2002Edwin Rijpkema, ``Modeling Task Level Parallelism in Piece-wise Regular Programs'',PhD thesis, Leiden University, Leiden Institute of Advanced Computer Science (LIACS), The Netherlands, Sept 2002. Alexandru Turjan, Bart Kienhuis, and Ed Deprettere, ``Solving out-of-order communication in Kahn Process Networks '', submitted for publication in Journal on VLSI Signal Processing-Systems for Signal, Image, and Video Technology. Kluwer Academic Publishers., 2003Claudiu Zissulescu , Todor Stefanov, Bart Kienhuis and Ed Deprettere, “Laura: Leiden Architecture Research and Exploration Tool”, submitted to The International Conference on Field Programmable Logic and Applications, September 1-3, 2003 Lisbon, PortugalTodor Stefanov, Claudiu Zissulescu, Alexandru Turjan, Bart Kienhuis and Ed Deprettere, “System Design using Kahn Process Networks: The Compaan/Laura Approach”, submitted for review to ICCAD, November 9 –13, 2003, San Jose, CA, USA.

System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Documents