Top Banner
System Design Using Kahn System Design Using Kahn Process Networks: Process Networks: The Compaan/Laura Approach The Compaan/Laura Approach Bart Kienhuis Assistant Professor LIACS, Leiden University The Netherlands
44

System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Aug 23, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

System Design Using Kahn System Design Using Kahn Process Networks: Process Networks: The Compaan/Laura ApproachThe Compaan/Laura Approach

Bart KienhuisAssistant ProfessorLIACS, Leiden University The Netherlands

Page 2: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

DSP Performance RequirementsDSP Performance Requirements

0

500

1000

1500

2000

2500

2000 2001 2002 2003 2004 2005 2006

Billi

on M

AC/s

HDTV

MPEG4

Video

over IP

3G Wireless/WCDMA

FutureBroadband Standards

Voice

over IPGeneral

Purpose DSP/CPU

Market Requirements Increasing

Gap

2.5G

Applications have a ferocious appetite for more programmable compute power

Source: TI, Xilinx – 1 MAC = 8 bit Multiply-Accumulate

Page 3: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Embedded DSP ArchitecturesEmbedded DSP Architectures

Programmable Interconnect (NoC)

Programmable Interconnect (NoC)

IPcoreIPcore

RPU

RPU

Mem

oryM

emory

CPU

CPU

Micro

ProcessorM

icro Processor

MemoryMemory

...

CPU: A simple MicroprocessorRPU: Reconfigurable Processing UnitIPcore; Dedicated Accelerator blockNoC: Network on a Chip

Weakly coupledProcessing elements

Page 4: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Programming ProblemProgramming Problem

for j = 1:1:N,[x(j)] = Source1( );

endfor i = 1:1:K,

[y(i)] = Source2( ); endfor j = 1:1:N,

for i = 1:1:K,[y(i), x(j)] = F( y(i), x(j) );

endendfor i = 1:1:K,

[Out(i)] = Sink( y( I ) ); end

SequentialApplication Specification

EASY to specify

DIFFICULT to map

Programmable Interconnect (NoC)Programmable Interconnect (NoC)

IPcoreIPcore

RPU

RPU

Mem

oryM

emory

CPU

CPU

Micro

ProcessorM

icro Processor

MemoryMemory

...

Programming

P1 P2

S1Source

P3 P4

Sink

ParallelApplication Specification

EASY to map

DIFFICULT to specify

Compaan

L

Application

aura

Page 5: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

OutlineOutline

The programming problemKahn Process NetworksSystem Design: Compaan/Laura ApproachCase-study M-JPEGConclusions

Page 6: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Embedded DSP ArchitecturesEmbedded DSP Architectures

• Distributed Control• Distributed Memory

To satisfy the computational requirements, these architectures have to exploit:

Task-level Parallelism

Inst

ruct

ion

Para

llelis

mProgrammable Interconnect (NoC)

Programmable Interconnect (NoC)

RPU

RPU

Mem

oryM

emory

CPU

CPU

Micro

ProcessorM

icro Processor

...

MemoryMemory

IPcoreIPcore

Page 7: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

QR Algorithm (smart antennas)QR Algorithm (smart antennas)

%parameter N 8 16;%parameter K 100 1000;

for k = 1:1:K,for j = 1:1:N,

[ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) );for i = j+1:1:N,

[ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );end

endend

Matlab Code (QR Algorithm)

Sequ

entia

lly O

rder

ed

Matrices are located inBig Global Memory

QR simple program: but keeps your CPU very busy

Page 8: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

SolutionSolution

Change the model of computation in such a way that it better fits the model of architecture.Make sure the data-type is of precisely the format that fits the architecture (e.g. Streams)What model of Computation would fit this description, when looking at Digital Signal Processing (DSP) applications, Imaging and Multi Media?

Kahn Process Networks

Page 9: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Kahn Process Network (KPN)Kahn Process Network (KPN)Kahn Process Networks [Kahn 1974][Parks&Lee 95]– Processes run

autonomously– Communicate via

unbounded FIFOs– Synchronize via blocking

readProcess is either– executing (execute)– communicating

(send/get)DeterministicDistributed ControlDistributed Memory

Fc

A

Fa Fb

getexecutesend C

getexecute

sendsend

getgetexecutesend

Fifo

C

B

Page 10: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Kahn Process Network (KPN)Kahn Process Network (KPN)Fifo

Process A

Process C

Process BFifoFifo

Fifo FifoFPGA B

CPU 1FPGA A

•Autonomously operating Processes; no global schedule needed•Blocking Read simple realize in Hardware•Buffer Sizes of the FIFOs are quite often very small

Page 11: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

The Compaan Tool ChainThe Compaan Tool Chain

%parameter N 8 16;%parameter K 100 1000;

for k = 1:1:K,for j = 1:1:N,

[r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) );for i = j+1:1:N,

[r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );end

endend

Matlab Program

SAC

MatParser

Matlab ApplicationoutputR

Rotate

VectorizeinitialR

inputSamples

DgParser

PRDG

Source

P1 P2

S1

P3 P1

Process Network

Sink

Panda

Kahn ProcessNetwork

Polyhedral Reduce Dependence Graph (PRDG)

Data DependencyAnalysis

Linearization

Page 12: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Data Dependency AnalysisData Dependency Analysis

j

1 2 3 4 5 N=612

43

5N=6

for i= 1 : 1 : N,for j= 1 : 1 : N,

[ a(i+j) ] = funcA( a(i+j) );end

end

i

i+j=6

a(i+1,j-1)

Ax >= b (polytope)

The for-next loops define an Iteration Domain

Page 13: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Polyhedral Reduced Dependence GraphPolyhedral Reduced Dependence Graph

%parameter N 8 16;%parameter K 100 1000;

for k = 1:1:K,for j = 1:1:N,

[ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) );for i = j+1:1:N,

[ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );end

endend

Matlab Code (QR Algorithm)

Dependence Graph

k

i

Vecj

Rot

Page 14: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Polyhedral Reduced Dependence GraphPolyhedral Reduced Dependence Graph

CA

B D

E

Polytope “C”

Polytope “D”

Page 15: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

LinearizationLinearizationLinearization is the process of mapping high-order data-structures (e.g., Matrices) on a 1-D streamWe replaced the indexing of the variable x(j,I) and x(n-1,m) by relative put and getoperation on a FIFO buffer (unboxing)Is this always possible?

for j = 1 :1 5,for i = j : 1 : 5,

[ x(j,I) ]=F1(); end

end

for j = 1 :1 5,for i = j : 1 : 5,

F2(x(n-1,m)); end

end

Global Memory

for j = 1 :1 5,for i = j : 1 : 5,

[ out ]=F1(); FIFO.put(out);

endend

for j = 1 :1 5,for i = j : 1 : 5,

in = FIFO.get();F2(in);

endend

FIFO

LinearizationProducer Consumer

Page 16: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

KPN Hand OffKPN Hand Off

Kahn Process Network

(b)

(a)(b)

(a) – rotate(b) – vectorize

Synthesizable VHDL

FPGA

Laura

Functional Simulation in Ptolemy II

Ptolemy Actor models in Java C++/YAPI

Page 17: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

The Laura ToolThe Laura Tool

KPN

Network of Virtual Processors

KPNtoArchicture

MappingLibrary ofIP cores

Network of Synthesizable Processors

Verilog SystemCVHDL

Platform dependent

Platform Independent

Page 18: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

The Laura ToolThe Laura Tool

IP2 OP1

OP2IP1

Ch2

P2

Kahn Process Network

P3P1

FIFO1 FIFO3IP2

IP1

OP1

OP2VP2

FIFO2

Ch1 Ch3

KPNToArchitecture

Abstract Architecture

VP1 VP3

Page 19: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

The Laura ToolThe Laura Tool

DATA FLOW

Execution UnitIP Core

Read UnitController

Write UnitM

UX

DeM

UX

FIFO

FIFO

FIFO

FIFO

Control Tables Control Tables

Structure of an individual processor

Page 20: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

System Design FlowSystem Design Flow

The Tools in action– M-JPEG Example based on the original C-code

of the Portable Video Research Group, Stanford University.

– Simple Target Platform• Common PC platform• With FPGA board

Page 21: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Motion JPEG encoder Motion JPEG encoder

Sequence of T frames

JPEG encodingM-JPEG encoded

video streamVideo stream

(4:2:2 YUV format)

observed bitrate

dimV

dimH

Page 22: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Target PlatformTarget Platform

Mem

ory

Ban

ks

Mul

tiple

xerAddress

ControlH

ost I

nter

face

Control

Select

Virtex-II 2V6000 FPGAADM-XRCII board

Stat

us

Con

trol

HW

Des

ign

Pentium IVMicroprocessor

PCI bus

Control

Address

Data

Data

Data

DataAddress

Data

Page 23: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

System Design FlowSystem Design Flow

KPN

ApplicationIn Matlab

HW/SW partitioning(Workload Analysis)

Compaan Compiler

Compaan/Laura HW Compiler

Hardware Processes(Matlab)

Hardware DescriptionVHDL

Mem

ory

Ban

ks

Mul

tiple

xerAddress

Control

Hos

t Int

erfa

ce

Control

Select

Virtex-II 2V6000 FPGAADM-XRCII board

Stat

us

Con

trol

HW

Des

ign

Pentium IVMicroprocessor

PCI bus

ControlAddress

Data

Data

Data

DataAddress

DataSW Programming HW Programming

Software Path Hardware Path

Software Processes(YAPI)

GCC/V++SW Compiler

Object Code

Page 24: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

MM--JPEG Specification in MatlabJPEG Specification in Matlab[ QTables, HuffTables, TablesInfo, EndOfFrame ] = P2_l_DefaultTables( );for k = 1:1:NumFrames,[ HeaderInfo ] = P1_l_VideoInInit( );for j = 1:1:VNumBlocks,

for i = 1:1:HNumBlocks,[ Block( j ,i ) ] = P1_l_VideoInMain( );

endendfor j = 1:1:VNumBlocks,

for i = 1:1:HNumBlocks,[ Block( j , i ) ] = DCT( Block( j , i ) );

endendfor j = 1:1:VNumBlocks,

for i = 1:1:HNumBlocks,[ Block( j , i ) ] = Q( Block( j , i ), QTables );[ Packets, StatisticsB ] = VLE( Block( j , i ), EndOfFrame, HuffTables );[ BitRate, StatisticsF, EndOfFrame ] = CtrlF1( StatisticsB ); [ ] = VideoOut( HeaderInfo, TablesInfo, Packets );

end end [ QTable, HuffTables, TablesInfo ] = P2_l_CtrlF2( BitRate, StatisticsF,

QTables, HuffTables, TablesInfo );end

Parameterized%parameter NumFrames 1 1000;%parameter VNumBlocks 16 256;%parameter HNumBlocks 8 256;

Block( j , i )

Page 25: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Deriving a KPN Deriving a KPN

ApplicationIn Matlab

Compaan Compiler FunctionalVerification

Ptolemy IIPN Domain

Workload Analysis to do the HW/SW

Partitioning YAPI/C++

Page 26: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Deriving a KPNDeriving a KPN

Page 27: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

The KPN of MThe KPN of M--JPEGJPEG

VOut

CtrlF1

QDCTP1

P2

VLEBlock Block Block Packets

BitRateQ

Tabl

es

HuffTab

les

Stat

istic

sB

StatisticsF

EndO

fFra

me

TablesInfo

struct Block {int Y1[64]; /* block 8x8 pixels */int Y2[64]; /* block 8x8 pixels */int U[64]; /* block 8x8 pixels */int V[64]; /* block 8x8 pixels */

};

HeaderInfo

Page 28: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Interface Code DCTInterface Code DCT

The DCT process is selected to move to Hardware.Interface code is needed to run with the Software ProcessesObserve that ‘Blocks’ are being moved to the FPGA and from the FPGA

void DCT::main() {// NumFrames = 100;// VNumBlocks = 16;// HNumBlocks = 8;

for ( int k=1; k <= NumFrames; k++ ) {for ( int j=1; j <= VNumBlocks; j++ ) {

for ( int i=1; i <= HNumBlocks; i++ ) {read( inPort, inBlock );outBlock = DCT( inBlock );write( outPort, outBlock );

}}

}

}

Page 29: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Matlab of the DCT ProcessMatlab of the DCT Processfor k = 1:1:4,

for j = 1:1:64,[ Pixel( k , j ) ] = Source( inBlock );

endendfor k = 1:1:4,if k <= 2,

for j = 1:1:64,[ Pixel( k , j ) ] = PreShift( Pixel( k , j ) );

endendfor j = 1:1:64,

[ Block ] = P_l_PixelsToBlock( Pixel( k , j ) );end[ Block ] = P_l_2D_dct( Block );for j = 1:1:64,

[ Pixel( k , j ) ] = P_l_BlockToPixels( Block );end

endfor k = 1:1:4,

for j = 1:1:64,[ outBlock ] = Sink( Pixel( k , j ) );

endend

Of the DCT process, we make a new Matlab program– This exposes more

parallelism at a finer lever.– Automatic conversion

from Blocks to Stream (Linearization)

Compaan produces by default a process per function call.– However, using the

Preamble ‘P_1’ we can group processes.

Page 30: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

KPN Sub network DCTKPN Sub network DCT

P

PreShift

Source SinkPixel

Pixel

Pixel

PixelinBlock outBlock

DCT

VOut

CtrlF1

QDCTP1

P2

VLEBlock Block Block Packets

HeaderInfo

BitRate

QTa

bles

HuffTab

les

Stat

istic

sB

StatisticsF

EndO

fFra

me

TablesInfo

Hierarchical Subnet of DCT

Page 31: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Programming the CPUProgramming the CPU

M-JPEG specifiedIn YAPI

C++ Compiler

YAPI Executable

Pentium IVPentium IV

YAPI Multithreading Environment

Page 32: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Laura DCT Hardware ModelLaura DCT Hardware Model

P

PreShift

Source SinkPixel

Pixel

Pixel

PixelinBlock outBlock

DCT

IP2

IP1 OP1VP3FIFO1

FIFO2

FIFO

3FIFO4

VP2

VP1(Source)

VP4(Sink)

(PreShift)

(P)

Sink/Source do the typeConversion

One-to-OneMapping

Abstract Hardware Model: Network of Virtual Processors

Page 33: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Laura DCT Hardware ModelLaura DCT Hardware ModelTo get the functionality of the Virtual Processor, we integrated an IPcore.We have taken the Core (2D-DCT) from the Xilinx Webside.Make Processor specific for a platform– Determine the Bit width– Determine the FIFO sizes– Take into account the

Clock– Determine the Control

tables for the switches

IP2

IP1 OP1VP3FIFO1

FIFO

3

FIFO4 (xilinx)

(P)

0

1

MU

X

IP1

IP2in_0 out_0

C

OP1

Control Table

Synch. Logic Synch. Logic

2D-DCT (Xilinx)

Control Unit

Execute UnitIP Core implementing

Read Unit Write Unit

Page 34: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Hw/Hw/Sw Sw Solution for MSolution for M--JPEGJPEG

Mem

ory

Ban

ks

Mul

tiple

xerAddress

ControlH

ost I

nter

face

Control

Select

Virtex-II 2V6000 FPGAADM-XRCII board

Stat

us

Con

trol

PCI bus

Control

Address

Data

Data

Data

DataAddress

DataYAPI Multithreading Environment

Pentium IVPentium IV

The way it is programmed; the CPU and FPGA run in parallel

Page 35: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Processing Time MProcessing Time M--JPEGJPEG

Compaan Laura Other tools Manually Total

M-JPEG -> KPN 00:00:22 -- -- 00:30:00 00:30:22

Software Compilation -- -- 00:00:35 -- 00:00:35

DCT Subnet Compilation 00:00:08 -- -- -- 00:00:08

Laura -- 00:00:07 -- 03:00:00 03:00:07

Synthesis to FPGA -- -- 00:13:10 -- 00:13:10

Overall 00:00:30 00:00:07 00:13:45 03:30:00 03:44:22

Page 36: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Device Utilization for the DCTDevice Utilization for the DCT

FPGA resource Utilization %

Number of MULT18x18s 8 out of 144 5%

Number of RAMB16s 4 out of 144 2%

Number of SLICEs 2367 out of 33792 7%

Number of BUFGMUXs 2 out of 16 12%

Virtex-II 2V6000 FPGA(taking 4% of the FPGA)

Page 37: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

RealReal--time performance Mtime performance M--JPEGJPEG

Throughput of the system– 10.5 CIF frame (128x128) per second

• Running Windows 2000• Simple Compiler• Simple Multithreading architecture

Required is 25 frames per second– Communication FPGA/CPU is too slow (PCI)

However,– 64 bit PCI– Running at 66Mhz– 4 times increase in performance

Then 25 frames per second (128x128) not a problem

Page 38: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

ExplorationExplorationP1 P2

S1 SinkS2KPN_4

Generatefor j = 1:1:N,[x(j)] = Source1( );

endfor i = 1:1:K,

[y(i)] = Source2( ); endfor j = 1:1:N,

for i = 1:1:K,[y(i), x(j)] = F( y(i), x(j) );

endendfor i = 1:1:K,

[Out(i)] = Sink( y(i) ); end

P1 P2

P3 P4

S1 SinkS2

KPN_1

P1 P2

S1 SinkS2KPN_3

P4

S1 SinkS2

P3P2P1

KPN_2

P

S1

Sink

S2

KPN_5Map

Explore

Programmable Interconnect (NoC)Programmable Interconnect (NoC)

IPcoreIPcore

RPU

RPU

Mem

oryM

emory

CPU

CPU

Micro

ProcessorM

icro Processor

MemoryMemory

...

Alternative Application Instances

Page 39: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Unrolling/UnfoldingUnrolling/Unfolding

%parameter N 100 1000;%parameter K 8 48;

for j = 1:1:N,for i = 1:1:K,

[y(i), x(j)] = F(y(i), x(j));end

end

F F F F

F

F

F

F

F

F

F

F

x(1) x(2) x(3) x(4)

y(1)

y(2)

y(3)

j

i F F F F

F

F

F

F

F

F

F

F

x(1) x(2) x(3) x(4)

y(1)

y(2)

y(3)

j

i

Compaan

U = [ N, K ]

Difficult to derive

for j = 1:1:N,if mod( j , if mod( j , 2 2 ) = 1,) = 1,

for i = 1:1:K,[y(i), x(j)] = F(y(i), x(j));

endendend

if mod( j , if mod( j , 2 2 ) = 0,) = 0,for i = 1:1:K,[y(i), x(j)] = F(y(i), x(j));

endendend

end

MatTransform

Page 40: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Retiming/SkewingRetiming/Skewing

Skewing matrix

==

==

100111

22222121

12121111

mmmmmmmm

MM

F F F F

F

F

F

F

F

F

F

F

x(1) x(2) x(3) x(4)

y(1)

y(2)

y(3)

j

i

for j = 2:1:N+K,if mod( j , if mod( j , 22 ) = 1,) = 1,for i = max(1, j-N):1:min(j-1, K),

[y(i), x(j-i)] = F(y(i), x(j-i));end

endendif mod( j , if mod( j , 22 ) = 0,) = 0,for i = max(1, j-N):1:min(j-1, K),

[y(i), x(j-i)] = F(y(i), x(j-i));end

endendend

F F F F

F

F

F

F

F

F

F

F

x(1) x(2) x(3) x(4)

y(1)

y(2)

y(3)

j

i F F F F

F

F

F

F

F

F

F

F

x(1) x(2) x(3) x(4)

y(1)

y(2)

y(3)

j

i

for j = 2:1:N+K,for i = max(1, j-N):1:min(j-1, K),

[y(i), x(j-i)] = F(y(i), x(j-i));end

end

Unfolding vectorU = [ u1, u2 ] = [2, 1]

Compaan

Difficult to derive

%parameter N 100 1000;%parameter K 8 48;

for j = 1:1:N,for i = 1:1:K,

[y(i), x(j)] = F(y(i), x(j));end

end

Page 41: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

Design Space ExplorationDesign Space Exploration

Retiming/Unrolling

Compaan Compiler

Initial Values ofParameters

Matlab Application

Matlab Code

New Values ofParameters

MappingLaura/XFT

PerformanceAnalysis

PerformanceNumbers

Xilinx Virtex-II

ProcessNetwork

Page 42: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

ConclusionsConclusions

To satisfy tomorrow’s applications, we will see hierarchical multiprocessor systems with a number of CPUs, Memories, IPcores, and RPU.Programming these system will be difficult unless the MoC is changed to take Concurrency into account. The key items will be– Distributed Memory– Distributed Control

Kahn Process Networks seem to be a very promising programming formalism for tomorrow’s HW/SW codesign platforms

Page 43: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

ConclusionsConclusions

We showed proof-of-concept with a case in which we Convert M-JPEG into a KPN of which the processes are mapped either on hardware or software.In the M-JPEG case, the hardware and software were running concurrently, exploiting task-level parallelismHaving good tools, we can start from (legacy) code in Matlab, C, or other imperative languages.The results are just the beginning. There is more to achieve when more mature / commercial products are used (RTOS, Compiler, Target Platform, Virtex Pro)

Page 44: System Design Using Kahn Process Networksptolemy.eecs.berkeley.edu/~kienhuis/ftp/ESDtalk2003.pdf · the Virtual Processor, we integrated an IPcore. z. We have taken the Core (2D-DCT)

PublicationsPublicationsBart Kienhuis, Edwin Rijpkema, and Ed F. Deprettere ``Compaan: Deriving Process Networks from Matlab for Embedded Signal Processing Architectures.'', 8th International Workshop on Hardware/Software Codesign (CODES'2000), May 3 -- 5 2000, San Diego, CA, USA.Alexandru Turjan, Bart Kienhuis, and Ed Deprettere``A compile time based approach for solving out-of-order communication in Kahn Process

Networks'', in proceeding of IEEE 13th International Conference on Application-specific Systems, Architectures and Processors (ASAP'2002), San Jose, CA, USA, July 17-19, 2002Tim Harriss, Richard Walke, Bart Kienhuis, and Ed Deprettere``Compilation from Matlab to Process Networks Realized in FPGA'', In journal on Design

Automation of Embedded Systems, Kluwer, Vol 7, Issue 4, 2002Todor Stefanov, Bart Kienhuis, and Ed Deprettere``Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application

Instances'', in proceeding of Tenth International Symposium on Hardware/Software Codesign CODES'2002, Stanley Hotel, Estes Park, Colorado, USA, May 6 -- 8, 2002Edwin Rijpkema, ``Modeling Task Level Parallelism in Piece-wise Regular Programs'',PhD thesis, Leiden University, Leiden Institute of Advanced Computer Science (LIACS), The Netherlands, Sept 2002. Alexandru Turjan, Bart Kienhuis, and Ed Deprettere, ``Solving out-of-order communication in Kahn Process Networks '', submitted for publication in Journal on VLSI Signal Processing-Systems for Signal, Image, and Video Technology. Kluwer Academic Publishers., 2003Claudiu Zissulescu , Todor Stefanov, Bart Kienhuis and Ed Deprettere, “Laura: Leiden Architecture Research and Exploration Tool”, submitted to The International Conference on Field Programmable Logic and Applications, September 1-3, 2003 Lisbon, PortugalTodor Stefanov, Claudiu Zissulescu, Alexandru Turjan, Bart Kienhuis and Ed Deprettere, “System Design using Kahn Process Networks: The Compaan/Laura Approach”, submitted for review to ICCAD, November 9 –13, 2003, San Jose, CA, USA.