ECE 565 High-Level Synthesis—An Introduction Shantanu Dutt ECE Dept., UIC.

Post on 21-Dec-2015

214 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

Transcript

ECE 565High-Level Synthesis—An Introduction

Shantanu Dutt

ECE Dept., UIC

HLS Flow

• Code/Algorithm Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects)

Classically, these 3 stages were performed sequentially but currently performed together (which leads to better optimization)

HLS Flow (contd)

HLS Flow (contd)

Allocation: Simple counting of FUs after theabove 2 stages

(Binding)

Simple HLS Examples

+

Simple HLS Examples (contd)

2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) w/ X delay of 2 cc’s and + delay of 1 cc

z ldz

X +

a b

c d

mux mux

demux

x y

lda ldb

ldx

ldc ldd

ldy

mux1 mux2I0I1

I0 I1

demux

cc 3(i+1)

lda = 1 reg. “a”loaded

Note: A register is loaded at the +ve/-ve edge (in a +ve/-ve edge triggered system) of the cc after the one in which its load signal is asseted.

lda=1, ldb=1,ldc=1, ldd=1,

mux1=1, mux2=1demux=1,

ldz=1

mux1=0,mux2=0

demux=0,ldy=1

ldx=1

[z x+y](c3)

[y c+d](c2)

[x a x b](c1)

cc 3i

cc 3(i+2)

Reset

Controller FSM:

1 2 3 4 5 6

c1(1) c1(2)

c2(1) c3(1) c2(2) c3(2)

X

+

i) Non-overlapped pipelined scheduling

cc’s

Note: Unspecified control signals have either an inactive value, or if such a concept doesn’t exists for the cs, then the don’t-care value

(a) Scheduling

(b) Arch. Synthesis

(c) Controller FSMSynthesis

O0O1

Simple HLS Examples (contd)

2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) (cont’d)

1 2 3 4 5 6

c1(1) c1(2)

c2(1) c3(1) c2(2) c3(2)

X

+

ii) Overlapped pipelined scheduling

z ldz

X +

a b

c d

mux mux

demux

x y

lda ldb

ldx

ldc ldd

ldy

mux1 mux2I0I1

I0 I1

demux

cc 3(i+1)

lda=1, ldb=1,mux1=0, mux2=0

demux=0,ldy=1, ldx=1

ldc=1, ldd=1,mux1=1,mux2=1,

demux=1,ldz=1

[y c+d, x a x b]((c1, c2)

[z x+y,](c3)

cc 3iReset

Controller FSM:

cc’s

• For 4 iterations, the overlapped schedule takes 9 cc’s versus 12 cc’s by the non-overlapped sched.• Overlap. sched: Time for n iterations = 2n+1 Throughput = n/(2n+1) ~ 0.5 outputs/cc• Nonoverlap. sched: Time for n iterations = 3n Throughput = n/3n ~ 0.33 outputs/cc ~ 34% throughput improvement using an overlapped schedule

(a) Scheduling

(b) Arch. Synthesis

(c) Controller FSMSynthesis

Simple HLS Examples (contd)

Condition(T/F)

in

out1 out2

T F

Distributor

Condition(T/F)

in1 in2

out

T F

Selectot• Some DFG control operation nodes:

• Conditional code: If (a > b) then c a-b;Else c b-a;

• Possible DFGs corresponding to the above conditional code:

Simple HLS Examples (contd)

• Iterative code: while (a > b) a a-b;

dist

>

sel

-

a b

a

T F

T F

Initializedto F

+

b

final a

Mux

Demux

ar1

cin 1

b’+1 = 2’s compl. of -b

b’1 0

1 0

s xor ovfl= 1 -ve= 0 +ve

mux

ldr1 lda ldb

demux

ldfina

To fsmc1c2

c1 c2+

cc’s

c1 c2Scheduling& binding:

a

(a) Scheduling (using only 1 adder/sub)

(b) Arch. Synthesis

Delay Nodes in DFGs

A delay node is generally implemented as a register; a delay node thus becomes a state variable.

Delay Nodes in DFGs (contd)

register

Transformation in the DFG Mapping to the architecture

Detailed HLS Example

Detailed HLS Example (contd)

The synthesized architecture

Note: Not clear how register allocation has been done.It is sub-optimal (4 non-primary i/p regs. needed)

(a) Scheduling w/ one X (2 cc’s) & one + (1 cc); goal: min. latency

Different paths (i/p o/p) in the DFG

(b) Reg. alloc. for o/p of operations

(c) Arch. synthesis

For WAR constraint

Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be available at u’s earliest finish will have the largest lifetime at that point.

Detailed HLS Example (contd)

Detailed HLS Example—Register Allocation

d0

3 non-primary i/pregs. needed

Detailed HLS Example—Register Allocation (contd)

• In the conflict graph (one per FU), there is an edge between 2 var. nodes if their lifetimes overlap (indicating that different registers need to be allocated to them)• Graph coloring—using min. # of colors to color node s.t. connected node pairs have different colors—in general is NP-hard• The above type of conflict graph is called an interval graph (derived from a 1-dimensional interval of the lifetimes)• Min. graph coloring can be solved optimally in linear time for interval graphs (using the left-edge algorithm that we will see later for channel routing)

Scheduling heuristic: Among available opers schedule those on avail. FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be avail. at u’s earliest finish will have the largest lifetime at that point.

Detailed HLS Example—Register Allocation (contd)

d0

3 non-primary i/pregs. needed

Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking arbitrarily: B’s lifetime oncreases, but D’s (dep. of B) decreases similarly—heuristic should be based on more global information

top related