A Synthesizable Datapath-Oriented Programmable Logic Core Steven J.E. Wilton, Chun Hok Ho, Philip Leong, Wayne Luk, Brad Quinton University of British.

A Synthesizable Datapath-Oriented Programmable Logic Core

Steven J.E. Wilton, Chun Hok Ho, Philip Leong, Wayne Luk, Brad Quinton

University of British Columbia and Imperial College

Embedded Programmable Logic Cores

Embed a small amount of programmable logic onto an ASIC– Postpone some decisions until late in design cycle– Fast upgrade path for products– Embedded Debug:

Soft Programmable Logic Cores

RTL ofSoft PLC

RTL Simulation

Synthesis

Scan Insertion

Gate-Level Simulation

Floorplanning

Placement

Clock Tree Generation

Routing & TimingVerification

Physical Verification

0

Soft Programmable Logic Cores

Advantages – Easy to integrate, reduces design time– Very flexible, can create the exact required core– Easy to migrate to smaller technologies

Disadvantages– Inefficient compared to hard cores

Our thought– Makes sense if you only want a small core (a few hundred

gates)

This talk:

A new architecture for a synthesizable programmable logic core that supports datapath (bus-based) circuits

Previous Synthesizable PLC’s

Kim Bozman and Noha Kafafi:

LUT-Based

Unique Directional Routing Fabric

OU

TP

UT

S

3-LUT

3-LUT

3-LUT3-LUT

3-LUT

3-LUT

3-LUT

3-LUTx3 x3

x3 x3

x3x3

INPUTS

x3

x3

3-LUTx3

V

All inputs are fed into multiplexer

x4

x4

x4

Synthesizable Cores

Observation 1: To make it truly synthesizable, must avoid

combinational loops in the unprogrammed fabric

Observation 2: Each tile need not be identical

Previous Synthesizable PLC’s

Andy Yan:

Product-term Based Logic Block

Unique Directional Routing Fabric

Supported Sequential Circuits

PTB

PTB

PTB

PTB

PTB

PTB

PTB

PTB

PTB

PTB

PTB

PTB

PTB

PTB

INPUTS OUTPUTS

Le

vel 1

Inte

rcon

ne

ct Sw

itch

Le

vel 2

Inte

rcon

ne

ctS

witch

Le

vel 3

Inte

rcon

ne

ct Sw

itch

Ou

tpu

tIn

terco

nn

ect S

witch

Our Architecture

Use it when the PLC is connected to a bus:

PLC

Bus Bus

Observation: These connections are permanently tied to the bus signals, and we know this when the ASIC is designed

Logic Architecture

bit0

bit1

bit2

bit3

bit N-1

4-LUT

4-LUT

reg

A

BC

Cin

Cout

k1

s

Wordblock Bitblock

Logic Architecture

Key point:

- All bitblocks within a wordblock share same set of configuration bits

- Means all bitblocks implement the same function

bit0

bit1

bit2

bit3

bit N-1

4-LUT

4-LUT

reg

A

BC

Cin

Cout

k1

s

Wordblock Bitblock

Routing Architecture

Key point: Signals are routed as buses

N

N

N

N

Bit N-1

Bit 2

Bit 1

Bit 0

N N N N


Key point: - Linear array of wordblocks

- Buses get wider as we go to the right

Bit N-1

Bit 2

Bit 1

Bit 0



- Buses get wider as we go to the right

Bit N-1

Bit 2

Bit 1

Bit 0

Bit N-1

Bit 2

Bit 1

Bit 0



- Number of buses goes up as we go to the right

Bit N-1

Bit 2

Bit 1

Bit 0

Bit N-1

Bit 2

Bit 1

Bit 0

Bit N-1

Bit 2

Bit 1

Bit 0

Datapath Architecture

Bit N-1

Bit 2

Bit 1

Bit 0

Bit N-1

Bit 2

Bit 1

Bit 0

Bit N-1

Bit 2

Bit 1

Bit 0

DQ DQ DQ

SH

IFT

SH

IFT

SH

IFT

Multipliers

Bit N-1

Bit 2

Bit 1

Bit 0

Bit N-1

Bit 2

Bit 1

Bit 0

DQ DQ DQ

SH

IFT

Multiply

SH

IFT

Multiply

Two inputs instead of three

Two output buses (MSB, LSB)

Add a Control Block

Control Block

Status Mux Control Mux

Wordblock 0

bit 0

bit 1

bit 2

bit N-1

control status

Q D

Wordblock 1

bit 0

bit 1

bit 2

bit N-1

control status

Wordblock D-1

bit 0

bit 1

bit 2

bit N-1

control status

Output Mux

Constant Registers

(C)

Input Buses (M)

Feedback Registers (F)

FeedbackMux

Output Buses

(R)

control

status

shift

er

shift

er

shift

er

Control block is based on P-term fine-grained synthesizable core

Example Mapping

Monitor two buses: - Count the number of times each bus matches a mask - includes don’t care bits - Count the number of times both buses match the mask at the same time

input businput bus

constantconstant

feedbackfeedbackfeedback

outp

ut b

uses

Q D

reset

Control Block

MA

SK

MA

SK

AD

D

AD

D

AD

D

Interesting Questions:

1. How do the various architectural parameters affect density?

2. How does this compare to a fine-grained architecture?

Architectural Parameters

D Number of Wordblocks (incl. multipliers)

N Bit Width

M Number of Input Buses

R Number of Output Buses

F Number of Feedback Paths

C Number of Constant Registers

A Number of Multipliers

P Number of Product-Term Blocks

Impact of Number of Word-blocks and bit-width

Key Result: Both bit-width and number of wordblocks have a significant impact on area.

0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1 2 3 4 5 6 7 8

Number of Wordblocks (D)

Cel

l Are

a (x

106 m

2 )

N=24

N=32

N=16

N=8

Impact of the Number of Multipliers

Key result: Area increase due to more buses in the routing

0.700.720.740.760.780.800.820.840.860.880.90

0 1 2 4 8 16 32

Number of Multipliers (A)

N=16, D=32

Cel

l Are

a (x

106 m

2 )

Impact of the Size of the Control Block

Key result: The control block can dominate if it becomes too big

Cel

l Are

a (x

106 m

2 )

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

4 6 9 12 16

Number of Product Term Blocks in the Control Block (P)

D=16

D=8

Bench- Datapath Fined-Grain ASIC Fine-Grain/Datapath/

Mark (ours) (PTerm) Datapath ASIC

fbly 68,190 132,339,335 9,300 1940 7.33

dotv3 34,119 65,534,780 6,575 1921 5.19

dscg 72,178 116,271,968 9,473 1611 7.62

fir4 76,213 130,971,120 9,843 1718 7.74

egcd 1,225,231 22,776,474 10,420 18.6 117

momul 294,135 11,448,589 7,097 38.9 41

median 142,172 10,733,962 4,420 75.5 32

debug1 87,265 1,302,928 3,484 14.9 25



fbly 68,190 132,339,335 9,300 1940 7.33

dotv3 34,119 65,534,780 6,575 1921 5.19

dscg 72,178 116,271,968 9,473 1611 7.62

fir4 76,213 130,971,120 9,843 1718 7.74

egcd 1,225,231 22,776,474 10,420 18.6 117

momul 294,135 11,448,589 7,097 38.9 41

median 142,172 10,733,962 4,420 75.5 32

debug1 87,265 1,302,928 3,484 14.9 25

Key result 1: Significantly better than fine-grained architecture



fbly 68,190 132,339,335 9,300 1940 7.33

dotv3 34,119 65,534,780 6,575 1921 5.19

dscg 72,178 116,271,968 9,473 1611 7.62

fir4 76,213 130,971,120 9,843 1718 7.74

egcd 1,225,231 22,776,474 10,420 18.6 117

momul 294,135 11,448,589 7,097 38.9 41

median 142,172 10,733,962 4,420 75.5 32

debug1 87,265 1,302,928 3,484 14.9 25


Key result 2: Overhead roughly the same as FPGA/ASIC

But these results aren’t fair:

- For each benchmark, we found the optimum set of

architectural parameters.

- We need an architecture that works for a variety of

circuits

Architecture Construction

Our thought:

- The number of inputs/outputs is fixed by the SoC

- The designer has an idea of the size of the programmable

logic (number of wordblocks)

Fix all other parameters (as a function of # of wordblocks)

- eg. fixed ratio between number of multipliers vs. wordblocks

fixed ratio between control logic and datapath logic, etc.

We arbitrarily chose fixed ratios based on our experience

- A full architecture study is left as future work!



fbly 332,091 132,339,335 9,300 399 35.7

dotv3 225,518 65,534,780 6,575 291 34.3

dscg 325,029 116,271,968 9,473 358 34.3

fir4 307,154 130,971,120 9,843 426 31.2

egcd 3,778,611 22,776,474 10,420 6.02 363

momul 486,654 11,448,589 7,097 23.5 68.5

median 194,654 10,733,962 4,420 55.1 44

debug1 119,286 1,302,928 3,484 10.9 34



fbly 332,091 132,339,335 9,300 399 35.7

dotv3 225,518 65,534,780 6,575 291 34.3

dscg 325,029 116,271,968 9,473 358 34.3

fir4 307,154 130,971,120 9,843 426 31.2

egcd 3,778,611 22,776,474 10,420 6.02 363

momul 486,654 11,448,589 7,097 23.5 68.5

median 194,654 10,733,962 4,420 55.1 44

debug1 119,286 1,302,928 3,484 10.9 34



fbly 332,091 132,339,335 9,300 399 35.7

dotv3 225,518 65,534,780 6,575 291 34.3

dscg 325,029 116,271,968 9,473 358 34.3

fir4 307,154 130,971,120 9,843 426 31.2

egcd 3,778,611 22,776,474 10,420 6.02 363

momul 486,654 11,448,589 7,097 23.5 68.5

median 194,654 10,733,962 4,420 55.1 44

debug1 119,286 1,302,928 3,484 10.9 34


Key result 2: Overhead roughly the same as FPGA/ASIC

625m

625m

Conclusions

Our architecture is 6 to 426 x more efficient than fine-grained architecture

But, this is only for datapath-oriented circuits.

However, this is ok:

- In an SoC, we know, when the chip is designed, whether

the inputs are buses or bits

- If there are buses, use this architecture

- If there are not buses, use Andy’s PTerm architecture

Final thought: using this architecture, the overhead is similar to

that of a normal FPGA. People already accept this!

A Synthesizable Datapath-Oriented Programmable Logic Core Steven J.E. Wilton, Chun Hok Ho, Philip Leong, Wayne Luk, Brad Quinton University of British.

Documents

rightbit n

lsbbit n

functionbit0bit1bit2bit3bit

busesnnnnbit n

number of buses

output buses msb

linear array of wordblocks

bus signals