A Synthesizable Datapath- Oriented Programmable Logic Core Steven J.E. Wilton, Chun Hok Ho, Philip Leong, Wayne Luk, Brad Quinton University of British Columbia and Imperial College
Jan 18, 2016
A Synthesizable Datapath-Oriented Programmable Logic Core
Steven J.E. Wilton, Chun Hok Ho, Philip Leong, Wayne Luk, Brad Quinton
University of British Columbia and Imperial College
Embedded Programmable Logic Cores
Embed a small amount of programmable logic onto an ASIC– Postpone some decisions until late in design cycle– Fast upgrade path for products– Embedded Debug:
Soft Programmable Logic Cores
RTL ofSoft PLC
RTL Simulation
Synthesis
Scan Insertion
Gate-Level Simulation
Floorplanning
Placement
Clock Tree Generation
Routing & TimingVerification
Physical Verification
0
Soft Programmable Logic Cores
Advantages – Easy to integrate, reduces design time– Very flexible, can create the exact required core– Easy to migrate to smaller technologies
Disadvantages– Inefficient compared to hard cores
Our thought– Makes sense if you only want a small core (a few hundred
gates)
This talk:
A new architecture for a synthesizable programmable logic core that supports datapath (bus-based) circuits
Previous Synthesizable PLC’s
Kim Bozman and Noha Kafafi:
LUT-Based
Unique Directional Routing Fabric
OU
TP
UT
S
3-LUT
3-LUT
3-LUT3-LUT
3-LUT
3-LUT
3-LUT
3-LUTx3 x3
x3 x3
x3x3
INPUTS
x3
x3
3-LUTx3
V
All inputs are fed into multiplexer
x4
x4
x4
Synthesizable Cores
Observation 1: To make it truly synthesizable, must avoid
combinational loops in the unprogrammed fabric
Observation 2: Each tile need not be identical
Previous Synthesizable PLC’s
Andy Yan:
Product-term Based Logic Block
Unique Directional Routing Fabric
Supported Sequential Circuits
PTB
PTB
PTB
PTB
PTB
PTB
PTB
PTB
PTB
PTB
PTB
PTB
PTB
PTB
INPUTS OUTPUTS
Le
vel 1
Inte
rcon
ne
ct Sw
itch
Le
vel 2
Inte
rcon
ne
ctS
witch
Le
vel 3
Inte
rcon
ne
ct Sw
itch
Ou
tpu
tIn
terco
nn
ect S
witch
Our Architecture
Use it when the PLC is connected to a bus:
PLC
Bus Bus
Observation: These connections are permanently tied to the bus signals, and we know this when the ASIC is designed
Logic Architecture
bit0
bit1
bit2
bit3
bit N-1
4-LUT
4-LUT
reg
A
BC
Cin
Cout
k1
s
Wordblock Bitblock
Logic Architecture
Key point:
- All bitblocks within a wordblock share same set of configuration bits
- Means all bitblocks implement the same function
bit0
bit1
bit2
bit3
bit N-1
4-LUT
4-LUT
reg
A
BC
Cin
Cout
k1
s
Wordblock Bitblock
Routing Architecture
Key point: Signals are routed as buses
N
N
N
N
Bit N-1
Bit 2
Bit 1
Bit 0
N N N N
Routing Architecture
Key point: - Linear array of wordblocks
- Buses get wider as we go to the right
Bit N-1
Bit 2
Bit 1
Bit 0
Routing Architecture
Key point: - Linear array of wordblocks
- Buses get wider as we go to the right
Bit N-1
Bit 2
Bit 1
Bit 0
Bit N-1
Bit 2
Bit 1
Bit 0
Routing Architecture
Key point: - Linear array of wordblocks
- Number of buses goes up as we go to the right
Bit N-1
Bit 2
Bit 1
Bit 0
Bit N-1
Bit 2
Bit 1
Bit 0
Bit N-1
Bit 2
Bit 1
Bit 0
Datapath Architecture
Bit N-1
Bit 2
Bit 1
Bit 0
Bit N-1
Bit 2
Bit 1
Bit 0
Bit N-1
Bit 2
Bit 1
Bit 0
DQ DQ DQ
SH
IFT
SH
IFT
SH
IFT
Multipliers
Bit N-1
Bit 2
Bit 1
Bit 0
Bit N-1
Bit 2
Bit 1
Bit 0
DQ DQ DQ
SH
IFT
Multiply
SH
IFT
Multiply
Two inputs instead of three
Two output buses (MSB, LSB)
Add a Control Block
Control Block
Status Mux Control Mux
Wordblock 0
bit 0
bit 1
bit 2
bit N-1
control status
Q D
Wordblock 1
bit 0
bit 1
bit 2
bit N-1
control status
Wordblock D-1
bit 0
bit 1
bit 2
bit N-1
control status
Output Mux
Constant Registers
(C)
Input Buses (M)
Feedback Registers (F)
FeedbackMux
Output Buses
(R)
control
status
shift
er
shift
er
shift
er
Control block is based on P-term fine-grained synthesizable core
Example Mapping
Monitor two buses: - Count the number of times each bus matches a mask - includes don’t care bits - Count the number of times both buses match the mask at the same time
input businput bus
constantconstant
feedbackfeedbackfeedback
outp
ut b
uses
Q D
reset
Control Block
MA
SK
MA
SK
AD
D
AD
D
AD
D
Interesting Questions:
1. How do the various architectural parameters affect density?
2. How does this compare to a fine-grained architecture?
Architectural Parameters
D Number of Wordblocks (incl. multipliers)
N Bit Width
M Number of Input Buses
R Number of Output Buses
F Number of Feedback Paths
C Number of Constant Registers
A Number of Multipliers
P Number of Product-Term Blocks
Impact of Number of Word-blocks and bit-width
Key Result: Both bit-width and number of wordblocks have a significant impact on area.
0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1 2 3 4 5 6 7 8
Number of Wordblocks (D)
Cel
l Are
a (x
106 m
2 )
N=24
N=32
N=16
N=8
Impact of the Number of Multipliers
Key result: Area increase due to more buses in the routing
0.700.720.740.760.780.800.820.840.860.880.90
0 1 2 4 8 16 32
Number of Multipliers (A)
N=16, D=32
Cel
l Are
a (x
106 m
2 )
Impact of the Size of the Control Block
Key result: The control block can dominate if it becomes too big
Cel
l Are
a (x
106 m
2 )
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
4 6 9 12 16
Number of Product Term Blocks in the Control Block (P)
D=16
D=8
Bench- Datapath Fined-Grain ASIC Fine-Grain/Datapath/
Mark (ours) (PTerm) Datapath ASIC
fbly 68,190 132,339,335 9,300 1940 7.33
dotv3 34,119 65,534,780 6,575 1921 5.19
dscg 72,178 116,271,968 9,473 1611 7.62
fir4 76,213 130,971,120 9,843 1718 7.74
egcd 1,225,231 22,776,474 10,420 18.6 117
momul 294,135 11,448,589 7,097 38.9 41
median 142,172 10,733,962 4,420 75.5 32
debug1 87,265 1,302,928 3,484 14.9 25
Bench- Datapath Fined-Grain ASIC Fine-Grain/Datapath/
Mark (ours) (PTerm) Datapath ASIC
fbly 68,190 132,339,335 9,300 1940 7.33
dotv3 34,119 65,534,780 6,575 1921 5.19
dscg 72,178 116,271,968 9,473 1611 7.62
fir4 76,213 130,971,120 9,843 1718 7.74
egcd 1,225,231 22,776,474 10,420 18.6 117
momul 294,135 11,448,589 7,097 38.9 41
median 142,172 10,733,962 4,420 75.5 32
debug1 87,265 1,302,928 3,484 14.9 25
Key result 1: Significantly better than fine-grained architecture
Bench- Datapath Fined-Grain ASIC Fine-Grain/Datapath/
Mark (ours) (PTerm) Datapath ASIC
fbly 68,190 132,339,335 9,300 1940 7.33
dotv3 34,119 65,534,780 6,575 1921 5.19
dscg 72,178 116,271,968 9,473 1611 7.62
fir4 76,213 130,971,120 9,843 1718 7.74
egcd 1,225,231 22,776,474 10,420 18.6 117
momul 294,135 11,448,589 7,097 38.9 41
median 142,172 10,733,962 4,420 75.5 32
debug1 87,265 1,302,928 3,484 14.9 25
Key result 1: Significantly better than fine-grained architecture
Key result 2: Overhead roughly the same as FPGA/ASIC
But these results aren’t fair:
- For each benchmark, we found the optimum set of
architectural parameters.
- We need an architecture that works for a variety of
circuits
Architecture Construction
Our thought:
- The number of inputs/outputs is fixed by the SoC
- The designer has an idea of the size of the programmable
logic (number of wordblocks)
Fix all other parameters (as a function of # of wordblocks)
- eg. fixed ratio between number of multipliers vs. wordblocks
fixed ratio between control logic and datapath logic, etc.
We arbitrarily chose fixed ratios based on our experience
- A full architecture study is left as future work!
Bench- Datapath Fined-Grain ASIC Fine-Grain/Datapath/
Mark (ours) (PTerm) Datapath ASIC
fbly 332,091 132,339,335 9,300 399 35.7
dotv3 225,518 65,534,780 6,575 291 34.3
dscg 325,029 116,271,968 9,473 358 34.3
fir4 307,154 130,971,120 9,843 426 31.2
egcd 3,778,611 22,776,474 10,420 6.02 363
momul 486,654 11,448,589 7,097 23.5 68.5
median 194,654 10,733,962 4,420 55.1 44
debug1 119,286 1,302,928 3,484 10.9 34
Bench- Datapath Fined-Grain ASIC Fine-Grain/Datapath/
Mark (ours) (PTerm) Datapath ASIC
fbly 332,091 132,339,335 9,300 399 35.7
dotv3 225,518 65,534,780 6,575 291 34.3
dscg 325,029 116,271,968 9,473 358 34.3
fir4 307,154 130,971,120 9,843 426 31.2
egcd 3,778,611 22,776,474 10,420 6.02 363
momul 486,654 11,448,589 7,097 23.5 68.5
median 194,654 10,733,962 4,420 55.1 44
debug1 119,286 1,302,928 3,484 10.9 34
Bench- Datapath Fined-Grain ASIC Fine-Grain/Datapath/
Mark (ours) (PTerm) Datapath ASIC
fbly 332,091 132,339,335 9,300 399 35.7
dotv3 225,518 65,534,780 6,575 291 34.3
dscg 325,029 116,271,968 9,473 358 34.3
fir4 307,154 130,971,120 9,843 426 31.2
egcd 3,778,611 22,776,474 10,420 6.02 363
momul 486,654 11,448,589 7,097 23.5 68.5
median 194,654 10,733,962 4,420 55.1 44
debug1 119,286 1,302,928 3,484 10.9 34
Key result 1: Significantly better than fine-grained architecture
Key result 2: Overhead roughly the same as FPGA/ASIC
625m
625m
Conclusions
Our architecture is 6 to 426 x more efficient than fine-grained architecture
But, this is only for datapath-oriented circuits.
However, this is ok:
- In an SoC, we know, when the chip is designed, whether
the inputs are buses or bits
- If there are buses, use this architecture
- If there are not buses, use Andy’s PTerm architecture
Final thought: using this architecture, the overhead is similar to
that of a normal FPGA. People already accept this!