This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Initial System Description (Floating point MATLAB/Simulink)
Determine Flexibility Requirements
Description with Hardware Constraints (Fixed point Simulink,
FSM Control in Stateflow)
Real-time Emulation
(FPGA Array)
Automated ASIC Generation (Chip-in-a-day Flow)
14.3
Simulink Based Chip Design: Direct Mapping
Result: An architecture that can be implemented rapidly
Mult2
Mac2 Mult1 Mac1
S reg X reg Add,
Sub,
Shift
Directly map diagram into hardware since there is a one-for-one relationship for each of the blocks
[1] W. R. Davis, et al., "A Design Environment for High Throughput, Low Power Dedicated Signal Processing Systems," IEEE J. Solid-State Circuits, vol. 37, no. 3, pp. 420-431, Mar. 2002.
Custom tool 2: I/O components for logic verification
[2] K. Kuusilinna, et al., "Real Time System-on-a-Chip Emulation," in Winning the SoC Revolution, by H. Chang, G. Martin, Kluwer Academic Publishers, 2003.
[2]
14.5
Energy
Delay Area 0
VDD scaling Optimal design
intl, fold
Optimal design
Datapath Block-level
Energy-Area-Delay Optimization
Energy-Area-Delay space for architecture comparison
Set architecture optimization parameter (e.g. P = 2)
Parallel design Parallel option (P = 2)
14.18
5/20/2012
10
Transformed Simulink Architecture
P = 2 parallel Input streams
P = 2 parallel Output streams
Parallel Adder core
Parallel Multiplier core
Automatically generated scheduled architecture pops up
14.19
Range of Architecture Tuning Parameters
Energy
Tclk 0
VDD*
P, R↑ P, R↑
N↑
VDDmax
VDDmin
N↑ Throughput max Latency max
VDD scaling
fixed VDD
Pipeline: R Parallel: P
Time mux: N
[6] R. Nanda, C.-H. Yang, and D. Marković, "DSP Architecture Optimization in MATLAB/Simulink Environment," in Proc. Int. Symp. VLSI Circuits, June 2008, pp. 192-193.
[6]
14.20
5/20/2012
11
Energy-Area-Performance Map
Each point on the surface is an optimal architecture automatically generated in Simulink after modified ILP scheduling and retiming
Are
a
Valid architectures
Constraints
Direct-mapping (reference)
0.2 0.4
0.6 0.8
1
0.2
0.4
0.6
0.8
1 0.2
0.4
0.6
0.8
1
System designer can choose from many feasible (optimal) solutions
It is not just about functionality, but how good a solution is, and how many alternatives exist
14.21
E-A-P Space
RTL, switching activity
Energy-area-performance estimate
Simulink Synthesis
An Optimization Result
Are
a
Validarchitectures
Constraints
Direct-mapping(reference)
0.20.4
0.60.8
1
0.2
0.4
0.6
0.8
10.2
0.4
0.6
0.8
1Time-mux
Retiming
Pipelining
Parallelism
Gate sizing
Carry save
Fine pipe
IP cores
14.22
5/20/2012
12
D D D
Parallel
Time-multiplex
Reference
Lower Area
Higher Throughput or Lower Energy
N = 2 multiplier core
N = 2 adder core
input mux
controller out
4-way multiplier core
4-way adder core
input de-mux
output mux
In
Out
N = 2
P = 4
Architecture Tuning Result: MDL
×
+ + +
× × ×
14.23
Pipelining Strategy
Latency
Cycle Time
0
mult
add
Energy
VDD scaling
VDDref
TClk @ VDDopt
Library blocks / macros synthesized @ VDD
ref Pipeline logic scaling
FO4 inv simulation
Speed Power Area
TClk @ VDD
ref
gate sizing
[7] D. Marković, B. Nikolić, and R.W. Brodersen, "Power and Area Efficient VLSI Architectures for Communication Signal Processing," in Proc. Int. Conf. on Communications, June 2006, vol. 7, pp. 3223-3228.
[7]
14.24
5/20/2012
13
Optimization Results: 16-tap FIR Filter
Design variables: CSA, fine R (f-R), VDD (0.32 V to 1 V), pipelining
VDD
1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4
x 104
10
Area (μm2)
Ene
rgy
(pJ)
[lo
g sc
ale]
623 M
300 M 350 M
395 M 516 M
166 M
395 M
Pip + R
Ref. no CSA
M = MS/s
(90 nm CMOS)
Ref. CSA
Ref. CSA f-R
Pip + R + f-R
14.25
16-tap FIR: Architecture Parallelism (Unfolding)
Parallelism improves throughput or energy efficiency
– About 10x range in efficiency from VDD scaling (0.32 V – 1 V)
Approach: use Simulink test bench (TB) for ASIC verification
– Develop custom interface blocks (I/O)
– Place I/O and ASIC RTL into TB model
+ + = TB TB
I/O
ASIC
I/O
ASIC
Simulink implicitly provides the test bench
Additional requirements from the FPGA test platform
– As general purpose as possible (large memories, fast I/O)
– Use embedded CPU to provide high-level interface to FPGA
[8] D. Marković et al., "ASIC Design and Verification in an FPGA Environment," in Proc. Custom Integrated Circuits Conf., Sep. 2007, pp. 737-740.
[8]
14.31
Design Environment: Xilinx System Generator
Custom interface blocks
– Regs, FIFOs, BRAMs
– GPIO ports
– Analog subsystems
– Debugging
1-click compile
[9] C. Chang, Design and Applications of a Reconfigurable Computing System for High Performance Digital Signal Processing, Ph.D. Thesis, University of California, Berkeley, 2005.
[9]
14.32
5/20/2012
17
Simulink Test Model
software_reg
sim_rst reset
reg0
BRAM_IN
IN OUT
ADDR
WE
BRAM_ASIC
logic
Simulink hardware
model
ASIC board gpio
-c-
in
rst
clk
-c-
-c- in
logic
gpio
gpio
gpio
out
-c-
BRAM_FPGA
out IN OUT
ADDR
WE
IN OUT
ADDR
WE
rst
14.33
Example: SVD Test Model
Emulation-based ASIC I/O test
14.34
5/20/2012
18
FPGA Based ASIC Test Setup
Test bench model on the FPGA board
Block read / write operation
– Custom read_xps, write_xps commands
FPGA board
Client PC
ASIC board
PC to FPGA interface
– UART RS232 (slow, limited applicability)
– Ethernet (with an FPGA operating system support)
FPGA-ASIC interface
– GPIO (electrically limited to ~130 Mbps)
– High-speed ZDOK+ differential-line link (~500 Gbps, fclk limited)
14.35
Low Data-Rate Test Setup
IBOB FPGA board
ASIC board
Limitations: Speed of RS232 (~kb/s) & GPIO interface (~130 MHz)
The trend is towards fully embedding logic analysis on FPGA, including OS support for remote access
14.43
Further Extensions
Design recommendations
– Send source-synchronous clock with returned data
– Send synchronization information with returned data ● “Vector warning” or frame start, data valid
KATCP: communication protocol interfacing to BORPH
– Can be implemented over TCP telnet connection
– Libraries and clients for C, Python
– KATCP MATLAB client (replaces read_xps, write_xps) ● Can program FPGA from directly from MATLAB – no more JTAG cable!
● Provides byte-level read/write granularity
● Increases speed from ~Kb/s to ~Mb/s (Room for improvement; currently high protocol overhead)
Towards streaming
– Transition to TCP/IP-based protocols facilitates streaming
– Ethernet streaming w/o going through shared memory 14.44
5/20/2012
23
Summary
MATLAB/Simulink is an environment for algorithm modeling and optimized hardware implementation
– Bit-true cycle-accurate model can be used for functional verification and mapping to FPGA/ASIC hardware
– The environment is suited for automated architecture exploration using high-level scheduling and retiming
– Test vectors used in algorithm development can also be used for functional verification of fabricated ASIC
Enhancements to traditional FPGA-based verification
– Operating system can be hosted on an FPGA for remote access and software-like execution of hardware processes
– Test vectors can be hosted on FPGA for real-time data streaming (for data-limited or high-performance applications)
14.45
ILP Models for Scheduling and Retiming
5/20/2012
24
Basic ILP Model for Scheduling and Retiming
Case 1: r = 0 (scheduling only, no retiming): sub optimal
Case 2: r ≠ 0 (scheduling with retiming): exponential run time
wf = N ∙ w d + A ∙ p + N ∙ A ∙ r 0 Precedence constraints
Scheduling Retiming
Minimize
Subject to
Mp : # of PEs of type p
Resource constraint
Each node is scheduled once
·p pp
c M
| |
1p
V
uijx M
1
1N
jijx
14.47
B ∙ ( w + ( A ∙ p d ) / N ) 0 Loop constraints to ensure feasibility of retiming
Each node is scheduled once
wf = N ∙ w d + A ∙ p + N ∙ A ∙ r 0
Inte
ger
Lin
ear
Pro
gram
min
g B
ellm
an-F
ord
A ∙ r ( w + ( A ∙ p d ) / N ) Retiming inequalities solved by Bellman-Ford (B-F) Algorithm
Time-Efficient ILP Model for Scheduling & Retiming
Precedence constraints
Simulink Arch.Opt
Feasible CPU runtime (polynomial complexity of B-F algorithm)
Minimize
Subject to
Mp : # of PEs of type p
Resource constraint
Each node is scheduled once
·p pp
c M
| |
1p
V
uijx M
1
1N
jijx
14.48
5/20/2012
25
Example: Wave Digital Filter
Method 3 (Sch. + B-F retiming):
Power & Area optimal
Reduced CPU runtime
Method 3 yields optimal solution with feasible CPU runtime
3 4 5 6 7 810
-2
100
102
104
106
Folding Factor
CP
U R
un
tim
e (
sec)
1 2 3 4 5 6 7 8 90
0.2
0.4
0.6
0.8
1.0
Folding Factor
Are
a (
No
rm.)
1 2 3 4 5 6 7 8 90
0.2
0.4
0.6
0.8
1.0
Folding Factor
Po
wer
(No
rm.)
Optimal Suboptimal No solution
scheduling
scheduling + retiming
scheduling
scheduling + retiming
Method 1
Method 2 (*)
Method 3
Architecture ILP scheduling:
Method 1: Scheduling
Method 2: Scheduling + retiming
Method 3: Sched. + Bellman Ford
(*) reported CPU runtime for Method 2 is very optimistic (bounded retiming variables)
Goal: architecture optimization in area-power-performance space
14.49
References (1/2)
W.R. Davis et al., "A Design Environment for High Throughput, Low Power Dedicated Signal Processing Systems," IEEE J. Solid-State Circuits, vol. 37, no. 3, pp. 420-431, Mar. 2002.
K. Kuusilinna, et al., "Real Time System-on-a-Chip Emulation," in Winning the SoC Revolution, by H. Chang, G. Martin, Kluwer Academic Publishers, 2003.
D. Marković, A Power/Area Optimal Approach to VLSI Signal Processing, Ph.D. Thesis, University of California, Berkeley, 2006.
R. Nanda, DSP Architecture Optimization in Matlab/Simulink Environment, M.S. Thesis, University of California, Los Angeles, 2008.
R. Nanda, C.-H. Yang, and D. Marković, "DSP Architecture Optimization in Matlab/Simulink Environment," in Proc. Int. Symp. VLSI Circuits, June 2008, pp. 192-193.
14.50
5/20/2012
26
References (2/2)
D. Marković, B. Nikolić, and R.W. Brodersen, "Power and Area Efficient VLSI Architectures for Communication Signal Processing," in Proc. Int. Conf. Communications, June 2006, vol. 7, pp. 3223-3228.
D. Marković, et al., "ASIC Design and Verification in an FPGA Environment," in Proc. Custom Integrated Circuits Conf., Sept. 2007, pp. 737-740.
C. Chang, Design and Applications of a Reconfigurable Computing System for High Performance Digital Signal Processing, Ph.D. Thesis, University of California, Berkeley, 2005.
H. So, A. Tkachenko, and R.W. Brodersen, "A Unified Hardware/Software Runtime Environment for FPGA-Based Reconfigurable Computers using BORPH," in Proc. Int. Conf. Hardware/Software Codesign and System Synthesis, 2008, pp. 259-264.