Contents Timing optimization Area optimization Additional readings Budapest University of Technology and Economics RTL Optimization Techniques Péter Horváth Department of Electron Devices March 30, 2016 Péter Horváth RTL Optimization Techniques 1 / 20
20
Embed
RTL Optimization Techniques - EET - EETeet.bme.hu/.../07_RTL_Optimization_Techniques.pdfPéter Horváth RTL Optimization Techniques 1/20. Contents Timing optimization Area optimization
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Contents Timing optimization Area optimization Additional readings
Budapest University of Technology and Economics
RTL Optimization Techniques
Péter Horváth
Department of Electron Devices
March 30, 2016
Péter Horváth RTL Optimization Techniques 1 / 20
Contents Timing optimization Area optimization Additional readings
Contents
Contents
timing optimization concepts and design techniquesthroughput, latency, local datapath delayloop unrolling, removing pipeline registers, register balancing
area optimization concepts and design techniquesresource requirement metrics in standard cell ASIC and FPGAcontrol-based logic reuse, priority encoders, considering technologyprimitives
additional readings
Péter Horváth RTL Optimization Techniques 2 / 20
Contents Timing optimization Area optimization Additional readings
Timing optimization
Péter Horváth RTL Optimization Techniques 3 / 20
Contents Timing optimization Area optimization Additional readings
Computation performance concepts
Computation performance concepts
There are three important concepts related to the computationperformance.
throughput: The amount of data processed in a single clock cycle(bits per second).latency: The time elapsed between data input and processed dataoutput (clock cycles).local datapath delays: Delay of logic between storage elements(nanoseconds). It determines the maximum clock frequency.
Péter Horváth RTL Optimization Techniques 4 / 20
Contents Timing optimization Area optimization Additional readings
Contents Timing optimization Area optimization Additional readings
Timing optimization techniques
High throughput – loop unrolling (pipeline)
During the high throughput optimization the time required forprocessing of a single data is irrelevant but the time elapsedbetween two input reads is minimized.Data n+1 is read while data n is still under processing.
architecture iterative of pow3 isbegin process (clk) begin if (rising_edge(clk)) then if (start = '1') then count <= 2; pow <= x; elsif (stop = '0') then count <= count - 1; pow <= pow * x; end if; end if; end process; stop <= '1' when count = 0 else '0';end architecture;
Contents Timing optimization Area optimization Additional readings
Timing optimization techniques
Low latency – removing pipeline registers
pow1
clk
x1
clk
x2
clk
x
x[31:0]
x
pow
clk
pow[31:0]
32
32
32
32
32
32
32
32
latency: 3 cycles
x
x[31:0]
x
pow
clk
pow[31:0]
32
32
32
32
32
32
latency: 1 cycle
Péter Horváth RTL Optimization Techniques 7 / 20
Contents Timing optimization Area optimization Additional readings
Timing optimization techniques
Low latency – removing pipeline registers
The objective of the low-latency optimization is to pass the datafrom the input to the output with minimal internal processingdelay.A low-latency design uses parallelism and removes pipeline registers.
architecture pipelined of pow3 isbegin process (clk) begin if (rising_edge(clk)) then -- stage 1 x1 <= x; -- stage 2 x2 <= x1; pow1 <= x1 * x1; -- stage 3 pow <= pow1 * x2; end if; end process;end architecture;
latency: 3 cycles
architecture async of pow3 isbegin process (x) begin x1 <= x; end process; process (x1) begin x2 <= x1; pow1 <= x1 * x1; end process;
pow <= pow1 * x2;end architecture;
latency: 1 cycle (with an additional output register)
Péter Horváth RTL Optimization Techniques 8 / 20
Contents Timing optimization Area optimization Additional readings
Timing optimization techniques
Minimizing logic delay – register layers
x1
clk
x2
clk
y
clk
+
x
xC[31:0]
B[31:0]
x
A[31:0]
x[31:0]
y[31:0]
3232
32
32
32
32
32
32
32
32
32
local datapaths: 1 adder and 1multiplier
prod1
clk
prod3
clk
x1
clk
x2
clk
x[31:0]
x
A[31:0]
x
C
y
clk
+
y[31:0]
prod2
clk
x
B[31:0]
32
32
32
32
32
3232
32
32 32 32
32
32
local datapaths: 1 adder or 1multiplier
Péter Horváth RTL Optimization Techniques 9 / 20
Contents Timing optimization Area optimization Additional readings
Timing optimization techniques
Minimizing logic delay – register layers
The logic between two sequential elements is called local datapath.The delay of the slowest local datapath determines the maximumclock frequency.The local datapath delay can be reduced by additional registerlayers.
architecture single_cycle of fir isbegin process (clk) begin if (rising_edge(clk)) then if (valid = '1') then x1 <= x; x2 <= x1; y <= A*x + B*x1 + C*x2; end if; end if; end process;end architecture;
architecture multi_cycle of fir isbegin process (clk) begin if (rising_edge(clk)) then if (valid = '1') then x1 <= x; x2 <= x1; prod1 <= A * x; prod2 <= B * x1; prod3 <= C * x2; y <= prod1 + prod2 + prod3; end if; end if; end process;end architecture;
Péter Horváth RTL Optimization Techniques 10 / 20
Contents Timing optimization Area optimization Additional readings
Timing optimization techniques
Minimizing logic delay – register balancing
clk
+
clk
reg_a reg_b
in_a[31:0] in_b[31:0]
clk
reg_b
in_b[31:0]
+
sum[31:0]
clk
sum
323232
32 32
3232
32
32
local datapaths: 2 adders
in_a[31:0] in_b[31:0] in_c[31:0]
reg_ab_sum
clk
+
reg_c
clk
+
sum
clk
sum[31:0]
32
32 32
32 32
32
32
local datapaths: 1 adder
Péter Horváth RTL Optimization Techniques 11 / 20
Contents Timing optimization Area optimization Additional readings
Timing optimization techniques
Minimizing logic delay – register balancing
During register balancing the logic between registers is redistributedin order to minimize the worst-case delay between any register pairs.
architecture not_balanced of add3 isbegin process (clk) begin if (rising_edge(clk)) then reg_a <= in_a; reg_b <= in_b; reg_c <= in_c; sum <= reg_a + reg_b + reg_c; end if; end process;end architecture;
architecture balanced of add3 isbegin process (clk) begin if (rising_edge(clk)) then reg_ab_sum <= in_a + in_b; reg_c <= in_c;
sum <= reg_ab_sum + reg_c; end if; end process;end architecture;
Péter Horváth RTL Optimization Techniques 12 / 20
Contents Timing optimization Area optimization Additional readings
Area optimization
Péter Horváth RTL Optimization Techniques 13 / 20
Contents Timing optimization Area optimization Additional readings
Area concepts
Area concepts
The resource requirement means the amount of the basic functionalprimitives required for implementing the described functionality.The basic functional primitives in standard cell ASICs are thestandard cells, which can be simple logic gates, flip-flops but alsomore complex arithmetic-logic functions or memories.The basic logic elements (BLE) of FPGAs consist of a logicfunction (the input number is dependent on the vendor and thedevice family), a flip-flop and a multiplexer. There are specialpurpose resoures as well, such as memory blocks, signal processingelements (multipliers) etc.
Péter Horváth RTL Optimization Techniques 14 / 20
Contents Timing optimization Area optimization Additional readings
Area optimization techniques
Minimizing area – control-based logic reuse
Control-based logic reuse should be considered the oppositeoperation to the loop unrolling. Pipeline requires internal datastorage resources and additional logic to implement paralleloperation. These resources can be reused with the cost of areduced throughput.
in1 in2 in3 in4
+
+ +
accce
acc
clkresetreset
clk
zero
1
plr1ce
clkreset plr2
ce
clkreset
32 32 32 32
32 32
32 32
32
32
32
sel0 1 2 3
accce
FSM +
acc
clkreset
ce_accclkreset
sel_input
clkreset ss_z
zerozero
32 32 3232
32
32
32
32
in1 in2 in3 in4
Control-based logic reuse requires anFSM to generate control signals.
Péter Horváth RTL Optimization Techniques 15 / 20
Contents Timing optimization Area optimization Additional readings
Area optimization techniques
Minimizing area – priority encoders
The resource requirement can be improved if the mutual exclusionis exploited. The elsif statement should be used only if a priorityencoder is required and the conditions are not mutually exclusive.
architecture priority of logic isbegin process (clk) begin if (rising_edge(clk)) then if (ctrl(0) = '1') then output(0) <= input; elsif (ctrl(1) = '1') then output(1) <= input; elsif (ctrl(2) = '1') then output(2) <= input; elsif (ctrl(3) = '1') then output(3) <= input; end if; end if; end process;end architecture;
architecture not_priority of logic isbegin process (clk) begin if (rising_edge(clk)) then if (ctrl(0) = '1') then output(0) <= input; end if; if (ctrl(1) = '1') then output(1) <= input; end if; if (ctrl(2) = '1') then output(2) <= input; end if; if (ctrl(3) = '1') then output(3) <= input; end if; end if; end process;end architecture;
Péter Horváth RTL Optimization Techniques 16 / 20
Contents Timing optimization Area optimization Additional readings
Area optimization techniques
Minimizing area – priority encoders
output_aclk output_a[31:0]
sel
0
1
input[31:0]
output_bclk output_b[31:0]
sel
0
1
output_cclk output_c[31:0]
sel
0
1
output_dclk output_d[31:0]
sel
0
1
ctrl
[0]
[1]
[2]
[3]
[0]
[1]
[0]
[0][1][2]
32 32
32
32
32
32
32
32
32
32
32
32
4
4
4
4
without exploiting mutual exlusion
[3]
[2]
[1]
output_aclk output_a
sel
0
1
input
output_bclk output_b
sel
0
1
output_cclk output_c
sel
0
1
output_dclk output_d
sel
0
1
ctrl
[0]
32 32
32
32
32
32
32
32
32
32
32
32
4
4
4
4
with exploiting mutual exclusion
Péter Horváth RTL Optimization Techniques 17 / 20
Contents Timing optimization Area optimization Additional readings
Area optimization techniques
Minimizing area – considering technology primitives
With appropriate HDL coding style a more efficient logicsynthesis can be achieved. The synthesis tool vendors usuallyprovide coding technique proposals to improve the resourcerequirement or timing parameters of the design. The proposedcoding style takes the unique characteritics of the technologyprimitives into consideration.
utilizing block RAM modules in FPGAs: Block RAM modules donot have any reset inputs and their outputs are synchronous to aclock signal. Only HDL models with these parameters can beimplemented in block RAMs.utilizing high quality DSP units: The DSP slices in the FPGAs havesynchronous outputs. This restriction have to be taken into accountin HDL model generation.
Péter Horváth RTL Optimization Techniques 18 / 20
Contents Timing optimization Area optimization Additional readings
Area optimization techniques
Minimizing area – considering technology primitives
architecture FFS of RAM isbegin process (clk) begin if (reset = '1') then content <= (others=>(others=>'0')); elsif (rising_edge(clk)) then if (write = '1') then content(address) <= data_in; end if; end if; end process; data_out <= content(address);end architecture;
architecture BRAM of RAM isbegin
process (clk) begin if (rising_edge(clk)) then if (write = '1') then content(address) <= data_in; end if; data_out <= content(address); end if; end process; end architecture;
Because of the asynchronousoutput this model cannot beimplemented in block RAM.The reset function hinders theLUT implementation as well.
This model can be implementedas flip-flops, LUT RAM andblock RAM as well.
Péter Horváth RTL Optimization Techniques 19 / 20
Contents Timing optimization Area optimization Additional readings
Additional readings
Additional readings
Steve Kilts – Advanced FPGA Design, Architecture, Implementation,and OptimizationDavid Money Harris, Sarah L. Harris – Digital Design and ComputerArchitecturePeter J. Ashenden – Digital Design – An Embedded SystemApproach Using VHDLM. Moris Mano, Charles R. Kime – Logic and Computer DesignFundamentalsPong P. Chu – RTL Hardware Design Using VHDLPeter Wilson – Design Recipes for FPGAs