RTL Optimization Techniques - EET - EETeet.bme.hu/.../07_RTL_Optimization_Techniques.pdfPéter Horváth RTL Optimization Techniques 1/20. Contents Timing optimization Area optimization

Contents Timing optimization Area optimization Additional readings

Budapest University of Technology and Economics

RTL Optimization Techniques

Péter Horváth

Department of Electron Devices

March 30, 2016

Péter Horváth RTL Optimization Techniques 1 / 20


Contents

Contents

timing optimization concepts and design techniquesthroughput, latency, local datapath delayloop unrolling, removing pipeline registers, register balancing

area optimization concepts and design techniquesresource requirement metrics in standard cell ASIC and FPGAcontrol-based logic reuse, priority encoders, considering technologyprimitives

additional readings



Timing optimization



Computation performance concepts


There are three important concepts related to the computationperformance.

throughput: The amount of data processed in a single clock cycle(bits per second).latency: The time elapsed between data input and processed dataoutput (clock cycles).local datapath delays: Delay of logic between storage elements(nanoseconds). It determines the maximum clock frequency.




High throughput – loop unrolling (pipeline)

pow

clk

x

start0 1

x[31:0]

pow[31:0]

32

32

32

32 32

32

throughput: 32/3 = 10.7 bits/cycle;latency: 3 cycles

pow1

clk

x1

clk

x2

clk

x

x[31:0]

x

pow

clk

pow[31:0]

32

32

32

32

32

32

32

32

throughput: 32/1 = 32 bits/cycle;latency: 3 cycles



Timing optimization techniques

High throughput – loop unrolling (pipeline)

During the high throughput optimization the time required forprocessing of a single data is irrelevant but the time elapsedbetween two input reads is minimized.Data n+1 is read while data n is still under processing.

architecture iterative of pow3 isbegin process (clk) begin if (rising_edge(clk)) then if (start = '1') then count <= 2; pow <= x; elsif (stop = '0') then count <= count - 1; pow <= pow * x; end if; end if; end process; stop <= '1' when count = 0 else '0';end architecture;

throuhgput: 32/3 = 10.7 bits/cycle; latency: 3 cycles

architecture pipelined of pow3 isbegin process (clk) begin if (rising_edge(clk)) then -- stage 1 x1 <= x; -- stage 2 x2 <= x1; pow1 <= x1 * x1; -- stage 3 pow <= pow1 * x2; end if; end process;end architecture;

throuhgput: 32/1 = 32 bits/cycle; latency: 3 cycles




Low latency – removing pipeline registers

pow1

clk

x1

clk

x2

clk

x

x[31:0]

x

pow

clk

pow[31:0]

32

32

32

32

32

32

32

32

latency: 3 cycles

x

x[31:0]

x

pow

clk

pow[31:0]

32

32

32

32

32

32

latency: 1 cycle




Low latency – removing pipeline registers

The objective of the low-latency optimization is to pass the datafrom the input to the output with minimal internal processingdelay.A low-latency design uses parallelism and removes pipeline registers.

architecture pipelined of pow3 isbegin process (clk) begin if (rising_edge(clk)) then -- stage 1 x1 <= x; -- stage 2 x2 <= x1; pow1 <= x1 * x1; -- stage 3 pow <= pow1 * x2; end if; end process;end architecture;

latency: 3 cycles

architecture async of pow3 isbegin process (x) begin x1 <= x; end process; process (x1) begin x2 <= x1; pow1 <= x1 * x1; end process;

pow <= pow1 * x2;end architecture;

latency: 1 cycle (with an additional output register)




Minimizing logic delay – register layers

x1

clk

x2

clk

y

clk

+

x

xC[31:0]

B[31:0]

x

A[31:0]

x[31:0]

y[31:0]

3232

32

32

32

32

32

32

32

32

32

local datapaths: 1 adder and 1multiplier

prod1

clk

prod3

clk

x1

clk

x2

clk

x[31:0]

x

A[31:0]

x

C

y

clk

+

y[31:0]

prod2

clk

x

B[31:0]

32

32

32

32

32

3232

32

32 32 32

32

32

local datapaths: 1 adder or 1multiplier




Minimizing logic delay – register layers

The logic between two sequential elements is called local datapath.The delay of the slowest local datapath determines the maximumclock frequency.The local datapath delay can be reduced by additional registerlayers.

architecture single_cycle of fir isbegin process (clk) begin if (rising_edge(clk)) then if (valid = '1') then x1 <= x; x2 <= x1; y <= A*x + B*x1 + C*x2; end if; end if; end process;end architecture;

architecture multi_cycle of fir isbegin process (clk) begin if (rising_edge(clk)) then if (valid = '1') then x1 <= x; x2 <= x1; prod1 <= A * x; prod2 <= B * x1; prod3 <= C * x2; y <= prod1 + prod2 + prod3; end if; end if; end process;end architecture;




Minimizing logic delay – register balancing

clk

+

clk

reg_a reg_b

in_a[31:0] in_b[31:0]

clk

reg_b

in_b[31:0]

+

sum[31:0]

clk

sum

323232

32 32

3232

32

32

local datapaths: 2 adders

in_a[31:0] in_b[31:0] in_c[31:0]

reg_ab_sum

clk

+

reg_c

clk

+

sum

clk

sum[31:0]

32

32 32

32 32

32

32

local datapaths: 1 adder




Minimizing logic delay – register balancing

During register balancing the logic between registers is redistributedin order to minimize the worst-case delay between any register pairs.

architecture not_balanced of add3 isbegin process (clk) begin if (rising_edge(clk)) then reg_a <= in_a; reg_b <= in_b; reg_c <= in_c; sum <= reg_a + reg_b + reg_c; end if; end process;end architecture;

architecture balanced of add3 isbegin process (clk) begin if (rising_edge(clk)) then reg_ab_sum <= in_a + in_b; reg_c <= in_c;

sum <= reg_ab_sum + reg_c; end if; end process;end architecture;



Area optimization



Area concepts

Area concepts

The resource requirement means the amount of the basic functionalprimitives required for implementing the described functionality.The basic functional primitives in standard cell ASICs are thestandard cells, which can be simple logic gates, flip-flops but alsomore complex arithmetic-logic functions or memories.The basic logic elements (BLE) of FPGAs consist of a logicfunction (the input number is dependent on the vendor and thedevice family), a flip-flop and a multiplexer. There are specialpurpose resoures as well, such as memory blocks, signal processingelements (multipliers) etc.



Area optimization techniques

Minimizing area – control-based logic reuse

Control-based logic reuse should be considered the oppositeoperation to the loop unrolling. Pipeline requires internal datastorage resources and additional logic to implement paralleloperation. These resources can be reused with the cost of areduced throughput.

in1 in2 in3 in4

+

+ +

accce

acc

clkresetreset

clk

zero

1

plr1ce

clkreset plr2

ce

clkreset

32 32 32 32

32 32

32 32

32

32

32

sel0 1 2 3

accce

FSM +

acc

clkreset

ce_accclkreset

sel_input

clkreset ss_z

zerozero

32 32 3232

32

32

32

32

in1 in2 in3 in4

Control-based logic reuse requires anFSM to generate control signals.




Minimizing area – priority encoders

The resource requirement can be improved if the mutual exclusionis exploited. The elsif statement should be used only if a priorityencoder is required and the conditions are not mutually exclusive.

architecture priority of logic isbegin process (clk) begin if (rising_edge(clk)) then if (ctrl(0) = '1') then output(0) <= input; elsif (ctrl(1) = '1') then output(1) <= input; elsif (ctrl(2) = '1') then output(2) <= input; elsif (ctrl(3) = '1') then output(3) <= input; end if; end if; end process;end architecture;

architecture not_priority of logic isbegin process (clk) begin if (rising_edge(clk)) then if (ctrl(0) = '1') then output(0) <= input; end if; if (ctrl(1) = '1') then output(1) <= input; end if; if (ctrl(2) = '1') then output(2) <= input; end if; if (ctrl(3) = '1') then output(3) <= input; end if; end if; end process;end architecture;




Minimizing area – priority encoders

output_aclk output_a[31:0]

sel

0

1

input[31:0]

output_bclk output_b[31:0]

sel

0

1

output_cclk output_c[31:0]

sel

0

1

output_dclk output_d[31:0]

sel

0

1

ctrl

[0]

[1]

[2]

[3]

[0]

[1]

[0]

[0][1][2]

32 32

32

32

32

32

32

32

32

32

32

32

4

4

4

4

without exploiting mutual exlusion

[3]

[2]

[1]

output_aclk output_a

sel

0

1

input

output_bclk output_b

sel

0

1

output_cclk output_c

sel

0

1

output_dclk output_d

sel

0

1

ctrl

[0]

32 32

32

32

32

32

32

32

32

32

32

32

4

4

4

4

with exploiting mutual exclusion




Minimizing area – considering technology primitives

With appropriate HDL coding style a more efficient logicsynthesis can be achieved. The synthesis tool vendors usuallyprovide coding technique proposals to improve the resourcerequirement or timing parameters of the design. The proposedcoding style takes the unique characteritics of the technologyprimitives into consideration.

utilizing block RAM modules in FPGAs: Block RAM modules donot have any reset inputs and their outputs are synchronous to aclock signal. Only HDL models with these parameters can beimplemented in block RAMs.utilizing high quality DSP units: The DSP slices in the FPGAs havesynchronous outputs. This restriction have to be taken into accountin HDL model generation.




Minimizing area – considering technology primitives

architecture FFS of RAM isbegin process (clk) begin if (reset = '1') then content <= (others=>(others=>'0')); elsif (rising_edge(clk)) then if (write = '1') then content(address) <= data_in; end if; end if; end process; data_out <= content(address);end architecture;

architecture BRAM of RAM isbegin

process (clk) begin if (rising_edge(clk)) then if (write = '1') then content(address) <= data_in; end if; data_out <= content(address); end if; end process; end architecture;

Because of the asynchronousoutput this model cannot beimplemented in block RAM.The reset function hinders theLUT implementation as well.

This model can be implementedas flip-flops, LUT RAM andblock RAM as well.



Additional readings

Additional readings

Steve Kilts – Advanced FPGA Design, Architecture, Implementation,and OptimizationDavid Money Harris, Sarah L. Harris – Digital Design and ComputerArchitecturePeter J. Ashenden – Digital Design – An Embedded SystemApproach Using VHDLM. Moris Mano, Charles R. Kime – Logic and Computer DesignFundamentalsPong P. Chu – RTL Hardware Design Using VHDLPeter Wilson – Design Recipes for FPGAs


RTL Optimization Techniques - EET - EETeet.bme.hu/.../07_RTL_Optimization_Techniques.pdfPéter Horváth RTL Optimization Techniques 1/20. Contents Timing optimization Area optimization

Documents