Asynchronous FIFO: Simulation using Modelsim Note: Diagram numbers are continued from the previous post. Test bench strategy is to generate all corner conditions like full and empty. Simulation waveforms are shown in Figure (11) to Figure (13). These waveforms are generated using test bench program provided in previous article. Read clock frequency 50 MHz and write clock frequency 10 MHz are generated using initial procedural statements: initial begin #10 r_clk=0; forever #10 r_clk=~r_clk; end initial begin #5 w_clk=0; forever #50 w_clk=~w_clk; end Duty cycle of r_clk is 10 nS and w_clk is 50 nS. Other conditions like w_en, r_en etc are generated using below set of statements: initial begin d_in=1; @(posedge w_en); repeat(20) @(posedge w_clk) d_in=d_in+2; repeat(20) @(posedge w_clk) d_in=d_in-1; end initial begin reset=1;#30 reset=0;end initial begin fork #50 w_en=1; #1800 w_en=0; #2500 w_en=1 ; join end initial begin fork #50 r_en=0; #1850 r_en=1; #2400 r_en=0; #2500 r_en=1; join end In the above two statements ‘fork’ and ‘join’ are used to start the simulation at zero simulation time. Reset signal is activated for first 30 nS and then it is deactivated. For the first 5 nS reset is active which tests asynchronous reset condition For asynchronous reset condition except d_out all other variables are initialized to default states, including f_empty_flag. r_ptr, w_ptr, ptr_diff are initialized to zero. All FIFO status flags are initialized to default values. Since reset signal is connected to binary counters, reset of counters causes reset of both read and write pointers, pointer difference and all status flags. This can be observed in the simulated waveform shown in Figure (11). Verilog code has to be improved for the complete asynchronous reset including reset of d_out. When I tried to add asynchronous reset to RTL code of dual port RAM, synthesizer does not infer dual port RAM. Instead, it infers set of registers. This problem has to be sorted out.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Asynchronous FIFO: Simulation using ModelsimNote: Diagram numbers are continued from the previous post.
Test bench strategy is to generate all corner conditions like full and empty. Simulation
waveforms are shown in Figure (11) to Figure (13). These waveforms are generated using test
bench program provided in previous article. Read clock frequency 50 MHz and write clock
frequency 10 MHz are generated using initial procedural statements:
initial begin #10 r_clk=0; forever #10 r_clk=~r_clk; endinitial begin #5 w_clk=0; forever #50 w_clk=~w_clk; end
Duty cycle of r_clk is 10 nS and w_clk is 50 nS.
Other conditions like w_en, r_en etc are generated using below set of statements:
initial begin d_in=1;@(posedge w_en);repeat(20) @(posedge w_clk) d_in=d_in+2; repeat(20) @(posedge w_clk) d_in=d_in-1; endinitial begin reset=1;#30 reset=0;endinitial begin fork #50 w_en=1; #1800 w_en=0; #2500 w_en=1 ; join endinitial begin fork #50 r_en=0; #1850 r_en=1; #2400 r_en=0; #2500 r_en=1; join end
In the above two statements ‘fork’ and ‘join’ are used to start the simulation at zero
simulation time. Reset signal is activated for first 30 nS and then it is deactivated. For the first
5 nS reset is active which tests asynchronous reset condition
For asynchronous reset condition except d_out all other variables are initialized to
default states, including f_empty_flag. r_ptr, w_ptr, ptr_diff are initialized to zero. All FIFO
status flags are initialized to default values. Since reset signal is connected to binary counters,
reset of counters causes reset of both read and write pointers, pointer difference and all status
flags. This can be observed in the simulated waveform shown in Figure (11). Verilog code has
to be improved for the complete asynchronous reset including reset of d_out. When I tried to
add asynchronous reset to RTL code of dual port RAM, synthesizer does not infer dual port
RAM. Instead, it infers set of registers. This problem has to be sorted out.
Thus overall performance of the designed FIFO resembles the performance of the FIFO
IP core provided by the Xilinx. Algorithm and methodology used are entirely different in both
designs. IP core uses acknowledgement signal for the confirmation of read and write
operation. But proposed design does not have any such mechanism. It is assumed that data
sending and receiving hardware takes care of the data once the FIFO full and empty condition
are asserted.
Figure (13) Simulation waveform 3
Asynchronous FIFO: Synthesis using Xilinx ISE and Spartan 3
Synthesis of the design based on two different optimization goal makes difference in usage of logic cells and maximum operating frequency of the design. Maintaining ‘speed’ as optimization goal, maximum achievable frequency is 113.830MHz. ----------------------------------------------Timing Summary:Speed Grade: -5Minimum period: 8.785ns (Maximum Frequency: 113.830MHz)Minimum input arrival time before clock: 4.692nsMaximum output required time after clock: 12.049nsMaximum combinational path delay: No path found---------------------------------------------------------
Maintaining ‘area’ as optimization goal, maximum achievable frequency is 90.212MHz. --------------------------------------------------------Timing Summary:Speed Grade: -5Minimum period: 11.085ns (Maximum Frequency: 90.212MHz)Minimum input arrival time before clock: 4.574nsMaximum output required time after clock: 13.375nsMaximum combinational path delay: No path found---------------------------------------------------------------------
The difference in the operating frequency can be attributed to the delay in the adder-
subtractor circuit. Dual port distributed RAM is used for memory. Output data d_out is
registered (RTL schematic is shown in Figure (14)) which is one of the advantages of this
design.
Figure (14) Registered output
The part of the synthesis report generated by Xilinx ISE, shown below infers the hardware
which remains same for both optimization goals.
----------------------------------------------------------------------Synthesizing Unit .Related source file is a_fifo5.v.Found 16x8-bit dual-port distributed RAM for signal .-----------------------------------------------------------------------| aspect ratio | 16-word x 8-bit | || clock | connected to signal | rise || write enable | connected to internal node | high || address | connected to signal | || dual address | connected to signal | || data in | connected to signal | || data out | not connected | || dual data out | connected to internal node | || ram_style | Auto | |-----------------------------------------------------------------------INFO:Xst:1442 - HDL ADVISOR - The RAM contents appears to be read asynchronously. A synchronous read would allow you to take advantage of available block RAM resources, for optimized device usage and improved timings. Please refer to your documentation for coding guidelines.Found 8-bit register for signal .Found 4-bit addsub for signal <$n0003>.Found 4-bit comparator greater for signal <$n0007> created at line 60.Found 4-bit comparator less for signal <$n0008> created at line 62.Found 4-bit adder for signal <$n0009> created at line 64.Found 4 1-bit 2-to-1 multiplexers.Summary:inferred 1 RAM(s).inferred 8 D-type flip-flop(s).inferred 2 Adder/Subtracter(s).inferred 2 Comparator(s).inferred 4 Multiplexer(s).Unit synthesized.-----------------------------------------------------------------
But observation of the low level synthesis gives the device utilization summary. Device
utilization with ‘speed’ as optimization goal is as follows:
-----------------------------------------Device utilization summary:---------------------------Selected Device: 3s200ft256-5 Number of Slices: 36 out of 1920 1% Number of Slice Flip Flops: 20 out of 3840 0% Number of 4 input LUTs: 50 out of 3840 1% Number of bonded IOBs: 24 out of 173 13% Number of GCLKs: 2 out of 8 25% -------------------------------------------------------------
Device utilization with ‘area’ as optimization goal is as follows:
-----------------------------------------Device utilization summary:-----------------------------------------Selected Device : 3s200ft256-5 Number of Slices: 34 out of 1920 1% Number of Slice Flip Flops: 16 out of 3840 0% Number of 4 input LUTs: 47 out of 3840 1% Number of bonded IOBs: 24 out of 173 13% Number of GCLKs: 2 out of 8 25% ------------------------------------------------------------
In the case of ‘area’, slices are reduced by 2, slice flip-flops are reduced by 4, 4 input LUTs are reduced by 3. For the present design read clock is working at 50MHz. Hence operating speed of the design is very important and optimization goal is maintained as ‘speed’. Other than the package pin constraints, area constraints or timing constraints are not provided to the design while synthesis. Since the design doesn’t consume much of the resource of the FPGA area is not important factor. Package pin constraints are given so as to implement the design on Spartan 3 development board. This is done for the fifo_top.v code which includes clock generation code for the design.
Asynchronous FIFO DesignAsynchronous FIFOs are used as buffers between two asynchronous clock
domains to exchange data safely. Data is written into the FIFO from one clock domain
and it is read from another clock domain. This requires a memory architecture wherein
two ports of memory are available- one is for input (or write or push) operation and
another is for output (or read or pop) operation. Generally FIFOs are used where write
operation is faster than read operation. However, even with the different speed and access
types the average rate of data transfer remains constant. FIFO pointers keep track of
number of FIFO memory locations read and written and corresponding control logic
circuit prevents FIFO from either under flowing or overflowing. FIFO architectures
inherently have a challenge of synchronizing itself with the pointer logic of other clock
domain and control the read and write operation of FIFO memory locations safely. A
detailed and careful analysis of synchronizer circuit along with pointer logic is required
to understand the synchronization of two FIFO pointer logic circuits which is responsible
for accessing the FIFO read and write ports independently controlled by different clocks.
Why Synchronization?
It is very important to understand the signal stability in multi clock domains since
for a traveling signal the new clock domain appears to be asynchronous. If the signal is
not synchronized to new clock, the first storage element of the new clock domain may go
to metastable state and the worst case is that resolution time can’t be predicted. It can
traverse throughout the new clock domain resulting in failure of functionality. To prevent
such failures setup time and hold time specification has to be obeyed in the design.
Manufacturers provide statistics of probability of failure of flip-flops due to metastability
characters in terms of MTBF (Mean Time Before Failure). Synchronizers are used to
prevent the downstream logic from entering into the metastable state in multiclock
domain with multibit data values.
Issues in Designing Asynchronous FIFO
It has been mentioned that designing of FIFO pointers for efficient working of
FIFO is the key issue while designing FIFO architecture. Let us go deep into the FIFO
read and write pointers. On reset both read and write pointers are pointing to the starting
location of the FIFO. This location is also the first location where data has to be written at
the same time this first location happens to be first read location. Therefore, in general we
can say, read pointer always points to the word to be read and write pointer always points
to the next location to which data has to be written.
Now let us examine data write operation. When both read and write pointers are
pointing to first location of FIFO empty flag is asserted indicating the FIFO status as
empty. Now data writing can be performed. Data will be written to the location where the
write pointer is pointing and after the data write operation write pointer gets incremented
pointing to the next location to be written. At the same time, empty flag is deasserted
which indicates that FIFO is not empty, some data is available. One notable point
regarding read pointer is with empty flag active the data pointed out by the read pointer is
always invalid data. When first data written and empty flag status cleared (i.e. empty flag
inactive) read pointer logic immediately drives the data from the location to which it was
pointing to the read port of the dual port RAM, ready to be read by read logic. With this
implementation of read logic the biggest advantage is that only one clock pulse is
required to read from read port since previous clock cycle has already incremented read
pointer and drives the data to read port. This will help in reducing latency in detecting
empty and full pointer flag status. Empty status flag can be asserted in one more
condition. After some n number of data write operations if same n number of read is
performed then both pointers are again equal. Hence if both pointers “catch up” each
other then empty flag is asserted.
Now let us examine about FIFO full status. When write pointer reaches the top of
the FIFO, it is pointing towards the location, which can be written and is the last location
to be written. No read operation is performed yet and read pointer is pointing to first
location itself. This is one method is to generate FIFO full condition. When write pointer
reaches the top of the FIFO, if full flag is asserted then it is not the actual FIFO full
condition, this is only ‘almost full’ as there is one location which can be written.
Similarly almost empty condition can exist in FIFO. Now a write operation causes the
location to be written and increment of write pointer. Since the location was the last one
write pointer wraps up to first location. Now both read and write pointers are equal and
hence empty flag is asserted instead of full flag assertion, which is a fatal mistake. Hence
wrap around condition of a full pointer may be a FIFO full condition.
After writing the data to FIFO (consider write pointer is in top of FIFO) some data has
been read and read pointer is somewhere in between FIFO. One more write operation
causes the write pointer to wrap. Note that even though write pointer is pointing to first
location of FIFO this is NOT FIFO full condition, since read pointer has moved up from
the first location. Further data writing pushes write pointer up. Imagine read pointer
wraps around after some more read operation. Present condition is that both pointers have
wrapped around but there is no FIFO full or FIFO empty condition. Data can be written
to FIFO or read from the FIFO. This is being the situation how to identify and generate
full and empty condition? How to synchronize and compare these two pointers to
generate full and empty status? While synchronizing how to avoid possible metastable
state and ‘pessimistic reporting’ (i.e. harmless wrong report; will be discussed later)?
These are some key issues in designing an asynchronous FIFO
Verilog code for asynchronous FIFOVerilog code for asynchronous FIFO is given below. The module “a_fifo5” should be used for Modelsim (or any other HDL simulator) simulation. The module “fifo_top” is used to synthesize the design in Spartan 3 board. Before invoking this module in ISE you should add Digital Clock Manager (DCM) code to your project. This can be done using Xilinx ISE wizard.//==================================================//a_fifo5.v; verilog code for asynchronous FIFO//This module describes FIFO//===================================================module a_fifo5(d_out,f_full_flag,f_half_full_flag,f_empty_flag,f_almost_full_flag,f_almost_empty_flag,d_in,r_en,w_en,r_clk,w_clk,reset);parameter f_width=8; //FIFO widthparameter f_depth=16; //FIFO depthparameter f_ptr_width=4; //because depth =16;parameter f_half_full_value=8;parameter f_almost_full_value=14;parameter f_almost_empty_value=2;output [f_width-1:0] d_out; reg [f_width-1:0] d_out; //outputs output f_full_flag,f_half_full_flag,f_almost_full_flag,f_empty_flag,f_almost_empty_flag;input [f_width-1:0] d_in;input r_en,w_en,r_clk,w_clk;input reset;//internal registers,wireswire [f_ptr_width-1:0] r_ptr,w_ptr;reg r_next_en,w_next_en;reg [f_ptr_width-1:0] ptr_diff;reg [f_width-1:0] f_memory[f_depth-1:0];assign f_full_flag=(ptr_diff==(f_depth-1)); //assign FIFO statusassign f_empty_flag=(ptr_diff==0);assign f_half_full_flag=(ptr_diff==f_half_full_value);assign f_almost_full_flag=(ptr_diff==f_almost_full_value);assign f_almost_empty_flag=(ptr_diff==f_almost_empty_value);
//===========================================================//fifo_top.v; top level verilog code of FIFO//To be used with Xilinx ISE-simulation and synthesis//For functional simulation this module is not necessary//============================================================module fifo_top(x,y,z,d_out,f_full_flag,f_half_full_flag,f_empty_flag,f_almost_full_flag,f_almost_empty_flag,d_in,r_en,w_en,CLKIN_IN,RST_IN,reset);parameter f_width=8;parameter f_depth=16;parameter f_ptr_width=4; parameter f_half_full_value=8;parameter f_almost_full_value=14;parameter f_almost_empty_value=2;output [f_width-1:0] d_out; //reg [f_width-1:0] d_out; //outputs output f_full_flag,f_half_full_flag,f_almost_full_flag,f_empty_flag,f_almost_empty_flag;output x,y,z;input [f_width-1:0] d_in;input r_en,w_en,CLKIN_IN,RST_IN;input reset;a_fifo5 a_fifo55(d_out,f_full_flag,f_half_full_flag,f_empty_flag,f_almost_full_flag,f_almost_empty_flag,d_in,r_en,w_en,CLK0_OUT,CLKDV_OUT,reset); //instantiate fifodcm_fifo dcm_fifo1(CLKIN_IN,RST_IN,CLKDV_OUT,CLKFX_OUT,CLKIN_IBUFG_OUT,CLK0_OUT, LOCKED_OUT); //instantiate DCMassign x=CLKIN_IBUFG_OUT; //simply to avoid errorassign y=LOCKED_OUT;
assign z=CLKFX_OUT;endmodule
Verilog Test Bench for Asynchronous FIFOHere is the verilog test bench for the asynchronous FIFO code already published. Simulation results of the asynchronous FIFO will be discussed in coming articles.//===========================================================//ta_fifo5.v; test bench for the module a_fifo5//===========================================================module ta_fifo5();
initial #5000 $stop;initial begin #10 r_clk=0; forever #10 r_clk=~r_clk; endinitial begin #5 w_clk=0; forever #50 w_clk=~w_clk; end//initial begin #10 r_clk=0; forever #50 r_clk=~r_clk; end //test for synchronous operation//initial begin #10 w_clk=0; forever #50 w_clk=~w_clk; end //test for synchronous operation
Asynchronous FIFO: Synthesis using Xilinx ISE and Spartan 3
Synthesis of the design based on two different optimization goal makes difference in usage of logic cells and maximum operating frequency of the design. Maintaining ‘speed’ as optimization goal, maximum achievable frequency is 113.830MHz. ----------------------------------------------Timing Summary:Speed Grade: -5Minimum period: 8.785ns (Maximum Frequency: 113.830MHz)Minimum input arrival time before clock: 4.692nsMaximum output required time after clock: 12.049nsMaximum combinational path delay: No path found---------------------------------------------------------
Maintaining ‘area’ as optimization goal, maximum achievable frequency is 90.212MHz.
--------------------------------------------------------Timing Summary:Speed Grade: -5Minimum period: 11.085ns (Maximum Frequency: 90.212MHz)Minimum input arrival time before clock: 4.574nsMaximum output required time after clock: 13.375nsMaximum combinational path delay: No path found
The difference in the operating frequency can be attributed to the delay in the adder-
subtractor circuit. Dual port distributed RAM is used for memory. Output data d_out is
registered (RTL schematic is shown in Figure (14)) which is one of the advantages of
this design.
Figure (14) Registered output
The part of the synthesis report generated by Xilinx ISE, shown below infers the
hardware which remains same for both optimization goals.
----------------------------------------------------------------------Synthesizing Unit .Related source file is a_fifo5.v.Found 16x8-bit dual-port distributed RAM for signal .-----------------------------------------------------------------------| aspect ratio | 16-word x 8-bit | || clock | connected to signal | rise || write enable | connected to internal node | high || address | connected to signal | || dual address | connected to signal | || data in | connected to signal | || data out | not connected | || dual data out | connected to internal node | || ram_style | Auto | |-----------------------------------------------------------------------INFO:Xst:1442 - HDL ADVISOR - The RAM contents appears to be read asynchronously. A synchronous read would allow you to take advantage of available block RAM resources, for optimized device usage and improved timings. Please refer to your documentation for coding guidelines.Found 8-bit register for signal .
Found 4-bit addsub for signal <$n0003>.Found 4-bit comparator greater for signal <$n0007> created at line 60.Found 4-bit comparator less for signal <$n0008> created at line 62.Found 4-bit adder for signal <$n0009> created at line 64.Found 4 1-bit 2-to-1 multiplexers.Summary:inferred 1 RAM(s).inferred 8 D-type flip-flop(s).inferred 2 Adder/Subtracter(s).inferred 2 Comparator(s).inferred 4 Multiplexer(s).Unit synthesized.-----------------------------------------------------------------
But observation of the low level synthesis gives the device utilization summary. Device
utilization with ‘speed’ as optimization goal is as follows:
-----------------------------------------Device utilization summary:---------------------------Selected Device: 3s200ft256-5 Number of Slices: 36 out of 1920 1% Number of Slice Flip Flops: 20 out of 3840 0% Number of 4 input LUTs: 50 out of 3840 1% Number of bonded IOBs: 24 out of 173 13% Number of GCLKs: 2 out of 8 25% -------------------------------------------------------------
Device utilization with ‘area’ as optimization goal is as follows:
-----------------------------------------Device utilization summary:-----------------------------------------Selected Device : 3s200ft256-5 Number of Slices: 34 out of 1920 1% Number of Slice Flip Flops: 16 out of 3840 0% Number of 4 input LUTs: 47 out of 3840 1% Number of bonded IOBs: 24 out of 173 13% Number of GCLKs: 2 out of 8 25% ------------------------------------------------------------
In the case of ‘area’, slices are reduced by 2, slice flip-flops are reduced by 4, 4 input LUTs are reduced by 3. For the present design read clock is working at 50MHz. Hence operating speed of the design is very important and optimization goal is maintained as ‘speed’. Other
than the package pin constraints, area constraints or timing constraints are not provided to the design while synthesis. Since the design doesn’t consume much of the resource of the FPGA area is not important factor. Package pin constraints are given so as to implement the design on Spartan 3 development board. This is done for the fifo_top.v code which includes clock generation code for the design.