1 Design, Development, and Simulation/Experimental Validation of a Crossbar Interconnection Network for a Single-Chip Shared Memory Multiprocessor Architecture Master’s Project Report June, 2002 Venugopal Duvvuri Department of Electrical and Computer Engineering University Of Kentucky Under the Guidance of Dr. J. Robert Heath Associate Professor Department of Electrical and Computer Engineering University of Kentucky
80
Embed
Design, Development, and Simulation/Experimental …heath/Masters_Proj_Report_Venugopal...1 Design, Development, and Simulation/Experimental Validation of a Crossbar Interconnection
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Design, Development, and Simulation/Experimental Validation of a
Crossbar Interconnection Network for a Single-Chip Shared Memory
Multiprocessor Architecture
Master’s Project Report
June, 2002
Venugopal Duvvuri
Department of Electrical and Computer Engineering
University Of Kentucky
Under the Guidance of
Dr. J. Robert Heath
Associate Professor
Department of Electrical and Computer Engineering
University of Kentucky
2
Table of Contents
Topic Page Number
ABSTRACT 3
Chapter 1: Introduction, Background, and Positioning of Research 4
Chapter 2: Types of Interconnect Systems 8
Chapter 3: Multistage Interconnection Systems Complexity 16
Chapter 4: Design of the Crossbar Interconnect Network 28
Chapter 5: VHDL Design Capture, Simulation, Synthesis and
Implementation Flow 35
Chapter 6: Design Validation via Post-Implementation Simulation Testing 39
Chapter 7: Experimental Prototype Development, Testing, and Validation
the location address within the memory block and ‘rw_mem’ for those memory blocks
become ‘1’ indicating ‘write’ operation.
Scenario 6:
Input stimulus:
data_in_prc <= x"CCCC" ;
data_in_mem <= x"FEDC" ;
addr_prc <= x"6666" ;
qdep <= x"5555" ;
ctrl <= x"E" ;
rw <= x"0" ;
57
In this case, processor ‘1’ (read), processor ‘2’ (read) and processor ‘3’ (read) are
requesting for memory access with the same memory location ‘2’ in the memory block
‘1’. Processor ‘3’ gets the priority as its has greatest processor ID of all the processors.
Hence processor ‘3’ reads 'D' from the hexbit on the ‘data_in_mem’ bus corresponding to
the memory block '1', and is observed on ‘data_out_prc’ bus on the simulation tracer ‘5’,
shown in Figure 6.8, for ‘addr_prc’ = ‘6666’. ‘flag’ is ‘1’ for only processor ‘3’,
‘addr_mem’ corresponding to memory block ‘1’ becomes ‘0’ indicating the location
address within the memory block and ‘rw_mem’ for those memory blocks become ‘0’
indicating ‘read’ operation.
Scenario 7:
Input stimulus:
data_in_prc <= x"FFFF" ;
data_in_mem <= x"FEDC" ;
addr_prc <= x"EEAE" ;
qdep <= x"0011" ;
ctrl <= x"0" ;
rw <= x"0" ;
In this case, all processors are in idle state. No transactions are performed through the interconnect. Hence no changes are observed in any of the memory locations or on ‘data_bus’ and are observed on the simulation tracer ‘5’, shown in Figure 6.8, ‘addr_prc’ = ‘EEAE’.
The simulation tracers ‘4’ and ‘5’ are shown in Figures 6.7 and 6.8 in the next two pages.
58
Figure 6.7: Simulation tracer 4
59
Figure 6.8: Simulation tracer 5
60
The post-implementation simulation tracers shown in Figures 6.4, 6.5 and 6.6
show that the interconnect network module ‘main’, described in Appendix A,
experimentally performed correctly from a functional stand-point for all scenarios. The
post-implementation simulation tracers shown in Figures 6.7 and 6.8 show that the
interconnect network module ‘main_ic’, described in Appendix B, experimentally
performed correctly from a functional stand-point for all scenarios that were tested
sucessfully on the interconnect module ‘main’.
61
Chapter 7
Experimental Prototype Development, Testing and Validation Results
The VHDL coded interconnect network system and interfaced memory modules
described in the code of Appendix A were synthesized, implemented, and programmed
into the Xilinx based Spartan XL FPGA chip using the Xilinx Foundation 3.1i CAD tool
set [7]. The FPGA chip used as the target in downloading of the VHDL code generated
bit stream was the earlier mentioned Xilinx XCS10PC84 from the Spartan XL family.
Since the module ‘main’ has a large number of I/O pins, a testcase ‘top’ is developed and
described in VHDL code which generates stimulus under various scenarios to the inputs
of ‘main’ and routes a selected set of outputs of ‘main’ to the LEDs of the prototype
board for output observation and evaluation. Figure 7.1 shows the block diagram of
‘top’.
Figure 7.1: Top Level Block Diagram of Testcase ‘top’.
main
Stimulus
generator
display
Input
Output
clk
rst
pid 3
scnr
3
addr
data
4
4
62
The function of ‘top’ is to generate stimulus to the input pins of module ‘main’.
The 3-bit primary input ‘scnr’ (scenario), to the testcase ‘top’, determines which set of
inputs are to be given to the module ‘main’. The signal ‘Input’ shown in the Figure 7.1 is
the stimulus given to all the input ports of module ‘main’. The signal ‘Output’ is the
output of module ‘main’. Both ‘Input’ and ‘Output’ are the input ports to the block
‘display’. There are 8 different scenarios that can be tested experimentally. There are 8
LEDs on the proto-board which are used to display and verify inputs and outputs of the
under-test module ‘main’. Another function of ‘top’ is to display the required set of
inputs or outputs depending upon the value of a 3-bit primary input ‘pid’ to the testcase
‘top’. The 4-bit ‘addr’ and 4-bit ‘data’ are the primary outputs of the testcase ‘top’. The
primary input to the testcase ‘top’ determines the set of signals that are to be displayed on
the 8 LEDs. For example, a ‘000’ value of ‘pid’ displays the 4-bit input queue depth
‘qdep’ of processor ‘0’ on 4-bit ‘data’, and ‘flag’ of processor ‘0’ on ‘addr’ bits ‘3’ and
‘2’, ‘rw’ of processor ‘0’ on ‘addr’ bit ‘1’ and ‘ctrl’ of processor ‘0’ on ‘addr’ bit ‘0’. The
‘111’ value of ‘pid’ displays the 4-bit ‘addr_bus’ of processor ‘0’ on ‘addr’ and 4-bit
‘data_in’ of processor ‘0’ on the ‘data’ pins. In a similar fashion ‘010’ and ‘011’ values
of ‘pid’ display corresponding signals of processor ‘1’, ‘100’ and ‘101’ values of ‘pid’
display corresponding signals for processor ‘2’ and ‘110’ and ‘111’ values of ‘pid’
display corresponding signals for processor ‘3’. A 50Mhz clock which is generated
internally on the FPGA chip is another primary input to the testcase ‘top’ and hence to
module ‘main’. During the implementation stage, the inputs and outputs are assigned to
the I/O pins of the FPGA chip. This information is entered in the ‘top.ucf file. The
scenario select bits scnr(0), scnr(1) and scnr(2) are assigned to ‘p25’, ‘p26’ and ‘p27’,
which are switches 4,3,2 on the proto-board. The display select bits pid(0), pid(1), and
pid(2) are assigned to pins ‘p19’, ‘p20’ and ‘p23’, which are switches 8,7,6 on the
protoboard. The 4-bit ‘addr’ is assigned to pins ‘p66’, ‘p67’, ‘p68’ and ‘p69’ which are
LEDs 4,3,2,1 on the proto-board and 4-bit data is assigned to pins ‘p60’, ‘p61’, ’p62’ and
‘p65’ which are LEDs 8,7,6,5 on the proto-board. The reset signal ‘rst’ is assigned to
‘p28’ which is switch 1 on the proto-board. By changing switches 4,3 and 2 on the proto-
board, different sets of inputs are given to the module ‘main’ and the corresponding
63
interconnection network behavior and outputs of module ‘main’ can be seen on the 8
LEDs for any processor by selection through switches 8, 7 and 6 on the proto-board.
All the scenarios that were experimentally successfully tested during post-
implementation simulations were also tested on the proto-board. The scenarios are
described in Chapter 7 (Pages 43 – 48) and in Appendix ‘A’ (Pages 68, 69). The
simulation tracers for these scenarios are shown in Figures 6.4, 6.5 and 6.6. The post-
implementation simulation results and experiments performed on the proto-board showed
consistent results and proved that the interconnet network experimentally performed
correctly from a functional stand-point for all scenarios.
Interconnect Network Performance:
The design is tested on the proto-board at 50MHz. But setup time and hold time
violations are observed on the post-synthesis simulation tracer at 50MHz clock
frequency, in some cases where there are frequent changes in input. These violations are
not observed on the simulation tracer for 20MHz clock frequency. The total delay, which
includes connection establishment and network latency, between any processor and any
memory block takes about 10ns. It is about one-half the clock period when operated at
20MHz. Hence a processor can access data from two different memory blocks on two
successive falling edges of ‘clk’. A similar scenario is tested in both post-implementation
simulations and experimental testing on the proto-board. This can observed on the
simulation tracer 1 shown in Figure 6.4, where processor ‘1’ reads ‘3’ from memory
location ‘3’ in memory block ‘3’,which is shown in ‘scnr’ = ‘1’ and reads ‘0’ from
memory location ‘2’ in memory block ‘1’,which is shown in scnr = ‘2’. On the
simulation tracer, scenario does not change (‘scnr’ from ‘1’ to ‘2’) on successive clocks.
But it is possible that similar data transfers between any processor and any memory block
happen in two successive clocks.
For this frequency, maximum bandwidth possible for data transfers through the
interconnect is 4 MB/s. This is the case when all the processors access different memory
blocks without any conflict at any memory block. On every falling edge of ‘clk’, 4-bits
of data can be transferred between any processor and any memory block. Hence in the
64
best case when all processors access different memory blocks, 16 bits (2 bytes) of data
transfer is done through the interconnect network on every falling edge of ‘clk’. The
connection establishment between any processor and any memory block does take more
than one-half of a clock period.
Shown below is the resource utilization of testcase ‘top’ implemented to a Spartan
XL FPGA (XCS10PC84) chip. This data gives the number of Configurable Logic Blocks
(CLBs), flip flops, latches, Look Up Tables (LUTs) and gate count used in implementing
the testcase ‘top’ to a Spartan XL FPGA (XCS10PC84) chip. This data is generated
during the post-implementation process.
Resource Utilization Summary of Spartan XL FPGA (XCS10PC84) Chip for Testcase ‘top’ of Figure 7.1 Programmed to Chip: Number of CLBs: 196 out of 196 100% CLB Flip Flops: 96 CLB Latches: 0 4 input LUTs: 354 3 input LUTs: 130 Total equivalent gate count for design: 3273
65
Chapter 8
Conclusions
A modular and scalable architecture and design for a crossbar
interconnect network of a HDCA single-chip multiprocessor system was first
presented. It’s development, simulation validation, and experimental hardware
prototype validation was also successfully accomplished in this project. The
design capture, simulation (pre-synthesis, post-synthesis and post-
implementation) and implementation was done in VHDL using Xilinx
Foundation PC based CAD software. The design was implemented as a
prototype in a PROM based Xilinx Spartan XL FPGA chip. The FPGA chip
that was used as the target in the downloading of a VHDL code generated bit
stream was the XCS10PC84 from the Xilinx Spartan XL family. The pre-
synthesis, post-synthesis and post-implementation VHDL functional simulation
results obtained from the designed interconnect network matched with obtained
FPGA chip hardware prototype experimental results for all test scenarios and all
original design specifications were met.
66
References 1. George Broomell and J. Robert Heath, “Classification Categories and Historical
Development of Circuit Switching Topologies”, ACM Computing Surveys, Vol.15,
No.2, pp. 95-133, June 1983.
2. J. Robert Heath, Paul Maxwell, Andrew Tan, and Chameera Fernando, “Modeling,
Design, and Experimental Verification of Major Functional Units of a Scalable Run-
Time Reconfigurable and Dynamic Hybrid Data/Command Driven Single-Chip
Multiprocessor Architecture and Development and Testing of a First-Phase
Prototype”, Private Communication, 2002.
3. M. L. Bos, “Design of a Chip Set for a Parallel Computer based on the Crossbar
Interconnection Principle”, Proceedings Circuits and Systems, 1995, Proceedings of
the 38th Midwest Symposium, Vol.2, pp. 752-756, 1996.
4. Enrique Coen-Alfaro and Gregory W. Donohoe, “A Comparison of Circuits for On-
Interconnect Network and Memory VHDL Code (Version 1)
The VHDL Code for the Crossbar Interconnect Network of the HDCA
System and It’s Shared Memory Organization as Structured in Figures 7.1
and 6.1.
-- This VHDL code describes the Crossbar interconnect network -- and shared memory organization with -- only one interface, on processor side. -- Equivalent Block diagram for module 'top' is shown in Figure 7.1 -- Equivalent Block diagram for module 'main' is shown in Figure 6.1 library IEEE ; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity top is port (clk: in std_logic ; rst: in std_logic ; scnr: in std_logic_vector(2 downto 0) ; pid: in std_logic_vector(2 downto 0) ; data: out std_logic_vector(3 downto 0) ; addr: out std_logic_vector(3 downto 0) ); end top ; architecture test_top of top is component main is port (ctrl: in std_logic_vector(3 downto 0) ; qdep: in std_logic_vector(15 downto 0) ; addr_bus: in std_logic_vector(15 downto 0) ; data_in : in std_logic_vector(15 downto 0) ; rw: in std_logic_vector(3 downto 0) ; clk: in std_logic ; rst: in std_logic ; flag: inout std_logic_vector(3 downto 0) ; data_out: out std_logic_vector(15 downto 0) ); end component main ; signal ctrl: std_logic_vector(3 downto 0) ; signal flag: std_logic_vector(3 downto 0) ;
68
signal qdep: std_logic_vector(15 downto 0) ; signal addr_bus: std_logic_vector(15 downto 0) ; signal data_in: std_logic_vector(15 downto 0) ; signal data_out: std_logic_vector(15 downto 0) ; signal rw: std_logic_vector(3 downto 0) ; begin stim_gen: process (scnr) is begin -- This is equivalent to Stimulus generator in the figure 7.1 -- All these scenarios are discussed in Chapter 6 case scnr (2 downto 0) is when "000" => -- all processors write to memory locations in different memory blocks data_in <= x"4321" ; addr_bus <= x"FB73" ; qdep <= x"1234" ; ctrl <= x"F" ; rw <= x"F" ; when "001" =>
-- all processors read from memory locations in different memory blocks -- in the reverse order
data_in <= x"26FE" ; addr_bus <= x"37BF" ; qdep <= x"1234" ; ctrl <= x"F" ; rw <= x"0" ; when "010" => -- processors 2 (write) and 3 (read) to different memory locations in the same memory block -- processor 2 gets priority as its qdepth is greater -- processors 0 (write) and 1 (read) to different memory locations in the same memory block -- processor 1 gets priority as its processor id is greater data_in <= x"AAAA" ; addr_bus <= x"CD56" ; qdep <= x"EFFF" ; ctrl <= x"F" ; rw <= x"5" ; when "011" =>
69
-- processors 2 (write) and 3 (read) to different memory locations in the same memory block -- processor 3 gets priority as its processor id is greater -- processors 0 (write) and 1 (read) to different memory locations in the same memory block -- processor 0 gets priority as its qdepth is greater data_in <= x"5555" ; addr_bus <= x"DC65" ; qdep <= x"4434" ; ctrl <= x"F" ; rw <= x"5" ; when "100" => data_in <= x"9999" ; -- processors 2 (write) and 3 (write) to same memory location addr_bus <= x"3399" ; -- processor 2 gets priority as its qdepth is greater qdep <= x"4EE4" ; -- processors 0 (read) and 1 (read) from same memory location ctrl <= x"F" ; -- processor 1 gets priority as its qdepth is greater rw <= x"C" ; when "101" => -- processors 0 (write), 1 (read), 2 (read) and 3 (read) to different memory locations in the same memory block data_in <= x"EEEE" ; -- processor 0 gets priority as its qdepth is greater addr_bus <= x"7654" ; qdep <= x"5556" ; ctrl <= x"F" ; rw <= x"1" ; when "110" => -- processors 1 (read), 2 (read) and 3 (read) to same memory location (proc 0 is idle) data_in <= x"CCCC" ; -- processor 3 gets priority as its processor id is greater addr_bus <= x"6666" ; qdep <= x"5555" ; ctrl <= x"E" ; rw <= x"0" ; when others => -- all processors in idle state data_in <= x"FFFF" ; -- flag is ‘0000’ addr_bus <= x"EEAE" ; qdep <= x"0011" ; ctrl <= x"0" ; rw <= x"0" ; end case ; end process stim_gen ;
70
INST: main port map (clk => clk, rst => rst, data_in => data_in, qdep => qdep, addr_bus => addr_bus, rw => rw, ctrl => ctrl, flag => flag, data_out => data_out ); -- This is equivalent to 'display' block in figure 7.1 disply: process (ctrl,scnr,pid) is begin case pid (2 downto 0) is when "000" => data <= qdep(3 downto 0) ; addr(0) <= ctrl(0) ; addr(1) <= rw(0) ; addr(2) <= flag(0) ; addr(3) <= flag(0) ; when "001" => if (rw(0) = '0') then data <= data_out(3 downto 0) ; else data <= data_in(3 downto 0) ; end if ; addr <= addr_bus(3 downto 0) ; when "010" => data <= qdep(7 downto 4) ; addr(0) <= ctrl(1) ; addr(1) <= rw(1) ; addr(2) <= flag(1) ; addr(3) <= flag(1) ; when "011" => if (rw(1) = '0') then data <= data_out(7 downto 4) ; else data <= data_in(7 downto 4) ; end if ; addr <= addr_bus(7 downto 4) ; when "100" => data <= qdep(11 downto 8) ; addr(0) <= ctrl(2) ; addr(1) <= rw(2) ; addr(2) <= flag(2) ;
71
addr(3) <= flag(2) ; when "101" => if (rw(2) = '0') then data <= data_out(11 downto 8) ; else data <= data_in(11 downto 8) ; end if ; addr <= addr_bus(11 downto 8) ; when "110" => data <= qdep(15 downto 12) ; addr(0) <= ctrl(3) ; addr(1) <= rw(3) ; addr(2) <= flag(3) ; addr(3) <= flag(3) ; when others => if (rw(3) = '0') then data <= data_out(15 downto 12) ; else data <= data_in(15 downto 12) ; end if ; addr <= addr_bus(15 downto 12) ; end case ; end process disply ; end architecture test_top ; -- The main interconnect module library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity main is port (ctrl: in std_logic_vector(3 downto 0) ; qdep: in std_logic_vector(15 downto 0) ; addr_bus: in std_logic_vector(15 downto 0) ; data_in: in std_logic_vector(15 downto 0) ; rw: in std_logic_vector(3 downto 0) ; clk: in std_logic ; rst: in std_logic ; flag: inout std_logic_vector(3 downto 0) ;
72
data_out: out std_logic_vector(15 downto 0) ); end main ; architecture main_arch of main is type qd is array (3 downto 0) of std_logic_vector(3 downto 0) ; type data_array is array (3 downto 0) of std_logic_vector(3 downto 0) ; type addr_array is array (3 downto 0) of std_logic_vector(3 downto 0) ; type mb is array (3 downto 0) of std_logic_vector(1 downto 0) ; type mem_array is array (15 downto 0) of std_logic_vector(3 downto 0) ; -- This function does the priority logic for all the memory blocks -- This is schemactically equivalent to Figures 4.5 and 4.6 in the report -- This can work for any number of processors and memory blocks -- by changing 'i' and 'j' values function flg (qdep, addr_bus, ctrl:std_logic_vector ) return std_logic_vector is variable qdvar: std_logic_vector (3 downto 0) ; variable flag: std_logic_vector(3 downto 0) ; variable qdv : std_logic_vector(3 downto 0) ; variable gnt : std_logic ; variable a: integer ; variable b: integer ; variable memaddr : mb ; variable qd_arr : qd ; begin qd_arr(0) := qdep(3 downto 0) ; qd_arr(1) := qdep(7 downto 4) ; qd_arr(2) := qdep(11 downto 8) ; qd_arr(3) := qdep(15 downto 12) ; memaddr(0) := addr_bus(3 downto 2) ; memaddr(1) := addr_bus(7 downto 6) ; memaddr(2) := addr_bus(11 downto 10) ; memaddr(3) := addr_bus(15 downto 14) ; L1: for i in 0 to 3 loop L2: for j in 0 to 3 loop if (ctrl(j) = '0') then flag(j) := '0' ;
73
qdv(j) := '0' ; elsif (memaddr(j) = i) then qdv(j) := '1' ; else qdv(j) := '0' ; end if ; end loop L2 ; qdvar := "0000" ; gnt := '0' ; L3: for k in 0 to 3 loop if qdv(k) = '1' then if qdvar <= qd_arr(k) then qdvar := qd_arr(k) ; a := k ; gnt := '1' ; else flag(k) := '0' ; end if; end if ; end loop L3 ; if (gnt = '1') then flag(a) := '1' ; end if ; end loop L1 ; return (flag) ; end flg; signal memory: mem_array ; begin P1 : process(ctrl, clk, qdep, addr_bus, rst, data_in) is begin if (rst = '1') then flag <= "0000" ; else flag <= flg(qdep, addr_bus, ctrl) ;
74
-- Memory transaction -- The conditional statements make sure that the connection is established -- before memory transaction -- Equivalent to Figures 4.4 and 4.1 after the completion of priority logic operation. -- This routine is to be repeated for each addition of processor if (clk 'event and clk = '0') then if (flag(0) = '1') then if (rw(0) = '1') then memory(conv_integer(addr_bus(3 downto 0))) <= data_in(3 downto 0) ; data_out(3 downto 0) <= (others => 'Z') ; else data_out(3 downto 0) <= memory(conv_integer(addr_bus(3 downto 0))) ; end if ; end if ; if (flag(1) = '1') then if (rw(1) = '1') then memory(conv_integer(addr_bus(7 downto 4))) <= data_in(7 downto 4) ; data_out(7 downto 4) <= (others => 'Z') ; else data_out(7 downto 4) <= memory(conv_integer(addr_bus(7 downto 4))) ; end if ; end if ; if (flag(2) = '1') then if (rw(2) = '1') then memory(conv_integer(addr_bus(11 downto 8))) <= data_in(11 downto 8) ; data_out(11 downto 8) <= (others => 'Z') ; else data_out(11 downto 8) <= memory(conv_integer(addr_bus(11 downto 8))) ; end if ; end if ; if (flag(3) = '1') then if (rw(3) = '1') then memory(conv_integer(addr_bus(15 downto 12))) <= data_in(15 downto 12) ; data_out(15 downto 12) <= (others => 'Z') ; else data_out(15 downto 12) <= memory(conv_integer(addr_bus(15 downto 12)))
75
; end if ; end if ; end if; end if; end process P1 ; end main_arch ;
76
Appendix B
Interconnect Network VHDL Code (Version 2)
The VHDL Code for the Crossbar Interconnect Network as Structured in
Figure 6.2.
-- This VHDL code is described for Crossbar interconnect network with -- two interfaces, on on processor side and the other on shared memory side -- Equivalent Block diagram for module 'main_ic' is shown in Figure 6.2 library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity main_ic is port (ctrl: in std_logic_vector(3 downto 0) ; qdep: in std_logic_vector(15 downto 0) ; addr_prc: in std_logic_vector(15 downto 0) ; data_in_prc: in std_logic_vector(15 downto 0) ; data_in_mem: in std_logic_vector(15 downto 0) ; rw: in std_logic_vector(3 downto 0) ; clk: in std_logic ; rst: in std_logic ; flag: inout std_logic_vector(3 downto 0) ; addr_mem: out std_logic_vector(7 downto 0) ; rw_mem: out std_logic_vector(3 downto 0) ; data_out_prc: out std_logic_vector(15 downto 0) ; data_out_mem: out std_logic_vector(15 downto 0) ); end main_ic ; architecture main_arch of main_ic is type qd is array (3 downto 0) of std_logic_vector(3 downto 0) ; type data_array is array (3 downto 0) of std_logic_vector(3 downto 0) ; type addr_array is array (3 downto 0) of std_logic_vector(3 downto 0) ; type mb is array (3 downto 0) of std_logic_vector(1 downto 0) ; -- This function does the priority logic for all the memory blocks -- This is schemactically equivalent to Figures 4.5 and 4.6 in the report -- This can work for any number of processors and memory blocks
77
-- by changing 'i' and 'j' values function flg (qdep, addr_prc, ctrl:std_logic_vector ) return std_logic_vector is variable qdvar: std_logic_vector (3 downto 0) ; variable flag: std_logic_vector(3 downto 0) ; variable qdv : std_logic_vector(3 downto 0) ; variable gnt : std_logic ; variable a: integer ; variable b: integer ; variable memaddr : mb ; variable qd_arr : qd ; begin qd_arr(0) := qdep(3 downto 0) ; qd_arr(1) := qdep(7 downto 4) ; qd_arr(2) := qdep(11 downto 8) ; qd_arr(3) := qdep(15 downto 12) ; memaddr(0) := addr_prc(3 downto 2) ; memaddr(1) := addr_prc(7 downto 6) ; memaddr(2) := addr_prc(11 downto 10) ; memaddr(3) := addr_prc(15 downto 14) ; L1: for i in 0 to 3 loop L2: for j in 0 to 3 loop if (ctrl(j) = '0') then flag(j) := '0' ; qdv(j) := '0' ; elsif (memaddr(j) = i) then qdv(j) := '1' ; else qdv(j) := '0' ; end if ; end loop L2 ; qdvar := "0000" ; gnt := '0' ; L3: for k in 0 to 3 loop if qdv(k) = '1' then if qdvar <= qd_arr(k) then qdvar := qd_arr(k) ;
78
a := k ; gnt := '1' ; else flag(k) := '0' ; end if; end if ; end loop L3 ; if (gnt = '1') then flag(a) := '1' ; end if ; end loop L1 ; return (flag) ; end flg; signal data_in_mem_2d: data_array ; signal data_out_mem_2d: data_array ; signal addr_mem_2d: mb ; begin data_in_mem_2d(0) <= data_in_mem(3 downto 0) ; data_in_mem_2d(1) <= data_in_mem(7 downto 4) ; data_in_mem_2d(2) <= data_in_mem(11 downto 8) ; data_in_mem_2d(3) <= data_in_mem(15 downto 12) ; P1 : process (rst, addr_prc, ctrl) is begin if (rst = '1') then flag <= "0000" ; data_out_prc (3 downto 0) <= (others => '0') ; data_out_prc (3 downto 0) <= (others => '0') ; data_out_prc (3 downto 0) <= (others => '0') ; data_out_prc (3 downto 0) <= (others => '0') ; else flag <= flg(qdep, addr_prc, ctrl) ; -- Memory transaction -- The conditional statements make sure that the connection is established -- before memory transaction -- Equivalent to Figures 4.4 and 4.1 after the completion of priority logic operation. -- This routine is to be repeated for each addition of processor if (clk 'event and clk = '1') then if (flag(0) = '1') then
79
addr_mem_2d (conv_integer(addr_prc(3 downto 2))) <= addr_prc(1 downto 0); rw_mem (conv_integer(addr_prc(3 downto 2))) <= rw(0) ; if (rw(0) = '1') then data_out_mem_2d (conv_integer(addr_prc(3 downto 2))) <= data_in_prc(3 downto 0) ; data_out_prc (3 downto 0) <= (others => 'Z') ; else data_out_prc (3 downto 0) <= data_in_mem_2d(conv_integer(addr_prc(3 downto 2))) ; end if ; end if ; if (flag(1) = '1') then addr_mem_2d (conv_integer(addr_prc(7 downto 6))) <= addr_prc(5 downto 4); rw_mem (conv_integer(addr_prc(7 downto 6))) <= rw(1) ; if (rw(1) = '1') then data_out_mem_2d (conv_integer(addr_prc(7 downto 6))) <= data_in_prc(7 downto 4) ; data_out_prc (7 downto 4) <= (others => 'Z') ; else data_out_prc (7 downto 4) <= data_in_mem_2d(conv_integer(addr_prc(7 downto 6))) ; end if ; end if ; if (flag(2) = '1') then addr_mem_2d (conv_integer(addr_prc(11 downto 10))) <= addr_prc(9 downto 8); rw_mem (conv_integer(addr_prc(11 downto 10))) <= rw(2) ; if (rw(2) = '1') then data_out_mem_2d (conv_integer(addr_prc(11 downto 10))) <= data_in_prc(11 downto 8) ; data_out_prc (11 downto 8) <= (others => 'Z') ; else data_out_prc (11 downto 8) <= data_in_mem_2d(conv_integer(addr_prc(11 downto 10))) ; end if ; end if ; if (flag(3) = '1') then addr_mem_2d (conv_integer(addr_prc(15 downto 14))) <= addr_prc(13 downto 12); rw_mem (conv_integer(addr_prc(15 downto 14))) <= rw(3) ; if (rw(3) = '1') then data_out_mem_2d (conv_integer(addr_prc(15 downto 14))) <= data_in_prc(15 downto 12) ; data_out_prc (15 downto 12) <= (others => 'Z') ; else data_out_prc (15 downto 12) <= data_in_mem_2d(conv_integer(addr_prc(15 downto 14))) ; end if ; end if ;
80
end if ; addr_mem(1 downto 0) <= addr_mem_2d(0) ; addr_mem(3 downto 2) <= addr_mem_2d(1) ; addr_mem(5 downto 4) <= addr_mem_2d(2) ; addr_mem(7 downto 6) <= addr_mem_2d(3) ; data_out_mem(3 downto 0) <= data_out_mem_2d(0) ; data_out_mem(7 downto 4) <= data_out_mem_2d(1) ; data_out_mem(11 downto 8) <= data_out_mem_2d(2) ; data_out_mem(15 downto 12) <= data_out_mem_2d(3) ; end if ; end process P1 ; end main_arch ;