Hardware Support for FPGA Resource Elasticity - DiVA

Degree Project in Computer Science and Engineering, specializing in

Embedded

Second cycle 30 HP

Hardware Support for FPGA

Resource Elasticity

FIDAN ALIYEVA

Stockholm, Sweden 2022

Hardware Support for FPGA

Resource Elasticity

FIDAN ALIYEVA

Master of Science in Embedded Systems

Date: April 19, 2022

Industrial Supervisor: Ahsan Javed Awan

Examiner: Johnny Öberg

School of Electrical Engineering and Computer Science

Host company: Ericsson AB

Abstract

FPGAs are commonly used in cloud computing due to their ability to be programmed as a

processor that serves a specific purpose; hence, achieving high performance at low power. On

the other hand, FPGAs have a lot of resources available, which are wasted if they host a

single application or serve a single user’s request. Partially Reconfiguration technology

enables FPGAs to divide their resources into different regions and then dynamically

reprogram those regions with various applications during runtime. Therefore, they are

considered as a good solution to eliminate the underutilization resource problem.

Nevertheless, the sizes of these regions are static; they cannot be increased or decreased once

they are defined. Thereby, it leads to the underutilization of reconfigurable region resources.

This thesis addresses this problem, i.e., how to dynamically increase/decrease partially

reconfigurable FPGA resources matching an application’s needs. Our solution enables

expanding and contracting the FPGA resources allocated to an application by 1) application

acceleration requirements expressed in multiple smaller modules which are configured into

multiple reconfigurable regions assigned to the application dynamically and 2) providing a

low-area-overhead, configurable, and isolated communication

mechanism by adjusting crossbar interconnect and WISHBONE

interface among those multiple reconfigurable regions.

Keywords: FPGA, Elasticity, Partial reconfiguration, Crossbar, WISHBONE, Multicast

Svensk Sammanfattning

FPGA-kretsar har en förmåga att programmeras som processorer med ett specifikt syfte vilket

gör att de ofta används i molnlösningar. Detta ger hög prestanda med låg effektförbrukning.

Å andra sidan disponerar FPGA-kretsar över stora resurser, vilka är bortkastade om de enbart

används av en applikation eller endast på en användares förfrågan. Partiellt omkonfigurerbara

teknologier tillåter FPGA-kretsar att fördela resurser mellan olika regioner, och sen

dynamiskt omprogrammera regioner med olika applikationer vid körning. Därför betraktas

partiellt omkonfigurerbara teknologier som en bra lösning för att minimera underutnyttjande

av resurser. Storleken på regionerna är statiska och kan inte ändras när de väl definierats,

vilket leder till underutnyttjande av de omkonfigurerbara regionernas resurser. Denna uppsats

angriper problemet med dynamisk allokering av partiellt omkonfigurerbara FPGA-resurser

utifrån applikationens behov. Vår lösning möjliggör ökning och minskning av FPGA-resurser

allokerade till en applikation genom 1) accelerering av applikationen genom att applikationen

tilldelas flera mindre moduler konfigurerade till dynamiskt omkonfigurerbara regioner, och

2) tillhandahållande av en effektiv konfigurerbar och isolerad kommunikationsmekanism,

genom justering av crossbar-sammankopplingen och WISHBONE-gränssnittet hos de

omkonfigurerbara regionerna.

Nyckelord: FPGA, Elasticitet, Partiell omkonfigurering, Crossbar, WISHBONE, Multicast

Acknowledgments

Firstly, I would like to thank my examiner and professor at KTH, Johnny Öberg, who

referred me as a summer intern to Ericsson, and thus, my journey here began. I am also

grateful to my supervisor at Ericsson, Ahsan Javed Awan, for being academically and

morally supportive during my thesis.

Secondly, I would like to thank my caring family – my parents and brothers, who always

supported my dreams and education. To my father, who always stood behind me during my

education and inspired me to come to Sweden, thank you. Big thanks to my best friends,

Galib and Aydin, who always believed in me and were there for me. Finally, I want to thank

A. G. Pettersson, who constantly motivated me and cared for me during my thesis journey.

List of Abbreviations

API Application Program Interface

AXI the Advanced eXtensible Interface

AXI-MM AXI Memory Mapped

AXI-ST AXI Streaming

BITMAN Bitstream Manipulation

BRAM Block Random Access Memory

CAD Computer-Aided Design

CLBs Configurable Logic Blocks

CPU Central Processing Unit

C2H Card to Host

DMA Dynamic Memory Access

E-WB Encapsulated WISHBONE

FF Flip Flop

FIFO First-In-First-Out

FPGA Field Programmable Gate Array

GT Giga Transfer

HDL Hardware Description Language

H2C Host to Card

ICAP Internal Configuration Access Port

INDRA Integrated Design flow for Reconfigurable Architectures

IP Intellectual Property

JTAG Joint Test Action Group

L i s t o f A b b r e v i a t i o n s | 7

LUT Look Up Tables

MPSoC Multiprocessor Systems-on-Chip

NoC Network – on–Chip

OpenCL Open Computing Language

OS Operating System

PC Personal Computer

PCIe Peripheral Component Interconnect express

PR Partially Reconfigurable

P2P Point-to-Point

QSPI Quad Serial Peripheral Interface

RMB Reconfigurable Multiple Bus

RMBoC RMB on Chip

SDM Space Division Multiplexing

SoC System-on-Chip

TDM Time Division Multiplexing

VC Virtual Channel

VHDL VHSIC HDL

VHSIC Very High-Speed Integrated Circuit

VR Virtual Region

XDMA Xilinx DMA

WB WISHBONE

Table of Contents

Chapter 1. Introduction .........................................................................................................................12

1.1. Background .................................................................................................................................12

1.2. Problem Statement .....................................................................................................................12

1.3. Thesis Goals ................................................................................................................................13

1.4. Thesis Contributions ...................................................................................................................13

1.5. Structure of the Thesis ................................................................................................................13

Chapter 2. Background ..........................................................................................................................14

2.1. FPGAs in the Cloud......................................................................................................................14

2.2. Multiprocessor Interconnection Methods .................................................................................14

2.2.1 Network-on-Chip ..................................................................................................................15

2.2.2 Crossbar Switch Interconnection ..........................................................................................15

2.3. PCIe Bus ......................................................................................................................................16

2.4. XDMA IP Core and Drivers ..........................................................................................................16

2.4.1 XDMA IP Core [3] ..................................................................................................................16

2.4.2 XDMA Drivers [4] ..................................................................................................................17

2.5. WISHBONE Interconnection Architecture ..................................................................................18

2.5.1 Bus Interface [7]....................................................................................................................19

2.5.2 WISHBONE Interconnection Modes [7] ................................................................................21

Chapter 3. Related Work .......................................................................................................................22

3.1. Architecture Support for FPGA Multi-tenancy in the Cloud [9] .................................................22

3.2. Resource Elastic Virtualization for FPGAs using OpenCL [10] ....................................................23

3.3. A Design Methodology for Communication Infrastructures on .................................................24

Partially Reconfigurable FPGAs [12] ..................................................................................................24

3.4. A Practical Approach for Circuit Routing on Dynamic Reconfigurable Devices [13] ..................25

3.5. Enhancement of Wishbone Protocol with Broadcasting and Multicasting [15] ........................26

3.6. Limitations of Related Works .....................................................................................................26

3.6.1 Limitations of Works on Resource Elasticity ........................................................................26

3.6.2 Limitations of the Work on Multicast Communication Method ..........................................27

Chapter 4. Design and Implementation ................................................................................................28

4.1. Design Tools and Environment ...................................................................................................28

4.1.1 Target Device ........................................................................................................................28

4.1.2 Tools and Design Guidelines .................................................................................................28

T a b l e o f C o n t e n t s | 9

4.2. Solution Description and its Advantages ....................................................................................29

4.3. System Design Architecture ........................................................................................................31

4.4. Crossbar Switch Architecture .....................................................................................................36

4.4.1 Slave Port ..............................................................................................................................37

4.4.2 Master Port ...........................................................................................................................44

4.5. WISHBONE Interfaces and Computation Modules .....................................................................46

4.5.1 The Usage of WISHBONE ......................................................................................................47

4.5.3 The Design and Implementation of the WB Master Interface .............................................50

4.5.4 The Design and Implementation of the Slave Interface .......................................................53

4.5.5 A computation Module and its Template .............................................................................55

4.6. Extending the WISHBONE Crossbar Interconnection to Support Multicast Communication ....58

Chapter 5. Design Verification and Results ...........................................................................................62

5.1. Design Verification ......................................................................................................................62

5.1.1 Verification in the Simulation environment .........................................................................62

5.1.2 Verification in the FPGA device ............................................................................................63

5.2. Results .........................................................................................................................................63

5.2.1 The System Features .............................................................................................................63

5.2.2 Demonstration of Some Features.........................................................................................63

5.2.3 Communication Overhead ....................................................................................................65

5.2.4 Area usage ............................................................................................................................66

5.2.5 Power Consumption .............................................................................................................67

Chapter 6. Discussion and Future Work ................................................................................................69

6.1. Discussion ...................................................................................................................................69

6.2. Conclusion and Future Work ......................................................................................................71

6.2.1 Conclusion .............................................................................................................................71

6.2.2 Future Work ..........................................................................................................................71

References .............................................................................................................................................72

APPENDIX A ............................................................................................................................................74

Register File Description ....................................................................................................................74

10 | L i s t o f F i g u r e s

List of Figures

Figure 2.1. 3x3 NoC with Mesh Topology ..............................................................................................15 Figure 2.2. 4x4 Crossbar Switch Interconnection ..................................................................................16 Figure 2.3. The WISHBONE master and slave connected point-to-point [7] ........................................19 Figure 2.4. The Waveform of Standard Write Communication.............................................................20 Figure 2.5. The Waveform of Pipelined Write Communication. ...........................................................21 Figure 3.1. VR and Router Architectures and Their Communication Structure [9] ...............................23 Figure 3.2. 5-layer Communication Model Presented by [12] ..............................................................24 Figure 3.3. Example of Parallel Transmission using Multiple Bus Segments in [13] .............................25 Figure 4.1. High-Level View of the Proposed Communication Technique ............................................30 Figure 4.2. Overall System Design Architecture View ...........................................................................32 Figure 4.3. PR Reconfiguration and Communication Flow ....................................................................35 Figure 4.4. Block Diagram of the Proposed Crossbar Switch Interconnection......................................36 Figure 4.5. Slave Port with the Black Box View of Slave Side ................................................................37 Figure 4.6. The Slave Side Unit with a Black Box View of the WRR and Output Port ...........................38 Figure 4.7. Block Diagram of Implemented Weighted Round Robin Arbiter ........................................40 Figure 4.8. Control Logic FSM of the Weighted Round Robin Arbiter...................................................41 Figure 4.9. Flow Chart of Master Grant Process ....................................................................................42 Figure 4.10. Block Diagram of Output Port ...........................................................................................43 Figure 4.11. Master Port with Black Box view of Input Port .................................................................44 Figure 4.12. Block Diagram of Input Port ..............................................................................................45 Figure 4.13. The Black Box View of the WISHBONE Master Interface ..................................................46 Figure 4.14. Black Box View of WISHBONE Slave Interface ...................................................................47 Figure 4.15. A Computation Module having Both Slave and Master Interfaces ...................................47 Figure 4.16. Communication Flow between a Computation Module and its Interfaces ......................49 Figure 4.17. Implementation Block Diagram of WB Master Interface ..................................................50 Figure 4.18. FSM of WB Master Control Logic .......................................................................................51 Figure 4.19. FSM of Timeout Logic ........................................................................................................52 Figure 4.20. The Implementation Block Diagram of the WB Slave Interface ........................................53 Figure 4.21.The FSM of the WB Slave Control Logic.............................................................................54 Figure 4.22. The Computation Module Black Box Template .................................................................55 Figure 4.23. The Block Diagram of a Computation Module .................................................................56 Figure 4.24. The FSM of a Computation Module ..................................................................................57 Figure 4.25.The Master Port adjusted to support Multicast Communication ......................................59 Figure 4.26. The WB Timeout FSM adjusted to Support Multicast Communication ............................60 Figure 5.1. The Comparison of Execution Time with Different Case Scenarios ....................................64 Figure 5.2. Communication Isolation Demonstration ...........................................................................65

file://///Users/macbook/Desktop/KTH/Master/Fidan_Aliyeva-Thesis_Draft-083021_AGP_Edit.docx%23_Toc101207165




List of Tables

Table 4-1. KCU 1500 board's Resources ................................................................................................28 Table 5-1. The Communication Overhead Summary ............................................................................66 Table 5-2. The Area Usage of the WISHBONE Crossbar together with three Computation Modules ..67 Table 5-3. The Area Usage of all Components ......................................................................................67 Table 5-4. The Power Usage by Categories ...........................................................................................68 Table 5-5. The Power Usage by Hardware Components .......................................................................68 Table 6-1. The Comparison of the Resource Usages between the Developed Crossbar and the Existing

Previous Art NoC routers .......................................................................................................................69 Table 6-2. Comparison of Power Consumption between the Crossbar System and Routers in [9] .....70 Table 6-3. The Comparison of Resource Usages between the Crossbar System and Communication

Infrastructures in [12] ............................................................................................................................70 Table A-1. Register File Description .......................................................................................................74

Chapter 1. Introduction

1.1. Background

As the technology improves, big expectations are made on the speed and bandwidth of newly

available devices. These devices include edge computing acceleration devices where different

applications from several users are hosted. On the one hand, machine learning applications

have become more common to improve user experience. On the other hand, data sets that need

to be processed get larger as the number of users increases. Therefore, fast processors are

required in order to catch up with the required high rate of data processing. FPGAs are

commonly used for this purpose due to their ability to be programmed as different specific

purpose processors to host such applications; thus, achieving high performance.

However, FPGAs have a lot of resources available, and when they are configured statically to

host a single application, then a huge amount of FPGA resources are wasted. Therefore,

reconfigurable FPGA architectures are considered a good solution because they can be

dynamically reconfigured during run time, which enables hosting different applications when

needed. This purpose is achieved by the partial reconfiguration features of FPGAs. The partial

reconfiguration allows to divide the FPGA resources into several regions called partially

reconfigurable (PR) regions, and program each region with a different application. The PR

regions are dynamically reprogrammable, meaning that it is possible to change the application

that the FPGA hosts, without going through the design steps all over at run-time.

1.2. Problem Statement

PR regions are fixed in size; it is not possible to dynamically increase or decrease the resources

of the PR once it is defined. This characteristic can lead to either underutilization of resources

or not having PR regions large enough to host the application. In other words, a PR region

might contain much more resources than the running application requires; thus, some resources

are wasted. On the other hand, it might also be the case that a running application requires more

resources than the PR region has; thus, not being able to host it. Therefore, the question is how

to dynamically adjust PR region resources accordingly to the application’s needs.

C h a p t e r 1 . I n t r o d u c t i o n | 13

1.3. Thesis Goals

The goals of this thesis are the following:

1. To develop hardware support for FPGA resource elasticity which:

a) enables increase/decrease of FPGA PR resources accordingly to application

requirements

b) provides the communication isolation; in other words, no two different

applications can access each other’s data or FPGA region

2. To suggest a solution to enable a multicast communication among PR regions of the FPGA.

1.4. Thesis Contributions

This work addresses the problem by providing an area-efficient and low-power communication

method among PR regions. The communication method is chosen to be WISHBONE Crossbar

Interconnection, and we explain its benefits and implementation details for meeting the thesis

goals. Additionally, the WISHBONE bus interface implementation is presented, as well as a

hardware module template to reuse these interfaces with different modules. Moreover, we

suggest a solution to enable multicasting communication among PR regions using the

WISHBONE crossbar. Finally, the overall system design architecture using the suggested

solution is presented.

1.5. Structure of the Thesis

The thesis structure is as follows: chapter 2 gives background information that is necessary to

follow the thesis contents. Chapter 3 introduces a summary of previous related work done and

explains the advantages of our solution. Chapter 4 provides the proposed solution, its design,

and implementation. Chapter 5 presents the results. In chapter 6, we compare our results with

the previous art as a discussion and suggest further works that could be done to our solution.

Chapter 2. Background

2.1. FPGAs in the Cloud

Due to their reconfigurable and reprogrammable architectures, FPGAs have become popular

for deploying in cloud systems. They are used to host specific purpose accelerators and are

deployed in the cloud using several different techniques. These include attaching the FPGA to

the system by PCIe link, Ethernet link, or MPSoC, including ARM CPUs and programmable

logic. Examples from the industry include Amazon Web Services (AWS), where they connect

FPGA to servers by Ethernet link and use it to host accelerator requests. Additionally, IBM

uses FPGA accelerators by communicating with them through a PCIe link.

On the other hand, FPGAs have a lot of logic resources, and when it is dedicated to a single

user, underutilization of FPGA resources are inevitable. As a solution to this problem, partial

reconfiguration is suggested to enable multi-tenancy on FPGAs. This leads to statically divided

FPGA resources into multiple partitions, each of which can be separately and dynamically re-

configured during runtime to serve a different purpose. The bitstreams used in this case can be

generated at RTL level, HLS level, or using OpenCL.

2.2. Multiprocessor Interconnection Methods

The traditional way of connecting multiple processors or chips to build single system

architecture involves the shared-bus method or point-to-point (P2P) method. However, these

methods have several limitations and disadvantages. First of all, the point-to-point method

allows only two processors to be connected and communicate with each other. However, there

is a need to involve more than two processors in complex architectures. On the other hand,

shared bus architectures allow multiple processors to communicate. Nevertheless, it has limited

bandwidth and latency problems since only one processor can access the bus at a time.

Considering drawbacks of shared-bus and P2P interconnections, Crossbar and Network-on-

Chip methods were introduced as more effective ways of interconnection. The following

subsections give brief information about those.

C h a p t e r 2 . B a c k g r o u n d | 15

2.2.1 Network-on-Chip

Network-on-Chip (NoC) solutions apply the general idea of computer network topologies to

multichip systems as a communication method. To illustrate, each module is considered a node

and has a router; routers are connected following one network topology: mesh, torus, butterfly,

etc. The following figure 2.1 illustrates 3x3 NoC having mesh topology. The blue boxes

represent routers, while the red ones represent computation nodes.

Figure 2.1. 3x3 NoC with Mesh Topology

The communication flow is again similar to computer networks. First, a node directs its data to

its router, and the router divides data into packets and forwards them into the next router. Thus,

a packet travels through several routers until it reaches its destination. Routers can use routing

algorithms (random, weighted random, adaptive, etc.) or routing tables to decide which path to

follow [1].

Since there are multiple paths and routers available in network topology, parallel transmissions

can happen simultaneously. Therefore, this method improves latency and bandwidth problems

of shared-bus and P2P methods.

2.2.2 Crossbar Switch Interconnection

The crossbar switch interconnection consists of switches arranged in a matrix form and

connects the source module to a destination. As it is displayed in Figure 2.2, all modules are

connected to a common set of bus lines; however, communications are allowed or prevented

by switches illustrated as blue boxes. The communication request can be initiated by a master.

In the case of communication, switches are enabled to allow a physical connection between a

master and destination slave. Switches can be implemented using either transistors or

multiplexers.

16 | C h a p t e r 2 . B a c k g r o u n d

Figure 2.2. 4x4 Crossbar Switch Interconnection

Since there are separate bus lines for each destination, it is possible to do parallel transmissions

in a crossbar interconnection. The only limitation is that source devices must not target the

same destination. When they target the same destination, then arbitration logic is again applied

to regulate the access order.

2.3. PCIe Bus

PCIe Bus is a serial interconnect used for providing a physical communication link between

the CPU and FPGA [2]. It has several versions; however, the Gen3 version is used in this

project with a Xilinx Ultrascale+ device. PCIe Gen3 can support 16 lanes, each being 8-bit

wide, connected to the FPGA device. Consequently, it can support the transfer rate of 16 Giga

Transfer (GT)/sec. PCIe Bus is connected to GTH transceivers inside of the FPGA.

2.4. XDMA IP Core and Drivers

Xilinx provides DMA IP core [3] and specific XDMA drivers [4] in order to communicate with

Ultrascale+, Ultrascale, and Virtex FPGA device series from a host computer. FPGA device is

attached to a host PC by a GTH-PCIe link, connected to an XDMA IP core inside the FPGA.

Then the user data is delivered to the FPGA end running the XDMA driver application. The

following subsections provide a brief explanation about both the IP core and drivers.

2.4.1 XDMA IP Core [3]

As mentioned before, the IP core uses a GTH transceiver and PCIe link as a communication

link between a host PC and the FPGA. This link enables eight separate channels for a single

XDMA core to exchange data. The channels are called Host-to-Card (H2C) channels, which


transmit data from the host computer to the FPGA, and Card-to-Host (C2H) channels, used for

transmitting data from the FPGA to the host computer. A single XDMA IP core can support

four H2C and four C2H channels.

Additionally, two options are provided as common communication interfaces to deliver data

from an XDMA to the user logic: the AXI Memory Mapped (AXI-MM) and AXI Streaming

(AXI-ST). When the AXI-MM interface is chosen, all H2C and C2H channels are connected

to a single interface accordingly. The AXI-ST interface, on the other hand, provides a separate

interface for each channel. Both interfaces can operate with a bit width up to 256 bits.

Furthermore, XDMA IP Core has two other channels. The first one enables a host to access a

user logic’s status and configuration registers; this is done through the AXI4-Lite Master

interface. Similarly, the user logic can access the DMA’s status and configuration registers

through the AXI4-Lite Slave interfaces. Those interfaces can support either 32-bit or 64-bit

read/write transactions. The other channel is used to bypass the DMA core and enables a host

to communicate with a user logic inside the FPGA directly. The bus width supported here is

32-bit and 64-bit, as well.

2.4.2 XDMA Drivers [4]

XDMA Drivers provide a simple way to deliver user data to the XDMA IP Core from a host

PC. It enables separate devices for each channel at a kernel level, making it convenient to use

them. For instance, xdma0_h2c_0 channel is used to transmit user data by the H2C-0 channel.

Similarly, xdma0_c2h_0 channel retrieves data from the FPGA device by the C2H-0 channel.

Additionally, xdma0_user channel is for using the AXI4-Lite interface of the XDMA IP core,

while xdma0_bypass channel is for bypassing XDMA. Finally, to get information about the

status of channels; if they are busy, completed transactions, etc., the debugging channel is

provided. This channel can retrieve information from the control and status register of the

XDMA IP core and is listed as xdma0_control device. More information about these

channels and debugging can be found at [4].

Furthermore, different functions are used to access each type of channel. Firstly, to transfer

data to the FPGA device using one of the H2C channels is done by dma_to_device function.

The general function prototype is:

./dma_to_device -d device -f file_name -s transaction_size_bytes -c

number_of_transactions

Executing the following command results in sending 1024 bytes from data.bin file two

times:

./dma_to_device -d /dev/xdma0_h2c_0 -f data.bin -s 1024 -c 2

Correspondingly, dma_from_device serves for getting data from the FPGA device:

./dma_from_device -d device -f file_name -s transaction_size_bytes -

c number_of_transactions


The upcoming command reads 512 bytes from the FPGA using the first C2H channel to file

data_back.bin.

./dma_from_device -d /dev/xdma0_c2h_0 -f data_back.bin -s 512 -c 1

Moreover, the command to read or write from/to user logic’s registers by AXI4-Lite interface

reg_rw command is used:

./reg_rw device register_addr access_type

./reg_rw /dev/xdma0_user 0x0001 r

This command reads data from the register at the address of 0x0001.

XDMA drivers can operate in two different modes: polling mode and interrupt mode. In polling

mode, the host OS periodically keeps checking if there is any response from the XDMA IP

core. On the other hand, in the interrupt mode, whenever there is a response from the XDMA

IP core, an interrupt happens, followed by the OS kernel serving this interrupt. The XDMA

Drivers and the XDMA IP Core’s performances in different modes of the IP Core are provided

by [5]. The XDMA Drivers are available at [6].

2.5. WISHBONE Interconnection Architecture

WISHBONE Interconnection Architecture was invented by OpenCores and is targeted to

provide an effective communication method for IP cores on SoC [7]. According to [7], the

purpose is to increase the reusability of different IP cores in different communication scenarios

by providing a standard data exchange protocol. Since different IP cores have different I/O

ports, it is not always fast and simple to build a single design architecture connecting them. On

the other hand, using a standard bus interface and protocol for communication makes this

process easier.

WISHBONE Interconnection Architecture mainly consists of 2 parts: a bus protocol and the

communication interconnection. The bus protocol is a set of rules about exchanging data

among IPs and must be followed when designing bus interfaces. On the other hand,

communication interconnection is the way that IP cores are connected.

OpenCores does not specify the implementation of bus interfaces or interconnection

architectures. Instead, they provide a set of rules that should be followed in the design to be

suitable for a WISHBONE Interconnection architecture. Thus, implementation details are up

to a designer as long as it obeys this set of rules. The following sections provide a brief

background about the bus protocol and communication interconnections.


2.5.1 Bus Interface [7]

The WISHBONE interface operates with master-slave logic. A WISHBONE master can

initiate either a read or a write request to a slave. Figure 2.3 illustrates a WISHBONE master

and slave connected using the point-to-point form.

Here is a brief description of necessary interface signals:

Common Signals:

• DAT_I - incoming data from a master or a slave.

• DAT_O – outgoing data from a master or a slave.

Figure 2.3. The WISHBONE master and slave connected point-to-point [7]

Other Signals:

• CYC_O (master) & CYC_I (slave) – an indication of a request. A master asserts CYC_O

signal high and deasserts it when the request is complete. The signal is processed by

the crossbar control and arbitration logic and directed to a slave in the case of a grant.

• STB_O (master) & STB_I (slave) – indication of an active request. STB_O signal is

asserted together with CYC_O; however, if a master interface stalls the communication

for a brief time, it deasserts this signal, informing the slave that it needs to wait.

• WE_O (master) & WE_I (slave) – identifies whether a master’s request is a read or write

request.

• SEL_O (master) & SEL_I (slave) – the target memory/register address of a slave where

DAT_I/DAT_O should be read from/written to.

• ADR_O (master) & ADR_I (SLAVE) – address of a target slave requested by a master.

• GNT_I (master) – indicates that a master is granted a request. This signal comes from

the interconnection.

• ERR_I (master) & ERR_O (slave) – this signal indicates that the communication with

the slave has failed.

• ACK_O (slave) & ACK_I (master) – if ACK_O; thus, ACK_I is high, it indicates that a

slave has registered/sent incoming/requested data from a master.


• STALL_O (slave) & STALL_I (master) – if STALL_O; thus, STALL_I is high, it indicates

that a slave has suspended the communication temporarily; a master needs to wait.

WISHBONE interface has two operation modes: standard and pipelining mode. The example

waveform of a standard mode with a write request is illustrated in Figure 2.4. In standard

mode, before sending new data, a master waits for an acknowledgment signal from the slave

for previously sent data. As shown in the figure, a master sends the next data only after

receiving an acknowledgment signal. For instance, since in clock cycle 6, the master has not

received an acknowledgment signal for DAT_3, it holds the data until it receives an

acknowledgment. After sending the last data and receiving the last acknowledgment signal,

all output lines go low.

Figure 2.4. The Waveform of Standard Write Communication

In a pipelining mode, on the other hand, a master does not wait for an acknowledgment signal;

instead, it sends all data if a slave has not asserted stall and then waits for acknowledgment

signals from the slave. In other words, as long as the STALL_I signal is low, the master

continues to send data. This process is illustrated in Figure 2.5; for each clock cycle, the

master sends the next data due to the low STALL_I signal. In clock cycle 3; however, the

master receives a high stall signal; thus, it holds DAT_3 until it goes low. After sending all

data in clock cycle 6, it deasserts all signals except CYC_O to wait for the last acknowledgment

signal to come in. Finally, after receiving the last acknowledgment signal, the request signal

CYC_O goes low, too. For more information about the WISHBONE interface, please, see [7].

According to [7], higher throughput can be obtained with this mode. To illustrate, consider

there are eight data words to be sent to a slave. In this case, it would take 16 clock cycles;

because in the first cycle, a master sends data, it receives the acknowledgment in the next

clock cycle before it proceeds with the next data word. However, in pipelining mode, since

the master sends data as long as STALL_I is low and ACK_I is high, it would take only nine

clock cycles to complete the transaction; eight clock cycles to send data, and one more clock

cycle to wait for the last acknowledgment signal.


Figure 2.5. The Waveform of Pipelined Write Communication.

The upcoming arguments list the benefits of WISHBONE as a bus interface:

• It has a non-hierarchal bus architecture, and it has a single-level bus line. Hence, it saves

both area [7] and consumes less energy [8].

• It is easy to adapt hardware modules to work with the WISHBONE interface [7].

• It has small design complexity [7].

• It can operate in high frequencies [7].

• It has a built-in handshaking protocol which eliminates the need to implement an extra

technique to ensure transmission safety.

• It can support different types of transactions; block read/write, single read/write,

RMW.

2.5.2 WISHBONE Interconnection Modes [7]

Four different interconnection modes are provided by [7]; these are point-to-point, shared bus,

data flow, and crossbar interconnection. In data-flow interconnection, each module has both

slave and master interfaces, and they are connected to each other sequentially; a master

interface of the first IP core is connected to the slave of the second IP core, the master of the

second IP core is connected to the master of the third one and so on. Thus, the exchange of data

happens sequentially from one IP core to the next one.

Chapter 3. Related Work

Previous work on the FPGA elasticity mainly solves the problem by employing NoC

connections. Moreover, there are works done using a bitstream manipulation as well as shared

bus interconnection. The following sections highlight one work for each solution category and

provide their summaries. Additionally, the solution for the multicasting technique for the

selected bus interface is summarized, as well. The limitations of those works are listed in the

final subsection.

3.1. Architecture Support for FPGA Multi-tenancy in the Cloud [9]

This paper explores the problem of elasticity on FPGAs using the NoC solution. Authors

suggest that resources of the FPGA can be divided into virtual regions (VR), which are partially

reconfigurable. Then, they can communicate with one another using the NoC communication

method when needed. Initially, authors suggest a 3x3 mesh topology, each node having one

module – a partially reconfigurable region.

On the other hand, to decrease the area usage of routers, they apply some optimizations

techniques to routers and overall network. These optimizations include but are not restricted

to, firstly, implementing routers in a way that they do not contain any buffers; instead, buffers

are allocated inside of VRs. Additionally, routers are simplified so that they do not contain any

virtual channels (VCs) either since they are resource costly. Moreover, only 3-port routers are

used when there is no need for 4-port ones.

Nodes, on the other hand, called VRs, each contains a reconfigurable region, a communication

interface, an access monitor, wrappers, and some registers to store destination IDs, as shown

in Figure 1. The communication interface has buffers to store waiting data that is to be sent by

routers. Access monitor helps to unwrap data from additional information such as IDs.

Additionally, registers store information of destination such as its ID and its router’s ID.

C h a p t e r 3 . R e l a t e d W o r k | 23

Figure 3.1. VR and Router Architectures and Their Communication Structure [9]

The prototype of the proposed system was implemented in a 2x3 mesh having six modules

running. According to their results, optimizations show their impact on area usage; 3-port

routers take 305 LUTs, while 4-port routers take 495 LUTs. Moreover, since they have

eliminated buffers, routers do not take any BRAM sources. Overall, their NoC, together with

PR regions, takes 1.71 % of CLB resources on Xilinx Virtex Ultrascale+.

3.2. Resource Elastic Virtualization for FPGAs using OpenCL [10]

The researchers here approach the reconfiguration concept from an area time perspective of

view. Authors implement a resource manager which observes the currently available resources

of the FPGA and demanded applications, then decides when to run which application, how

many instances of it to run, and which version of it. To illustrate, each application has different

implementations taking a different number of resources in the reconfigurable area. Moreover,

when all resources are full and there are additional requests, time-division multiplexing (TDM)

is applied.

Firstly, to increase or decrease allocated FPGA resources to a specific PR region, they use the

BITMAN tool implemented by [11]. BITMAN tool takes the bitstream of the overall design

with initially allocated resources for applications, then manipulates the bitstream to change the

module's location, increase/decrease allocated resources, etc. Moreover, according to the

authors, the tool does not require any additional tools such as CAD; thus, being able to run on

general CPUs and microcontrollers.

Each application has different versions with different sizes and instances implemented in

OpenCL, referred to as kernels. The resource manager knows which applications are running

and which are being demanded to run. Then, the built-in scheduler decides which kernel and

its version to run based on both SDM and TDM. The decision flow is as follows: kernels and

their implementations are selected by the resource manager. Then resource allocation fairness

for each application is calculated based on the specific equations, which consider several

parameters, such as completion time of kernels, FPGA reconfiguration time, the size of

bitstreams, etc. Finally, the decision is given, which leads to a reconfiguration of the FPGA

with the selected kernel versions and the number of instances.

As a result of the resource manager prototype implemented, this application utilizes FPGA

resources 2.3x better and shows 49% better performance. Moreover, with the help of

scheduling algorithms, the average waiting times for kernels to be run are decreased, too.

24 | C h a p t e r 3 . R e l a t e d W o r k

3.3. A Design Methodology for Communication Infrastructures on

Partially Reconfigurable FPGAs [12]

In this paper, the authors introduce the specific design flow for reconfigurable architectures

called INDRA and implement the communication layer model to make reconfigurable regions

communicate. The details of INDRA are out of the scope of this work; thus, the presented

communication method is covered here only.

In the presented system architecture, a standard bus interface connects reconfigurable regions

to the common shared bus. Since the implemented shared bus architecture is a pipelined bus

architecture, although the bus interface of PR regions is WISHBONE, it is wrapped and

adapted to be able to support the pipelined bus; thus, called the Encapsulated WISHBONE

Protocol (E-WB). Then, they deploy a 5-layer communication protocol to make a flexible

communication infrastructure among PR modules using a commonly shared bus. Five layers

of the protocol are grouped into two classifications being top-down and bottom-up.

Firstly, the top-down implementation phase includes the application layer and protocol

mapping layer. The application layer is for developing a standardized communication protocol

for PR modules; in this case, it is the WISHBONE protocol. The protocol mapping layer, on

the other hand, is for covering up the differences between the pipelined shared bus and

WISHBONE bus requirements. Therefore, an intermediate protocol is used, which they refer

to as Encapsulated – WISHBONE (E-WS). The bridge which wraps WISHBONE into E-WB

consists of FIFOs, registers, and control logic.

Figure 3.2. 5-layer Communication Model Presented by [12]

Secondly, the bottom-up implementation phases consist of the physical layer, routing layer,

and binding layer. The physical layer is the available physical resources of the FPGA device

to be used by the routing layer. The routing layer is the available resources to implement a

shared bus architecture. Finally, the binding layer is the logical realization of a shared bus


architecture. Here they consider if the bus architecture should be implemented using a tri-state

buffer (TB-based) method or a slice-based method.

During the implementation part, the authors compare binding layer methods and summarize

the performance of the E-WB mapping. According to results, the slice-based communication

takes more area leaving less space for PR modules compared to the TB-based communication.

Coming to E-WB, it does not decrease WISHBONE’s original performance necessarily due to

its pipelined architecture. The results show that the protocol operates at 70 MHz for block

transfers larger than 1 KB with a 32-bit bus architecture.

3.4. A Practical Approach for Circuit Routing on Dynamic

Reconfigurable Devices [13]

In this paper, circuit routing method is researched as a communication method among

reconfigurable areas. The proposed technique relies on Reconfigurable Multiple Bus (RMB),

which was originally developed for multiprocessor systems communication [14], and it is

adapted to be used by 1-D and 2-D NoCs here. Authors want to eliminate the packet

communication method used by NoCs since it introduces extra communication overhead like

dividing data into packets, then combining and recovering original data at the destination. Thus,

they introduce Reconfigurable Multiple Bus on Chip (RMBoC) which provides a physical

communication link between a source and destination of NoCs’ nodes. Consequently, when a

physical channel is built then, a source can send its data directly to the destination without

communication overhead.

Initially, a single bus line is divided into multiple bus segments, as shown in Figure 3.3, so a

parallel communication among different modules can happen. Moreover, the bus line

connecting all modules is divided into sections, and communication links among these sections

are prevented using transistors. In this case, transistors play the role of switches to make a

physical communication link between a source and a destination.

Figure 3.3. Example of Parallel Transmission using Multiple Bus Segments in [13]

Further, bus controllers for each section are deployed to enable or disable transistors to provide

or destroy communication links. When a module wants to send data, it sends the request to the

bus controller, and the bus controller delivers this request to the destination through the other

26 | C h a p t e r 3 . R e l a t e d W o r k

bus controllers which sit on the path. Then if the destination is available, it sends a confirmation

reply, and this reply again propagates through all bus controllers. When each bus controller

receives this reply, they turn on transistors to recover the communication link. For each bus

controller, it takes eight clock cycles to process the request. Once the communication is built,

the latency of sending a message is one clock cycle only.

The implementation is done on the Virtex II 6000 FPGA device having four nodes hence four

bus controllers. According to results, in 1-D NoC, area overhead of RMBoC changes between

4-15% and max. reachable frequency is above 120 MHz. Additionally, the 2-D NoC version

also achieves high frequency – max. 95 MHz but takes 50% more area.

3.5. Enhancement of Wishbone Protocol with Broadcasting and

Multicasting [15]

This paper is reviewed for the WISHBONE protocol that is included in our solution, and we

are proposing a multicasting communication method involving it.

The solution suggested by the authors targets adding multicasting and broadcasting features to

WISHBONE shared bus interconnection. To do that, when a master is introduced to the system,

it assigns group IDs to slaves participating in multicasting, and slaves store their group IDs in

registers. Since slaves use registers to store group IDs, the number of groups they can belong

to is limited.

When multicasting happens, a master sends the group's ID as an address so that slaves can

check if they should participate in this communication. Then, all acknowledgment signal

responses from slaves are ANDed together and sent to the master. In this case, the slaves which

do not participate in this communication assert their acknowledgment signals unconditionally.

Eight slaves and four masters are implemented in the Nexys-4 board. The overall system

occupies 2566 LUTs and the operating frequency achieved is 207 MHz.

3.6. Limitations of Related Works

3.6.1 Limitations of Works on Resource Elasticity

NoC and Shared Bus Methods

Network-on-Chip method [1][16][17] or shared bus method [12] have the below-mentioned

disadvantages:

• The shared bus method is not a flexible or scalable method of communication due to its

bandwidth limitations. Here by flexibility, we refer to interconnection’s ability to support

different communication patterns. Additionally, scalability refers to interconnection’s


ability to be extended to support a larger number of modules. Although NoC architecture

is flexible and scalable, NoC’s area usage and power consumption are its big

disadvantages [8][18].

• The shared bus cannot support parallel transmissions. Although NoC provides a

parallel transmission feature, since NoC’s routers have a smaller number of physical

channels, it requires extra protocol overhead in order to handle access to them.

• NoC has a large network protocol overhead. This includes dividing a packet into

different segments, sending them separately, and recovering the original sequence in the

destination. Additionally, if any segment is lost, the source should send the whole packet

again.

Bitstream manipulation methods

The bitstream manipulation method for the resource elasticity is not flexible. In other words, it

is very device-specific because different FPGA devices have different bitstream file formatting.

Moreover, the BITMAN tool developed [11] is for specific devices, too. Additionally,

bitstream manipulation requires a very good understanding of the given device’s architecture

and its resources to be error safe. Finally, it is not assured that a bitstream file would be

available to a developer. Some vendors might prefer to provide encoded bitstream file format,

for instance.

3.6.2 Limitations of the Work on Multicast Communication Method

We use the WISHBONE Crossbar Interconnection as a communication method among PR

regions. There is no suggested way of multicasting for the WISHBONE protocol at all in the

WISHBONE datasheet. According to our research, there has not been work done for it

either except [15]. Nevertheless, the solution proposed by the authors in that paper is valid

for the shared bus interconnection. It would not work for the crossbar interconnection as the

grant process for the requested slaves is not the same. Additionally, this method has a

multicasting introduction overhead, such as assigning group IDs and forcing nonparticipant

slaves to assert acknowledgment signals. This would lead to extra traffic in the communication

path and disturbing unrelated slaves unnecessarily.

Chapter 4. Design and Implementation

4.1. Design Tools and Environment

4.1.1 Target Device

Implementation is done on the Xilinx KCU1500 acceleration development board. It contains

the Kintex Ultrascale XCKU115 FPGA device. The logic resources of this board are

summarized as in the following table:

Table 4-1. KCU 1500 board's Resources

Resource Type Amount

BRAMs 2160

DSPs 5520

FFs 1326720

I/O pin count 2104

LUTs 663360

Moreover, the board has PCIe Gen3 with x16 lane, allowing either bifurcated access by two x8

lanes or non-bifurcated access by a single x8 lane [19].

The board contains dual QSPI flash memory with a total size of 1GB to store a user code in a

non-volatile memory [19]. Additionally, since it is a dual memory system, the FPGA can

recover the program faster. Therefore, instead of programming the board with a bitstream file,

the flash memories are programmed with .mcs files through a JTAG cable.

4.1.2 Tools and Design Guidelines

The implementation of the system architecture was done on the Xilinx Vivado tool version

2018.3. The implemented system architecture runs on the KCU1500 board connected to a

server running Ubuntu 20.4. The version of XDMA drivers used was 2020.1.8 [6].

C h a p t e r 4 . D e s i g n a n d I m p l e m e n t a t i o n | 29

Additionally, we also make use of Xilinx-developed IP cores: XDMA core, ICAP, and FIFOs.

The crossbar interconnection and other hardware components; AXI-WB, WB-AXI modules,

the register file, bus interfaces, and computation modules in the overall system architecture are

developed using VHDL. Those components are made available in the user repository of Vivado

by custom IP packaging.

The suggested system architecture contains the ICAP module to enable the partial

reconfiguration feature of the FPGA device; however, in the implementation it is not used.

Instead, the developed crossbar interconnection, bus interfaces, and their features are tested

using statically allocated modules. Enabling the reconfiguration feature of the FPGA and

reprograming it dynamically through the PCIe link and ICAP module, as is suggested in the

solution, has been done by [5].

4.2. Solution Description and its Advantages

The main idea of the proposed solution is to divide an application’s request into small

computational modules and accelerate those modules to small-sized PR regions. Then, these

modules communicate with one another to exchange computation results using the

communication technique. Utterly, since an application is divided into small computational

modules and PR regions are small-sized, the underutilization of resources is decreased. On the

other hand, allocating extra PR regions if needed and enabling them to communicate can

increase the number of allocated resources to the application. Nevertheless, originally PR

regions are isolated from each other; therefore, the challenge occurs on how to make them

communicate.

The proposed method for providing communication among reconfigurable regions is based on

the WISHBONE Crossbar Switch Interconnection. Figure 4.1 illustrates the high-level view of

the interconnection considering a four-by-four crossbar architecture. Since only a master

interface can initiate a request, it is proposed that all modules, which are reconfigurable regions,

have both master and slave interfaces; thus, preventing any communication limitations. A more

detailed explanation of the crossbar switch architecture, its arbitration logic, and the

communication flow is explained in the upcoming sections.

30 | C h a p t e r 4 . D e s i g n a n d I m p l e m e n t a t i o n

Figure 4.1. High-Level View of the Proposed Communication Technique

Using a crossbar architecture has the below-mentioned benefits:

• A crossbar switch interconnection is more flexible and scalable than a shared bus

architecture; however, less than NoC. Both scalability and flexibility come with the

area overhead; the more flexible or scalable, the more area usage. Crossbar

interconnection has more area usage than a shared bus [7][8][20] but less than NoC. In

other words, it stays in the middle of the area and flexibility/scalability tradeoff.

• A crossbar interconnection’s main area usage comes from its arbitration logic [20]. The

arbiter of a crossbar interconnection must observe multiple channels; thus, requiring

more complex logic. Auspiciously, it is possible to decrease the area usage of the

crossbar by optimizing its arbitration logic [21] and we use the suggested method in

[21] and further design optimizations to implement an area-friendly arbiter.

• A crossbar interconnection can support multiple transmissions in parallel while not

having the overhead problem to handle physical channels as in NoC.

• A crossbar interconnection does not have network protocol overhead once the

communication channel is established in contrast to NoC. In crossbar architecture,

when the physical communication channel is enabled between modules, the whole data

can be transmitted using this channel. However, in NoC architecture different packets

of the same data follow different paths and require to be retrieved in the correct order

in the destination.

• A crossbar interconnection can demonstrate good performance on different types of

scenarios. According to [22], the performance of communication architectures depends

on the communication pattern. In other words, it can be a pipelined communication in

which modules send data to the next ones sequentially. Yet, a communication pattern


can be global; each module shares data with any of the other modules. A crossbar

architecture shows adequate latency and throughput results in both scenarios.

Additionally, the main reason for using WISHBONE bus interface is to increase the reusability

of the crossbar architecture. In other words, the goal is to make different IP cores to be easily

adapted to the interconnection through a standard bus interface. Combining the advantages of

crossbar and WISHBONE interface, adjusting, and modifying their futures to support

communication among PR regions, our proposed solution:

• enables dynamically increasing/decreasing FPGA PR region resources allocated to

the application.

• provides an easy way of managing the communication isolation for different user

requests.

• provides a simpler way of handling dynamic bandwidth allocation inside the FPGA

device.

• has a low area usage due to the optimized way of implementation.

• has low communication protocol overhead.

• is flexible to be used with different hardware modules due to the usage of the

standard bus interface.

4.3. System Design Architecture

This section describes how the proposed solution will be used as a part of the overall system

architecture with PR regions. Figure 4.2 displays the system architecture that will be

implemented. The purposes of other essential components follow:

FPGA Elastic Resource Manager

User requests are sent to the FPGA Elastic Resource Manager, which has to keep track of the

PR regions. It knows which PR regions are available and which PR regions the specific user’s

application is using. Therefore, when it receives a request, it analyzes a request in terms of

required PR regions to handle it and then program the FPGA accordingly. Furthermore, it

also sends user data to PR regions, provides configuration information, and recovers status

information from the register file. The FPGA Elastic Resource Manager achieves these tasks

utilizing the XDMA Driver, PCIe express cable, and XDMA IP Core. The work flow of the

resource manager in different scenarios is explained in Communication Flow section and

illustrated in Figure 4.3.

The XDMA IP Core

The AXI-ST interface of the XDMA IP core is used due to simplified access to the ICAP

module. Since the AXI-ST enables to use each channel of the XDMA, it is possible to dedicate

a separate link to the ICAP module. On the other hand, the AXI-Lite bypass link of the XDMA

IP is used to access the register file.


ICAP

The ICAP module does programming of PR regions. To decrease reconfiguration overhead, in

other words, to enable a fast reconfiguration process, the ICAP module is connected through

the dedicated link - H2C3 from the XDMA IP core. Additionally, the ICAP module operates

at the clock frequency of 125 MHz while the rest of the system has 250 MHz. Therefore, to

prevent the loss of information due to different clock speeds, a FIFO buffer is added before the

ICAP module.

Register File

The register file plays an important role in providing configuration data and storing necessary

status information. Firstly, when ICAP reconfigures PR regions, it stores the status data on the

register file regarding whether the reconfiguration process was successful or failed. Secondly,

the WB Crossbar switch and PR regions are served by the register file, too. For the crossbar,

the purpose of having the register file is to provide the requested bandwidth to applications and

to enable the communication isolation. The crossbar operates with a weighted round-robin

technique; thus, it needs to know how many packages each module can send the other one.

Furthermore, the crossbar must know which modules are allowed to communicate with each

other. On the other hand, each computation module must know what its destination address is.

This kind of information is stored in the register file as configuration data. Moreover, when a

computation module’s attempt to make a communication fails either due to a wrong destination

address or timeout due to an unresponsive destination, those error codes are registered in the

Figure 4.2. Overall System Design Architecture View


register file as status data. Finally, before PR regions are reprogrammed, they should be reset

in order to prevent them from making any possible communication with the other modules

during the reconfiguration process. These reset signals are provided through the register file,

too.

The access to the register file by the user space, either to write configuration data or to read

any status data, is done through the AXI Lite interface of the XDMA IP Core. The reason for

this choice is that this interface is a bypass interface which means it enables easy and simple

access to the register file. Moreover, it prevents interference between register file data and other

user data for computations since they are done in different channels. The contents and addresses

of the register file are given in Table A.1. in APPENDIX A.

AXI-to-WB and WB-to-AXI

Finally, in order to deliver user data for computations, one of the crossbar’s ports is used

together with the AXI-to-WB and WB-to-AXI modules. First, user data is stored in FIFO

buffers together with its application ID through H2C-0 to H2C-2 channels; each channel has a

dedicated FIFO. The allocation of channels is not static, meaning that any application data can

be sent through any H2C channel and similarly, can be read back through any C2H channel.

The AXI-to-WB module serves each FIFO periodically. Since FIFO channels have AXI

interfaces, the implemented WISHBONE master interface here also has an AXI interface;

hence, making the AXI-to-WB module. The same principle is valid for the WISHBONE slave

interface side, too. When it receives computation results from the modules, it sends results to

one of the C2H channels through the AXI-Streaming interface. It accesses each channel

periodically one by one. This module is referred to as WB-to-AXI in Figure 2.

In the AXI-to-WB module, the WISHBONE master delivers user data to the destined PR region

based on the application ID. It knows the destination module of each application through the

register file. The reason for using the application ID instead of a direct destination address is

to prevent other applications from accessing invalid locations. To clarify, the crossbar port

which serves for delivering/reading user data is allowed to access any PR regions since it must

provide necessary computation data of all user applications and read back its results. Thus,

some other applications may try to access an invalid location by providing the wrong

destination address. Nevertheless, using the application ID here, the WISHBONE master

directs the data to a valid destination with the help of the register file. It should be noted that,

as the implemented version of a crossbar has four ports and one of the ports is dedicated to the

PCIe link, the number of PR regions here is three.

Clock System

The ICAP module operates with a 125 MHz frequency. On the other hand, the rest of the

modules operates with a 250 MHz frequency. The PCIe reference clock of 100 MHz is buffered

and fed into the XDMA IP core. Then, the XDMA IP core generates a 250 MHz frequency

connected to the rest of the modules. Since in this implementation, we use statically allocated

modules, not PR modules, this method is applicable here.


Reset System

All computation modules, as well as the crossbar interconnection, have both local and global

resets. The global reset is used to reset the whole system architecture at once. To do so, the

asynchronous reset signal of the XDMA IP core is buffered and connected to the modules. On

the other hand, the local reset signals are provided by the register file. Each local reset signal

is sent to a computation module and to the crossbar port where that module is connected; thus,

enabling to reset a specific resource and its configurations. Locally resetting is also useful when

ICAP does the reconfiguration process; the module can be isolated from the rest of the system,

and the crossbar port would be prevented from making any grant decisions.

Communication Flow

Figure 4.3 shows a flow diagram of the whole process of reconfiguration and providing elastic

resources to the application. To sum up, a user sends a request to the FPGA Elastic Resource

Manager. The manager allocates the available amount of PR regions to the application’s

computation modules through the ICAP. If there are not enough PR regions to host all modules,

part of them runs on the server (referred to as on-server module from now on), which means

they are not accelerated to the FPGA. Then, the manager provides configuration data to PR

regions and the crossbar (allowed modules, destination modules, and allowed number of

packages). In this phase, the last module’s destination address is set back to the server to receive

its results and continue the computation on the server. Afterward, it sends user data to start the

computation process. When the on-server module finishes its computation, the FPGA manager

checks again if there are any PR regions released so that it can run the on-server module on the

FPGA, as well. If so, then it reprograms the available PR region with the on-server module and

updates the other module’s destination addresses so that they communicate with the newly

available module, as well. Thus, resource elasticity is achieved; the allocated resource for the

user is increased by the communication interconnection among PR regions.


Figure 4.3. PR Reconfiguration and Communication Flow


4.4. Crossbar Switch Architecture

This section explains how the proposed crossbar architecture has been designed and

implemented. The novelty in this section includes designing a crossbar that can provide the

communication isolation and dynamic bandwidth allocation. Moreover, applying the

suggested optimized way of designing a round-robin arbiter to implement weighted round-

robin is included.

As it can be observed from Figure 4.1, each crossbar port consists of 2 different parts colored

in yellow and purple; these are called master and slave ports accordingly. The block diagrams

of those ports are displayed in Figure 4.4 in a detailed form. Please, note that all

implementation details are explained and illustrated in figures considering a four-by-four

crossbar interconnection.

Figure 4.4. Block Diagram of the Proposed Crossbar Switch Interconnection

Firstly, a master port consists of multiplexers and an input port. The input port handles

master’s requests while multiplexers play the role of switches between master and slaves.

The input port receives a communication request from the master interface together with the

destination slave’s address. If a destination address is invalid, it prevents the communication

and returns an error signal. Otherwise, it directs the request to the slave port and waits for a

grant. If a grant is given, it connects the target slave’s data lines (DAT_I and STALL_I in this

case) to the master interface through multiplexers.


A slave port, on the other hand, is responsible for giving grants based on requests coming

from master ports. It also keeps track of exchanged package numbers between the slave and

a master. Additionally, it sends grant signals back to a master and enables the slave to initiate

communication. Finally, it connects the granted master’s data signals (DAT_I, SEL_I,

STB_I, and WR_I in this case) to the slave interface through multiplexers. A slave port

consists of an arbitration logic and output port module - called the slave side together, and

multiplexers to achieve these tasks. This also implies that an arbitration logic in this crossbar

architecture is decentralized, meaning each slave has its own arbiter to serve masters. The

benefits of this design choice and each component are explained in detail in the upcoming

sections.

4.4.1 Slave Port

A slave Port consists of multiplexers that play the role of switches for the crossbar and the

slave side unit, which controls these multiplexers in addition to handling incoming master

requests and slave acknowledgment signals. Figure 5 illustrates the general view of a slave

port.

Figure 4.5. Slave Port with the Black Box View of Slave Side

Slave Side

The slave side of a slave port consists of 2 main components: an output port and arbiter

logic, whose black box views are displayed in Figure 6. They are interconnected to each

other; the output port provides output signals from the slave side, while the weighted

round-robin logic deals with master requests.

Firstly, the arbiter logic; in other words, weighted round-robin makes grant decisions and

passes this decision to the output port. Based on the decision, the output port enables the

slave to make communication and handles acknowledgment signals coming from it.

Moreover, the output port informs a master that it is granted access while also enabling

multiplexers; thus, providing physical communication.


Figure 4.6. The Slave Side Unit with a Black Box View of the WRR and Output Port

Arbitration Logic – Weighted Round-Robin

The arbitration logic plays an important role in the performance of the crossbar

interconnection since it decides bandwidth allocation for masters. On the other hand,

different accelerators have different bandwidth requirements depending on their applications.

For instance, some applications would perform better with each master having equal

bandwidth grants. Nonetheless, some applications require specific masters to have higher

bandwidth grants since they provide/access data more frequently.

The scope of the FPGA Sharing is targeted to host diverse applications from different users.

Thus, it becomes important to make sure that the system meets with requirements of different

applications. Consequently, it is decided to choose a weighted round-robin arbiter as an

arbitration logic for this purpose.

By implementing a weighted round-robin, it would be possible to ensure fair bandwidth

allocation, as well as an unfair one. Before anything else, in this implementation, a weighted

round-robin tracks the number of packages rather than a time period. To achieve this goal, it

uses a package counter, and when the maximum number of packages is reached, it switches

to the next master and starts to count its packages. Similarly, if a user sends less than allowed

packages or no packages at all, the grant mechanism switches to the next master without

wasting a cycle; this is covered in more detailed in upcoming sections.

The number of packages each master can send is stored in the register files, serving the

weighted round-robin counter. In other words, the counter looks up those registers to know

the maximum number. Moreover, those registers are configurable by a user; in other words,

when a reconfigurable region is programmed, its allowed package numbers are stored in the

status register. Therefore, it is possible to indicate different amounts for all masters or the

same amount, making weighted round-robin operate as a usual round-robin arbiter.


Arbitration logic is decentralized, meaning that each slave interface has its own arbiter; this

approach provides some benefits. First of all, the arbitration logic becomes simplified; to

illustrate, multiple communications can happen simultaneously between different slaves and

masters in a crossbar interconnection. One large, centralized arbiter would require complex

logic to observe and handle all those parallel communications. Thus, having decentralized

small ones to handle requests of one slave rather than all slaves simultaneously, decreases the

design complexity of an arbiter. Furthermore, this enables managing multicast data

transmission in an uncomplicated way, too. Since each slave has a dedicated arbiter, input ports

can issue different requests to different destination slaves; then, each arbiter would handle its

own requests. This is explained in a detailed way in section 4.6 Extending WISHBONE

Crossbar Interconnection to Support Multicast Communication.

The area usage of the arbitration logic is considered, as well. The conventional

implementation of a round-robin arbiter is done using priority encoders. Nevertheless, in

accordance with [21], using Leading Zero Counters (LZC) effectively decreases the area

usage of an arbitration logic. The reason behind that is priority encoders report the ID of

granted masters by one-hot-coding addresses. This leads to the need to have extra encoder

logic to convert them to binary addresses. Nevertheless, LZCs report the output by binary

addresses directly; thus, it eliminates extra encoder logic. Additionally, as reported by the

same paper, LZCs enable an arbiter to operate at higher frequencies. Please, refer to [23] for

the implementation details of LZCs.

Altogether, the proposed logic schematic of the weighted round-robin arbiter is displayed in

Figure 7. This block diagram is developed considering there exist four masters.

As seen from Figures 4.6 and 4.7, an arbiter logic has three output signals sent to an output

port; giving_grant_dec, grant_bits, and granted. Firstly, grant_bits indicates the ID

of the granted master, while granted signal represents the validity of the grant. On the other

hand, giving_grant_dec informs the output port that an arbiter logic is in the phase of

decision making. This prevents the output port from sending a grant bit to a master, although

registers still hold the values of the previous grant.

The process of grant decision-making relies on having masked/unmasked request bits and,

thus, masked/unmasked grants. Firstly, there exists the Thermo Logic block which generates

masked vector bits based on the previous grant decision, and request bits are masked using

those bits. Then, LZC0 calculates masked grants while LZC1 gives the decision based on the

unprocessed request bits. If there is a masked grant, that one is used as a final decision;

otherwise, an unmasked grant decision is taken. Inputs to LZCs are provided in reversed

form.


Figure 4.7. Block Diagram of Implemented Weighted Round Robin Arbiter

Furthermore, after the grant decision is made, allowed package amounts are loaded to a down

counter. This counter is enabled only when it receives an acknowledgment signal from the

slave through the output port. Acknowledgment signal indicates that a master has

written/read one package to/from the slave; thus, each time an arbitration logic receives this

signal, it decreases the remaining package amount. Eventually, when the counter reaches zero

value or the master which holds the bus currently withdraws its request, the control FSM

enables the Thermo Logic block and LZCs to make a new decision. The FSM’s states are

shown in Figure 4.8 shortly. For the flowchart of the whole process, please, refer to Figure

4.9.


Figure 4.8. Control Logic FSM of the Weighted Round Robin Arbiter

Consider the following example where master 0 and master 2 make requests.

1. The FSM switches to the “GRANTING” state from “IDLE”, enabling LZCs.

2. The output of the thermo is initially zero; thus, the masked request and input of LZC0

are zero. On the other hand, the input of LZC1 is “1010” (a reverse form of “0101”).

3. LZC1 gives the grant to master 0 since there is 0 number of leading zeros in the

provided input; thus, grant bits become “00,” and they are stored in DFFs.

4. Grant bits select an input load value to the counter through 4-to-1 mux, and the

counter is started. Moreover, the slave is enabled by the output port (explained in the

next section).


Figure 4.9. Flow Chart of Master Grant Process

5. As the counter receives the “ACK” signal from the slave, it decreases its value, and

when it reaches zero, it signals the FSM to switch to the “GRANTING” phase again to

make a new grant decision. Alternatively, the current master can withdraw its

request, again making FSM switch to the “GRANTING” phase.


6. This time, since the thermo logic has an input “01”, which is an increment of the

previous grant “00”, it generates the mask vector of “1110”. The thermo logic

generally generates vector bits whose LSBs are zeros and MSBs are ones; an input

decides the number of zeros.

7. LZC0’s input becomes “0010” and outputs “2” in this case, which means master 2

is granted. Meanwhile, LZC1 gives a grant decision on master 0 again. However,

due to LZC0’s having a valid grant this time, its grant decision is taken instead of

LZC1’s. In this case, master 0 is not granted again, although it still requests a slave.

Furthermore, no cycle is wasted on checking master 1 since it does not have a

request.

Output Port

The main components of an output port are a demultiplexer and logic gates, as it is shown in

Figure 10. Firstly, a demultiplexer logic directs the grant signal to the granted master based

on the select signal. A demultiplexer is disabled when an arbiter is in the “GRANTING” phase,

which is indicated by “giving_grant_dec”. Additionally, it considers request bits as an

enable signal since a master can withdraw its request.

Secondly, an output port enables a slave by the “CYC_I” signal, which is a valid grant signal

coming from an arbiter. In other words, when there is a valid grant to any master, the slave

is enabled. Additionally, acknowledgment signals coming from the slave are directed to both

the arbiter so that it can track the package number and to the master. Lastly, it directs grant

bits to the multiplexer to select the granted master if the grant is valid.

Figure 4.10. Block Diagram of Output Port


4.4.2 Master Port

A master Port, as it is displayed in Figure 4.11, includes a switch for data lines and an input

port to manage a master’s requests, grant signals, acknowledgment signals while also

providing communication isolation. A switch is a 4-to-1 multiplexer that selects among

incoming slave data lines. The input port provides the multiplexer’s select signal and enable

signal. There could have been separate multiplexers to select among grant and

acknowledgment signals, too, instead of directing them to an input port. However, since there

will be multicasting communication, handling those signals inside an input port is more

appropriate.

Input Ports

As it is mentioned above, an input port is in charge of handling a master’s requests. First of all,

it checks if the request is valid, meaning that the master does not try to access a forbidden slave

(the communication isolation), and only then, the request should proceed further. Figure 4.12

shows the implemented block diagram of an input port. The implemented version does not

include a multicasting technique. However, the suggested solution to enable multicasting will

be explained in the upcoming section 4.6 Extending WISHBONE Crossbar Interconnection to

Support Multicast Communication.

Figure 4.11. Master Port with Black Box view of Input Port


Figure 4.12. Block Diagram of Input Port

The Communication Isolation

The communication isolation is done with the help of the configuration registers, which provide

an input port with the allowed slaves. It has high bits for allowed slaves’ bit number while low

bits for the non-allowed ones. For instance, if the configuration register has bits “0110”, this

means the master can access slaves 1 and 2. Slave addresses are sent in a one-hot encoding

form by a master, too; for instance, if it wants to access slave 1, it sends “0010”. This eases

the communication isolation. In order to know if the access is valid, sent slave addresses and

allowed addresses are ANDed; if the result is 0, then it means a master has sent an invalid slave

address. In that case, the input port sends an error signal to the master and does not issue any

request to the slave. Using one-hot encoding to address the slaves has other benefits for

multicast communication, too, which is explained in 4.6 Extending WISHBONE Crossbar

Interconnection to Support Multicast Communication.

This method has several further advantages, as well. First, the allowed destination addresses

for each module are kept in the register files. Therefore, if the validity of the request were

checked in the destination end, then each time a PR region is configured, all registers serving

each port of the crossbar, including ports that do not serve the same application’s modules,

should be updated as well, because each slave would have to be aware of the new coming

module. However, since it is done in the input port, when a new module is introduced, only

the registers of the ports which serve this module and the modules belonging to the same

application should be updated. This reduces the overhead of the reconfiguration process.


Secondly, consider that a master initiates a request to a forbidden destination address, gets a

grant, and then the slave interface checks if this is a master that is supposed to be

communicated with. In this scenario, extra clock cycles would be wasted from the arbiter

logic on making a grant decision on an invalid master. Additionally, the slave would

unnecessarily waste time providing an error signal for an invalid communication while other

valid requests are pending. Thus, providing this feature in the input end reduces the possible

communication traffic on a slave end.

Evaluating a request

If the master’s request is valid, the input port forwards this request to the destination slaves.

Once the master is granted access, the input port enables the multiplexer to link the

destination slave’s data line to the master’s data input port. Moreover, it directs

acknowledgment signals of slaves to the master as long as the master has a grant.

4.5. WISHBONE Interfaces and Computation Modules

This section explains how WISHBONE interfaces and computation modules are designed

and implemented. The contributions of this part include modifying the WISHBONE features

to meet the needs of elastic resource management and, additionally, to offer a more flexible

and less costly way of implementation.

The black box block diagrams of the implemented master and slave interfaces are shown in

Figures 13 and 14, accordingly. For this work, the pipelined mode is chosen due to its high

throughput.

Figure 4.13. The Black Box View of the WISHBONE Master Interface


Figure 4.14. Black Box View of WISHBONE Slave Interface

4.5.1 The Usage of WISHBONE

In the proposed solution, all accelerators have both slave and master interfaces. This prevents

any limitations in communication, giving the ability to each accelerator to initiate a request

and accept a request. Figure 4.15 illustrates how slave and master interfaces are connected to

a computation module. It should be noted that those interfaces should specifically be designed

to meet a module’s requirements. In general, the WISHBONE datasheet does not provide a

standard design for implementing these interfaces; thus, it is up to a designer to design them

based upon the modules' and system’s needs. For instance, in this work, the WB master

interfaces, both in the AXI and computation modules, some features are adjusted according

to supporting elastic resource management requirements. This is explained further in the

upcoming sections.

Additionally, in this project, the designed computation modules process incoming data and

deliver results to a destination; a master interface initiates only a write request to a slave to

deliver the current computation’s results. Therefore, it has only an incoming data port

(DAT_I_acc) from an accelerator (a computation module) to read data and deliver it to a

slave. On the other hand, a slave interface has only an outgoing data port (DAT_O_acc) to an

accelerator.

Figure 4.15. A Computation Module having Both Slave and Master Interfaces


Moreover, there are two reset signals; global_reset and reset. The global_reset signal

is the overall reset signal for the whole system architecture; on the other hand, reset stands

for a reset signal of the specific PR region provided from the register file. This signal isolates

a region from the rest of the system in the configuration process and brings it to the “IDLE”

state after the configuration is complete.

Furthermore, a master interface has ports for a destination slave address and a request

indicator from the computation module. Thus, when a module is done with a computation, it

asserts the request signal high and provides a destination address. The master interface

processes the request, and if the request is successful, it reads data from the module’s registers

and forwards it to the slave. Initially, a module is capable of processing eight sets of 32-bit

data word at once; this number can easily be increased or decreased as it is generic. A master

can forward one data word at a time as the bandwidth of the crossbar is 32-bit. When it is

done sending all data, it informs the module via the buff_empty port. On the other hand, if

the request fails due to an invalid destination address or a timed-out response, it provides the

module with an error code.

Next, a slave interface has a port to inform the accelerator that it has new data to be delivered

by the slave_buf_full signal. When the module registers this new data, it sends

acknowledgment by the data_read_by_acc port, and then, the slave interface continues to

accept new data. Otherwise, the slave interface waits, stalling a master until the module reads

data. The general computation module template to reuse these implemented interfaces will

be explained in the next sections.

The overall communication flow between a computation module and its slave and master

interfaces is shown in Figure 4.16. Please, note that orange dashed lines indicate informing

signals triggering the appropriate condition to evaluate true.


Figure 4.16. Communication Flow between a Computation Module and its Interfaces

Initially, as a single application's computational modules, a hamming code encoder, hamming

code decoder, and multiplier have been implemented. Assuming a user has all three PR

regions available to run all modules on the FPGA, the communication flow among them is

as the following:

1. Encoded user data is delivered to the AXI-WB module through the XDMA IP core

using the PCIe link.

2. The WB master interface initiates a request which is handled by the crossbar to deliver

it to the hamming decoder.


3. The hamming decoder accepts encoded data via its slave interface, processes it, and

forwards results to the multiplier via its master interface initiating a request through

the crossbar interconnection.

4. The same process happens until the final data reaches the hamming encoder after

multiplication. The encoder sends results to the WB-AXI module using its master

interface through the crossbar.

5. The WB-AXI module sends results to a user using PCIe through the XDMA IP core.

4.5.3 The Design and Implementation of the WB Master Interface

The implementation block diagram of the WB Master interface is provided in Figure 4.17. It

consists of the main control unit – the Master FSM, a couple of timeout detection FSMs, the

memory-select controller, and a multiplexer. Firstly, the multiplexer selects which data word

is going to be transferred to a slave as there are eight sets of 32-bit data coming from a

module. The selection is decided by the memory select controller, which is enabled by the

Master FSM. Secondly, the timeout FSMs; ACK Timeout FSM and GNT Timeout FSM,

detect if an acknowledgment response from a slave or a grant decision from the crossbar

takes too long. Finally, the master FSM is triggered by the request signal coming from a

computation module and manages other units to handle the request overall. This includes

processing response signals from the crossbar (GNT_I, ERR_I) or a slave (ACK_I or STALL_I),

enabling the other FSMs, and acting on their results, too.

Figure 4.17. Implementation Block Diagram of WB Master Interface


Figure 4.18 displays the states of the master FSM. Firstly, it initiates the request and provides

the destination address to the crossbar upon receiving a request signal from the module. Then,

it switches to the “WAIT_GRANT” state, enabling the GNT Timeout FSM to start counting

clock cycles. However, if the provided destination address is invalid, it receives an error

signal and goes back to the “IDLE” state providing the error code back to a module. The same

scenario happens if the destination address is valid but the waiting time for the grant signal

times out. Otherwise, if a master is granted access to the slave, it switches to the “WR_REQ”

to provide data words together with their register addresses. Providing the register address is

done by the memory-select controller here. As long as the master FSM is in the “WR_REQ”

state, it keeps this counter enabled, providing the next word every cycle.

Figure 4.18. FSM of WB Master Control Logic

Next, if the slave cannot serve the request currently; hence, sending a high STALL signal, then

a master switches to the “STALLED” state. Here it disables other counters as well, and waits

until that signal is deasserted again to go back to the “WR_REQ” state. However, if the

destination slave does not respond in a defined period, a timeout error happens. Otherwise,

when all data is sent successfully, the FSM switches to the “WAIT_ACK” state, where it waits

until it receives all acknowledgment signals for all data set that it has sent. For instance, in this

case, it must receive eight acknowledgment signals since there are eight sets of data.

WISHBONE datasheet does not specify what should be done in the case of error signals; thus,

it is up to the designer. In our work, if a master does not receive enough acknowledgment

signals in the provided time limit, then the error code that acts as an internal ERR signal, is

forwarded to a module, and the request is completed by switching to the “REQ_COMP” state.

Otherwise, if the slave sends all acknowledgment signals, the request is completed with no

error code. In both cases, a master interface informs a computation module by pulling

out_buff_emptied signal high. Note that different error codes are assigned for different

types of errors:

• Code “00” – no error; the request is successful

• Code “01” – the response from a slave is timed out

• Code “10” – an invalid destination slave address


• Code “11” – the grant response from the crossbar is timed out

The states of the GNT and ACK Timeout FSMs are shown in Figure 4.19. This figure

particularly illustrates the case of acknowledgment signal; however, the same logic is also

applied for a grant signal. Providing a watchdog timer for unresponsive signals is

recommended by the WISHBONE datasheet. However, they recommend generating it from

the interconnection, in this case, the crossbar. On the other hand, to keep the crossbar's design

simple and efficient to leave more resources for PR regions, this feature is implemented in the

master interface. Then, the designer can decide whether this feature is necessary for a specific

computation module or not.

Figure 4.19. FSM of Timeout Logic

Additionally, an external ERR signal is used as well, referred to as the ERR_I signal. According

to the WISHBONE datasheet, the ERR_I signal is usually provided by a slave interface,

except the timeout errors which should be provided by the interconnect [7]. However, here

we use it as a response from the crossbar to an invalid destination address. As explained in

the 4.4.2 Master Port section, the crossbar’s input port knows which destinations a master

can access and generates an error in the case of an unallowed slave. This signal returns as the

ERR_I signal to a master interface.

The overall design and logic are the same with the AXI-WB module; however, it has some

modifications to improve transfer latency and ensure communication isolation on the PCIe

port of the crossbar. The first modification to provide the communication isolation on the

PCIe port is to fetch a target slave address from the register file. As it is mentioned in 4.3.

Overall System Design Architecture section, user data is indicated by the application ID, and

the register file stores the target PR region of each application. Hence, the WB master

interface gets the slave end’s address from the register file and directs it to the correct

destination.


Secondly, the AXI interface side of a module receives one 32-bit data word each cycle from

FIFOs. For example, in the scenario of having eight data words, it would take eight clock

cycles to receive complete user data. When a master delivers data to the slave end, it also

delivers one data word for each clock cycle. The grant latency, the number of clock cycles

that it takes for a master to get a grant from the time it initiates a request, is three clock cycles

in the best-case scenario. Eventually, to deliver user data from a FIFO buffer to a computation

module, it would take 19 clock cycles. Therefore, instead of waiting till the AXI side buffer

becomes full for a master to send a request, a master initiates a request when the buffer is

half full. Then until the AXI side receives full data, the WB master interface would have

already gotten a grant to deliver data and start to send the first data word. Therefore,

intersecting three clock cycles of grant latency and one clock cycle of sending the first data

word with the second half of receiving data from the AXI end, overall latency is decreased

to 15 clock cycles.

4.5.4 The Design and Implementation of the Slave Interface

The slave interface mainly consists of a few logical gates, the address decoder, registers, and

the Slave FSM control logic, as is illustrated in Figure 4.20.

Figure 4.20. The Implementation Block Diagram of the WB Slave Interface

First of all, registers store the incoming data while the decoder sends the enable signal to the

requested register. Since the implemented computation modules process and transfer eight

32-bit data words, there exist eight registers. It should be noted that the last bit of each data

word is high, indicating valid data. Hamming encoders/decoders use only 31 bits; 26 bits as

original data and five as the code word, leaving the last bit unused. Hence, this last bit is used

to indicate that the data is valid when it is high. Consequently, if each register’s last data bit

is high, it means all registers are full; thus, making the slave_buffer_full signal high,

informing a module.


Alternatively, a master might have less than eight data words to deliver. In this case, it is

checked if at least one of the register’s last bit is set and CYC_O is negated. Because if any

register contains data but the request signal is low, then it means a master has delivered all

data. Figure 4.21 displays the FSM states of the slave control logic.

Figure 4.21.The FSM of the WB Slave Control Logic

Initially, it is in the “IDLE” state and is triggered by the request signals coming from a master.

In the case of a valid request, if a slave’s registers currently do not contain any unread data,

then it switches to the “WR_REQ” state. In this state, the registers are enabled to store incoming

data, and acknowledgment signals are sent back to the master. When the registers become full,

and the master still wants to send data, e.g., a master has two consecutive user data sets to be

delivered, the control logic switches to the “STALL” state. In this state, a slave interface asserts

the STALL_O signal high and the ACK_O signal low, informing the master that it needs to wait

before sending new data and disables its registers.

Meanwhile, the slave computation module is informed about the presence of new data by the

slave_buffer_full signal, and the control logic waits on the “STALL” state until it receives

the data_read signal from the module. When the module registers new data, it sends the

data_read bit to the slave interface, which causes it to reset its registers and change the state

back to “WR_REQ” to have new data. Switching to the “STALL” state can happen from the

“IDLE” state, as well, with a similar scenario, if the previously registered data still has not been

read by the module and there is a new request coming.

Additionally, in either state, “WR_REQ” or “STALL”, the state can change back to “IDLE” by

the deassertion of the request signal. This might happen due to three reasons:

1. A master completes its request having no more data to send. This can happen

when a slave is in the “WR_REQ” state.

2. A master has already sent the allowed number of packages by the weighted

round-robin, so the slave side disables its connection to the slave through

multiplexers. This can happen when a slave is in the “WR_REQ” state.


3. A master gives a timeout error waiting so long on a low acknowledgment signal

when the slave is in the “STALL” state.

Originally, the ADR_O line coming from a master should be directed to a slave as the ADR_I

line. However, since the valid destination is guaranteed; the request is directed by the input

port and the register file, as explained in 5.3.3. Master Port section, there is no need to direct

this signal to a slave and get it checked there too. This would mean adding extra logic to a slave

interface and wasting an extra clock cycle.

The implementation is the same for the WB-AXI module with slight modifications. In the WB-

AXI module, there are three AXI interfaces targeting each C2H channel, and they are not

statically allocated to user applications. Therefore, the WB Slave interface sends computation

results via different AXI interfaces each time. The interface selection is made based on the shift

register, which has three bits and only one enabled bit at a time. The output of the shift register

decides to which channel the slave_buff_full signal is sent and from which channel the

data_read signal is expected. Consequently, each channel is targeted in a round-robin

fashion.

4.5.5 A computation Module and its Template

Computation Modules do the computational operations requested by a user. Currently, we

implemented three computation modules, the hamming decoder, the multiplier, and the

hamming encoder, being able to interact with the implemented WISHBONE interfaces. They

all have the standard input/output ports and implementations; hence, they can use the same

interfaces. This section suggests the standard template for developing such computation

modules to reuse the same interfaces. Please note that depending on application requirements

or the nature of computation modules, the implementation can be different, which means

interfaces should be adapted to operate with it. Figure 4.22 illustrates the black box view of

the suggested computation module template.

Figure 4.22. The Computation Module Black Box Template


The design and implementation of a computation module are shown in Figure 4.23. The main

components are the input and output registers, the error status register, computation logics,

and the control logic. First of all, the input registers, when enabled, read and store data

coming from a slave interface. Secondly, computation logics do arithmetic operations on

input data: decoding, multiplying, or encoding in this case. In this work, computation is done

parallel for all data words since there are enough FPGA resources available. Alternatively,

the number of computation logics can be decreased, and each data word can be forwarded

one by one to it with the help of a multiplexer and the control logic.

Figure 4.23. The Block Diagram of a Computation Module

Furthermore, since the first data word here indicates application ID, it should not be

processed; thus, it is directly forwarded to the output register. Next in order, the output

registers hold output results from the computation logics. On the other hand, the error

registers hold the status of an error signal for the last request provided by the master interface.

Finally, the main control logic handles all communication flow. Figure 4.24 shows the states

of the main control logic for the hamming decoder. These states are the same for all modules.


Figure 4.24. The FSM of a Computation Module

To begin with, initially, it is in “IDLE”, and switching to the next state – “REGISTER_DATA”

happens when it receives the slave_buffer_full signal from the slave interface. In this

state, input registers are enabled to save incoming data, and the data_read signal is sent

back to the slave interface to inform that it read data so that the slave interface can register

further incoming data. Next in order, the “DECODE” state comes, where output registers are

enabled to store computation logic’s output results. Comp computation logics are fully

combinational in the implemented modules, not consuming an extra clock cycle. However,

if there are fewer computation logics than data sets completing a computation in several clock

cycles, it should wait until output data is ready in the “DECODE” state. Finally, after output

results are ready, the control logic jumps to the “MAKE_REQ” state. It sends the request signal

to the master interface and presents the output results and a destination address.

Afterward, when a computation module receives the request-completed signal from the master

interface, the control logic enables the error register to save error status (0 for the successful

request), resets output registers, and then jumps to the next state. The next state can be either

“IDLE” or “REGISTER_DATA”, depending on the slave interface’s notification signal. If a slave

interface has new data, it registers new data; otherwise, it becomes idle.

Furthermore, the error status is forwarded as an input to the register file; hence, the FPGA

elastic resource manager can see if the status of the last request is successful or not. Although

error bits returned from a master interface are two bits, eight bits are allocated for each PR

region for this purpose in the register file. The other bits are reserved for future implementations

on multicasting to hold the error status of multicasting communication, too; for instance, which

slave failed to respond.


4.6. Extending the WISHBONE Crossbar Interconnection to

Support Multicast Communication

This section proposes the solution in order to support multicast communication in the

WISHBONE Crossbar Interconnection. The novelty here consists of improving the

WISHBONE master interface and the crossbar interconnection to support sending data to

different slaves simultaneously while still ensuring the communication isolation. The multicast

communication pattern can be needed when a computation module sends its results to multiple

computation modules to get different results from each. For instance, a picture file can be sent

to two computation modules where one module identifies people in the picture, and the other

filters vehicles present there. In this case, instead of generating two different requests for each

module, it is more efficient if data can be sent to both simultaneously.

The main challenges to consider are providing the communication isolation and managing

response signals – GNT, ACK, and STALL coming from different slave sides simultaneously.

Our solution of multicasting has the following benefits:

• This is the first solution done for multicast communication using the WISHBONE

Crossbar Interconnection.

• Addressing multiple slaves is done in a much simpler manner reducing the multicast

overhead.

• The slave interface logic remains unchanged and unrelated slaves are not disturbed,

leading to a less costly slave interface.

• Meanwhile, the communication isolation is still achieved.

Figure 4.25 illustrates the proposed solution for handling multicast communication. Since the

master port of the crossbar generally provides the signals mentioned above and the

communication isolation, the main modifications for multicast purposes should be done there,

too. Please, note that the figure illustrates the main design considerations. Enabling data signals

coming from slaves can be done in the same way as shown in Figure 4.11.

Communication Isolation

First of all, the communication isolation is provided with the help of the register file. Here

requesting slaves by bit-addressing simplifies requesting different slaves. To illustrate, as

mentioned in the 4.4.2 Master Port section, addressing slaves is done by one-hot encoding;

here, the same logic applies. However, instead of only 1 bit, several bits can be high, each bit

representing one slave. For instance, if the ADR_O signal is “0111”, a master wants to send

data to the slaves connected to ports 0, 1, and 2. On the other hand, the register file stores the

allowed slaves for a master. Consider, it is “0110”, meaning a master is allowed to

communicate with slaves 1 and 3. Then, ADR_O and allowed_slaves are ANDed, resulting

in “0110”, and the result is forwarded as request bits; thus, port 0 is safe while the request is

sent only to valid slaves. On the other hand, if a master’s all destinations are invalid, the

valid_access signal is evaluated as false, and ERR_I is returned to the master.


This method has the following advantages; firstly, the length of the ADR_O line is decreased. If

a master used byte-addressing, sending the address of slaves separately using two bits for each;

for instance, “11_10_00” - slaves 3, 2, and 0, then it would have to use six bits, and additional

decoders and logic would be needed to convert it to request bits. Alternatively, only two bits

of ADR_O would be used, but addresses would be sent sequentially, causing it to take three

clock cycles. Nevertheless, with this method - using simple gates, fewer bits, and low latency,

the purpose is achieved.

Moreover, as we mentioned, slaves use a decentralized arbiter; each slave handles its own

requests. This shows its benefits for multicast communication. Since request bits are sent to

each slave separately, slave ends can still respond to a request coming to them individually as

a usual write request, without being aware of multicast. Additionally, slaves do not need to be

concerned with additional address or group ID checking either, thus eliminating the need for

multicast introduction overhead, sending group IDs to slaves as in [15]. All of these result in

simple and area-friendly slave logic.

Management of Slave Response Signals

Since a master sends a request to different slaves, it also has to receive all grant and

acknowledgment signals. However, it is not guaranteed that a master receives all grants at the

same time or receives all grants at all. For example, for some slaves, it might take a longer time

to give a grant, or some slaves might not respond. Similar issues are valid for acknowledgment

signals, too. The “Grant Handle Logic” and the “ACK Handle Logic” are introduced to solve

Figure 4.25.The Master Port adjusted to support Multicast Communication


these issues. The “Grant Handle Logic” implementation details are displayed in the figure only;

however, the same approach is also used for acknowledgment.

To begin with, each grant signal passes through a two-input multiplexer. This is done to ensure

to force a high signal for unrequested slaves so that when the final grant signal – GNT_I is

generated, it would be high due to the requested slaves. In other words, consider, slave 3 is

requested while slave 0 is not. Then a multiplexer directs slave 3’s grant signal to the final AND

gate. On the other hand, since slave 0 is not requested, a master does not need its response;

thus, a high signal is passed instead of the GNT_0 signal, evaluating only the requested slave’s

response.

Secondly, not all slaves might grant the request at the same time, and this might be the case

that some of them do not respond at all. In this scenario, a master must wait for the grant from

all slaves or send data to responsive slaves should there be any unresponsive slaves while also

knowing which slaves are unresponsive. The any_GNT signal is introduced to achieve this task,

indicating if a master has a grant from at least one of the slaves, and modifications are done to

the Grant Timeout FSM inside the master interface. The modified FSM states are shown in

Figure 4.26.

Figure 4.26. The WB Timeout FSM adjusted to Support Multicast Communication

When a master makes a request and does not have any grants, the FSM switches to the

“WAIT_GNT_ANY” state, waiting for a grant from at least one of the slaves. If it does not receive

any grant in this state, it switches to “ALL_GNT_TIMEDOUT”, meaning there has not been a

single grant. In this case, the request would be marked as completed, and an error signal would

be returned to a computation module. On the other hand, if a master receives at least a single

grant, then the FSM switches to the “WAIT_GNT_ALL” state where it waits for the grant from

all slaves and waits there while the final grant signal is low. If it receives grants from all slaves,

then the next state is “IDLE”, and the Master FSM starts to send its data.

On the other hand, if the waiting time reaches its maximum value in the “WAIT_GNT_ALL”

state, the next state becomes “SOME_GNT_TIMEDOUT”. In this state, a master interface sends

the gnt_time_out signal to the input port, which means it cannot wait any longer for the

grant from other slaves and would proceed with the responsive slaves only. The


gnt_time_out signal enables registers inside the Grant Handle Logic to negate incoming

grant signals from slaves. Low grant signals are registered high, indicating which slave did not

respond to the request. For instance, if slaves 0 and 3 are requested, slave 3 is granted, but slave

0 is not, then the output of the registers inside becomes “0001”. This is directed as

unresponsive slave addresses to a master interface.

Moreover, outputs of the Grant Handle Logic’s registers are negated and ANDed with the output

request bits, causing to cancel request signals to the unresponsive slaves and making

multiplexer select signals for those slaves 0 accordingly, thus, proceeding with only the

responsive slaves. Since the request to the timed-out slave is canceled, GNT_O is evaluated true,

causing the timeout FSM to switch to the “IDLE” state. Next in order, the master interface

registers the unresponsive slave IDs and forwards them to the computation module with the

error code so that the status can be registered in the register file. For example, if slave 0 did not

give a grant, a master would forward “0001_11” – “slave0_grant-timeout”. Afterward, a

master interface starts to send data to granted slaves.

The same approach applies to ACK Handle Logic; however, there are some additional issues

and assumptions to consider here. First of all, since a master waits for all grant signals before

sending data, it means all destined slaves become available at the same time; thus, all

acknowledgment signals come simultaneously too, causing the ACK_O signal to be evaluated to

true. On the other hand, in this case, there might occur two scenarios causing low

acknowledgment signals from some slaves. First, some slaves might have less buffer space

than others, causing STALL and a low acknowledgment signal. If at least one of the slaves

asserts the STALL signal, then the master FSM switches to the “STALLED” state as shown in

Figure 4.18 and hence, waiting without sending data to other slaves. Meanwhile, the

acknowledgment timeout FSM, which has the same states and logic as in Figure 4.19, starts to

keep track, as well. If the slave who has sent STALL does not respond to in defined time limit,

the ack_time_out signal is sent to the input port, causing to cancel the request to the timed-

out slave, registering its address from ack_unresponsive_slaves, and again proceeding

with the responsive slaves.

The other scenario where ACK_O might be evaluated as false is when a slave might be stuck

and does not operate. In this case, the same procedures apply again, and a master proceeds with

responsive slaves. Additionally, if none of the slaves sends an acknowledgment signal, the

request is marked as complete with an error status.

Furthermore, STALL_I signals pass through two-input multiplexers, causing to consider only

requested slaves’ signals. As mentioned, when a single slave asserts a high STALL signal, then

a master must wait for it. Therefore, the final STALL_O output directed to a master interface

is evaluated with the OR gate.

Finally, since multicast communication sends data to multiple slaves, it is unnecessary to

modify how to handle DAT_I signals from slaves shown in Figure 4.11, as they are not used.

Thus, this part of the input port can stay the same.

Chapter 5. Design Verification and Results

5.1. Design Verification

All self-developed modules were tested and verified using testbenches and simulation before

being integrated. The final system architecture was tested in simulation, as well as in the

FPGA device.

5.1.1 Verification in the Simulation environment

The simulation environment is provided by Xilinx Vivado Tool 2018.3. The input signals to

hardware modules were randomly generated using functions in testbenches.

The WISHBONE Crossbar

The Crossbar’s submodules: the WRR, input/output ports, etc., were tested separately first.

Then they were integrated together and verified in two steps. In the first step, only a single

port was tested; in the second step, all four ports were tested together.

The WB interfaces and computation modules

Firstly, each computation module’s functionality was tested, as well as master and slave

interfaces. Then, the encoder and decoders were connected point-to-point using interfaces.

Their inputs and outputs were compared to check the overall flow between computation

modules and bus interfaces.

The AXI-WB and WB-AXI modules

To verify these modules' functionalities, they were connected in a point-to-point fashion.

Moreover, they were also tested separately with AXI-ST FIFO modules provided by Xilinx.

The Integrated system

The modules were integrated and verified step-by-step. First, WB computation modules were

connected to the crossbar interconnection, and the AXI/WB module sides together with

FIFOs were simulated separately using randomized input signals. Then, AXI/WB modules

and the register file were added to the crossbar system and tested. After the results were

verified, the XDMA IP core was added to the system architecture, and the resulting design

was tested on the KCU1500 board.

C h a p t e r 5 . D e s i g n V e r i f i c a t i o n a n d R e s u l t s | 63

5.1.2 Verification in the FPGA device

For verification in the actual device, different scenarios were followed, and results were

verified referring to the register file in addition to input/output files. A couple of scenarios are

provided in the 5.2 Results section to demonstrate the dynamic resource allocation and

communication isolation features.

5.2. Results

In this section, the results of the final system architecture are assessed. This includes

summarizing system features and demonstrating the important ones on the FPGA device.

Additionally, the area usage and the power consumption of the system are presented, as well.

5.2.1 The System Features

The features of the implemented system architecture are summarized as below:

1. The configurable four-port WISHBONE Crossbar communication interconnection which:

• enables to increase/decrease the number of resources allocated to an application

• allows dynamic bandwidth allocation for different applications

• provides the communication isolation

2. The XDMA IP Core with 6 AXI-ST channels to exchange user data

3. Three different statically implemented computation modules; the multiplier, the hamming

encoder, and the hamming decoder, together with the WISHBONE master and slave

interfaces

4. The AXI-WB and WB-AXI modules

5. The register file:

• to serve computation modules and the crossbar with configuration data

• to save status data from computations and the AXI-WB and WB-AXI modules.

5.2.2 Demonstration of Some Features

This section briefly displays the important features of the developed system on the KCU1500

board. In all those experiments, the PCIe link is connected to crossbar port 0, the multiplier to

port 1, the encoder to port 2, and the decoder to port 3.

Elastic Resource Allocation and Dynamic Bandwidth

To show dynamic elastic resource allocation, we did an experiment to show how elasticity

improves the application's execution time. For this purpose, 16 KB data was sent to be

64 | C h a p t e r 5 . D e s i g n V e r i f i c a t i o n a n d R e s u l t s

processed by the multiplier, the encoder, and the decoder sequentially. Three cases were

considered in this experiment:

Case 1: The multiplier is available on the FPGA, so the multiplication is done on the FPGA,

and the encoding and decoding are done on the CPU.

Case 2: The encoder becomes available; the multiplication and encoding are done on the

FPGA while the decoding process is still on the CPU.

Case 3: The decoder becomes available; hence, it is allocated to serve the application as well,

having all computations running on the FPGA.

For each case, the experiment was repeated ten times, and the average execution time was

calculated. The graph in Figure 5.1 shows how the execution time improves as the user gets

more resources from the FPGA due to resource elasticity. For case 1, the average execution

time is 16.9 ms.; nevertheless, as the user gets more resources towards case 3, the average

execution time becomes 10.87 ms.

Figure 5.1. The Comparison of Execution Time with Different Case Scenarios

Additionally, the experiment was repeated considering two different bandwidth allocation

numbers. Firstly, the case mentioned above scenarios were done allocating only 16 packages,

one package being 4 bytes, per computation module each time they get a grant. Secondly, this

number was increased to be 128 packages per computation module. According to the results

shown in Figure 5.1., the execution time performance improves as computation modules get

more bandwidth.

As the FPGA partial reconfiguration process was not done, the partial reconfiguration time

was not assessed. Nevertheless, reconfiguration times in different scenarios including the

proposed technique - using the ICAP module through the XDMA IP core, are reported in [5].

Communication Isolation

To demonstrate the communication isolation, i.e., to display that the computation modules

cannot access resources they are not allowed to, consider the following scenario. The decoder

targets the multiplier as its destination; however, it is not allowed to. This is because they

belong to totally different applications.


In Figure 5.5, the decoder’s allowed slave address is shown in a green box. It can access slave

number 0 only. However, by the first command, the decoder’s destination address is

configured to be slave connected to port 1, which is the multiplier. Then, the data is sent to

the FPGA device to be computed.

Figure 5.2. Communication Isolation Demonstration

After sending data, the error status can be assessed from the register file, shown in the orange

box. The figure shows that the error status for computation module number 3 has been

changed to “02,” which is the invalid address error code. At the same time, when data is

read back by the last command shown in Figure 5.5, it throws an error since there is no data

to read back. This is because the decoder dropped the data when the error occurred. As a

result, when the module tried to access an invalid address, it was prevented, and the error

status was registered. Thus, communication isolation was achieved.

5.2.3 Communication Overhead

The communication overhead here refers to “time-to-grant” – the number of clock cycles

from when a computation module initiates a request to when the master interface starts to

send the first data. The best-case time-to-grant is four clock cycles where the slave does not

serve any request currently, and a master gets the grant immediately. It takes one clock cycle

for a module’s request to reach the master interface, and the master interface initiates a

request in the next clock cycle. On the other hand, the slave side takes two clock cycles to

grant the request and enable the slave interface. As a result, if a computation module has

eight packages to deliver, then it takes 13 clock cycles in the best case to complete the

request. Here, the last clock cycle is used to register the error status of the transaction.

On the other hand, the worst-case time-to-grant occurs when all three computation modules

target the 4th one simultaneously. The master who would be served the last would have to


wait for the first two masters to be served. In this case, the 13th clock cycle for each previous

master module can be ignored because a master interface releases the bus as soon as it

completes sending its packages; thus, registering an error code happens only on the master

side. Consequently, the last computation module would have to wait 28 clock cycles (12

clock cycles for each previous master and four clock cycles for time-to-grant) to send its first

data and 37 clock cycles to complete its request.

In summary, the assessment of the communication overhead is listed below. Please, note that

these numbers are given considering a master has 8 data words to deliver to the slave:

Table 5-1. The Communication Overhead Summary

Category Clock Cycles

The best-case time-to-grant 4

The worst-case time-to-grant 28

The best-case request completion

latency

13

The worst-case request

completion latency

37

5.2.4 Area usage

To assess the area usage of the system architecture and its submodules, the Xilinx Vivado

Tool’s utilization report is used after running the implementation. As reported by the results,

the overall LUT usage of the WISHBONE Crossbar together with computation modules and

bus interfaces is 1278, while this number is 1429 for FF usage. This means they use only 0.19

% of LUT resources and 0.11 % of FF resources. The WB crossbar interconnection itself uses

only 375 LUTs and 60 FFs having 0.06 % and 0.004 % usage of each accordingly. On the

other hand, the WB master interfaces’ LUT usages are around 0.03%, while slaves’ usages

are around 0.02%. The detailed list can be found in Table 5.2 in a hierarchal form.

In addition, the LUT usage of the whole system architecture is 5.46 %, out of which 5.04 %

comes from the XDMA IP Core. Correspondingly, the overall FF utilization is 2.75 %, where

the XDMA IP contributes 2.32 %. Moreover, the overall system has 4.12 % BRAM usage, as

well. A detailed list of all hardware components’ resource usage is attached to Table 5.3.


Table 5-2. The Area Usage of the WISHBONE Crossbar together with three Computation Modules

Hardware component CLB

LUT

CLB LUT

usage %

CLB

FF

CLB FF

usage %

WB Crossbar with 3 WB

computation modules

1278

0.19

1429

0.11

WB Crossbar 475 0.07 60 0.004

WB Hamming Decoder 432 0.07 646 0.05

WB Master Interface 213 0.03 27 <0.01

WB Slave Interface 115 0.02 220 0.02

Hamming Decoder 104 0.02 399 0.03

WB Hamming Encoder 233 0.04 99 0.01


WB Slave Interface 26 <0.01 38 <0.01

Hamming Encoder 16 <0.01 27 <0.01

WB Multiplier 138 0.06 624 0.05


WB Slave Interface 115 0.02 213 0.03

Multiplier 96 0.01 384 0.02

Table 5-3. The Area Usage of all Components

Hardware component CLB

LUT

CLB LUT

usage %

CLB FF CLB FF

usage %

BRAM BRAM

usage %

1. XDMA IP Core 33441 5.04 30843 2.32 62 2.87

2. WB Crossbar with the WB

computation modules

1278

0.19

1429

0.11

0

0

3. AXI-WB – FIFO System 975 0.15 1842 0.14 13.5 0.62

3.1. WB module 413 0.06 822 0.06 0 0

3.2. FIFO system 3 x 188 3 x 0.02 3 x 340 3 x ~0.02 3 x 4.5 3 x 021

4. WB-AXI-FIFO System 389 0.06 2274 0.17 13.5 0.62

4.1. WB-AXI module 254 0.04 1161 0.09 0 0

4.2. FIFO System 3 x 45 3 x <0.01 3 x 371 3 x ~0.03 3 x 4.5 3 x 0.21

5. Register File 265 0.04 560 0.04 0 0

Total 36348 5.47 36948 2.79 89 4.12

5.2.5 Power Consumption

According to the implementation results and analysis done by the Xilinx Vivado tool, the

power consumption of the WB Crossbar interconnection is 1 mW. On the other hand, the

power usage of each master interface is 1 mW, while the usage of a single slave interface is

less than 1 W. The power consumption of the overall system architecture is 5.03 W, out of

which 44% comes from the GTH transceivers.

The power consumption categories are summarized as in the following table:


Table 5-4. The Power Usage by Categories

Category Power Consumption

(W)

GTH 2.215

Hard IP 0.298

Dynamic Power 1.139

Device Static Power 1.379

Total 5.031

The power consumption of the individual hardware modules is listed as below:

Table 5-5. The Power Usage by Hardware Components

Hardware Component Power Consumption

(W)

1. XDMA IP Core 3.583

2. WISHBONE Crossbar 0.001

3. WB Hamming Encoder 0.001

3.1. Hamming Encoder <0.01

3.2. WB Master Interface 0.001

3.3. WB Slave Interface <0.001

4. WB Hamming Decoder 0.005

4.1. Hamming Decoder 0.003



5. WB Multiplier 0.001

5.1. Multiplier <0.001



6. AXI-WB module – FIFO System 0.028

6.1. AXI-WB module 0.006

6.2. AXI FIFOs 3 x 0.007

7. WB-AXI module – FIFO System 0.03

7.1. WB-AXI module 0.008

7.2. AXI FIFOs 3 x ~0.007

8. Register File 0.004

Total Power Usage 3.653

Chapter 6. Discussion and Future Work

6.1. Discussion

First and foremost, the resulting 32-bit width configurable crossbar communication

interconnection enables dynamic resource allocation, dynamic bandwidth configuration, and

communication isolation; thus, achieving the targeted features. Moreover, it takes a very

small area, 475 LUTs, 60 FFs, and no BRAMs, while consuming 1 mW power. This number

varies between 305 and 495 LUTs for a single 32-bit router in the NoC architecture provided

in [9]. On the other hand, the resulting crossbar can connect four modules. The NoC, which

connects four modules – 2x2 NoC provided in [9], would occupy 1220 LUTs using four 3-

port routers.

Table 6-1. The Comparison of the Resource Usages between the Developed Crossbar and the Existing Previous Art NoC

routers

Resource Type 4x4 WB Crossbar 2x2 NoC 3-port routers [9]

LUTs 475 1220

FFs 60 1240

Table 6.1 shows a side-by-side comparison of our crossbar and 2x2 NoC routers in [9], each

connecting four modules. These numbers have been extracted from Figure 8 in the original

paper considering their 3-port router implementation. Consequently, the resulting WB

Crossbar takes 61 % fewer LUTs and 95 % fewer FFs than its equivalent NoC architecture.

In this comparison, we are not considering the register file’s, the master and slave interfaces’

occupation since numbers in [9] have been given for routers without router tables, bus

interfaces, and extra buffers, which are additional components to the main computation

module as shown in Figure 3.1.

The power consumption comparison with [9] is made, as well. The same method described

above is followed here; a 2x2 NoC router connection is considered. Table 6.3 lists the power

consumption numbers side-by-side.

70 | C h a p t e r 6 . D i s c u s s i o n a n d F u t u r e W o r k

Table 6-2. Comparison of Power Consumption between the Crossbar System and Routers in [9]

Power Consumption (mW)

4x4 WB Crossbar 1

2x2 NoC 3-port routers [9] 80

On the other hand, the register file, master interfaces, and slave interfaces should also be

considered in our crossbar interconnection since they are extra components to an actual

computation module. The area usage of interfaces changes dependently on the computation

module, as shown in Table 5-2, although all of them have the same implementation. This

might be due to optimization methods applied by the Xilinx Vivado Tool. However, on

average, master and slave interfaces have 196 and 85 LUTs, 117 and 628 FFs, accordingly.

Then, theoretically, on average, 4x4 WISHBONE crossbar interconnection with the register

file, WB master interfaces, and slave interfaces would take 1864 LUTs and 1365 FFs which

are 0.28 % and 0.1 % of KCU 1500’s LUT and FF resources. These numbers can be

compared to the area usage of the E-WB shared bus communication infrastructure provided

by [12]. Nevertheless, for this comparison, the register file’s area usage is extracted since in

[12] numbers are provided considering master and slave interfaces only. Taking into

consideration their single master-slave communication infrastructures’ most optimal area

usage and multiplying it by four, the following results are obtained:

Table 6-3. The Comparison of Resource Usages between the Crossbar System and Communication Infrastructures in [12]

Resource Type 4x4 WB Crossbar

Interconnection

System

4 Communication

Infrastructures in [12]

LUTs 1599 1076

FFs 796 1484

Consequently, our WISHBONE crossbar interconnection solution occupies 48.6 % more

LUT resources and 46.4 % fewer FFs compared to the shared bus solution presented in [12].

A more LUT usage percentage was expected since, in general, crossbars occupy more area

than shared bus architectures.

Additionally, considering all hardware modules, the overall system architecture has 5.47 %

LUT and 2.79 % FF overall resource usage. 5.04 % of this usage comes from the XDMA IP

core. Nevertheless, the suggested architecture was one of the example ways of using and

demonstrating the proposed communication interconnection. In real applications, data can be

delivered to computation modules through other methods.

Coming to communication overhead results, the best-case time-to-grant is four clock cycles,

the best-case latency is 13 clock cycles, and the worst time-to-grant is 28 clock cycles.

Although the best-case time-to-grant is fixed, the worst-case time-to-grant and the latency

might increase/decrease due to bandwidth allocation of other modules. These current

C h a p t e r 6 . D i s c u s s i o n a n d F u t u r e W o r k | 71

numbers have been calculated considering each module has eight packages to send, and they

have uninterrupted access to the bus once they are granted.

To compare communication overhead to the existing works, [13] mentions that eight clock

cycles for processing a single command are required in the best case. Two commands are

exchanged between modules to start a communication, which would take 16 clock cycles to

start sending data. Moving further, if we consider there are 8 data words to be sent, the

request completion latency would be 32 clock cycles. That is because one last command is

sent to the destination to indicate the end of the communication and destroy the

communication channel. In this case, our solution is four times faster than [13] in terms of

time-to-grant and 2.4 times faster in terms of latency.

On the other hand, some considerations need to be taken to compare the latency to [9]. First,

according to the paper, one flit takes two clock cycles to pass from one router; however, since

it is a pipelined architecture, this is one clock cycle for the rest of the flits [9]. According to

[24], one network package contains a head flit and tail flit to initiate and stop the transaction,

accordingly, and body flits. Thus, in this case, sending eight sets of data, as in our case,

would require sending ten flits. Considering the best case where flits traverse only source and

destination routers, then it would be 22 clock cycles to complete the request, while in our

implementation, this number is 13 clock cycles.

6.2. Conclusion and Future Work

6.2.1 Conclusion

In conclusion, the goal of designing low-area, low-communication overhead

configurable communication interconnection is achieved to enable FPGA Resource

Elasticity. The resulting interconnection has intended features, enabling to dynamically

increase/decrease the FPGA PR resources to an application while providing the

communication isolation. Moreover, dynamic bandwidth allocation to the application inside

the FPGA is one of the other important features of the resulting system. As predicted, the

resulting crossbar occupies much lesser area than the existing NoC solution and slightly more

area than the shared bus one. In addition, the solution for enabling a multicast communication

pattern among the PR regions was suggested targeting the implemented crossbar

interconnection.

6.2.2 Future Work

First of all, due to the time limit, the integration of the work with the PR regions and the

ICAP module could not be done. Therefore, the system architecture can be tested with real

PR regions as future work. Furthermore, the multicast feature can be added to the system

using the suggested multicast method. Moreover, the work can be further improved and

integrated with OpenStack/Kubernetes. At this point, the FPGA elastic resource manager can

be implemented, too, as it is given in the description.

72 | R e f e r e n c e s

References

[1] W. J. Dallas and B. Towels, "Chapter 11. Routing Mechanics", in Principles and Practices of Interconnection

Networks, Ed. Elsevier, 2004, p. 203.

[2] "PCI Express", Xilinx, 2021. [Online]. Available: https://www.xilinx.com/products/technology/pci-

express.html#versalpcie. [Accessed: 22- Jul- 2021].

[3] Product Guide: DMA/Bridge Subsystem for PCI Express, v4.1, Xilinx, April 29, 2019. Accessed on: Jul. 22,

2021. [Online]. Available:

https://www.xilinx.com/support/documentation/ip_documentation/xdma/v4_1/pg195-pcie-dma.pdf

[4] Xilinx Answer 71435: DMA Subsystem for PCI Express - Driver and IP Debug Guide, Xilinx, 2018.

Accessed on: Jul. 22, 2021. [Online]. Available:

https://www.xilinx.com/Attachment/Xilinx_Answer_71435_XDMA_Debug_Guide.pdf

[5] P. Fallah, “FPGA Virtualization,” M.S. Thesis, School of Electrical Engineering and Computer Science,

KTH Royal Institute of Technology, Stockholm, 2019. [Online]. Available: http://kth.diva-

portal.org/smash/get/diva2:1412396/FULLTEXT01.pdf

[6] "Xilinx DMA IP Reference drivers", GitHub, 2021. [Online]. Available: GitHub - Xilinx/dma_ip_drivers:

Xilinx QDMA IP Drivers. [Accessed: 22- Jul- 2021].

[7] W. D. Peterson, WISHBONE System-on-Chip (SoC) Interconnection Architecture for Portable IP Cores, B4

ed. OpenCores, 2010. [Online]. Available: Wishbone B4 (opencores.org)

[8] A. S. Lee and N. W. Bergman, "On-chip interconnect schemes for reconfigurable system-on-chip",

in Proceedings of SPIE - The International Society for Optical Engineering, March, 2004. [Online]. Available:

(1) On-chip interconnect schemes for reconfigurable system-on-chip (researchgate.net)

[9] J. M. Mbongue, A. Shuping, P. Bhowmik and C. Bobda, "Architecture Support for FPGA Multi-tenancy in

the Cloud," 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and

Processors (ASAP), 2020, pp. 125-132, doi: 10.1109/ASAP49362.2020.00030. [Online]. Available:

Architecture Support for FPGA Multi-tenancy in the Cloud | IEEE Conference Publication | IEEE Xplore

[10] A. Vaishnav, K. D. Pham, D. Koch and J. Garside, "Resource Elastic Virtualization for FPGAs Using

OpenCL," 2018 28th International Conference on Field Programmable Logic and Applications (FPL), 2018, pp.

111-1117, doi: 10.1109/FPL.2018.00028. [Online]. Available: Resource Elastic Virtualization for FPGAs Using

OpenCL | IEEE Conference Publication | IEEE Xplore

[11] K. Dang Pham, E. Horta and D. Koch, "BITMAN: A tool and API for FPGA bitstream manipulations,"

Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, 2017, pp. 894-897, doi:

10.23919/DATE.2017.7927114. [Online]. Available: BITMAN: A tool and API for FPGA bitstream

manipulations | IEEE Conference Publication | IEEE Xplore

[12] J. Hagemeyer, B. Kettelhoit, M. Koester and M. Porrmann, "A Design Methodology for Communication

Infrastructures on Partially Reconfigurable FPGAs," 2007 International Conference on Field Programmable

Logic and Applications, 2007, pp. 331-338, doi: 10.1109/FPL.2007.4380668. [Online]. Available: A Design

Methodology for Communication Infrastructures on Partially Reconfigurable FPGAs | IEEE Conference

Publication | IEEE Xplore

[13] A. Ahmadinia et al., "A practical approach for circuit routing on dynamic reconfigurable devices," 16th

IEEE International Workshop on Rapid System Prototyping (RSP'05), 2005, pp. 84-90, doi:

10.1109/RSP.2005.7. [Online]. Available: A practical approach for circuit routing on dynamic reconfigurable

devices | IEEE Conference Publication | IEEE Xplore

https://www.xilinx.com/products/technology/pci-express.html#versalpcie

https://www.xilinx.com/products/technology/pci-express.html#versalpcie

https://www.xilinx.com/support/documentation/ip_documentation/xdma/v4_1/pg195-pcie-dma.pdf

https://www.xilinx.com/Attachment/Xilinx_Answer_71435_XDMA_Debug_Guide.pdf

http://kth.diva-portal.org/smash/get/diva2:1412396/FULLTEXT01.pdf

http://kth.diva-portal.org/smash/get/diva2:1412396/FULLTEXT01.pdf

https://github.com/Xilinx/dma_ip_drivers

https://github.com/Xilinx/dma_ip_drivers

http://cdn.opencores.org/downloads/wbspec_b4.pdf

https://www.researchgate.net/publication/43457772_On-chip_interconnect_schemes_for_reconfigurable_system-on-chip

https://ieeexplore.ieee.org/document/9153256










R e f e r e n c e s | 73

[14] H. A. ElGindy, A. K. Somani, H. Schroeder, H. Schmeck, and A. Spray. RMB - A Reconfigurable Multiple

Bus Network. In Proceedings of the Second International Symposium on High-Performance Computer

Architecture (HPCA-2), pages 108–117, Feb. 1996. [Online]. Available: RMB-a reconfigurable multiple bus

network | IEEE Conference Publication | IEEE Xplore

[15] G. Agarwal and Ramesh Kini M., "Enhancement of Wishbone protocol with broadcasting and

multicasting," 2015 IEEE International Conference on Electronics, Computing and Communication

Technologies (CONECCT), 2015, pp. 1-5, doi: 10.1109/CONECCT.2015.7383856. [Online]. Available:

Enhancement of Wishbone protocol with broadcasting and multicasting | IEEE Conference Publication | IEEE

Xplore

[16] C. Bobda and A. Ahmadinia, "Dynamic interconnection of reconfigurable modules on reconfigurable

devices," in IEEE Design & Test of Computers, vol. 22, no. 5, pp. 443-451, Sept.-Oct. 2005, doi:

10.1109/MDT.2005.109. [Online]. Available: Dynamic interconnection of reconfigurable modules on

reconfigurable devices | IEEE Journals & Magazine | IEEE Xplore

[17] T. Pionteck, R. Koch and C. Albrecht, "Applying Partial Reconfiguration to Networks-On-Chips," 2006

International Conference on Field Programmable Logic and Applications, 2006, pp. 1-6, doi:

10.1109/FPL.2006.311208. [Online]. Available: Applying Partial Reconfiguration to Networks-On-Chips |

IEEE Conference Publication | IEEE Xplore

[18] T. S. T. Mak, P. Sedcole, P. Y. K. Cheung and W. Luk, "On-FPGA Communication Architectures and

Design Factors," 2006 International Conference on Field Programmable Logic and Applications, 2006, pp. 1-8,

doi: 10.1109/FPL.2006.311209. [Online]. Available: On-FPGA Communication Architectures and Design

Factors | IEEE Conference Publication | IEEE Xplore

[19] User Guide: KCU1500 Board, v1.4, Xilinx, April 12, 2018. Accessed on: Jul. 22, 2021. [Online].

Available: KCU1500 Board User Guide (UG1260) (xilinx.com)

[20] V. Lahtinen, E. Salminen, K. Kuusilinna and T. Hamalainen, "Comparison of synthesized bus and crossbar

interconnection architectures," Proceedings of the 2003 International Symposium on Circuits and Systems,

2003. ISCAS '03., 2003, pp. V-V, doi: 10.1109/ISCAS.2003.1206307. [Online]. Available: Comparison of

synthesized bus and crossbar interconnection architectures | IEEE Conference Publication | IEEE Xplore

[21] G. Dimitrakopoulos, C. Kachris and E. Kalligeros, "Scalable Arbiters and Multiplexers for On-FGPA

Interconnection Networks," 2011 21st International Conference on Field Programmable Logic and Applications,

2011, pp. 90-96, doi: 10.1109/FPL.2011.26. [Online]. Available: Scalable Arbiters and Multiplexers for On-

FGPA Interconnection Networks | IEEE Conference Publication | IEEE Xplore

[22] Kyeong Keol Ryu, Eung Shin and V. J. Mooney, "A comparison of five different multiprocessor SoC bus

architectures," Proceedings Euromicro Symposium on Digital Systems Design, 2001, pp. 202-209, doi:

10.1109/DSD.2001.952283. [Online]. Available: A comparison of five different multiprocessor SoC bus

architectures | IEEE Conference Publication | IEEE Xplore

[23] V. G. Oklobdzija, "An algorithmic and novel design of a leading zero detector circuit: comparison with

logic synthesis," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 2, no. 1, pp. 124-

128, March 1994, doi: 10.1109/92.273153. [Online]. Available: An algorithmic and novel design of a leading

zero detector circuit: comparison with logic synthesis | IEEE Journals & Magazine | IEEE Xplore

[24] W. J. Dallas and B. Towels, "Chapter 12. Flow Control Basics", in Principles and Practices of

Interconnection Networks, Ed. Elsevier, 2004, p. 224.



https://ieeexplore.ieee.org/abstract/document/7383856








https://www.xilinx.com/support/documentation/boards_and_kits/kcu1500/ug1260-kcu1500-data-center.pdf









74 | A P P E N D I X

APPENDIX A

Register File Description

This register file was prepared considering three PR regions connected to ports 1,2,3 and the

PCIe slot connected to port 0 as given in the implementation description. Access types are

given from the host side's perspective. Only the FPGA logic can update the registers which

have read access-type.

Table A-1. Register File Description

N Register

Address

Information Stored Default Value Access

Type

Notes

0 0x0 FPGA device ID 0xFACE0103 R

1 0x4 PR region 1 destination address 0x1 R/W

By default, all PR regions

send data to the PCIe slot 2 0x8 PR region 2 destination address 0x1 R/W

3 0xC PR region 3 destination address 0x1 R/W

4 0x10 Reset PR regions and ports [3:0] 0xFFFFFFFF R/W Resets are active low

5 0x14 Allowed Addresses of Port 0 Master 0xF R/W PCIe slot can access all ports

6 0x18 Allowed Addresses of Port 1 Master 0x1 R/W By default, all PR regions can

send data to the PCIe slot

only 7 0x1C Allowed Addresses of Port 2 Master 0x1 R/W

8 0x20 Allowed Addresses of Port 3 Master 0x1 R/W

9 0x24 Package numbers allowed in port 0 for

ports [3:0]

0x08080808 R/W Eight bits are reserved for

each port-to-port

communication. For ex., port

1 can send eight packs (08) to

port 0, but 0 packs to port 1-

to-3 (00)


ports [3:0]

0x00000008 R/W

11 0x2C Package numbers allowed in port 2 for

ports [3:0]

0x00000008 R/W


ports [3:0]

0x00000008 R/W

13 0x34 Application ID 0 destination address 0x1 R/W By default, all data sent to

XDMA is looped back to the

server. 14 0x38 Application ID 1 destination address 0x1 R/W

15 0x3C Application ID 2 destination address 0x1 R/W

16 0x40 Application ID 3 destination address 0x1 R/W

17 0x44 PR region [3:1] last transaction error

status

0x0 R Eight bits are reserved for

each PR region, i.e., bits [7:0]

for PR 1, [15:8] for PR 2

18 0x48 App. ID [3:0] last transaction error

status

0x0 R Eight bits are reserved for

each app ID.

19 0x4C ICAP status 0x0 R It indicates if the ICAP

module completed

reconfiguration or gave an

error.

If bit 0 is high, it is done; if

bit 1 is high, it gives an error.

The rest is not used

TRITA - EECS-EX-2022:94

www.kth.se

https://www.kth.se/

Hardware Support for FPGA Resource Elasticity - DiVA

Documents