Memory and Network Interface Virtualization for Multi-Tenant Reconfigurable Compute Devices by Daniel Rozhko A thesis submitted in conformity with the requirements for the degree of Master of Applied Science The Edward S. Rogers Sr. Department of Electrical & Computer Engineering University of Toronto c Copyright 2018 by Daniel Rozhko
127
Embed
Memory and Network Interface Virtualization for Multi ... · Memory and Network Interface Virtualization for Multi-Tenant Recon gurable Compute Devices Daniel Rozhko Master of Applied
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Memory and Network Interface Virtualization forMulti-Tenant Reconfigurable Compute Devices
by
Daniel Rozhko
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
The Edward S. Rogers Sr. Department of Electrical & Computer EngineeringUniversity of Toronto
The use of reconfigurable computing devices in mainstream datacentre applications is growing, with
more companies and institutions using such devices to accelerate compute workloads. Specific examples
of particular note include the work done to integrate Field Programmable Gate Array (FPGA) devices
into the Microsoft Bing search engine [1] [2] [3], the introduction of FPGA devices in Amazon’s AWS
cloud offering [4], and the work done at IBM Research to integrate FPGA devices into the cloud [5].
Many academic works have also explored the deployment of FPGA devices in datacentre environments;
clearly the datacentre deployment of FPGA devices is an emergent and popular solution to expand the
compute capabilities of datacentres.
For CPU-based compute nodes in datacentres, it is common to use virtualization technologies to
enable multi-tenant use, i.e. multiple resident virtual compute nodes on a single physical system. This
increases the efficiency of the datacentres by increasing the effective utilization of those compute nodes.
In addition, this enables cloud-based service models as a single physical system can be shared by the
multiple customers of a cloud provider. This thesis explores the tradeoffs in designing analogous virtual-
ization technologies for reconfigurable compute devices, particularly the shared memory and networking
interfaces of FPGA devices.
1.1 Motivation
Reconfigurable compute devices, and in particular FPGA devices, have the potential to accelerate many
compute operations. In fact, FPGA devices have been shown to accelerate encryption [6], compression [7],
packet processing [8], and even machine learning applications such as neural networks [9]. This has no
doubt motivated their continued deployment in datacentre applications.
1
Chapter 1. Introduction 2
Microsoft has successfully demonstrated their Catapult platform (both v1 [1] and v2 [2] versions),
which utilizes FPGAs to accelerate Bing search. Microsoft has since expanded their deployment of
FPGAs to their Azure offering [10], using FPGAs to compress network traffic for their cloud customers.
Amazon has deployed FPGA devices to their own cloud offering, AWS [4]. A single AWS instance can
be created with up to eight FPGA devices, completely programmable by the cloud users. The interest
in using FPGAs in the datacentre will likely only increase in the future, making the efficient design of
the platform that enables their deployment of key importance.
The benefits of virtualization, and thus the motivation of exploring virtualization technologies and
potential benefits for FPGAs, are well established. Virtualization as a technology dates back to the
1960s [11], and its use in datacentres today is widespread. Simply put, virtualization allows for what
would have been multiple physical servers to run co-located as virtual servers on a single physical server
node. As long as the single physical server has enough resources to run the workloads of all of the
virtual servers along with the overhead of the virtualization itself, the virtualized deployment saves
on the need for more physical server nodes. The same argument motivates virtualization for FPGA
devices: if multiple FPGA bound applications can be co-located on a single physical FPGA, and that
physical FPGA has enough resources (i.e. area and interface bandwidth) to accommodate the original
applications and the overhead of the virtualization platform itself, the FPGA deployment can be made
with fewer total compute nodes.
In this thesis we specifically address the implementation of virtualization from the perspective of
performance, data, and network domain isolation, while also considering that FPGAs have a limited
amount of area. We present the implementation of various components to achieve these domains of
isolation for an FPGA targeting a multi-tenant deployment.
1.2 Contributions
The contribution of this thesis is a detailed analysis and exploration of the design tradeoffs involved
in virtualizing FPGA devices, and design of a virtualization platform considering these tradeoffs. The
major components of the contribution are as follows:
• The introduction and formalization of the concepts of a “hard shell” and “soft shell”, easing the
design and analysis of virtualization technologies targeting reconfigurable compute devices
• A functioning HDL implementation of memory virtualization hardware cores, targeting single and
multi-channel memory platforms
Chapter 1. Introduction 3
• A functioning HDL implementation of network virtualization hardware cores, exploring multiple
deployment scenarios depending on the network infrastructure
• An analysis of the area overheads incurred by the virtualization technologies considering the design
tradeoffs discussed
1.3 Overview
The remainder of this thesis is organized as follows: Chapter 2 will provide background information
and will review previous work on the virtualization of reconfigurable compute devices. Chapter 3 will
provide an analysis of various deployment models for FPGAs in the datacentre and introduce concepts
used in the design of virtualization technologies for reconfigurable compute devices. Chapter 4 and
Chapter 5 will present the design and analysis of memory interface virtualization and network interface
virtualization respectively, considering the analyses of Chapter 3. And finally, Chapter 6 will examine
avenues for future work and conclude the thesis.
Chapter 2
Background
This chapter introduces some concepts, technologies, and definitions used throughout the thesis. It also
examines some related work, providing context for the thesis.
2.1 Reconfigurable Compute Devices
Reconfigurable computing describes a general class of devices in which the primary circuit elements can
be reconfigured into a user-defined hardware configuration; the array of these reconfigurable elements
is often referred to as a reconfigurable fabric. The device as a whole performs some computation, often
interfacing with some external network or memory. This section describes such devices, particularly the
FPGA, which is the focus of the implementations presented in Chapters 4 and 5, as well technologies
that enable the use of reconfigurable computing devices.
2.1.1 Field Programmable Gate Arrays
The most common reconfigurable fabric used in reconfigurable compute applications is the FPGA. An
FPGA is an integrated circuit designed such that its functionality can be reprogrammed after its manu-
facture, according to some specification of the user. Users specify some hardware configuration using a
Hardware Description Language (HDL), a class of programming languages that can be used to describe
digital hardware. The user’s HDL application can be synthesized into a bitstream (used to program the
FPGA) using a set of Computer Aided Design (CAD) tools. For example, Xilinx FPGA devices can be
targeted for synthesis using the Vivado CAD tool set [12]. The bitstream generated describes the desired
configuration of the FPGA’s various building blocks, which can be configured to implement most digital
circuit designs.
4
Chapter 2. Background 5
The main building block of an FPGA is the Look-up Table (LUT), which can implement a four-,
five-, or six-input combinational logic function. The LUTs are grouped with flip flops inside logic blocks,
which are connected to a series of switches and connection blocks implementing a programmable routing
fabric. In addition to the LUTs, modern FPGAs include dedicated Digital Signal Processing (DSP)
blocks to perform multiplication, Block Random Access Memory (BRAM) for local memory storage,
and often even PCIe [13] or Network controllers [14]. Note that hardware components implemented as
dedicated silicon on an FPGA chip, rather than through a configuration and interconnection of LUTs
to implement the logic, are often referred to as “hardened”, in contrast to digital circuits synthesized
on the programmable fabric, which are referred to as “soft”; the PCIe and Network controllers included
on FPGA chips are examples of such hardened components. See [15] for more information on FPGA
architecture.
2.1.2 Coarse-Grained Reconfigurable Architectures
Another relatively new reconfigurable compute device is the Coarse Grained Reconfigurable Architecture
(CGRA) device, which as the name suggests, includes much coarser building blocks than the LUTs of the
FPGA. These blocks can be configured to perform some larger arithmetic functions, such as additions
or shifts, and are often connected using some routing fabric [16]. As there are no general purpose LUTs
to implement arbitrary hardware, these devices are often programmed using a different method (i.e., in
contrast to the HDLs used for FPGAs). Also, any communication to the external world (e.g., memory
controllers or network controllers) would need to be implemented as hard elements. The content of this
thesis focuses on the FPGA, though similar memory and network interface designs as presented for the
FPGA class of devices could be implemented as hard elements if virtualization were to be considered for
CGRAs.
2.1.3 Computer Aided Design Tools
As mentioned in Section 2.1.1, CAD tools are used to take a user’s description of the hardware they
intend to implement on the device and create a description of how the elements of the device should be
organized and configured to achieve the specified implementation. This final configuration descriptor file
is generally referred to as a bitstream, in reference to the fact that the descriptor represents the states to
be programmed to the bits of Static Random Access Memory (SRAM) cells that drive the configurable
portions of the LUTs and routing fabric. Figure 2.1 shows the steps in a typical FPGA design flow,
further details of these design flow steps, and their particular functions, can be found in [17].
Chapter 2. Background 6
Figure 2.1: A Typical FPGA CAD Design Flow, adapted from [17]
For FPGA devices, the user’s description of the hardware to be implemented is often in the form
of HDL, but other descriptions are available. High Level Synthesis (HLS) has increased in popularity
in recent years due to the fact that complex digital hardware circuits can be described using simpler
software-based programming languages, such as C, C++, or OpenCL [18]. Some examples of HLS-based
CAD tools include LegUp [19], Xilinx’s Vivado HLS [20], and Intel’s FPGA SDK for OpenCL [21].
HLS lowers the barrier to adopting reconfigurable computing platforms thereby increasing the viability
(financial or otherwise) of datacentre deployments of FPGAs. This motivates further research into such
datacentre deployments and this virtualization work more specifically.
One CAD-based innovation for FPGA devices that is particularly important in enabling datacentre
deployments of FPGAs is Partial Reconfiguration (PR). PR allows for the FPGA device to be partitioned
and for these partitions to be reconfigured independently, such that one portion of the FPGA can be
actively running some circuit while another portion is reconfigured without its operation being stalled
or affected. These techniques are described in works by both major FPGA vendors [22] [23]. In PR-
based FPGA CAD design flows, the portion of the FPGA which is not reconfigured after the initial
configuration is termed the static region, while the portions of the FPGA that are reconfigured live are
called the dynamic or PR regions (note, an FPGA can typically have multiple PR regions). These PR-
Chapter 2. Background 7
based FPGA CAD design flows generate PR bitstreams, that can be programmed through the traditional
Joint Test Action Group (JTAG) boundary scan methods, or often using internal connections driven by
the FPGA fabric directly, such as the Xilinx-based ICAP connection [24]
2.2 Virtualization
Virtualization is a widely used term in many sub-fields of computer architecture, computer science, and
digital hardware. In regards to the term virtualization, the focus of this thesis is desktop virtualization
(often also termed server virtualization), which is the virtualization of a server compute node. This
section describes this type of virtualization in more detail.
2.2.1 Desktop Virtualization
Desktop virtualization is the set of technologies used to enable the deployment of multiple virtual servers
on a single physical server. In other terms, virtualization enables a single physical server to be seen
by its multiple tenant virtual servers as multiple unique and wholly independent hardware instances.
It was originally envisioned by IBM to partition their mainframe computers and allow multiple virtual
workloads to run on a single physical mainframe [11]. The main benefit of virtualization was the increase
in the efficiency of the mainframe computers, since single workloads would not use 100% of the physical
server’s resources at all times.
Essentially, virtualization software seek to emulate multiple instances of the physical server, such that
each tenant can run on the emulated physical server without modification. Virtualization software is
commonly referred to as a Virtual Machine Monitor (VMM), or as a hypervisor. Without virtualization,
modern servers already support context switching between independent processes. Processes cannot
access the environment of other processes without sufficiently elevated privileges. For the VMM to
emulate multiple physical servers, it must only intercept these privileged calls from the virtual systems,
most often termed a Virtual Machine (VM), and ensure they only access parts of the system assigned to
them. For example, memory is allocated on a page basis to the VM and a translation action is needed
every time a VM attempts to access memory. Similarly, I/O devices are either emulated, or assigned
wholly to a single VM and attempts to query the system about the I/O devices are intercepted by the
VMM. Only emulated and physical I/O devices assigned to a VM will be discoverable.
Multiple different types of virtualization software are available today. Two main categories of VMM
software are Type I and Type II virtualization [25]. For Type I, the VMM software runs atop of a
traditional operating system, and the guest VMs run on the presented virtualization layer. For Type II,
Chapter 2. Background 8
the VMM software runs directly on the physical server itself, becoming the main operating system of the
physical machine. In addition, paravirtualizated VMMs of both types exist, which decrease the overhead
of virtualization by allowing the VMs to install drivers that are virtualization-aware, bypassing some of
the overheads associated with virtualization [11]. Type II non-paravirtualized VMMs are the main focus
of this work, as it is not evident that there is an analogue to Type I software or paravirtualization for
FPGA devices.
Two key goals to consider for virtualization solutions are data isolation and performance isolation.
Data isolation ensures that the data of one VM cannot be accessed, modified, or otherwise molested
by other VMs on the same VMM. Performance isolation refers to the idea that the performance of a
VM should not be impacted by the transient activity of other VMs running on the same VMM. While
processor time scheduled to a VM and memory allocated to a VM can be strictly controlled, the memory
access patterns and cache usage patterns of other VMs can affect the performance of a VM [26].
2.2.2 Containerization
Containerization is a virtualization technology that aims to reduce the overhead imposed on a physical
server running a traditional VMM. In a containerized environment, each virtualized server (termed
containers in the containerization context rather than VMs) shares an operating system kernel, but has
its own execution environment and middlleware setup [27]. For example, Linux based containers can be
created using control groups (cgroups) and Linux namespaces.
By sharing a kernel, and thereby a single application scheduling environment and memory allocation
scheme, overhead is reduced and resources can be more effectively distributed. The system’s process
scheduler is fully aware of not only the VMs running on the system, but all the processes of the VMs.
The system’s memory allocation scheme is similarly fully aware of the memory requirements of each
process, and can allocate memory more efficiently. For traditional virtualization solutions, there would
be two layers of process scheduling and memory allocation, first at the VMM level and then again at the
VM’s guest operating system level. Figure 2.2 shows a visual comparison of the virtualization techniques
discussed.
2.2.3 Operating Systems
Operating Systems are not generally considered virtualization technologies, though previous work on
FPGAs creating both Hardware Hypervisors and Hardware Operating Systems are very similar in that
they present an abstracted environment for multiple hardware tasks to run on a single FPGA. From
Chapter 2. Background 9
VM1
User Space
Libraries & Middleware
Guest OS
VM2
User Space
Libraries & Middleware
Guest OS
VM3
User Space
Libraries & Middleware
Guest OS
Virtual Machine Manager (VMM)
Physical Server
VM1
User Space
Libraries & Middleware
Guest OS
VM2
User Space
Libraries & Middleware
Guest OS
Virtual Machine Manager (VMM)
Physical Server
Host Operating System
Libraries & Middleware
Host System Applications
Container 1
User Space
Libraries & Middleware
Container 2
User Space
Libraries & Middleware
Container 3
User Space
Libraries & Middleware
Host Operating System Kernel
Physical Server
(a) (b) (c)
Figure 2.2: (a) Type I virtualization, (b) Type II virtualization, (c) Containerization
this hardware analogue, it is also easy to see how traditional software operating systems are similar
to VMMs. While VMMs allow for multiple guest operating systems to run in environments that seem
completely independent, operating systems allow for multiple user applications to run in environments
that seem completely independent. This is mainly accomplished through context switching and virtual
memory, i.e., memory accesses from applications are translated before being serviced.
Memory virtualization works by dividing the memory region into pages of some pre-determined size
and then allocating the pages to the applications as they are needed. Each application sees a zero-base
address for its memory space, accesses to the virtual memory space are intercepted and handled by a
translation mechanism at the operating system level. The lowest significant bits of the virtual address,
those that index memory within a page, remain unchanged, while the most significant bits are remapped
to the actual physical memory location assigned to that application. Page mappings are stored in a
map table in memory and cached in a structure known as a Translation Lookaside Buffer (TLB) [28].
Figure 2.3 depicts the memory translation scheme. Note, VMM environments must have two levels of
translation: at the guest operating system level, and then again at the VMM level.
2.3 FPGAs in the Cloud and FPGA Virtualization
A lot of effort has gone into investigating the deployment of FPGA devices in datacentres and cloud
environments. This is an important area of consideration for this work since virtualization technologies
are often used in cloud settings, and most of these FPGA cloud and datacentre deployment works include
some version of virtualization.
Chapter 2. Background 10
31..12 11..0
31..12 11..0
Virtual Memory Address
Physical Memory Address
Address Translation Lookup
Process ID
31..12 11..0
31..12 11..0
Virtual Memory Address
Guest Physical Address
Guest OS Address Translation Lookup
Process ID
31..12 11..0
Host Physical Memory Address
Hypervisor Address Translation Lookup
VM ID
(b)(a)
Figure 2.3: Virtual to physical memory address translation. (a) Translation in a standard operatingenvironment, (b) Translation in a virtualized environment
2.3.1 Related Work
This related work is important to establish the context in which this thesis is presented.
Microsoft Catapult
Microsoft introduced FPGAs into their Bing Search datacentres to accelerate their search algorithms
using specialized hardware implementations of those algorithms. The original implementation, dubbed
Catapult v1, was published in 2014 [1]. The Catapult implementation included FPGAs installed as
PCIe add-on cards within Processor-based servers. These FPGAs are controlled and receive their data
from the Processor system, essentially setup as a master-slave configuration. Multiple FPGAs are
connected together using a dedicated interconnection network, configured in a torus arrangement. This
allows for the FPGAs to communicate with each other and enables multi-FPGA applications. The work
characterizes the hardware application in their platform as the “Role” and the surrounding abstraction
layer as the “Shell”. While this shell does not enable sharing of the FPGA between multiple tenant
applications, it does provide an abstraction layer and might be considered analogous to a Hardware
Operating System [1].
This is a good place to discuss the nomenclature used to describe the various components of FPGA
platforms. What the Microsoft authors (Putnam et al.) term the Role is often called the “Hardware
small Hardware Applications themselves. Table 4.5 shows a breakdown of the area utilization needed
to implement coarse-grained MMUs with various page sizes. As the page size decreases, the amount
of FPGA area resources needed increases in turn, namely the amount of LUTRAMs needed for the
solution. This does make intuitive sense, since the main need in reducing the page size is storage space.
In any case, we find that even as the page size decreases to about 4 MB, the total utilization by the shell
components is not greatly effected.
4.5 Multi-Channel Memory Considerations
The memory virtualization solutions discussed thus far have only considered a single independent memory
channel. Many FPGAs often include multiple memory channels to increase the total effective bandwidth
of external memory. This section considers the design decisions to be made in extending the previous
concepts of this chapter to a multi memory channel platform, introducing a few different paradigms for
including multiple memory channels. Note, the Alpha Data 8k5 FPGA board used in this work includes
two separate DDR4 memory channels.
4.5.1 Separately Managed Channels
The simplest way to virtualize multiple memory channels is to separate the channels and attach each to
some fraction of the Hardware Applications Regions, i.e., each Hardware Application Region is connected
to a single memory channel. This solution is depicted in Figure 4.13. Separately managed memory
channels do not introduce any increased complexity over the solutions presented earlier in this chapter,
since each memory channel would simply have the performance and data isolation of a single-channel
system. This solution is not explicitly evaluated in this thesis, but it is included here for completeness.
Chapter 4. Memory Interfaces 61
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1DDR4 Controller
AXI4 Interconnect
MMU
AXI4 Inter.
DDR4 Controller
AXI4 Interconnect
MMU
AXI4 Inter.
Dec. + Ver.
BW Throt.
Dec. + Ver.
BW Throt.
Dec. + Ver.
BW Throt.
Dec. + Ver.
BW Throt.
Figure 4.13: Multi-Channel Organization with Separately Managed Channels
AXI4 Interconnect
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
DDR4 Controller
MMU
AXI4 Inter.
DDR4 Controller
Width Converter + Buffering
Width Converter + Buffering
Dec. + Ver.
BW Throt.
Dec. + Ver.
BW Throt.
Dec. + Ver.
BW Throt.
Dec. + Ver.
BW Throt.
Figure 4.14: Multi-Channel Organization with Single Shared MMU
4.5.2 Single Shared MMU
A shared MMU system is depicted if Figure 4.14. In this system, each Hardware Application has a
single top-level memory interface (i.e., the interface that is at the PR region boundary) that connects to
protocol decouplers and verifiers, bandwidth throttlers, and a single MMU, just as in the single-channel
solutions. The difference is that the single MMU is connected at its master (request issuing) side to
an interconnect that can route requests to any of the memory channels (two channels are depicted in
Figure 4.14). In other words, the memory spaces of the memory channels are logically concatenated and
the single MMU serves this concatenated memory space.
If the data width of the interface presented to the Hardware Applications is equal to the width
Chapter 4. Memory Interfaces 62
of a single memory channel, the post-MMU interconnect can be connected directly to the memory
controllers for each of the memory channels; however, this limits the system to just a fraction of the
total memory bandwidth available (e.g., one half for two memory channels and one quarter for four
memory channels). To use the entire available bandwidth, the data width of the interface presented to
the Hardware Applications must be at least the number of channels multiplied by the data width of a
single memory channel. In this case, the there must be a data width converter inserted between the
port-MMU interconnect and the memory channels, specifically a data width downsizer for write data
received and a data width upsizer for read data returned.
For the write data interface, a downsizer on its own would exert back-pressure on the interconnect
preventing it from sending data at the full bandwidth speed (since a downsizer cannot accept a new
data beat every cycle). To prevent write requests from throttling the performance of the entire system,
a write data buffer must be included for each memory channel. The post-MMU interconnect can simply
write data to these buffers and not be throttled by the back-pressure of the data width downsizers. For
the read channels, the data width upsizers would not have new data available every cycle, as they would
have to wait for multiple read data beats to pack into one larger read data beat. The AXI4 protocol,
however, allows for read data to be interleaved, i.e., the interconnect can interchangeably read data from
different channels and send them upstream out of order. There is no buffering requirement for the read
data channel. These data width converters and write data buffers are shown in Figure 4.14.
This shared MMU solution is simple and requires relatively few changes from the single memory
channel system, but it does present some potential problems. Memory controllers implemented on
FPGAs tend to have wide data widths already because of the relatively slower clock speeds achievable in
FPGA fabric relative to the ASIC devices (e.g., CPUs) for which off-chip memory solutions are generally
designed. The data width must be increased at the same ratio that the clock speed is reduced between
the memory device itself and the FPGA fabric clock (e.g., if the clock is reduced by 1/4, the data width
must be increased by four-fold). Introducing multiple memory channels widens that data width even
further, and that might present timing challenges to the Hardware Applications. For example, the Xilinx
memory controller in [47] requires a four-fold data width increase, resulting in a native data width of 256
bits for the memory controller, which would increase to a 512-bit data width in the AXI interconnect
for two memory channels and a 1024-bit data width for four memory channels.
A further complication actually limits the effectiveness of the performance isolation in a shared MMU
solution. The bandwidth throttlers operate on the AXI interface presented to the Hardware Application
itself, with no knowledge of the future memory channel that the request will eventually target. If all of the
Hardware Applications try to target the same memory channel (assuming the MMU assignment allows
Chapter 4. Memory Interfaces 63
AXI4 Interconnect AXI4 Interconnect
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
DDR4 Controller
MMU (stage 2)
DDR4 Controller
Width Converter + Buffering
Width Converter + Buffering
MMU (stage 2)
MMU (stage 1)
MMU (stage 1)
MMU (stage 1)
MMU (stage 1)
AXI4 Inter.
AXI4 Inter.
AXI4 Inter.
AXI4 Inter.
BW BW BW BW BW BW BW BW
Figure 4.15: Multi-Channel Organization with Parallel Shared MMUs
such), the memory bus’ bandwidth would be effectively limited to the bandwidth of that single memory
channel. The only way to ensure performance isolation would be to isolate Hardware Applications to a
single memory channel, which would somewhat defeat the purpose of the multi-channel memory solution.
If any memory channel has memory that is assigned to two or more Hardware Applications, there is a
potential for malicious, or even unintentional, bandwidth limiting for those Hardware Applications.
4.5.3 Parallel MMUs with a Single Port
To overcome the problem in performance isolation for a shared MMU solution, each Hardware Applica-
tion can have a first stage MMU that simply indicates which memory channel the request is to access,
and this information can be used to reroute that request to the correct memory channel. An interconnect
can follow this first stage MMU and map requests to the memory channel indicated by the first stage
MMU. Separate bandwidth throttlers can then be instantiated at the output of this first interconnection
network, that would effectively throttle the bandwidth between each Hardware Application and memory
channel pairing individually. This arrangement is depicted in Figure 4.15.
If the system uses a base-and-bounds MMU design, the first stage MMU would simply be a table
indexed by the VIID of the requester interface and indicate which memory channel that VIID is mapped
to. The most significant bits of the address, which indicate the memory channel, would be replaced with
this stored value. If the system uses a coarse-grained paged MMU, this first stage MMU’s page-table
would be indexed by the same bits of the address (in addition to the VIID) as the later stage MMU,
containing the same number of entries as the portion of the later stage MMU’s page table assigned to
that specific Hardware Application Region. However, the mapping value stored in the page-table simply
Chapter 4. Memory Interfaces 64
indicates the memory channel that page is mapped to, so only the most significant bits of the address,
that indicate the memory channel, would be replaced with this stored value. The remainder of the
mapping would be stored in the second stage MMU’s page table.
This first MMU and interconnect can also handle out-of-bounds access, freeing downstream compo-
nents of wasting bandwidth on useless transactions. This would also mean that bandwidth credits in
the downstream bandwidth throttler are not consumed by out-of-bounds accesses. For a coarse-grained
paged MMU system, the first stage MMU would indicate the validity of a page mapping and act on
4k boundary crossing errors, while the second stage MMU could safely assume all mappings are valid
and ignore 4k boundary crossings. For a base-and-bounds MMU system, the first stage MMU would
deal with the bound check and 4k boundary crossing errors in addition to storing the mapped memory
channel for each VIID, and the second stage could safely ignore any errors and simply add the base
component.
In this MMU arrangement, since requests are already separated by a targeted memory channel to
perform performance isolation, those separated request streams need only be forwarded to that memory
channel. Thus, each memory channel can have a separate MMU that only handles requests targeting
that memory channel; we term this MMU system “Parallel MMUs with a Single Port” because each
memory channel has an individual MMU and each Hardware Application Region has a single port.
In the single shared MMU approach, the data width of the memory interface presented to the
Hardware Application has to be wider to allow for the full memory bandwidth of the attached memory
to be realized. In this case, there is no bottleneck at a single MMU, so the interface width can be
smaller. In fact, the interface width presented to the Hardware Applications would only limit the
maximum amount of bandwidth that could be assigned to the Hardware Application, and the total
system bandwidth might not be impacted. For example, if a system has two memory controllers with a
256-bit data width, and each memory interface port at the PR boundary to two Hardware Applications
also has a 256-bit data-width, the full bandwidth of the system could still be used as long as the access
patterns of the Hardware Applications are efficiently mapped across the memory channels. Note, a wider
data interface at the Hardware Application would still be required if the system might want to assign
more than the bandwidth of a single memory channel to any Hardware Application.
The arrangement depicted in Figure 4.15 includes a wider memory access interface and thus also
includes data width converters and write channel buffers before the memory channels. This arrangement
would require all of the MMUs and interconnects to have larger data widths as well. These data
width converters and write channel buffers could however be included immediately before the bandwidth
throttlers, as indicated in the modified arrangement shown in Figure 4.16. This would reduce the size
Chapter 4. Memory Interfaces 65
AXI4 InterconnectAXI4 Interconnect
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
DDR4 Controller
MMU (stage 2)
DDR4 Controller
MMU (stage 2)
MMU (stage 1)
MMU (stage 1)
MMU (stage 1)
MMU (stage 1)
AXI4 Inter.
AXI4 Inter.
AXI4 Inter.
AXI4 Inter.
Width Converter + Buffering
Width Converter + Buffering
BW BW BW BW BW BW BW BW
Figure 4.16: Multi-Channel Organization with Parallel Shared MMUs (modified)
of the downstream interconnect and MMUs, but would require data width converters for each Hardware
Application Region. We term this arrangement the “parallel MMUs with a Single Port (modified)”.
4.5.4 Parallel MMUs with Multiple Ports
Looking at the modified parallel MMUs arrangement, much of the infrastructure located before the
bandwidth throttlers could be included in a soft shell implementation and need not necessarily be
included in the static hard shell implementation. In essence, what this would do is implement a parallel
MMUs system with multiple ports presented at the PR interface to the Hardware Application Region.
Each port would correspond to a separate memory channel. This is shown in Figure 4.17, which is
essentially the same as Figure 4.16 except with the protocol decouplers and verifiers duplicated and
moved to just before the bandwidth throttlers, and the other components moved inside the soft shell.
The first stage MMU would then be connected to the management framework through the management
connection of the soft shell.
The advantage of this arrangement is that the interconnect instantiated within the soft shell can
be made just large enough to accommodate the largest memory interface needed inside the soft shell.
For example, if a particular Hardware Application needed only memory interfaces of width 64-bits, that
interconnect and first stage MMU could be limited to 64-bits with a data width upsizer included at the
memory interface. Note, since the bandwidth throttler included in this thesis penalizes requesters for
gaps in data transmission and acceptance, that a data-width upsizer would induce, some buffering of
write requests until enough data has been received would be needed to preserve bandwidth allocations.
Another advantage of this system is that if the memory interfaces within the soft shell use fewer address
Chapter 4. Memory Interfaces 66
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
DDR4 Controller
AXI4 Interconnect
MMU (stage 2)
DDR4 Controller
MMU (stage 2)
AXI4
Inter. Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
BW
BW
BW
BW
BW BW BW BW
Figure 4.17: Multi-Channel Organization with Parallel NMUs and Multiple Ports
bits than the system memory (i.e., they do not need large memory allocations), the first stage MMU’s
depth can be reduced and total FPGA area utilization of the shell (cumulative hard and soft shell
utilization) would also be reduced.
These multi-channel solutions are presented here as a conceptual discussion. The implementation of
such solutions is left to future work.
4.5.5 Multi-Memory Channel Implementations in Previous Works
Most of the previous works described in Chapter 2 include only a single memory channel, similar to the
exploration presented in this thesis. One notable exception is the SDAccel platform created by Xilinx [30].
Specifically, the SDAccel Platform Reference Design described in [65] shows that the Shell implemented
for the SDAccel platform includes four separate memory channels. In that reference platform, the
connections to the off-chip memory are not abstracted through the Shell, but instead presented directly
to the PR region. Since the Shell presented in that work does not have multiple applications, even the
memory controller itself is meant to be implemented within the PR region. This work therefore does
not present a multi-memory channel solution with any kind of virtualization.
One relevant prior work that considers both multiple applications and multi-memory channel de-
ployments is the work presented by Yazdanshenas and Betz [35]. In that work, an exploration of the
overheads associated with a multi-tenant Shell are explored. In that exploration, multiple memory
channels are considered explicitly. The way that those memory channels are presented to the hardware
applications is consistent with the theoretical solution presented in Section 4.5.4. More specifically, each
Chapter 4. Memory Interfaces 67
of the memory channels is accessed through a separate interface within each applications (i.e., each
application has a memory access interface to correspond to each memory channel). However, that work
does not explicitly consider isolation and therefore would not include the parallel MMUs described in
Section 4.5.4.
Chapter 5
Network Interfaces
In this chapter, we switch the focus to securing the sharing of the network interface, which is required for
the direct-connected FPGA deployment model. Network interfaces, particularly Ethernet connectivity,
are provided on many FPGA boards and are often directly supported by FPGA vendors. The Alpha
Data 8k5 FPGA board used in this work for example includes 10 Gbps Ethernet connectivity [46]. Xilinx
provides support for Ethernet ports, including the 10 Gbps port on the Alpha Data device, through its
10G Ethernet Subsystem IP Core [48]. The interafce provided to the user for this Xilinx Ethernet
controller is an AXI-Stream interface; while the work presented in this thesis targets the AXI-Stream
interface, this interface is generic enough such that these methods could be applied to other interface
types as well.
In contrast to the solutions that aim at securing memory, presented in Chapter 4, network inter-
faces are connected to the data-centre infrastructure itself, which means activity propagated over these
connections could impact applications beyond the multi-tenant device. In this chapter we analyze the
domain isolation solutions needed to address this problem, as well as discuss how performance isolation
solutions can be extended to memory interfaces.
5.1 Network Interface Performance Isolation
To institute performance isolation for the network interface, similar stages to those implemented for the
memory channel are required: protocol decoupling, protocol verification, and interconnect bandwidth
throttling. To illustrate the intention of the work described in this section, see Figure 5.1.
In part (a) of the figure, we depict an unsecured shell organization that includes only network
connectivity as an external resource. The multiple Hardware Application Regions include an AXI-
68
Chapter 5. Network Interfaces 69
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
10Gbps Ethernet
AXI-Stream Interconnect
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
PCIe Controller
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
10Gbps Ethernet
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
PCIe Controller
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
BW Throttler
BW Throttler
BW Throttler
BW Throttler
AXIS Inter.
(a) (b)
Figure 5.1: Adding Performance Isolation for Networking to the Shell (a) Shell without isolation (b)Shell with added isolation
Stream output port that connects to an AXI-Stream interconnect to arbitrate access to the Xilinx
Ethernet controller. In addition, the Hardware Application Regions include AXI-Stream input ports that
are driven by the output of another AXI-Stream interconnect. Packets that arrive from the Ethernet
controller pass through a component that takes the least significant bits of the packet’s MAC address
as the VIID to determine which interface to route to in the AXI-Stream Interconnect that drives the
AXI-Stream Input ports of the Hardware Application Regions. This component is called a “Simple
NMU” since it manages which interface to route input packets to, though it does not implement any
security features like the NMUs described in Section 5.3. As with the shell for the memory connectivity,
PCIe is included such that a host computer can manage the shell, though in the case of the unsecured
shell there is nothing to manage outside the soft shell components.
Part (b) of the figure indicates how the various performance isolation components modify the simple
unsecured shell depicted in part (a); each of the AXI-Stream input and output ports pass through
AXI-Stream Protocol Verifier-Decoupler components. The AXI-Stream output ports are connected to
bandwidth throttlers, so that the access to the Ethernet output port can be farily shared amongst the
Hardware Application Regions. Note, no bandwidth regulation is done for input packets since the shell
can not effectively assert backpressure on the Ethernet input port. All of these performance isolation
components are connected to the PCIe management network.
Chapter 5. Network Interfaces 70
Ingress Packet ChannelEgress Packet Channel
0
0
decouple
egr_tvalid_in
egr_tready_in
egr_tvalid_out
egr_decoupled
egr_tready_out
Outstanding egr packet?
= 00
0
decouple
egr_tvalid_in
egr_tready_in
egr_tvalid_out
egr_decoupled
egr_tready_out
Outstanding egr packet?
= 0
0
1
decouple
ingr_tvalid_in
ingr_tready_in
ingr_tvalid_out
ingr_decoupled
ingr_tready_out
Outstanding ingr packet?
= 0
set rst
Sticky bit
ingr_tlast_in
0
1
decouple
ingr_tvalid_in
ingr_tready_in
ingr_tvalid_out
ingr_decoupled
ingr_tready_out
Outstanding ingr packet?
= 0
set rst
Sticky bit
ingr_tlast_in
ingr_decoupled
egr_decoupleddecouple_done
Figure 5.2: Network Interface Decoupler
5.1.1 AXI-Stream Decoupling
As stated in Section 4.1.3, decouplers are needed so that the Hardware Application Region can be
effectively disconnected from the shared interconnect and Ethernet connection. This could be done to
reprogram the PR region in which the Hardware Application is resident for those deployments where PR
is enabled, or to pause the Hardware Applications for some other reason, such as to disable the packets
from being sent from a particular Hardware Application.
The network connectivity is provided by an AXI-Stream interface, which is a pretty generic interface
providing only a data field (with a strobe value indicating valid bytes), a LAST signal to indicate the
end of a packet, and some handshaking signals. The simple Xilinx decoupler [54] cannot be used in this
case because it might decouple packets mid way through transmission, that could cause downstream
components to lock up waiting for the last data beat of a packet. Thus, decoupling activity must be
gated with an indication of whether or not a packet is midway through its transmission. Figure 5.2
shows the implementation of this AXI-Stream decoupler. Outstanding packet trackers (implement using
procedural HDL code) are used to track whether there is a mid-stream packet for both the input and
output stream directions.
For input packets, the decoupled READY signal is held high so that all packets are seen as accepted
by the downstream interconnect, to prevent the backpressure from locking up the interconnect. Since
packets are pseduo-accepted in this way, there might be a problem if the input AXI-Stream port is un-
decoupled midway through one of these pseduo-accepted packets, as the Hardware Application would
see a partial packet with no way of knowing whether or not it is a complete packet. To prevent this,
the decoupler signal must also be tied to a sticky decouple signal, that simply enables the decoupling
Chapter 5. Network Interfaces 71
even after the decouple signal has been de-asserted until the pseudo-accepted packet’s transmission is
complete.
5.1.2 AXI-Stream Protocol Verification
Unlike the AXI4 Protocol for memory interfaces, the AXI-Stream protocol does not include protocol
assertions that must be met to confirm protocol compliance, because the AXI-Stream interface has little
disallowed behaviour. The only protocol assertion of note that could be inferred for the AXI-Stream
interface is the Handshaking check, i.e., once the VALID signal is asserted, all the signals must maintain
their value until the READY signal indicates that the data beat has been accepted. The Xilinx Ethernet
controller does however impose some additional protocol restrictions, namely that the KEEP signal (the
strobe signal that indicates which data bytes are valid) must be held all high before the last beat is
transferred and that packet transmission cannot include any gaps (i.e., once a packet is started, VALID
cannot be de-asserted until the last beat is transferred). Invalid KEEP values are ignored by the Xilinx
Ethernet controller, but gaps in the transmission can cause packets to be dropped [48]. Finally, the
components implemented in future sections require that the packet be held to a maximum size (this is
often called the Maximum Transmission Unit (MTU)), so packets must assert the LAST signal before
the packet exceeds this size.
The total list of assertions that must be met to ensure that packet transmission is not interrupted
is: the Handshaking check, no gaps in the transmission, and packet size limited to the MTU. Note, the
no gaps in transmission applies to input packets as well, indicating that there can be no gaps in the
acceptance of a packet transmission, but the other assertions apply for the output direction only.
The AXI-Stream protocol verifier design is depicted in Figure 5.3. An outstanding packet tracker
is used to track whether any outgoing packets are midstream; this value is used to override the VALID
signal to ensure there are no gaps in the transmission. Next, a counter is used to keep track of the number
of beats sent for outgoing packets; once the count is equal to one less than the maximum packet size, the
LAST signal is forced high to end the packet. Note, both of these changes could corrupt the packet sent,
but the purpose of the protocol verifier is simply to prevent malformed requests from propagating to
the downstream interconnect, so this is only of concern to the Hardware Application sending malformed
packets. Finally, the AXI-Stream outputs are registered so the values are not captured if they change
after the VALID signal has been asserted. For input packets, the only change required to insure protocol
compliance is the overriding of the READY to signal to a high value; the input port must accept all
packets when they arrive and cannot ever assert backpressure to lock up the interconnect.
Chapter 5. Network Interfaces 72
Outstanding egr packet?
!= 0
Beat Counter
= (MTU – 1)
reg
reg
egr_tvalid_in
reg
egr_tvalid_out
egr_tlast_inegr_tlast_out
other_egr_signals
ingr_tready_in1
other_ingr_signals
Figure 5.3: Network Interface Protocol Verifier
5.1.3 Network Interface Bandwidth Throttling
Network bandwidth throttling can be implemented by again using a modified version of the credit-
based accounting system presented in [58]. Since network transmissions cannot be interrupted mid-
transmission, the number of credits needed to initiate the transmission must be the total number of
packets that the transmission might need to use. This is equal to the number of beats that make up
an MTU packet. It is not possible to tell the size of the packet before it has been transmitted, so the
total credits deducted on a new packet transmission acceptance must be equal to the MTU. Once the
end of the packet is reached, credits can be redeposited based on how much shorter the packet is than
the MTU. As a reminder, the original credit accounting mechanism was as follows:
credits(t+ 1) = credits(t) + ρ− 1 has bus access
credits(t+ 1) = credits(t) + ρ no access but pending requests
credits(t+ 1) = σ no pending requests
Where ρ is a decimal value (less than or equal to one) that represents the proportion of bandwidth
assigned to that interface, and σ represents the burstiness accepted from that interface. This formula-
tion can be adjusted to implement the changes needed for the network interface as follows:
credits(t+ 1) = credits(t) + ρ − crnew + crlast pending packet or data to send
credits(t+ 1) = σ no pending packet or data
Chapter 5. Network Interfaces 73
where :
crnew = MAX BEATSMTU new packet transmission accepted
crnew = 0 otherwise
crlast = unsent last data beat accepted
crlast = 0 otherwise
unsent(t+ 1) = MAX BEATSMTU − 1 TLAST and TREADY high or reset
unsent(t+ 1) = unsent(t)− 1 TREADY high and TVALID high
unsent(t+ 1) = unsent(t) TREADY not asserted
The bandwidth throttler implemented based on this formulation is shown in Figure 5.4. As mentioned
in the beginning of this chapter, bandwidth throttling only effects the output AXI-Stream interface.
Again, an outstanding packet tracker is included that prevents decoupling based on the credit count
once a packet has started transmission. The credit count is compared to the MTU to determine whether
or not that interface should be decoupled. The credit update system updates the amount of credits
stored in the credit register whenever a new packet transmission is accepted and/or the last beat of a
transmission is sent.
Unlike the bandwidth throttling for the memory interface, the total bandwidth available on the
shared network interface is not dependant on the network access pattern. The total bandwidth available
should only be limited by the downstream datacentre switching infrastructure. As such, the network
bandwidth throttling system does not need a bandwidth conserving system like the one introduced for
the memory bandwidth in Section 4.2.3. Instead, the sum of the ρ values should simply be set to the
total bandwidth available in the system.
5.2 Network Security Background
Virtualized FPGA deployments must consider security in the way that Hardware Applications are allowed
to access the shared network. As already mentioned in the introduction to this chapter, this securitization
is required not only to isolate the Hardware Application Regions from each other, but to isolate the rest
of the network from any unwanted accesses from the Hardware Applications themselves. This is what
was termed Domain Isolation in Section 3.2.3. The need for Domain Isolation is not restricted to FPGAs
Chapter 5. Network Interfaces 74
0
update (ρ)
++
_ _
0
MTU
0
add_back
Credits (int)
Credits (frac)
sel0 sel1
sel3
init (σ)
++
sel4
0
0
>=
egr_tvalid_in
egr_tready_in
rst count
Beat Counter
tready & tlast tready & tvalid
_ _
(MTU-1)add_back
sel0tlasttready
sel1tvalidtready
sel3tvalid sel4
egr_tvalid_out
egr_tready_out
tlast_prev
Figure 5.4: Network Interface Bandwidth Throttler
deployed in the cloud, this security consideration is necessary also for software VMs installed on CPU-
based datacentre nodes. In this section, we discuss the solutions used to provide domain isolation in
other parts of the datacentre.
5.2.1 Software Analogues
In the software domain, the National Institute of Standards and Technology (NIST) details some common
methodologies used to secure access to a shared network by VMs in a virtualized environment [66]. The
main methodology presented is the virtual switch, a fully functional switch implemented in software that
switches traffic from the virtual network connections to the physical network interface and the next-level
Chapter 5. Network Interfaces 75
physical switch. Distributed virtual switches extend the virtual switch concept by provisioning and
managing virtual switches on multiple physical nodes simultaneously, an avenue that could be explored
for hardware NMU solutions in future work.
Another common network security methodology, according to NIST, is the firewall: devices and/or
security layers within switches or software that filter traffic such that only allowed connections are left to
pass-through to the network. The set of allowed connections is often specified in what are termed Access
Control Lists (ACLs), or alternatively Network Access Control Lists (NACLs). Firewall functionality
can be provisioned using physical appliances installed in the network, through ACLs implemented in
the physical switches of the network, or through firewalls implemented in the virtual switch solutions
mentioned earlier.
For multi-tenant environments, the pushing of ACLs to a physical firewall appliance or the next-level
physical switch is often termed hairpinning, since traffic from the VM is first routed to the physical
appliance and then to its final destination. Note, for such a firewall implementation to work, some
level of source semantics enforcement must be done before routing to the firewall appliance such that
the traffic is uniquely identifiable. Such hairpinning techniques are considered here in this thesis for
analogous hardware solutions.
A final consideration, virtual networking subdivides the physical network into virtual networks that
can be provisioned to different users and isolated from each other. The simplest form of virtual net-
working is the Virtual Local Area Network (VLAN) tag, IEEE 802.1Q [67]. The VLAN tag includes
a 12-bit virtual ID that allows switches to identify, and isolate packets between, devices on the same
virtual network. Such tagging can often be done by the switches themselves at ingress to the network.
Additionally, network virtualization can be provided using encapsulation based methods, VXLAN [68] or
NVGRE [69]; Virtual Tunnel Endpointss (VTEPs), often implemented within virtual switches, perform
the encapsulation and de-encapsulation.
5.2.2 OpenFlow Switching Hardware
In addition to virtual switches implemented on software nodes, hardware network switches can also
be used to implement security for network connected devices. One of the most ubiquitous Hardware
Switch standards is the OpenFlow standard [70]. The OpenFlow standard was specifically introduced
as an open source Software Defined Networking (SDN) solution; SDN describes network deployment
and management solutions that split the data forwarding plane and the control plane. In reference to
security, OpenFlow is relevant because it introduces a format for rules to influence how packets are
Chapter 5. Network Interfaces 76
OpenFlow Table
(TCAM)
OpenFlow Table
(TCAM)
OpenFlow Table
(TCAM)
OpenFlow Table
(TCAM)
MAC Parser
VLAN Parser
IP4 Parser
Transport Parser
Parsed Field
Queues
Parsed Field
Queues
Parsed Field
Queues
Parsed Field
Queues
Parsed Field
Queues
Parsed Field
Queues
Parsed Field
Queues
Parsed Field
Queues
Input Interface
MAC Parser
VLAN Parser
IP4 Parser
Transport Parser
Input Interface
MAC Parser
VLAN Parser
IP4 Parser
Transport Parser
Input Interface
MAC Parser
VLAN Parser
IP4 Parser
Transport Parser
Input Interface
Parsed Field
Queues
Parsed Field
Queues
Arbiter
OpenFLow Rule Processor
Packet Buffer MemoryPacket Buffer Memory
DMA Accessor
Input Interface
Input Interface
Input Interface
Output Interface
Figure 5.5: Example Implementation of an OpenFlow Capable Switch
forwarded or dropped when processed by the OpenFlow switch containing those rules; these rules can
include ACLs to implement security measures on an OpenFlow switch.
Complete OpenFlow switch solutions have been implemented on FPGAs [71] [72], and they can
provide the same level of security afforded to software systems through the use of rules that target
security, such as ACLs, adherence to routing protocols and stateful inspection. However, they consume
significant resources, on the order of 15-36 percent LUTs and 45-62 percent of BRAMs for the devices
used. This high area overhead indicates that full switch solutions implemented on FPGAs are likely too
large to implement in conjunction with a shell and multiple Hardware Application Regions; alternative
solutions must be sought that minimizes the area overhead.
One possible OpenFlow switch solution is shown in Figure 5.5. The packets flow in from the network
inputs to the outputs after they have been processed. When packets arrive at the network input, they
are parsed for key network fields that are required to compare against the rules stored in the OpenFlow
tables. The kinds of fields parsed out from a packet include source and destination MAC addresses,
source and destination IP addresses, port numbers, etc. Once the packet has been parsed, the parsed
fields are sent to a queue, while they wait to be processed by the OpenFlow tables, and the packet itself
is sent to a buffer until its eventual destination is determined.
The OpenFlow table processor pulls parsed packet data from one of the queues waiting to be processed
and compares the fields to the expected field data in each of the OpenFlow rules. OpenFlow rules are
stored in OpenFlow tables, which are implemented as Ternary Content Addressable Memories. If the
parsed packet data matches with a rule stored in the OpenFlow table TCAM, that rule has an associated
Chapter 5. Network Interfaces 77
action that is used to modify the packet, modify the parsed fields, update some internal switch metrics,
or add some metadata to the parsed fields. The parsed packet data is forwarded through a series of these
OpenFlow tables, matching up to one rule per OpenFlow table. Once the packet has passed through all
of the series of OpenFlow tables, the actions list it has accumulated is implemented by modifying the
packet in the ways specified (e.g. removing a VLAN tag field, or updating some IP field value), and/or
dropping/forwarding the packet to the specified output interface. Note, an OpenFlow switch can send
modified parsed packet data back to the queue to be reprocessed by the OpenFlow tables.
Some works modify this basic structure to implement reduced versions of the OpenFlow standard. For
example, the work presented in [72] modifies the OpenFlow table structure such that each rule matched
can have multiple actions associated with it, and then does not include multiple OpenFlow tables (the
work has multiple tables, but it is best interpreted as a single OpenFlow table that is pipelined). This
solution limits the flexibility of the OpenFlow standard, but also reduced the overall area need for the
Hardware switch implementation.
From this description we can glean why the full OpenFlow Hardware solutions might take up such
a significant amount of FPGA hardware resources. While the OpenFlow switch can implement network
security, it also has a great deal of overhead that is included to deal with other networking needs, such
as packet forwarding and VLAN tagging. Also, the queuing structure for parsed packet data forces the
parsed data and the packets themselves to be buffered. The need for buffer space would be determined by
the maximum number of packets the switch needs to hold while they to wait to be processed, which can be
significant depending on the network speed of the Ethernet interface, the number of network interfaces,
and the average time it takes to process a single packet. All of this added buffering, the inclusion of
multiple OpenFlow tables, and the need to sometimes reprocess a packet through the OpenFlow tables
also can add a significant amount of latency to the processing of a packet. Solutions that target security
exclusively can omit some of the overhead of a full OpenFlow switch implementation to reduce this area
overhead need and alleviate this long packet processing latency.
5.3 The Network Management Unit
The software analogues demonstrate some of the needs of network security, namely the enforcement of
access control (either directly or by hairpinning such functionality to the next-level physical switch or
some hardware appliance), and the ability to route traffic between logical interfaces on the same FPGA.
In traditional software virtual environments, VMs share memory and I/O connections. The memory
sharing is generally provisioned by hardware means, specifically, data isolation is provided through the
Chapter 5. Network Interfaces 78
employment of a MMU [28]. As an analogy to the MMU, that provides memory data isolation, we
propose the creation of an NMU, that provides network domain isolation. Based on the related work,
and the trends we identified, we contend that the NMU is required to enable the secure deployment of
direct-connected FPGAs in multi-user or multi-tenant datacenters and cloud deployments.
Similar to the software analogues presented in the previous section, there can be many potential
ways to secure the network interface for shared use of the network resources. For example, in the
Chapter 2, several works were presented that had some kind of network security gaurantees. The work
presented by Byma et al. [33] policed outgoing traffic by replacing the source MAC address with the
one assigned to the sender; the work presented by Tarafdar et al. [34] encapsulated data within a MAC
packet; and the work presented by Microsoft research, specifically Catapult 2 [3], encapsulated data in a
custom Transport layer protocol called Lightweight Transport Layer (LTL). In this section, some of the
considerations that might be needed for network security are presented, and a nomeclature is developed
to refer to these NMUs.
Note, the exact requirements of the NMU design will always depend on the specific deployment
details of the datacentre in which the FPGAs are to be deployed. For this reason, we do not present a
single NMU that we posit meets the requirements for domain isolation of networking interfaces. Instead,
a number of potential NMU designs are presented, which represent a series of deployment scenarios that
we claim meet the domain isolation needs of many common FPGA deployments.
5.3.1 Access Control Level
We note from the software analogues that ACLs are one important way in which network connectivity
should be secured. Access control functionality can be done within the NMU, or hairpinned to the next
level switch. The first criteria by which we categorize potential NMU designs is the level of access control
done within the NMU rather than pushed to the next level switch.
Un-Inspected Networking (Type A)
At the lowest level, we have NMUs that do not inspect outgoing packets at all and push all access control
functionality to the next-level switch (and potentially a further firewall appliance); we call these Type A
NMUs. Of course, for the next-level switch to be able to uniquely identify separate logical interfaces,
some methodology must be employed to mark outgoing packets as originating from a particular logical
interface. Two recent different IEEE standards could be used to this end.
The Edge Virtual Bridging standard (802.1Qbg) [73] allows for a single physical port of a switch to be
Chapter 5. Network Interfaces 79
treated as multiple logical ports by associating each logical connection with a specific Service VLAN tag.
Similarly, the Bridge Port Extension standard (802.1pr) [74] allows for a single physical port on a switch
to be expanded into multiple individually managed connections using a custom tag structure. Thus,
a Type A NMU should employ such tagging to push both routing and access control to the next-level
switch.
The simplicity of Type A NMUs lend themselves to simple hardware realizations, but they require
all ACLs to be implemented at the next-level switch, tightly coupling the hardware application to the
switch configuration, which is not desirable (the datacentre management framework must manage ACLs
in multiple places with multiple update and management procedures).
Source Semantics Enforcement (Type B)
The next level of access control is source semantics enforcement, i.e., ACLs that ensure the sender
addresses in the packets are correct and no other device addresses are spoofed; we term these Type B
NMUs. This is the type of NMU applied to the work presented by Byma et al. [33]. If source semantics
are enforced on the FPGA, further access controls can be applied at the next-level switch without the
configuration complexity of the Type A NMUs. Also, the Type B NMU does not rely on relatively new
IEEE standards that may have limited adoption. While the configuration complexity is reduced, most
access controls must still be implemented on the next-level switch; Type B NMU solutions remain tightly
coupled to the switch configuration.
Destination Rule Enforcement (Type C)
We define Type C NMUs as those that perform both sender and destination based access controls on
the FPGA. The full scope of what might constitute access control could be quite wide, and in fact might
include the full implementation of a switch on the FPGA. As discussed in the previous section, such an
implementation is likely infeasible or carries too high an overhead. Instead, we narrow the definition of
access controls.
Some previous works have shown FPGA datacenter deployments that rely solely on static point-to-
point links between the FPGAs. Limiting the NMU’s access control to a single destination field per
logical network interface would allow for some access control to be implemented in the Type C NMU at
a relatively lower cost. Moreover, multiple logical network interfaces can be provided to each hardware
application to implement point-to-multipoint connectivity. Other simple destination-based rules can also
be included, such as limiting the ability to send multicast packets, and limiting IP traffic to a specific
subnet. We contend that these simple access controls are powerful enough for many tasks.
Chapter 5. Network Interfaces 80
The Type C NMU adds complexity in the hardware implementation, and as such area overhead,
however it removes the tight coupling between the hardware application and the network infrastructure,
which should greatly ease deployment. Of course, this is limited: if the point-to-point access controls
are not sufficient enough to isolate the network accesses, more powerful ACLs from the next-level switch
would be needed.
Packet Encapsulation (Type E)1
Finally, Type E NMUs eliminate the need for access controls by moving packet encapsulation into the
NMU itself; instead of users performing network packetization within their own Hardware Applications,
they simply send the payload to the NMU, that encapsulates it within the appropriate network packet.
This is the methodology imposed in the implementation by Tarafdar et al. [34], and implied as an option
in the Catapult v2 work with the introduction of the LTL protocol [3].
Type E NMU solutions can be quite simple in terms of the hardware required to implement them, and
there is no tight coupling between the hardware application deployment and the network configuration.
Type E NMUs are however the least flexible, as they impose point-to-point only connectivity. Type E
NMUs also share network encapsulation hardware between the hardware applications, reducing area
utilization, but thus also require Hardware Applications to be rewritten to target the encapsulation-
based NMU scheme.
5.3.2 Internal Routing
Another functionality that might be required is the routability of traffic between logical network interfaces
located on the same FPGA. In general, haripin routing to the next-level switch and back is not possible
since the IEEE switch specifications explicitly forbid the re-routing of packets to the interface on which
the packet was received. The Edge Virtual Bridging [73] and the Bridge Port Extensions [74] protocols
are exceptions, so the Type A NMUs based on these standards enable routability by default.
For other NMU types, routability between the logical network interfaces can only be provided by
including routing functionality directly in the NMU; we term such NMUs as Type *R NMUs. Note,
routability doesn’t necessarily need to be provided, though this would impose on the cloud management
framework the limitation that two applications that need to communicate with each other must be
provisioned on different FPGAs; this might be an onerous limitation. This is the methodology employed
by Byma et al. [33] for example.
1Type D is intentionally unused and left for NMUs with a richer set of access controls (such as fully implementedswitches on FPGAs, stateful access controls, or OpenFlow flow tables), left for future work
Chapter 5. Network Interfaces 81
5.3.3 VLAN Networking Support
From the NIST publication, another common way to ensure network security is by encapsulating packets
within a virtual network, such as a VLAN or a VXLAN. A VLAN-based NMU would tag each logical
network interface with the appropriate VLAN tag without having to parse the packet itself, and as such
we classify it as a Type A NMU (Types Av and ARv). A VXLAN-based NMU would encapsulate the
whole packet within a VXLAN delivery packet, and as such we classify it as a Type E NMU (Types Ev
and ERv).
5.3.4 Layer of Network Virtualization
Routing functionality and access control can be implemented at various levels of the network protocol
stack, depending on the desired abstraction to present to the hardware application. For example, the
hardware applications might have their own MAC addresses, or they might share a MAC/IP address
and differ only on the Layer 4 port number. NMUs can be designed to process packets at a specific layer
of the network protocol stack: MAC-only NMU, MAC/IP NMU, and MAC/IP/Layer4 NMU.
5.3.5 NMU Nomenclature
The previous subsections have presented many different features that could be implemented to provide an
effective network security solution. For simplicity, all of these NMU types and features are summarized
by the nomenclature presented in Table 5.1. The Type of the NMU is determined by the level of access
control that it supports. In addition, an R or a v can be added to indicate that the NMU supports
routing between Hardware Applications on the same FPGA and that the NMU specifically targets a
virtualized network technology respectively. Finally, the layer of the network stack at that the NMU
works is appended to the end of the name. As a final note, a Universal NMU is used to refer to an NMU
that is designed to support all of the potential features; a Universal NMU can be parameterized by the
FPGA management framework at runtime to determine which of the modes to implement for each of
the Hardware Applications and the network access ports indicated by their VIID.
5.4 Network Management Unit Hardware Design
The Network Management Unit was introduced conceptually in the previous section. This section
discusses the actual hardware implementation of the NMU for synthesis into a shell design. To illustrate
the intention of the work described in this section, see Figure 5.6. In part (a) of the figure, a shell
Chapter 5. Network Interfaces 82
Table 5.1: NMU Nomenclature Summary
Type (A|B|C|E) [R] [v] - [L2|L3|L4]Type A No access controls provided within the FPGA, some tagging such
that ACLs can be applied at the next-level physical switch (hair-pinning)
Type B Source semantics enforcement for all outgoing traffic from hard-ware applications, allowing ACLs at the next-level switch whileeliminating spoofing
Type C Source semantics enforcement and some simple dest. based accesscontrols (e.g. restricting to a single dest, or restricting multicastand/or broadcast)
Type E Encapsulation: hardware applications send payload without gen-erating packet headers, network packet generation done in theNMU itself
Type *R Routing between hardware applications on the same FPGA doneinside of the NMU (no hairpinning)
Type *v Virtualized networking environment supported
[L2|L3|L4] Network protocol stack layer the NMU operates with respect to(L2 = MAC, L3 = IP, L4 = Transport)
E.g. Type A-vepa, Type A-etag, Type Av, Type ARv, Type B-L2, Type B-L3, Type B-L4, Type BR-L2, Type BR-L3, Type BR-L4, Type C-L2, Type C-L3, Type C-L4, Type CR-L2, Type CR-L3, Type CR-L4, Type E-L2, Type E-L3, Type E-L4, Type ER-L2, Type ER-L3, Type ER-L4, Type ERv-vxlan, Type ERv-nvgre,Type Ev-vxlan, Type Ev-nvgre, Universal
with just the performance isolation components is shown. In part (b) of the figure, the Simple NMU of
part (a), that in itself could provide no network security, is replaced with a more complex NMU based on
the descriptions in the previous section. This compex NMU is connected to the PCIe based management
framework such that the parameters of the NMU can be set at runtime.
5.4.1 Reusable Sub-Components
To implement the functionality required of the NMUs, we need packet processing components that can
examine the packets and pull out the relevant header information, as well as modify the packets by
inserting and removing headers/fields. These components can be designed as reusable sub-components
to reduce the complexity is deploying the various different types of NMUs.
Packet Parser-Processor
Packet parsers are used to pull out header information from a packet. This header information is
then generally compared to some ACLs or a routing table. Previous works doing packet processing
on FPGAs range from complex programmable parser designs [75], to simpler parsers generated from
Chapter 5. Network Interfaces 83
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
10Gbps Ethernet
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
PCIe Controller
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
BW Throttler
BW Throttler
BW Throttler
BW Throttler
AXIS Inter.
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
PCIe Controller
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
10Gbps Ethernet
AXIS Inter.
NMU
BW
BW
BW
BW
(a) (b)
Figure 5.6: Adding NMU to the Shell (a) Shell without NMU (b) Shell with added NMU
domain specific languages [76]. One of the focuses of our solution is to minimize the hardware overhead
of network security for virtualized FPGAs, so we focus on the simpler designs.
The simple parser architectures include parsers for each part of the network protocol stack, cascad-
ing the parsers and accumulating the parsed information. For example, parsers could be created and
connected in a cascade for MAC-parsing, IPv4-parsing, ARP-parsing, etc. The parsers themselves are
generally simple, including a counter that counts the current position within the packet stream, and
specialized field extractors that look for particular offsets within the packet for the field to be extracted.
Note, the position that the field extractor must look for to find the field can change based on previous
packet fields extracted, and so cannot necessarily be hard-coded.
We employed a similar parser design in our work. Figure 5.7 shows a number of Field Extraction
Sequencers that each extract a particular field in the packet. Traditional packet parsing systems pull
out all the fields of interest, through some series of packet parsers, and then pass those fields en masse
to some routing table or flow table structure to be analyzed and processed (e.g., like the OpenFlow
standard switch implementations). A key difference in our design is the inclusion of the Access Control
and Routing CAM logic for a particular field directly within the parser responsible for extracting that
field. This design allows for the cascaded parsers to simply pass along the cumulative routing and ACL
status instead of the entire field (that might contribute to high register utilization in highly pipelined
designs). This direct inclusion in the parser also eliminates the need for buffering of packets and queuing
of parsed packet data for processing, since all the parsers by necessity must operate at line rate. Access
Control and Routing CAM components can be excluded if not needed for a particular NMU type.
Chapter 5. Network Interfaces 84
Count
Field Extraction Sequencer
Next Header Determination
Source Addr Access Control
Dest Addr Routing CAM
Dest Addr Access Control
valid
valid
valid
acl e
rror
rout
e m
ask
nex
t hea
der
done?
Figure 5.7: Packet Parser Architecture
Tagger/Encapsulator
The tagger and encapsulation components are used to insert bytes at the beginning or in the middle
of a packet, to support the Type A and Type E NMUs respectively. To insert bytes into a packet, the
incoming packet stream must first be divided into segments, that can be read and pushed to the output
individually. This is accomplished using a segmented FIFO, where the segments form multiple FIFO
outputs. The segmentation is done on a 16-bit basis, since all network headers at Layer 4 and below are
aligned to 16-bit boundaries.
Figure 5.8 shows the implemented tagger/encapsulation core, with the input driving a segmented
FIFO. The output stream is generated by using multiplexers to select from the segments of the input
FIFO, and the tag/encapsulation data to be inserted into the packet. A Packet Output Sequencer,
implemented as a Finite State Machine, sequences the input and the bytes of the data to construct the
output packet. The stream VIID from the input is used to determine which logical network interface
sent the packet that is currently being processed. This VIID is used to index into the Tag/Encap Data
register file to access the tag data to be inserted into packets specifically from that logical interface.
De-Tagger/De-Encapsulator
The de-tagger and de-encapsulation components do the opposite task as the tagger and encapsulator.
For packets coming in from the network, these components can be used to strip some bytes from the
packet that are not needed in the downstream hardware applications, like various tag information or
As with the memory protections introduced into the shell, we evaluate our shell design based on the
area overhead of its implementation.
5.5.1 Shell Design
Most of the components of the memory securitization part of the shell are similar in nature to the
solutions presented for the memory. We evaluate the overhead of the components in a similar way,
incrementally adding the parts to an existing shell design and iteratively adding the isolation components.
The results are summarized in Table 5.2 and Table 5.3, with the second table giving the percentage of
available resources on the FPGA that the shell uses. These evaluations are similarly performed on the
Kintex Ultrascale XKCU115.
We note that adding the performance isolation componenets has very little impact on the total area
Chapter 5. Network Interfaces 90
tag
L2 L3 L4
L2L3L4
tag
NetworkFPGA MAC Parser
Source ACL
Dest CAM
Dest ACL
VLAN Parser
VLAN ACL
VLAN CAM
IP Parser
Source ACL
Dest CAM
Dest ACL
Port Parser
Source ACL
Dest CAM
Dest ACL
Universal Tagger/Encap.
On-Chip Router
FIltering
MAC Parser
Dest CAM
VLAN Parser
VLAN CAM
IP Parser
Dest CAM
Port Parser
Dest CAM
Univ. Tag Parser &
De-Tagger
Dest CAM
SwitchBuffer & Filtering
s-tag e-tag L2
L2 L3
L2 L3 L4
L2 vlan
L2
L2 L3
L2 L3 L4
vlan
Universal De-Encap.
L2
L2 L3
L2 L3 L4
L2
L2 L3
L2 L3 L4
L2 encap L3 encap
L4 encap
L2 encap L3 encap
L4 encapvxlan encap vxlan encap
(b)
(a)
(c) (d)
(e) (f)
(g) (h)
Figure 5.11: NMU Varieties (a) Universal NMU, with components labeled and marked with symbols tobe used as the legend for sub-figures (b) Type A NMUs (c) Type B NMUs (d) Type C NMUs (e) TypeBR NMUs (f) Type CR NMUs (g) Type E NMUs (h) Type ER NMUs
Chapter 5. Network Interfaces 91
utilization of the shell. This makes intuitive sense since the amount of logic needed to decouple and
verify the protocol assertions on the network interface was fairly small. The NMU however adds a great
deal of overhead to the system. There is an increase in the usage of LUTS by 62 percent, of LUTRAM
by 74 percent, and of flip flops of 46 percent. The NMU is a lot more logic intensive than the other
componenets of the design, so this also makes sense. Even so, the total utilization of the modified shell
does not exceed 9 percent of the whole of the FPGA. Considering the functionality that is possible using
the Universal NMU, it is a worthwhile inclusion in any FPGAs included in the datacentre. More detailed
analysis of the NMU follows.
5.5.2 NMU Overhead
The NMU designs were tested on an Alpha Data 8k5 FPGA add-in board with a 10Gb Ethernet con-
nection; the FPGA on that Board is a Xilinx Kintex Ultrascale XCKU115. All tests were done using
the Xilinx Vivado 2018.1 software, and the associated versions of the the PCIe Subsystem and Ethernet
Subsystem cores.
The NMU was placed in a system with four hardware applications, each connected to the ingress and
egress ports of the Ethernet Controller through an AXI Stream Switch. Each application is provided
eight logical network connections, so the NMUs evaluated support 32 total logical connections. The
Ethernet controller has a datapath width of 64-bits and operates at 156.25 MHz, which is the clock used
for the whole test platform (except for the PCIe Controller). The applications themselves simply include
a Block RAM that stores packet data, a DMA device to send that packet data out to the network, and a
DMA engine that receives data from the network to store to Block RAM. Each of the Apps is controlled
through PCIe by a Host PC that manages the test setup. The Host is also responsible for configuring
the NMU. Figure 5.12 shows the architecture of the test platform.
To evaluate the various NMUs based on the previous descriptions, each of those design decisions
is compared on an area utilization and unloaded latency basis. Note, such designs would generally be
evaluated in terms of throughput as well, but all of the packet processing components used in this work
operate at the 10Gbps line-rate of the Ethernet controller. All of the results are shown in Table 5.4.
Access Control
Part (b) of Table 5.4 shows the area and latency results of the four different types of NMUs. The Type A
NMU, as expected, has the lowest area and latency, though this is likely because the Type A NMU does
not need on-FPGA switching to allow the Hardware Applications to communicate (The Bridge Port
Chapter 5. Network Interfaces 92
App 1 App 2
App 3 App 4
PCIe Controller10Gb Ethernet
Controller
NMU
AXI Stream Switch
AXI Stream Switch
Figure 5.12: Multi-Application Test Setup for Networking
Extensions E-tag standard allows for hairpin routing). The Encapsulation based NMU has a slightly
lower utilization, indicating that Type E NMUs might be preferable to reduce area utilization, though
this is at the cost of slightly increased latency caused by the included segmented FIFOs in the packet
path. Finally, we note that the added overhead of implementing some destination-based access controls
(i.e., Type C NMUs) is fairly minimal.
Virtualization
The results of the evaluation for the two virtualized networking NMUs are shown in Part (c) of Table 5.4.
The VLAN based virtualization solution uses about the same amount of resources as the Type B and
Type C NMUs from Part (b), though there is added latency from the tagging functionality. The VXLAN
virtualized solution has a much higher utilization because it must parse a full Layer 4 packet first before
identifying the virtual ID and routing the packet. The modest area overhead relative to the other NMUs
might be worth it considering the ease of deployment, and ubiquity, of virtual network solutions.
Routability
Dropping the requirement that there be routability between co-resident hardware applications cuts the
area utilization in half for the Type B and Type C NMUs, and nearly in half for the other NMUs, as
shown in Part (d) of Table 5.4. There is also a drop in latency from removing the Switching. Note,
AXI ERRM AWADDR BOUNDARY A write burst cannot cross a 4KBboundary
Protocol error ignored Error: causes out of bounds access
AXI ERRM AWADDR WRAP ALIGN A write transaction with burst typeWRAP has an aligned address
Protocol error ignored Error: undefined behaviour
AXI ERRM AWBURST A value of 2b11 on AWBURST isnot permitted when AWVALID isHigh
Protocol error ignored Defaults to INCR burst type, noerror
AXI ERRM AWLEN LOCK Exclusive access transactions can-not have a length greater than 16beats
Protocol error ignored Exclusive access not supported, ig-nored, no error
AXI ERRM AWCACHE If not cacheable, AWCACHE =2’b00
Protocol error ignored Signal unused, no error
AXI ERRM AWLEN FIXED Transactions of burst type FIXEDcannot have a length greater than16 beats
Protocol error ignored FIXED burst type unsupported,defaults to INCR type, no error
AXI ERRM AWLEN WRAP A write transaction with burst typeWRAP has a length of 2, 4, 8, or 16
Protocol error ignored Error: undefined behaviour
AXI ERRM AWSIZE The size of a write transfer does notexceed the width of the data inter-face
Error: data width convertersmay not operate correctly
Error: interconnect error
AXI ERRM AWVALID RESET AWVALID is Low for the first cycleafter ARESETn goes High
PR reset and static region re-set are not asserted at the sametime, no error
PR reset and static region reset arenot asserted at the same time, noerror
AXI ERRM AWxxxxx STABLE Handshake Checks: AWxxxxxmust remain stable when AW-VALID is asserted and AWREADYLow
Error: changing signals may ef-fect interconnect functionality
Error: interconnect error
AXI ERRM AWREADY MAX WAIT Recommended that AWREADY isasserted within MAXWAITS cyclesof AWVALID being asserted
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
Appendix
A.
AXI4
ProtocolAsse
rtions
113
Table A.2: AXI4 Protocol Write Data Channel AssertionsProtocol Assertion Description Interconnect Response [53] Mem Controller Response [47]
AXI ERRM WDATA NUM The number of write data itemsmatches AWLEN for the corre-sponding address. This is trig-gered when any of the following oc-curs: Write data arrives, WLASTis set, and the WDATA count isnot equal to AWLEN; Write dataarrives, WLAST is not set, and theWDATA count is equal to AWLEN;ADDR arrives, WLAST is alreadyreceived, and the WDATA count isnot equal to AWLEN
Error: may cause interconnectto hang
Error: interconnect error
AXI ERRM WSTRB A write transaction with burst typeWRAP has an aligned address
Protocol error ignored Protocol error ignored
AXI ERRM WVALID RESET WVALID is Low for the first cycleafter ARESETn goes High
PR reset and static region re-set are not asserted at the sametime, no error
PR reset and static region reset arenot asserted at the same time, noerror
AXI ERRM Wxxxxx STABLE Handshake Checks: Wxxxxx mustremain stable when WVALID is as-serted and WREADY Low
Error: changing signals may ef-fect interconnect functionality
Error: interconnect error
AXI ERRM WREADY MAX WAIT Recommended that WREADY isasserted within MAXWAITS cyclesof WVALID being asserted
Signals from static region don’tneed to be checked, no error
Signals from static region don’t notneed to be checked, no error
AXI ERRM ARADDR BOUNDARY A write burst cannot cross a 4KBboundary
Protocol error ignored Error: causes out of bounds access
AXI ERRM ARADDR WRAP ALIGN A write transaction with burst typeWRAP has an aligned address
Protocol error ignored Error: undefined behaviour
AXI ERRM ARBURST A value of 2b11 on ARBURST isnot permitted when ARVALID isHigh
Protocol error ignored Defaults to INCR burst type, noerror
AXI ERRM ARLEN LOCK Exclusive access transactions can-not have a length greater than 16beats
Protocol error ignored Exclusive access not supported, ig-nored, no error
AXI ERRM ARCACHE If not cacheable, ARCACHE =2’b00
Protocol error ignored Signal unused, no error
AXI ERRM ARLEN FIXED Transactions of burst type FIXEDcannot have a length greater than16 beats
Protocol error ignored FIXED burst type unsupported,defaults to INCR type, no error
AXI ERRM ARLEN WRAP A write transaction with burst typeWRAP has a length of 2, 4, 8, or 16
Protocol error ignored Error: undefined behaviour
AXI ERRM ARSIZE The size of a write transfer does notexceed the width of the data inter-face
Error: data width convertersmay not operate correctly
Error: interconnect error
AXI ERRM ARVALID RESET ARVALID is Low for the first cycleafter ARESETn goes High
PR reset and static region re-set are not asserted at the sametime, no error
PR reset and static region reset arenot asserted at the same time, noerror
AXI ERRM ARxxxxx STABLE Handshake Checks: ARxxxxx mustremain stable when ARVALID isasserted and ARREADY Low
Error: changing signals may ef-fect interconnect functionality
Error: interconnect error
AXI ERRM ARREADY MAX WAIT Recommended that ARREADY isasserted within MAXWAITS cyclesof ARVALID being asserted
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
Appendix
A.
AXI4
ProtocolAsse
rtions
116
Table A.5: AXI4 Protocol Read Data Channel AssertionsProtocol Assertion Description Interconnect Response [53] Mem Controller Response [47]
AXI ERRM RLAST ALL DONE EOS All outstanding read bursts musthave completed
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
AXI ERRM RDATA NUM The number of read data itemsmust match the correspondingARLEN
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
AXI ERRM RID The read data must always followthe address that it relates to. IfIDs are used, RID must also matchARID of an outstanding addressread transaction. This violationcan also occur when RVALID is as-serted with no preceding AR trans-fer
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
AXI ERRM RRESP EXOKAY An EXOKAY write response canonly be given to an exclusive readaccess
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
AXI ERRM RVALID RESET RVALID is Low for the first cycleafter ARESETn goes High
PR reset and static region re-set are not asserted at the sametime, no error
PR reset and static region reset arenot asserted at the same time, noerror
AXI ERRM Rxxxxx STABLE Handshake Checks: Rxxxxx mustremain stable when RVALID is as-serted and RREADY Low
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
AXI ERRM RREADY MAX WAIT Recommended that RREADY is as-serted within MAXWAITS cycles ofRVALID being asserted
Error: not accepting responsewill cause interconnect to hang
AXI ERRM EXCL ALIGN The address of an exclusive access isaligned to the total number of bytesin the transaction
Protocol error ignored Exclusive access not supported, ig-nored, no error
AXI ERRM EXCL LEN The number of bytes to be trans-ferred in an exclusive access burstis a power of 2, that is, 1, 2, 4, 8,16, 32, 64, or 128 bytes
Protocol error ignored Exclusive access not supported, ig-nored, no error
AXI ERRM EXCL MATCH Recommended that the address,size, and length of an exclusivewrite with a given ID is the sameas the address, size, and length ofthe preceding exclusive read withthe same ID
Protocol error ignored Exclusive access not supported, ig-nored, no error
AXI ERRM EXCL MAX 128 is the maximum number ofbytes that can be transferred in anexclusive burst
Protocol error ignored Exclusive access not supported, ig-nored, no error
AXI ERRM EXCL PAIR Recommended that every exclusivewrite has an earlier outstanding ex-clusive read with the same ID
Protocol error ignored Exclusive access not supported, ig-nored, no error