Memory and Network Interface Virtualization for Multi ... · Memory and Network Interface Virtualization for Multi-Tenant Recon gurable Compute Devices Daniel Rozhko Master of Applied

Memory and Network Interface Virtualization forMulti-Tenant Reconfigurable Compute Devices

by

Daniel Rozhko

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

The Edward S. Rogers Sr. Department of Electrical & Computer EngineeringUniversity of Toronto

c© Copyright 2018 by Daniel Rozhko

Abstract

Memory and Network Interface Virtualization for

Multi-Tenant Reconfigurable Compute Devices

Daniel Rozhko

Master of Applied Science

The Edward S. Rogers Sr. Department of Electrical & Computer Engineering

University of Toronto

2018

Field Programmable Gate Arrays (FPGAs) have increasingly been deployed in datacenters. These

devices have proven effective at performing certain compute tasks better (faster, lower latency, higher

throughput, and/or more efficiently) than traditional compute devices. In this work, we explore the

application of the virtualization concept to multi-tenant FPGAs, with a specific focus on enabling

multiple applications to securely share an FPGA and preventing snooping and/or errant behaviour.

Conceptually, the traditional shell concept in FPGA design is extended to the hard shell and soft shell

components. The hard shell focuses on the logical isolation of hardware applications on the same FPGA.

For memory and networking interfaces on the FPGA, this work introduces hardware components aimed

at enabling this isolation. In particular, the design of a new component that is termed the Network

Management Unit (NMU) is presented in detail. The NMU enables the isolation of network traffic.

ii

Acknowledgements

The work presented here would not be possible without the generous and always insightful support of my

advisor Professor Paul Chow. His wealth of knowledge and vast experience, with both the technologies

presented herein and indeed academic endeavor in general, aided the development and execution of this

research greatly. Thank you for your invaluable advice and support.

In addition, I owe thanks to the many peers and colleagues who provided support to me in the course

of this Masters program. First, to my fellow members of Professor Paul Chow’s research group; Dan

Ly-Ma, Roberto DiCecco, Naif Tarafdar, Eric Fukuda, Justin Tai, Charles Lo, Fernando Martin Del

Campo, Vincent Mirian, Jasmina Capalija Ex Vasiljevic, Nariman Eskandari, and Varun Sharma; your

help, advice, and most notably your willingness to listen, was appreciated and I thank you for it.

Next, to the colleagues with whom I shared an office, and indeed a significant portion of my time;

Xander Chin, Karthik Ganesan, Mario Badr, Jin Hee Kim, Joy Chen, Julie Hsaio, Shehab Elsayed, Josh

San Miguel, Jose Antonio Munoz Cepillo, Patrick Judd, and again the colleagues listed above; your

help, technical and otherwise, was a boon to the development of this research and my life as a graduate

student. Thank you. I count you all amongst my friends.

iii

Contents

Acknowledgments iii

Table of Contents iv

List of Tables viii

List of Figures ix

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4

2.1 Reconfigurable Compute Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Field Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.2 Coarse-Grained Reconfigurable Architectures . . . . . . . . . . . . . . . . . . . . . 5

2.1.3 Computer Aided Design Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Desktop Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2 Containerization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.3 Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 FPGAs in the Cloud and FPGA Virtualization . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Hardware OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

iv

3 Virtualization Model 16

3.1 Deployment Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.1 Physical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.2 Service Provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.3 Deployment Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Required Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.1 Performance Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.2 Data Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.3 Domain Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.4 Channels Targeted for Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.5 Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.6 Interface Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Virtualization Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.1 Shell Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.2 Xilinx Specific Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Memory Interfaces 31

4.1 AXI4 Protocol Verification and Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 The AXI4 Protocol Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.2 AXI4 Master Protocol Verification Requirements . . . . . . . . . . . . . . . . . . . 34

4.1.3 Memory Transaction Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.4 Memory Protocol Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Performance Isolation for AXI4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.1 Traditional Credit-Based Rate Throttling . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.2 AXI4-Specific Credit Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.3 Bandwidth Conserving System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2.4 Limitations for SDRAM Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.5 Bandwidth Limiting Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 52

4.3 Memory Management Unit Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3.1 Base-and-Bounds MMU Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.2 Coarse-Grained Paged MMU Design . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4 Memory Virtualizing Shell Overhead Evaluation . . . . . . . . . . . . . . . . . . . . . . . 56

4.4.1 Building up the Secure Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

v

4.4.2 Latency Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4.3 Paged-NMU Size Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.5 Multi-Channel Memory Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5.1 Separately Managed Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5.2 Single Shared MMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.5.3 Parallel MMUs with a Single Port . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.5.4 Parallel MMUs with Multiple Ports . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.5.5 Multi-Memory Channel Implementations in Previous Works . . . . . . . . . . . . . 66

5 Network Interfaces 68

5.1 Network Interface Performance Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.1.1 AXI-Stream Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.2 AXI-Stream Protocol Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.3 Network Interface Bandwidth Throttling . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2 Network Security Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.1 Software Analogues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.2 OpenFlow Switching Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.3 The Network Management Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3.1 Access Control Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3.2 Internal Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3.3 VLAN Networking Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3.4 Layer of Network Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3.5 NMU Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.4 Network Management Unit Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.4.1 Reusable Sub-Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4.2 Destination Rules Enforced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.4.3 Universal NMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.4.4 Limited Functionality NMUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.5 Network Virtualizing Shell Overhead Evaluation . . . . . . . . . . . . . . . . . . . . . . . 89

5.5.1 Shell Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.5.2 NMU Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.6 Multi-Channel Networking Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.6.1 Separately Managed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

vi

5.6.2 One General and One Exclusive Connection . . . . . . . . . . . . . . . . . . . . . . 96

5.6.3 Individual Connection Per Application . . . . . . . . . . . . . . . . . . . . . . . . . 96

6 Conclusion 98

6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.3.1 Further Shell Explorations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.3.2 Additional Security Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.3.3 Hardening Shell Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Bibliography 104

A AXI4 Protocol Assertions 111

vii

List of Tables

4.1 Bandwidth Throttling Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2 Shared Memory Secured Shell Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 Shared Memory Secured Shell Utilization (Percentage) . . . . . . . . . . . . . . . . . . . . 58

4.4 Latency Increase Per AXI Channel for Shared Memory Secured Shell . . . . . . . . . . . . 59

4.5 Shell Utilization as a Function of Page Size in MMU . . . . . . . . . . . . . . . . . . . . . 60

5.1 NMU Nomenclature Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.2 Shared Network Connectivity Secured Shell Overhead . . . . . . . . . . . . . . . . . . . . 89

5.3 Shared Network Connectivity Secured Shell Overhead (Percentage) . . . . . . . . . . . . . 89

5.4 NMU Area and Latency Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.1 AXI4 Protocol Write Address Channel Assertion . . . . . . . . . . . . . . . . . . . . . . . 112

A.2 AXI4 Protocol Write Data Channel Assertions . . . . . . . . . . . . . . . . . . . . . . . . 113

A.3 AXI4 Protocol Write Response Channel Assertions . . . . . . . . . . . . . . . . . . . . . . 114

A.4 AXI4 Protocol Read Address Channel Assertions . . . . . . . . . . . . . . . . . . . . . . . 115

A.5 AXI4 Protocol Read Data Channel Assertions . . . . . . . . . . . . . . . . . . . . . . . . . 116

A.6 AXI4 Protocol Exclusive Access Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . 117

viii

List of Figures

2.1 FPGA CAD Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Types of Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Memory Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Microsoft Catapult v2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Soft Shell Inspired Virtualization Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Soft Shell Including Management Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 Adding Decoupling for Memory to the Shell . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 AXI4 Read Channel Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 AXI4 Write Channel Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4 Xilinx AXI Decoupler Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.5 AXI4 Memory Decoupler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.6 AXI4 Memory Protocol Verifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.7 Adding Bandwidth Throttling for Memory to the Shell . . . . . . . . . . . . . . . . . . . . 42

4.8 AXI4 Memory Bandwidth Throttler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.9 AXI4 Memory Utilization Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.10 Adding MMU to the Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.11 Base and Bounds MMU Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.12 On-Chip Coarse Grained MMU Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.13 Multi-Channel Organization with Separately Managed Channels . . . . . . . . . . . . . . 61

4.14 Multi-Channel Organization with Single Shared MMU . . . . . . . . . . . . . . . . . . . . 61

4.15 Multi-Channel Organization with Parallel Shared MMUs . . . . . . . . . . . . . . . . . . . 63

4.16 Multi-Channel Organization with Parallel Shared MMUs (modified) . . . . . . . . . . . . 65

4.17 Multi-Channel Organization with Parallel NMUs and Multiple Ports . . . . . . . . . . . . 66

ix

5.1 Adding Performance Isolation for Networking to the Shell . . . . . . . . . . . . . . . . . . 69

5.2 Network Interface Decoupler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3 Network Interface Protocol Verifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.4 Network Interface Bandwidth Throttler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.5 Example Implementation of an OpenFlow Capable Switch . . . . . . . . . . . . . . . . . . 76

5.6 Adding NMU to the Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.7 Packet Parser Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.8 Tagger & Encapsulation Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.9 De-Tagger & De-Encapsulation Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.10 Universal NMU System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.11 NMU Varieties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.12 Multi-Application Test Setup for Networking . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.13 Universal NMU utilization vs Number of Logical Connections . . . . . . . . . . . . . . . . 94

5.14 Multiple Network Interfaces Managed Separately . . . . . . . . . . . . . . . . . . . . . . . 95

5.15 Multiple Network Interfaces With an Exclusive Connection . . . . . . . . . . . . . . . . . 96

x

Chapter 1

Introduction

The use of reconfigurable computing devices in mainstream datacentre applications is growing, with

more companies and institutions using such devices to accelerate compute workloads. Specific examples

of particular note include the work done to integrate Field Programmable Gate Array (FPGA) devices

into the Microsoft Bing search engine [1] [2] [3], the introduction of FPGA devices in Amazon’s AWS

cloud offering [4], and the work done at IBM Research to integrate FPGA devices into the cloud [5].

Many academic works have also explored the deployment of FPGA devices in datacentre environments;

clearly the datacentre deployment of FPGA devices is an emergent and popular solution to expand the

compute capabilities of datacentres.

For CPU-based compute nodes in datacentres, it is common to use virtualization technologies to

enable multi-tenant use, i.e. multiple resident virtual compute nodes on a single physical system. This

increases the efficiency of the datacentres by increasing the effective utilization of those compute nodes.

In addition, this enables cloud-based service models as a single physical system can be shared by the

multiple customers of a cloud provider. This thesis explores the tradeoffs in designing analogous virtual-

ization technologies for reconfigurable compute devices, particularly the shared memory and networking

interfaces of FPGA devices.

1.1 Motivation

Reconfigurable compute devices, and in particular FPGA devices, have the potential to accelerate many

compute operations. In fact, FPGA devices have been shown to accelerate encryption [6], compression [7],

packet processing [8], and even machine learning applications such as neural networks [9]. This has no

doubt motivated their continued deployment in datacentre applications.

1

Chapter 1. Introduction 2

Microsoft has successfully demonstrated their Catapult platform (both v1 [1] and v2 [2] versions),

which utilizes FPGAs to accelerate Bing search. Microsoft has since expanded their deployment of

FPGAs to their Azure offering [10], using FPGAs to compress network traffic for their cloud customers.

Amazon has deployed FPGA devices to their own cloud offering, AWS [4]. A single AWS instance can

be created with up to eight FPGA devices, completely programmable by the cloud users. The interest

in using FPGAs in the datacentre will likely only increase in the future, making the efficient design of

the platform that enables their deployment of key importance.

The benefits of virtualization, and thus the motivation of exploring virtualization technologies and

potential benefits for FPGAs, are well established. Virtualization as a technology dates back to the

1960s [11], and its use in datacentres today is widespread. Simply put, virtualization allows for what

would have been multiple physical servers to run co-located as virtual servers on a single physical server

node. As long as the single physical server has enough resources to run the workloads of all of the

virtual servers along with the overhead of the virtualization itself, the virtualized deployment saves

on the need for more physical server nodes. The same argument motivates virtualization for FPGA

devices: if multiple FPGA bound applications can be co-located on a single physical FPGA, and that

physical FPGA has enough resources (i.e. area and interface bandwidth) to accommodate the original

applications and the overhead of the virtualization platform itself, the FPGA deployment can be made

with fewer total compute nodes.

In this thesis we specifically address the implementation of virtualization from the perspective of

performance, data, and network domain isolation, while also considering that FPGAs have a limited

amount of area. We present the implementation of various components to achieve these domains of

isolation for an FPGA targeting a multi-tenant deployment.

1.2 Contributions

The contribution of this thesis is a detailed analysis and exploration of the design tradeoffs involved

in virtualizing FPGA devices, and design of a virtualization platform considering these tradeoffs. The

major components of the contribution are as follows:

• The introduction and formalization of the concepts of a “hard shell” and “soft shell”, easing the

design and analysis of virtualization technologies targeting reconfigurable compute devices

• A functioning HDL implementation of memory virtualization hardware cores, targeting single and

multi-channel memory platforms

Chapter 1. Introduction 3

• A functioning HDL implementation of network virtualization hardware cores, exploring multiple

deployment scenarios depending on the network infrastructure

• An analysis of the area overheads incurred by the virtualization technologies considering the design

tradeoffs discussed

1.3 Overview

The remainder of this thesis is organized as follows: Chapter 2 will provide background information

and will review previous work on the virtualization of reconfigurable compute devices. Chapter 3 will

provide an analysis of various deployment models for FPGAs in the datacentre and introduce concepts

used in the design of virtualization technologies for reconfigurable compute devices. Chapter 4 and

Chapter 5 will present the design and analysis of memory interface virtualization and network interface

virtualization respectively, considering the analyses of Chapter 3. And finally, Chapter 6 will examine

avenues for future work and conclude the thesis.

Chapter 2

Background

This chapter introduces some concepts, technologies, and definitions used throughout the thesis. It also

examines some related work, providing context for the thesis.

2.1 Reconfigurable Compute Devices

Reconfigurable computing describes a general class of devices in which the primary circuit elements can

be reconfigured into a user-defined hardware configuration; the array of these reconfigurable elements

is often referred to as a reconfigurable fabric. The device as a whole performs some computation, often

interfacing with some external network or memory. This section describes such devices, particularly the

FPGA, which is the focus of the implementations presented in Chapters 4 and 5, as well technologies

that enable the use of reconfigurable computing devices.

2.1.1 Field Programmable Gate Arrays

The most common reconfigurable fabric used in reconfigurable compute applications is the FPGA. An

FPGA is an integrated circuit designed such that its functionality can be reprogrammed after its manu-

facture, according to some specification of the user. Users specify some hardware configuration using a

Hardware Description Language (HDL), a class of programming languages that can be used to describe

digital hardware. The user’s HDL application can be synthesized into a bitstream (used to program the

FPGA) using a set of Computer Aided Design (CAD) tools. For example, Xilinx FPGA devices can be

targeted for synthesis using the Vivado CAD tool set [12]. The bitstream generated describes the desired

configuration of the FPGA’s various building blocks, which can be configured to implement most digital

circuit designs.

4

Chapter 2. Background 5

The main building block of an FPGA is the Look-up Table (LUT), which can implement a four-,

five-, or six-input combinational logic function. The LUTs are grouped with flip flops inside logic blocks,

which are connected to a series of switches and connection blocks implementing a programmable routing

fabric. In addition to the LUTs, modern FPGAs include dedicated Digital Signal Processing (DSP)

blocks to perform multiplication, Block Random Access Memory (BRAM) for local memory storage,

and often even PCIe [13] or Network controllers [14]. Note that hardware components implemented as

dedicated silicon on an FPGA chip, rather than through a configuration and interconnection of LUTs

to implement the logic, are often referred to as “hardened”, in contrast to digital circuits synthesized

on the programmable fabric, which are referred to as “soft”; the PCIe and Network controllers included

on FPGA chips are examples of such hardened components. See [15] for more information on FPGA

architecture.

2.1.2 Coarse-Grained Reconfigurable Architectures

Another relatively new reconfigurable compute device is the Coarse Grained Reconfigurable Architecture

(CGRA) device, which as the name suggests, includes much coarser building blocks than the LUTs of the

FPGA. These blocks can be configured to perform some larger arithmetic functions, such as additions

or shifts, and are often connected using some routing fabric [16]. As there are no general purpose LUTs

to implement arbitrary hardware, these devices are often programmed using a different method (i.e., in

contrast to the HDLs used for FPGAs). Also, any communication to the external world (e.g., memory

controllers or network controllers) would need to be implemented as hard elements. The content of this

thesis focuses on the FPGA, though similar memory and network interface designs as presented for the

FPGA class of devices could be implemented as hard elements if virtualization were to be considered for

CGRAs.

2.1.3 Computer Aided Design Tools

As mentioned in Section 2.1.1, CAD tools are used to take a user’s description of the hardware they

intend to implement on the device and create a description of how the elements of the device should be

organized and configured to achieve the specified implementation. This final configuration descriptor file

is generally referred to as a bitstream, in reference to the fact that the descriptor represents the states to

be programmed to the bits of Static Random Access Memory (SRAM) cells that drive the configurable

portions of the LUTs and routing fabric. Figure 2.1 shows the steps in a typical FPGA design flow,

further details of these design flow steps, and their particular functions, can be found in [17].


Figure 2.1: A Typical FPGA CAD Design Flow, adapted from [17]

For FPGA devices, the user’s description of the hardware to be implemented is often in the form

of HDL, but other descriptions are available. High Level Synthesis (HLS) has increased in popularity

in recent years due to the fact that complex digital hardware circuits can be described using simpler

software-based programming languages, such as C, C++, or OpenCL [18]. Some examples of HLS-based

CAD tools include LegUp [19], Xilinx’s Vivado HLS [20], and Intel’s FPGA SDK for OpenCL [21].

HLS lowers the barrier to adopting reconfigurable computing platforms thereby increasing the viability

(financial or otherwise) of datacentre deployments of FPGAs. This motivates further research into such

datacentre deployments and this virtualization work more specifically.

One CAD-based innovation for FPGA devices that is particularly important in enabling datacentre

deployments of FPGAs is Partial Reconfiguration (PR). PR allows for the FPGA device to be partitioned

and for these partitions to be reconfigured independently, such that one portion of the FPGA can be

actively running some circuit while another portion is reconfigured without its operation being stalled

or affected. These techniques are described in works by both major FPGA vendors [22] [23]. In PR-

based FPGA CAD design flows, the portion of the FPGA which is not reconfigured after the initial

configuration is termed the static region, while the portions of the FPGA that are reconfigured live are

called the dynamic or PR regions (note, an FPGA can typically have multiple PR regions). These PR-


based FPGA CAD design flows generate PR bitstreams, that can be programmed through the traditional

Joint Test Action Group (JTAG) boundary scan methods, or often using internal connections driven by

the FPGA fabric directly, such as the Xilinx-based ICAP connection [24]

2.2 Virtualization

Virtualization is a widely used term in many sub-fields of computer architecture, computer science, and

digital hardware. In regards to the term virtualization, the focus of this thesis is desktop virtualization

(often also termed server virtualization), which is the virtualization of a server compute node. This

section describes this type of virtualization in more detail.

2.2.1 Desktop Virtualization

Desktop virtualization is the set of technologies used to enable the deployment of multiple virtual servers

on a single physical server. In other terms, virtualization enables a single physical server to be seen

by its multiple tenant virtual servers as multiple unique and wholly independent hardware instances.

It was originally envisioned by IBM to partition their mainframe computers and allow multiple virtual

workloads to run on a single physical mainframe [11]. The main benefit of virtualization was the increase

in the efficiency of the mainframe computers, since single workloads would not use 100% of the physical

server’s resources at all times.

Essentially, virtualization software seek to emulate multiple instances of the physical server, such that

each tenant can run on the emulated physical server without modification. Virtualization software is

commonly referred to as a Virtual Machine Monitor (VMM), or as a hypervisor. Without virtualization,

modern servers already support context switching between independent processes. Processes cannot

access the environment of other processes without sufficiently elevated privileges. For the VMM to

emulate multiple physical servers, it must only intercept these privileged calls from the virtual systems,

most often termed a Virtual Machine (VM), and ensure they only access parts of the system assigned to

them. For example, memory is allocated on a page basis to the VM and a translation action is needed

every time a VM attempts to access memory. Similarly, I/O devices are either emulated, or assigned

wholly to a single VM and attempts to query the system about the I/O devices are intercepted by the

VMM. Only emulated and physical I/O devices assigned to a VM will be discoverable.

Multiple different types of virtualization software are available today. Two main categories of VMM

software are Type I and Type II virtualization [25]. For Type I, the VMM software runs atop of a

traditional operating system, and the guest VMs run on the presented virtualization layer. For Type II,


the VMM software runs directly on the physical server itself, becoming the main operating system of the

physical machine. In addition, paravirtualizated VMMs of both types exist, which decrease the overhead

of virtualization by allowing the VMs to install drivers that are virtualization-aware, bypassing some of

the overheads associated with virtualization [11]. Type II non-paravirtualized VMMs are the main focus

of this work, as it is not evident that there is an analogue to Type I software or paravirtualization for

FPGA devices.

Two key goals to consider for virtualization solutions are data isolation and performance isolation.

Data isolation ensures that the data of one VM cannot be accessed, modified, or otherwise molested

by other VMs on the same VMM. Performance isolation refers to the idea that the performance of a

VM should not be impacted by the transient activity of other VMs running on the same VMM. While

processor time scheduled to a VM and memory allocated to a VM can be strictly controlled, the memory

access patterns and cache usage patterns of other VMs can affect the performance of a VM [26].

2.2.2 Containerization

Containerization is a virtualization technology that aims to reduce the overhead imposed on a physical

server running a traditional VMM. In a containerized environment, each virtualized server (termed

containers in the containerization context rather than VMs) shares an operating system kernel, but has

its own execution environment and middlleware setup [27]. For example, Linux based containers can be

created using control groups (cgroups) and Linux namespaces.

By sharing a kernel, and thereby a single application scheduling environment and memory allocation

scheme, overhead is reduced and resources can be more effectively distributed. The system’s process

scheduler is fully aware of not only the VMs running on the system, but all the processes of the VMs.

The system’s memory allocation scheme is similarly fully aware of the memory requirements of each

process, and can allocate memory more efficiently. For traditional virtualization solutions, there would

be two layers of process scheduling and memory allocation, first at the VMM level and then again at the

VM’s guest operating system level. Figure 2.2 shows a visual comparison of the virtualization techniques

discussed.

2.2.3 Operating Systems

Operating Systems are not generally considered virtualization technologies, though previous work on

FPGAs creating both Hardware Hypervisors and Hardware Operating Systems are very similar in that

they present an abstracted environment for multiple hardware tasks to run on a single FPGA. From


VM1

User Space

Libraries & Middleware

Guest OS

VM2

User Space


Guest OS

VM3

User Space


Guest OS

Virtual Machine Manager (VMM)

Physical Server

VM1

User Space


Guest OS

VM2

User Space


Guest OS

Virtual Machine Manager (VMM)

Physical Server

Host Operating System


Host System Applications

Container 1

User Space


Container 2

User Space


Container 3

User Space


Host Operating System Kernel

Physical Server

(a) (b) (c)

Figure 2.2: (a) Type I virtualization, (b) Type II virtualization, (c) Containerization

this hardware analogue, it is also easy to see how traditional software operating systems are similar

to VMMs. While VMMs allow for multiple guest operating systems to run in environments that seem

completely independent, operating systems allow for multiple user applications to run in environments

that seem completely independent. This is mainly accomplished through context switching and virtual

memory, i.e., memory accesses from applications are translated before being serviced.

Memory virtualization works by dividing the memory region into pages of some pre-determined size

and then allocating the pages to the applications as they are needed. Each application sees a zero-base

address for its memory space, accesses to the virtual memory space are intercepted and handled by a

translation mechanism at the operating system level. The lowest significant bits of the virtual address,

those that index memory within a page, remain unchanged, while the most significant bits are remapped

to the actual physical memory location assigned to that application. Page mappings are stored in a

map table in memory and cached in a structure known as a Translation Lookaside Buffer (TLB) [28].

Figure 2.3 depicts the memory translation scheme. Note, VMM environments must have two levels of

translation: at the guest operating system level, and then again at the VMM level.

2.3 FPGAs in the Cloud and FPGA Virtualization

A lot of effort has gone into investigating the deployment of FPGA devices in datacentres and cloud

environments. This is an important area of consideration for this work since virtualization technologies

are often used in cloud settings, and most of these FPGA cloud and datacentre deployment works include

some version of virtualization.


31..12 11..0

31..12 11..0

Virtual Memory Address

Physical Memory Address

Address Translation Lookup

Process ID

31..12 11..0

31..12 11..0

Virtual Memory Address

Guest Physical Address

Guest OS Address Translation Lookup

Process ID

31..12 11..0

Host Physical Memory Address

Hypervisor Address Translation Lookup

VM ID

(b)(a)

Figure 2.3: Virtual to physical memory address translation. (a) Translation in a standard operatingenvironment, (b) Translation in a virtualized environment

2.3.1 Related Work

This related work is important to establish the context in which this thesis is presented.

Microsoft Catapult

Microsoft introduced FPGAs into their Bing Search datacentres to accelerate their search algorithms

using specialized hardware implementations of those algorithms. The original implementation, dubbed

Catapult v1, was published in 2014 [1]. The Catapult implementation included FPGAs installed as

PCIe add-on cards within Processor-based servers. These FPGAs are controlled and receive their data

from the Processor system, essentially setup as a master-slave configuration. Multiple FPGAs are

connected together using a dedicated interconnection network, configured in a torus arrangement. This

allows for the FPGAs to communicate with each other and enables multi-FPGA applications. The work

characterizes the hardware application in their platform as the “Role” and the surrounding abstraction

layer as the “Shell”. While this shell does not enable sharing of the FPGA between multiple tenant

applications, it does provide an abstraction layer and might be considered analogous to a Hardware

Operating System [1].

This is a good place to discuss the nomenclature used to describe the various components of FPGA

platforms. What the Microsoft authors (Putnam et al.) term the Role is often called the “Hardware

Application”, “Kernel”, “PR Region” (for implementations using PR), “vFPGA” (i.e. virtual FPGA),

and many other names. For simplicity, the term Hardware Application will be used in this thesis except

when specifically referring to the nomenclature used by other works. What the Microsoft authors term

the shell is often called the “Hardware Hypervisor”, “Hardware Operating System”, “Static Region”

(for implementations using PR), “Service Logic”, and many other names. Because of the popularity of


Figure 2.4: Microsoft Catapult v2 bump-in-the-wire configuration, adapted from [3]

the Catapult work, the term Shell will be used in this thesis except when specifically referring to the

nomenclature used by other works.

The second version of the Catapult platform changed the deployment model. The FPGAs are still

connected as add-on cards to a dedicated Processor-based server, but the dedicated interconnection

network is augmented with a direct connection to the datacentre Ethernet network. In addition, the

network connection from the Processor-based server is connected directly to the FPGA, presenting a

bump-in-the wire like configuration, as shown in Figure 2.4. This enables applications that can act

directly on the incoming and outgoing network traffic of the traditional server, such as encryption and

compression. In addition, since the FPGAs are connected directly to the network, applications can be

built that require no intervention from the Processor-based server. The FPGA can receive data and

instructions from the network and push results back out to the network itself. Data received and sent

by the Hardware Application can also be configured to use the introduced Lightweight Transport Layer

(LTL) protocol [3], virtualizing network access by encapsulating sent data in a Layer 4 network packet.

This version of Catapult also considered the deployment of multiple Hardware Applications (i.e.,

Roles) to a single FPGA, through the use of what is termed an Elastic Router (an on-chip arbiter and

router that routes traffic to different roles depending on the packet’s incoming Media Access Control

(MAC) address). In this way, Catapult v2 does implement virtualization. Catapult v2 was briefly

introduced by Chiou in 2016 [2], and later elaborated by Caulfield et al. later that year [3].

Microsoft has also recently made these Catapult v2 FPGA devices available to Azure users, though

not directly: users cannot program the FPGAs themselves. Instead, the FPGAs are used automatically


to offload some simple virtual networking tasks from the Processor and compress traffic [10]. There is

no indication from Microsoft whether the FPGAs themselves are virtualized in this environment.

Amazon AWS F1 Instances

Amazon’s deployment of FPGAs in their AWS cloud offering differs from Microsoft’s in that Amazon

has made the FPGA devices available to program to its AWS customers. An AWS F1 instance (as the

FPGA-containing VM instances are called) with up to eight FPGA cards can be provisioned. These

FPGA cards are connected to a Processor-based server and a dedicated FPGA-only interconnection

network, similar to the Catapult v1 work. The notable difference is that all eight FPGAs are connected

to a single physical server [4]. More information about the deployment can be obtained from the

Hardware and Software development kits provided on their Github platform [29]. Amazon refers to a

Hardware Application as CL (Custom Logic) and also uses the Shell terminology. From the hardware

development kit, we note that there is no sense of abstraction or multi-tenancy in Amazon’s shell, and

thus no consideration of virtualization.

Amazon also provides the Xilinx SDAccel [30] programming environment for its F1 instances, which

is an implementation of the OpenCL [18] standard for Xilinx FPGA devices. The SDAccel shell, simply

termed the static region in the Xilinx documentation, adds some abstraction as up to 16 Hardware

Applications, termed kernels from the OpenCL standard, can be instantiated. Each of the 16 Hardware

Applications can access isolated (and therefore virtualized) regions of the shared off-chip memory.

IBM Research

The work presented by IBM Research considers virtualization directly, as a goal of the implementation [5].

The IBM work, by Chen et al., divides the FPGA spatially into distinct application regions. It is in

these application regions that Hardware Applications are to be programmed, termed accelerators in the

IBM work. The Shell, termed Service Logic by IBM, secures access to shared off-chip memory and a

dedicated host Processor-based server [5]. Similar to all of the previously presented FPGA deployment

models, IBM has adopted an add-on card model with the Processor-based system acting as the master

(i.e., controlling and configuring the FPGA). This solution provides no other communication between

FPGAs other than through the host, so it is the most limited in terms of multi-FPGA applications.

Chen et al. specifically decouple the roles of accelerator developers (i.e., developers of the Hardware

Application) and software developers. In the presented model, software developers would offload some

computation from their software application to a dedicated Hardware accelerator. Accelerator devel-

opers would create the set of standardized hardware accelerators that the software developers could


instantiate from their software code [5]. The shell thus includes scheduling hardware in addition to the

memory virtualization, to switch between invocations from multiple software threads of VMs. Hardware

accelerators are programmed into the spatially divided regions of the FPGA device using PR-based

techniques.

Academic Works

Multiple academic works have also explored the deployment of FPGAs in cloud environments. The

work presented by Fahmy et al. virtualizes FPGAs in a similar manner to the work developed by

IBM Research. The FPGA is partitioned spatially into multiple regions and programmed using the PR

methodology. The Hardware Applications are termed Partially Reconfigurable Regions and the Shell is

termed the static logic. The Shell arbitrates access to off-chip memory and a host Processor through

a PCIe interface. No networking functionality is provided. This work considers performance isolation

explicitly by implementing a round-robin scheduler for access to the host communication interface,

preventing starvation. Similar to the IBM model, this work imagines a deployment model wherein

multiple Hardware Applications are available to be programmed and application developers can invoke

them to offload some compute [31].

The implementation by Byma et al. introduces a different deployment model for FPGAs in cloud

deployments. The virtualized FPGA resources are directly connected to the datacentre’s Ethernet

network, similar to the Catapult v2 implementation, but in contrast to that work, there is no connection

of these FPGA devices to some host Processor-based system. Rather than being scheduled by and

receiving data from a dedicated Processor, the FPGA includes a small microcontroller (implemented

as soft logic on the FPGA) that communicates with the network. The cloud management software

(OpenStack in this case [32]) communicates with the FPGA over the network to program and schedule

tasks. Like the previous implementation and the IBM Research implementation, this work uses PR to

allow for seamless multi-tenancy. Hardware Applications can be programed by sending bitstreams over

the network to the dedicated microcontroller on the FPGA. Note, this work refers to the Hardware

Application Regions as Virtualized FPGA Resources (VFRs), and refers to the Shell as the Static

Hardware.

The Byma et al. work virtualizes access to memory by portioning off an equal chunk of memory

to each Hardware Application Region, effectively isolating each region’s data. The network connection

is virtualized by using an arbiter that redirects incoming network traffic to the appropriate Hardware

Application based on the OpenStack assigned MAC address. Outgoing traffic is policed by replacing

the source MAC address supplied by the user with the OpenStack assigned address, preventing MAC


address spoofing [33].

An implementation by Tarafdar et al. modifies an SDAccel-like Shell (as described in the Amazon

F1 section) to add networking capabilities. From a memory and host-connectivity point of view, the

implementation is the same as the SDAccel Shell. The memory can be virtualized and shared between up

to 16 Hardware Applications (called kernels in the OpenCL nomenclature), and the system is connected

and scheduled by a dedicated host Processor-based system. The work adds virtualized networking capa-

bilities, allowing the Hardware Applications to communicate to each other and other network connected

devices. Rather than police and arbitrate sent and received data like the system proposed by Byma

et al., this implementation virtualizes network access by encapsulating sent data in a User Datagram

Protocol (UDP) like Layer 4 network packet, similar to the methodology employed by Catapult’s LTL

protocol [34].

Finally, work by Yazdanshenas and Betz implemented a multi-tenant FPGA shell with the express

focus of measuring the overhead of the shell implementation [35]. The shell was implemented with up

to four applications (termed Roles) implemented on the FPGA. The shell design included connectivity

for two memory channels, four network interfaces, and a connection to a CPU-based host system. No

explicit isolation of the memory or network interfaces is attempted in that work. The conclusions of the

work indicate that virtualizing FPGAs decreases the maximum operating frequency of the applications

by up to about 40% and increases routing congestion in the Place and Route by up to 2.6x.

2.4 Hardware OS

Hardware OS works focus on creating platforms for FPGA application development, analogous to soft-

ware Operating Systems. They are similar to virtualization solutions in that they often also permit

multiple Hardware Applications to share a single FPGA device. They differ in that the focus of these

works is generally on abstracting access to external resources for Hardware Applications, rather than

data and performance isolation. Because of their similarity to virtualization solutions, they can provide

important insights in the design of virtualization technologies.

2.4.1 Related Work

One popular Hardware OS work is R3TOS, which is actually quite similar to the virtualization solutions

already presented. The R3TOS Hardware OS uses PR to allow multiple Hardware Applications to share

the same FPGA. The R3TOS work targets embedded system environments rather than the datacentre

environments considered in this thesis. Nonetheless, the R3TOS system manages the shared memory


so that the memory of each Hardware application is protected, similar to the virtualization solutions

presented in Section 2.2.3. In addition, R3TOS adds interprocess communication, allowing Hardware

Applications to communicate to each other on the same FPGA [36].

Other Hardware OS implementations focus on developing abstractions that make it easier to pro-

gram for Reconfigurable Devices. Both ReconOS [37] and BORPH [38] implement Hardware OSes that

introduce hardware processes as analogues to software threads. In fact, both systems are integrated with

Unix based operating systems and allow Hardware threads to access the Unix filesystem. The ReconOS

system also introduces a mechanism by which Hardware Applications can initiate system calls, using an

Operating System Synchronization State Machine that the Hardware Applications must interact with.

Finally, the LEAP system provides perhaps the deepest set of abstractions. The LEAP OS provides

two main categories of abstractions: Scratchpads and Latency-Insensitive Channels. Scratchpads are

essentially automatically generated caches that allow Hardware Applications to access off-chip memory.

Latency-Insensitive Channels can allow different parts of a Hardware Application (or indeed different

Hardware Applications) to communicate to each other. In addition, Latency-Insensitive Channels allow

Hardware Applications to communicate to a host Processor-based system. It provides Operating System

services over the host connections such as standard I/O, barriers, locks, and debugging [39].

Chapter 3

Virtualization Model

The focus of this thesis work is to examine the tradeoffs in the design of a virtualization solution for

FPGA devices. In particular, solutions analogous to software based VMMs are sought. In the software

realm, VMMs provide multi-tenant support with data isolation and performance isolation, such that

multiple VMs can share a single physical resource securely and with some level of quality of service.

These too are the goals for our analogous hardware virtualization model.

3.1 Deployment Model

From the discussed related works, it is clear that there are a number of deployment models for FPGA

devices in datacentres. Virtualization solutions for FPGAs vary in their physical connectivity, in the way

the FPGA services are provisioned for users, and in the abstraction models provided to the Hardware

Applications to access external resources. In this section, the various deployment methodologies are

analyzed.

3.1.1 Physical Implementation

Three main physical deployment models were used in previous work for FPGA virtualization. The most

common FPGA deployment model includes the FPGA device as an accelerator add-on to a traditional

Processor-based compute node. The processor-based system schedules tasks on the FPGA, sends data to

the device, and then retrieves results as they become available. The advantage of such a deployment lies

in the great availability and familiarity of existing software, which can be accelerated without changing

the entire codebase. Only those components of the software that need to be accelerated on the FPGA

need to be implemented in Hardware. Also, existing software frameworks such as OpenCL [18] make

16

Chapter 3. Virtualization Model 17

the task of programming FPGAs for software developers less daunting. The downsides of this physical

deployment model is that all communication from the FPGA device must be made through the Processor,

which can introduce a significant latency. For the remainder of this thesis, this deployment model will

be referred to as the master-slave model.

Direct connected FPGA deployments, in which FPGA devices are connected directly to the datacen-

ter’s primary network infrastructure (e.g., an Ethernet network), allow for multi-FPGA applications to

see lower latencies. As this interconnection network is the existing datacentre’s Ethernet network, this

deployment opens up a lot of application possibilities, as the FPGAs can communicate quickly with each

other and other devices in the datacentre. This is the deployment model of the Byma et al. work [33].

If the FPGA is also connected to a host Processor-based system, then the FPGA deployment can take

advantage of both models, as demonstrated in the Tarafdar et al. [34] and the Microsoft Catapult v2

works [2]. This deployment model will be referred to as the direct-connected deployment model. Note,

the Catapult v1 work by Microsoft includes an interconnection network for the FPGAs, though this

interconnection network is reserved for communication by the FPGAs alone, so we do not consider this

a direct-connected FPGA deployment by our definition.

Finally, the Catapult v2 [2] work introduces a bump-in-the-wire physical deployment model, which

is particularly useful for offloading network processing from the Processor-based system to the FPGA.

This exact use case is demonstrated in the deployment of the Catapult v2 system in Microsoft’s Azure

product [10]. This offloading reduces the latency required for complex packet processing tasks that would

otherwise need to be done in the software host. It may also reduce deployment costs, as each Processor-

based system and hardware FPGA pair need only one network interface connection on the downstream

network switch. However, the bump-in-the-wire compute model utilizes at least one networking interface

port on the FPGA for the host’s network connection, which could be instead used for increasing the

potential outgoing bandwidth of an FPGA device.

The direct-connected FPGA deployment model is the most flexible in terms of the types of applica-

tions that can be targeted to them, given that direct-connected FPGAs can communicate to each-other

and potentially distant servers, as well as local servers in the datacentre in which they are deployed,

implicitly enabling the kind of communication that would have been possible with a master-slave model

(albeit through a higher-latency Ethernet network rather than a PCIe connection). This model is specifi-

cally advantageous, however, for those applications that require low-latency inter-FPGA communication.

By treating FPGAs as equal peers to traditional Processor-based compute nodes, it also enables more

possibilities; applications targeted at the Reconfigurable Compute enabled datacentres are no longer con-

strained by the need to initiate and terminate computation at software nodes. This directed-connected


FPGA deployment model is adopted for the exploration in this thesis.

3.1.2 Service Provisioning

The existing FPGA virtualization solutions also vary in the ways that these FPGA resources are made

available to the datacentre’s users. There is a particular contrast between the deployments presented

in the IBM and Microsoft works, and the deployments in the Amazon and Academic works. IBM

and Microsoft restrict the FPGA to trusted Hardware applications, generated in-house or by trusted

developers. The users simply invoke instances of the existing Hardware Applications. This deployment

model is simple to use for end users, but can be quite limiting as custom Hardware accelerators cannot be

developed. For our virtualization solution, it is more general to consider the case where non-trusted users

share the same physical FPGA, which is also closer to the software VMM analogue. Note, data isolation

and performance isolation become much more important focuses of the design; mutually distrusting

users must be sufficiently assured of the security and quality of service of their virtual session for the

virtualization solution to be viable.

In addition, to truly enable seamless multi-tenancy on these FPGAs, a PR methodology must be

used. The physical FPGA can be shared spatially by dividing the FPGA into separate virtual regions

(i.e., Hardware Application Regions) and each can be programmed independently without affecting the

others’ operation. One consequence of this approach is that portability of the developed applications is

more limited, as each Hardware Application must be synthesized by FPGA CAD tools targeted for each

different PR region available. While it is technically possible to make multiple PR regions that share the

same layout and connection profile so as to eliminate the need to synthesize bitstreams for each region

individually, such an implementation is technically challenging to achieve. Datacentres are also likely to

have multiple types of FPGAs deployed as new devices are released with better capabilities. Datacentres

may even deploy devices from different vendors. Bitstream portability cannot easily be provisioned for

end users, and the lowest level of portability for Hardware Applications is the HDL source.

As an extension to the above discussion, whether the end-user or the datacentre managers are re-

sponsible for synthesizing the Hardware Application into a bitstream must be considered. The work by

Byma et al. used the latter approach; end-users would simply upload the Hardware Application source

HDL to the cloud management software, and the cloud would automatically synthesize the application

into the appropriate bitstreams. This does introduce a level of abstraction that eases development as the

user does not have to be aware of the existence of different PR regions, but it removes from end-users key

information needed in the design iteration process for Hardware Applications. Namely, the Synthesis


and Place and Route processes provided by FPGA CAD tools generate reports that are vital in fixing

any timing violations in the Hardware Application. Also, Synthesis and Place and Route settings can

often be set for different levels of the design hierarchy, which is often required to strike a balance between

the solution exploration effort and the total runtime of the CAD tools. While a cloud system that passes

this information (reports and settings) from the user to the CAD tools and vice versa could be designed,

these issues introduce quite a bit of complexity into the datacentre management system design.

The advantage of separating bitstream generation from the end-users is that the source HDL is easier

to examine and parse than the final bitstream (the bitstream format is often proprietary and specific

to the device vendor). Such parsing can be used to ensure that the end-user’s Hardware Application

does not perform any malicious activity. The deployment by Amazon [4] does the bitstream generation

for this reason, though that deployment does not include multiple Hardware Applications on the same

FPGA and as such does not use HDL parsing to determine whether malicious interaction with co-resident

Hardware Applications is attempted. It is unclear whether one can determine for certain (at least in

an automated manner) by examining source HDL that no malicious activity is attempted (such an

examination is left for future work). We contend that the most general assumption is that all malicious

activity cannot be automatically deduced from source HDL.

Whether or not the source HDL is parsed by the cloud management system, some amount of security

considerations must be made in the design of the Shell. Thus, for simplicity, the Deployment model we

explore in this thesis does not consider source HDL parsing and instead has the end-users synthesizing

the Hardware Applications into bitstreams themselves. A more thorough exploration of the level of

security that can be guaranteed through source parsing is left for future work.

Finally, the way that external resources are connected to Hardware Applications must be considered.

As an example, consider the OpenCL [18] programming model, used with the SDAccel Shell pltaform

[30]. It decouples memory buffers from the Hardware Applications accessing the memory. A particular

memory location can be used as the output for one Hardware Application, and once that Hardware

Application has finished execution, that memory buffer can be reassigned as the input to another Hard-

ware Application, perhaps even in a different PR Region. This is mostly provisioned at higher levels of

the compute paradigm (as in the OpenCL example), though it must also be considered in the design

of the Shell. In particular, the Shell must ensure that memory locations can be attached and detached

from Hardware Applications completely, which is not for example possible with the work presented by

Byma et al. (the memory is statically assigned to each Hardware Application Region). Similar lifetime

decoupling could be considered for all external resources provisioned by the Shell. We note that such a

methodology results in no loss of generality, as it can easily be used to provision programming models


in which the memory buffer lifetime is either coupled or decoupled from the lifetime of the Hardware

Application.

3.1.3 Deployment Model Summary

The most flexible and powerful deployment model for FPGA virtualization is one in which direct-

connected FPGAs are made available to the end-users of datacentres to be programmed directly. The

FPGAs are treated as peers to software compute nodes, rather than slaves, and can be programmed

directly by the users, rather than be restricted to some available Hardware Applications developed by

trusted developers. The key goals of such a virtualization model must be securing the operation of one

user from any interference or impact from other users on the same physical resource. The FPGA is to be

shared using a PR-based CAD flow, and portability of Hardware Applications can only be guaranteed

at the HDL level. We posit that this model is general, since any applications that target other FPGA

deployment models could be ported to a direct-connected FPGA deployment; communications to CPU-

based compute nodes, as explicitly enabled in master-slave models, can still be achieved by sending

communications over the shared network infrastructure.

3.2 Required Functionality

The deployment model discussed above introduces a number of requirements of the virtualization solu-

tion. Note, since a PR-based implementation is to be targeted, these requirements are to be implemented

in the Shell (or static regions) of the platform.

3.2.1 Performance Isolation

First, the Shell must adequately decouple the actions of the virtualized Hardware Applications from each

other, such that some reasonable level of performance can be assured. Hardware Applications access

external resources using some standardized interface. For example, memory resources are often accessed

using the Advanced eXtensible Interface (AXI) protocol. Access to the shared resource is usually pro-

vided through some sort of arbitrated interconnection network. As a first step in performance isolation,

some adherence to the interface protocol needs to be assured such that the interconnect providing the

arbitration can service all requests. Illegal or spurious requests to the shared interconnect need to be

blocked to prevent the interconnect from entering an error state or stalling operations.

A protocol verifier and decoupler can be included to assure adherence to the interface protocol. Note,

the protocol verifier and decoupler need only block those protocol violations that may force the intercon-


nect into an error state, and decouple the interface on such an occurrence. Any protocol violations that

do not affect the interconnect or resource functionality (i.e., the resource continues to function correctly)

can be ignored. The protocol verifiers implemented in our work block errant transactions by modifying

the transactions to be protocol conformant, which eliminates the protocol violation, though the original

intent of the transaction may not be preserved. This may affect the operation of the application, but we

are not concerned with the correct operation of errant applications as long as they cannot affect the other

applications on the FPGA. Previous works that focus on blocking errant transactions generate Verilog

for a full list of protocol assertions [40]; our work is more methodical in selecting only those protocol

assertions that may cause errors in donwstream devices (e.g., interconnects, memory controllers).

Shared resources often have a limited bandwidth available to the multiple Hardware Applications. To

ensure quality of service, we need to be able to regulate the resource bandwidth used by each application.

Thus, the next step in performance isolation is the inclusion of bandwidth shaping functionality into the

shared interconnect. Bandwidth shaping is a mechanism by which the accesses to the shared resource

by the Hardware Applications can be limited to within some allotted bandwidth budget. This ensures

that one application cannot spam the memory interface with requests that cause other applications to

be denied access. This specific example illustrates how bandwidth shaping can eliminate intentional

starvation forced by bad actors, though bandwidth shaping can also add value beyond these security

benefits; bandwidth shaping provides more deterministic performance to the applications, often resulting

in more reliable system performance. Also, for cloud-based deployments, the inclusion of bandwidth

shaping allows the cloud provider to provision virtual resources and charge customers based on the

required resource access bandwidth.

3.2.2 Data Isolation

From a security perspective, the virtualization solution must ensure that Hardware Applications are

limited in terms of what parts of external resources they are granted access to. Applications should not

be permitted to view, access, or modify the execution environment of the other virtualized Hardware

Applications running on the same physical system. Considering off-chip memory, locations in memory

assigned to one specific Hardware Application should not be readable or writeable by other Hardware

Applications. In processor-based systems, this is often accomplished through the use of a Memory

Management Unit (MMU), which includes the TLB and dedicated hardware to do virtual memory

translation lookups in the Operating System’s page tables. The Virtualization solution should include a

Memory Management Unit to effectively decouple the memory assigned to each Hardware Application.


Note, this decoupling also enables the memory model from the OpenCL programming model described

in Section 3.1.2.

Another external resource considered for this work is access to the network, which follows from our

direct-connected FPGA model. As with the memory, data meant for a single Hardware Application

should be excluded from all other Hardware Applications. In other words, incoming network packets

should only be forwarded to the Hardware Application for which it is destined. Some form of network

routing must be implemented. Another less obvious need for the isolation of networking resources is

in the prevention of MAC address spoofing, i.e., one Hardware Application should not be allowed to

send packets with the MAC address assigned to another Hardware Application. The example described

here considers Layer 2 spoofing, though one could imagine a system that implements this exclusion at

a different layer of the network stack (e.g., the work by Tarafdar et al. excludes traffic at Layer 4). We

propose that the unit providing this network isolation be termed the Network Management Unit (NMU).

3.2.3 Domain Isolation

Following our discussion on network resource isolation, and as an analogue to data isolation, we introduce

the concept of Domain Isolation: virtual compute nodes (Hardware Applications in the case of this thesis

work) should be excluded from interacting with the domain of the other virtual nodes. In the case of

memory, domain isolation implies access to the memory locations assigned to each Hardware Application,

and in the case of networking, domain isolation implies ownership of some network identity (e.g. MAC

address at Layer 2) for both received and transmitted traffic.

3.2.4 Channels Targeted for Isolation

To effectively virtualize multiple applications on an FPGA, we’ve discussed the importance of isolating

those applications in different ways. Specifically, we’ve introduced performance isolation, data isolation

(for memory), and domain isolation (for networking) as specific isolation needs in Sections 3.2.1, 3.2.2,

and 3.2.3. In the field of security, isolation could imply quite a wide range of potential protections; in

this subsection, we discuss the specific goals of our system in introducing isolation.

In terms of memory connectivity, we focus specifically on introducing isolation for the AXI memory-

mapped interface. This interface is targeted because it is commonly used in FPGA designs that target

Xilinx devices (the vendor used in our work); maintaining backwards compatibility with existing FPGA

designs is an important guarantee of the isolation solution presented in this thesis. The AXI protocol

includes multiple channels, including data channels for read and write paths, address/control channels


for read and write paths, and a write response channel. All five of these channels are targeted for

isolation. Specifically, this isolation solution aims to provide: data isolation guarantees, such that data

is protected against access and/or modification by other users; access guarantees, such that no application

can perform some series of transactions that could starve other applications of access; and some level of

performance guarantees.

In terms of networking connectivity, we focus specifically on introducing isolation for the AXI stream

interface, included for both egress and ingress packets. Again, this interface is targeted for backwards

compatibility with already developed hardware blocks and applications. Both the egress and ingress

channels are targeted for isolation. Specifically, this solution aims to provide: domain isolation guaran-

tees, such that network packets cannot be sent from the device that would interfere with another domain

or section of the network to which the Hardware Application should not have access; access guarantees,

such that no application can perform some series of packet transmissions that could stall the shared

resources or starve other applications; and some level of performance guarantees.

Channels that are explicitly not targeted for isolation include any resources other than memory or

networking connectivity, or any possible vulnerable side-channels. Side-channels commonly refers to

methods by which malicious actors can affect or glean information in reference to other applications in

the system without actually gaining access to the data channels by which the applications traditionally

communicate. For example, a malicious actor could monitor the voltage of the FPGA device, and that

voltage could be used to correlate to a specific value of interest (e.g., an encryption key). This has been

shown to be possible on FPGA devices [41] [42] [43]. No components are solutions are introduced in this

thesis to target side-channel isolation.

3.2.5 Management

A mechanism to manage the multi-tenant FPGA system, including setting up the parameters of the

MMU and the NMU and programming the application regions, is required. Such a management layer

should be integrated with a higher layer scheduling and allocation framework. This higher level schedul-

ing layer could be a cloud management software suite if the FPGA deployment targets a cloud datacentre

(e.g., the OpenStack management software [32]). Such management can be facilitated in a number of

ways, though it is important to note that the implementation of the scheduling layer does not affect the

Virtualization model presented to the Hardware Applications. The implementation of the management

layer only impacts the ease of integrating the system within a higher level scheduling framework.

Most of the previous works include management through an attached dedicated host system. Using


an attached host system would require the FPGA device to be connected to the host though some

mechanism such as PCIe, though it wouldn’t necessarily imply the master-slave configuration discussed

in section 3.1.1. The host connection can be provisioned to provide only management, leaving the

virtualized Hardware Applications as solely directly connected compute modules, or it can be used

for both management and as a connection to the Hardware Applications themselves, enabling users to

decide which configuration to use (i.e., direct-connected or master-slave). This methodology for the

management would meet the deployment model specifications set out in Section 3.1.1.

An alternative to using a direct-connected host to manage the virtualized system is to include some

means of management directly in the Shell. This is the methodology employed by Byma et al. [33] in

their work. Rather than implement the entire scheduling algorithm in dedicated hardware, instead a

small Processor implemented as soft-logic in the Shell was used to program the PR regions and setup

the parameters of the system. Such a soft Processor can also be used to setup the parameters of the

MMU and the NMU. Some scheduling and allocation software can be run on the dedicated processor,

or alternatively a lightweight software setup can be used that simply accepts commands from some

scheduling software running elsewhere. Note, the soft-processor would need to be connected to the

FPGA’s network port for access to the external world.

From the above two discussions, an effective way to implement management of the Shell is by using

a processor to program and setup the Shell’s parameters. This processor can be implemented as a soft-

processor on the FPGA directly, can be an embedded ARM processor (as seen in many of the latest

FPGA devices), or can be a processor connected to the FPGA through the PCIe in a host machine.

The Virtualization Model seen by the Hardware Applications and their developers is unaffected by the

means of management.

3.2.6 Interface Abstraction

Another key consideration in the design of a virtualization platform is the way in which the external

interfaces are abstracted and presented to the Hardware Applications. Previous FPGA virtualization

solutions have tended to minimize the extent to which the external resources are abstracted, presenting a

fairly low-level view of the memory and networking resources. Some abstraction is nonetheless included.

For example, in the work by Tarafadar et al. [34], networking connectivity is presented to the Hardware

Applications as an encapsulating Layer 4 switch. In that work the Hardware Application does not need

to construct the Layer 2 and 3 parts of the sent data packets, but instead only sends the payload that

is encapsulated automatically. A similar approach is used in the Catapult v2 work with their LTL


protocol [2].

The Hardware OS works described in section 2.4 provide a deeper set of interface abstractions. These

abstractions range from automatically instantiated cache structures for accessing memory resources, as

shown in the LEAP work [39], to system calls to the software OS running on a connected processor-based

host system [38]. In addition, different FPGA programming models can provide their own set of abstrac-

tions; the OpenCL programming model used with the SDAccel [30] Shell for example provides a deep

memory and execution model that is provisioned by the Shell. No one abstraction model has become

dominant in the programming of FPGAs, and indeed more abstractions can and likely will be conceived

in future works. If we draw from the analogous software Operating System environment for example,

abstractions to access file storage, to hide network stack complexity, and maybe even implement higher

level programming frameworks can ease the development process for Hardware Application developers.

Some of these design methodologies have been researched on FPGAs before: an in-hardware Trans-

mission Control Protocol (TCP) core has been developed to abstract the Layer 4 network model from

Hardware developers [44], and Message Passing Interface (MPI) abstractions have also been researched

for FPGAs [45].

At this point in the state-of-the-art, it would not make sense to lock-in to a particular set of abstrac-

tions for the Shell design. Maximal flexibility in the deployment and development of future abstraction

models would be advantageous. If we consider the software Operating System analogue again, a wide

array and range of system services provide a rich set of abstractions that ease software development.

Such a rich set of abstractions should be the goal of future Hardware Application models to achieve a

similar ease of development for hardware applications. Since an FPGA has limited resources, it would

not be possible to include all possible system services that might be required by any one particular

application within the Shell. This drives our introduction of two new concepts, the “Hard Shell” and

the “Soft Shell”, discussed in the following subsection.

Soft Shell vs. Hard Shell

We draw on two key insights from our previous discussion to introduce the concepts of the “Hard Shell”

and “Soft Shell”. First, as discussed in the previous paragraph, it will not necessarily be possible to

include all desired interface and system abstractions into the Shell design, simply based on resource

limitations. Next, as discussed in section 3.1.2, bitstream portability for our virtualization model cannot

necessarily (or easily) be provisioned. The lowest level of portability for Hardware Applications will

likely be the source HDL. Any Hardware Application must be synthesized for each PR region targeted

anyway, thus any abstractions needed by that Hardware Application can be included and synthesized


with the original HDL source. These abstractions synthesized and included inside the PR region itself

is what we are calling the Soft Shell. In software systems, the Soft Shell might be referred to as the

unprivileged or untrusted domain of operation.

In the design of the Shell, we follow the principle that only those components of the system that need

to be shared amongst the PR regions are included in the static region; we call this the Hard Shell. In

software systems, the Hard Shell might be referred to as the privileged or trusted domain of operation.

This includes the components that arbitrate access to shared resources and ensure domain isolation for

these resources. In addition, performance isolation features may need to be included in the Hard Shell.

In the case that users generate their own bitstream, which would include the Soft Shell (it is synthesized

with the Hardware Application), performance isolation functionality cannot be left to the Soft Shell and

must be included in the Hard Shell as well. This is because a malicious actor could modify the Soft Shell

in its source HDL form to remove any security features included therein, circumventing the protections;

the examination of a synthesized bitstream for such circumvention is a difficult task. If the user does

not generate the bitstream, performance isolation could be reliably provided within the Soft Shell (since

its generation would be hidden from the user). To summarize, the Hard Shell provides only the minimal

features needed to facilitate multi-tenancy and secure resource sharing: the domain and performance

isolation discussed in sections 3.2.3 and 3.2.1.

We envision Soft Shell implementations (and indeed the minimal Soft Shell’s presented in this work) as

a generated wrapper for a user application that provides higher-level abstractions and services specifically

required by the user application. The generated wrapper changes depending on the abstractions needed

by the specific application, and as such, many different abstraction and programming models can be

implemented within the same Virtualization model. As an example, the soft shell could provide a TCP

stack to ease the development of network connected applications. The introduction of the Soft Shell

also has implications on the design of the Hard Shell. As an illustrative example, consider a situation

where a Soft Shell instantiated abstraction itself needs access to a region in memory. To effectively hide

the provisioning of this abstraction from the Hardware Application, the Hardware Application must be

unaware of this provisioned memory, i.e., it must not be forced to avoid those regions in memory alloted

to the interface abstraction component. The Hard Shell must implement the secure separation of shared

access from within a single PR region as well.


DDR Controller

Memory Management Unit (MMU)

Protocol Verifier/

Decoupler

Protocol Verifier/

Decoupler

Soft Shell (PR Region)

HW Application 1


HW Application 2

TCP Core

MPI Core

Mem

ory

In

terc

onn

ect

Mem

ory

In

terc

onn

ect

FCoE Core

File-Stream Accessor

Protocol Verifier/

Decoupler

Protocol Verifier/

Decoupler

Network Management Unit (NMU)

Ethernet Controller

Bandwidth-Shaping

Interconnect

Bandwidth-Shaping

Interconnect

Figure 3.1: Virtualization Shell architecture based on the Soft Shell concept

3.3 Virtualization Shell

The virtualization model discussed in the previous sections of this chapter gives way to the virtualization

Shell depicted in Figure 3.1. Note, for this thesis work, only external memory and network resources

are considered, as depicted in the figure, though a similar approach could be applied to other external

resources.

3.3.1 Shell Overview

In the Virtualization-based Shell, each Hardware Application is joined by some set of abstractions as

part of the Soft Shell. Both the Hardware Application itself and the Soft Shell components are included

within the PR region. Note, the abstractions included in the diagram are for illustrative purposes only,

this thesis focuses mainly on exploring the design parameters of the Hard Shell. For each of the external

resources, the Hardware Application can include multiple virtualized interfaces to access the physical

resource. For example, Hardware Application 1 from Figure 3.1 includes four virtual memory interfaces

to access the physical off-chip memory (three interfaces from the Application itself and another one from

the TCP interface block). Each of these virtual interfaces is isolated individually in the MMU and NMU

of the Hard Shell.

The Hard Shell includes all those components shown outside of the PR regions. The protocol verifier

and decoupler along with the bandwidth shaping interconnect implement performance isolation for both

memory and network access. The MMU and the NMU implement domain isolation for the memory

and network respectively, treating each virtual interface as an individual domain. This is accomplished

by assigning each of the virtual interfaces within the Soft Shells (whether connected to the Hardware

Application or some Soft Shell component) a unique Virtual Interface ID (VIID). The MMU and the

NMU perform isolation based on this VIID.


DDR Controller

Memory Management Unit (MMU)

Protocol Verifier/

Decoupler

Protocol Verifier/

Decoupler


HW Application 1


HW Application 2

TCP Core

MPI Core

Mem

ory

In

terc

onn

ect

Mem

ory

In

terc

onn

ect

FCoE Core

File-Stream Accessor

Protocol Verifier/

Decoupler

Protocol Verifier/

Decoupler

Network Management Unit (NMU)

Ethernet Controller

Bandwidth-Shaping

Interconnect

Network Interconnect

Shell Mangement

Layer

Mangement Interconnect

PCIe Controller

Figure 3.2: Virtualization Shell with management infrastructure shown

The way that the management layer interacts with the Soft Shell is shown in Figure 3.2. Each of the

components in the Shell (both within the Hard Shell and Soft Shell) are connected to some management

interconnect. One possible implementation for this interconnect would be a memory mapped interconnect

model that allows for registers within each of the components to be accessed and modified. For the

components within the Soft Shell to be accessible by the management layer, the PR region must include

an interface to the management interconnect. Upon generation of the Soft Shell, each of the Soft Shell

components can be connected to this management interface and mapped within the PR region’s memory

space on the interface. Note, this implies that the Soft Shell configuration space varies from Hardware

Application to Hardware Application. For the Management layer to know how to configure the Soft Shell

components, some listing of the Soft Shell’s configuration must be generated upon the generation of the

Soft Shell itself. The Hardware Application can then be fully described by the generated bitstream and

this Soft Shell description file.

As discussed in Section 3.2.5, the management of the Shell components can be done through either

a PCIe connected host system, or through an integrated soft-processor connected to the shared network

interface. The Shell Management Layer from Figure 3.2 represents either the bridge between the PCIe

controller and the management interconnect (With the management layer acting as the master of the

memory mapped register interconnect), or it represents the soft-processor, which can communicate

over some Ethernet connection. Note, the diagram shows the Management Layer using the Ethernet

Controller though the network interconnect, which is also shared by the PR regions, though a dedicated

Ethernet connection could also be provisioned.


3.3.2 Xilinx Specific Details

The shell shown in Figures 3.1 and 3.2 is implemented, with various component configurations, on an

AlphaData 8k5 FPGA board with a Xilinx Ultrascale XCKU115 FPGA [46]. Further implementation

details are discussed in Chapters 4 and 5.

The AlphaData FPGA board has two channels of DDR4 memory, each containing a single rank.

The capacity of each of these memory ranks is 8GB, with a 64-bit interface. Except where explicitly

mentioned, the shells designed in this thesis use only a single memory channel. The DDR controller

used for all of the designs of this work are those generated by the Xilinx Memory Interface Generator

(MIG) provided with the Xilinx Vivado CAD tools [47]. This DDR controller presents a data-width of

512-bits to the interconnect. All of the interconnect components and the respective interfaces provided

to the Soft Shell regions are also configured with a 512-bit interface (i.e., no data-width conversion is

included). The frequency of operation for the DDR controller is 299.581 MHz, which is four times less

than the frequency of operation of the DDR memory itself. The clock for the DDR controller is provided

to the remaining shell logic, and this clock is used for all of the logic that drives the DDR controller

(which includes the interconnect, any added components, and the Soft Shell memory interfaces). Note,

this wide memory bus and relatively high clock speed (for FPGA designs) introduced a great deal of

routing stress on the FPGA CAD tools. In all of the shell designs, register slices had to be inserted into

the data-path to meet these timing requirements. This high routing stress is consistent with other shell

explorations such as the work described in [35].

The network connections on the AlphaData board are two individual SFP+ cages that can operate

at up to 16Gbps [46]. For Ethernet connections, these ports are set to operate at the 10Gbps Ethernet

line rate. Except where explicitly mentioned, the shells designed in this thesis use only a single network

interface connection. The Ethernet controller used for all the shell designs of this work is the 10Gbps

Ethernet Subsystem offered by Xilinx [48], implemented using the Vivado CAD tools. This Ethernet

controller presents egress and ingress stream interfaces with a data-width of 64-bits each. As with the

memory solutions, there is no data-width conversion included, so all of the attached interconnect and

Soft Shell interfaces are also configured with 64-bit ingress and egress interfaces. The Ethernet controller

generates a clock that operates at a frequency of 156.25 MHz. This clock is used to drive all of the logic

within the network path interconnect and the networking Soft Shell interfaces.

For the management layer, we implement a PCIe-based connection, with all management done in

an attached CPU-Based system. The PCIe connectivity is provisioned with the PCIe Subsystem core

provided by Xilinx [49], which includes AXI interfaces that can easily interface with the other system


components (most Xilinx components operate using the AXI interface). The PCIe Subsystem includes

two separate memory mapped interfaces, one AXI-Lite interface meant to access configuration registers,

and one full AXI4 interface that can operate at higher bandwidths. The AXI-Lite interface is configured

as a 32-bit interface and is connected to all of the configuration registers of all the components within the

Hard Shell. An AXI-Lite interface is also connected to each of the Soft Shells to enable the configurability

of any Soft Shell instantiated abstractions. The full bandwidth AXI4 interface is simply connected to

the DDR Controller to allow the attached host to access the FPGA’s memory. The full AXI4 interface

has a 256-bit data-width. All of the PCIe components are clocked by a 100 MHz frequency clock. To

configure registers in different clock domains, the AXI Interconnect IP Core is used that automatically

instantiates any necessary clock conversion functionality.

Chapter 4

Memory Interfaces

This thesis focuses on the security of two specific common FPGA shared resources: memory and net-

working connectivity. In this chapter, we focus on the secured sharing of off-chip memory, such as Double

Data Rate (DDR) attached off-chip memories. Most FPGAs include some form of on-chip memory, and

generally support off-chip memory technologies. The on-chip memories (such as BRAM and LUTRAM

in the Xilinx line of products) are usually spread across the FPGA spatially; these memories can be

split based on their spatial locality, i.e., each Hardware Application would have access to the on-chip

memories located within their assigned PR region. On-chip memories do not need any special security

or isolation considerations, but off-chip memories on the other hand need to be accessed through a single

shared bus and would have their entire contents theoretically accessible by all actors with unfettered

access to this bus. In this chapter we analyze the performance and domain isolation solutions needed to

secure this shared resource.

4.1 AXI4 Protocol Verification and Decoupling

On Xilinx FPGA devices, controllers for managing access to DDR memory are made available that

present an AXI interface to the accessors [47]. Specifically, these controllers target the AXI4 version

of the AXI standard. The FPGA board targeted in this work is the AlphaData 8k5 FPGA board,

that includes a Xilinx Kintex Ultrascale XCKU115 FPGA device [46]. As such, the work presented

herein is specifically tailored towards the AXI4 interface standards. While the work is specific to the

AXI4 standards, we assert that the concepts are general and could be applied to any memory interface

standard.

As the first step in securing the shared access of off-chip memory resources between multiple co-

31

Chapter 4. Memory Interfaces 32


HW Application 1


HW Application 1 Soft Shell (PR Region)

HW Application 1


HW Application 1

DDR4 Controller

AXI4 Interconnect


HW Application 1


HW Application 1


HW Application 1


HW Application 1

PCIe Controller


HW Application 1



HW Application 1


HW Application 1

DDR4 Controller

AXI4 Interconnect


HW Application 1


HW Application 1


HW Application 1


HW Application 1

PCIe Controller

AXI4 Decoup. + Verify




(a) (b)

Figure 4.1: Adding Decoupling for Memory to the Shell (a) Shell without decoupling (b) Shell withadded decoupling

resident Hardware Applications, conformance to the AXI4 standard must be checked. In particular, as

mentioned in Section 3.2.1, those aspects of the AXI4 standard that might cause the memory controller

or the shared bus used to access the memory controller to enter into an unexpected state, perform some

erroneous transaction, or stall operation must be enforced. In this subsection, we cover the implemen-

tation of such functionality.

To illustrate the intention of the work described in this subsection, see Figure 4.1. In part (a) of

the figure, we depict an unsecured shell organization that includes only off-chip memory as an external

resource. The multiple Hardware Applications connect to an AXI4 interconnect that arbitrates access to

the Xilinx DDR memory controller. Specifically, for the AlphaData 8k5 board, DDR4 memory is attached

as available off-chip memory. The board itself also includes Peripheral Component Interconnect Express

(PCIe) connectivity to a host CPU-based system. As mentioned in Section 3.2.5, this PCIe is used to

manage the various components of the shell.

The PCIe connectivity is provided by the Xilinx DMA Subsystem for PCIe [49], that is also connected

to the AXI4 interconnect to be able to access the off-chip memory. Part (b) of the figure indicates

how protocol verification and decoupling modifies the simple unsecured shell depicted in part (a); each

AXI4 interface connection from each of the Hardware Applications is connected to a Protocol Verifier-

Decoupler, that in turn outputs a modified AXI4 interface connecting to the original AXI4 interconnect.

These Protocol Verifier-Decouplers are all connected to the PCIe management network.


Figure 4.2: The AXI4 Read Channel Interfaces, adapted from Figure 1.1 of [50]

4.1.1 The AXI4 Protocol Basics

Before introducing the design of the Protocol Verifier-Decoupler, some AXI4 standard details must be

introduced. The complete details of the standard are described in [50]. As a primer, the AXI4 standard

splits memory read and write transactions into channels. Figure 4.2 shows the read channel specification,

and Figure 4.3 shows the write channel specifications. In the AXI4 standard document, those devices

that issue memory access requests are referred to as Masters, while those devices that receive and fulfill

memory access requests are referred to as Slaves. As can be seen from the figures, requests for access

to memory resources, data sent to/from the memory resource, and responses from the memory resource

are separated into individual channels. Each of these channels includes independent handshaking signals

(i.e., signals that indicate a new transaction is available and that an available transaction has been

received).

The address and control channels are used to initiate read and write requests. These channels include

a series of signals (in addition to the handshaking signals) that indicate the type of transaction to initiate,

specifically the size of each data beat and the number of data beats in total to send for that particular

transaction. In addition to the size and number of the expected data beats, the address and control

signals allow for the specification of a “burst mode” for the requested transaction: options include a

“FIXED” mode, where the address of each data beat is the same; an “INCR” mode, where the address of

the data beats is incremented for each beat; and a “WRAP” mode, where the addresses also increment,

but wrap to a lower address once the address reaches some alignment boundary. The write address and

control channel signals are generally prefixed with “aw” and the read address and control signals are

generally prefixed “ar”.


Figure 4.3: The AXI4 Read Channel Interfaces, adapted from Figure 1.2 of [50]

The write and read data channels are responsible for transferring the data to be used in the memory

access operation; multiple data beats can be sent for each requested transaction depending on the burst

mode and length. The read data channel signals are generally prefixed with “r” and the write data

channel signals are generally prefixed with “w”. The read data channel also includes a “resp” signal

that indicates the success of a requested transaction; the analogous signal for the write transaction is

provided on a dedicated write response channel, whose signals are generally prefixed with “b”.

4.1.2 AXI4 Master Protocol Verification Requirements

The AXI4 standard includes a companion standards document that details the specific assertions that

must be met to guarantee that any particular transaction is protocol compliant, the AXI4 Protocol As-

sertions User Guide [51]. A summary of relevant assertions is included in Appendix A. Table A.1 includes

a summary of the protocol assertions for the write address and control channel. Table A.2 includes a

summary of the protocol assertions for the write data channel. Table A.3 includes a summary of the

protocol assertions for the write response channel. Table A.4 includes a summary of the protocol asser-

tions for the read address and control channel. Table A.5 includes a summary of the protocol assertions

for the read data channel. Finally, Table A.6 includes a summary of the protocol assertions dealing with


exclusive access. Protocol assertions that target simulations of AXI4 peripherals rather than synthesis,

and those pertaining to signals not included in the Xilinx memory controller, are not included for brevity

(and because they do not need to be considered). In addition, the handshaking assertions are summa-

rized by a single entry in each table (e.g., a single AXI ERRM AWxxxxx STABLE entry is included

in the write address and control assertions table rather than individual AXI ERRM AWLEN STABLE,

AXI ERRM AWSIZE STABLE, etc., entries).

Xilinx actually provides to customers an AXI4 protocol assertion Hardware Core [52], though this

block simply monitors the AXI4 transactions and would not prevent an erroneous transaction from

propagating to the interconnect; for secure isolation, erroneous transactions need to be prevented. Not

all protocol errors can induce erroneous operation in the system; in fact, the Xilinx AXI4 interconnect

user guide [53] and the Xilinx DDR Memory Controller user guide [47] indicate that some errors are

ignored. These ignored errors would not need to be prevented. In Tables A.1 - A.6, the final two columns

indicate whether or not the interconnect and memory controller, respectively, will enter some erroneous

operation state as a result of the AXI4 protocol violation (the highlighted entries indicate an error that

need be avoided).

The Xilinx AXI4 interconnect ignores most AXI4 protocol violations, though there are some excep-

tions that must be considered. For the read and write address channels, the size of a single beat must

be indicated to be less than or equal to the interface width; an erroneous value of this signal may cause

width converters in the interconnect (and indeed in the memory controller) to fail to operate correctly.

Also, the WLAST indicator on the write data channel must be asserted correctly such that the inter-

connect arbitrates correctly. For all of the output channels (the read and write address channels, and

the write data channel), the signals must remain stable once they have been indicated to be valid and

before they have been received by the slave interface; changing values might cause different values to

propagate in the interconnect depending on when the values are sampled. Finally, for all input channels

(the read data channel and the write response channel), the data must be accepted within a reasonable

time to avoid the shared interconnect from hanging.

The Xilinx AXI4 memory controller does not ignore as many protocol violations as the Xilinx AXI4

interconnect, and as such more verification considerations need to be addressed. Specifically, additional

errors need to be avoided in the read and write address channels. First, if a burst transaction with

a burst mode of WRAP is indicated, the address of that transaction must be aligned and the burst

length must be a power of 2 between 2 and 16. Incorrect WRAP burst mode transactions may cause

undefined behaviour in the memory controller. In addition, the AXI4 protocol standard specifies that

no transaction may access data across a 4kB boundary. This protocol violation is particularly significant


for shared access since assigned memory regions must not be accessible by actors to whom the memory

is not assigned.

4.1.3 Memory Transaction Decoupling

The AXI4 protocol isolation presented in this work is divided into two separate components: the protocol

decoupler and the protocol assertion verifier. The purpose of the protocol decoupler is to allow the FPGA

management framework to disconnect a Hardware Application from the shared interconnect. This might

be done to reprogram the PR region in which the Hardware Application is resident (i.e., to assign the

PR region to a new or modified Hardware Application) for those deployments where PR is enabled, or to

pause the memory accesses of the Hardware Application for some other reason, such as to disable further

erroneous transactions from a misbehaving Hardware Application or to perform some maintenance on

the memory resource (e.g., defragmentation).

The existing commercial decoupling solutions generally work by driving the handshake signals to

a zero value while allowing the remaining signals to pass through. This works because even if a valid

transaction is presented on the interface, on the decoupled side of the interface decoupler the valid signals

will be held low. This is how the Xilinx Partial Reconfiguration Decoupler works for example [54], as

shown in Figure 4.4 adapted from that work. This works to prevent any new beats from being sent into

the interconnect, but it might lock-up the interconnect if some burst transaction has not been completed.

In this work we present a new AXI4-specific decoupler that stops new memory access requests from being

sent while waiting for transactions to finish before decoupling the write data and read/write response

channels.

Figure 4.5 shows the design of this modified decoupler. The read and write address and control

channels are decoupled in a similar way to the previously mentioned decoupler, i.e., the valid and

ready signals are driven low when a decouple event is indicated, however there must be one special

consideration. The AXI protocol specifies that once the valid signal on any channel has been asserted,

it cannot be de-asserted until that transaction has been accepted by the downstream interface. For each

of the read/write address and control channels, a single “sticky” bit is included that goes high whenever

a valid beat is indicated but not accepted; this sticky bit is used to gate the decouple signal until after

that waiting transaction is accepted. The data and response channels must not be decoupled until the

outstanding transactions those channels are to service are completed; to this end counters are included

that count the expected number of write data beats to be sent, the expected number of write responses

to be received, and the expected number of read responses to be received (for read responses, a read


Figure 4.4: Xilinx AXI Decoupler Operation, adapted from Figure 2.3 of [54]

data beat with the LAST signal asserted is counted as a single response). The counters are used to gate

the decouple signal until no transactions are outstanding.

As a final consideration, we know from Section 4.1.2 that a Hardware Application can cause the

interconnect to hang if a malicious user does not send the data beats or does not accept read/write

responses in a timely manner. In fact, the Hardware Application could decide to never accept and thus

deprive all other Hardware Applications of access to this resource. As such, there must also be a way

to force the interface to send data beats and accept responses if it has timed out (a timeout condition

is included in the memory protocol verifier described in Section 4.1.4). A “decouple force” input signal

drives the READY signals of the read and write response channels to high when those channels are not

decoupled. The forced decouple state also generates forced write data beats to be sent on the write data

channel, with the write strobe signal (a signal that indicates which bytes of a write transactions should

actually be written) set to zero. When all of the channels are successfully decoupled, and all outstanding

transactions complete, a “decouple done” output signal is asserted.

4.1.4 Memory Protocol Verification

The AXI4 protocol verifier prevents the Hardware Application from creating any of the protocol vi-

olations described in Section 4.1.2, and is shown in Figure 4.6. The depicted module labelled Ad-

dress/Control Correction modifies any read and write requests that contain a protocol violation such


Write response channel

Write Data Channel

Write address channel

Sticky awvalid 0

0

decouple

awvalid_in

awready_in

awvalid_out

awready_out

aw_decoupled

Read address channel

Sticky arvalid 0

0

decouple

arvalid_in

arready_in

arvalid_out

arready_out

ar_decoupled

Outstanding write data

0

0

decouple

wvalid_out

wready_out

w_decoupled

= 0 1

0

decouple_force

wvalid_in

wready_in

wrstrb_outwstrb_in

Outstanding write resp

0

0

decouple

bready_out

bvalid_out

b_decoupled

= 0 1

decouple_force

bready_in

bvalid_in

Read data channel

Outstanding read data

0

0

decouple

rready_out

rvalid_out

r_decoupled

= 0 1

decouple_force

rready_in

rvalid_in

aw_decoupledar_decoupled

w_decoupled

r_decoupled

b_decoupled

decouple_done

Figure 4.5: AXI4 Memory Decoupler

that the protocol violation is removed. The details of the corrections of this module are described in

Algorithm 1.

For the read/write address and control channels, the first check done is that the SIZE signal does

not indicate a beat size that is greater than the data interface; the SIZE field is overridden if it contains

an invalid value. Next, if a WRAP type burst mode is indicated, the number of beats indicated in the

burst length must be a power of 2 between 2 and 16. If the value is not one of these, the burst length

value remains unchanged but the burst mode is overridden to be of INCR type. This is to prevent the

decoupler from malfunctioning (if the burst length is changed, the decoupler will send a different number

of write beats than expected by the interconnect). Finally, if the final burst mode continues to be of


Inputs: size in, addr in, burst in, len inOutputs: size out, addr out, burst out, len out, 4k error

len out← len in;

if size in > MAX SIZE thensize out←MAX SIZE;

elsesize out← size in;

end

var: is burst length← len in = 2 ∨ len in = 4 ∨ len in = 8 ∨ len in = 16;

if burst in = WRAP ∧ ¬ is burst length thenburst out← INCR;

elseburst out← burst in;

end

array: addr masks[8]← { ...111111111, ...111111110, ...111111100, ...111111000,...111110000, ...111100000, ...111000000, ...110000000 };

var: addr align← addr in & addr masks[size out];

if burst out = WRAP thenaddr out← addr align;

elseaddr out← addr in;

end

var: addr last← addr align+ (len in << size out)− 1;var: addr match← addr last[MSB − 1 : 12] 6= addr in[MSB − 1 : 12];

if burst out = INCR ∧ addr match then4k error ← 1;

else4k error ← 0;

endAlgorithm 1: Address/Control Channel Correction


Idle Cycle Counters

Read Output Channels

Write Output Channels

awsize_in

awlen_in

awburst_in

awaddr_in

awsize_out

awlen_out

awburst_out

awaddr_out

aw4k_error

Address/Control Correction

arsize_in

arlen_in

arburst_in

araddr_in

arsize_out

arlen_out

arburst_out

araddr_out

ar4k_error

Address/Control Correction

Current Beat Counter

==wlast_out

awid

wdata

wstrb

arid

wvalid

rst count

Idle Cycle Counter

wready

= MAX_IDLE

w_timeout

sticky bit

rst count

Idle Cycle Counter

bready bvalid

= MAX_IDLE

b_timeout

sticky bit

rst count

Idle Cycle Counter

rready rvalid

= MAX_IDLE

r_timeout

sticky bit

decouple_force

reg

reg

reg

reg

reg

reg

reg

reg

reg

reg

reg

reg

reg

reg

reg

Figure 4.6: AXI4 Memory Protocol Verifier

WRAP type (after the check described before), the address is forced to an aligned value by masking

out the least significant bits that correspond to the beat size. Rather than including a barrel shifter to

calculate this mask, a lookup table of masks is used (the SIZE field is 3 bits wide and thus limited to

only 8 different possible values ranging from 1 Byte to 27 = 128 Bytes).

Note, overriding parameters on the read/write request changes the behaviour from the Hardware

Application User’s point of view, since the actual transaction sent differs from the one that was intended,

but it will not cause any unexpected behaviour in the shared resources. This system promotes shell

stability over the functioning of an errant user, whose functionality should be suspect in any case

considering the malformed transaction request.

The final check on the read/write address and control channels is a 4kB boundary crossing check.


To simplify this calculation, we note that a WRAP burst is incapable of crossing a 4kB boundary

since the maximum transaction size of a WRAP burst is 2048 Bytes (16 beats x 128 Bytes/beat) and

WRAP bursts always access data aligned to the total transaction size. To determine whether a user

transaction crosses a 4kB boundary, the address of the last byte of the transaction is computed and

its most significant bits are compared to those of the starting address indicated. The number of bits

compared is equal to the address field length minus 12 (since the bus is byte addressable and 12 bits

are needed to index a 4kB page). The last address is computed by taking the aligned version of the

address (as computed for the WRAP address alignment) and adding the burst LENGTH multiplied by

the burst SIZE. The SIZE parameter is in a log base 2 form, so the multiply can be implemented with

a barrel shifter. Thus, a 4kB boundary crossing is detected when the burst type is INCR and the most

significant bits of the final address in the transaction is different than the most significant bits of the

starting address.

In the previous instances of protocol violations, the transaction was changed so as to remove the

violation and then allowed to proceed. In the case of the 4kB boundary crossing, that isn’t possible as

the only change that could guarantee no boundary crossing would be to change the burst length and such

a change would cause the decoupler to malfunction. Instead, an error signal is output from the verifier

that is used by the MMU later (the MMU includes an error handling mechanism, which is described in

Section 4.3).

The write data output channel needs one specific protocol correction, the proper assertion of the

LAST signal, indicating the last data beat in a transaction. When a write transaction is accepted in the

write address channel, the LEN value is pushed to a small First In First Out (FIFO) buffer. The AXI4

protocol standard does not allow for the reordering of write requests (if reordering should be allowed,

separate FIFOs would be needed per write ID), so this FIFO contains the stream of all expected data

beat counts. A counter is incremented every time a write data beat is accepted, and when the counter’s

value is equal to the value at the head of the FIFO, the LAST signal is asserted, the FIFO’s data is

read, and the counter is reset. Thus, the user’s LAST signal is ignored and a corrected version is output,

removing the possibility of this specific protocol violation.

For all of the output channels, the read and write address channels and the write data channel, the

STABLE AXI protocol assertions must also be met. All of the output signals of these channels are

stored into registers once the VALID signal is asserted, assuring that any changes to the signals are not

captured by the interconnect.

Finally, the read and write response channels, as well as the write data channel, must be handled to

in a timely manner to prevent the interconnect from hanging. Counters for each of these channels are



HW Application 1



HW Application 1


HW Application 1

DDR4 Controller

AXI4 Interconnect


HW Application 1


HW Application 1


HW Application 1


HW Application 1

PCIe Controller






HW Application 1



HW Application 1


HW Application 1

DDR4 Controller


HW Application 1


HW Application 1


HW Application 1


HW Application 1

PCIe Controller

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

BW Throttler

BW Throttler

BW Throttler

BW Throttler

AXI4 Inter.

(a) (b)

Figure 4.7: Adding Bandwidth Throttling for Memory to the Shell (a) Shell without bandwidth throttling(b) Shell with added bandwidth throttling

included that count the number of cycles for which the channel is ready to proceed with a transaction but

the Hardware Application user does not proceed. Once the counter reaches some parameterized timeout

value, an error signal is asserted. The logical ORing of all of these error signals is output from the module

and used to drive the “decouple force” input signal of the decoupler. Once the decouple force signal

is asserted, the decoupler will override the users signals and force a response. Once the outstanding

transactions have completed, the decouple force signal will force the interface into a decoupled state,

preventing further starving of the memory resources.

4.2 Performance Isolation for AXI4

Memory resources have limited bandwidth available and that bandwidth must be effectively shared

between competing Hardware Applications on a co-resident FPGA device. As we discussed in Sec-

tion 3.2.1, this Performance Isolation provides more reliable performance for the users of virtualized

FPGA resources. To illustrate the intention of the work described in this subsection, see Figure 4.7. In

part (a) of the figure, we depict the shell described so far, with memory decoupling and protocol veri-

fication implemented. Part (b) of the figure demonstrates how performance isolation modifies the shell

depicted in part (a); the interconnection network is augmented with bandwidth throttling components

to limit the rate of transactions allowed from each Hardware Application. These bandwidth throttlers

are also connected to the PCIe management network.


4.2.1 Traditional Credit-Based Rate Throttling

The concept of Latency-Rate Servers (LR Servers) was introduced in [55] to ensure some level of quality

of service in broadband applications. An LR-Server is a rate limiter for some traffic generating endpoint

on a network that limits the rate of the sender to ensure some bounded latency and a rate of transmission.

The concept has also been applied outside of broadband applications; the works described in [56] [57] [58]

use this conceptual definition in the design of rate limiters for other resources, including SDRAM memory.

The work presented in [58] is particularly interesting given its simple design and powerful bandwidth

guarantees. This credit-based approach can be modified to target the AXI4 memory interface.

The credit-based accounting system of [58] works by assigning to each requester a rate ρ and a bursti-

ness σ. Each requester also has an associated counter whose value is used to determine whether or not

the requester is allowed to initiate a new request. When the requester has a request pending that has

not yet been serviced, the counter is incremented by some amount n every cycle, accumulating credits.

For every cycle that the requester is given access to the shared bus, the counter is decremented by d.

In most conditions, the counter is incremented by n every cycle, and decremented by d every cycle for

which the requester is granted access; if d is greater than n (it should be when configured), the counter

accumulates credits when it does not have access to the bus, and loses credits when it is granted access

to the bus. As an exception, the credit-based accounting work specifies that when the requester does not

have any pending requests, its credit is reset to an initialization amount equal to d× σ, which prevents

the requester from accumulating credits and then sending requests all at once with the accumulated

credits. The credit accounting system is summarized as follows:

credits(t+ 1) = credits(t) + n− d has bus access

credits(t+ 1) = credits(t) + n no access but pending requests

credits(t+ 1) = d× σ no pending requests

The values of n and d, integers, are chosen such that ρ ≈ n/d, which ensures that credits are re-

populated at a rate that gives the requester the allocated bandwidth. The credit accounting system

can be further simplified if the credits are taken as fixed point values (rather than integer values in the

original formulation), and the value of d is fixed to 1. In that case, the credit accounting system is

simplified to:


credits(t+ 1) = credits(t) + ρ− 1 has bus access

credits(t+ 1) = credits(t) + ρ no access but pending requests

credits(t+ 1) = σ no pending requests

Where ρ is a decimal representation of the percentage of time the requester should be granted access to

the bus (ρ ≤ 1), and σ is an integer value that continues to represent the burstiness of the LR Server.

Technically speaking, the original formulation from [58] where ρ ≈ n/d allows for a closer approximation

of ρ as there is more flexibility in choosing integer values of both n and d to get a close approximation.

The fixed point system simply truncates ρ to represent it in the reduced precision formulation. However,

this second formulation eases further calculations.

The requester is granted access to the bus when there are enough credits for the request to be

processed. For requests that require access to the bus for a single cycle, the requester need only check

that there are at least n credits in the first formulation of the credit accounting system, and at least one

credit in the second formulation of the credit accounting system. However, for burst AXI4 transactions,

the number of credits needed to initiate a transaction is LEN ∗ n in the first credit accounting system,

and simply LEN in the second credit accounting system, where LEN is the burst length. Thus, the

second credit accounting system eliminates the need for multiplications and reduces the area needed to

implement the credit accounting system.

Note, other rate limiting systems have been implemented to enable multiple memory access interfaces

to share a memory system with some guaranteed bandwidth. For example, the Bandwidth Guaranteed

Prioritized Queuing (BGPQ) system, presented in [59] can effectively manage bandwidth allocations.

The BGPQ system, however, requires that requests be queued, which introduces more area and latency

overhead, and arbitrates based on some credit count value for each requester in the system. This means

that to determine whether a specific requester should have access to the bus, its credit value must be

compared to all other requester’s credit values. This could be an expensive operation. Extensions to

the BGPQ mechanism can implement latency guarantees, so that interfaces that have low bandwidth

requirements but strict latency requirements can be serviced [60]. This implementation introduces

yet more logic to the arbitration. The credit-based accounting mechanism chosen for this thesis can

determine whether to allow transactions to propagate based solely on information for that requester,

reducing combinational logic and routing strain for the FPGA. Comparative analyses of these other rate

limiting systems are not formally included in this thesis.

The work presented in [58] presents a robust proof that the credit accounting system is an LR-Server,

and can thus be used to effectively regulate the latency and rate of transactions for memory access.


4.2.2 AXI4-Specific Credit Mechanism

This thesis presents two separate contributions relating to the credit-based accounting system presented

in [58]: first, the modified credit accounting methodology presented in the previous section (which

dispenses with an integer based credit system for a fixed point system), and second, the modification of

the credit-accounting system for modern memory access protocols, AXI4 in this specific case.

The first change motivated by the AXI4 standard is one that addresses the separation of memory

access requests into multiple channels. The read and write requests are logically separated, and the ad-

dress/control and data parts of the transaction are also logically separated. The simplest way to address

this separation is to instantiate separate credit accounting systems for each of read and write requests

streams. Two separate credit-accounting systems also allows the FPGA management framework to al-

locate read and write bandwidth independently, enabling finer-grained control. The separation of the

address/control from the data motivates changing the way that the credit system is reset to the σ value.

Rather than simply inspecting the memory request channels for pending requests to determine whether

to reset the counter, the write data channel must also be monitored; if there are no outstanding requests

on the write address channel and no outstanding data to send on the write channel, the counter can be

reset. These changes are summarized in the following:

crRd(t+ 1) = crRd(t) + ρrd − LENrd AR request accepted

crRd(t+ 1) = crRd(t) + ρrd pending AR request

crRd(t+ 1) = σrd no pending AR request

crWr(t+ 1) = crWr(t) + ρwr − LENwr AW request accepted

crWr(t+ 1) = crWr(t) + ρwr pending AW request or Data to send

crWr(t+ 1) = σwr no pending AW request or Data

where CrRd and CrWr represent the credits for the read and write channels respectively and LEN

is the length of the burst transaction. Note, the subscripts on the ρ, σ and LEN values indicate

separate parameters for the read and write channels.

As mentioned in Section 4.1.4, the user Hardware Application does not have to send data in the

same cycle that the interconnect is ready to receive the data. With the implementation of the mem-

ory protocol verifier described in this thesis, in Section 4.1.4, a timeout limit is set to avoid indefinite

starvation of the system. This has an impact on the credit accounting system in two ways: first, the

number of credits that must be subtracted when the transaction is accepted is equal to the length of


the transaction multiplied by the total number of cycles the user could wait per data beat; next, if the

user is more efficient in their responses, credits must be redeposited at the time the data is accepted on

the write data channel. The effects of these changes on the write component of the credit accounting

mechanism are as follows:

crWr(t+ 1) = crWr(t) + ρwr − craw + crw pending AW request or Data to send

crWr(t+ 1) = σwr no pending AW request or Data

where :

craw = LENwr × (timeout cyc+1) AW request accepted

craw = 0 otherwise

crw = timeout cyc −wasted cycwr write data accepted

crw = 0 otherwise

wasted cycwr(t+ 1) = wasted cycwr(t) + 1 WREADY high and WVALID low

wasted cycwr(t+ 1) = wasted cycwr(t) WREADY not asserted

wasted cycwr(t+ 1) = 0 write data accepted or reset

where the term timeout cyc is a constant value that represents the maximum number of cycles the

user is given to respond to a ready write data channel. If the value of timeour cyc is chosen such that

timeout cyc+ 1 is a power of 2, the multiplication in the above can be implemented as a constant shift

and adds little combinational area overhead. The user can also waste cycles on the read data bus, by

refusing to accept data immediately when it is available. A similar formulation as above for the read

component of the credit accounting system yields the following:

crRd(t+ 1) = crRd(t) + ρrd − crar + crr pending AR request

crRd(t+ 1) = σrd no pending AR request


where :

crar = LENrd × (timeout cyc+1) AR request accepted

crar = 0 otherwise

crr = timeout cyc−wasted cycrd read data accepted

crr = 0 otherwise

wasted cycrd(t+ 1) = wasted cycrd(t) + 1 RVALID high and RREADY low

wasted cycrd(t+ 1) = wasted cycrd(t) RVALID not asserted

wasted cycrd(t+ 1) = 0 read data accepted or reset

The term wasted cyc, in both the read and write credit accounting formulations, monitors the write

data channel for activity and stores the total number of cycles in which the data bus was ready to

transmit data but the user did not accept the transmission. In this way, the user must “pay” cred-

its to waste bus cycles and thus does not use more than their allotted share of bandwidth. If a user

is efficient in their access of the data bus, no cycles are wasted and all oversubscribed credits will be

returned. In this case, the number of credits initially deducted when the transaction is accepted is

equal to LEN × (timeout cyc+ 1), and the number of credits redeposited at each successful data beat

transmission is simply timeout cyc−0 (wasted cyc would be zero), that leaves an effective total number

of credits used equal to LEN for each transaction, which was the initial credit formulation. The user

essentially “borrows” credits to pay for any potential wasted cycles in the future.

The hardware implementation of this modified credit accounting based bandwidth throttler is shown

in Figure 4.8. The read and write channels have separate counters to keep track of the number of

credits each stream has. For the address/control channels, the LEN field of the transaction is used to

calculate the number of credits needed, using a constant shift to represent the multiplication by the term

timeout cyc+1. This is compared to the credit counter’s integer component (the counter is a fixed point

decimal number) to determine whether enough credits are available. This comparator signal is used to

decouple the address/control channel from the Hardware Application. The credit update mechanism is

implemented by adding the update term ρ (which is passed to the block as an input) to the fractional

part of the counter, subtracting the shifted LEN value if the requested transaction is accepted, and

adding back a term that implements the timeout cyc − wasted cyc calculation if the write/read data

beat is accepted. If at any point there is no outstanding transactions pending, the σ is loaded into the

credit counter instead of the updated term. For the write transactions, this default value loading is done


Write Channel

0

update (ρ)

++

_ _

0

cred_need

0

w_add_back

Credits (int)

Credits (frac)

sel0 sel1

sel3

init (σ)

++

sel4

0

0

awvalid_out

awready_out

>=

awvalid_in

awready_in

xx cred_needawlen

MAX_TIME

x cred_needawlen

MAX_TIME

rst count

Idle Cycle Counter

wvalid wready

_ _

(MAX_TIME-1)w_add_back

rst count

Idle Cycle Counter

wvalid wready

_

(MAX_TIME-1)w_add_back

sel0wvalidwready sel1awvalid

awready

sel3awvalidoutstanding wdata sel4


awready



awready


Read Channel

0

update (ρ)

++

_ _

0

cred_need

0

r_add_back

Credits (int)

Credits (frac)

sel0 sel1

sel3

init (σ)

++

sel4

0

0

arvalid_out

arready_out

>=

arvalid_in

arready_inxx cred_need

arlen

MAX_TIME

x cred_needarlen

MAX_TIME

rst count

Idle Cycle Counter

rready rvalid

_ _

(MAX_TIME-1)r_add_back

rst count

Idle Cycle Counter

rready rvalid

_

(MAX_TIME-1)r_add_back

sel0rvalidrready sel1arvalid

arready

sel3arvalid sel4

Figure 4.8: AXI4 Memory Bandwidth Throttler

when there are no pending transactions and no pending data beats.

4.2.3 Bandwidth Conserving System

The bandwidth throttling system described thus far is limited in that the total system bandwidth would

be limited to the cumulatively assigned ρ values; any unused or unassigned bandwidth is wasted and

cannot be reclaimed. The original credit-accounting system in [58] includes an addendum on bandwidth

conservation. If all of the requesters are blocked from proceeding with a transaction or do not have

a pending transaction, a second tier arbitration scheme is used to override the bandwidth throttler

decouplers. Requesters that are granted access from this second tier arbitration scheme when they are

blocked do not consume credits for the accepted transaction.

To implement this system in the previously described bandwidth throttler, each of the bandwidth

throttlers needs to output its pending status (i.e., whether or not it has a pending transaction that is

blocked, or pending write data to send) and input a credit override signal. The pending status outputs

are ORed to determine if any requester has a valid request pending. If no valid requests are pending

and blocked, a second arbitration scheme outputs the credit override signals back to the bandwidth


throttlers. In the case of this thesis work, a simple time-division multiplexing second tier arbitration is

used, selecting a single requester to override per cycle in a cyclic manner. Once a requester has been

granted an override, or some other requester has a valid request granted, the override system ceases

operation in favour of the original bandwidth throttling system.

If a requester issues a transaction request based on an override, it does not subtract the credits that

would have been necessary had the transaction been issued in a normal manner. One complication

that comes up in this implementation is that without alteration, the current system would continue to

redeposit credits for efficient use of the read/write data channel, which does not make sense since those

credits were not borrowed in the first place. To remedy this, a FIFO buffer between the address/control

channel and the data channel stores whether or not the transaction was issued with an override. If

the transaction was issued with an override, the credits are not redeposited based on the efficiency of

the transaction on the data bus. In the AXI4 standard, write data transactions must be issued in the

same order as the corresponding write address/control transactions, so a single 1-bit FIFO is sufficient

to store this information. However, read data transactions must be returned for a specific transaction

ID value, but can be returned out of order for transactions with different ID values. The read data

transactions need separate FIFOs for each possible ID value, so 2AXI ID WIDTH 1-bit FIFOs are needed

for a functioning override system for the read channel.

4.2.4 Limitations for SDRAM Systems

The total available bandwidth of an SDRAM based memory system is highly dependent on the memory

access pattern. For example, the Xilinx memory controller described in [47] has a stated memory bus

utilization efficiency of 94 percent for sequential read accesses, but only 24 percent for random-addressed

alternating read and write transactions. The sources of inefficiency are summarized in [61], which divides

these inefficiencies into five distinct components: refresh efficiency, which represents the total possible

bus utilization efficiency considering that some number of cycles must be stalled while the DDR memory

undergoes a refresh operation; command efficiency, which encapsulates all inefficiencies as a result of

limitations in the commands that must be issued and the order in which they can be issued; data

efficiency, which refers to the amount of data read/written that goes unused (newer generation DDR

memories access data in bursts, e.g., burst size of four for DDR2 memories and burst size of eight for

DDR3 and DDR4 memories); bank-efficiency, which refers to the inefficiencies created from accesses to

banks in which the requested row is not open (i.e., a page miss); and read-write switching efficiency,

which deals with the increased latency required when back-to-back transactions are of different types


(i.e., a read following a write or a write following a read).

Refresh inefficiencies are not avoidable with DDR memories, the command efficiency is completely a

product of the memory controller design (i.e., outside the influence of the Hardware Application users,

and therefore not of import in this thesis), and the data efficiency is already considered in the current

shell design; the Xilinx memory controller [47] performs fixed bursts of size eight, and the bandwidth

throttler of Section 4.2.2 deducts tokens for the entire transfer whether or not the data is used. In terms

of performance isolation, the relevant efficiencies that determine the effect one Hardware Application

has on the others are the bank efficiency and the read-write switching efficiency.

Software performance isolation solutions have run into the same problem. The work presented in [62]

presents a system whereby a multi-core CPU system allocates bandwidth to each core of the system.

To get past the problem of less than ideal bandwidth availability due to inefficiencies, the total of the

bandwidth that would be assigned to the cores is equal to the guaranteed bandwidth of the memory

system, considering the least efficient use of the memory resource. In that work for example the guaran-

teed bandwidth was equal to 1.2GB/s, compared to the peak bandwidth of 6.4 GB/s (about 19 percent

of the peak bandwidth is guaranteed compared to the 24 percent of our use-case). All excess bandwidth

is allocated to cores on the basis of need (i.e., the cores compete for the excess bandwidth).

Such a system can be applied to the FPGA virtualization solution presented in this thesis. The sum

of all ρ values, across all Hardware Applications’ read and write ports, can be limited to the minimum

efficiency of the memory resource, 0.24 in this case (note, the ρ values are decimal quantities). The

remaining excess bandwidth can then be split amongst the requesters using the bandwidth conservation

mechanism presented in Section 4.2.3. This is not an ideal solution, since many high-performance

applications with efficient memory access patterns may want to reserve bandwidth greater than the

guaranteed bandwidth. Design of more complicated bandwidth throttlers, that consider the memory

access pattern of the requester, is left to future work.

As a potential remedy, the FPGA management framework can assign bandwidth greater than the

guaranteed bandwidth and monitor the resultant utilization rate to ensure the memory access pattern

does not result in over subscription of the memory data bus, though this system would require periodic

monitoring of the bandwidth utilization to ensure it is not abused by a malicious actor. Figure 4.9

depicts the design of a utilization monitor based on the exponential weighted moving average [63], the

formula for which follows:

Avgexp(t) = Avgexp(0) t = 0

Avgexp(t) = α · In(t) + (1− α) ·Avgexp(t− 1) t > 0


++

++

Count

<< N

<< N

_ _

>> N

wvalidwready

rvalidrready

utilization

Figure 4.9: AXI4 Memory Utilization Monitor

Where Avgexp(0) is the initialization value, α is a parameter less than 1, and In(t) is the input value

of the time series to average. With careful selection of the parameter α, this type of average can be

implemented using only shifts and adds, without any expensive multiplications or divisions. If a value

for α is chosen as 1/1024 for example, the formula simplifies to:

Avgexp(t) = Avgexp(t− 1) + (In(t) >> 10)− (Avgexp(t− 1) >> 10)

This can be done for any power of two that is less than one (i.e. two to a negative exponent), that

simplifies to a shift of n bits where the power of 2 is expressed as 2−n. The average can be represented

as a fixed point decimal number with 1 single integer bit and 2×n decimal bits (this is needed to prevent

a loss of precision in the subtraction). The bandwidth utilization monitor of Figure 4.9 implements such

an average calculator though with two separate input parameters, that indicate a valid write data beat

has been accepted and a valid read data beat has been accepted, respectively. Since the downstream

memory controller cannot issue read and write beats at the same time, the bus should on average only

have one of these inputs active at a time. This utilization monitor is included after the interconnect

depicted in part (b) of Figure 4.7 to continuously monitor the total granted bandwidth of the memory

data bus.


Table 4.1: Bandwidth Throttling Performance

Incrementing Read Bursts Random & Narrow Accesses

ρ Assigned Reclaimed BW ρ Assigned Reclaimed BW

0% 96.5% 0% 96.5%

25% 70.8% 6.25% 47.5%

50% 47.1% 12.5% 6.8%

75% 22.1% 25% 0%

100% 0% 50% 0%

4.2.5 Bandwidth Limiting Performance Evaluation

To test the performance of the bandwidth throttlers, one application’s share of memory bandwidth must

be monitored while another application tries to spam the interconnect. In this situation, the bandwidth

throttlers should limit the ability of the second actor to use an excessive amount of memory bandwidth to

starve the first actor. For this setup, we include two applications that are continuously requesting access

to the memory bus. The first application is unthrottled but is connected to a lower priority connection

on the interconnect; this actor will only be granted access to the bus when the second application

is not requesting access to the bus, or is blocked by the bandwidth throttler. This first application

performs exclusively read accesses from the same memory address to model a high efficiency memory

access pattern. The second application’s access pattern is varied as part of this evaluation. The total

utilization of the first application is monitored. Note, both applications are simply micro-benchmarks

developed for this monitoring purpose, no other functionality is included.

The results of this experiment are shown in Figure 4.1. The “ρ Assigned” column represents the

amount of bandwidth assigned to the spamming interface, and the second column indicates the amount

of Bandwidth that was reclaimed by the unblocked interface. For the spamming interface, two different

modes of sending were tested. In the first test, the spamming interface would send efficient memory

accesses (large bursts to the same address). In the second set of tests, the spamming interface sent

randomly addressed memory accesses with a small burst size. As expected, this later test case is able to

saturate the entire bandwidth of the interconnect with only a 25 percent allowance (i.e., ρ value). While

the bandwidth throttler is effective at splitting memory bandwidth among efficient requesters, inefficient

accesses to memory can slow down the whole system. Just like in the software realm, the best solution

is to assign bandwidth based on the known guaranteed memory bandwidth; any excess bandwidth can

be reclaimed. An effective FPGA management framework should continually monitor the bandwidth to

ensure that the Hardware Applications are not being starved.



HW Application 1



HW Application 1


HW Application 1

DDR4 Controller


HW Application 1


HW Application 1


HW Application 1


HW Application 1

PCIe Controller

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

BW Throttler

BW Throttler

BW Throttler

BW Throttler

AXI4 Inter.


HW Application 1



HW Application 1



HW Application 1


HW Application 1


HW Application 1


HW Application 1

PCIe Controller

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

DDR4 Controller

AXI4 Inter.

MMU

AXI4 Inter.

BW

BW

BW

BW

(a) (b)

Figure 4.10: Adding MMU to the Shell (a) Shell without MMU (b) Shell with added MMU

4.3 Memory Management Unit Design

The previous subsections dealt with the performance isolation of the memory resource, i.e., how memory

resources are shared between the co-resident Hardware Applications to ensure some level of performance

can be guaranteed to the Hardware Applications. The MMU however is needed to guarantee data

isolation, so that memory assigned to one specific Hardware Application is inaccessible by other Hardware

Applications. The MMU is not a new concept and has been implemented in CPU devices to ensure

memory isolation between distinct processes [28]. Two different MMU design are explored in this thesis

work to determine their applicability to a multi-tenant FPGA deployment: base-and-bounds MMU and

paged MMU presented in Sections 4.3.1 and 4.3.2 respectively.

To illustrate the intention of the work described in this subsection, see Figure 4.10. In part (a)

of the figure, we depict the shell described so far, with memory decoupling, protocol verification, and

bandwidth throttling implemented. Part (b) of the figure demonstrates how the inclusion of an MMU

modifies the shell depicted in part (a); an MMU sits between the master port (i.e., the request issuing

port) of the interconnect network and the memory controller. An additional interconnect is added

between the master port of the MMU and the memory controller. While this new interconnect includes

only one master and one slave port, it also includes a range of unmapped addresses that the MMU can

target if the memory access request is determined to be errant (i.e., to a region of the memory that

should be inaccessible by the requester). The Xilinx memory interconnect instantiates a dummy slave

device within the interconnect that receives all requests addressed to unmapped addresses [53]. The

MMU is also connected to the PCIe management network.


++xN

reg

Base Memory

>>> ADDR_BITS - 1

reg

Bound Memory

reg

reg

reg

addr_in

error

other_fields_in other_fields_out

addr_out

ID_in

Figure 4.11: Base and Bounds MMU Design

4.3.1 Base-and-Bounds MMU Design

The base-and-bounds MMU design simply compares incoming addresses to some limit to determine

whether the access is out-of-bounds, and then adds the base address to the incoming address to shift

that Hardware Application’s memory access to some pre-assigned location in memory. This design is not

presented as new work in this thesis, as indeed it is an older design as described in [64]; its description

is included here for clarity.

The base-and-bounds MMU designed for inclusion in the shell is depicted in Figure 4.11 (only a single

channel of request remapping is shown, the logic would be included for the read and write channel). The

incoming address is first compared to the bound register to determine whether it is an errant access,

and then the address is added to the base address to get the effective physical address to access. If the

access is determined to be errant, the most significant bit of the output address is set; this is used to

indicate an error to the downstream interconnect, as that range of addresses is unmapped. The error

bit of the address is also set if a 4k boundary error was indicated at the protocol verifier stage (see

Section 4.1.4 for details), that prevents accesses that cross the 4k boundary from propagating to the

memory controller. The incoming VIID of the request is used to index into a memory containing base

and bound values, so these values can be set independently for each Hardware Application and indeed

each independent memory port within a Hardware Application. This design presents a zero-base address

to each individual memory port while mapping the accesses to physically different parts of the memory.


4.3.2 Coarse-Grained Paged MMU Design

Page-based MMUs are common in modern CPUs; they implement the memory virtualization solution

presented in Section 2.2.3. This solution divides the entire memory space into segments of equal size

(in most cases) that can be assigned to the Hardware Applications and the memory interfaces therein.

Multiple of these pages can be assigned to a single memory interface, with some number of the most

significant bits of the address used to index a page-mapping table, that stores a remapping value and a

bit to indicate whether or not the mapping is valid. The bits used to index the table are replaced with the

remapping value, resulting in the virtual to physical address translation depicted in Figure 2.3. As with

the base-and-and-bounds MMU, the error bit from the mapping table and the error bit forwarded from

the AXI4 protocol verifier (that indicates a 4k boundary crossing) are used to set the most significant

bit of the output address, resulting in an output address that is not mapped to the memory controller

and is handled by a dummy AXI4 slave instead.

If N bits are remapped using a page-based MMU (page size of 2ADDR WIDTH−N , and 2N pages),

the mapping table would need to be (N + 1) bits wide and 2N entries deep. For many page-based

MMU systems, the page size tends to be relatively small (compared to the total memory size) and thus

the page-mapping table can be quite large. For example, the AXI4 standard specifies a 4k boundary

for page mapping (hence the 4k boundary crossing assertion); in a 32-bit (4GB) memory system, this

would result in 20 bits remapped, requiring a table of about 1 million entries of width 21-bits. A unique

page table is also needed for each VIID value, as every VIID value has a unique mapping, which further

increases the storage need for page-mapping tables. Rather than storing these page tables in dedicated

hardware structures, the page tables are stored in the CPU-system’s memory and caching structures are

included in the MMU to reduce the latency of access.

For a multi-tenant FPGA deployment where memory virtualization is needed, any large page-table

structure would likely also need to be stored in the memory itself, which would add latency to any

accesses of memory. Caching structures could also be implemented in this case, however associative

structures tend to be expensive to implement on FPGAs. The need for such large page-tables structures

is less obvious for FPGA deployments however. First, multi-tenant FPGA deployments would be limited

in the number of memory interfaces that need simultaneous access to the memory, since the number

of Hardware Applications would be limited by the spatial constraints of the device itself. This is in

contrast to CPU systems, where preemption means that many software processes can be active on the

device even if they are not spatially located in an executing core. And second, many applications that

target FPGAs tend to reserve memory in large contiguous chunks (e.g., Neural Nets that store a large


map

Page Table

Memory

xN

>> ADDR_BITS - 1regerror

v

Concataddr[ADDR_BITS-INDEX_BITS-1:0]

addr[ADDR_BITS-1:INDEX_BITS]

reg

Concat

addr_out

ID_in

other_fields_in other_fields_outreg

addr_in

Figure 4.12: On-Chip Coarse Grained MMU Design

amount of weights and activations).

In this thesis work, we implement a coarse-grained paged MMU. The term “coarse-grained” refers

to the fact that the size of the pages is quite large relative to the size of the memory itself. In this case,

the entire page-table structure, that includes mappings for all VIIDs, can be stored in on-chip BRAM

resources with little impedance to the progress of the read and write transactions (outside the latency of

access to the BRAM itself). Illustrating with an example, for the AlphaData 8k5 board, an 8GB off-chip

memory is attached; if this memory were to be split into coarse-grained pages of 64MB instead of 4kB,

and the total number of unique VIIDs is limited to 32 (e.g., four separate Hardware Applications with

eight possible interfaces per applications), the required size of the entire page-table structure would be

a depth of 233−26 × 32 = 4096 and a bit width of (33 − 26) + 1 = 8, that could easily fit into on-chip

BRAM. This does of course come with the trade off that the memory is only divided into 128 pages, so

the mapping is a little more limited, though flexibility can be bought at the expensive of more BRAM

resources by increasing the page-table size. Figure 4.12 shows the implementation of a coarse-grained

paged MMU. The figure shows only a single channel of the MMU, The full MMU would have a separate

page table and mapping logic for both read and write channels.

4.4 Memory Virtualizing Shell Overhead Evaluation

Virtualizing any compute resource introduces some overhead, and in the realm of FPGAs that overhead

generally takes the form of area utilization. To evaluate the proposed secuirtization solutions, we im-


Table 4.2: Shared Memory Secured Shell Utilization

LUT LUTRAM FF BRAM DSP

Shell Type Num % Incr Num % Incr Num % Incr Num % Incr Num % Incr.

No Security Incl. 65,243 6192 82,928 94.5 3

Add Decouplers 66,159 1.4% 6224 0.5% 86,193 3.9% 94.5 0% 3 0%

Add BW Throttlers 67,583 2.2% 6224 0% 89,357 3.7% 94.5 0% 3 0%

Add Base+Bound MMU 72,722 7.6% 6417 3.1% 99,786 11.7% 94.5 0% 3 0%

Switch to Paged MMU 70,731 -2.7% 7344 14.4% 96,144 -3.6% 94.5 0% 3 0%

plement them into an FPGA shell and measure this overhead. If the overhead is sufficiently small, the

case that virtualizing FPGAs in datacentre deployments is made stronger. In this section, we evaluate

our proposed solution in terms of area overhead.

4.4.1 Building up the Secure Shell

To determine the overhead of the various components of the memory-securing shell, features can be

added incrementally to see their impact. As a first step, a shell without any of the discussed memory

security features was implemented on the Alpha Data 8k5 FPGA board. This shell implementation

resembled that of part (a) of Figure 4.1. The board includes a Xilinx Kintex Ultrascale XCKU115 with

DDR4 attached off-chip memory. All tests were done using the Xilinx Vivado 2018.1 software, and the

associated versions of the the PCIe Subsystem and Memory Controller cores.

Table 4.2 list the utilization of the shell developed with each of the security components added on

incrementally. A second column in the table indicates the increase in utilization from the previous shell

iteration. The First shell simply implements a memory controller and a shared interconnect, with no

technology to manage effective sharing of that interconnect. Note, this shell includes in its synthesis

simple applications in each Hardware Application Region to put a realistic routing stress on the Place

and Route tools. The Hardware application is the same one as that used for the tested of the bandwidth

throttlers earlier in this chapter. The first entry of Table 4.2 shows the area utilization of the shell

and these simple Hardware Applications. Note, all of the shells described in this section include four

Hardware Applications with three VIID bits each, for a total of 32 managed logical connections. Note,

Table 4.3 includes the same information except expressed as a percentage of the Kintex Ultrascale

XCKU115’s total resources.

The next entries show different components added to the shell design. The first new entry includes

decouplers and protocol verifiers in the design. From this we can wee that the overhead for including AXI4

protocol verifiers is in the range of 1.4 percent to 3.9 percent, depending on which specific resource you


Table 4.3: Shared Memory Secured Shell Utilization (Percentage)

Shell Type LUT (%) LUTRAM (%) FF (%) BRAM (%) DSP (%)

No Security Incl. 9.84% 2.11% 6.25% 4.38% 0.05%

Add Decouplers 9.97% 2.12% 6.50% 4.38% 0.05%

Add BW Throttlers 10.19% 2.12% 6.74% 4.38% 0.05%

Add Base+Bound MMU 10.96% 2.18% 7.52% 4.38% 0.05%

Switch to Paged MMU 10.66% 2.50% 7.25% 4.38% 0.05%

consider. Further adding bandwidth throttlers to the shell, to implement performance isolation, needs

2.2 percent more LUTs and 3.7 percent more flip flops. Finally, the addition of a network management

unit, to perform some data isolation, adds 7-11 percent of the total area utilization to the shell design.

The page-based MMU uses much more LUTRAMs than the base-and-bounds MMU, which is expected

considering that the page tables take up more memory.

We also note from Table 4.3 that the overall area utilization of the shell is not very significant,

though this excludes any networking connectivity. Only about 10 percent of the chip resources are

needed to implement any kind of protocol verification or 11 percent to implement data isolation. These

utilization number also include the small Hardware Applications, as they are synthesized with the shell,

so the actual area utilization need is lower. Such a low area overhead is perfect for deployment in a

virtualization environment, since higher overheads depletes the FPGA of resources which could have

been used by another Hardware Application. While the total area utilization is fairly low, these memory

based solution tend to induce a lot of routing strain on the system, and meeting the timing requirements

of the system necessitated retiming and high-effort synthesis.

4.4.2 Latency Impact

Introducing the isolation components to virtualize the memory interfaces also introduced some latency

to the shell design. The summary of the latencies introduced are summarized in Table 4.4. All of

the latencies are normalized to the shell that implements no isolation, with the table displaying the

additional number of cycles of latency added over that shell design. Adding the decouplers and protocol

verifiers only adds a single cycle of latency for all of the output channels. This single cycle of latency

is added because the protocol verifiers must insert a registering stage to avoid transactions that have

changing values while the valid signal is held high and the ready signal has not yet been asserted, as

described in Section 4.1.4. The channels that are driven from the memory controller to the Hardware

applications, the read and write response channels, don’t need this additional registering stage.

Bandwidth throttlers have a more significant effect on the amount of latency introduced into the


Table 4.4: Latency Increase Per AXI Channel for Shared Memory Secured Shell

Shell Type AW Channel W Channel B Channel AR Channel R Channel

No Security Incl. – – – – –

Add Decouplers +1 cycle +1 cycle +0 cycles +1 cycle +0 cycles

Add BW Throttlers +4 cycles +4 cycles +3 cycles +4 cycles +0 cycles

Add Paged MMU +6 cycles +4 cycles +3 cycles +6 cycles +1 cycles

system. Three additional cycles of latency are added to every channel except for the read data response

channel. Note, the addition of the bandwidth throttler itself does not add any cycles of latency, since

the bandwidth throttler is implemented as combinational logic. All of the additional cycles of latency

are added through the inclusion of register slices, which were required for the updated shell design to

meet the system timing requirements. The bandwidth throttlers put a significant routing stress on the

shell, which necessitated the inclusion of three separate register slice stages (at different points in the

data path). Note, the register slices were not required for the read channel, so no latency was introduced

on received data.

Finally, a paged-based MMU was added and configured to use pages that were 64MB in size. The

paged MMU includes a single cycle of latency in the AW and AR channels. The additional cycles of

latency on the AW and AR channels not accounted for in the MMU design, and the added cycle of

latency for the R channel, was added because of an additional register slice. Note, the MMU does not

impact the read data path directly (it simply passes read responses through without alteration), so the

register slice required on the R Channel was necessary simply from the increased routing stress put on

the entire shell with the inclusion of the MMU.

4.4.3 Paged-NMU Size Comparisons

One of the main changes to a traditional page-based MMU versus the one described in this work is the

coarse-grained nature of the pages. Each page could be up to 128 MB in size, which would leave the 8GB

memory divided into only about 64 pages; for a system with 32 unique VIIDs (e.g., the works presented

in this thesis), that leaves an average of two pages per Hardware Application. While we contend that

large pages are likely sufficient for FPGAs, given the smaller number of concurrent applications and the

big-data nature of many FPGA targeted applications, this may not be true for all circumstances or even

in the future as more applications are ported to FPGAs. In this subsection, we analyze the actual area

tradeoff involved in increasing the size of the pages for an on-chip page-based MMU.

The shells studied in the previous subsection served as the platform for this evaluation. As such, it is

necessary to compare the results against each other, since the nominal values themselves also include the


Table 4.5: Shell Utilization as a Function of Page Size in MMU


128 MB Pages 10.54% 2.28% 7.24% 4.38% 0.05%

64 MB Pages 10.66% 2.50% 7.25% 4.38% 0.05%

32 MB Pages 10.70% 2.59% 7.25% 4.38% 0.05%

16 MB Pages 11.17% 3.50% 7.25% 4.38% 0.05%

8 MB Pages 11.93% 4.90% 7.27% 4.38% 0.05%

4 MB Pages 13.47% 7.69% 7.26% 4.38% 0.05%

small Hardware Applications themselves. Table 4.5 shows a breakdown of the area utilization needed

to implement coarse-grained MMUs with various page sizes. As the page size decreases, the amount

of FPGA area resources needed increases in turn, namely the amount of LUTRAMs needed for the

solution. This does make intuitive sense, since the main need in reducing the page size is storage space.

In any case, we find that even as the page size decreases to about 4 MB, the total utilization by the shell

components is not greatly effected.

4.5 Multi-Channel Memory Considerations

The memory virtualization solutions discussed thus far have only considered a single independent memory

channel. Many FPGAs often include multiple memory channels to increase the total effective bandwidth

of external memory. This section considers the design decisions to be made in extending the previous

concepts of this chapter to a multi memory channel platform, introducing a few different paradigms for

including multiple memory channels. Note, the Alpha Data 8k5 FPGA board used in this work includes

two separate DDR4 memory channels.

4.5.1 Separately Managed Channels

The simplest way to virtualize multiple memory channels is to separate the channels and attach each to

some fraction of the Hardware Applications Regions, i.e., each Hardware Application Region is connected

to a single memory channel. This solution is depicted in Figure 4.13. Separately managed memory

channels do not introduce any increased complexity over the solutions presented earlier in this chapter,

since each memory channel would simply have the performance and data isolation of a single-channel

system. This solution is not explicitly evaluated in this thesis, but it is included here for completeness.



HW Application 1



HW Application 1



HW Application 1


HW Application 1


HW Application 1


HW Application 1DDR4 Controller

AXI4 Interconnect

MMU

AXI4 Inter.

DDR4 Controller

AXI4 Interconnect

MMU

AXI4 Inter.

Dec. + Ver.

BW Throt.

Dec. + Ver.

BW Throt.

Dec. + Ver.

BW Throt.

Dec. + Ver.

BW Throt.

Figure 4.13: Multi-Channel Organization with Separately Managed Channels

AXI4 Interconnect


HW Application 1



HW Application 1



HW Application 1


HW Application 1


HW Application 1


HW Application 1

DDR4 Controller

MMU

AXI4 Inter.

DDR4 Controller

Width Converter + Buffering


Dec. + Ver.

BW Throt.

Dec. + Ver.

BW Throt.

Dec. + Ver.

BW Throt.

Dec. + Ver.

BW Throt.

Figure 4.14: Multi-Channel Organization with Single Shared MMU

4.5.2 Single Shared MMU

A shared MMU system is depicted if Figure 4.14. In this system, each Hardware Application has a

single top-level memory interface (i.e., the interface that is at the PR region boundary) that connects to

protocol decouplers and verifiers, bandwidth throttlers, and a single MMU, just as in the single-channel

solutions. The difference is that the single MMU is connected at its master (request issuing) side to

an interconnect that can route requests to any of the memory channels (two channels are depicted in

Figure 4.14). In other words, the memory spaces of the memory channels are logically concatenated and

the single MMU serves this concatenated memory space.

If the data width of the interface presented to the Hardware Applications is equal to the width


of a single memory channel, the post-MMU interconnect can be connected directly to the memory

controllers for each of the memory channels; however, this limits the system to just a fraction of the

total memory bandwidth available (e.g., one half for two memory channels and one quarter for four

memory channels). To use the entire available bandwidth, the data width of the interface presented to

the Hardware Applications must be at least the number of channels multiplied by the data width of a

single memory channel. In this case, the there must be a data width converter inserted between the

port-MMU interconnect and the memory channels, specifically a data width downsizer for write data

received and a data width upsizer for read data returned.

For the write data interface, a downsizer on its own would exert back-pressure on the interconnect

preventing it from sending data at the full bandwidth speed (since a downsizer cannot accept a new

data beat every cycle). To prevent write requests from throttling the performance of the entire system,

a write data buffer must be included for each memory channel. The post-MMU interconnect can simply

write data to these buffers and not be throttled by the back-pressure of the data width downsizers. For

the read channels, the data width upsizers would not have new data available every cycle, as they would

have to wait for multiple read data beats to pack into one larger read data beat. The AXI4 protocol,

however, allows for read data to be interleaved, i.e., the interconnect can interchangeably read data from

different channels and send them upstream out of order. There is no buffering requirement for the read

data channel. These data width converters and write data buffers are shown in Figure 4.14.

This shared MMU solution is simple and requires relatively few changes from the single memory

channel system, but it does present some potential problems. Memory controllers implemented on

FPGAs tend to have wide data widths already because of the relatively slower clock speeds achievable in

FPGA fabric relative to the ASIC devices (e.g., CPUs) for which off-chip memory solutions are generally

designed. The data width must be increased at the same ratio that the clock speed is reduced between

the memory device itself and the FPGA fabric clock (e.g., if the clock is reduced by 1/4, the data width

must be increased by four-fold). Introducing multiple memory channels widens that data width even

further, and that might present timing challenges to the Hardware Applications. For example, the Xilinx

memory controller in [47] requires a four-fold data width increase, resulting in a native data width of 256

bits for the memory controller, which would increase to a 512-bit data width in the AXI interconnect

for two memory channels and a 1024-bit data width for four memory channels.

A further complication actually limits the effectiveness of the performance isolation in a shared MMU

solution. The bandwidth throttlers operate on the AXI interface presented to the Hardware Application

itself, with no knowledge of the future memory channel that the request will eventually target. If all of the

Hardware Applications try to target the same memory channel (assuming the MMU assignment allows


AXI4 Interconnect AXI4 Interconnect


HW Application 1



HW Application 1



HW Application 1


HW Application 1


HW Application 1


HW Application 1

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

DDR4 Controller

MMU (stage 2)

DDR4 Controller



MMU (stage 2)

MMU (stage 1)

MMU (stage 1)

MMU (stage 1)

MMU (stage 1)

AXI4 Inter.

AXI4 Inter.

AXI4 Inter.

AXI4 Inter.

BW BW BW BW BW BW BW BW

Figure 4.15: Multi-Channel Organization with Parallel Shared MMUs

such), the memory bus’ bandwidth would be effectively limited to the bandwidth of that single memory

channel. The only way to ensure performance isolation would be to isolate Hardware Applications to a

single memory channel, which would somewhat defeat the purpose of the multi-channel memory solution.

If any memory channel has memory that is assigned to two or more Hardware Applications, there is a

potential for malicious, or even unintentional, bandwidth limiting for those Hardware Applications.

4.5.3 Parallel MMUs with a Single Port

To overcome the problem in performance isolation for a shared MMU solution, each Hardware Applica-

tion can have a first stage MMU that simply indicates which memory channel the request is to access,

and this information can be used to reroute that request to the correct memory channel. An interconnect

can follow this first stage MMU and map requests to the memory channel indicated by the first stage

MMU. Separate bandwidth throttlers can then be instantiated at the output of this first interconnection

network, that would effectively throttle the bandwidth between each Hardware Application and memory

channel pairing individually. This arrangement is depicted in Figure 4.15.

If the system uses a base-and-bounds MMU design, the first stage MMU would simply be a table

indexed by the VIID of the requester interface and indicate which memory channel that VIID is mapped

to. The most significant bits of the address, which indicate the memory channel, would be replaced with

this stored value. If the system uses a coarse-grained paged MMU, this first stage MMU’s page-table

would be indexed by the same bits of the address (in addition to the VIID) as the later stage MMU,

containing the same number of entries as the portion of the later stage MMU’s page table assigned to

that specific Hardware Application Region. However, the mapping value stored in the page-table simply


indicates the memory channel that page is mapped to, so only the most significant bits of the address,

that indicate the memory channel, would be replaced with this stored value. The remainder of the

mapping would be stored in the second stage MMU’s page table.

This first MMU and interconnect can also handle out-of-bounds access, freeing downstream compo-

nents of wasting bandwidth on useless transactions. This would also mean that bandwidth credits in

the downstream bandwidth throttler are not consumed by out-of-bounds accesses. For a coarse-grained

paged MMU system, the first stage MMU would indicate the validity of a page mapping and act on

4k boundary crossing errors, while the second stage MMU could safely assume all mappings are valid

and ignore 4k boundary crossings. For a base-and-bounds MMU system, the first stage MMU would

deal with the bound check and 4k boundary crossing errors in addition to storing the mapped memory

channel for each VIID, and the second stage could safely ignore any errors and simply add the base

component.

In this MMU arrangement, since requests are already separated by a targeted memory channel to

perform performance isolation, those separated request streams need only be forwarded to that memory

channel. Thus, each memory channel can have a separate MMU that only handles requests targeting

that memory channel; we term this MMU system “Parallel MMUs with a Single Port” because each

memory channel has an individual MMU and each Hardware Application Region has a single port.

In the single shared MMU approach, the data width of the memory interface presented to the

Hardware Application has to be wider to allow for the full memory bandwidth of the attached memory

to be realized. In this case, there is no bottleneck at a single MMU, so the interface width can be

smaller. In fact, the interface width presented to the Hardware Applications would only limit the

maximum amount of bandwidth that could be assigned to the Hardware Application, and the total

system bandwidth might not be impacted. For example, if a system has two memory controllers with a

256-bit data width, and each memory interface port at the PR boundary to two Hardware Applications

also has a 256-bit data-width, the full bandwidth of the system could still be used as long as the access

patterns of the Hardware Applications are efficiently mapped across the memory channels. Note, a wider

data interface at the Hardware Application would still be required if the system might want to assign

more than the bandwidth of a single memory channel to any Hardware Application.

The arrangement depicted in Figure 4.15 includes a wider memory access interface and thus also

includes data width converters and write channel buffers before the memory channels. This arrangement

would require all of the MMUs and interconnects to have larger data widths as well. These data

width converters and write channel buffers could however be included immediately before the bandwidth

throttlers, as indicated in the modified arrangement shown in Figure 4.16. This would reduce the size


AXI4 InterconnectAXI4 Interconnect


HW Application 1



HW Application 1



HW Application 1


HW Application 1


HW Application 1


HW Application 1

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

DDR4 Controller

MMU (stage 2)

DDR4 Controller

MMU (stage 2)

MMU (stage 1)

MMU (stage 1)

MMU (stage 1)

MMU (stage 1)

AXI4 Inter.

AXI4 Inter.

AXI4 Inter.

AXI4 Inter.



BW BW BW BW BW BW BW BW

Figure 4.16: Multi-Channel Organization with Parallel Shared MMUs (modified)

of the downstream interconnect and MMUs, but would require data width converters for each Hardware

Application Region. We term this arrangement the “parallel MMUs with a Single Port (modified)”.

4.5.4 Parallel MMUs with Multiple Ports

Looking at the modified parallel MMUs arrangement, much of the infrastructure located before the

bandwidth throttlers could be included in a soft shell implementation and need not necessarily be

included in the static hard shell implementation. In essence, what this would do is implement a parallel

MMUs system with multiple ports presented at the PR interface to the Hardware Application Region.

Each port would correspond to a separate memory channel. This is shown in Figure 4.17, which is

essentially the same as Figure 4.16 except with the protocol decouplers and verifiers duplicated and

moved to just before the bandwidth throttlers, and the other components moved inside the soft shell.

The first stage MMU would then be connected to the management framework through the management

connection of the soft shell.

The advantage of this arrangement is that the interconnect instantiated within the soft shell can

be made just large enough to accommodate the largest memory interface needed inside the soft shell.

For example, if a particular Hardware Application needed only memory interfaces of width 64-bits, that

interconnect and first stage MMU could be limited to 64-bits with a data width upsizer included at the

memory interface. Note, since the bandwidth throttler included in this thesis penalizes requesters for

gaps in data transmission and acceptance, that a data-width upsizer would induce, some buffering of

write requests until enough data has been received would be needed to preserve bandwidth allocations.

Another advantage of this system is that if the memory interfaces within the soft shell use fewer address



HW Application 1


HW Application 1



HW Application 1


HW Application 1



HW Application 1


HW Application 1


HW Application 1

DDR4 Controller

AXI4 Interconnect

MMU (stage 2)

DDR4 Controller

MMU (stage 2)

AXI4

Inter. Soft Shell (PR Region)

HW Application 1


HW Application 1


HW Application 1

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

BW

BW

BW

BW

BW BW BW BW

Figure 4.17: Multi-Channel Organization with Parallel NMUs and Multiple Ports

bits than the system memory (i.e., they do not need large memory allocations), the first stage MMU’s

depth can be reduced and total FPGA area utilization of the shell (cumulative hard and soft shell

utilization) would also be reduced.

These multi-channel solutions are presented here as a conceptual discussion. The implementation of

such solutions is left to future work.

4.5.5 Multi-Memory Channel Implementations in Previous Works

Most of the previous works described in Chapter 2 include only a single memory channel, similar to the

exploration presented in this thesis. One notable exception is the SDAccel platform created by Xilinx [30].

Specifically, the SDAccel Platform Reference Design described in [65] shows that the Shell implemented

for the SDAccel platform includes four separate memory channels. In that reference platform, the

connections to the off-chip memory are not abstracted through the Shell, but instead presented directly

to the PR region. Since the Shell presented in that work does not have multiple applications, even the

memory controller itself is meant to be implemented within the PR region. This work therefore does

not present a multi-memory channel solution with any kind of virtualization.

One relevant prior work that considers both multiple applications and multi-memory channel de-

ployments is the work presented by Yazdanshenas and Betz [35]. In that work, an exploration of the

overheads associated with a multi-tenant Shell are explored. In that exploration, multiple memory

channels are considered explicitly. The way that those memory channels are presented to the hardware

applications is consistent with the theoretical solution presented in Section 4.5.4. More specifically, each


of the memory channels is accessed through a separate interface within each applications (i.e., each

application has a memory access interface to correspond to each memory channel). However, that work

does not explicitly consider isolation and therefore would not include the parallel MMUs described in

Section 4.5.4.

Chapter 5

Network Interfaces

In this chapter, we switch the focus to securing the sharing of the network interface, which is required for

the direct-connected FPGA deployment model. Network interfaces, particularly Ethernet connectivity,

are provided on many FPGA boards and are often directly supported by FPGA vendors. The Alpha

Data 8k5 FPGA board used in this work for example includes 10 Gbps Ethernet connectivity [46]. Xilinx

provides support for Ethernet ports, including the 10 Gbps port on the Alpha Data device, through its

10G Ethernet Subsystem IP Core [48]. The interafce provided to the user for this Xilinx Ethernet

controller is an AXI-Stream interface; while the work presented in this thesis targets the AXI-Stream

interface, this interface is generic enough such that these methods could be applied to other interface

types as well.

In contrast to the solutions that aim at securing memory, presented in Chapter 4, network inter-

faces are connected to the data-centre infrastructure itself, which means activity propagated over these

connections could impact applications beyond the multi-tenant device. In this chapter we analyze the

domain isolation solutions needed to address this problem, as well as discuss how performance isolation

solutions can be extended to memory interfaces.

5.1 Network Interface Performance Isolation

To institute performance isolation for the network interface, similar stages to those implemented for the

memory channel are required: protocol decoupling, protocol verification, and interconnect bandwidth

throttling. To illustrate the intention of the work described in this section, see Figure 5.1.

In part (a) of the figure, we depict an unsecured shell organization that includes only network

connectivity as an external resource. The multiple Hardware Application Regions include an AXI-

68

Chapter 5. Network Interfaces 69


HW Application 1



HW Application 1


HW Application 1

10Gbps Ethernet

AXI-Stream Interconnect


HW Application 1


HW Application 1


HW Application 1


HW Application 1

PCIe Controller


HW Application 1



HW Application 1


HW Application 1

10Gbps Ethernet


HW Application 1


HW Application 1


HW Application 1


HW Application 1

PCIe Controller

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

BW Throttler

BW Throttler

BW Throttler

BW Throttler

AXIS Inter.

(a) (b)

Figure 5.1: Adding Performance Isolation for Networking to the Shell (a) Shell without isolation (b)Shell with added isolation

Stream output port that connects to an AXI-Stream interconnect to arbitrate access to the Xilinx

Ethernet controller. In addition, the Hardware Application Regions include AXI-Stream input ports that

are driven by the output of another AXI-Stream interconnect. Packets that arrive from the Ethernet

controller pass through a component that takes the least significant bits of the packet’s MAC address

as the VIID to determine which interface to route to in the AXI-Stream Interconnect that drives the

AXI-Stream Input ports of the Hardware Application Regions. This component is called a “Simple

NMU” since it manages which interface to route input packets to, though it does not implement any

security features like the NMUs described in Section 5.3. As with the shell for the memory connectivity,

PCIe is included such that a host computer can manage the shell, though in the case of the unsecured

shell there is nothing to manage outside the soft shell components.

Part (b) of the figure indicates how the various performance isolation components modify the simple

unsecured shell depicted in part (a); each of the AXI-Stream input and output ports pass through

AXI-Stream Protocol Verifier-Decoupler components. The AXI-Stream output ports are connected to

bandwidth throttlers, so that the access to the Ethernet output port can be farily shared amongst the

Hardware Application Regions. Note, no bandwidth regulation is done for input packets since the shell

can not effectively assert backpressure on the Ethernet input port. All of these performance isolation

components are connected to the PCIe management network.


Ingress Packet ChannelEgress Packet Channel

0

0

decouple

egr_tvalid_in

egr_tready_in

egr_tvalid_out

egr_decoupled

egr_tready_out

Outstanding egr packet?

= 00

0

decouple

egr_tvalid_in

egr_tready_in

egr_tvalid_out

egr_decoupled

egr_tready_out


= 0

0

1

decouple

ingr_tvalid_in

ingr_tready_in

ingr_tvalid_out

ingr_decoupled

ingr_tready_out

Outstanding ingr packet?

= 0

set rst

Sticky bit

ingr_tlast_in

0

1

decouple

ingr_tvalid_in

ingr_tready_in

ingr_tvalid_out

ingr_decoupled

ingr_tready_out

Outstanding ingr packet?

= 0

set rst

Sticky bit

ingr_tlast_in

ingr_decoupled

egr_decoupleddecouple_done

Figure 5.2: Network Interface Decoupler

5.1.1 AXI-Stream Decoupling

As stated in Section 4.1.3, decouplers are needed so that the Hardware Application Region can be

effectively disconnected from the shared interconnect and Ethernet connection. This could be done to

reprogram the PR region in which the Hardware Application is resident for those deployments where PR

is enabled, or to pause the Hardware Applications for some other reason, such as to disable the packets

from being sent from a particular Hardware Application.

The network connectivity is provided by an AXI-Stream interface, which is a pretty generic interface

providing only a data field (with a strobe value indicating valid bytes), a LAST signal to indicate the

end of a packet, and some handshaking signals. The simple Xilinx decoupler [54] cannot be used in this

case because it might decouple packets mid way through transmission, that could cause downstream

components to lock up waiting for the last data beat of a packet. Thus, decoupling activity must be

gated with an indication of whether or not a packet is midway through its transmission. Figure 5.2

shows the implementation of this AXI-Stream decoupler. Outstanding packet trackers (implement using

procedural HDL code) are used to track whether there is a mid-stream packet for both the input and

output stream directions.

For input packets, the decoupled READY signal is held high so that all packets are seen as accepted

by the downstream interconnect, to prevent the backpressure from locking up the interconnect. Since

packets are pseduo-accepted in this way, there might be a problem if the input AXI-Stream port is un-

decoupled midway through one of these pseduo-accepted packets, as the Hardware Application would

see a partial packet with no way of knowing whether or not it is a complete packet. To prevent this,

the decoupler signal must also be tied to a sticky decouple signal, that simply enables the decoupling


even after the decouple signal has been de-asserted until the pseudo-accepted packet’s transmission is

complete.

5.1.2 AXI-Stream Protocol Verification

Unlike the AXI4 Protocol for memory interfaces, the AXI-Stream protocol does not include protocol

assertions that must be met to confirm protocol compliance, because the AXI-Stream interface has little

disallowed behaviour. The only protocol assertion of note that could be inferred for the AXI-Stream

interface is the Handshaking check, i.e., once the VALID signal is asserted, all the signals must maintain

their value until the READY signal indicates that the data beat has been accepted. The Xilinx Ethernet

controller does however impose some additional protocol restrictions, namely that the KEEP signal (the

strobe signal that indicates which data bytes are valid) must be held all high before the last beat is

transferred and that packet transmission cannot include any gaps (i.e., once a packet is started, VALID

cannot be de-asserted until the last beat is transferred). Invalid KEEP values are ignored by the Xilinx

Ethernet controller, but gaps in the transmission can cause packets to be dropped [48]. Finally, the

components implemented in future sections require that the packet be held to a maximum size (this is

often called the Maximum Transmission Unit (MTU)), so packets must assert the LAST signal before

the packet exceeds this size.

The total list of assertions that must be met to ensure that packet transmission is not interrupted

is: the Handshaking check, no gaps in the transmission, and packet size limited to the MTU. Note, the

no gaps in transmission applies to input packets as well, indicating that there can be no gaps in the

acceptance of a packet transmission, but the other assertions apply for the output direction only.

The AXI-Stream protocol verifier design is depicted in Figure 5.3. An outstanding packet tracker

is used to track whether any outgoing packets are midstream; this value is used to override the VALID

signal to ensure there are no gaps in the transmission. Next, a counter is used to keep track of the number

of beats sent for outgoing packets; once the count is equal to one less than the maximum packet size, the

LAST signal is forced high to end the packet. Note, both of these changes could corrupt the packet sent,

but the purpose of the protocol verifier is simply to prevent malformed requests from propagating to

the downstream interconnect, so this is only of concern to the Hardware Application sending malformed

packets. Finally, the AXI-Stream outputs are registered so the values are not captured if they change

after the VALID signal has been asserted. For input packets, the only change required to insure protocol

compliance is the overriding of the READY to signal to a high value; the input port must accept all

packets when they arrive and cannot ever assert backpressure to lock up the interconnect.



!= 0

Beat Counter

= (MTU – 1)

reg

reg

egr_tvalid_in

reg

egr_tvalid_out

egr_tlast_inegr_tlast_out

other_egr_signals

ingr_tready_in1

other_ingr_signals

Figure 5.3: Network Interface Protocol Verifier

5.1.3 Network Interface Bandwidth Throttling

Network bandwidth throttling can be implemented by again using a modified version of the credit-

based accounting system presented in [58]. Since network transmissions cannot be interrupted mid-

transmission, the number of credits needed to initiate the transmission must be the total number of

packets that the transmission might need to use. This is equal to the number of beats that make up

an MTU packet. It is not possible to tell the size of the packet before it has been transmitted, so the

total credits deducted on a new packet transmission acceptance must be equal to the MTU. Once the

end of the packet is reached, credits can be redeposited based on how much shorter the packet is than

the MTU. As a reminder, the original credit accounting mechanism was as follows:

credits(t+ 1) = credits(t) + ρ− 1 has bus access

credits(t+ 1) = credits(t) + ρ no access but pending requests

credits(t+ 1) = σ no pending requests

Where ρ is a decimal value (less than or equal to one) that represents the proportion of bandwidth

assigned to that interface, and σ represents the burstiness accepted from that interface. This formula-

tion can be adjusted to implement the changes needed for the network interface as follows:

credits(t+ 1) = credits(t) + ρ − crnew + crlast pending packet or data to send

credits(t+ 1) = σ no pending packet or data


where :

crnew = MAX BEATSMTU new packet transmission accepted

crnew = 0 otherwise

crlast = unsent last data beat accepted

crlast = 0 otherwise

unsent(t+ 1) = MAX BEATSMTU − 1 TLAST and TREADY high or reset

unsent(t+ 1) = unsent(t)− 1 TREADY high and TVALID high

unsent(t+ 1) = unsent(t) TREADY not asserted

The bandwidth throttler implemented based on this formulation is shown in Figure 5.4. As mentioned

in the beginning of this chapter, bandwidth throttling only effects the output AXI-Stream interface.

Again, an outstanding packet tracker is included that prevents decoupling based on the credit count

once a packet has started transmission. The credit count is compared to the MTU to determine whether

or not that interface should be decoupled. The credit update system updates the amount of credits

stored in the credit register whenever a new packet transmission is accepted and/or the last beat of a

transmission is sent.

Unlike the bandwidth throttling for the memory interface, the total bandwidth available on the

shared network interface is not dependant on the network access pattern. The total bandwidth available

should only be limited by the downstream datacentre switching infrastructure. As such, the network

bandwidth throttling system does not need a bandwidth conserving system like the one introduced for

the memory bandwidth in Section 4.2.3. Instead, the sum of the ρ values should simply be set to the

total bandwidth available in the system.

5.2 Network Security Background

Virtualized FPGA deployments must consider security in the way that Hardware Applications are allowed

to access the shared network. As already mentioned in the introduction to this chapter, this securitization

is required not only to isolate the Hardware Application Regions from each other, but to isolate the rest

of the network from any unwanted accesses from the Hardware Applications themselves. This is what

was termed Domain Isolation in Section 3.2.3. The need for Domain Isolation is not restricted to FPGAs


0

update (ρ)

++

_ _

0

MTU

0

add_back

Credits (int)

Credits (frac)

sel0 sel1

sel3

init (σ)

++

sel4

0

0

>=

egr_tvalid_in

egr_tready_in

rst count

Beat Counter

tready & tlast tready & tvalid

_ _

(MTU-1)add_back

sel0tlasttready

sel1tvalidtready

sel3tvalid sel4

egr_tvalid_out

egr_tready_out

tlast_prev

Figure 5.4: Network Interface Bandwidth Throttler

deployed in the cloud, this security consideration is necessary also for software VMs installed on CPU-

based datacentre nodes. In this section, we discuss the solutions used to provide domain isolation in

other parts of the datacentre.

5.2.1 Software Analogues

In the software domain, the National Institute of Standards and Technology (NIST) details some common

methodologies used to secure access to a shared network by VMs in a virtualized environment [66]. The

main methodology presented is the virtual switch, a fully functional switch implemented in software that

switches traffic from the virtual network connections to the physical network interface and the next-level


physical switch. Distributed virtual switches extend the virtual switch concept by provisioning and

managing virtual switches on multiple physical nodes simultaneously, an avenue that could be explored

for hardware NMU solutions in future work.

Another common network security methodology, according to NIST, is the firewall: devices and/or

security layers within switches or software that filter traffic such that only allowed connections are left to

pass-through to the network. The set of allowed connections is often specified in what are termed Access

Control Lists (ACLs), or alternatively Network Access Control Lists (NACLs). Firewall functionality

can be provisioned using physical appliances installed in the network, through ACLs implemented in

the physical switches of the network, or through firewalls implemented in the virtual switch solutions

mentioned earlier.

For multi-tenant environments, the pushing of ACLs to a physical firewall appliance or the next-level

physical switch is often termed hairpinning, since traffic from the VM is first routed to the physical

appliance and then to its final destination. Note, for such a firewall implementation to work, some

level of source semantics enforcement must be done before routing to the firewall appliance such that

the traffic is uniquely identifiable. Such hairpinning techniques are considered here in this thesis for

analogous hardware solutions.

A final consideration, virtual networking subdivides the physical network into virtual networks that

can be provisioned to different users and isolated from each other. The simplest form of virtual net-

working is the Virtual Local Area Network (VLAN) tag, IEEE 802.1Q [67]. The VLAN tag includes

a 12-bit virtual ID that allows switches to identify, and isolate packets between, devices on the same

virtual network. Such tagging can often be done by the switches themselves at ingress to the network.

Additionally, network virtualization can be provided using encapsulation based methods, VXLAN [68] or

NVGRE [69]; Virtual Tunnel Endpointss (VTEPs), often implemented within virtual switches, perform

the encapsulation and de-encapsulation.

5.2.2 OpenFlow Switching Hardware

In addition to virtual switches implemented on software nodes, hardware network switches can also

be used to implement security for network connected devices. One of the most ubiquitous Hardware

Switch standards is the OpenFlow standard [70]. The OpenFlow standard was specifically introduced

as an open source Software Defined Networking (SDN) solution; SDN describes network deployment

and management solutions that split the data forwarding plane and the control plane. In reference to

security, OpenFlow is relevant because it introduces a format for rules to influence how packets are


OpenFlow Table

(TCAM)

OpenFlow Table

(TCAM)

OpenFlow Table

(TCAM)

OpenFlow Table

(TCAM)

MAC Parser

VLAN Parser

IP4 Parser

Transport Parser

Parsed Field

Queues

Parsed Field

Queues

Parsed Field

Queues

Parsed Field

Queues

Parsed Field

Queues

Parsed Field

Queues

Parsed Field

Queues

Parsed Field

Queues

Input Interface

MAC Parser

VLAN Parser

IP4 Parser

Transport Parser

Input Interface

MAC Parser

VLAN Parser

IP4 Parser

Transport Parser

Input Interface

MAC Parser

VLAN Parser

IP4 Parser

Transport Parser

Input Interface

Parsed Field

Queues

Parsed Field

Queues

Arbiter

OpenFLow Rule Processor

Packet Buffer MemoryPacket Buffer Memory

DMA Accessor

Input Interface

Input Interface

Input Interface

Output Interface

Figure 5.5: Example Implementation of an OpenFlow Capable Switch

forwarded or dropped when processed by the OpenFlow switch containing those rules; these rules can

include ACLs to implement security measures on an OpenFlow switch.

Complete OpenFlow switch solutions have been implemented on FPGAs [71] [72], and they can

provide the same level of security afforded to software systems through the use of rules that target

security, such as ACLs, adherence to routing protocols and stateful inspection. However, they consume

significant resources, on the order of 15-36 percent LUTs and 45-62 percent of BRAMs for the devices

used. This high area overhead indicates that full switch solutions implemented on FPGAs are likely too

large to implement in conjunction with a shell and multiple Hardware Application Regions; alternative

solutions must be sought that minimizes the area overhead.

One possible OpenFlow switch solution is shown in Figure 5.5. The packets flow in from the network

inputs to the outputs after they have been processed. When packets arrive at the network input, they

are parsed for key network fields that are required to compare against the rules stored in the OpenFlow

tables. The kinds of fields parsed out from a packet include source and destination MAC addresses,

source and destination IP addresses, port numbers, etc. Once the packet has been parsed, the parsed

fields are sent to a queue, while they wait to be processed by the OpenFlow tables, and the packet itself

is sent to a buffer until its eventual destination is determined.

The OpenFlow table processor pulls parsed packet data from one of the queues waiting to be processed

and compares the fields to the expected field data in each of the OpenFlow rules. OpenFlow rules are

stored in OpenFlow tables, which are implemented as Ternary Content Addressable Memories. If the

parsed packet data matches with a rule stored in the OpenFlow table TCAM, that rule has an associated


action that is used to modify the packet, modify the parsed fields, update some internal switch metrics,

or add some metadata to the parsed fields. The parsed packet data is forwarded through a series of these

OpenFlow tables, matching up to one rule per OpenFlow table. Once the packet has passed through all

of the series of OpenFlow tables, the actions list it has accumulated is implemented by modifying the

packet in the ways specified (e.g. removing a VLAN tag field, or updating some IP field value), and/or

dropping/forwarding the packet to the specified output interface. Note, an OpenFlow switch can send

modified parsed packet data back to the queue to be reprocessed by the OpenFlow tables.

Some works modify this basic structure to implement reduced versions of the OpenFlow standard. For

example, the work presented in [72] modifies the OpenFlow table structure such that each rule matched

can have multiple actions associated with it, and then does not include multiple OpenFlow tables (the

work has multiple tables, but it is best interpreted as a single OpenFlow table that is pipelined). This

solution limits the flexibility of the OpenFlow standard, but also reduced the overall area need for the

Hardware switch implementation.

From this description we can glean why the full OpenFlow Hardware solutions might take up such

a significant amount of FPGA hardware resources. While the OpenFlow switch can implement network

security, it also has a great deal of overhead that is included to deal with other networking needs, such

as packet forwarding and VLAN tagging. Also, the queuing structure for parsed packet data forces the

parsed data and the packets themselves to be buffered. The need for buffer space would be determined by

the maximum number of packets the switch needs to hold while they to wait to be processed, which can be

significant depending on the network speed of the Ethernet interface, the number of network interfaces,

and the average time it takes to process a single packet. All of this added buffering, the inclusion of

multiple OpenFlow tables, and the need to sometimes reprocess a packet through the OpenFlow tables

also can add a significant amount of latency to the processing of a packet. Solutions that target security

exclusively can omit some of the overhead of a full OpenFlow switch implementation to reduce this area

overhead need and alleviate this long packet processing latency.

5.3 The Network Management Unit

The software analogues demonstrate some of the needs of network security, namely the enforcement of

access control (either directly or by hairpinning such functionality to the next-level physical switch or

some hardware appliance), and the ability to route traffic between logical interfaces on the same FPGA.

In traditional software virtual environments, VMs share memory and I/O connections. The memory

sharing is generally provisioned by hardware means, specifically, data isolation is provided through the


employment of a MMU [28]. As an analogy to the MMU, that provides memory data isolation, we

propose the creation of an NMU, that provides network domain isolation. Based on the related work,

and the trends we identified, we contend that the NMU is required to enable the secure deployment of

direct-connected FPGAs in multi-user or multi-tenant datacenters and cloud deployments.

Similar to the software analogues presented in the previous section, there can be many potential

ways to secure the network interface for shared use of the network resources. For example, in the

Chapter 2, several works were presented that had some kind of network security gaurantees. The work

presented by Byma et al. [33] policed outgoing traffic by replacing the source MAC address with the

one assigned to the sender; the work presented by Tarafdar et al. [34] encapsulated data within a MAC

packet; and the work presented by Microsoft research, specifically Catapult 2 [3], encapsulated data in a

custom Transport layer protocol called Lightweight Transport Layer (LTL). In this section, some of the

considerations that might be needed for network security are presented, and a nomeclature is developed

to refer to these NMUs.

Note, the exact requirements of the NMU design will always depend on the specific deployment

details of the datacentre in which the FPGAs are to be deployed. For this reason, we do not present a

single NMU that we posit meets the requirements for domain isolation of networking interfaces. Instead,

a number of potential NMU designs are presented, which represent a series of deployment scenarios that

we claim meet the domain isolation needs of many common FPGA deployments.

5.3.1 Access Control Level

We note from the software analogues that ACLs are one important way in which network connectivity

should be secured. Access control functionality can be done within the NMU, or hairpinned to the next

level switch. The first criteria by which we categorize potential NMU designs is the level of access control

done within the NMU rather than pushed to the next level switch.

Un-Inspected Networking (Type A)

At the lowest level, we have NMUs that do not inspect outgoing packets at all and push all access control

functionality to the next-level switch (and potentially a further firewall appliance); we call these Type A

NMUs. Of course, for the next-level switch to be able to uniquely identify separate logical interfaces,

some methodology must be employed to mark outgoing packets as originating from a particular logical

interface. Two recent different IEEE standards could be used to this end.

The Edge Virtual Bridging standard (802.1Qbg) [73] allows for a single physical port of a switch to be


treated as multiple logical ports by associating each logical connection with a specific Service VLAN tag.

Similarly, the Bridge Port Extension standard (802.1pr) [74] allows for a single physical port on a switch

to be expanded into multiple individually managed connections using a custom tag structure. Thus,

a Type A NMU should employ such tagging to push both routing and access control to the next-level

switch.

The simplicity of Type A NMUs lend themselves to simple hardware realizations, but they require

all ACLs to be implemented at the next-level switch, tightly coupling the hardware application to the

switch configuration, which is not desirable (the datacentre management framework must manage ACLs

in multiple places with multiple update and management procedures).

Source Semantics Enforcement (Type B)

The next level of access control is source semantics enforcement, i.e., ACLs that ensure the sender

addresses in the packets are correct and no other device addresses are spoofed; we term these Type B

NMUs. This is the type of NMU applied to the work presented by Byma et al. [33]. If source semantics

are enforced on the FPGA, further access controls can be applied at the next-level switch without the

configuration complexity of the Type A NMUs. Also, the Type B NMU does not rely on relatively new

IEEE standards that may have limited adoption. While the configuration complexity is reduced, most

access controls must still be implemented on the next-level switch; Type B NMU solutions remain tightly

coupled to the switch configuration.

Destination Rule Enforcement (Type C)

We define Type C NMUs as those that perform both sender and destination based access controls on

the FPGA. The full scope of what might constitute access control could be quite wide, and in fact might

include the full implementation of a switch on the FPGA. As discussed in the previous section, such an

implementation is likely infeasible or carries too high an overhead. Instead, we narrow the definition of

access controls.

Some previous works have shown FPGA datacenter deployments that rely solely on static point-to-

point links between the FPGAs. Limiting the NMU’s access control to a single destination field per

logical network interface would allow for some access control to be implemented in the Type C NMU at

a relatively lower cost. Moreover, multiple logical network interfaces can be provided to each hardware

application to implement point-to-multipoint connectivity. Other simple destination-based rules can also

be included, such as limiting the ability to send multicast packets, and limiting IP traffic to a specific

subnet. We contend that these simple access controls are powerful enough for many tasks.


The Type C NMU adds complexity in the hardware implementation, and as such area overhead,

however it removes the tight coupling between the hardware application and the network infrastructure,

which should greatly ease deployment. Of course, this is limited: if the point-to-point access controls

are not sufficient enough to isolate the network accesses, more powerful ACLs from the next-level switch

would be needed.

Packet Encapsulation (Type E)1

Finally, Type E NMUs eliminate the need for access controls by moving packet encapsulation into the

NMU itself; instead of users performing network packetization within their own Hardware Applications,

they simply send the payload to the NMU, that encapsulates it within the appropriate network packet.

This is the methodology imposed in the implementation by Tarafdar et al. [34], and implied as an option

in the Catapult v2 work with the introduction of the LTL protocol [3].

Type E NMU solutions can be quite simple in terms of the hardware required to implement them, and

there is no tight coupling between the hardware application deployment and the network configuration.

Type E NMUs are however the least flexible, as they impose point-to-point only connectivity. Type E

NMUs also share network encapsulation hardware between the hardware applications, reducing area

utilization, but thus also require Hardware Applications to be rewritten to target the encapsulation-

based NMU scheme.

5.3.2 Internal Routing

Another functionality that might be required is the routability of traffic between logical network interfaces

located on the same FPGA. In general, haripin routing to the next-level switch and back is not possible

since the IEEE switch specifications explicitly forbid the re-routing of packets to the interface on which

the packet was received. The Edge Virtual Bridging [73] and the Bridge Port Extensions [74] protocols

are exceptions, so the Type A NMUs based on these standards enable routability by default.

For other NMU types, routability between the logical network interfaces can only be provided by

including routing functionality directly in the NMU; we term such NMUs as Type *R NMUs. Note,

routability doesn’t necessarily need to be provided, though this would impose on the cloud management

framework the limitation that two applications that need to communicate with each other must be

provisioned on different FPGAs; this might be an onerous limitation. This is the methodology employed

by Byma et al. [33] for example.

1Type D is intentionally unused and left for NMUs with a richer set of access controls (such as fully implementedswitches on FPGAs, stateful access controls, or OpenFlow flow tables), left for future work


5.3.3 VLAN Networking Support

From the NIST publication, another common way to ensure network security is by encapsulating packets

within a virtual network, such as a VLAN or a VXLAN. A VLAN-based NMU would tag each logical

network interface with the appropriate VLAN tag without having to parse the packet itself, and as such

we classify it as a Type A NMU (Types Av and ARv). A VXLAN-based NMU would encapsulate the

whole packet within a VXLAN delivery packet, and as such we classify it as a Type E NMU (Types Ev

and ERv).

5.3.4 Layer of Network Virtualization

Routing functionality and access control can be implemented at various levels of the network protocol

stack, depending on the desired abstraction to present to the hardware application. For example, the

hardware applications might have their own MAC addresses, or they might share a MAC/IP address

and differ only on the Layer 4 port number. NMUs can be designed to process packets at a specific layer

of the network protocol stack: MAC-only NMU, MAC/IP NMU, and MAC/IP/Layer4 NMU.

5.3.5 NMU Nomenclature

The previous subsections have presented many different features that could be implemented to provide an

effective network security solution. For simplicity, all of these NMU types and features are summarized

by the nomenclature presented in Table 5.1. The Type of the NMU is determined by the level of access

control that it supports. In addition, an R or a v can be added to indicate that the NMU supports

routing between Hardware Applications on the same FPGA and that the NMU specifically targets a

virtualized network technology respectively. Finally, the layer of the network stack at that the NMU

works is appended to the end of the name. As a final note, a Universal NMU is used to refer to an NMU

that is designed to support all of the potential features; a Universal NMU can be parameterized by the

FPGA management framework at runtime to determine which of the modes to implement for each of

the Hardware Applications and the network access ports indicated by their VIID.

5.4 Network Management Unit Hardware Design

The Network Management Unit was introduced conceptually in the previous section. This section

discusses the actual hardware implementation of the NMU for synthesis into a shell design. To illustrate

the intention of the work described in this section, see Figure 5.6. In part (a) of the figure, a shell


Table 5.1: NMU Nomenclature Summary

Type (A|B|C|E) [R] [v] - [L2|L3|L4]Type A No access controls provided within the FPGA, some tagging such

that ACLs can be applied at the next-level physical switch (hair-pinning)

Type B Source semantics enforcement for all outgoing traffic from hard-ware applications, allowing ACLs at the next-level switch whileeliminating spoofing

Type C Source semantics enforcement and some simple dest. based accesscontrols (e.g. restricting to a single dest, or restricting multicastand/or broadcast)

Type E Encapsulation: hardware applications send payload without gen-erating packet headers, network packet generation done in theNMU itself

Type *R Routing between hardware applications on the same FPGA doneinside of the NMU (no hairpinning)

Type *v Virtualized networking environment supported

[L2|L3|L4] Network protocol stack layer the NMU operates with respect to(L2 = MAC, L3 = IP, L4 = Transport)

E.g. Type A-vepa, Type A-etag, Type Av, Type ARv, Type B-L2, Type B-L3, Type B-L4, Type BR-L2, Type BR-L3, Type BR-L4, Type C-L2, Type C-L3, Type C-L4, Type CR-L2, Type CR-L3, Type CR-L4, Type E-L2, Type E-L3, Type E-L4, Type ER-L2, Type ER-L3, Type ER-L4, Type ERv-vxlan, Type ERv-nvgre,Type Ev-vxlan, Type Ev-nvgre, Universal

with just the performance isolation components is shown. In part (b) of the figure, the Simple NMU of

part (a), that in itself could provide no network security, is replaced with a more complex NMU based on

the descriptions in the previous section. This compex NMU is connected to the PCIe based management

framework such that the parameters of the NMU can be set at runtime.

5.4.1 Reusable Sub-Components

To implement the functionality required of the NMUs, we need packet processing components that can

examine the packets and pull out the relevant header information, as well as modify the packets by

inserting and removing headers/fields. These components can be designed as reusable sub-components

to reduce the complexity is deploying the various different types of NMUs.

Packet Parser-Processor

Packet parsers are used to pull out header information from a packet. This header information is

then generally compared to some ACLs or a routing table. Previous works doing packet processing

on FPGAs range from complex programmable parser designs [75], to simpler parsers generated from



HW Application 1



HW Application 1


HW Application 1

10Gbps Ethernet


HW Application 1


HW Application 1


HW Application 1


HW Application 1

PCIe Controller

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

BW Throttler

BW Throttler

BW Throttler

BW Throttler

AXIS Inter.


HW Application 1



HW Application 1



HW Application 1


HW Application 1


HW Application 1


HW Application 1

PCIe Controller

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

10Gbps Ethernet

AXIS Inter.

NMU

BW

BW

BW

BW

(a) (b)

Figure 5.6: Adding NMU to the Shell (a) Shell without NMU (b) Shell with added NMU

domain specific languages [76]. One of the focuses of our solution is to minimize the hardware overhead

of network security for virtualized FPGAs, so we focus on the simpler designs.

The simple parser architectures include parsers for each part of the network protocol stack, cascad-

ing the parsers and accumulating the parsed information. For example, parsers could be created and

connected in a cascade for MAC-parsing, IPv4-parsing, ARP-parsing, etc. The parsers themselves are

generally simple, including a counter that counts the current position within the packet stream, and

specialized field extractors that look for particular offsets within the packet for the field to be extracted.

Note, the position that the field extractor must look for to find the field can change based on previous

packet fields extracted, and so cannot necessarily be hard-coded.

We employed a similar parser design in our work. Figure 5.7 shows a number of Field Extraction

Sequencers that each extract a particular field in the packet. Traditional packet parsing systems pull

out all the fields of interest, through some series of packet parsers, and then pass those fields en masse

to some routing table or flow table structure to be analyzed and processed (e.g., like the OpenFlow

standard switch implementations). A key difference in our design is the inclusion of the Access Control

and Routing CAM logic for a particular field directly within the parser responsible for extracting that

field. This design allows for the cascaded parsers to simply pass along the cumulative routing and ACL

status instead of the entire field (that might contribute to high register utilization in highly pipelined

designs). This direct inclusion in the parser also eliminates the need for buffering of packets and queuing

of parsed packet data for processing, since all the parsers by necessity must operate at line rate. Access

Control and Routing CAM components can be excluded if not needed for a particular NMU type.


Count

Field Extraction Sequencer

Next Header Determination

Source Addr Access Control

Dest Addr Routing CAM

Dest Addr Access Control

valid

valid

valid

acl e

rror

rout

e m

ask

nex

t hea

der

done?

Figure 5.7: Packet Parser Architecture

Tagger/Encapsulator

The tagger and encapsulation components are used to insert bytes at the beginning or in the middle

of a packet, to support the Type A and Type E NMUs respectively. To insert bytes into a packet, the

incoming packet stream must first be divided into segments, that can be read and pushed to the output

individually. This is accomplished using a segmented FIFO, where the segments form multiple FIFO

outputs. The segmentation is done on a 16-bit basis, since all network headers at Layer 4 and below are

aligned to 16-bit boundaries.

Figure 5.8 shows the implemented tagger/encapsulation core, with the input driving a segmented

FIFO. The output stream is generated by using multiplexers to select from the segments of the input

FIFO, and the tag/encapsulation data to be inserted into the packet. A Packet Output Sequencer,

implemented as a Finite State Machine, sequences the input and the bytes of the data to construct the

output packet. The stream VIID from the input is used to determine which logical network interface

sent the packet that is currently being processed. This VIID is used to index into the Tag/Encap Data

register file to access the tag data to be inserted into packets specifically from that logical interface.

De-Tagger/De-Encapsulator

The de-tagger and de-encapsulation components do the opposite task as the tagger and encapsulator.

For packets coming in from the network, these components can be used to strip some bytes from the

packet that are not needed in the downstream hardware applications, like various tag information or


Segmented FIFO

Segmented FIFO

Tag/Encap. Data to be Inserted

Packet Output Sequencer

Input stream ID

Figure 5.8: Tagger & Encapsulation Architecture

Segmented FIFO

Segmented FIFO

Packet Input Sequencer

Figure 5.9: De-Tagger & De-Encapsulation Architecture

whole network headers in the case of Type E NMUs. See Figure 5.9 for details on the implementation

of the de-tagger/de-encapsulator component. The design is similar to the tagger and encapsulator of

Figure 5.8, except the direction of packet flow is reversed. The 16-bit segments of the input from the

network drive multiplexers that in turn drive the FIFO. The input side of the FIFO is segmented and

the output side is full-width. The Packet Output Sequencer drives the selects of the multiplexers and

the enables of the FIFOs to discard the appropriate bytes.


5.4.2 Destination Rules Enforced

For Type C NMUs some restricted set of destination-based ACLs can be enforced. While there could

be quite a lot of different rules that might be enforced for the destination fields of a network packet,

the full set of possible rules would be unwieldy. As such, we implement some destination-field-based

rules that, from our personal experience, we believe cover many of the secure deployment scenarios that

might be encountered. We do not contend that this list is exhaustive or optimal in any way, but rather

include them here as one possible implementation of a Type C NMU. Any of the ACLs or functionality

described in this section can be optionally enabled/disabled for any network interface individually. The

rules included in this section are implemented in all Type C NMUs evaluated in this thesis.

MAC

For the MAC destination address, a simple ACL is included to restrict the destination address to a single

possible value. This would limit the network interface for which the rule is enabled to a single destination

value, and would be useful in implementing point-to-point networking. If a Hardware Application need

only communicate to a single destination, this ACL could enforce that restriction. In addition, ACLs

are included that can bar the sending of packets destined to any multicast MAC address, or can block

all multicast packets except those sent to a IP4 multicast address. This is included to prevent Hardware

Applications from spamming, whether maliciously or inadvertently, the network interface with many

multicast packets.

VLAN

For the VLAN field, in addition to being able to restrict output to a specific VLAN (which technically

acts as both a source and destination ACL, and is included in both Type B and Type C NMUs), the

VLAN parser can also act on the priority field in the VLAN tag. The priority field can be restricted to

a certain value to prevent the Hardware Applications from targeting a priority class that is reserved for

other network traffic.

EtherType

The EtherType is a field included in the MAC header that indicates the type of the next header in the

packet. For the EtherType, additional ACL functionality is added to restrict sending to only packets

that target IP4, IP6, ARP, and/or sending raw Ethernet packets. The ACL can be restricted to any,

all, or any combination of the above protocols. From our experience, these are the protocols commonly


targeted by FPGA Hardware Applications. If the Hardware Application needs to target some other

packet type, the ACL can be disabled.

IP4

For the IP destination address, we similarly implement restrictions that allow only a single destination

address and that limit the ability to send multicast and broadcast packets. In addition, a subnet mask

field can be configured to allow the Hardware Application to target a specific network subnet, allowing

the NMU to restrict network communication from each Hardware Application to a specific network

segment. Finally, the packet parser for the IP field includes an ACL that can restrict or allow access

to all public IP addresses, i.e., IP addresses outside the range of addresses reserved for private use.

Allowing these IP addresses to be targeted allows the Hardware Application to communicate to any IP

address outside the immediate datacentre deployment, while still restricting access to any other network

devices in the datacentre. This functionality can be combined with the subnet mask functionality to

restrict the Hardware Application to communicating to any public IP addresses, and only those private

IP addresses within its subnet.

Port

For the transport layer, only point-to-point communication restrictions are possible. In other words, an

ACL can be included that restricts the Hardware Application to addressing only a single destination

port. No other relevant destination-based ACLs were included for the transport layer.

5.4.3 Universal NMU

The Universal NMU is depicted in Figure 5.10. As stated in Section 5.4.1, the parser components are

cascaded, with routing CAMs and ACL components integrated within the parsers rather than in some

later packet processing stage.

Packets flowing from the FPGA to the network (left to right in the figure) pass through parsing

stages for MAC, VLAN, IP, and the Transport layer. In addition, not shown in the figure, an ARP

stage is included that is processed in parallel with the IP parser. The parser chain is followed by

an On-Chip Router Filtering component that can filter out packets (i.e., drop) that failed any of the

preceding ACLs. This Filtering component also includes a fully described ACL that dictates which of the

Hardware Application interfaces are allowed to communicate with each other. If a packet is attempting

to be forwarded to a co-resident Hardware Application that it is barred from communicating with, that


tag

L2 L3 L4

L2L3L4

tag

NetworkFPGA MAC Parser

Source ACL

Dest CAM

Dest ACL

VLAN Parser

VLAN ACL

VLAN CAM

IP Parser

Source ACL

Dest CAM

Dest ACL

Port Parser

Source ACL

Dest CAM

Dest ACL

Universal Tagger/Encap.

On-Chip Router

FIltering

MAC Parser

Dest CAM

VLAN Parser

VLAN CAM

IP Parser

Dest CAM

Port Parser

Dest CAM

Univ. Tag Parser &

De-Tagger

Dest CAM

SwitchBuffer & Filtering

Universal De-Encap.

Figure 5.10: Universal NMU System Architecture

packet will also be dropped. The fully-described on-chip ACL is implemented using a bitmask that

contains a bit for each logical network connection that is used to mask out any communications that are

not permissible. This Filtering component also includes a must route mask, that can force any Network

Interfaces packets to automatically be forwarded to another Hardware Application Region’s input port.

This extended functionality can enable the NMU to facilitate direct on-chip communication.

Following the Filtering component, any undropped packets proceed to an AXI-Stream switch that al-

lows for packets to be routed back into the FPGA or out to the network. This implements the routability

functionality of the NMU, so that if any Hardware Application is targeting for communication another

Hardware Application on the same device, it can be forwarded properly. Finally, a tagger/encapsulator

can be used to implement the tagging or encapsulation modes.

Packets arriving from the network (right to left in the figure) are first de-tagged (if a tagging mode

happens to be enabled) before being passed to the ingress path parsers. While these parsers are logically

separate from the egress path parsers, the CAMs used in all the parsers are register-based and the

registers are shared between both versions of the parsers, reducing the area utilization. After the

parsers, a de-encapsulation stage is included to support any of the encapsulation modes. Finally, a

buffering stage that can hold at least one maximum transmission sized packet must be included since

the ingress port to the FPGA is shared with the packets rerouted inward from the egress path; packets

must be buffered so they are not lost or dropped.


Table 5.2: Shared Network Connectivity Secured Shell Overhead

LUT LUTRAM FF BRAM DSP

Shell Type Num % Incr Num % Incr Num % Incr Num % Incr Num % Incr.

No Security Incl. 36,141 4770 36,133 61 0

Perf. Isolation 36,416 0.7% 4770 0% 36,587 1.3% 61 0% 0 0%

Universal NMU 59,449 63.2% 8295 73.9% 53,541 46.3% 63 3.27% 0 0%

Table 5.3: Shared Network Connectivity Secured Shell Overhead (Percentage)


No Security Incl. 5.45% 1.62% 2.71% 2.81% 0.0%

Perf. Isolation 5.49% 1.62% 2.75% 2.81% 0.0%

Universal NMUs 8.96% 2.82% 4.04% 2.92% 0.0%

5.4.4 Limited Functionality NMUs

In addition to the Universal NMU, many other limited functionality NMUs can be implemented based

on the descriptions in Section 5.3. Depicted in Figure 5.11 is the Universal NMU again, but this time

pictorially labelled so as to act as a legend for the other limited functionality NMUs. The Universal NMU

is shown in Part (a) of the Figure, while parts (b)-(h) show the limited functionality NMUs. Each of the

shapes over the labelled components in the Universal NMU are included or excluded in the depiction

of the limited functionality NMUs based on whether or not they would be needed to implement this

limited functionality.

5.5 Network Virtualizing Shell Overhead Evaluation

As with the memory protections introduced into the shell, we evaluate our shell design based on the

area overhead of its implementation.

5.5.1 Shell Design

Most of the components of the memory securitization part of the shell are similar in nature to the

solutions presented for the memory. We evaluate the overhead of the components in a similar way,

incrementally adding the parts to an existing shell design and iteratively adding the isolation components.

The results are summarized in Table 5.2 and Table 5.3, with the second table giving the percentage of

available resources on the FPGA that the shell uses. These evaluations are similarly performed on the

Kintex Ultrascale XKCU115.

We note that adding the performance isolation componenets has very little impact on the total area


tag

L2 L3 L4

L2L3L4

tag

NetworkFPGA MAC Parser

Source ACL

Dest CAM

Dest ACL

VLAN Parser

VLAN ACL

VLAN CAM

IP Parser

Source ACL

Dest CAM

Dest ACL

Port Parser

Source ACL

Dest CAM

Dest ACL

Universal Tagger/Encap.

On-Chip Router

FIltering

MAC Parser

Dest CAM

VLAN Parser

VLAN CAM

IP Parser

Dest CAM

Port Parser

Dest CAM

Univ. Tag Parser &

De-Tagger

Dest CAM

SwitchBuffer & Filtering

s-tag e-tag L2

L2 L3

L2 L3 L4

L2 vlan

L2

L2 L3

L2 L3 L4

vlan

Universal De-Encap.

L2

L2 L3

L2 L3 L4

L2

L2 L3

L2 L3 L4

L2 encap L3 encap

L4 encap

L2 encap L3 encap

L4 encapvxlan encap vxlan encap

(b)

(a)

(c) (d)

(e) (f)

(g) (h)

Figure 5.11: NMU Varieties (a) Universal NMU, with components labeled and marked with symbols tobe used as the legend for sub-figures (b) Type A NMUs (c) Type B NMUs (d) Type C NMUs (e) TypeBR NMUs (f) Type CR NMUs (g) Type E NMUs (h) Type ER NMUs


utilization of the shell. This makes intuitive sense since the amount of logic needed to decouple and

verify the protocol assertions on the network interface was fairly small. The NMU however adds a great

deal of overhead to the system. There is an increase in the usage of LUTS by 62 percent, of LUTRAM

by 74 percent, and of flip flops of 46 percent. The NMU is a lot more logic intensive than the other

componenets of the design, so this also makes sense. Even so, the total utilization of the modified shell

does not exceed 9 percent of the whole of the FPGA. Considering the functionality that is possible using

the Universal NMU, it is a worthwhile inclusion in any FPGAs included in the datacentre. More detailed

analysis of the NMU follows.

5.5.2 NMU Overhead

The NMU designs were tested on an Alpha Data 8k5 FPGA add-in board with a 10Gb Ethernet con-

nection; the FPGA on that Board is a Xilinx Kintex Ultrascale XCKU115. All tests were done using

the Xilinx Vivado 2018.1 software, and the associated versions of the the PCIe Subsystem and Ethernet

Subsystem cores.

The NMU was placed in a system with four hardware applications, each connected to the ingress and

egress ports of the Ethernet Controller through an AXI Stream Switch. Each application is provided

eight logical network connections, so the NMUs evaluated support 32 total logical connections. The

Ethernet controller has a datapath width of 64-bits and operates at 156.25 MHz, which is the clock used

for the whole test platform (except for the PCIe Controller). The applications themselves simply include

a Block RAM that stores packet data, a DMA device to send that packet data out to the network, and a

DMA engine that receives data from the network to store to Block RAM. Each of the Apps is controlled

through PCIe by a Host PC that manages the test setup. The Host is also responsible for configuring

the NMU. Figure 5.12 shows the architecture of the test platform.

To evaluate the various NMUs based on the previous descriptions, each of those design decisions

is compared on an area utilization and unloaded latency basis. Note, such designs would generally be

evaluated in terms of throughput as well, but all of the packet processing components used in this work

operate at the 10Gbps line-rate of the Ethernet controller. All of the results are shown in Table 5.4.

Access Control

Part (b) of Table 5.4 shows the area and latency results of the four different types of NMUs. The Type A

NMU, as expected, has the lowest area and latency, though this is likely because the Type A NMU does

not need on-FPGA switching to allow the Hardware Applications to communicate (The Bridge Port


App 1 App 2

App 3 App 4

PCIe Controller10Gb Ethernet

Controller

NMU

AXI Stream Switch

AXI Stream Switch

Figure 5.12: Multi-Application Test Setup for Networking

Extensions E-tag standard allows for hairpin routing). The Encapsulation based NMU has a slightly

lower utilization, indicating that Type E NMUs might be preferable to reduce area utilization, though

this is at the cost of slightly increased latency caused by the included segmented FIFOs in the packet

path. Finally, we note that the added overhead of implementing some destination-based access controls

(i.e., Type C NMUs) is fairly minimal.

Virtualization

The results of the evaluation for the two virtualized networking NMUs are shown in Part (c) of Table 5.4.

The VLAN based virtualization solution uses about the same amount of resources as the Type B and

Type C NMUs from Part (b), though there is added latency from the tagging functionality. The VXLAN

virtualized solution has a much higher utilization because it must parse a full Layer 4 packet first before

identifying the virtual ID and routing the packet. The modest area overhead relative to the other NMUs

might be worth it considering the ease of deployment, and ubiquity, of virtual network solutions.

Routability

Dropping the requirement that there be routability between co-resident hardware applications cuts the

area utilization in half for the Type B and Type C NMUs, and nearly in half for the other NMUs, as

shown in Part (d) of Table 5.4. There is also a drop in latency from removing the Switching. Note,


Table 5.4: NMU Area and Latency Comparisons

CLB LUTs Flip-Flops Latency (cycles)

egress / ingress

(a) Universal 23,014 (3.47%) 16,336 (1.23%) 13–18 / 19–25

(b) Access Control Evaluation

Type A-etag 4049 (0.61%) 5010 (0.38%) 1 / 4–6

Type BR-L2 7199 (1.09%) 4311 (0.32%) 5–10 / 6–8

Type CR-L2 7424 (1.12%) 4378 (0.33%) 5–10 / 6–8

Type ER-L2 6133 (0.92%) 4316 (0.33%) 6–7 / 8–10

(c) Virtualization Evaluation

Type ARv-vlan 7218 (1.09%) 5827 (0.44%) 6–8 / 8–10

Type ERv-vxlan 9606 (1.45%) 5628 (0.42%) 6–7 / 9–15

(d) Routability Evaluation

Type Av-vlan 3753 (0.57%) 4582 (0.35%) 1 / 4–6

Type B-L2 3516 (0.53%) 2883 (0.22%) 1–6 / 2–4

Type C-L2 3687 (0.56%) 2867 (0.22%) 1–6 / 2–4

Type E-L2 3392 (0.51%) 3113 (0.23%) 1 / 4–6

(e) Network Layer Evaluation

Type CR-L2 7424 (1.12%) 4378 (0.33%) 5–10 / 6–8

Type CR-L3 11,645 (1.76%) 6372 (0.48%) 6–11 / 7–12

Type CR-L4 12,550 (1.89%) 7053 (0.53%) 6–11 / 7–12

while there are area and latency benefits to dropping the routability, it likely will lead to a more difficult

system to manage.

Network Layer

From Part (e) of Table 5.4, we note that the biggest increase in area utilization yet in this evaluation

is a result of elevating functionality to Layer 3 (IP4) and Layer 4 (Transport) of the Network stack.

There’s a 57 percent increase in LUT utilization from Layer 2 to Layer 3, suggesting that much of the

area utilization is in parsing and controlling the IPv4 network packets. Previous FPGA works have built

on top of Layer 2 network Packets (e.g. Byma et. al. [33]), but higher layer protocols may be needed if

FPGAs span broadcast domains.

Universal NMU

The Universal NMU’s latency and area results are shown in Part (a) of Table 5.4. There is a consid-

erable, though not unreasonable, increase in latency over other NMU solutions. This is expected, as

more pipeline stages were required to meet timing and all packets must pass through both tagging and


4 8 16 32 64 128 2560

5

10

15

20

1.5 1.82.4

3.5

5.7

9.8

18.6

0.4 0.5 0.81.2

2.2

3.9

8.1

uti

liza

tion

(%)

LUTs FFs

Figure 5.13: Universal NMU utilization vs Number of Logical Connections

encapsulation stages. The latency numbers shown assume no UDP checksum calculation; if a checksum

is to be calculated in UDP-Encap mode (i.e. Type ER-L4), the entire packet would need to be buffered

during the computation. This adds an additional 190 cycles for a maximally-sized packet of 1522 bytes.

The total area utilization is just under 3.5 percent of the LUTs available on the FPGA, which includes

LUTs configured as logic as well as LUTs configured as LUTRAMs and Shift Registers. In terms of

flip-flops, the Universal NMU’s utilization is just 1.23 percent, so overall the area overhead of the NMU

is quite modest. Figure 5.13 shows how the Universal NMU scales with number of logical connections,

reaching to just over 18 percent LUTs and 8 percent of FFs at 256 connections. This low utilization

can be attributed to the modified parser design discussed in Section 5.4.1, and the pairing-down of

network security functionality from a fully functional switching capability to the minimally necessary

access controls of the NMU.

The Universal NMU is fairly small compared to the Kintex Ultrascale XCKU115 FPGA. When the

Universal NMU was run through place and route for a Virtex Unltrascale+ XCKVU13P, it required only

1.32 percent of the LUTs and 0.44 percent of the FFs. As FPGAs increase in size, the resource needs

of the NMU solution approach nearly negligible quantities. This small size also suggests that hardening

the NMU would not need a significant overhead in terms of die area.

Finally, the Universal NMU can be compared to a full switch solution implemented on an FPGA.

The solution presented in [72] is a relatively low utilization OpenFlow siwtch implemented on a Xilinx

Virtex-7 VX485T FPGA. That solution uses 15.93 percent of the LUTs and 7.84 percent of the FFs.

When the Universal NMU was run through place and route for this FPGA, it required 7.36 percent of

the LUTs and 2.48 percent of the FFs. This is less than half the LUTs and less than a third of the FFs



HW Application 1



HW Application 1



HW Application 1


HW Application 1


HW Application 1


HW Application 110Gbps Ethernet


NMU

10Gbps Ethernet


NMU

Dec. + Ver.

BW Throt.

Dec. + Ver.

BW Throt.

Dec. + Ver.

BW Throt.

Dec. + Ver.

BW Throt.

Figure 5.14: Multiple Network Interfaces Managed Separately

compared to the full switch solution. Note, the OpenFlow switch we are comparing against is a much

simplified switch with effectively only one OpenFlow table, so one would assume that a larger richer

implementation would use up quite a bit more resources than our Universal NMU architecture.

5.6 Multi-Channel Networking Considerations

A FPGA deployment could also include multiple network interfaces. For example, the Alpha Data

8k5 FPGA board includes 2 10 Gbps Ethernet connections [46]. Including multiple network interfaces

impacts the design of an isolation based networking solution.

5.6.1 Separately Managed

In the simplest deployment, all these interfaces would be connected to the same downstream datacentre

network. In this simple use-case, there is no advantage in connecting each Hardware Application to

all of the network interfaces, since they all the target what is essentially the same network resource.

Thus, the network interfaces can simply be connected to some fraction of the Hardware Applications;

each Hardware Application is connected to only a single of the network interfaces. In this case, the

performance and domain isolation components would simply be replicated for each network interface

and connected to a subset of the Hardware Applications. Figure 5.14 depicts this organization.



HW Application 1


HW Application 1



HW Application 1


HW Application 1



HW Application 1


HW Application 1


HW Application 1

10Gbps Ethernet


NMU

10Gbps Ethernet

AXIS

Inter. Soft Shell (PR Region)

HW Application 1


HW Application 1


HW Application 1

Dec. only

Dec. only

Dec. only

Dec. only

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

Dec. + Ver.

BW

BW

BW

BW

Dest indicator

Figure 5.15: Multiple Network Interfaces With an Exclusive Connection

5.6.2 One General and One Exclusive Connection

The Catapult v2 [3] work motivates a special use-case, where-in one of the network interfaces is connected

to a device directly rather than a switching-based network. In the Microsoft work, the connection is

made to a traditional CPU-based server, enabling a Bump-in-the-Wire FPGA deployment. In any such

case, where one of the network interfaces has a special purpose rather than connecting to a switched

Ethernet network (e.g., connected to a device’s egress network port, connected to a non-switched or non-

Ethernet network, etc.), that port is likely to be assigned exclusively to a single Hardware Application.

We term this specific Network Interface an exclusive connection, and term the deployment Multiple

Network Interfaces with an Exclusive Connection. The Hardware Application with exclusive ownership

is likely to be a trusted application that can be safely deployed without domain isolation enforced on that

connection. The only components needed to ensure this connectivity would be decouplers to disconnect

all Hardware Applications that are not to access the exclusive interface and a component to indicate the

destination (VIID) for routing incoming packets. This system is shown in Figure 5.15, with the Ethernet

controller on the left implementing an Exclusive Interface (e.g., this interface could be connected directly

to a host server for a bump-in-the-wire deployment).

5.6.3 Individual Connection Per Application

As a final consideration, a special use-case exists if there are enough network interfaces to assign a single

interface to each of the Hardware Applications on the FPGA; some of the isolation components would

not need to be implemented. For example, the bandwidth throttlers implementing performance isolation


would serve no purpose in this configuration. The inclusion of the NMU would not strictly be necessary,

since every application would be uniquely identifiable at the next level switch; if each application is

uniquely identifiable, any ACLs that need to be implemented could be implemented in the switch. The

NMU could still be included to reduce the processing overhead required in the switch. The decouplers

and protocol verifiers would still be required since the application should be prevented from forcing the

Ethernet controller into an errant state in case that application region is reconfigured with a different

application. However, instead of these decouplers and verifiers, special reset logic could be used that

would allow the Ethernet controller to be reset whenever the application region is reconfigured, which

would break the controller out of any errant states.

These multi-interface solutions are presented here as a conceptual discussion. The implementation

of such solutions is left to future work.

Chapter 6

Conclusion

In this thesis, many different concepts and solutions were introduced that aim to better secure and

virtualize multi-tenant FPGA deployments. In this final chapter, these contributions are summarized

and conclusions are drawn.

6.1 Summary of Contributions

The contributions presented in thesis thesis include:

• the conceptual introduction of the soft shell and the hard shell, and the principle along with them

that in multi-tenant FPGA deployments, any security protections should be embedded in the hard

shell and abstractions can be left to the soft shell (we do not contend that there is no use for

abstractions in the static region)

• extending the concept of a process ID to the FPGA with the coined VIID terminology, enabling

multiple virtual interfaces to coexist within the same PR region or Hardware Application

– we contend this is actually necessary to implement the soft shell functionality since the in-

stantiation of middleware components within the soft shell should be transparent to the user

and as such its resource needs should be virtualized separately

• the development of a set of hardware to effectively block malicious or malformed data transmissions

along a shared AXI bus, and the concepts to extend these hardware items to other interfaces

(decoupling, protocol verification, and bandwidth throttling)

98

Chapter 6. Conclusion 99

• the extension of the credit controlled accounting latency rate server of [58] to consider separate data

and command channels and to handle the the potential for users to stall the shared interconnect

by waiting even when the interconnect is ready to receive/transmit data

• An evaluation of the area overheads associated with the various components identified as necessary

to secure the shared memory interface, including the area overhead analysis of existing memory

virtualization solutions (base and bounds and paged MMUs)

• An extension of the introduced shared memory protection infrastructure to network resources,

including decoupling, protocol verification, and bandwidth throttling

• the introduction of a new type of network security component, the Network Management Unit, that

is a low overhead alternative to full switch implementation and yet more powerful than existing

network protection schemes on FPGAs

• a top-level analysis of how these concepts could be extended to multi-channel memory and multi-

interface network systems

6.2 Conclusions

Just as software compute nodes can benefit from the virtualization of CPU-based computers, the virtu-

alization of the FPGA can provide benefits to datacentre FPGA deployments. In software implementa-

tions, overhead is a key consideration in the effectiveness of virtualization, since a high overhead impacts

the degree to which the physical device can be shared. Applying these same principles to FPGAs, the

solutions presented in this thesis provide a very low overhead means of securing logical isolation between

Hardware Applications on the same FPGA.

The total overhead of the memory interface virtualization was achieved with an incremental overhead

ranging from 8.4% to 18.6% of the various FPGA resources, over a simple memory-based Shell that

includes no isolation. The total overhead of the network interface virtualization was achieved with an

incremental overhead ranging from 48.2% to 73.9% of the base networking Shell. While some of these

increases show an overhead that seems relatively high, the total utilization of any of the shells presented

in this thesis did not exceed 11% of the LUT resources or 8% of the flip flop resources. From this we

can concludes that the isolation components presented can be implemented at a fairly low overhead,

validating the Shell implementation.

The analysis of the implemented bandwidth throttling components showed that these implemen-

tations can effectively manage and allocate bandwidth to particular applications in a shared FPGA


environment. It was also shown, however, that the bandwidth throttling of the memory connection is

highly dependent on the memory access pattern. We conclude that the performance isolation problems

encountered in the software virtualization realm, namely that inefficient memory access patterns make

it difficult to accurately assign the full bandwidth of a memory bus to individual users, also impact the

performance isolation solutions for hardware systems. The bandwidth reclamation system presented can

be used to allocate unused bandwidth. Some amount of memory bandwidth allocation can be guaranteed

with the modified credit-based accounting bandwidth throttlers presented in this work.

6.3 Future Work

Given that FPGAs in cloud and datacentre deployments are relatively new, the work presented herein

has many vectors for future research.

6.3.1 Further Shell Explorations

To begin, the conceptual Soft Shell described in Chapter 3 could be fleshed out to provide a much

more powerful and flexible platform for FPGA development. For instance, auto-generated middleware

within the soft shell could abstract from the user difficult protocol details and allow for much easier

hardware development. This auto-generated soft shell is supported by the hard shell features presented

in this thesis, though further development of the soft shell concepts might expose or introduce new

considerations for the virtualization and isolation enabled by the hard shell.

In developing the soft-shell, some concepts touched upon in this thesis should be explored further.

First, the development of tools that automatically generate a wrapper around some Hardware Appli-

cation such that it can be deployed in a hard-shell based FPGA deployment would enable a powerful

use case that lessens the barrier in using FPGA for computation. Second, the development of actual

middleware for FPGAs is a potentially fruitful extension of the work presented here. As a final point,

more applications developed for FPGAs, targeted at the virtualized Shell presented in this thesis in

particular, could be useful to validate the necessity of the work presented in this thesis and to perhaps

discover future vectors or vulnerabilities that must be addressed to protect Hardware Applications from

malicious activity.

In terms of the Hard Shell itself, work could be done in improving the concepts presented in this thesis

to strengthen the security guarantees. For example, the bandwidth throttler for the memory interface is

quite limited in that it does not take into account the stream of accesses from the requester in calculating

an updated token count. Future work could look at effective ways to build better bandwidth throttlers


given the complications of the SDRAM protocol. The isolation solutions introduced here could be

expanded to target other platforms, whether non-AXI or even non-FPGA. For example, some of the

same concepts covered in this work could potentially be applied to CGRAs, or perhaps to some Neural

Network Processor to allow for co-residency on these new class of devices without software interference to

guarantee protection. The general methods applied for memory and networking could also be extended

to other interface types, perhaps PCIe and storage would be worthwhile peripherals for FPGAs.

6.3.2 Additional Security Considerations

The work presented here does introduce some security-based controls for cloud managers to consider

in their FPGA deployments, but the solutions and evaluations presented do not take into account

the entire security situation. This work only considers logical (i.e., in terms of digital logic) security

concerns, whereas electrical means of attacks or interference exist as well. Given the level of control that

FPGA users have over the electrical circuitry of the FPGA, electrical attacks could be of ever increasing

importance in FPGA deployments.

There have been a few specific attacks demonstrated already in academic works on FPGA devices.

The work presented in [42] demonstrated that bitstreams can be generated that fluctuate the voltage of

the device in such a way as to crash the entire device. Another work, presented in [41] demonstrated that

the voltage level of a system could be monitored with special digital circuitry implemented on the FPGA,

and that this voltage level could be used to ascertain information about other applications running on

the same FPGA. Finally, [43] showed that malicious applications could leak information from co-resident

applications by monitoring the level of wires that the malicious actor has access to which happen to

be routed near wires of other applications. The data transferred on the wires of the other applications

induces voltage level changes on the wire controlled by the malicious actor. For a multi-tenant device, all

of these attack vectors are obvious violations of the principle of isolating co-resident applications. Work

into the electrical isolation of applications on the same FPGA is a vital area of future work if FPGAs

can every be truly securely shared in multi-user environments.

In addition to these FPGA specific security vulnerabilities, some vulnerabilities that target CPU-

based systems might present security concerns for FPGAs as well. As an example, the Row Hammer

vulnerability showed that a malicious actor could implement a memory access technique that allowed

them to change values in memory to which they did not have access [77]. The Row Hammer attack

should also be possible on an FPGA, and in fact might be easier to implement since the user has more

fine-grained control over the sequence of memory accesses that are sent. An analysis of existing software


vulnerabilities, especially those that target SDRAM memories (since these same types of memories and

similar memory controllers are used in FPGAs would be a useful area for future work.

6.3.3 Hardening Shell Components

Many of the ideas and technologies brought up or introduced in this work might be amenable to hardening

within an FPGA. With the increasing focus on compute-focused FPGA usage, it’s likely that FPGA

vendors like Xilinx and Intel might further expand the scope of functionality that is baked into the

device rather than provisioned through programmable fabric. Much of the functionality described in

this thesis could effectively be hardened within FPGAs to reduce the strain that the large shells have on

the placing and routing of the Hardware Applications themselves. Even the simpler shells presented in

this work sometimes presented difficulties in meeting the timing requirements. The memory interfaces

in particular, at 512-bits wide, were challenging to implement on the FPGA. The conceptual ideas

introduced in this work, such at the soft shell and the VIID could influence the design of these hardened

components to ensure maximal compatibility with the greatest number of deployments.

The components that are most amenable to hardening in this work are the MMU and the NMU, since

their inclusion does not depend on the application use case or deployment details of the FPGA. The

implementation of the other components, such as the bandwidth throttlers and the decouplers, depends

on the total number of co-resident applications that should be enabled on the FPGA, and also the

location spatially of those applications. If these components were hardened, the number of applications

and the location on the FPGA of those applications would need to be fixed. The MMU and the NMU

however have fixed sized interfaces regardless of the application deployment scenario and have fixed

locations on the device where they must be located already; the MMU and the NMU must be located

near the DDR and Ethernet controllers that are tied to specific pins on the FPGA device. Note, the

DDR and Ethernet controllers should likely be hardened (if not already) before the hardening of these

data and domain isolation components.

In the previous paragraph we posited that those isolation components that are replicated per hardware

application (i.e., the decouplers, verifiers, and bandwidth throttlers) should not be hardened. A possible

exception to that line of thinking is in an FPGA in which some hardened interconnection network exists.

The work presented in [35] recommended the hardening of so called Network on Chip (NoC) components

in future multi-tenant targeted FPGAs. In this case, the protocol verification and decoupling components

should be considered for hardening with every interface to the hardened interconnection network. The

added overhead of this solution may not be worth it, since the number of such interfaces might be quite


significant and adding the decoupling and verification logic to each might present too large an overhead.

Nonetheless, the hardening of these components in a system with an already hardened NoC would be a

worthwhile consideration in future work.

Bibliography

[1] A. Putnam, A. Caulfield, E. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh,

J. Fowers, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, E. Peterson,

A. Smith, J. Thong, P. Y. Xiao, D. Burger, J. Larus, G. P. Gopal, and S. Pope, “A Reconfig-

urable Fabric for Accelerating Large-Scale Datacenter Services,” in Proceeding of the 41st Annual

International Symposium on Computer Architecuture (ISCA), pp. 13–24, IEEE Press, June 2014.

[2] D. Chiou, “Heterogeneous Computing and Infrastructure for Energy Efficiency in Microsoft Data

Centers: Extended Abstract,” in Proceedings of the 2016 International Symposium on Low Power

Electronics and Design, ISLPED ’16, (New York, NY, USA), pp. 150–151, ACM, 2016.

[3] A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil,

M. Humphrey, P. Kaur, J. Y. Kim, D. Lo, T. Massengill, K. Ovtcharov, M. Papamichael, L. Woods,

S. Lanka, D. Chiou, and D. Burger, “A cloud-scale acceleration architecture,” in 2016 49th Annual

IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13, Oct 2016.

[4] “Amazon EC2 F1 Instances.” aws.amazon.com/ec2/instance-types/f1/.

[5] F. Chen, Y. Shan, Y. Zhang, Y. Wang, H. Franke, X. Chang, and K. Wang, “Enabling FPGAs

in the Cloud,” in Proceedings of the 11th ACM Conference on Computing Frontiers, CF ’14, (New

York, NY, USA), pp. 3:1–3:10, ACM, 2014.

[6] M. McLoone and J. McCanny, High Performance Single-Chip FPGA Rijndael Algorithm Implemen-

tations, pp. 65–76. Berlin, Heidelberg: Springer Berlin Heidelberg, 2001.

[7] S. Rigler, W. Bishop, and A. Kennings, “FPGA-Based Lossless Data Compression using Huffman

and LZ77 Algorithms,” in 2007 Canadian Conference on Electrical and Computer Engineering,

pp. 1235–1238, April 2007.

104

aws.amazon.com/ec2/instance-types/f1/

Bibliography 105

[8] F. Braun, J. Lockwood, and M. Waldvogel, “Protocol wrappers for layered network packet process-

ing in reconfigurable hardware,” IEEE Micro, vol. 22, pp. 66–74, Jan 2002.

[9] R. DiCecco, G. Lacey, J. Vasiljevic, P. Chow, G. Taylor, and S. Areibi, “Caffeinated FPGAs:

FPGA framework For Convolutional Neural Networks,” in 2016 International Conference on Field-

Programmable Technology (FPT), pp. 265–268, Dec 2016.

[10] M. Russinovich, “Inside the Microsoft FPGA-based config-

urable cloud.” azure.microsoft.com/en-ca/resources/videos/

build-2017-inside-the-microsoft-fpga-based-configurable-cloud/.

[11] J. Sahoo, S. Mohapatra, and R. Lath, “Virtualization: A Survey on Concepts, Taxonomy and

Associated Security Issues,” in 2010 Second International Conference on Computer and Network

Technology, pp. 222–226, April 2010.

[12] “Vivado Design Suite User Guide,” Tech. Rep. UG973 v2017.3, Xilinx, Oct 2017.

[13] “UltraScale Devices Gen3 Integrated Block for PCI Express v4.4,” Tech. Rep. PG156, Xilinx, Dec

2017.

[14] “UltraScale Devices Integrated 100G Ethernet v2.3,” Tech. Rep. PG165, Xilinx, Oct 2017.

[15] I. Kuon, R. Tessier, and J. Rose, “FPGA Architecture: Survey and Challenges,” Found. Trends

Electron. Des. Autom., vol. 2, pp. 135–253, Feb. 2008.

[16] B. Mei, A. Lambrechts, J. Y. Mignolet, D. Verkest, and R. Lauwereins, “Architecture exploration

for a reconfigurable architecture template,” IEEE Design Test of Computers, vol. 22, pp. 90–101,

March 2005.

[17] D. Chen, J. Cong, and P. Pan, “FPGA Design Automation: A Survey,” Foundations and Trends

in Electronic Design Automation, vol. 1, pp. 139–169, Jan. 2006.

[18] J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A Parallel Programming Standard for Heterogeneous

Computing Systems,” IEEE Des. Test, vol. 12, pp. 66–73, May 2010.

[19] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. H. Anderson, S. Brown, and T. Cza-

jkowski, “LegUp: High-level Synthesis for FPGA-based Processor/Accelerator Systems,” in Pro-

ceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays,

FPGA ’11, (New York, NY, USA), pp. 33–36, ACM, 2011.

azure.microsoft.com/en-ca/resources/videos/build-2017-inside-the-microsoft-fpga-based-configurable-cloud/

azure.microsoft.com/en-ca/resources/videos/build-2017-inside-the-microsoft-fpga-based-configurable-cloud/

Bibliography 106

[20] T. Feist, “Accelerating IP Development with High-Level Synthesis,” in Vivado Design Suite

Whitepaper, no. WP416 v1.1, Jun 2012.

[21] T. S. Czajkowski, U. Aydonat, D. Denisenko, J. Freeman, M. Kinsner, D. Neto, J. Wong, P. Yian-

nacouras, and D. P. Singh, “From opencl to high-performance hardware on FPGAS,” in 22nd

International Conference on Field Programmable Logic and Applications (FPL), pp. 531–534, Aug

2012.

[22] D. Dye, “Partial Reconfiguration of Xilinx FPGAs Using ISE Design Suite,” Tech. Rep. WP374

v1.2, Xilinx, May 2012.

[23] “Increasing Design Functionality with Partial and Dynamic Reconfiguration in 28-nm FPGAs,”

Tech. Rep. WP-01137-1.0, Intel, July 2010.

[24] “UltraScale Architecture Configuration,” Tech. Rep. UG570 v1.8, Xilinx, Dec 2017.

[25] G. Vallee, T. Naughton, C. Engelmann, H. Ong, and S. L. Scott, “System-Level Virtualization

for High Performance Computing,” in 16th Euromicro Conference on Parallel, Distributed and

Network-Based Processing (PDP 2008), pp. 636–643, Feb 2008.

[26] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha, “Memory Bandwidth Management for

Efficient Performance Isolation in Multi-Core Platforms,” IEEE Transactions on Computers, vol. 65,

pp. 562–576, Feb 2016.

[27] C. Pahl, “Containerization and the PaaS Cloud,” IEEE Cloud Computing, vol. 2, pp. 24–31, May

2015.

[28] A. Bhattacharjee and D. Lustig, “Architectural and Operating System Support for Virtual Mem-

ory,” Synthesis Lectures on Computer Architecture, vol. 12, no. 5, pp. 1–175, 2017.

[29] Amazon, “AWS EC2 FPGA Hardware and Software Development Kits.” aws.amazon.com/ec2/

instance-types/f1/, 2017.

[30] “SDAccel Environment,” Tech. Rep. UG1164 v2016.3, Xilinx, Nov 2016.

[31] S. A. Fahmy, K. Vipin, and S. Shreejith, “Virtualized FPGA Accelerators for Efficient Cloud Com-

puting,” in 2015 IEEE 7th International Conference on Cloud Computing Technology and Science

(CloudCom), pp. 430–435, Nov 2015.

[32] “OpenStack.” www.openstack.org.



www.openstack.org

Bibliography 107

[33] S. Byma, J. G. Steffan, H. Bannazadeh, A. L. Garcia, and P. Chow, “FPGAs in the Cloud: Booting

Virtualized Hardware Accelerators with OpenStack,” in 2014 IEEE 22nd Annual International

Symposium on Field-Programmable Custom Computing Machines, pp. 109–116, May 2014.

[34] N. Tarafdar, T. Lin, E. Fukuda, H. Bannazadeh, A. Leon-Garcia, and P. Chow, “Enabling Flex-

ible Network FPGA Clusters in a Heterogeneous Cloud Data Center,” in Proceedings of the 2017

ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’17, (New

York, NY, USA), pp. 237–246, ACM, 2017.

[35] S. Yazdanshenas and V. Betz, “Quantifying and mitigating the costs of FPGA virtualization,” in

2017 27th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–7,

Sept 2017.

[36] X. Iturbe, K. Benkrid, C. Hong, A. Ebrahim, R. Torrego, I. Martinez, T. Arslan, and J. Perez,

“R3TOS: A Novel Reliable Reconfigurable Real-Time Operating System for Highly Adaptive, Effi-

cient, and Dependable Computing on FPGAs,” IEEE Transactions on Computers, vol. 62, pp. 1542–

1556, Aug 2013.

[37] A. Agne, M. Happe, A. Keller, E. Lbbers, B. Plattner, M. Platzner, and C. Plessl, “ReconOS: An

Operating System Approach for Reconfigurable Computing,” IEEE Micro, vol. 34, pp. 60–71, Jan

2014.

[38] R. Brodersen, A. Tkachenko, and H. K. H. So, “A unified hardware/software runtime environment

for FPGA-based reconfigurable computers using BORPH,” in Proceedings of the 4th International

Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS ’06), pp. 259–

264, Oct 2006.

[39] K. Fleming, H. J. Yang, M. Adler, and J. Emer, “The LEAP FPGA operating system,” in 2014

24th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–8, Sept

2014.

[40] F. Hategekimana, T. Whitaker, M. J. H. Pantho, and C. Bobda, “Shielding non-trusted IPs in

SoCs,” in 2017 27th International Conference on Field Programmable Logic and Applications (FPL),

pp. 1–4, Sept 2017.

[41] M. Zhao and G. E. Suh, “FPGA-Based Remote Power Side-Channel Attacks,” in 2018 IEEE Sym-

posium on Security and Privacy (SP), pp. 229–244, May 2018.

Bibliography 108

[42] D. R. E. Gnad, F. Oboril, and M. B. Tahoori, “Voltage drop-based fault attacks on FPGAs us-

ing valid bitstreams,” in 2017 27th International Conference on Field Programmable Logic and

Applications (FPL), pp. 1–7, Sept 2017.

[43] C. Ramesh, S. B. Patil, S. N. Dhanuskodi, G. Provelengios, S. Pillement, D. Holcomb, and R. Tessier,

“FPGA Side Channel Attacks without Physical Access,” in 2018 IEEE 26th Annual International

Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 45–52, April 2018.

[44] D. Sidler, G. Alonso, M. Blott, K. Karras, K. Vissers, and R. Carley, “Scalable 10gbps tcp/ip stack

architecture for reconfigurable hardware,” in 2015 IEEE 23rd Annual International Symposium on

Field-Programmable Custom Computing Machines, pp. 36–43, May 2015.

[45] M. Saldana and P. Chow, “TMD-MPI: An MPI Implementation for Multiple Processors Across

Multiple FPGAs,” in 2006 International Conference on Field Programmable Logic and Applications,

pp. 1–6, Aug 2006.

[46] “ADM-PCIE-8K5 User Manual,” Tech. Rep. 1.9, Alpha Data, September 2017.

[47] “UltraScale Architecture-Based FPGAs Memory IP,” Tech. Rep. PG150 v1.4, Xilinx, April 2018.

[48] “10G/25G High Speed Ethernet Subsystem,” Tech. Rep. PG210 v2.4, Xilinx, June 2018.

[49] “DMA/Bridge Subsystem for PCI Express,” Tech. Rep. PG195 v4.1, Xilinx, April 2018.

[50] “AMBA AXI Protocol Specification,” Tech. Rep. 2.0, ARM, 2010.

[51] “AMBA 4 AXI4, AXI4-Lite, and AXI4-Stream Protocol Assertions User Guide,” Tech. Rep. r0p1,

ARM, 2012.

[52] “AXI Protocol Checker,” Tech. Rep. PG101 v2.0, Xilinx, April 2018.

[53] “AXI Interconnect,” Tech. Rep. PG059 v2.1, Xilinx, December 2017.

[54] “Partial Reconfiguration Decoupler,” Tech. Rep. PG227 v1.0, Xilinx, April 2016.

[55] D. Stiliadis and A. Varma, “Latency-rate servers: a general model for analysis of traffic scheduling

algorithms,” IEEE/ACM Transactions on Networking, vol. 6, pp. 611–624, Oct 1998.

[56] B. Akesson, K. Goossens, and M. Ringhofer, “Predator: A predictable sdram memory controller,” in

2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System

Synthesis (CODES+ISSS), pp. 251–256, Sept 2007.

Bibliography 109

[57] B. Akesson, A. Minaeva, P. Sucha, A. Nelson, and Z. Hanzalek, “An efficient configuration method-

ology for time-division multiplexed single resources,” in 21st IEEE Real-Time and Embedded Tech-

nology and Applications Symposium, pp. 161–171, April 2015.

[58] B. Akesson, L. Steffens, E. Strooisma, and K. Goossens, “Real-Time Scheduling of Hybrid Sys-

tems using Credit-Controlled Static-Priority Arbitration,” Tech. Rep. NXP-TN-2007-00119, NXP

Semiconductors, 2008.

[59] K. L. E. Law, “The bandwidth guaranteed prioritized queuing and its implementations,” in GLOBE-

COM 97. IEEE Global Telecommunications Conference. Conference Record, vol. 3, pp. 1445–1449

vol.3, Nov 1997.

[60] Z. Dai, M. Jarvin, and J. Zhu, “Credit borrow and repay: Sharing dram with minimum latency and

bandwidth guarantees,” in 2010 IEEE/ACM International Conference on Computer-Aided Design

(ICCAD), pp. 197–204, Nov 2010.

[61] L. Woltjer, “Optimal DDR controller,” Master’s thesis, University of Twente, the Netherlands, Jan

2005.

[62] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha, “Memory bandwidth management for

efficient performance isolation in multi-core platforms,” IEEE Transactions on Computers, vol. 65,

pp. 562–576, Feb 2016.

[63] J. S. Hunter, “The exponentially weighted moving average,” Journal of Quality Technology, vol. 18,

p. 203210, 1986.

[64] C. P. Pfleeger and S. L. Pfleeger, Security in Computing. Prentice Hall Professional, 4 ed., 2013.

[65] “SDAccel Platform Reference Design User Guide,” Tech. Rep. UG1234 v2017.1, Xilinx, June 2017.

[66] R. Chandramouli, Secure Virtual Network Configuration for Virtual Machine (VM) Protection.

National Institute of Standards and Technology, March 2016.

[67] “IEEE Standard for Local and Metropolitan Area Networks—Virtual Bridged Local Area Net-

works,” IEEE Std 802.1Q-2005 (Incorporates IEEE Std 802.1Q1998, IEEE Std 802.1u-2001, IEEE

Std 802.1v-2001, and IEEE Std 802.1s-2002), pp. 1–300, May 2006.

[68] M.Mahalingam, D. Dutt, K. Duda, P. Agarwal, L. Kreeger, T. Sridhar, M. Bursell, and C. Wright,

“Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer

2 Networks over Layer 3 Networks,” IETF RFC 7348, August 2014.

Bibliography 110

[69] P. Garg and Y. Wang, “NVGRE: Network Virtualization Using Generic Routing Encapsulation,”

IETF RFC 7637, September 2015.

[70] OpenFlow Switch Specification version 1.1.0, February 2011. http://archive.openflow.org/

documents/openflow-spec-v1.1.0.pdf.

[71] B. Ho, C. Pham-Quoc, T. N. Thinh, and N. Thoai, “A Secured OpenFlow-Based Switch Archi-

tecture,” in 2016 International Conference on Advanced Computing and Applications (ACOMP),

pp. 83–89, Nov 2016.

[72] V. B. Wijekoon, T. M. Dananjaya, P. H. Kariyawasam, S. Iddamalgoda, and A. Pasqual, “High

performance flow matching architecture for OpenFlow data plane,” in 2016 IEEE Conference on

Network Function Virtualization and Software Defined Networks (NFV-SDN), pp. 186–191, Nov

2016.

[73] “IEEE Standard for Local and metropolitan area networks–Media Access Control (MAC) Bridges

and Virtual Bridged Local Area Networks–Amendment 21: Edge Virtual Bridging,” IEEE Std

802.1Qbg-2012 (Amendment to IEEE Std 802.1Q-2011 as amended by IEEE Std 802.1Qbe-2011,

IEEE Std 802.1Qbc-2011, IEEE Std 802.1Qbb-2011, IEEE Std 802.1Qaz-2011, IEEE Std 802.1Qbf-

2011, and IEEE Std 802.aq-2012), July 2012.

[74] “IEEE Standard for Local and metropolitan area networks–Virtual Bridged Local Area Networks–

Bridge Port Extension,” IEEE Std 802.1BR-2012, July 2012.

[75] M. Attig and G. Brebner, “400 Gb/s Programmable Packet Parsing on a Single FPGA,” in 2011

ACM/IEEE Seventh Symposium on Architectures for Networking and Communications Systems,

pp. 12–23, Oct 2011.

[76] P. Bencek, V. Pu, and H. Kubtov, “P4-to-VHDL: Automatic Generation of 100 Gbps Packet

Parsers,” in 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom

Computing Machines (FCCM), pp. 148–155, May 2016.

[77] K. Park, S. Baeg, S. Wen, and R. Wong, “Active-precharge hammering on a row induced failure

in DDR3 SDRAMs under 3 nm technology,” in 2014 IEEE International Integrated Reliability

Workshop Final Report (IIRW), pp. 82–85, Oct 2014.

http://archive.openflow.org/documents/openflow-spec-v1.1.0.pdf

http://archive.openflow.org/documents/openflow-spec-v1.1.0.pdf

Appendix A

AXI4 Protocol Assertions

111

Appendix

A.

AXI4

ProtocolAsse

rtions

112

Table A.1: AXI4 Protocol Write Address Channel AssertionProtocol Assertion Description Interconnect Response [53] Mem Controller Response [47]

AXI ERRM AWADDR BOUNDARY A write burst cannot cross a 4KBboundary

Protocol error ignored Error: causes out of bounds access

AXI ERRM AWADDR WRAP ALIGN A write transaction with burst typeWRAP has an aligned address

Protocol error ignored Error: undefined behaviour

AXI ERRM AWBURST A value of 2b11 on AWBURST isnot permitted when AWVALID isHigh

Protocol error ignored Defaults to INCR burst type, noerror

AXI ERRM AWLEN LOCK Exclusive access transactions can-not have a length greater than 16beats

Protocol error ignored Exclusive access not supported, ig-nored, no error

AXI ERRM AWCACHE If not cacheable, AWCACHE =2’b00

Protocol error ignored Signal unused, no error

AXI ERRM AWLEN FIXED Transactions of burst type FIXEDcannot have a length greater than16 beats

Protocol error ignored FIXED burst type unsupported,defaults to INCR type, no error

AXI ERRM AWLEN WRAP A write transaction with burst typeWRAP has a length of 2, 4, 8, or 16


AXI ERRM AWSIZE The size of a write transfer does notexceed the width of the data inter-face

Error: data width convertersmay not operate correctly

Error: interconnect error

AXI ERRM AWVALID RESET AWVALID is Low for the first cycleafter ARESETn goes High

PR reset and static region re-set are not asserted at the sametime, no error

PR reset and static region reset arenot asserted at the same time, noerror

AXI ERRM AWxxxxx STABLE Handshake Checks: AWxxxxxmust remain stable when AW-VALID is asserted and AWREADYLow

Error: changing signals may ef-fect interconnect functionality


AXI ERRM AWREADY MAX WAIT Recommended that AWREADY isasserted within MAXWAITS cyclesof AWVALID being asserted

Signals from static region don’tneed to be checked, no error


Appendix

A.

AXI4

ProtocolAsse

rtions

113

Table A.2: AXI4 Protocol Write Data Channel AssertionsProtocol Assertion Description Interconnect Response [53] Mem Controller Response [47]

AXI ERRM WDATA NUM The number of write data itemsmatches AWLEN for the corre-sponding address. This is trig-gered when any of the following oc-curs: Write data arrives, WLASTis set, and the WDATA count isnot equal to AWLEN; Write dataarrives, WLAST is not set, and theWDATA count is equal to AWLEN;ADDR arrives, WLAST is alreadyreceived, and the WDATA count isnot equal to AWLEN

Error: may cause interconnectto hang


AXI ERRM WSTRB A write transaction with burst typeWRAP has an aligned address

Protocol error ignored Protocol error ignored

AXI ERRM WVALID RESET WVALID is Low for the first cycleafter ARESETn goes High



AXI ERRM Wxxxxx STABLE Handshake Checks: Wxxxxx mustremain stable when WVALID is as-serted and WREADY Low



AXI ERRM WREADY MAX WAIT Recommended that WREADY isasserted within MAXWAITS cyclesof WVALID being asserted


Signals from static region don’t notneed to be checked, no error

Appendix

A.

AXI4

ProtocolAsse

rtions

114

Table A.3: AXI4 Protocol Write Response Channel Assertions

Protocol Assertion Description Interconnect Response [53] Mem Controller Response [47]

AXI ERRM BRESP ALL DONE EOS All write transaction addressesare matched with a correspondingbuffered response



AXI ERRM BRESP EXOKAY An EXOKAY write response canonly be given to an exclusive writeaccess



AXI ERRM BVALID RESET BVALID is Low for the first cycleafter ARESETn goes High



AXI ERRM BRESP AW A slave must not take BVALIDHIGH until after the write addresshandshake is complete



AXI ERRM BRESP WLAST A slave must not take BVALIDHIGH until after the last write datahandshake is complete



AXI ERRM Bxxxxx STABLE Handshake Checks: Bxxxxx mustremain stable when BVALID is as-serted and BREADY Low



AXI ERRM BREADY MAX WAIT Recommended that BREADY is as-serted within MAXWAITS cycles ofBVALID being asserted

Error: not accepting responsewill cause interconnect to hang


Appendix

A.

AXI4

ProtocolAsse

rtions

115

Table A.4: AXI4 Protocol Read Address Channel AssertionsProtocol Assertion Description Interconnect Response [53] Mem Controller Response [47]

AXI ERRM ARADDR BOUNDARY A write burst cannot cross a 4KBboundary

Protocol error ignored Error: causes out of bounds access

AXI ERRM ARADDR WRAP ALIGN A write transaction with burst typeWRAP has an aligned address


AXI ERRM ARBURST A value of 2b11 on ARBURST isnot permitted when ARVALID isHigh

Protocol error ignored Defaults to INCR burst type, noerror

AXI ERRM ARLEN LOCK Exclusive access transactions can-not have a length greater than 16beats


AXI ERRM ARCACHE If not cacheable, ARCACHE =2’b00

Protocol error ignored Signal unused, no error

AXI ERRM ARLEN FIXED Transactions of burst type FIXEDcannot have a length greater than16 beats

Protocol error ignored FIXED burst type unsupported,defaults to INCR type, no error

AXI ERRM ARLEN WRAP A write transaction with burst typeWRAP has a length of 2, 4, 8, or 16


AXI ERRM ARSIZE The size of a write transfer does notexceed the width of the data inter-face

Error: data width convertersmay not operate correctly


AXI ERRM ARVALID RESET ARVALID is Low for the first cycleafter ARESETn goes High



AXI ERRM ARxxxxx STABLE Handshake Checks: ARxxxxx mustremain stable when ARVALID isasserted and ARREADY Low



AXI ERRM ARREADY MAX WAIT Recommended that ARREADY isasserted within MAXWAITS cyclesof ARVALID being asserted



Appendix

A.

AXI4

ProtocolAsse

rtions

116

Table A.5: AXI4 Protocol Read Data Channel AssertionsProtocol Assertion Description Interconnect Response [53] Mem Controller Response [47]

AXI ERRM RLAST ALL DONE EOS All outstanding read bursts musthave completed



AXI ERRM RDATA NUM The number of read data itemsmust match the correspondingARLEN



AXI ERRM RID The read data must always followthe address that it relates to. IfIDs are used, RID must also matchARID of an outstanding addressread transaction. This violationcan also occur when RVALID is as-serted with no preceding AR trans-fer



AXI ERRM RRESP EXOKAY An EXOKAY write response canonly be given to an exclusive readaccess



AXI ERRM RVALID RESET RVALID is Low for the first cycleafter ARESETn goes High



AXI ERRM Rxxxxx STABLE Handshake Checks: Rxxxxx mustremain stable when RVALID is as-serted and RREADY Low



AXI ERRM RREADY MAX WAIT Recommended that RREADY is as-serted within MAXWAITS cycles ofRVALID being asserted

Error: not accepting responsewill cause interconnect to hang


Appendix

A.

AXI4

ProtocolAsse

rtions

117

Table A.6: AXI4 Protocol Exclusive Access AssertionsProtocol Assertion Description Interconnect Response [53] Mem Controller Response [47]

AXI ERRM EXCL ALIGN The address of an exclusive access isaligned to the total number of bytesin the transaction


AXI ERRM EXCL LEN The number of bytes to be trans-ferred in an exclusive access burstis a power of 2, that is, 1, 2, 4, 8,16, 32, 64, or 128 bytes


AXI ERRM EXCL MATCH Recommended that the address,size, and length of an exclusivewrite with a given ID is the sameas the address, size, and length ofthe preceding exclusive read withthe same ID


AXI ERRM EXCL MAX 128 is the maximum number ofbytes that can be transferred in anexclusive burst


AXI ERRM EXCL PAIR Recommended that every exclusivewrite has an earlier outstanding ex-clusive read with the same ID


Memory and Network Interface Virtualization for Multi ... · Memory and Network Interface Virtualization for Multi-Tenant Recon gurable Compute Devices Daniel Rozhko Master of Applied

Documents