A Secure and Formally Veriﬁed Commodity Multiprocessor ...

A Secure and Formally Verified Commodity Multiprocessor Hypervisor

Shih-Wei Li

Submitted in partial fulfillment of therequirements for the degree of

Doctor of Philosophyunder the Executive Committee

of the Graduate School of Arts and Sciences

COLUMBIA UNIVERSITY

2021

© 2021Shih-Wei Li

All Rights Reserved

ABSTRACT

A Secure and Formally Verified Commodity Multiprocessor Hypervisor

Shih-Wei Li

Commodity hypervisors are widely deployed to support virtual machines on multiprocessor

server hardware. Modern hypervisors are complex and often integrated with an operating system

kernel, posing a significant security risk as writing large, multiprocessor systems software is error-

prone. Attackers that successfully exploit hypervisor vulnerabilities may gain unfettered access

to virtual machine data and compromise the confidentiality and integrity of virtual machine data.

Theoretically, formal verification offers a solution to this problem, by proving that the hypervisor

implementation contains no vulnerabilities and protects virtual machine data under all circum-

stances. However, it remains unknown how one might feasibly verify the entire codebase of a

complex, multiprocessor commodity system. My thesis is that modest changes to a commodity

system can reduce the required proof effort such that it becomes possible to verify the security

properties of the entire system.

This dissertation introduces microverification, a new approach for formally verifying the secu-

rity properties of commodity systems. Microverification reduces the proof effort for a commodity

system by retrofitting the system into a small core and a set of untrusted services, thus making it

possible to reason about properties of the entire system by verifying the core alone. To verify the

multiprocessor hypervisor core, we introduce security-preserving layers to modularize the proof

without hiding information leakage so we can prove each layer of the implementation refines its

specification, and the top layer specification is refined by all layers of the core implementation.

To verify commodity hypervisor features that require dynamically changing information flow, we

incorporate data oracles to mask intentional information flow. We can then prove noninterference

at the top layer specification and guarantee the resulting security properties hold for the entire

hypervisor implementation. Using microverification, we retrofitted the Linux KVM hypervisor

with only modest modifications to its codebase. Using Coq, we proved that the hypervisor pro-

tects the confidentiality and integrity of VM data, including correctly managing tagged TLBs,

shared multi-level page tables, and caches. Our work is the first machine-checked security proof

for a commodity multiprocessor hypervisor. Experimental results with real application workloads

demonstrate that verified KVM retains KVM’s functionality and performance.

Table of Contents

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 1: Retrofitting Commodity Hypervisors . . . . . . . . . . . . . . . . . . . . . . . 8

1.1 Assumptions and Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 HypSec: Design for Retrofitting Commodity Hypervisors . . . . . . . . . . . . . . 10

1.3 VM Protection and Virtualization Support . . . . . . . . . . . . . . . . . . . . . . 12

1.3.1 VM Boot and Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3.2 CPU Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.3.3 Memory Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.3.4 Interrupt Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.3.5 Input/Output Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.4 Security Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

i

Chapter 2: A Layered Hypervisor Architecture and Verification Methodology . . . . . . . . 26

2.1 SeKVM Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.1.1 Layered Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.1.2 TrapHandler API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.2 Layered Methodology for Verifying SeKVM . . . . . . . . . . . . . . . . . . . . . 37

2.2.1 AbsMachine: Abstract Multiprocessor Machine Model . . . . . . . . . . . 38

2.2.2 Page Table Management . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.2.3 TLB Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.2.4 Cache Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.2.5 SMMU Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Chapter 3: Verifying Security Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.1 Security-preserving Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.1.1 Multiprocessor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.1.2 Transparent Trace Refinement . . . . . . . . . . . . . . . . . . . . . . . . 59

3.2 Noninterference Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.2.1 Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.2.2 Intentional Information Release . . . . . . . . . . . . . . . . . . . . . . . 68

3.3 Verifying SeKVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.3.1 Proving KCore Refines TrapHandler . . . . . . . . . . . . . . . . . . . . . 71

3.3.2 Formulating Noninterference over TrapHandler . . . . . . . . . . . . . . . 74

3.3.3 Verified Security Guarantees of SeKVM . . . . . . . . . . . . . . . . . . . 79

ii

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Chapter 4: Implementation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.1 Retrofitting KVM on Arm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.1.1 SeKVM Retrofitting Effort and KServ Modifications . . . . . . . . . . . . 85

4.1.2 Virtualization Feature Support . . . . . . . . . . . . . . . . . . . . . . . . 87

4.1.3 Auxiliary Principles for Simplifying Verification . . . . . . . . . . . . . . 90

4.2 Proof Effort and Bug Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.2.1 Proof Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.2.2 Bugs Found During Verification. . . . . . . . . . . . . . . . . . . . . . . . 98

4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.3.2 Microbenchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.3.3 Application Workload Results . . . . . . . . . . . . . . . . . . . . . . . . 102

4.3.4 Evaluation of Practical Attacks . . . . . . . . . . . . . . . . . . . . . . . . 106

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Chapter 5: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Chapter 6: Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

iii

Appendix A: KCore API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

A.1 KCore’s Intermediate Layer API . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

iv

List of Figures

1.1 HypSec Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2 HypSec Memory Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1 SeKVM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.2 KCore Layered Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3 Refinement of Machine Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.4 Attack based on Mismatched Memory Attributes . . . . . . . . . . . . . . . . . . 47

3.1 Insecure Page Table Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2 Querying the Event Oracle to Refine set_npt . . . . . . . . . . . . . . . . . . . . . 59

3.3 Transparent Trace Refinement of Insecure and Secure set_npt Implementations . . 60

3.4 Noninterference Proof for a VM NPT Page Fault . . . . . . . . . . . . . . . . . . 75

4.1 Microverification of the Linux KVM Hypervisor . . . . . . . . . . . . . . . . . . . 86

4.2 Single-VM Application Benchmark Performance — KVM 4.18 and 5.4 . . . . . . 103

4.3 Single-VM Application Benchmark Performance — KVM 4.18 . . . . . . . . . . . 104

4.4 Multi-VM Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

A.1 KCore API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

v

List of Tables

2.1 TrapHandler API: KServ Hypercall Handlers (VM CREATE and VM BOOT) . . . 34

2.2 TrapHandler API: KServ Hypercall Handlers (VM ENTER and GET VM STATE) . 34

2.3 TrapHandler API: KServ Hypercall Handlers (IOMMU OPS and others) . . . . . . 35

2.4 TrapHandler API: VM Hypercall Handlers . . . . . . . . . . . . . . . . . . . . . . 35

2.5 TrapHandler API: Exception Handlers . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 SeKVM Retrofitting Effort in LOC . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.2 Commodity Hypervisor Feature Support . . . . . . . . . . . . . . . . . . . . . . . 88

4.3 KCore Implementation and Proof Effort in LOC . . . . . . . . . . . . . . . . . . . 96

4.4 Proof Effort for SeKVM’s Security Proofs in LOC . . . . . . . . . . . . . . . . . . 97

4.5 Microbenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.6 Microbenchmark Performance (cycles). . . . . . . . . . . . . . . . . . . . . . . . 101

4.7 Application Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.8 Selected Set of Analyzed CVEs - from VM . . . . . . . . . . . . . . . . . . . . . 107

4.9 Selected Set of Analyzed CVEs - from Host User . . . . . . . . . . . . . . . . . . 108

5.1 Comparison of Hypervisor Features . . . . . . . . . . . . . . . . . . . . . . . . . 114

A.1 TrapHandlerRaw API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

A.2 TrapDispatcher API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

A.3 FaultHandler API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

A.4 MemHandler API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

vi

A.5 CtxtSwitch API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

A.6 VCPUOps API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

A.7 VCPUOpsAux API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

A.8 SmmuOps API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

A.9 SmmuAux API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

A.10 SmmuCore API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

A.11 SmmuCoreAux API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

A.12 SmmuRaw API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

A.13 BootOps API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

A.14 BootAux API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

A.15 BootCore API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

A.16 VMPower API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

A.17 MemOps API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

A.18 MemAux API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

A.19 PageMgmt API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

A.20 PageIndex API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

A.21 Memblock API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

A.22 NPTOps API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

A.23 NPTWalk API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

A.24 PTWalk API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

A.25 PTAlloc API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

A.26 MmioSPTOps API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

A.27 MmioSPTWalk API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

A.28 MmioPTWalk API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

A.29 MmioPTAlloc API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

A.30 Locks API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

A.31 LockOpsH API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

vii

A.32 LockOpsQ API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

A.33 LockOps API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

viii

Acknowledgements

First, I want to thank my advisor, Professor Jason Nieh, for his initial beliefs in me and his

guidance throughout the past several years in the program, which have enabled me to work at

a level that I never thought was possible. His relentless pursuit of innovative ideas has inspired

me and given me something to strive for in my future career. Without the innumerable hours

Jason spent with me to discuss every aspect, even the subtleties of my research comprehensively,

it would be impossible for me to discover the insights into my own work that can be otherwise

hidden. Throughout the years working with Jason, I have become a much better critical thinker,

writer, and researcher. I owe him a great deal of gratitude for his patience that he never gives up

on my research, which his continuous effort has brought our work to the highest quality. Second,

I thank my dissertation committee for all their helpful feedback, especially Professor Ronghui Gu,

for his help and guidance in shaping my research.

Working at Columbia gave me the opportunity to collaborate with many of the most brilliant

minds. Christoffer Dall has provided timely guidance on technical challenges on Arm virtualiza-

tion as well as helpful feedback on my research since the get-go of my Ph.D. career and throughout

the process. Jin Tack Lim showed me the meaning of working hard and provided great help to my

research during my earlier years at Columbia. Alex Van’t Hof and Naser AlDuaij, thank you for

being my officemates, colleagues, and friends, and for all your encouragement in many of our late-

night conversations at CSB 509 during the most stressful and darkest times in the journey. John

Koh offered insights into cryptography and computer security and provided excellent writing help.

ix

Xupeng Li joined me to help verify the KVM hypervisor during his first years as a Columbia Ph.D.

student. He carried out the most challenging proofs in layer and code refinement, and the security

proofs. Without Xupeng’s help and hard work, many aspects of my work would have never ma-

terialized. John Hui helped with writing and editing earlier paper drafts and making initial efforts

in the lock proofs. Xuheng Li helped with proofs for assembly code and layer refinement, and

porting the verified hypervisor implementation to different KVM versions.

Finally, I want to thank all my family and friends for their support and help in this process. I

thank my partner, Yi-Wei Hsu, for her understanding. I could not have moved forward thus far in

this journey without her encouragement and unconditional support.

x

Preface

This document is submitted in partial fulfillment of the requirements for the degree of Doctor

of Philosophy in the Graduate School of Arts and Sciences.

In this dissertation, I use "we" almost exclusively throughout as this work was accomplished

in collaboration with many others, without whose help the work would not have been possible.

In some cases, I use "I" whenever the text is derived from my own conclusions and explanations,

which I am fully responsible for the content.

xi

Introduction

The availability of cost-effective, commodity cloud providers has pushed increasing numbers

of companies and users to move their data and computation off-site into virtual machines (VMs)

in the cloud. A hypervisor creates the VM abstraction, which is similar to the hardware, allowing

the operating system (OS) kernels and applications that were originally designed to run on the

hardware to run in the VMs. The hypervisor isolates and multiplexes VMs to enable them to

share the same hardware. Modern hypervisors are entrusted with various responsibilities, including

the virtualization of hardware and resource management. Hypervisors devote a substantial part

of their codebase to fulfilling those responsibilities pertaining to availability, functionality, and

performance.

As the software that manages the VMs and the hardware, a hypervisor has full access to VM

data, including CPU registers, memory, I/O data, and boot images, which allows it to retrieve or up-

date the VM’s execution context whenever necessary. The implication is that users have no choice

but to trust the entire hypervisor with this full access. However, attackers that exploit hypervisor

vulnerabilities can then gain unfettered access to VM data and compromise the confidentiality and

integrity of all VMs — an undesirable outcome for both cloud providers and users [1].

A key problem with needing to trust the entire hypervisor is that modern hypervisors are too

complex. Many commodity hypervisors are integrated with an entire OS kernel to leverage ex-

isting kernel functionality to simplify the implementation and maintenance of the hypervisor, but

this results in a large codebase that must be trusted. For instance, the widely-used Linux Kernel-

1

based Virtual Machine (KVM) [2] is integrated with a full Linux kernel to reuse its device drivers,

scheduler, and support for resource allocation. Xen [3] and Hyper-V [4] run a full OS kernel in a

privileged VM to provide I/O virtualization and VM management features such as VM snapshot

and migration, which require full access to VM data. For all of these commodity hypervisors,

the result is millions of lines of code with a large potential attack surface. The growing number

of hypervisor-based exploits [5, 6, 7, 8, 9, 10, 11, 12, 13, 14], increase in hypervisor security re-

search, and birth of secure cloud providers such as Azure Confidential Computing [15] and Google

Confidential VM [16] indicate a demand for further securing the current hypervisor infrastructure,

as cloud computing and use of VMs become increasingly ubiquitous.

Theoretically, formal verification offers a solution to this problem by proving that the hyper-

visor contains no vulnerabilities and protects VM data under all circumstances. However, this

approach is mainly intractable for commodity hypervisors — existing verification research [17,

18, 19, 20, 21, 22, 23, 24] has focused on systems far simpler than their commodity counterparts.

For example, both seL4 [17] and CertiKOS [18] do not support many features crucial for cloud

computing [25], such as multiprocessor VMs. Furthermore, these systems have not been verified

using realistic hardware models that resemble what can be found in a cloud computing setting.

For example, most systems [17, 24, 21, 22, 23] have been verified on a uniprocessor hardware

model, although CertiKOS [18] was verified on a multicore hardware model. However, none of

these systems model common hardware features used on virtually all computers, including those

used in the cloud, such as shared page tables, TLBs, and caches; therefore, their proofs may not

hold when running on real hardware. None of these formally verified systems have much, if any

use, in practice.

Furthermore, earlier research [17, 18] has taken several person-years to verify a small amount

of code. For example, it took seL4 10 person-years to verify 9K lines of code (LOC) and Cer-

tiKOS 3 person-years to verify 6.5K LOC. For comparison, KVM [2], a full-featured, multipro-

cessor hypervisor integrated with Linux, is more than 2M LOC. This suggests that the proof effort

potentially required to fully verify a commodity system is far beyond feasible. It remains unknown

2

how a vast system, not written with verification in mind, could be verified in its entirety on real

multiprocessor hardware.

My thesis is that modest changes to a commodity system can reduce the required proof effort

such that it becomes possible to verify the security properties of the entire system. This dissertation

introduces microverification, a new approach for formally verifying the security properties of com-

modity systems. Microverification reduces the proof effort for a commodity system by retrofitting

the system into a small core and a set of untrusted services, thus making it possible to reason about

the properties of the entire system by verifying the core alone. We use microverification to retrofit

the KVM hypervisor, then prove that the hypervisor protects the confidentiality and integrity of

VM data while retaining KVM’s overall commodity feature set and performance.

First, we present HypSec, a new hypervisor design for supporting the microverification of

commodity hypervisors. HypSec employs microkernel design principles to retrofit a commodity

hypervisor into a small hypervisor core, KCore, which protects the confidentiality and integrity of

VM data, and a large collection of untrusted hypervisor services, KServ, which provides complex

virtualization features. KCore serves as the hypervisor’s trusted computing base (TCB). HypSec

leverages hardware virtualization support to simplify the retrofit, isolating KCore at a higher priv-

ilege level than KServ. As applications increasingly use end-to-end encrypted I/O, HypSec tasks

KCore with enforcing access control to protect VM data in the CPU and memory but relies on

VMs to use end-to-end encrypted I/O to protect I/O data, simplifying the KCore design. KCore

has full access to hardware resources, provides basic CPU and memory virtualization, and medi-

ates all exceptions and interrupts, ensuring that only a VM and KCore can access the VM’s data in

CPU registers and memory. More complex operations, including I/O and interrupt virtualization,

and resource management such as CPU scheduling and device and memory management, are del-

egated to KServ, which can also leverage an OS kernel. HypSec significantly reduces the TCB of

commodity hypervisors while maintaining their full functionality.

Second, we present SeKVM, a KVM hypervisor retrofitted based on the HypSec design. To

verify SeKVM’s TCB on a multiprocessor machine with realistic hardware features, we introduce a

3

layered hypervisor architecture and verification methodology. First, we use HypSec to decompose

KVM into a higher layer consisting of a large and untrusted KServ and a lower layer consisting of

a small TCB, KCore. We leverage Arm Virtualization Extensions (VE) to protect and run KCore

at a higher privilege level than KServ. Next, we use layers to modularize the implementation and

proof of KCore. We structure KCore as a hierarchy of modules that build upon the hardware and

each other. Modularity reduces the proof burden of the implementation of the complex hypervisor

TCB by composing a set of simpler proofs, one for each implementation module, reducing proof

effort overall. Third, we introduce a novel layered hardware model that is accurate enough to

account for detailed multiprocessor hardware features, yet simple enough to be used to verify

real software by tailoring the complexity of the hardware model to the software using it. Lower

layers of the hypervisor tend to provide simpler, hardware-dependent functions that configure and

control hardware features. We verify these layers using all the various hardware features provided

by the machine model, allowing us to verify low-level operations such as TLB shootdown. Higher

layers of the hypervisor tend to provide complex, higher-level functions that are less hardware

dependent. We verify these layers using simpler, more abstract machine models that hide lower-

level hardware details not used by the software at higher layers, reducing the proof burden for

the more complex parts of the software. We extend our layered verification approach to construct

an appropriately abstract machine model for each respective layer of KCore. We take the first

steps to prove the correctness of the multiprocessor KCore, using a layered hardware that models

widely-used multiprocessor features, including multi-level page tables, tagged TLBs, and multi-

level caches with cache bypass support.

Third, we prove that SeKVM protects the confidentiality and integrity of VM data. We do this

in two steps. First, we build on SeKVM’s layered software and hardware architecture and prove the

functional correctness of KCore. We then use KCore’s specification to prove the security properties

of the entire SeKVM. Proving security properties is much easier using the specification. We can

avoid being inundated with the details of KCore’s entire implementation. A key challenge we ad-

dress is ensuring that the refinement preserves security properties, such as data confidentiality and

4

integrity. For example, this may not hold in a multiprocessor setting [26, 27] because intermediate

updates to shared data within critical sections can be hidden by refinement, yet visible to con-

currently running code on different CPUs. We introduce security-preserving layers, ensuring that

refinement does not hide information release, such that the specification can fully capture the be-

haviors of the implementation. We express KCore’s specification as a stack of security-preserving

layers, so that each layer module of its implementation may be incrementally proven to refine its

layered specification and preserve security properties. Starting with KCore’s lower layer imple-

mentation using a detailed machine model, we gradually prove that KCore’s entire implementation

refines its top-level specification using a simpler abstract model.

We can then use the top-level specification to prove security properties and ensure that they hold

for the implementation. We prove VM confidentiality and integrity using KCore’s specification,

formulating our guarantees in terms of noninterference [28] to verify that there is no information

leakage between VMs and KServ. However, a strict noninterference guarantee is incompatible

with commodity hypervisor features because information release is allowed when explicitly re-

quested. For example, a VM may send encrypted data via shared I/O devices virtualized through

the untrusted hypervisor services, thus not actually leaking private VM data. We incorporate data

oracles [25] to mask this intentional information and distinguish it from unintentional data leak-

age. We prove that noninterference assertions hold for any behavior by KServ and VMs, interacting

with KCore’s top layer specification, and verify the absence of unintentional information flow. The

noninterference assertions are proven over this specification for any implementation of KServ, so

that unintentional information flow is guaranteed to be absent for the entire KVM implementation.

Finally, we detail the implementation and verification effort required to prove VM confidential-

ity and integrity for multiple versions of KVM. We retrofit the Arm implementation of KVM [29,

30], given Arm’s increasing popularity in server systems [31, 32, 33]. Retrofitting requires only

modest modifications to the original KVM codebase yet retains KVM’s commodity feature set,

including multiprocessor, full device I/O, multi-VM, VM management, and broad Arm hardware

support. Upon retrofitting, KCore ends up with 3.8K LOC in C and assembly linked with a verified

5

crypto library. We verified the functional correctness of KCore and the security guarantees of all

of SeKVM. Both the machine model and the proofs that build upon it were formalized using the

Coq proof assistant [34]. Our work is the first machine-checked security proof for a commod-

ity multiprocessor hypervisor. Our work is also the first machine-checked correctness proof of a

multiprocessor system with shared page tables. We demonstrate that SeKVM provides verified

security guarantees while providing similar performance to unmodified KVM, with only modest

overhead for real application workloads and similar scalability when running multiple VMs.

Contributions: The contributions of this dissertation include:

1. Microverification, a new approach that enables formal verification of the security properties

of a commodity hypervisor.

2. HypSec, a novel hypervisor design for retrofitting an existing commodity hypervisor using

microkernel principles to decompose the hypervisor into a small trusted core that protects the

confidentiality and integrity of VM data against the rest of the untrusted hypervisor codebase.

3. SeKVM, a retrofit of KVM based on the HypSec design that leverages Arm Virtualization

Extensions and retains KVM’s performance and commodity virtualization features.

4. KCore, the trusted core of SeKVM, which uses a layered implementation to make it possible

to use a layered verification approach to reduce proof effort.

5. A new layered machine model that supports shared multi-level shared page tables, tagged

TLBs, and caching, yet is easy to use to verify commodity systems.

6. Security-preserving layers to support microverification, ensuring the layer specification cap-

tures all information released by the layer implementation and that there is no hidden infor-

mation flow in a multiprocessor setting.

7. Data oracles to mask intentional information flow to verify commodity hypervisor features

that require dynamically changing information flow in a concurrent setting.

6

8. The first machine-checked proof of the functional correctness of a multiprocessor system

with shared page tables.

9. The first machine-checked proof of the functional correctness of a multiprocessor system

managing TLBs and caches.

10. The first machine-checked security proof of a commodity hypervisor protecting VM confi-

dentiality and integrity.

Organization of the Dissertation: The dissertation is organized as follows. Chapter 1 intro-

duces the new hypervisor design for retrofitting commodity hypervisors to reduce the TCB needed

for protecting VM data. Chapter 2 presents SeKVM, a Linux KVM hypervisor retrofitted based

on the new hypervisor design and the layered approach that facilitates the verification of SeKVM’s

TCB on a realistic multiprocessor machine. Chapter 3 presents the correctness and security proof

of SeKVM. Chapter 4 discusses the implementation effort of SeKVM and presents the perfor-

mance evaluation of SeKVM. Chapter 5 discusses related work. I present concluding remarks and

directions for future research in Chapter 6. Finally, Appendix A details the APIs of SeKVM’s

layered TCB.

7

Chapter 1: Retrofitting Commodity Hypervisors

Recent trends in hardware virtualization support and application design provide an opportunity

to revisit hypervisor design requirements to simplify verification effort. First, modern hardware

includes virtualization support to protect and run the hypervisor at a higher privilege level than

VMs, potentially providing new opportunities to redesign the hypervisor. Second, due to greater

security awareness because of the Snowden leaks revealing secret surveillance of large portions of

the network infrastructure [35], applications are increasingly designed to use end-to-end encryption

for I/O channels, including secure network connections [36, 37] and disk encryption [38]. This

decreases the need for hypervisors to secure I/O channels themselves since applications can do a

better job of providing an end-to-end I/O security solution [39].

Based on these trends, we have created HypSec, a new hypervisor design for supporting the

microverfication of commodity hypervisors. HypSec retrofits existing commodity hypervisors to

significantly reduce their codebase needed for protecting VM data while retaining full-fledged

hypervisor functionality. The design employs microkernel principles, but instead of requiring a

clean-slate rewrite from scratch—a difficult task that limits both functionality and deployment—

applies them to restructure an existing hypervisor with modest modifications. HypSec partitions

a monolithic hypervisor into a small trusted core, KCore, and a large set of untrusted hypervisor

services, KServ. HypSec leverages hardware virtualization support to enforce the hypervisor par-

titioning between KCore and KServ. KCore enforces access control to protect data in CPU and

memory, but relies on VMs or applications to use end-to-end encrypted I/O to protect I/O data,

simplifying KCore design.

In this chapter, we first discuss the assumptions and threat model for HypSec. Next, we present

HypSec’s overall design principles for retrofitting hypervisors and discuss its support for commod-

ity virtualization features. Finally, we present the security analysis of the HypSec architecture and

8

discuss HypSec’s ability to protect VM confidentiality and integrity.

1.1 Assumptions and Threat Model

Assumptions We assume VMs use end-to-end encrypted channels to protect their I/O data. We

assume hardware virtualization support and an IOMMU similar to what is available on x86 and

Arm servers in the cloud. We assume a Trusted Execution Environment (TEE) provided by secure

mode architectures such as Arm TrustZone [40] or a Trusted Platform Module (TPM) [41] is avail-

able for trusted persistent storage. We assume the hardware, including a hardware security module

if applicable, is bug-free and trustworthy. We assume it is computationally infeasible to perform

brute-force attacks on any encrypted VM data, and any encrypted communication protocols are

assumed to be designed to defend against replay attacks. We assume the system is initially benign,

allowing signatures and keys to be sealed in the secure storage in the TEE before a compromise of

the system.

Threat Model We consider an attacker with remote access to a hypervisor and its VMs, including

administrators without physical access to the machine. The attacker’s goal is to compromise the

confidentiality and integrity of VM data, which includes: the VM boot image containing the guest

kernel binary, data residing in memory addresses belonging to guests, guest memory copied to

hardware buffers, data on VM disks or file systems, and data stored in VM CPU registers. VM

data does not include generic virtual hardware configuration information, such as the CPU power

management status or the interrupt level being raised. An attacker could exploit bugs in KServ

or control the VM management interface to access VM data. For example, an attacker could

exploit bugs in KServ to execute arbitrary code or access VM memory from the VM or KServ.

Attackers may also control peripherals to perform malicious memory access via direct memory

access (DMA). We consider it out of scope if the entire cloud provider, who provides the VM

infrastructure, is malicious.

A remote attacker does not have physical access to the hardware, so the following attacks are

9

out of scope: physical tampering with the hardware platform, cold boot attacks [42], memory

bus snooping, and physical memory access. These threats are better handled with on-site security

and tamper-resistant hardware; cloud providers such as Google go to great lengths to ensure the

physical security of their data centers and restrict physical access even for administrators [43]. We

also do not defend against side-channel attacks in virtualized environments [44, 45, 46, 47, 48], or

based on network I/O [49]. This is not unique to HypSec, and it is the kernel’s responsibility to

obfuscate such patterns with defenses orthogonal to HypSec.

We assume a VM does not voluntarily reveal its own sensitive data, whether on purpose or by

accident. A VM can be compromised by a remote attacker that exploits vulnerabilities in the VM.

We do not provide security features to prevent or detect VM vulnerabilities, so a compromised VM

that involuntarily reveals its own data is out of scope. However, attackers may try to attack other

hosted VMs from a compromised VM for which we provide protection.

1.2 HypSec: Design for Retrofitting Commodity Hypervisors

HypSec leverages microkernel design principles to split a monolithic hypervisor into two parts,

as depicted in Figure 1.1: a trusted and privileged KCore with full access to VM data, and a set of

untrusted and deprivileged hypervisor services, KServ, delegated with most hypervisor function-

ality. KCore serves as the hypervisor’s TCB. KCore protects the confidentiality and integrity of

VM data against KServ. Unlike previous microkernel approaches [50, 51, 52], HypSec is designed

specifically to restructure existing hypervisors with modest modifications as opposed to requiring

a clean-slate redesign.

We observe that many hypervisor functions can be supported without any access to VM data.

For example, VM CPU register data is unnecessary for CPU scheduling. Based on this observation,

KCore is kept small by only performing VM data access control and hypervisor functions that

require full access to VM data: secure VM boot, CPU virtualization, and page table management.

With applications increasingly using secure communication channels to protect I/O data, HypSec

takes an end-to-end approach to simplify KCore and allows KServ to provide I/O and interrupt

10

Hardware TEE Secure Persistent Storage

Exception Vectors

1. VM CREATE 4. IOMMU OPS2. VM BOOT 5. GET VM STATE3. VM ENTER

KServ

KCore

Host OS Kernel

VM Exits/Interrupts

HypSec API

Intermediate State

VCPU State

VM

File Storage

Cloud ServicesKey

Management

VM ProtectionMemoryCPU BootHost State

VM GPR Parameters

Figure 1.1: HypSec Architecture

virtualization. KServ handles other complex functions which do not need access to VM data,

including resource management such as CPU scheduling and memory allocation. KServ may even

incorporate a full existing OS kernel to support its features.

HypSec leverages modern hardware virtualization support in a new way to enforce the hyper-

visor partitioning. HypSec runs KCore in a higher privileged CPU mode designed for running

hypervisors, giving it full control of hardware, including virtualization hardware mechanisms such

as nested page tables (NPTs).1 KCore deprivileges KServ and VM kernel by running them in a

less privileged CPU mode. KCore interposes on all exceptions and interrupts, enabling it to pro-

vide access control mechanisms that prevent KServ from accessing VM data in CPU and memory.

For example, KCore has its own memory and uses NPTs to enforce memory isolation between

KServ, VMs, and itself. A compromised KServ or VM can neither control hardware virtualization

mechanisms nor access KCore memory and thus cannot disable HypSec.

HypSec Interface. Figure 1.1 presents the high-level description of the components of the API

that KCore exposes to KServ and interposes on all KServ and VM interactions to ensure secure

VM execution throughout the lifecycle of a VM. We provide further details about the API in our

KVM implementation based on HypSec in Section 2.1.1. The life of a VM begins when KServ

1Intel’s Extended Page Tables or Arm’s stage 2 page tables.

11

makes KCore’s VM CREATE and VM BOOT hypercalls to safely bootstrap it with a verified VM

image. KServ is deprivileged and cannot execute VMs. It must call VM ENTER to request KCore

to execute a VM. When the VM exits execution because an interrupt or exception occurs, it traps to

KCore, which examines the cause of the exit and if needed, will return to KServ. KCore provides

the IOMMU OPS API to device drivers in KServ for managing the IOMMU. While KServ has

no access to VM data in CPU or memory, it may request KCore to provide an encrypted copy of

VM data via the GET VM STATE hypercall API. KServ can use the API to support virtualization

features that require exporting VM data to disk or across the network, such as swapping VM

memory to disk or VM management functions like VM snapshot and migration.

HypSec Boot. KServ and KCore are linked as a single HypSec binary which is cryptographi-

cally (“digitally”) signed by the cloud provider, similar to how OS binaries are signed by vendors

like Red Hat or Microsoft. HypSec relies on hardware secure boot such as Unified Extensible

Firmware Interface (UEFI) firmware and its signing infrastructure with a hardware root of trust.

The HypSec binary is verified using keys in secure storage provided by the TEE, guaranteeing that

only the signed binary that contains the trusted KCore can be loaded.

HypSec relies on KServ’s bootstrapping code to initialize hardware and install KCore securely

at boot time. KServ installs KCore during the boot process before network access and serial input

service are available. Thus, remote attackers cannot compromise KServ prior to or during the

installation of KCore. After its installation, KCore gains full control of the hardware, including

the virtualization features. It subsequently deprivileges KServ, ensuring KServ can never disable

its VM protection. KCore is self-contained and can operate without any external data structures.

1.3 VM Protection and Virtualization Support

Here we discuss how HypSec protects VM data and supports commodity features in VM boot,

CPU, Memory, Interrupt, and I/O virtualization.

12

1.3.1 VM Boot and Initialization

HypSec guarantees the confidentiality and integrity of VM data during VM boot and initializa-

tion. HypSec delegates complicated boot processes to the untrusted KServ to simplify KCore, and

verifying VM images loaded to VM memory before they are run. As shown in Figure 1.1, when a

new VM is created, KServ participates with the KCore in a verified boot process. KServ calls VM

CREATE to request KCore to allocate VM state in KCore memory, including an NPT and VCPU

state, a per virtual CPU (VCPU) data structure. It then calls VM BOOT to request KCore to authen-

ticate the loaded VM images. If successful, KServ can then call VM ENTER to execute the VM.

In other words, KServ stores VM images and loads them to memory, avoiding implementing this

complex procedure in KCore. KCore verifies the cryptographic signatures of VM images using

public key cryptography, avoiding any shared secret between the user and HypSec. In our KVM

implementation, KCore maps the VM boot images to free virtual memory addresses, so it can use

the crypto function to authenticate the VM images using the user’s public key and the signature of

each VM image. HypSec relies on the keys and signatures that are stored locally to perform VM

image authentication. Its approach does not rely on remote attestation. However, the users of the

VMs running on HypSec can incorporate remote attestation to authenticate their own binaries or

files after their VMs boot.

Both the public keys and VM image signatures are stored in TEE secure storage prior to any

attack, as shown in Figure 1.1. Consider a VM that HypSec protects; we assume the system can be

compromised or controlled by a remote attacker anytime after the VM’s public key and signatures

are securely stored. If the VM kernel binary is detached and can be mapped separately to memory,

KServ calls KCore to verify the image. If the VM kernel binary is in the VM disk image’s boot

partition, HypSec-aware virtual firmware bootstraps the VM. The firmware is signed and verified

like VM boot images. The firmware then loads the signed kernel binary or a signed bootloader

such as GRUB from the cleartext VM disk partition. The firmware then calls KCore to verify the

VM kernel binary or bootloader. In the latter case, the bootloader verifies VM kernel binaries using

the signatures on the virtual disk; GRUB already supports this. GRUB can also use public keys

13

in the signed GRUB binary. KCore ensures only images it verified, either a kernel binary, virtual

firmware, or a bootloader binary, can be mapped to VM memory. Finally, KCore sets the VM

program counter to the entry point of the VM image to securely boot the VM.

As discussed in Section 1.3.5, HypSec expects that VM disk images are encrypted as part of

an end-to-end encryption approach. HypSec ensures that any password or secret used to decrypt

the VM disk is not exposed to KServ. Common encrypted disk formats [53, 54] use user-provided

passwords to protect the decryption keys. HypSec can store the encrypted key files locally or

remotely using a cloud provider’s key management service (KMS) [55, 56]. The KMS maintains a

secret key which is preloaded by administrators into hosts’ TEE secure storage. KCore decrypts the

encrypted key file using the secret key, and maps the resulting password to VM memory, allowing

VMs to obtain the password without exposing it to KServ. The same key scheme is used for VM

migration; HypSec encrypts and decrypts the VM state using the secret key from the KMS.

1.3.2 CPU Virtualization

Hypervisors typically have full access to VM CPU state when performing functionality to

virtualize CPUs: handling privileged CPU instructions, context switching VM execution context

on the hardware, and scheduling VCPUs on physical CPUs. This can pose a problem for VM

security if the hypervisor is compromised.

HypSec protects VM CPU state from KServ while keeping KCore small by restricting access

to VM CPU state to KCore while delegating complex CPU functions that can be done without

access to VM CPU state to KServ. This is done by having KCore handle all traps from the VM,

instruction emulation, and world switches between VMs and KServ, all of which require access to

VM CPU state. VCPU scheduling is delegated to KServ as it can be done without access to VM

CPU state.

KCore configures the hardware to route all traps from the VM, as well as interrupts as dis-

cussed in Section 1.3.4, to go to the KCore, ensuring that it retains full hardware control. It also

deprivileges KServ to ensure that KServ has no access to the KCore state. KCore traps and emu-

14

lates CPU instructions on behalf of the VM. KCore multiplexes the CPU execution context between

KServ and VMs on the hardware. KCore maintains VCPU execution context in the VCPU state in-

memory data structure allocated on VM CREATE, and maintains KServ’s CPU context in a similar

Host state data structure; both states are only accessible to KCore. KCore context switches VMs’

and KServ’s CPU context on the hardware using the two data structures. KServ handles VCPU

scheduling as it requires no access to VCPU data but can involve complex scheduling mechanisms,

especially for multiprocessors. For example, the Linux scheduler code alone is over 20K LOC, ex-

cluding kernel function dependencies and data structures shared with the rest of the kernel. KServ

schedules a VCPU to a physical CPU and calls to KCore to run the VCPU. KCore then loads the

VCPU state to the hardware.

HypSec by default ensures that KServ has no access to any VM CPU state. Functionalities

provided by KServ may need to access VM data stored in general purpose registers (GPRs). KCore

determines if and when to copy values from GPRs, and the GPRs from which to copy, based on the

specific CPU instructions executed by the VM. The set of instructions are those used to execute

hypercalls and special instructions provided by the architecture to access virtual hardware via

model-specific registers (MSRs), control registers in the x86 instruction set, or memory-mapped

I/O (MMIO). KCore allows KServ to read or update VM GPRs via an in-memory per VCPU

intermediate state structure. For example, to support emulating MMIO writes, KCore copies the

write data from VM GPR to the intermediate state; for MMIO reads, KServ updates the read data

to the intermediate state, which KCore copies the value to hardware before returning to the VM.

1.3.3 Memory Virtualization

Modern hypervisors rely on NPTs to provide memory virtualization. A guest OS manages the

traditional page tables to map guest virtual memory addresses (gVA) to guest physical memory

addresses (gPA). The hypervisor manages the NPTs to map from gPAs to host physical memory

addresses (hPA), so it can virtualize and restrict a VM’s access to physical memory. The hypervisor

has full access to physical memory, so it can manage VM memory either directly [3] or via a host

15

OS kernel’s [30] memory management APIs. A compromised hypervisor or host OS kernel thus

has unfettered access to VM memory and can read and write any data stored by VMs in memory.

HypSec protects VM memory from KServ while keeping KCore small by restricting access

to VM memory to KCore while delegating complex memory management functions that can be

done without access to actual VM data in memory to KServ. KCore is responsible for memory

protection, including configuring NPT hardware, while memory allocation and reclamation are

largely delegated to KServ. HypSec memory protection imposes an additional requirement, which

is to also protect KCore and VM memory from KServ.

Memory Protection. KCore uses the NPT hardware in the same way as modern hypervisors

to virtualize and restrict a VM’s access to physical memory, but in addition, leverages NPTs to

isolate KServ’s memory access. KCore configures NPT hardware as shown in Figure 1.2. KServ

is only allowed to manage its own page tables (KServ PT) and can only translate from host virtual

memory addresses (hVAs) to what we call virtualized host physical memory addresses (vhPAs).

vhPAs are then in turn translated to hPAs by the Host Nested Page Table (hNPT) maintained by

KCore. KCore adopts a flat address space mapping; each vhPA is mapped to an identical hPA,

allowing KServ to implicitly manage physical memory. KCore prevents KServ from accessing

KCore and VM memory by unmapping the memory from the hNPT. Any KServ access to KCore

or VM memory will trap to KCore, enabling KCore to intercept unauthorized accesses. Physical

memory is statically partitioned between KServ and KCore, but dynamically allocated between

KServ and VMs as discussed below. KCore allocates NPTs from its own memory pool, which is

not accessible to KServ.

KCore also protects KCore and VM memory against DMA attacks [57] by retaining control

of the IOMMU. KCore allocates IOMMU page tables from its memory and exports the IOMMU

OPS API to device drivers in KServ to update page table mappings. KCore validates the mapping

requests and ensures that attackers from a compromised KServ or VM cannot control the IOMMU

or its own to access memory owned by KCore or the VMs.

Memory Allocation. Memory allocation for VMs is largely done by KServ, which can reuse

16

gVA

gPA

KServ VM

KCore

sNPT

VM PT

hPA

hVA

vhPA

hNPT

KServPT

hPA

vNPT

NPT Base Register

Shadow

①

②

③

④

⑤

⑥

Figure 1.2: HypSec Memory Virtualization

memory allocation functions available in an integrated host OS kernel to dynamically allocate

memory from its memory pool to VMs. HypSec disallows KServ from managing VM memory

and, therefore, NPTs. KCore manages a Shadow Nested Page Table (sNPT) for each VM. KServ

has no access to the sNPTs. KCore multiplexes the hardware NPT Base Register between hNPT

and sNPT when switching between KServ and a VM.

Figure 1.2 also depicts the steps in HypSec’s memory virtualization strategy. When a guest

OS tries to map a gVA to an unmapped gPA, a nested page fault occurs, which traps to KCore.

If KCore finds that the faulted gPA falls within a valid VM memory region, it then points the

NPT Base Register to hNPT and switches to KServ to allocate a physical page for the gPA. KServ

allocates a virtualized physical page identified by vhPA for the faulted VM. Upon returning to the

VM, KCore verifies the resulting vhPA is not owned by itself or other VMs, by tracking ownership

of physical memory using a unique VM identifier (VMID), and maps the gPA to vhPA in the VM’s

sNPT. KCore unmaps the vhPA from the hNPT, so that KServ no longer has access to the memory

being allocated to the VM. KCore updates the NPT Base Register to point to the sNPT and returns

to the VM. Although possible, HypSec does not scrub pages allocated to VMs by KServ. Guest

OSes already scrub memory allocated from their free list before use for security reasons, so KServ

cannot allocate pages that contain malicious content to VMs.

HypSec’s use of shadow page tables differs significantly from previous applications of it to

17

collapse multi-level page tables down into what is supported by hardware [58, 3, 59, 60, 61].

In contrast, HypSec uses shadowing to protect hardware page tables, not virtualize them. KCore

does not shadow guest OS updates in its page tables; it only shadows KServ updates to the vNPT.

HypSec does not introduce additional traps from the VM for page table synchronization. Over-

shadow [60] maintains multiple shadow page tables for a given VM that provide different views

(plaintext/encrypted) of physical memory to protect applications from an untrusted guest OS. In

contrast, HypSec manages one shadow page table for each VM that provides a plaintext view of

gPA to hPA. The shadowing mechanism in HypSec is also orthogonal to recent work [62] that uses

shadow page tables to isolate kernel space memory from userspace.

Instead of using shadow paging, KCore can explicitly expose an API to KServ to pass the infor-

mation about memory allocation. For instance, KCore can expose a set of hypercalls to KServ to

pass the vhPA and size of the newly allocated memory. Alternatively, KServ can also pass these in-

formation to a data structure in memory shared with KCore. Unlike shadow paging, modifications

in KServ are required to use the memory allocation API.

Memory Reclamation. HypSec supports VM memory reclamation in KServ while preserving

the privacy and integrity of VM data in memory in KCore. When a VM voluntarily releases

memory pages, such as on VM termination, KCore returns the pages to KServ by first scrubbing

them to ensure the reclaimed memory does not leak VM data, then mapping them back to the

hNPT, so they are accessible to KServ. To allow KServ to reclaim VM memory pages without

accessing VM data in memory, HypSec takes advantage of ballooning [58]. Ballooning is widely

supported in common hypervisors, so only modest effort is required in HypSec to support this

approach. A paravirtual “balloon” device is installed in the VM. When the host is low on free

memory, KServ requests the balloon device to inflate. The balloon driver inflates by getting pages

from the free list, thereby increasing the VM’s memory pressure. The guest OS may therefore start

to reclaim pages or swap its pages to the virtual disk. The balloon driver notifies KCore about the

pages in its balloon that are ready to be reclaimed. KCore then unmaps these pages from the VM’s

sNPT, scrubs the reclaimed pages to ensure they do not leak VM data, and assigns the pages to the

18

KServ, which can then treat them as free memory. Deflating the balloon releases memory pressure

in the guest, allowing the guest to reclaim pages.

HypSec also safely allows KServ to swap VM memory to disk when it feels memory pressure.

KServ uses GET VM STATE to get access to the encrypted VM page before swapping it out. Later,

when the VM page is swapped in, KCore unmaps the swapped-in page from hNPT, decrypts the

page, and maps it back to the VM’s sNPT.

Advanced VM Memory Management. HypSec by default ensures that KServ has no access

to any VM memory, but sometimes a VM may want to share its memory, after encrypting it, with

KServ. HypSec allows VMs to voluntarily share their memory with KServ. KCore provides the

GRANT_MEM and REVOKE_MEM hypercalls, which can be explicitly used by a guest OS to share its

memory with KServ. As described in Section 1.3.5, this can be used to support paravirtualized

I/O of encrypted data in which a memory region owned by the VM has to be shared between

the VM and KServ for communication and efficient data copying. The VM passes the start of a

guest physical frame number, the size of the memory region, and the specified access permission to

KCore via the two hypercalls. KCore enforces the access control policy by controlling the memory

region’s mapping in hNPT. Only VMs can use these two hypercalls, so the KServ cannot use them

to request access to arbitrary VM pages.

HypSec can support advanced memory virtualization features such as merging similar memory

pages, KSM [63] in Linux, by splitting the work into the simple KCore functions which require di-

rect access to VM data, and the more complicated KServ functions, such as the merging algorithm

which do not require access to VM data. To avoid potential information leakage, HypSec disables

KSM support by default.

1.3.4 Interrupt Virtualization

Hypervisors trap and handle physical interrupts to retain full control of the hardware while

virtualizing interrupts for VMs. Access to the interrupt controller interface can be done via MSRs

or MMIO. Hypervisors provide a virtual interrupt controller interface and trap and emulate VM

19

access to the interface. Virtual devices in the hypervisors can also raise interrupts to the interface.

However, giving hypervisors full control of hardware poses a problem for VM security if the

hypervisor is compromised.

To protect against a compromised KServ, KCore configures the hardware to route all physical

interrupts and trap all accesses to the interrupt controller to KCore, ensuring that it retains full

hardware control. However, to simplify KCore, HypSec delegates almost all interrupt functionality

to the KServ, including handling physical interrupts and providing the virtual interrupt controller

interface. Before entering KServ to handle interrupts, KCore protects all VM CPU and memory

states, as discussed in Sections 1.3.2 and 1.3.3.

KServ has no access to and requires no VM data to handle physical interrupts. However, VM

accesses to the virtual interrupt controller interface involve passing parameters between the VM

and KServ since KServ provides the interface. On Arm, this is done using only MMIO via the

intermediate state structure discussed in Section 1.3.2.

1.3.5 Input/Output Virtualization

To ease the burden of supporting a wide range of virtual devices, modern hypervisors often rely

on an OS kernel and its existing device drivers to support I/O virtualization, which significantly

increase the hypervisor TCB. Similar to previous work [60, 64], HypSec assumes an end-to-end I/O

security approach, relying on VMs for I/O protection. VMs can leverage secure communication

channels such as TLS/SSL for network communications and full disk encryption for storage. This

allows KCore to relax its I/O protection requirements such that the large hypervisor code base

that provides I/O support does not have to be trusted, simplifying KCore. HypSec offloads the

support of I/O virtualization to the untrusted KServ. Since I/O data is already encrypted by VMs,

a compromised KServ would at most gain access to encrypted I/O data, which would not reveal

VM data.

HypSec, like other modern hypervisors, supports all three classes of I/O devices: emulated,

paravirtualized, and passthrough devices; the latter two provide better I/O performance. Emulated

20

I/O devices are typically supported by hypervisors using trap-and-emulate to handle both port-

mapped I/O (PIO) and MMIO operations. In both cases, HypSec configures the hardware to trap

the operations to KCore, which hides all VM data other than actual I/O data and then allows KServ

to emulate the operation. For example, to support MMIO, KCore zeroes out the mappings for

addresses in the VM’s sNPT corresponds to virtual device I/O regions. Any subsequent MMIO

accesses from the VM result in a memory access fault that traps to KCore. KCore then securely

supports MMIO accesses. We assume security aware users disable the use of emulated devices

such as the serial port, keyboard, or mouse to avoid leaking private information to a compromised

KServ.

Paravirtualized devices require that a frontend driver in the VM coordinate with a backend

driver in the hypervisor; the two drivers communicate through shared memory asynchronously.

HypSec allows backend drivers to be installed as part of the untrusted KServ. To support shared

memory communication, the frontend driver is modified to use GRANT_MEM and REVOKE_MEM

hypercalls to identify the shared data structure and I/O memory buffers as accessible to KServ

back-end driver. Since the I/O data is encrypted, KServ’s access to the I/O memory buffers does

not risk VM data.

Passthrough devices are assigned to a VM and managed by the guest OS. To support passthrough

I/O, HypSec configures the hardware to trap sensitive operations such as Message Signaled Inter-

rupt (MSI) configuration in BAR to trap to KCore for secure emulation, while granting VMs direct

access to the non-sensitive device memory region. KCore controls the IOMMU to enforce inter-

device isolation, and ensures the passthrough device can only access its VM owner’s I/O buffer.

Since we assume the hardware is not malicious, passthrough I/O can be done securely on HypSec.

1.4 Security Analysis

We present a set of properties of the HypSec architecture, regarding HypSec’s ability to protect

the integrity and confidentiality of VM data.

Policy 1. core-trusted: KCore is trusted during the system’s lifetime against remote attackers.

21

HypSec leverages hardware secure boot to ensure only the signed and trusted HypSec binary

can be booted. This prevents an attacker from trying to boot or reboot the system to force it to load

a malicious KCore. KServ securely installs KCore during the boot process before network access

and serial input service are available. Thus, remote attackers cannot compromise KServ prior to

or during the installation of KCore. KCore protects itself after initialization. It runs in a privileged

CPU mode using a separate address space from KServ and the VMs. KCore has full control of

the hardware including the virtualization features that prevent attackers from disabling its VM

protection. KCore also protects its page tables so an attacker cannot map executable memory to its

address space.

Policy 2. vm-boot-prot: HypSec ensures only trusted VM images can be booted on VMs.

Based on Property 1, the trusted KCore authenticates the signatures of the VM images loaded

to VM memory before they are booted. The public keys and signatures are stored using TEE APIs

for persistent secure storage. A compromised KServ therefore cannot replace a verified VM with

a malicious one.

Policy 3. vm-mem-iso: HypSec isolates a given VM’s data in memory from all other VMs and

KServ.

Based on Property 1, HypSec prevents KServ and a given VM from accessing other VMs’ data

in memory. KCore tracks ownership of physical pages and enforces inter-VM memory isolation

using nested paging hardware. A compromised KServ could control a DMA capable device to

attempt to access VM memory or compromise KCore. However, KCore controls the IOMMU

and its page tables, so the attackers cannot access KCore or VM memory via DMA. VM pages

reclaimed by KServ are scrubbed by KCore, so they do not leak VM data. HypSec also protects

the integrity of VM nested page tables. KCore allocates VM NPT from its private memory. The

memory management unit (MMU) can only walk the NPTs managed by KCore.

Policy 4. vm-cpu-iso: HypSec protects a given VM’s data in CPU registers from KServ and all

other VMs.

HypSec protects VM’s data in CPU registers by only granting KCore (Property 1) full access to

22

them. KServ cannot access VM registers without permission. Attackers cannot compromise VM

execution flow since only KCore can update VM registers, including the program counter (PC),

link register (LR), and page table base registers.

Policy 5. vm-io-conf: HypSec protects the confidentiality of a given VM’s I/O data against KServ

and all other VMs assuming the VM employs an end-to-end approach to secure I/O.

Based on Properties 3 and 4, HypSec protects any I/O encryption keys loaded to VM CPU

registers or memory, so a compromised KServ cannot steal these keys to decrypt encrypted I/O

data. The same protection holds against other VMs.

Policy 6. vm-iodata-iso: HypSec protects the confidentiality and integrity of a given VM’s I/O data

against KServ and all other VMs assuming the VM employs an end-to-end approach to secure I/O

and the I/O can be verified before it permanently modifies the VM’s I/O data.

Using the reasoning in Property 5 with the additional assumption that I/O can be verified be-

fore it permanently modifies I/O data, HypSec also protects the integrity of VM I/O data, as any

tampered data will be detected and can be discarded. For example, a network endpoint receiving

I/O from a VM over an encrypted channel with authentication can detect modifications of the I/O

data by any intermediary such as KServ. If verification is not possible, then HypSec cannot pre-

vent compromises of data availability that result in destruction of I/O data, which can affect data

integrity. As an example, HypSec cannot prevent an attacker from arbitrarily destroying a VM’s

I/O data by blindly overwriting all or parts of a VM’s local disk image; both the VM’s availability

and integrity are compromised since the data is destroyed. Secure disk backups can protect against

permanent data loss.

Policy 7. vm-data-conf: Assuming a VM takes an end-to-end approach for securing its I/O, HypSec

protects the confidentiality of all of the VM’s data against a remote attacker, including if the at-

tacker compromises any other VMs or KServ itself.

Based on Properties 1, 3, and 4, a remote attacker cannot compromise KCore, and any compro-

mise of KServ or another VM cannot allow the attacker to access VM data stored in CPU registers

or memory. This combined with Property 5 allows HypSec to ensure the confidentiality of all of

23

the VM’s data.

Policy 8. vm-data-int: Under the assumption that a VM takes an end-to-end approach for securing

its I/O and I/O can be verified before it permanently modifies any VM data, HypSec protects the

integrity of all of the VM’s data against a remote attacker, including if the attacker compromises

any other VMs or KServ itself.

Based on Properties 1, 3, and 4, HypSec ensures a remote attacker cannot compromise KCore,

and that any compromise of KServ or another VM cannot allow the attacker to access VM data

stored in CPU registers or memory, thereby preserving VM CPU and memory data integrity. This

combined with Property 6 allows HypSec to ensure the integrity of all of the VM’s data.

Policy 9. data-declassification: a given VM’s non-private data may be intentionally released to

support virtualization features.

By default, KServ has no access to VM data. KCore provides GRANT_MEM and REVOKE_MEM

hypercalls that a VM may use to voluntarily grant and revoke KServ access to the VM’s pages.

KCore also allows intentional information flow via a VM’s GPRs to KServ to support MMIO and

the power management interface provided by KServ.

Policy 10. vm-data-goodhyp-prot: If the hypervisor is benign and responsible for handling I/O,

HypSec protects the confidentiality and integrity of all of the VM’s data against any compromises

of other VMs.

If both KServ and KCore are not compromised and KServ is responsible for handling I/O, then

the confidentiality and integrity of a VM’s I/O data will be protected against other VMs. This

combined with Properties 3 and 4, allows HypSec to ensure the confidentiality and integrity of all

of the VM’s data. This guarantee is equivalent to what is provided by a traditional hypervisor such

as KVM.

1.5 Summary

In this chapter, we introduced HypSec, a new hypervisor design that we introduce for sup-

porting microverification. HypSec reduces the codebase necessary to protect VM confidentiality

24

and integrity. HypSec decomposes a monolithic hypervisor into a small, trusted KCore and un-

trusted KServ, the latter containing the vast majority of hypervisor complexity, including an entire

host operating system kernel. HypSec leverages existing hardware virtualization features in a

new way to enforce the hypervisor partitioning between KCore and KServ, executing KCore at

a higher privilege level and provide access control mechanisms to restrict KServ’s access to VM

data. We discussed how the retrofitting in HypSec supports commodity virtualization features

while strengthening VM data protection. Finally, we presented a security analysis of HypSec, and

discussed a set of security properties of HypSec enforces for protecting VM confidentiality and

integrity.

25

Chapter 2: A Layered Hypervisor Architecture and Verification

Methodology

Although HypSec can retrofit a hypervisor to significantly reduce its TCB while protecting the

confidentiality and integrity of VMs, the retrofitted hypervisor TCB may contain vulnerabilities

that compromise its security properties. To prove that the TCB contains no vulnerabilities, we

want to formally verify its functional correctness.

A proof of correctness involves three components: a specification, an abstract model of how

the program is meant to behave, which serves as our standard of correctness; a hardware model,

an abstract model of the hardware the program executes upon, defining the machine interface

with which a program may interact with; and an implementation, a concrete program definition

representing the software we hope to verify. To prove the correctness of that software, we need to

show that any behavior exhibited by the implementation running on the hardware model is captured

by the specification. However, even if correctness can be proven, the soundness of that proof

ultimately rests on the accuracy of the hardware model. If the actual hardware exhibits behaviors

beyond the specified hardware model, then any proof of correctness built upon that model cannot

give any meaningful guarantees about the behavior of the real program.

Previously systems [18, 17, 22, 23] were not verified using realistic hardware models that re-

semble what can be found in a cloud computing setting. Most of them are limited to uniprocessor

settings, and none of them model common hardware features such as TLBs and caches. In other

words, these verified implementations cannot be deployed to handle cloud applications and work-

loads, and even if they could, their proofs may not hold for hardware used in a cloud computing

setting.

We present the first steps toward verifying the correctness of KVM’s TCB. We introduce

26

SeKVM, a KVM hypervisor retrofitted using the HypSec design. The verification is based on

a multiprocessor hardware model with multi-level page tables, TLBs, and caches. This is made

possible by introducing a layered hypervisor architecture and verification methodology. We use

layers in three ways. First, we use HypSec to split the hypervisor into two layers, a higher layer

consisting of a large set of untrusted hypervisor services and a lower layer consisting of a small

core that serves as the hypervisor’s TCB. Reducing the hypervisor’s TCB reduces the amount of

code that needs to be trusted, thereby reducing code complexity and vulnerabilities.

Second, we use layers to modularize the implementation and proof of the TCB. We structure

the TCB’s implementation as a hierarchy of modules that build upon the hardware and each other.

Modularity enables us to decompose the verification of the TCB into simpler components that are

easier to prove. Once we prove that a lower layer module of the implementation refines its specifi-

cation, we can then hide its implementation details and rely on its abstract specification in proving

the correctness of higher layer modules that rely on the lower layer module. Furthermore, we can

prove the correctness of the lower layer module once and then rely on it in proving higher layer

modules instead of needing to verify its implementation each time it is used by a higher layer mod-

ule. Using layers allows us to reduce the proof of a complex implementation by composing a set

of simpler proofs, one for each implementation module, reducing proof effort overall. As software

is updated, layers also help with proof maintainability, as only the proofs for the implementation

modules that change need to be updated while the other proofs can remain the same.

Third, we use layers to modularize the model of the hardware used for verification. We intro-

duce a layered hardware model that is accurate enough to model multiprocessor hardware features

yet simple enough to be used to verify real software by tailoring the complexity of the hardware

model to the software using it. Lower layers of the hypervisor tend to provide simpler, hardware-

dependent functions that configure and control hardware features. We verify these layers using

all the various hardware features provided by the machine model, allowing us to verify low-level

operations such as TLB shootdown. Higher layers of the hypervisor tend to provide complex,

higher-level functions that are less hardware dependent. We verify these layers using simpler,

27

Figure 2.1: SeKVM Architecture

more abstract machine models that hide lower-level hardware details not used by the software at

higher layers, reducing proof burden for the more complex parts of the software. We extend our

layered verification approach to construct an appropriately abstract machine model for each re-

spective software layer. This allows us to verify the correctness of the multiprocessor hypervisor

TCB while accounting for and taking advantage of widely-used multiprocessor features, including

multi-level page tables, tagged TLBs, and multi-level caches with cache bypass support.

2.1 SeKVM Implementation

We use HypSec to retrofit the Arm implementation of KVM (KVM/ARM) [29, 30] into a

small KCore, the TCB, and a set of untrusted services, KServ, delegated with most hypervisor

functionality. Figure 2.1 shows the architecture of the retrofitted KVM, SeKVM.

SeKVM leverages the virtualization hardware support provided by Arm to simplify its imple-

mentation. Arm provides Hyp (EL2) mode for hypervisors that is strictly more privileged than user

(EL0) and kernel (EL1) modes. EL2 has its own execution context defined by register and control

state, and can therefore switch the execution context of both EL0 and EL1 in software. Thus, the

hypervisor can run in an address space that is isolated from EL0 and EL1.

28

KCore runs in Arm VE’s higher-privilege hypervisor level, EL2. It implements the functions

that enforce access control or need direct access to VM data and the VM interface, which can

logically be thought of as a set of trap handlers. KServ runs at a lower-privilege kernel level, EL1.

It includes the Linux host OS and most of the KVM codebase. Just like vanilla KVM, VMs run

user-level code in EL0, the lowest privilege level, and kernel code in EL1. SeKVM leverages

KVM/ARM’s split into an EL2 lowvisor and an EL1 highvisor. As shown in Figure 2.1, KCore

encapsulates the KVM lowvisor. KServ includes the KVM highvisor.

VM operations that need hypervisor intervention trap to EL2 and run KCore. KCore either

handles the trap directly to protect VM data or world switches the hardware to EL1 to run KServ if

more complex handling is necessary, KCore context switches to KServ. When KServ finishes its

work, it makes a hypercall to trap to EL2 so KCore can securely restore the VM state to hardware.

KCore interposes on every switch between the VM and KServ, thus protecting the VM’s execution

context. SeKVM ensures that KServ cannot invoke arbitrary KCore functions via hypercalls.

KCore leverages Arm VE’s stage 2 memory translation support, Arm’s NPTs, to virtualize

KServ and VM memory. Stage 2 page tables translate from guest physical addresses in a VM to

the actual physical memory addresses on the host. Free physical memory is mapped into KServ’s

stage 2 page tables so KServ can allocate it to VMs. Once it is allocated to a VM, KCore maps

the memory into the VM’s stage 2 page tables and unmaps the memory from KServ’s stage 2 page

tables to make the physical memory inaccessible to KServ. KCore routes stage 2 page faults to EL2

and rejects illegal KServ and VM memory accesses. KCore allocates KServ’s and VMs’ stage 2

page tables from its own protected physical memory and manages the page tables, preventing

KServ from accessing them. When a VM is terminated and is done with its allocated memory,

KCore scrubs the memory before mapping it back into KServ’s stage 2 page tables as free memory,

which can be allocated again to another VM. As discussed in Section 1.3.3, KCore provides the

GRANT_MEM and REVOKE_MEM hypercalls, allowing VMs to share their memory with KServ to use

the paravirtualized I/Os provided by KServ. Only VMs can use these two hypercalls; KServ cannot

use them to gain access to VM pages.

29

Using hardware features. Like KVM, SeKVM leverages standard multiprocessor hardware fea-

tures for its functionality and performance, including multi-level page tables, tagged TLBs, caches,

and IOMMU hardware. KCore supports multi-level NPTs with dynamically allocated levels. Like

KVM, KCore supports 3- and 4-level multi-level page tables with dynamically allocated levels.

KCore supports huge (2MB) and regular (4KB) pages, also standard in KVM, crucial for virtual-

ization performance.

KCore uses Arm’s tagged TLBs to improve paging performance, avoiding the need to flush

TLBs on context switches between VMs and KServ. KCore assigns an identifier to each VM and

KServ, which it uses to tag TLB entries so address translation can be properly disambiguated on

TLB accesses from multiple VMs and KServ. When updating a page table entry, KCore flushes

corresponding TLB entries to ensure the TLB does not include stale page table entries that could

potentially compromise VM security. For instance, when a VM page is evicted from its stage 2

page tables, KCore has to flush the TLB entries correlated to the translation used for the evicted

page. Otherwise, a VM could use the cached TLB entry to access the evicted page that KServ

may now allocate to the other VMs. Correct TLB maintenance while avoiding unnecessary TLB

flushes is crucial for VM security and performance.

KCore takes advantage of Arm’s hardware cache coherence architecture to maximize system

performance, but needs to ensure that caching does not violate the confidentiality and integrity of

VM data. Architectures like Arm allow the software to manage cached data. In particular, Arm’s

hardware cache coherence ensures that all cached memory accesses across different CPUs and

different level caches get the same synchronized value. However, it does not guarantee that what

is in the cache is the same as the main memory. Memory accesses configured to bypass the cache

may obtain stale data if the latest value is cached. To ensure this does not result in any possible

leakage of VM data, KCore executes cache management instructions when scrubbing memory

pages, forcing those writes to cached data also to be written back to main memory. KCore ensures

no way for any VMs or KServ to access VM data directly from the main memory.

30

KCore leverages the System Memory Management Unit (SMMU) [65], Arm’s IOMMU, to

ensure that a VM’s private memory cannot be accessed by devices assigned to KServ or other

VMs, including protecting against DMA attacks. KCore ensures the SMMU is unmapped from

all NPTs so it can fully control the hardware to ensure devices can only access memory through

the SMMU page tables it manages. It uses the SMMU page tables to enforce memory isolation.

KCore validates all SMMU operations by only allowing the driver in KServ to program the SMMU

through Memory Mapped IO (MMIO) accesses, which trap to KCore. MMIO accesses are trapped

by unmapping the SMMU from KServ’s stage 2 page tables. KCore provides the SMMU hyper-

calls to support HypSec’s IOMMU OPS API. It allows KServ to (1) allocate/deallocate an SMMU

translation unit and its associated page tables, for a device, and (2) map/unmap/walk the SMMU

page tables for a given device. As part of validating a KServ page allocation proposal for a VM,

KCore also ensures that the page being allocated is not mapped by any SMMU page table for any

device assigned to KServ or other VMs.

2.1.1 Layered Implementation

Although we significantly reduced the TCB of the hypervisor to KCore, proving the correctness

of KCore remains a challenge, especially on Arm multiprocessor hardware. To further reduce the

proof burden, KCore uses a layered architecture to facilitate a layered approach to verification. The

implementation is constructed as a set of layers such that functions defined in higher layers of the

implementation can only call functions at lower layers of the implementation. Layers can then be

verified in an incremental and modular way. Once we verify the lower layers of the implementation,

we can compose them together to simplify the verification of higher layers.

The specific layers in KCore’s implementation are not determined in a vacuum, but with ver-

ification in mind based on the following layer design principles. First, we introduce layers to

simplify abstractions when higher layers do not need functionality needed by lower layers. Sec-

ond, we introduce layers to hide complexity when higher layers do not need low-level details.

Third, we introduce layers to consolidate functionality so that such functionality only needs to be

31

Exit Handler

TrapHandler

TrapHandlerRaw

TrapDispatcher

FaultHandler

MemHandler

VCPU

CtxtSwitch

VCPUOps

VCPUAux

SmmuOps

SmmuAux

SmmuCore

SmmuCoreAux

SmmuRawSMMUVM Boot

BootOps

BootAux

BootCore

VMPower

MmioSPTOps

MmioSPTWalk

MmioPTWalk

MmioPTAllocSMMUPT

NPTOps

NPTWalk

PTWalk

PTAllocMMUPT

MemOps

MemAux

PageMgmt

PageIndex

MemblockVMMem

Ed25519

AESHACL*

Lock

LockOpsH

LockOpsQ

LockOps

Figure 2.2: KCore Layered Implementation

verified once against its specification. For instance, by treating a module used by other modules as

its own separate layer, we do not have to redo the proof of that module for all of the other mod-

ules, simplifying verification. Finally, we introduce layers to enforce invariants, which are used to

prove high-level properties. Introducing layers modularizes verification, reducing proof effort and

maintenance.

We verify SeKVM by decomposing the KCore codebase into 34 layered modules. Figure 2.2

shows the KCore layered architecture. The top layer is TrapHandler, which defines KCore’s

interface to KServ and VMs, such as KServ hypercalls and VM exit handlers. Exceptions caused

by KServ and VMs cause a context switch to KCore, calling CtxtSwitch to save CPU register

state to memory, then TrapDispatcher or FaultHandler to handle the respective exception. On

32

a KServ hypercall, TrapDispatcher calls VCPUOps to handle the VM ENTER hypercall to execute a

VM, and MemHandler, BootOps, and SmmuOps to use their respective hypercall handlers. On a VM

exit, TrapDispatcher calls functions at lower layers if the exception can be handled directly by

KCore; otherwise, CtxtSwitch is called again, protecting VM CPU data and switching to KServ

to handle the exception. On other KServ exceptions, FaultHandler calls MemOps to handle KServ

stage 2 page faults and SmmuOps to handle any KServ accesses to SMMU hardware. FaultHandler

also calls MemOps to handle VM GRANT_MEM and REVOKE_MEM hypercalls. KCore implements

basic page table related operations in the layers in MMU PT, including page table walk, map or

unmap a page in a page table, and page table allocation. KCore implements ownership tracking

for each 4K regular page in PageMgmt, PageIndex, and Memblock for memory access control.

MemOps and MemAux provide memory protection APIs to other layers. KCore provides SMMU page

table operations in layers in SMMT PT. KCore provides VM boot protection in BootOps, BootAux,

and BootCore. We ported the verified Hacl* [66] library to SeKVM. BootOps calls the Ed25519

library from Hacl* to authenticate signed VM boot images. BootOps and MemOps call the AES

implementation from Hacl* to encrypt or decrypt VM data to support VM management.

We first describe the API of KCore’s top layer, TrapHandler. Appendix A.1 details the API

for the intermediate layers.

2.1.2 TrapHandler API

TrapHandler includes the hypercalls and exception handlers. It provides HypSec’s hypercall

API, shown in Figure 1.1. Table 2.1 lists the hypercalls for supporting the VM CREATE and VM

BOOT APIs, allowing KServ to allocate VMs and participate in secure VM boot. Table 2.2 lists the

hypercalls for the VM ENTER and GET VM STATE APIs, allowing KServ to execute a VCPU and

export and import encrypted VM data. Table 2.3 lists the hypercalls for the IOMMU OPS API, the

clear_vm hypercall for KServ to reclaim memory from a terminated VM, and a timer hypercall

to set timers. Table 2.4 lists the hypercalls used by VMs. Finally, Table 2.5 lists the exception

handlers.

33

Primitive Descriptionregister_vm Used by KServ to request KCore to create new VMs. KCore allocates

a unique VM identifier, which it returns to KServ. It also allocates theper-VM metadata VMInfo (See Section 3.3.1) and a stage 2 page tableroot for the VM.

register_vcpu Used by KServ to request KCore to initialize a new VCPU for a spec-ified VM. KCore allocates the VCPUContext (See Section 3.3.1) datastructure for the VCPU.

set_boot_info Used by KServ to pass VM boot image information, such as the imagesize and load guest physical address to KCore. An example of a bootimage is the VM kernel binary. The images will be verified and loadedto VM memory when VM boot.

remap_boot_image_page Used by KServ to pass one page of the VM boot image to KCore. KCoreremaps all pages of a VM image to a contiguous range of memory in itsaddress space so it can later authenticate the image.

verify_vm_image Used by KServ to request KCore to authenticate a VM boot image.KCore authenticates each binary of the boot image, and refuses to bootthe VM if authentication fails. Before authenticating a VM image,KCore unmaps all its pages from KServ’s stage 2 page table to guar-antee that the verified image cannot be later altered by KServ. If au-thentication succeeds, KCore maps the boot image to the VM’s stage 2page table.

Table 2.1: TrapHandler API: KServ Hypercall Handlers (VM CREATE and VM BOOT)

Primitive Descriptionrun_vcpu Used by KServ to request KCore to run a VM’s VCPU on the current

physical CPU. KServ passes the VM and VCPU identifiers to KCore.KCore context switches to the VCPU and resolves prior VM exceptionsbefore returning to the VM.

encrypt_vcpu Used by KServ to request KCore to export encrypted VM CPU data; forVM migration and snapshots.

decrypt_vcpu Used by KServ to request KCore to import encrypted VM CPU data;for VM migration and snapshots. KCore copies the data to a privatebuffer before decrypting it.

encrypt_vm_mem Used by KServ to request KCore to export encrypted VM memory data;for VM migration and snapshots.

decrypt_vm_mem Used by KServ to request KCore to import encrypted VM memory data;for VM migration and snapshots. KCore copies the data to a privatebuffer before decrypting it.

Table 2.2: TrapHandler API: KServ Hypercall Handlers (VM ENTER and GET VM STATE)

34

Primitive Descriptionsmmu_alloc_unit Used by KServ to request KCore to allocate an SMMU translation unit

for a given device. KCore sets the owner of the SMMU translation unitto the owner of the device.

smmu_free_unit Used by KServ to request KCore to deallocate an SMMU translationunit previously used by a device. If a device was owned by a VM,KCore ensures that deallocation can succeed when the VM is poweredoff.

smmu_map Used by KServ to request KCore to map a 4KB page to a device’sSMMU page table, from a device virtual address (iova) to the hPA ofthe page. KServ rejects the request if the owner of the page is differentfrom the device. KServ is allowed to map a page to the SMMU pagetable of a VM owned device before the VM boots.

smmu_unmap Used by KServ to request KCore to unmap an iova in a device’s SMMUpage table. KServ is only allowed to do so after the VM that owns thedevice is powered off.

smmu_iova_to_phys Used by KServ to request KCore to walk a device’s SMMU page table.Given an iova, KCore returns the corresponding physical address.

clear_vm Used by KServ to request KCore to reclaim pages from a terminatedVM. KCore will scrub all pages of the terminated VM and set theirownership to KServ.

set_timer Used by KServ to request KCore to update a privileged EL2 timer reg-ister with a timer counter offset; for timer virtualization since SeKVMoffloads timer virtualization to KServ.

Table 2.3: TrapHandler API: KServ Hypercall Handlers (IOMMU OPS and others)

Primitive DescriptionGRANT_MEM Used by a VM to grant KServ access to its data in a memory region.

KCore sets the share field in the S2Page (See Section 3.3.1) structurefor each page in the memory region.

REVOKE_MEM Used by a VM to revoke KServ’s access to a previously shared memoryregion. KCore clears the share field in the S2Page structure for eachof the pages in the memory region and unmaps the pages from KServ.

psci_power Used by a VM to request KCore to configure VM power states viaArm’s PSCI power management interface [67]; the request is passedto KServ for power management emulation. It calls VMPower to updateor retrieve the VM’s power state.

Table 2.4: TrapHandler API: VM Hypercall Handlers

35

Primitive Description

host_page_fault Handles stage 2 page faults for KServ. KCore builds the mapping forthe faulted address in KServ’s stage 2 page table if the access is allowed.An identity mapping is used for hPAs in KServ’s stage 2 page table,allowing KServ to implicitly manage all free physical memory. KCorealso handles the page faults caused by KServ’s MMIO access to theSMMU.

vm_page_fault Handles stage 2 page faults for a VM, which occur when a VM accessesunmapped memory or its MMIO devices. KCore context switches toKServ for further exception processing, to allocate memory for the VMand emulate VM MMIO accesses. KCore copies the I/O data fromKServ to the VM’s GPRs on MMIO reads, and vice versa on MMIOwrites.

handle_irq Handles physical interrupts that result in VM exits, KCore contextswitches to KServ for the interrupt handling.

handle_wfx Handles VM exits due to WFI/WFE instructions. KCore contextswitches to KServ to handle the exception.

handle_sysreg Handles VM exits due to accessing privileged system registers, handleddirectly by KCore.

Table 2.5: TrapHandler API: Exception Handlers

36

2.2 Layered Methodology for Verifying SeKVM

We take the first steps toward verifying the correctness of SeKVM’s TCB, KCore, by combin-

ing the layered implementation of the TCB with a layered hardware model. We start with a bottom

machine model that supports comprehensive multiprocessor hardware features such as multi-level

page tables, tagged TLBs, and a coherent cache hierarchy with bypass support. We use layers to

gradually refine the detailed low-level machine model to a higher-level and simpler abstract model.

Finally, we verify each layer of software by matching it with the simplest level of machine model

abstraction, reducing the proof burden to make it possible to verify commodity software using

these hardware features.

We leverage Certified Concurrent Abstraction Layers (CCALs) [18, 68] to verify the correct-

ness of the multiprocessor TCB. Each abstraction layer consists of three components: the underlay

interface, the layer’s module implementation, and its overlay interface. Each interface exposes

abstract primitives, encapsulating the implementation of lower-level routines, so that each layer’s

implementation may invoke the primitives of the underlay interface as part of its execution. We

allow primitives from multiple lower layers to be passed to a single given layer’s underlay interface.

For each layer I of KCore’s implementation, we prove that I running on top of the underlay

interface L refines its (overlay) specification S, I@L v S. Because the layer refinement relation

v is transitive, we can incrementally refine KCore’s entire implementation as a stack of layer

specifications. For example, given a system consisting of layer implementations M3, M2, and M1,

their respective layer specifications S3, S2, and S1, and a base machine model specified by S0, we

prove M1@S0 v S1, M2@S1 v S2, and M3@S2 v S3. In other words, once a module of the

implementation is proven to refine its layer specification, we can use that simpler specification,

instead of the complex module implementation, to prove other modules that depend on it. We

compose these layers to obtain (M3 ⊕ M2 ⊕ M1)@S0 v S3, proving that the behavior of the

system’s linked modules together refines the top-level specification S3.

All KCore interface specifications and refinement proofs are manually written in Coq, which

37

CPU0VA/gPA

PA

PA

Main Memory

Coherent Data Caches

VA/gPA

PA

DEV1DEV0

IOVA

IOVA

PA

IOVA

IOVA

PA

VA/gPA

TLB

PTsSMMU

PTs

CPU1

VA/gPA

PA TLB

SMMU TLB

PTs

PA

(a) The machine model of the bottom layer AbsMachine

CPU0

PA

Main Memory

CPU1

Shared Data Cache

VA/gPA

PA

DEV1DEV0

IOVA

PA

IOVA

PA

VA/gPA

PTs

SMMUPTs

(b) The machine model after the machine refinement

Figure 2.3: Refinement of Machine Models

we trust, with 34 interface specifications matching the layers in Figure 2.2. We use CompCert [69]

to parse each layer of the C implementation into Clight representation, an abstract syntax tree

(AST) defined in Coq; the same is done manually for assembly code. We then use that Coq

representation to prove that the layer implementation refines its respective interface specification

at the C and assembly level. Note that the C functions that we verify may invoke primitives

implemented in assembly and introduced in the bottom machine model. We enforce that these

assembly primitives do not violate C calling conventions and parameters are correctly passed. For

example, we verify the correctness of TLB maintenance code, which is implemented in C, but

invokes primitives implemented in assembly.

2.2.1 AbsMachine: Abstract Multiprocessor Machine Model

Each of KCore’s layer modules successively builds upon AbsMachine, our multiprocessor ma-

chine model. This abstract hardware model constitutes the foundation of our correctness proof. As

shown in Fig. 2.3, AbsMachine includes multiple CPUs and a shared main memory. AbsMachine

models general purpose and systems registers for each CPU. It also models Arm hardware fea-

tures relevant to modern hypervisor implementation, including the multi-level stage 1, stage 2, and

SMMU page tables, a physically indexed, physically tagged (PIPT) shared data cache, and TLBs.

In multiprocessor settings, page tables are usually shared by multiple CPUs. For instance, a mul-

tiprocessor VM running on different CPUs uses the same copy of the stage 2 page table. We must

38

account for the shared multi-level page tables when proving the correctness of SeKVM’s multipro-

cessor TCB. In this chapter, we discuss the correctness proof of the TCB in managing multi-level

page tables. We present the correctness proof of the TCB in managing shared multi-level page

tables in Chapter 3. The shared data cache is semantically equivalent to Arm’s multi-level cache

hierarchy with coherent caches. KCore uses stage 2 page tables to translate guest physical ad-

dresses to actual physical addresses on the host, and uses its EL2 stage 1 page table to translate its

virtual addresses to physical addresses. AbsMachine models the particular hardware configuration

of KCore that we verify. For example, although Arm supports 1GB, 2MB, and 4KB mappings in

stage 2 page tables, KCore only uses 4KB and 2MB mappings in stage 2 page tables, since 1GB

mappings result in fragmentation. Thus, we model a VM’s memory load and store accesses in

AbsMachine over stage 1 and stage 2 page tables using 4KB and 2MB mappings. AbsMachine

models concurrent executions of multiple CPUs. Further details are discussed in Section 3.1.1.

Although the abstract machine model is specified in the bottom layer of our proof, each suc-

cessive layer implicitly has a machine model used to express how events at that layer affect the

machine state. For example, each layer has some notion of memory to support memory load and

store primitives. For many layers, most primitives and their effect on the machine model at the

overlay interface are the same as those at the underlay interface. These pass through primitives

and their effects on the machine state do not need to be re-specified for each higher layer. On the

other hand, each layer may define new primitives based on a higher-level machine model, so long

as a refinement can be proven between the layer’s implementation over the underlay interface and

the overlay interface.

A key aspect of our proofs is to abstract away the low-level details of the machine model, layer

by layer, by proving refinement between the software implementation using a lower-level machine

model and its specification based on a higher-level machine model. For example, we verify that

the TLBs are maintained correctly by lower hypervisor layers, such that the TLB behavior exposed

by AbsMachine is encapsulated by the hypervisor implementation at lower layers, and is thus

abstracted from the higher layer specifications.

39

2.2.2 Page Table Management

As shown in Figure 2.3, AbsMachine includes multi-level page tables. Like Arm hardware,

a page table can include up to four levels, referred to using Linux terminology as pgd, pud, pmd,

and pte. AbsMachine models both regular (4KB) and huge (2MB) page table mappings, as used

by KVM and also employed by KCore. KCore maintains stage 2 page tables — one per VM and

one for KServ — and its EL2 stage 1 page table, all modeled by AbsMachine. KCore associates a

unique VMID identifier for each VM and KServ, identifying the respective stage 2 page table.

The functions for KCore to manipulate page tables are implemented and verified at the four

layers of the MMU PT module, shown in Figure 2.2. The PTAlloc layer dynamically allocates page

table levels, e.g.,pud, pmd, and pte. The PTWalk layer provides helper functions for walking an

individual level of the page table, e.g., walk_pgd, walk_pud, etc. The NPTWalk layer uses PTWalk’s

primitives to perform a full page table walk. The NPTOps layer grabs and releases page table locks

to perform page table operations. For instance, the map_page function maps a VM’s guest physical

frame number (gfn) to a physical frame number (pfn) by calling the set_s2pt function in the

NPTWalk layer to create a new mapping in the VM’s stage 2 page table:

void map_page(u32 vmid, u64 gfn, u64 pfn, u64 attr) acq_lock_s2pt();set_s2pt(vmid, gfn, pfn, 4K, attr);rel_lock_s2pt();

void set_s2pt(u32 vmid, u64 gfn, u64 pfn, u32 size, u64 attr) u64 pgd, pud, pmd, pte;pgd = walk_pgd(vmid, gfn);pud = walk_pud(vmid, pgd, gfn);pmd = walk_pmd(vmid, pud, gfn);if (size == 2M) /* make sure pmd is not mapped to a pte */if (pmd_table(pmd) != PMD_TYPE_TABLE)set_pmd(vmid, pmd, gfn, pfn, attr);

else if (size == 4K) if (pmd_table(pmd) == PMD_TYPE_TABLE) pte = walk_pte(vmid, pud, gfn);set_pte(vmid, pte, gfn, pfn, attr);

40

Refinement for multi-level page tables. While PTAlloc, PTWalk, and NPTWalk require under-

standing the underlying implementation details of the multi-level page table, NPTOps and other

higher layers of KCore do not. We, therefore, abstract away these implementation details in prov-

ing the correctness of NPTWalk by proving that its implementation refines a simpler higher-level

specification capturing the key structural properties of the page table. We defer to Chapter 3 a

discussion of how to ensure page table refinement does not hide information leakage, which is im-

portant for proving security guarantees. Since a page table is intended to maintain a page mapping

from gfn to pfn, we can simply specify the page table as a partial mapping gfn7→(pfn, size,

attr), where size is the size of the page, 4KB or 2MB, and attr encompasses attributes of the

page, such as its memory access permissions. This abstract page table collapses multiple levels

into one level, reduces both regular and huge page mappings into just one level of regular 4KB

mappings by treating a 2MB huge page as the equivalent of 512 consecutive 4KB pages. Once

we prove this refinement of the machine model, any component that depends on multi-level page

tables in KCore’s implementation can be verified against the abstract page table. For instance,

given a function that calls set_s2pt in NPTWalk, we can reason about it using its simplified ab-

stract specification — that sets an entry for a gfn in a flat map — instead of reasoning about its

implementation over the multi-level page table. Similarly, memory load and store primitives at

higher layers can be reasoned with performing address translation using only the flat map instead

of the multi-level page table.

The basic idea behind the proof to show the 4-level page table implementation refines the

abstract map is to first prove that the 4-level page table is a tree data structure. We show that every

page table at lower levels (pud, pmd, pte) is pointed by one and only one page table entry at higher

levels. We verify that KCore enforces the following two properties: (1) a lower-level page table

can only be allocated and mapped during the page table walk when the target page table level does

not exist, and (2) the allocated page table is a free and empty page. The first property ensures that

each edge of the tree remains unchanged after KCore maps an entry to the page table. For the

second property, the allocated page is free, guaranteeing that it is not mapped by any of the page

41

table entries. The allocated page is empty, which means that it does not contain any page table

entries. Therefore, assuming the original page table forms a tree, mapping to the newly allocated

page table still results in a tree.

We then verify that the tree structure can be refined to a flat map by showing that updating the

mapping for a gfn does not affect the mapping for any other gfn’ 6= gfn. Suppose both gfn and

gfn’ are regular or huge pages. If the page walks for gfn and gfn’ diverge at some level, they

will fall into different leaf nodes due to the tree structure. If gfn and gfn’ have the same page

walk path, their pte indices will be different if they are regular pages, and their pmd indices will be

different if they are huge pages, since gfn’ 6= gfn.

The proof becomes more complicated when one page is a regular page, and the other is a huge

page. We have to prove that once a pmd is allocated to store huge page mappings, it cannot be used

to store lower-level pte pointers for regular pages, and vice versa. This is ensured by checking the

size argument and the type of pmd during the page walk, as shown in the above example.

To unify the representation for the flat map at higher layers, we logically treat a 2MB huge

page as 512 4KB pages. Changing one mapping for a 2MB huge page will cause updates to the

mappings for all of its 512 4KB pages.

After the refinement proof at the layer NPTWalk, all the modules and their properties at higher

layers can be reasoned about using this flat map without the need to deal with the implementation

details of the multi-level page tables. For example, the memory isolation proof can be simplified

significantly using the flat page map.

2.2.3 TLB Management

As shown in Figure 2.3, AbsMachine models Arm’s tagged TLB for each CPU, which caches

page translations to regular and huge pages. In AbsMachine, each CPU is associated with an

abstract TLB mapping, which maps VMIDs as tags to a set of TLB entries.

Arm TLBs cache three types of entries: (1) a stage 1 translation from a VM’s virtual address

to a gPA, (2) a stage 2 translation from a gPA to a hPA, and (3) a translation from a VM’s virtual

42

address to a hPA that combines stage 1 and stage 2 translations. AbsMachine models all three

types of TLB entries, respectively, as: (1) a mapping from a virtual page number vpn to a tuple

(gfn, size, attr), and (2) a mapping from a gfn to a physical frame tuple (pfn, size, attr), and

(3) a mapping from a vpn to a gfn to a physical frame tuple (pfn, size, attr), where size and

attr are used the same way as in AbsMachine’s page tables, described in Section 2.2.2. Mappings

are aligned to the size (4KB or 2MB) of the mapped page. AbsMachine provides the following

four basic TLB operations, reflecting Arm’s hardware behavior:

• TLB lookup. For a given memory load or store made by a VM VMID to access an address addr

(gfn or vpn), AbsMachine searches the running CPU’s TLB tagged with VMID and checks if any

entry translates addr. AbsMachine first checks if addr maps to an exact 4KB pfn, If no such

mapping exists, it then checks if addr maps to a 2MB pfn by aligning addr to its 2MB base,

pfn_2m, and searching the TLB using pfn_2m. If a matching entry is found, a TLB hits, the

TLB returns the respective physical frame number if the VM memory operation is permitted,

otherwise it generates a permission fault. If no matching entry is found, the TLB returns None

to indicate a TLB miss, and AbsMachine will then perform the address translation using page

tables directly.

• TLB refill. If a TLB miss occurs on a memory access, AbsMachine refills the TLB with infor-

mation from the ensuing page table walk, either a 4KB or 2MB translation to the CPU’s tagged

TLB. As previously mentioned, the refilled pfn must be aligned to the corresponding mapping

size.

• TLB eviction. In AbsMachine, a memory load or store operation randomly invalidates a TLB

entry before the actual memory access to account for all possible TLB eviction policies.

• TLB flush. Like Arm, AbsMachine exposes two primitives, mmu_tlb_flush1 and mmu_tlb_-

flush2, to flush TLB entries. mmu_tlb_flush2 takes a gfn and a VMID as arguments and in-

validates the second type of TLB entry that maps the gfn. mmu_tlb_flush1 takes a VMID as an

argument and invalidates all TLB entries associated with VMID that are either the first or third

43

type of TLB entry. Hypervisors like KVM must use mmu_tlb_flush1 to conservatively flush

all of a VM’s TLB entries related to stage 1 translations when they update stage 2 page tables

because they do not track how VMs manage their own stage 1 page tables. Like KVM, KCore

uses both primitives to flush TLB entries as needed when updating a VM’s stage 2 page tables.

For simplicity, we use mmu_tlb_flush to refer to a call to both mmu_tlb_flush1 and mmu_tlb_-

flush2.

Note that the first three operations, TLB lookup, refill, and eviction, model Arm’s TLB hard-

ware behavior during the memory access. In contrast the last operation, TLB flush, provides a set

of primitives for the KCore software to perform TLB maintenance, implemented and verified at

the NPTOps layer of the MMU PT module shown in Figure 2.2.

At the layer NPTOps, we verify that KCore correctly maintains the TLB entries. No principal,

a VM or KServ, can use the TLB to access a physical page that does not belong to it, regardless of

the behavior of KServ or any VM. In this way, we can hide TLB and TLB-related operations from

all the layers above NPTOps, as shown in Figure 2.3, to simplify the reasoning at higher layers.

This verification step introduces a concept of page observers to represent the set of all possible

principals that can observe a pfn through TLBs or page tables. We write pfn: n kserv@TLB to

denote that VM n and KServ are page observers to pfn through TLBs. As an example, consider the

unmap_pfn_kserv primitive in NPTOps. When a page pfn is allocated by KServ to a VM n, KCore

first calls unmap_pfn_kserv to remove the pfn from KServ’s stage 2 page table, then inserts pfn in

n’s stage 2 page table. The page observers before and after each step can be computed as follows:// pfn: kserv@TLB pfn: kserv@PTunmap_pfn_kserv (pfn);// pfn: kserv@TLB pfn: _@PTmmu_tlb_flush (pfn, kserv);// pfn: _@TLB pfn: _@PTmap_page (n, gfn, pfn, attr);// pfn: n@TLB pfn: n@PT

A TLB can be refilled using page tables’ contents at any point due to a memory access on

another CPU, so the (possible) page observers through TLBs must be a superset of the ones through

page tables. That is why VM n can observe pfn through TLBs right after inserting pfn to n’s page

44

table. Intuitively, the superset relationship is because a TLB can contain the earlier and current

cached page table translations, while page tables contains only the current translations. The TLB

flush collapses all possible (cached) observers to pfn to the observers defined by the page table.

The above example generates the following sequence of page observers through TLB:

pfn: kserv, pfn: kserv, pfn: _, pfn: n

If we merge consecutive identical page observers into a page observer group, we get the following

page observer groups:

pfn: kserv, pfn: _, pfn: n (2.1)

To prove that TLBs are maintained correctly and can be hidden at higher layers, we just need

to show that TLBs and page tables generate the same sequence of page observer groups, even if

page tables’ observers are a subset of TLBs’ observers. In the above example, the page observers

through page tables are:

pfn: kserv, pfn: _, pfn: _, pfn: n

which can be merged to the same sequence of page observer groups shown in Eq. (2.1).

This property can be generally proven as follows. Starting with the same observer group

through TLBs and page tables, the resulting observer groups produced by operations such as mem-

ory accesses, creating new page mappings in page tables, and TLB flushes are still the same. The

only non-trivial case is unmapping pages, which introduces a new observer group through page

tables, while a TLB would still show the old observer group. To avoid missing this new observer

group, the TLBs must be invalidated by KCore calling mmu_tlb_flush.

Using this approach, incorrect maintenance of TLBs can be detected by a mismatch of page

observer groups. Consider the following insecure implementation that invalidates the TLB before

unmapping pfn.// pfn: kserv@TLB pfn: kserv@PT

45

mmu_tlb_flush (gfn, kserv);// pfn: kserv@TLB pfn: kserv@PTunmap_pfn_kserv (pfn);// pfn: kserv@TLB pfn: _@PTmap_page (n, gfn, pfn, attr);// pfn: kserv n@TLB pfn: n@PT

Since TLBs can be refilled by page tables’ contents, the page observers through TLBs remain

the same after the TLB flush. The followed page unmapping does not invalidate TLBs such that

the sequence of page observer groups through TLB for this insecure implementation is as follows:

pfn: kserv, pfn: kserv n

which is different from the one in Eq. (2.1), meaning that more information can be released through

TLBs than page tables.

2.2.4 Cache Management

As shown in Figure 2.3, AbsMachine includes physically-addressed writeback caches. Arm

adopts MESI/MOESI cache coherence protocols, guaranteeing all levels’ of cache are consistent,

meaning the hardware can retrieve the same contents from the cache located at different levels,

and the updates to the cache are synchronized to the cache at different levels. Arm’s multi-level

caches can be modeled by AbsMachine as a uniform global cache. To model hardware that will

invalidate and write back cached entries unbeknownst to software, for example, due to cache line

replacement, AbsMachine exposes a cache-sync primitive that randomly evicts a cache entry

and writes it back to memory. In KCore’s specification, memory load and store operations call

cache-sync before the actual memory accesses to account for all possible cache eviction policies.

While caches are coherent, Arm hardware does not guarantee that cached data is always coherent

with main memory; caches may write back dirty lines at any time. Like other modern architectures,

Arm provides cache maintenance instructions to allow the software to flush cache lines to ensure

what is stored in main memory is up-to-date with what is stored in the cache. AbsMachine provides

a cache-flush primitive that models Arm’s clean and invalidate instruction. The primitive takes a

46

S1PT

VM1

Hypervisor

S2PT

SMain memory

pfn

……

0

Data cache

pfn

S1PT

S2PT

VM2

Non-cacheable

Cacheable

scrub pfn

Figure 2.4: Attack based on Mismatched Memory Attributes

pfn as an argument, copies the val of pfn from the cache to main memory if the entry is present in

the cache, then removes pfn’s entry from the cache. Cache mismanagement could result in security

vulnerabilities, so hypervisors must use these instructions to ensure that data accesses across all of

its cores remain coherent, preventing stale data leaks.

Figure 2.4 shows how a malicious VM could leverage cache mismanagement on Arm hardware

to potentially obtain confidential data of another VM from main memory. Suppose the hypervisor

decides to evict a VM1’s page pfn. It unmaps the page from VM1 and scrubs the page by zeroing

out any residual data. Since the page no longer can be used by VM1, the hypervisor is free to

reassign it to another VM, VM2, by mapping pfn to VM2’s stage 2 page tables (S2PT). Arm hard-

ware guarantees the scrubbing is synchronized across all CPU caches, but does not guarantee it is

written back to the main memory. Arm allows the software to mark whether a page is cacheable or

not by setting the memory attributes in the respective page table entry. When stage 2 translation is

enabled, Arm combines memory attribute settings in stage 1 and stage 2 page tables. For a given

mapping, caching is only enabled when both stages of page tables enable caching. Hypervisors

allow VMs to manage their stage 1 page tables for performance reasons. Although KCore always

enables caching in stage 2 page tables, an attacker in VM2 could disable caching for the mapping

to pfn in its stage 1 page table, bypassing the caches and directly accessing pfn in main memory,

47

which could contain confidential VM1 data. To protect VM memory against this attack, the hy-

pervisor should invalidate pfn’s associated cache line after scrubbing the page to ensure that the

changes are written back to the main memory. This ensures VM2 can never retrieve VM1’s secret

in the main memory.

To ensure that KCore correctly manages caches, we verify it over AbsMachine, which models

writeback caches and cache bypass. AbsMachine models both cache and main memory as partial

maps pfn7→val, where val is the content stored in a given pfn. As a pfn moves between cache and

main memory, AbsMachine propagates its content with it. For example, on a cacheable memory

access, AbsMachine checks if the cache contains a mapping for pfn. If it does not, AbsMachine

populates the cache with val from main memory. It then returns val for memory loads and updates

the cached value for memory stores. Similarly, on a cache-flush or cache-sync, AbsMachine

flushes the pfn to the main memory, populating the main memory with the respective val from the

cache.

Using AbsMachine, we prove that KCore always sets the memory attributes in the page tables

that it manages to enable caching, maximizing performance. We then prove that KCore flushes

caches in the primitives that can change page ownership, verifying that the KCore implementation

refines its specification. Finally, we use KCore’s specification to prove that KCore’s cache man-

agement has no security vulnerabilities and does not compromise VM data. We discuss the first

two proofs here but defer the latter proof to Section 3.3.2.

We first prove that KCore always sets the memory attributes in the stage 2 page tables for VMs

and KServ to enable caching. KCore updates stage 2 page table entries by calling the verified

map_page primitive, as discussed in Section 2.2.2. map_page is passed the attr parameter to set

the page table entry attributes. We verify the primitives that call map_page pass in the correct attr

to enable caching. Specifically, we verify the implementation of map_pfn_vm and map_pfn_host

in the MemAux layer, which call map_page to map a pfn to a VM’s and KServ’s stage 2 page table,

respectively refine their specifications that pass an attr value with caching enabled to map_page.

We also prove that KCore always sets the memory attributes in its own EL2 stage 1 page tables to

48

enable caching. Similar to map_page, NPTOps provides a map_page_core primitive for updating

EL2 stage 1 page tables, which in turn calls set_s1pt in NPTWalk to update the multi-level page

tables — we proved the correctness of these primitives similarly to the proofs for map_page and

set_s2pt. We then verify that the primitives that call map_page_core pass in the correct attr to

enable caching.

We then prove that KCore correctly flushes cache lines in the primitives that change page

ownership. Specifically, we prove the correctness of assign_pfn_vm and clear_vm_page in the

MemAux layer. assign_pfn_vm unmaps pfn from KServ and assigns the owner of a newly allocated

pfn to a VM. clear_vm_page reclaims a pfn from a VM upon the VM’s termination, scrubs the

pfn, and assigns the owner of the pfn to KServ. We prove that the implementations of both

primitives refine their corresponding specifications that call cache-flush.

2.2.5 SMMU Management

As shown in Figure 2.3, AbsMachine models Arm’s SMMU, which supports a shared SMMU

TLB and SMMU multi-level page tables, which can be allocated for each device devk. The TLB is

tagged, and page tables can support up to four levels of paging with regular and huge page support,

similar to the page tables and TLBs discussed in Sections 2.2.2 and 2.2.3. Unlike memory accesses

from CPUs, there are no caches involved in memory accesses through the SMMU. For simplicity,

we only describe the SMMU stage 2 page tables, used by the SMMU implementation [70] on

the Arm Seattle server hardware we used for evaluation. AbsMachine also provides dev_load

and dev_store operations to model memory accesses of DMA-capable devices attached to the

SMMU.

KCore controls the SMMU and maintains the SMMU TLB and SMMU page tables for each

devk. TLB entries are tagged by VMID. The parts of KCore that manipulate page tables are the

four layers of SMMU PT shown in Figure 2.2. Similar to how we refine multi-level page tables in

NPTWalk as discussed in Section 2.2.2, we refine the SMMU multi-level page table and its multi-

level page table walk in MmioSPTWalk in SMMU PT into a layer specification with a partial map that

49

maps an input page frame from device address space, devfn 7→ (pfn, size, attr), where size is

the size of the page, 4KB or 2MB, and attr encompasses attributes of the page. Once we prove

this refinement, higher layers that depend on SMMU page tables can be verified against the abstract

page table, enabling us to prove the correctness of KCore’s SMMU page table management.

Similar to how we refine CPU TLBs as discussed in Section 2.2.3, we refine the SMMU TLB in

MmioSPTOps so that it is abstracted away from higher layers. We model the SMMU TLB as a set of

partial maps, each map identified by VMID and mapping devfn 7→ (pfn, size, attr). AbsMachine

models SMMU TLB invalidation by exposing a smmu-tlb-flush primitive to flush all entries

associated with a VMID. We prove the correctness of KCore with the SMMU TLB by verifying

it correctly flushes entries to ensure consistency with the SMMU page tables, then abstract away

the TLB by proving that the MmioSPTOps implementation using the SMMU TLB refines a simpler,

higher-level specification without the SMMU TLB. We prove unmap_spt in MmioSPTOps calls

smmu-tlb-flush after unmapping a pfn from the SMMU page table.

2.3 Summary

In this chapter, we presented SeKVM, a KVM hypervisor retrofitted based on the HypSec de-

sign. To verify SeKVM’s TCB on a multiprocessor machine with realistic hardware features, we

introduced a layered hypervisor architecture and verification methodology. First, we leveraged

HypSec to split KVM into two layers, a higher layer consisting of a large set of untrusted hy-

pervisor services, and a lower layer consisting of the hypervisor TCB. Next, we used layers to

modularize the implementation and proof of the TCB to reduce the proof effort, modeling mul-

tiprocessor hardware features at different levels of abstraction tailored to each layer of software.

Using this approach, we have taken the first steps to verify the correctness of SeKVM’s TCB, using

a novel layered machine model that accounts for widely-used multiprocessor hardware features,

including multi-level page tables, tagged TLBs, and multi-level caches with cache bypass support.

50

Chapter 3: Verifying Security Guarantees

Although functional correctness can be verified by showing that the software implementation

running on the machine model refines its specification, such a correctness proof may not guarantee

that the desired properties of the software are satisfied. Specifications may themselves be buggy

and not provide the desired properties. When verifying a hypervisor, in addition to verifying the

functional correctness of its TCB, it is also crucial to verify the security properties of the entire

hypervisor.

In this chapter, we verify the security guarantees of SeKVM, demonstrating that it protects

VM data confidentiality and integrity on multiprocessor hardware. We do this in two steps. First,

we build on the proofs described in Chapter 2 to verify the functional correctness of SeKVM’s

TCB, KCore, showing that its implementation refines the specification. Then, we use KCore’s

specification to prove the security properties of the entire system. Because the specification is

easier to use for higher-level reasoning, it becomes possible to prove security properties that would

be intractable if attempted directly on the implementation. A vital requirement of this approach is

that we must ensure the specification soundly captures all behaviors of KCore’s implementation,

so the security guarantees proven over the specification hold on the implementation. However,

refinement may not preserve security properties, such as data confidentiality and integrity in a

multiprocessor setting [26, 27] because intermediate updates to shared data within critical sections

can be hidden by refinement, yet visible to concurrently running code on different CPUs.

To reason about KCore in a multiprocessor setting, I introduce security-preserving layers to ex-

press KCore’s specification as a stack of layers, so that each module of its implementation may be

incrementally proven to refine its layered specification and preserve security properties. Security-

preserving layers are a drop-in replacement for CCALs in the SeKVM layered architecture that

provide the additional benefit of ensuring that the refinement of multiprocessor code does not hide

51

information release. We use security-preserving layers to verify, for the first time, the functional

correctness of a multiprocessor system with shared page tables. Using security-preserving lay-

ers, we gradually refine detailed hardware and software behaviors at lower layers into simpler

abstract specifications at higher layers. We ensure that the composition of layers embodied by

KCore’s top-level specification reflects all intermediate updates to shared data across the entire

KCore implementation. We can then use the abstract top-level specification to prove the system’s

information-flow security properties and ensure those properties hold for the implementation.

Next, we use KCore’s specification to prove that any malicious behavior of the untrusted KServ

using KCore’s interface cannot violate the desired security properties. We prove VM confidential-

ity and integrity using KCore’s specification, formulating the guarantees in terms of noninterfer-

ence [28] to verify that there is no information leakage between VMs and KServ. However, a strict

noninterference guarantee is incompatible with commodity hypervisor features, including KVM’s.

For example, a VM may send encrypted data via shared I/O devices virtualized via untrusted hyper-

visor services, thereby not leaking private VM data. This kind of intentional information release,

known as declassification [71], should be distinguished from unintentional information release.

We incorporate data oracles [25] to mask the intentional information flow and distinguish it from

unintentional data leakage. After this masking, any outstanding information flow is unintentional

and must be prevented, or it will affect the behavior of KServ or VMs. To show the absence of

unintentional information flow, we prove noninterference assertions hold for any behavior by the

untrusted KServ and VMs, interacting with KCore’s top layer specification. The noninterference

assertions are verified over this specification, for any implementation of KServ, but since KCore’s

implementation refines its specification via security-preserving layers, unintentional information

flow is guaranteed to be absent for the entire KVM implementation.

3.1 Security-preserving Refinement

Using KCore’s C and assembly code implementation to prove SeKVM’s security properties

is impractical, as we would be inundated by implementation details and concurrent interleaving.

52

Instead, we show that the implementation of the multiprocessor KCore incrementally refines a

high-level Coq specification. We then prove that any implementation of KServ or VMs interacting

with the top-level specification satisfies the desired security properties, ensuring that the entire

SeKVM system is secure regardless of the behavior of any principal. To guarantee that proven

top-level security properties reflect the behavior of the implementation of KCore, we must ensure

that each level of refinement fully preserves higher-level security guarantees.

To enable incremental and modular verification, I introduce security-preserving layers: security-

preserving layers:

Definition 1 (Security-preserving layer). A layer is security-preserving if and only if its specifi-

cation captures all information released by the layer implementation.

Security-preserving layers build on our initial layered architecture based on CCALs to verify cor-

rectness of multiprocessor code. Security-preserving layers retain the composability of CCALs,

but unlike CCALs, ensure refinement preserves security guarantees in a multiprocessor setting. We

prove that KCore’s implementation refines a stack of security-preserving layers, such that the top

layer specifies the entire system by its functional behavior over its machine state.

To simplify proof effort, we create a set of proof libraries and helper functions, including a

security-preserving layer library that facilitates and soundly abstracts away complications arising

from potential concurrent interference, so that we may leverage sequential reasoning to simplify

layer refinement proofs. The key challenge is handling objects shared across multiple CPUs, as

we must account for how concurrent operations interact with them while reasoning about the local

execution of any given CPU.

Example 1 (Simple page table). We illustrate this problem using a simplified NPT example.

In our example, the NPT of a VM is allocated from its own page table pool. The pool consists of

page table entries, whose implementation is encapsulated by a lower layer interface that exposes

functions pt_load(vmid, ofs) to read the value at offset ofs from the page table pool of VM

vmid and pt_store(vmid, ofs, val) to write the value val at offset ofs. To keep the example

simple, we use a simplified version of the actual NPT we verified, ignore dynamic allocation and

53

permission bits, and assume two-level paging denoted with pgd and pte. Consider the following

two implementations, which map gfn to pfn in VM vmid’s NPT:void set_npt(uint vmid, uint gfn, uint pfn) acq_lock_npt(vmid);// load the pte base addressuint pte = pt_load(vmid, pgd_offset(gfn));pt_store(vmid, pte_offset(pte,gfn), pfn);rel_lock_npt(vmid);

void set_npt_insecure(uint vmid, uint gfn, uint pfn) acq_lock_npt(vmid);uint pte = pt_load(vmid, pgd_offset(gfn));pt_store(vmid, pte_offset(pte,gfn), pfn+1); // BUGpt_store(vmid, pte_offset(pte,gfn), pfn);rel_lock_npt(vmid);

Since an NPT maintains a mapping from gfns to pfns, we can specify the NPT as a logical map

gfn 7→ pfn, then prove that a correct NPT implementation refines its specification. However,

among the two implementations, only set_npt is correct, while set_npt_insecure is not. A

sound refinement proof should only admit set_npt, while rejecting set_npt_insecure for its

extraneous intermediate mapping from gfn to pfn+1.

Figure 3.1 illustrates the vulnerability of set_npt_insecure. VM n runs on CPUs 0 and 1,

while VM m runs on CPU 2. Physical page pfn is free, but pfn+1 is mapped to VM m’s NPT and

contains its private data, so only VMm should have access to it. At time t0, KCore handles VM n’s

page fault and invokes set_npt_insecure(n, gfn, pfn) on CPU 0 to map guest page gfn in

VM n’s NPT. At time t2, gfn is transiently but erroneously mapped to pfn+1 until t4, allowing

VM n on CPU 1 to concurrently access VM m’s private data using this temporary mapping.

The problem arises because a multiprocessor VM supports shared memory among its VCPUs.

This requires its NPT also to be shared among its VCPUs, potentially running on different physical

CPUs. Even though the NPT is protected by software spinlocks, the hardware MMU will still

perform page translations during set_npt_insecure’s critical section. When the hypervisor runs

set_npt_insecure to map a physical page used by VM m to VM n’s NPT, VM m’s secrets can

be leaked to VM n accessing the page on another physical CPU.

Previous refinement techniques [18, 68, 72] would incorrectly deem set_npt and set_npt_-

54

Figure 3.1: Insecure Page Table Updates

insecure functionally equivalent, failing to detect this vulnerability. For example, although Cer-

tiKOS proves that its software is data-race free (DRF), it does not support, nor model, the MMU

hardware feature allowing an untrusted principal to concurrently access a shared page table, so the

above two implementations would erroneously satisfy the same specification.

To address this problem, security-preserving layers leverage transparent trace refinement [25],

which forbids hidden information flow in a multiprocessor setting. To explain this technique,

we first describe our multiprocessor machine model, and how shared objects are modeled using

event traces. We then describe how transparent trace refinement can be used to enable sequential

reasoning for a write data-race-free system, where shared objects are protected by write-locks, but

reads on shared objects (which may lead to information leakage) may occur at any time.

3.1.1 Multiprocessor Model

We verify SeKVM on AbsMachine, the multiprocessor machine that we introduced in Sec-

tion 2.2.1. We focus on how AbsMachine models multiprocessor execution here, since we have al-

ready discussed its memory management hardware features such as multi-level page tables, tagged

55

TLBs, and writeback caches earlier. The machine state σ in AbsMachine consists of per-physical

CPU private states (e.g., CPU registers) and a global logical log, a serial list of events generated by

all CPUs throughout their execution. σ does not explicitly model shared objects. Instead, events

incrementally convey interactions with shared objects, whose state may be calculated by replaying

the logical log. An event is emitted by a CPU and appended to the log whenever that CPU invokes

a primitive that interacts with a shared object.

The following shows a simplified machine state, logical log, and events modeled in Coq, used

for proving the set_npt example earlier:Inductive Event :=| ACQ_NPT (vmid: Z) | REL_NPT (vmid: Z)| P_LD (vmid ofs: Z) | P_ST (vmid ofs val: Z).

(* Log is a list of events and their CPU identifiers *)Definition Log := list (Event * Z).

(* Abstract state *)Record AbsSt := log: Log;cid: Z; (* local CPU identifier *)vid: Z; (* vmid of the running principal on CPU cid *)...

.

The simplified abstract machine state, AbsSt, consists of a log of shared object events, the local

CPU identifier cid, the currently running VM vmid on CPU cid. For example, the page table

pool used by the set_npt implementation is accessible from KCore running on each CPU via the

pt_store(vmid, ofs, val) and the pt_load(vmid, ofs) primitive, which respectively gener-

ates the event P_ST vmid ofs val and P_LD vmid ofs. These events are appended to the global

logical log. The shared objects in the abstract state are constructed using the log through a replay

function:Record SharedObj := ...pt_locks: ZMap.t (option Z); (* pt lock holders *)pt_pool: ZMap.t (ZMap.t Z); (* per-VM page table pool *)...

.

Fixpoint replay (l: Log) (st: SharedObj) :=match l with| e::l’ => match replay l’ st with

| Some (st’, _) => replay_event e st’

56

| None => Noneend

| _ => Some stend.

The replay function recursively traverses the log to reconstruct the state of shared objects, invok-

ing replay_event to handle each event and update shared object state; this update may fail (i.e.,

return None) if the event is not valid with respect to the current state. For example, the replay

function returns the load result for a page table pool load event P_LD, but the event is only allowed

if the lock is held by the current CPU:Definition replay_event (e: Event) (obj: SharedObj) :=match e with| (P_LD vmid ofs, cpu) =>match ZMap.get vmid (pt_locks obj) with| Some cpu’ => (*the pt lock of vmid is held by cpu’*)if cpu =? cpu’ (* if cpu = cpu’ *)then let pool := ZMap.get vmid (pt_pool obj) in

Some (obj, Some (ZMap.get ofs pool))else None (* fails when held by a different cpu *)

| None (* fails if not already held *)end

| . . . (* handle other events *)end.

Our abstract machine is formalized as a transition system, where each step models some atomic

computation taking place on a single CPU; concurrency is realized by the nondeterministic inter-

leaving of steps across all CPUs [73]. To simplify reasoning about all possible interleaving, we lift

multiprocessor execution to a CPU-local model, which distinguishes execution taking place on a

particular CPU from its concurrent environment [18].

All effects coming from the environment are encapsulated by and conveyed through an event

oracle, which yields events emitted by other CPUs when queried. How the event oracle synchro-

nizes these events is left abstract, its behavior constrained only by rely-guarantee conditions [74].

CPUs need only query the event oracle before interacting with shared objects, since its private state

is not affected by these events. Querying the event oracle will result in a composite event trace of

the events from other CPUs interleaved with events from the local CPU.

For example, the pt_load’s specification shown below queries event oracle o to obtain events

from other CPUs, appends them to the logical log (producing l0), checks the validity of the load

57

and calculates the load result using the replay function, then appends a load event to the log

(producing l1):(* Event Oracle takes the current log and produces

a sequence of events generated by other CPUs *)Definition EO := Log -> Log.

Definition pt_load_spec(o: EO) (st: AbsSt) (vmid ofs: Z) :=

let l0 := o (log st) ++ log st in (*query event oracle*)(* produce the P_LD event *)let l1 := (P_LD vmid ofs, cid st) :: l0 inmatch replay l1 with(* log is valid and P_LD event returns r *)

| Some (_, Some r) => Some (st log: l1, r)| _ => Noneend.

Since the interleaving of events is left abstract, our proofs do not rely on any particular inter-

leaving of events and therefore hold for all possible concurrent interleaving. A CPU captures the

effects of its concurrent environment by querying the event oracle, a query move, before its own

CPU step, a CPU-local move. A CPU only needs to query the event oracle before interacting with

shared objects, since its private state is not affected by these events.

Figure 3.2 illustrates query move and CPU-local move in the context of the event trace produced

by set_npt’s implementation to refine its specification. The bottom trace shows events produced

by set_npt’s implementation as it interacts with the shared lock and page table pool it uses. The

query move before ACQ_LK yields all events from other CPUs prior to ACQ_LK; the query move

before P_LD yields all events from other CPUs since the last query up until P_LD. The end result of

its execution is a composite event trace of the events from other CPUs, interleaved with the events

from the local CPU.

Interleaving query and CPU-local moves still complicates reasoning about set_npt’s imple-

mentation. However, if we can guarantee that events from other CPUs do not interfere with the

shared objects used by set_npt, we can safely shuffle events from other CPUs to the beginning or

end of its critical section. For example, if we could prove that set_npt’s implementation is DRF,

then other CPUs will not produce events within set_npt’s critical section that interact with the

locked NPT. We would then only need to make a query move before the critical section, not within

58

Figure 3.2: Querying the Event Oracle to Refine set_npt

the critical section, allowing us to sequentially reason about set_npt’s critical section as an atomic

operation.

Unfortunately, as shown by set_npt_insecure, even if set_npt correctly uses locks to pre-

vent concurrent NPT accesses within KCore’s own code, it is not DRF because KServ or VMs

executing on other CPUs may indirectly read the contents of their NPTs through the MMU hard-

ware. This prevents us from soundly shuffling event queries outside of the critical section and

employing sequential reasoning to refine the critical section to an atomic step. If set_npt cannot

be treated as an atomic primitive, sequential reasoning would then be problematic to use for any

layer that uses set_npt, making their refinement difficult. Without sequential reasoning, verifying

a large system like KCore is not feasible.

3.1.2 Transparent Trace Refinement

We observe that information leakage can be modeled by read events that occur arbitrarily

throughout critical sections, without regard for locks. To ensure that refinement does not hide

this information leakage, transparent trace refinement treats read and write events separately. We

view shared objects as write data-race-free (WDRF) objects—shared objects with unavoidable

concurrent observers. For these objects, we treat their locks as write-locks, meaning that query

moves that yield write events may be safely shuffled to the beginning of the critical section. Query

moves in the critical section may then only yield read events from those concurrent readers.

59

Figure 3.3: Transparent Trace Refinement of Insecure and Secure set_npt Implementations

To determine when read events may also be safely shuffled, each WDRF object must define

an event observer function, which designates what concurrent CPUs may observe: they take the

current machine state as input, and produce some observed result, with consecutive identical event

observations constituting an event observer group. Event observer groups thus represent all pos-

sible intermediate observations by concurrent readers. Since the event observations are the same

in an event observer group, read events from other CPUs will read the same values anywhere in

the group and can be safely shuffled to the beginning, or end, of an event observer group, reducing

the verification effort of dealing with interleaving. Our security-preserving layers enforce that any

refinement of WDRF objects must satisfy the following condition:

Definition 2 (Transparency condition). The list of event observer groups of an implementation

must be a sublist of that generated by its specification. That is, the implementation reveals at most

as much information as its specification.

This condition ensures that the possibility of concurrent readers and information release is

preserved through each layer refinement proof. In particular, if a critical section has at most two

distinct event observer groups, read events can be safely shuffled to the beginning or end of the crit-

ical section. Query moves are no longer needed during the critical section, but can be made before

or after the critical section for both read and write events, making it possible to employ sequential

60

reasoning to refine the critical section. Transparent trace refinement can thereby guarantee that

events from other CPUs do not interfere with shared objects in critical sections. Figure 3.3 illus-

trates how this technique fares against our earlier counterexample, as well as to our original, secure

implementation. Each node in Figure 3.3 represents an event observation. Nodes of the same color

constitute an event observer group. The insecure example shows that set_npt_insecure does not

satisfy the transparency condition. There is an intermediate observation (shown in red) that cannot

map to any group in the specification. set_npt_insecure has three event observer groups that

can observe three different values, before the first pt_store, between the first and second pt_-

store, and after the second pt_store. Read events after the first pt_store cannot be shuffled

before the critical section. On the other hand, set_npt has only two event observer groups, one

that observes the value before pt_store, and one that observes the value after pt_store, so query

moves are not needed during the critical section. The implementation can therefore be refined to

an atomic set_npt specification. Refinement proofs for higher layers that use set_npt can then

treat set_npt as an atomic primitive, simplifying those proofs since set_npt can be viewed as just

one atomic computation step instead of many CPU-local moves with intervening query moves.

Example Proofs: Transparent Trace Refinement. We expand the simple set_npt example

discussed in Section 3.1 with a layered implementation to demonstrate SeKVM’s layer refinement

proof. We use a simplified version of the actual layers in KCore for simplicity. set_npt is anal-

ogous to the map_page primitive in KCore’s actual implementation. Both of them acquire and

release the per-principal page table lock to protect access to the shared NPTs. However, unlike

set_npt, map_page manages 4-level NPTs, supports permission bits, and dynamically allocates

page table levels. To modularize the verification effort, map_page calls the set_s2pt primitive,

which encapsulates the complexity of multi-level page table updates (See Section 2.2.2).

61

// Primitives provided by NPTWalkextern void acq_lock_npt(uint vmid);extern void rel_lock_npt(uint vmid);extern uint pt_load(uint vmid, uint ofs);extern void pt_store(uint vmid, uint ofs, uint value);...

// Primitive provided by NPTOpsvoid set_npt(uint vmid, uint gfn, uint pfn) acq_lock_npt(vmid);uint pte = pt_load(vmid, pgd_offset(gfn));pt_store(vmid, pte_offset(pte,gfn), pfn);rel_lock_npt(vmid);

In the following, we first discuss how the refinement proof is constructed using security-

preserving layers. We then show how we prove set_npt’s implementation transparently refines its

specification.

Layer Configuration. A security-preserving layer consists of three components: the underlay

interface, the layer’s implementation, and its overlay interface. Each interface exposes abstract

primitives, encapsulating the implementation of lower-level routines. Each layer’s implementation

may invoke the primitives of the underlay interface as part of its execution. In the implementation,

invocations to these primitives appear as function calls, such as the pt_load, pt_store, and acq_-

lock_npt, and rel_lock_npt routines used in the example above.

We introduce a domain-specific layer configuration language to specify the components of

each security-preserving layer and declare the primitives involved. The following shows the layer

configuration for NPTOps:

NPTOpsHigh # overlay interfaceset_npt: [(vmid:Z), (gfn:Z), (pfn:Z)], void;...

NPTOpsLow # layer implementationset_npt_low: [(vmid:Z), (gfn:Z), (pfn:Z)], void;... ...

NPTWalk # underlay interface# pass through from lower layerspt_load: [(vmid:Z), (ofs:Z)], uint;pt_store: [(vmid:Z), (ofs:Z), (value:Z)], void;acq_lock_npt: [(vmid:Z)], void;rel_lock_npt: [(vmid:Z)], void;...

Our goal is to show that NPTOpsLow’s implementation, executing upon the underlay interface,

NPTWalk, faithfully implements the primitives of the overlay interface NPTOpsHigh. From the

62

configuration above, we provide tools that automatically generate the Coq code that specifies the

Coq proof obligations for the refinement between the implementation and the underlay interface,

as well as between the underlay and the overlay interface.

Underlay specification. The underlay interface exposed by NPTWalk is defined as a map from

function names to their specifications defined upon the simplified abstract state AbsSt introduced

in Section 3.1.1.Definition NPTWalk: Layer AbsSt :=acq_lock_npt 7→ csem acq_lock_npt_spec⊕ rel_lock_npt 7→ csem rel_lock_npt_spec⊕ pt_load 7→ csem pt_load_spec⊕ pt_store 7→ csem pt_store_spec...

These primitives are defined in a language-independent way and are lifted to C-level semantics

through a wrapper function csem, so that arguments and return values are passed according to

C calling conventions. This abstract specification encapsulates each primitive’s implementation

details, which are verified and passed through to NPTWalk from lower layers.

Implementation. We use CompCert’s ClightGen to parse the implementation KCore’s C imple-

mentation in each layer into an AST representation in Coq. The following Clight AST fragment is

the resulting function body from set_npt’s implementation:

Ssequence(Scall None(Evar acq_lock_npt (Tfunction Tnil tint cc_default)((Etempvar vmid tuint) :: nil))

(Ssequence(Scall (Some pte)(Evar pt_load (Tfunction (Tcons tuint (Tcons tuint Tnil))

tuint cc_default))...

Based on the generated Coq AST and the primitives specified by the underlay interface, our tool

can automatically infer the following operational Coq specification for set_npt_low, replacing

function calls with primitives to interact with the underlay state.Definition set_npt_low_spec (o: EO) (st: AbsSt) (gfn pfn vmid: Z) :=match acq_lock_npt_spec vmid st with| Some st1 =>

let pgd_off := pgd_offset gfn inlet pte := pt_load_spec o st1 vmid pgd_off inlet pte_off := pte_offset pte gfn in

63

let st2 := pt_store_spec o st1 vmid pte_off pfn inrel_lock_npt_spec vmid st2

| None => None (* acq_lock failed (unreachable) *)end.

This inferred specification is used to define the semantics of the implementation in NPTOps. To

prove that our implementation is correct, we must later prove that the stuck state is unreachable.

Note that the correctness of the generated set_npt_low_spec with respect to the Clight AST is

proven automatically (aside from loop invariants) by constructing forward simulations [75].

Overlay specification. The overlay interface for the NPTOps layer exposes the specification of

the set_npt primitive for higher layers to update page table entries and hides page table structures

and page walk details by removing NPTWalk’s primitives related to page table pools.Definition NPTOps: Layer AbsSt :=set_npt 7→ csem set_npt_spec...

set_npt’s abstract specification simply generates an atomic event SET_NPT:Definition set_npt_spec

(o: EO) (st: AbsSt) (vmid gfn pfn: Z) :=let l0 := o (log st) ++ log st inlet l1 := (SET_NPT vmid gfn pfn, cid st) :: l0 inmatch replay l1 with| Some _ => Some (st log: l1)| _ => Noneend.

To show set_npt meets the abstract specification, we prove that its C code implementation running

over the underlay interface NPTWalk faithfully implements the overlay interface set_npt_spec for

all possible interleaving across CPUs:

∀EO, ∃EO’, Mset_npt(EO)@NPTWalk vR set_npt_spec(EO’)

This says that, for any valid event oracle EO of NPTWalk, there exists a valid event oracle EO’ for

NPTOps such that, starting from two logs related R, the logs generated by the implementation and

specification, respectively, are still related by R. Here, R is the refinement relation, which maps

64

the SET_NPT event to the following four consecutive events and maps other events to themselves:

ACQ_NPT P_LD P_ST REL_NPT

To prove this refinement, we show that, for event oracle EO and log l, the updated log after executing

Mset_npt is as follows, where EOn represents other CPUs’ events, from the event oracle.

[l, EO1, ACQ_NPT, EO2, P_LD, EO3, P_ST, EO4, REL_NPT, EO5]

Since the lock_npt spinlock enforces WDRF over VM vmid’s shared page table pool pt_poolvmid,

events generated by other CPUs can be safely shuffled across events in the same observer group

over pt_poolvmid using transparent trace refinement. For example, EO3 can be shuffled to the left

of P_LD since the replayed state before and after P_LD share the same pt_poolvmid, but cannot be

shuffled to the right of P_ST, which belongs to a different observer group. Thus, we can shuffle the

above log to the following:

[l, EO1, EO2, EO3, ACQ_NPT, P_LD, P_ST, REL_NPT, EO4, EO5]

We can then prove that this log transparently refines the log generated by the specification:

[l′, EO1’, SET_NPT, EO2’]

when the input logs and oracle EO’ satisfy R, i.e., R l l′, R [EO1’] [EO1, EO2, EO3], and

R [EO2’] [EO4, EO5]. We can also show that the following log generated by the insecure implemen-

tation shown in Section 3.1 cannot be transparently refined to the specification’s log, because EO4

cannot be shuffled across observer groups:

[l, EO1, ACQ_NPT, EO2, P_LD, EO3, P_ST, EO4, P_ST, EO5, . . .]

65

3.2 Noninterference Assertions

Hypervisors must protect their VMs’ data confidentiality— adversaries should not be privy to

private VM data—and integrity—adversaries should not be able to tamper with private VM data.

For some particular VM, potential adversaries are other VMs hosted on the same physical machine,

as well as the hypervisor itself—specifically, SeKVM’s untrusted KServ. Each of these principals

runs on one or more CPUs, with their execution and communication mediated by KCore. Our goal

here is to verify that, irrespective of how any principal behaves, KCore protects the confidentiality

and integrity of each VM’s data.

To do so, we formulate confidentiality and integrity as noninterference assertions [28]— in-

variants on how principals’ behavior may influence one another. Intuitively, if the confidentiality

of one VM’s private data is compromised, then its adversaries’ behavior should vary depending

on that data. Thus, if the behavior of all other VMs and KServ remains the same, in spite of any

changes made to private data, then that data is confidential. Integrity is the dual of confidential-

ity [76, 22]: if the behavior of a VM, acting upon its own private data, is not affected by variations

in other VMs’ or KServ’s behavior, then its data is intact.

Since transparent trace refinement ensures that KCore’s top-level specification hides only its

implementation details, but not any potential information flow, we can now soundly prove its secu-

rity properties using its specification. To show the absence of unintentional information flow, we

prove noninterference assertions hold for any behavior by the untrusted KServ and VMs, interact-

ing with KCore’s top layer specification. We show that one principal cannot affect the private data

of another, ensuring VM confidentiality and integrity. For each principal, we define a private data

lens (PDL), denoted by V , which returns the subset of machine state σ that is private to the princi-

pal. For example, the private data of VM p, denoted as V(σ, p) ⊆ σ, includes the contents of its

CPU registers and memory. A principal should not be able to infer the state in any other principal’s

PDL, and its own PDL should be unaffected by other principals. Such an isolation property can be

proven using noninterference by showing state indistinguishability:

66

Definition 3 (State indistinguishability). Two states σ1 and σ2 are indistinguishable from the

perspective of principal p if and only if V(σ1, p) = V(σ2, p).

In other words, a pair of distinct machine states are indistinguishable to some principal p if the

differences fall beyond the scope of p’s PDL. We want to prove that, starting from any two states

indistinguishable to a principal p, the abstract machine should only transition to a pair of states that

are still indistinguishable to p. Such transitions are said to preserve state indistinguishability.

Example 2 (Proving VM confidentiality). Consider KServ and a VM m, where VM m has only

gfn 1 in its address space, mapped to pfn 2. We prove VM m’s data confidentiality by showing

that any change in VM m’s private data is not visible to KServ during execution. Suppose VM m

writes content b to its gfn 1 in one execution (leading to state σ), and writes content b′ in an alternate

execution (leading to state σ′); we must show that these executions are indistinguishable to KServ’s

PDL:

To keep our example simple, we use a simplified V consisting of only the contents stored in a

principal’s guest page frames. For instance, in σ, after VM m writes b to gfn 1, V(σ, m) is the

partial map 1 7→ b. Yet in both executions, whether VM m writes b or b′, KServ’s PDL to the

two states are identical: V(σ, KServ) = V(σ′, KServ) = 1 7→ a, 2 7→ c. This means that the

two states are indistinguishable to KServ—it cannot observe VM m’s update to pfn 2.

Although previous work used noninterference assertions to verify information-flow security [77,

20, 21], they do not address two key issues that we solve, concurrency and intentional information

flow.

67

3.2.1 Concurrency

We extend previous work on noninterference in a sequential setting [20] to our multiprocessor

specification. We prove noninterference for big steps, meaning that some principal always steps

from an active state to its next active state, without knowledge of concurrent principals. We say

a state is active if we are considering indistinguishability with respect to some principal p’s PDL,

and p will make the next step on CPU c; otherwise the state is inactive. We decompose each big

step noninterference proof into proofs of a set of auxiliary lemmas for a given principal p:

Lemma 1. Starting from any two active, indistinguishable states σ1 and σ′1, i.e., V(σ1, p) =

V(σ′1, p), if p makes CPU-local moves to states σ2 and σ′

2, then V(σ2, p) = V(σ′2, p).

Lemma 2. Starting from any inactive state σ, if some other principal makes a CPU-local move to

inactive state σ′, then V(σ, p) = V(σ′, p).

Lemma 3. Starting from any two inactive, indistinguishable states σ1 and σ′1, i.e., V(σ1, p) =

V(σ′1, p), if some other principal makes CPU-local moves to active states σ2 and σ′

2, respectively,

then V(σ2, p) = V(σ′2, p).

Lemma 4. Starting from any two indistinguishable states σ1 and σ′1, i.e., V(σ1, p) = V(σ′

1, p), if

query moves result in states σ2 and σ′2, respectively, then V(σ2, p) = V(σ′

2, p).

In other words, we prove in the first three lemmas that state indistinguishability is preserved in each

possible step of execution due to CPU-local moves, then based on rely-guarantee reasoning [74],

show in the last lemma that state indistinguishability must also be preserved due to query moves.

3.2.2 Intentional Information Release

Because commodity systems such as KVM may allow information release when explicitly

requested, to support KVM’s various virtualization features, we must model this intentional infor-

mation release and distinguish it from unintentional data leakage. We call data that is intentionally

released non-private data.

68

Example 3 (Supporting declassification). To illustrate the challenge of supporting intentional

information release, suppose VM m grants KServ access to gfn 3 for performing network I/O.

Since KServ may copy gfn 3’s data to its private memory, gfn 3’s data should be included in

KServ’s private data lens. Private pages of VMm (e.g., gfn 2) are handled the same as the previous

example—the content of gfn 2, whether b or b′, are not included in KServ’s PDL, and do not affect

state indistinguishability:

VM m may encrypt the data in gfn 2 and share it with KServ through gfn 3, so that KServ can

send the encrypted data via a backend paravirtual network I/O driver. Yet starting from two

indistinguishable states to KServ with different gfn 2 contents b and b′, writing encrypted data B

and B′ to gfn 3 leads to distinguishable states, since the differences between B and B′ are exposed

to KServ’s PDL:

The problem is that this information release does not preserve state indistinguishability, even

though it does not break confidentiality since KServ cannot decrypt private data. In general, what

is declassified is not known in advance; any page of a VM’s memory may be declassified at any

time, and the data may be declassified for an arbitrary amount of time.

To address this problem, we leverage data oracles [25], logical integer generators, to model

intentionally released information, such as encrypted data. Instead of using the actual value of such

69

information, the value is specified by querying the data oracle. A data oracle is a deterministic

function that generates a sequence of integers based on some machine state. We use a function

based on a principal’s PDL, guaranteeing that the value returned depends only on the state of the

principal reading the value. Each principal logically has its own data oracle, ensuring that the

returned value does not depend on the behavior of other principals. For example, if the encrypted

data is masked using a data oracle and the next integer returned by KServ’s data oracle is Ω, the

value of the next shared, encrypted data read by KServ will always be specified as Ω, whether the

encrypted data is B or B′. This way, intentional information release can be masked by data oracles

and will not break state indistinguishability:

Data oracles are only used to mask reading non-private data, decoupling the data read from the

writer of the data. Integer sequences returned by data oracles are purely logical and can yield any

value; thus, our noninterference proofs account for all possible values read from a data oracle. A

data oracle is applied dynamically, for example, masking a page of memory when it is declassified,

then no longer masking the page when it is used for private data. Note that unintentional informa-

tion leakage, such as sharing unencrypted private data, is not modeled by data oracles and will be

detected since it will break state indistinguishability.

3.3 Verifying SeKVM

We use security-preserving layers to replace the CCALs in the SeKVM layered architecture

discussed in Section 2.2. We express KCore’s specification as 34 security-preserving layers, and

70

incrementally prove that each layer of KCore’s implementation transparently refines its layer spec-

ification, starting from the bottom layer that uses the detailed machine model AbsMachine, until

reaching the top layer, TrapHandler, that uses a simpler abstract model. We then prove nonin-

terference using TrapHandler’s layer specification. We prove noninterference for any behavior

of KServ and VMs interacting with TrapHandler, so that proven security properties hold for the

entire SeKVM system with any implementation of KServ.

3.3.1 Proving KCore Refines TrapHandler

We describe the refinement proofs of different layered modules in KCore as follows.

VM Boot and Management. KCore tracks the lifecycle of each VM from boot until termination,

maintaining per-VM metadata in a VMInfo data structure. For example, KCore manages a VM’s

power state in VMInfo, preventing a VM from reclaiming memory from a VM that is powered

on. We ported HACL* to EL2 to support VM management. HACL* uses memcpy from the C

library to support its functionality. Because no C standard library exists for EL2, we replaced

the invocation of memcpy in HACL* with a standalone memcpy implementation for EL2, which

we verified. Beyond HACL*, KCore uses 4 layers for securing VM boot and supporting VM

management.

VM CPU Data Protection. To protect VM CPU data, KCore saves VM CPU registers from the

hardware to its memory in a VCPUContext data structure when a VM exits, and restores them to run

the VM. KCore does not make VM CPU data directly available to KServ, and validates any updates

from KServ before writing them to the VM CPU data it maintains in KCore memory. Many of the

functions for saving, restoring, and updating VM CPU data involving looping over registers to ini-

tialize, copy, and process data. For example, the primitive reset_sys_regs discussed in Table A.7

iterates a list of systems registers to reset their value in a while loop. We verify these functions

by first showing that the loop body refines an atomic step using transparent trace refinement, and

then use induction to prove that arbitrary many iterations of the loop refine an atomic step. We use

71

induction to prove the loop condition is monotonic, so that it will eventually terminate. Finally, we

prove the correctness of the assembly code used to support context switching CPU states. KCore

uses 3 layers for protecting VM CPU data.

Page Table Management. SeKVM assigns a dedicated page pool for allocating stage-2 page ta-

bles to each principal. More details are discussed in Section 4.1.3. We build on the multi-level page

table refinement proof in Section 2.2.2 and use transparent trace refinement to prove that KCore’s

shared multi-level page table implementation refines its specification (a logical map). Transparent

trace refinement was essential for verifying other layers that use stage 2 page tables, simplifying

their refinement by allowing the proofs to treat the primitives of the stage 2 page table map spec-

ification as atomic computation steps. Section 3.1.2 has more details. Using layers allowed most

of the refinement proofs to be written with the page table implementation details abstracted away,

making it possible to prove, for the first time, the functional correctness of a system that supports

shared page tables. KCore includes 4 layers to support page table management.

Memory Protection. To protect VM data in memory, KCore tracks metadata for each physical

page, including its ownership and sharing status, in an S2Page data structure, similar to the page

structure in Linux. KCore maintains a global S2Page array for all valid physical memory to trans-

late from a pfn to an S2Page. This can be specified as an abstract pfn 7→ (owner, share, gfn)

mapping, where owner designates the owner of the physical page, which can be KCore, KServ, or

a VM, and share records whether a page is intentionally shared between KServ and a VM.

Many of the functions involved in memory protection to virtualize memory and enforce mem-

ory isolation involve nested locks. For example, the primitive map_pfn_kserv that we discussed

in Section 2.1.2 first holds the S2Page lock to validate KServ’s memory access by checking the

ownership information in S2Page, then calls map_page within the critical section. The latter ac-

quires KServ’s page table lock to protect the shared memory accesses to KServ’s stage 2 page

table. Transparent trace refinement enables us to prove that the implementation with nested locks

refines an atomic step, by starting with the code protected by the innermost nested locks and prov-

72

ing that it refines an atomic step, then successively proving this while moving out to the next level

of nested locks until it is proven for all nested locks. The refinement proofs for memory protection

required 5 layers, including 3 layers for the S2Page specification.

SMMU. KCore manages the SMMU and its page tables to provide DMA protection. We changed

KVM’s SMMU page table walk implementation to an iterative one to eliminate recursion, which

is not supported by our layered verification approach, since each layer can only call functions in

a lower layer. This is not a significant issue for hypervisor or kernel code, in which recursion

is generally avoided due to limited kernel stack space. As with KVM, SMMU page tables for a

device assigned to a VM do not change after the VM boots, but can be updated before the VM

boots. SeKVM assigns a dedicated page pool for allocating SMMU page tables to each SMMU

translation unit. We leverage transparent trace refinement to prove the page table implementation

refines its specification to account for pre-VM boot updates. KCore includes 4 layers to support

SMMU page tables management, and 5 layers for supporting the SMMU hypercalls and handling

KServ’s accesses to the SMMU.

Synchronization. KCore uses an Arm assembly lock implementation from Linux to protect its

shared memory accesses. We verify that spinlocks correctly enforce mutual exclusion in a simi-

lar manner to CertiKOS [18]. We prove that all shared memory accesses in KCore are correctly

protected by spinlocks. The lock proofs required 4 layers. We leverage the VRM [78] framework

to verify SeKVM’s security guarantees on a sequentially consistent (SC) model to simplify veri-

fication, but prove that KCore satisfies a set of additional properties, such that the guarantees of

KCore verified using an SC model hold on Arm’s relaxed memory model. First, we show that the

lock implementation uses acquire and release barriers to ensure that memory accesses cannot be

reordered beyond their critical sections. Second, we verify that VMs and KServ can never access

KCore’s memory. Finally, we prove that KCore never overwrites an existing mapping in its EL2

stage 1 page table.

73

3.3.2 Formulating Noninterference over TrapHandler

Because security-preserving layers ensure that specification soundly captures all behaviors of

the KCore implementation, noninterference only needs to be formulated and proven over KCore’s

top-level specification, TrapHandler. Furthermore, by gradually proving KCore’s implementation

using a detailed machine model at lower layers refines TrapHandler’s specification using a simpler

abstract machine model, we can use this abstract specification to prove security properties. For

instance, to prove the security properties for VM’s memory accesses, we can reason the memory

load and store primitives at TrapHandler based on the abstract single-level page tables without

TLB, instead of the primitives defined in AbsMachine using multi-level page tables with TLB.

We define the PDL V for each principal based on policy vm-data-conf, vm-data-int, and vm-

iodata-iso in Section 1.4. V(σ, p) on a given CPU c contains all of its private data, including (1)

CPU c’s registers if p is active on c, (2) p’s saved execution (register) context, and (3) contents

of the physical pages owned by p, either stored in main memory or caches, which we call the

address space of p, including the memory pages mapped to p’s stage 2 and SMMU page tables.

V for a VM also contains metadata can affect its private data, including the sharing status of its

memory pages. V for KServ also contains metadata, including ownership metadata of all memory

pages, VM execution metadata, and SMMU configuration. We then use per-principal data oracles

to model the only three types of non-private data that a VM may intentionally release according to

policy data-declassification in Section 1.4: (1) data that VMs explicitly share and retain using the

GRANT_MEM and REVOKE_MEM hypercalls, (2) MMIO data that KServ copies from/into VM GPRs,

used for hardware device emulation, and (3) VCPU power state requests that VMs explicitly make

using the psci_power hypercall to control their own power management [67]. We also use data

oracles to model data intentionally released by KServ when providing its functionality: (1) physical

page indices proposed by KServ for page allocation, written to the page table of a faulting VM,

(2) contents of pages proposed by KServ for page allocation, and (3) scheduling decisions made by

KServ. Intuitively, if KServ has no private VM data, any release of information by KServ cannot

contain such data and can therefore be modeled using data oracles. We specify the intentionally

74

Figure 3.4: Noninterference Proof for a VM NPT Page Fault

released data in KCore’s specification, but prove noninterference against the specification to ensure

the absence of unintentional information release, guaranteeing VM data protection.

We prove noninterference by proving the lemmas from Section 3.2, for the top layer primitives

shown in Section 2.1.2, with respect to the PDL for each principal. Our proofs not only verify that

the execution of a principal does not interfere with another’s data, but that one principal does not

interfere with another’s metadata. For instance, we prove that a VM cannot interfere with KServ’s

decision about VM scheduling, or affect the memory sharing status of another VM. Furthermore,

since we prove that there is no unintentional release of private VM data to KServ, we also prove

that any release of information by KServ cannot contain such data.

For example, Figure 3.4 shows how we prove noninterference for the big step execution of a

stage 2 page fault caused by the need to allocate a physical page. VM p causes KCore’s vm_page_-

fault trap handler to run, KCore then switches execution to KServ to perform page allocation, then

KServ calls the run_vcpu hypercall to cause KCore to switch execution back to the VM with the

newly allocated page. We want to use the lemmas in Section 3.2, but we must first prove them for

each primitive.

We prove Lemmas 1 to 3 for vm_page_fault, which saves VM p’s execution context and

switches execution to KServ. For Lemma 1, an active principal must be the VM p. Starting

from two indistinguishable states for VM p, the local CPU’s registers must be for VM p since it

75

is active and running on the CPU, and must be the same in two executions. vm_page_fault in

two executions will therefore save the same execution context for VM p; so Lemma 1 holds. For

Lemma 2, an inactive principal q must be a VM other than p. q’s PDL will not be changed by VM

p’s page fault, which only modifies VM p’s execution context and the local CPU’s registers; so

Lemma 2 holds. For Lemma 3, only KServ can become active after executing vm_page_fault. Its

PDL will then include the local CPU’s registers after the execution. Since KServ’s saved execution

context must be the same in indistinguishable states, the restored registers will then remain the

same; so Lemma 3 holds.

We prove Lemmas 1 to 3 for KServ’s page allocation, which involves KServ doing mem_-

load and mem_store operations in its address space, assuming no KServ stage 2 page faults for

brevity. For Lemma 1, KServ is the only principal that can be active and executes the CPU-local

move, which consists of KServ determining what page to allocate to VM p. Starting from two

indistinguishable states for KServ, the same set of memory operations within KServ’s address

space will be conducted, and the same page index will be proposed in two executions, so Lemma 1

holds. For Lemma 2, all VMs are inactive. We prove an invariant for page tables, stating that any

page mapped by a principal’s stage 2 page table must be owned by itself. Since each page has at

most one owner, page tables, and address spaces, are isolated. With this invariant, we prove that

VMs’ states are not changed by KServ’s operations on KServ’s own address space, so Lemma 2

holds. Lemma 3 does not apply in this case, since KServ’s operations will not make any VM active.

We prove Lemmas 1 to 3 for run_vcpu. For Lemma 1, KServ is the only principal that can

invoke run_vcpu as a CPU-local move, which consists of KCore unmapping the allocated page pfn

from KServ’s stage 2 page table, assigning the owner of the pfn to the VM, calling cache-flush to

invalidate pfn’s cache line and synchronizes the contents to memory, mapping the allocated page

to VM p’s stage 2 page table, and saving KServ’s execution context, so it can switch execution back

to VM p. Starting from two indistinguishable states for KServ, run_vcpu in two executions will

transfer the same page from KServ’s page table to VM p’s page table. KServ’s resulting address

spaces remain indistinguishable, as the same page will be unmapped from both address spaces.

76

For Lemma 2, If a principal q stays inactive during the run_vcpu CPU-local move, it must not be

VM p. By the page table isolation invariant, the transferred page cannot be owned by q since it is

initially owned by KServ, and such a transfer will not affect q’s address space. For Lemma 3, VM

p is the only inactive principal that becomes active during the run_vcpu CPU-local move. Thus,

the page will be transferred to p’s address space. Starting from two indistinguishable states for

p, pfn’s content will be masked with the same data oracle query results, and the resulting caches

will not contain an entry for the pfn, therefore p’s resulting address spaces remain the same and

indistinguishable to p, so Lemma 3 holds.

By proving Lemmas 1 to 3 for all primitives, Lemma 4 holds for each primitive based on rely-

guarantee reasoning. We can then use the proven lemmas to complete the indistinguishability proof

for the big step execution of the stage 2 page fault, as shown in Figure 3.4.

As another example, a similar proof is done to show noninterference for MMIO accesses by

VMs. Addresses for MMIO devices are unmapped in all VMs’ stage 2 page tables. VM p traps

to KCore when executing an MMIO read or write instruction, invoking vm_page_fault. KCore

switches to KServ on the same CPU to handle the MMIO access. On an MMIO write, KCore

copies the write data from VM p’s GPR to KServ, so it can program the virtual hardware, which

we model using KServ’s data oracle in proving Lemma 3 for vm_page_fault. Lemmas 4 and 1

are then used to show that indistinguishability is preserved between two different executions of

VM p. Lemmas 4 and 2 are used to show indistinguishability as KServ runs. Finally, KServ calls

run_vcpu. On an MMIO read, KServ passes the read data to KCore so KCore can copy to VM p’s

GPR, which we model using VM p’s data oracle in proving Lemma 3 for run_vcpu. Lemmas 4

and 3 are then used to show that indistinguishability holds as KCore switches back to running the

VM.

KCore flushes caches when changing the principal associated with a pfn, which occurs when

(1) KServ allocates a new page to handle a VM’s page fault, as mentioned earlier in the run_vcpu

example shown in Figure 3.4, and (2) when KServ reclaims the pages from a terminated VM.

The former was discussed earlier. The latter occurs when KServ calls the clear_vm hypercall to

77

reclaim all pfns from a terminated VM, in which KCore calls clear_vm_page to scrub and assign

the owner of these pfns to KServ.

We prove noninterference for the clear_vm hypercall to KCore to ensure that an attacker in

KServ that bypasses the cache (as shown in the example in Figure 2.4) cannot access VM p’s data.

After ensuring the VM p is terminated, KCore calls clear_vm_page to scrub the pfns owned by p

and assigns these pfns to KServ. For each pfn, KCore calls cache-flush, which propagates the

scrubbing to main memory if pfn is cached, and has no effect otherwise. We prove Lemma 1 for

KServ, who is the only active principal that can make the CPU-local move to execute the hypercall.

From two states indistinguishable to KServ, after making the hypercall, the pfns reclaimed from

p will be owned by KServ. These pages will not be cached, and their contents in memory are

scrubbed. Therefore, the resulting states remain indistinguishable to KServ. We prove Lemma 2

for all VMs other than p. Consider a VM q different from VM p, starting from two indistinguishable

states, KCore does not change any of q’s states when handling the hypercall for KServ; thus, the

resulting states of q remain indistinguishable. Lemma 3 is not applicable.

We also prove that the use of caches by top-level memory load and store primitives preserves

state indistinguishability, so that the potential attack shown in Fig. 2.4 cannot happen. We first

prove noninterference when the ownership of a page does not change. If a principal p always owns

a page pfn, only p can access that page. If only p can access pfn, pfn will only be cached as a

result of being accessed by p. Based on the page table isolation invariant, the pages owned by p

that can be in the cache must be a subset of the pages mapped in p’s stage 2 page table. Since page

tables and address space are isolated, so are each principal’s entries in the cache. We can thereby

prove that a principal p’s states are not changed by any other principal’s q load and store operations

on q’s own address space, even if those operations involve the cache. We proved Lemma 1 for an

active principal p, and Lemma 2 for all others, that the contents of the principal in cache in any

two states remain indistinguishable before and after the principal’s memory accesses.

Finally, we prove that the use of SMMU page tables by top-level primitives preserves state

indistinguishability. Similar to page tables, we verify an SMMU page table isolation invariant that

78

any page mapped by a device’s SMMU page table must be owned by the device’s owner. With this

invariant, we prove that a principal p’s states are not changed by load and store operations from

a device owned by any other principal q using their SMMU page tables. Similarly, we prove that

SMMU primitives that transfer page ownership also do not affect state indistinguishability. The

transfer only happens when KServ calls the SMMU hypercall to map a pfn to the SMMU page

table used by a VM p’s device. KCore ensures the pfn is unmapped from KServ’s stage 2 page

table before transferring the owner of pfn from KServ to p. We thus ensure that the use of SMMU

page tables preserves state indistinguishability for VM memory.

3.3.3 Verified Security Guarantees of SeKVM

Our noninterference proofs over TrapHandler guarantee that KCore protects the confidential-

ity and integrity of VM data against both KServ and other VMs, and therefore hold for all of

SeKVM.

Confidentiality. We show that the private data of VM p cannot be leaked to an adversary. This

is shown by noninterference proofs that any big step executed by VM p cannot break the state

indistinguishability defined using the PDL of KServ or any other VM. There is no unintentional

information flow from VM p’s private data to KServ or other VMs.

Integrity. We show that the private data of VM p cannot be modified by an adversary. This

is shown by noninterference proofs that any big step executed by KServ or other VMs cannot

break the state indistinguishability defined using the PDL of VM p. VM p’s private data cannot be

influenced by KServ or other VMs.

In other words, SeKVM protects VM data confidentiality and integrity because we prove that

KCore has no vulnerabilities that can be exploited to compromise VM confidentiality and integrity,

and any vulnerabilities in KServ, or other VMs, that are exploited also cannot compromise VM

confidentiality and integrity.

79

Example Proofs: Verifying Page Fault Handling. We build on the example proofs in Sec-

tion 3.1.2 and present a more detailed, yet simplified proof of the primitives involved in handling a

given VM’s stage 2 page fault. The primitives are split across four layers, including NPTWalk and

NPTOps, which we discussed earlier. The code example here handles the page fault after KServ has

proposed a pfn to allocate and called run_vcpu in TrapHandler, KCore’s top layer. run_vcpu calls

assign_to_vm in MemOps to unmap the page from KServ’s stage 2 page table, update its owner to

the VM, and map the page to the VM’s stage 2 page table. MemOps updates these stage 2 page

tables using the set_npt primitive.// Primitives provided by NPTWalkextern void acq_lock_npt(uint vmid);extern void rel_lock_npt(uint vmid);extern uint pt_load(uint vmid, uint ofs);extern void pt_store(uint vmid, uint ofs, uint value);extern void acq_lock_s2pg();extern void rel_lock_s2pg();extern uint get_s2pg_owner(uint pfn);extern void set_s2pg_owner(uint pfn, uint owner);

// Primitive provided by NPTOpsvoid set_npt(uint vmid, uint gfn, uint pfn) acq_lock_npt(vmid);uint pte = pt_load(vmid, pgd_offset(gfn));pt_store(vmid, pte_offset(pte,gfn), pfn);rel_lock_npt(vmid);

// Primitive provided by MemOpsuint assign_to_vm(uint vmid, uint gfn, uint pfn) uint res = 1;acq_lock_s2pg();uint owner = get_s2pg_owner(pfn);if (owner == KSERV) set_npt(KSERV, pfn, 0);set_s2pg_owner(pfn, vmid);set_npt(vmid, gfn, pfn);

else res = 0; // pfn is not owned by KSERVrel_lock_s2pg();return res;

// Primitive provided by TrapHandlervoid run_vcpu(uint vmid)

. . .assign_to_vm(vmid, gfn, pfn);. . .

Layer 1: NPTWalk. This layer specifies a set of pass through verified primitives, and an abstract

state upon which they act. We extend the abstract state from the earlier example to model accesses

80

to the shared memory and the shared S2Page metadata, and a map from vmid to each VM’s local

state. We update the list of events from our earlier definition as follows:Inductive Event :=| ACQ_NPT (vmid: Z) | REL_NPT (vmid: Z)| P_LD (vmid ofs: Z) | P_ST (vmid ofs val: Z)| ACQ_S2PG | REL_S2PG| GET_OWNER (pfn: Z) | SET_OWNER (pfn owner: Z)| SET_MEM (pfn val: Z).

We also update NPTWalk’s abstract state to define a VM’s local state:(* VM local state *)Record LocalState := data_oracle: ZMap.t Z; (* data oracle for the VM *)doracle_counter: Z; (* data oracle query counter *)...

.

(* Abstract state *)Record AbsSt := log: Log;cid: Z; (* local CPU identifier *)vid: Z; (* vmid of the running principal on CPU cid *)lstate: ZMap.t LocalState; (* per-VM local state *)

.

We extend the definition of the shared objects from the earlier example to model the shared mem-

ory, an array of S2Page metadata, and the lock to protect shared accesses to the S2Page array. The

latter, as mentioned earlier is used by KCore to enforce memory access control:(* Shared objects constructed using replay function *)Record SharedObj := mem: ZMap.t Z; (* maps addresses to values *)s2pg_lock: option Z; (* s2pg lock holder *)pt_locks: ZMap.t (option Z); (* pt lock holders *)pt_pool: ZMap.t (ZMap.t Z); (* per-VM page table pool *)(* s2pg_array maps pfn to (owner, share, gfn) *)s2pg_array: ZMap.t (Z * Z * Z);

.

We update NPTWalk’s layer interface to include the newly exposed primitives as follows:Definition NPTWalk: Layer AbsSt :=acq_lock_npt 7→ csem acq_lock_npt_spec⊕ rel_lock_npt 7→ csem rel_lock_npt_spec⊕ pt_load 7→ csem pt_load_spec⊕ pt_store 7→ csem pt_store_spec⊕ acq_lock_s2pg 7→ csem acq_lock_s2pg_spec⊕ rel_lock_s2pg 7→ csem rel_lock_s2pg_spec⊕ get_s2pg_owner 7→ csem get_s2pg_owner_spec⊕ set_s2pg_owner 7→ csem set_s2pg_owner_spec.

81

Data oracles can be used for primitives that declassify data, as discussed in Section 3.2. For

example, set_s2pg_owner changes the ownership of a page. When the owner is changed from

KServ to a VM vmid, the page contents owned by KServ is declassified to VM vmid, so a data

oracle is used in the specification of set_s2pg_owner to mask the declassified contents:Definition set_s2pg_owner_spec

(o: EO) (st: AbsSt) (pfn vmid: Z) :=let l0 := o (log st) ++ log st inlet l1 := (SET_OWNER pfn vmid, cid st) :: l0 inmatch replay l1 with| Some _ => (* log is valid and lock is held *)let st’ := st log: l1) inif vid st =? KSERV && vmid != KSERVthen (* pfn is transferred from KServ to a VM *)mask_with_doracle st’ vmid pfn

else Some st’| _ => Noneend.

We introduce an auxiliary Coq definition mask_with_doracle to encapsulate the masking behav-

ior:Definition mask_with_doracle (st: AbsSt) (vmid pfn: Z) :=let local := ZMap.get vmid (lstate st) inlet n := doracle_counter local inlet val := data_oracle local n inlet l := (SET_MEM pfn val, cid st) :: log st inlet local’ := local doracle_counter: n+1 inst log: l, lstate: ZMap.set vmid local’ (lstate st)

mask_with_doracle queries the local data oracle of VM vmid with a local query counter, generates

an event to mask the declassified content with the query result, then updates the local counter. Since

each principal has its own data oracle based on its own local state, the behavior of other principals

cannot affect the query result. set_s2pg_owner_spec only queries the data oracle when the owner

is changed from KServ to a VM. When the owner is changed from a VM to KServ, the page is

being freed, and KCore must zero out the page before recycling it; masking is not allowed. We also

introduce auxiliary definitions to mask other declassified data, such as page indices and scheduling

decisions proposed by KServ, which are not shown in this simplified example.

Layer 2: NPTOps. As shown in the example in Section 3.1.2, we prove the implementation of

set_npt transparently refines its specification that specifies a logical map. Primitives related to

page table pools are removed from the layer interface. Other primitives are passed through.

82

Definition NPTOps: Layer AbsSt :=set_npt 7→ csem set_npt_spec⊕ acq_lock_s2pg 7→ csem acq_lock_s2pg_spec⊕ rel_lock_s2pg 7→ csem rel_lock_s2pg_spec⊕ get_s2pg_owner 7→ csem get_s2pg_owner_spec⊕ set_s2pg_owner 7→ csem set_s2pg_owner_spec.

Layer 3: MemOps. This layer introduces the assign_to_vm primitive to transfer a page from

KServ to a VM, and hides NPTOps primitives:Definition MemOps: Layer AbsSt :=assign_to_vm 7→ csem assign_to_vm_spec.

assign_to_vm’s specification has a precondition that it must be invoked by KServ and the vmid

must be valid:Definition assign_to_vm_spec

(o: EO) (st: AbsSt) (vmid gfn pfn: Z) :=if vid st =? KSERV && vmid != KSERVthenlet l0 := o (log st) ++ log st inlet l1 := (ASG_TO_VM vmid gfn pfn, cid st) :: l0 inmatch replay l1 with| Some (_, Some res)) => (* res is the return value *)let st’ := st log: l1 in (* update the log *)if res =? 1 (* if pfn is owned by KSERV *)then mask_with_doracle st’ vmid pfnelse Some st’ (* return without masking the page *)

| _ => Noneend

(*get stuck if it’s not transferred from KServ to a VM*)else None.

It transfers a page from KServ to a VM via set_s2pg_owner, so the contents are declassified and

masked using the data oracle.

Layer 4: TrapHandler. This top layer interface introduces run_vcpu, which invokes assign_-

to_vm and context switches from KServ to the VM. We first prove that run_vcpu does not violate

the precondition of assign_to_vm. We then prove noninterference as discussed in Section 3.3.2.

Here, we can see why the page content will be masked with the same data oracle query results in

the proof of Lemma 3 for run_vcpu in Section 3.3.2. Two indistinguishable states will have the

same VM local states, and therefore the same local data oracle and counter. Thus, the data oracle

query results must be the same.

83

3.4 Summary

We have formally verified the guarantees of VM data confidentiality and integrity for the mul-

tiprocessor Linux KVM implementation, SeKVM. First, we built on the software and hardware

layered architecture presented in Chapter 2 to prove the functional correctness of the hypervisor

core. I introduced security-preserving layers to incrementally prove that each layer of the core

implementation refines its layered specification and preserves security guarantees on real multi-

processor hardware with multi-level page tables shared across multiple CPUs. We have used the

core’s specification to verify the security guarantees of the entire multiprocessor KVM hypervisor,

even in the presence of information sharing needed for commodity hypervisor features.

84

Chapter 4: Implementation and Evaluation

We use microverification to verify, for the first time, the guarantees of VM confidentiality and

integrity of the multiprocessor Linux KVM hypervisor. As shown in Figure 4.1, we use microveri-

fication to first retrofit the Arm implementation of KVM into a small hypervisor core, KCore, that

serves as the TCB, and a rich set of the untrusted hypervisor services, KServ, that encapsulates

the rest of the KVM implementation, including the host Linux kernel. We verify SeKVM on a

multiprocessor machine model that accounts for shared multi-level page tables, tagged TLBs, and

writeback caches. We prove that the multiprocessor core refines a stack of security-preserving lay-

ers, such that the top layer specifies the entire system by its functional behavior over its machine

state. We then use KCore’s top layer specification to prove that VM confidentiality and integrity are

protected for any KServ implementation interacting with KCore, thereby proving that the security

guarantees hold for the entire SeKVM hypervisor.

In this chapter, we first present the effort required to retrofit the Arm implementation of KVM

into SeKVM. We then detail the functionality supported by SeKVM and the principles adopted

by the SeKVM implementation to simplify verification. Next, we discuss the verification effort

of SeKVM and the bugs that we discovered in our initial implementation. Finally, we present a

performance evaluation of multiple versions of SeKVM, and an evaluation of practical attacks.

4.1 Retrofitting KVM on Arm

4.1.1 SeKVM Retrofitting Effort and KServ Modifications

Retrofitting Effort. We use microverification to retrofit KVM/ARM [29, 30] into SeKVM, given

Arm’s increasing popularity in server systems [31, 32, 33]. Table 4.1 shows the effort required for

retrofitting mainline KVM in Linux 4.18, measured by LOC in C and assembly. Upon retrofitting,

85

Figure 4.1: Microverification of the Linux KVM Hypervisor

Retrofitting Component LOCQEMU additions 70KVM changes in KServ 1.5KHACL in KCore 10.1KKVM C in KCore 0.2KKVM assembly in KCore 0.3KOther C in KCore 3.2KOther assembly in KCore 0.1KTotal 15.5K

Table 4.1: SeKVM Retrofitting Effort in LOC

SeKVM’s KCore ends up consisting of 3.8K LOC (3.4K LOC in C and 400 LOC in assembly), of

which .5K LOC were in the existing KVM code that we verified. In addition, 10.1K LOC were

added for the implementation of Ed25519 and AES in the ported HACL* library. 1.5K LOC were

modified in existing KVM code, a tiny portion of the codebase, such as adding calls to KCore

hypercalls. 70 LOC were also added to QEMU to support secure VM boot and VM migration.

We also retrofitted and verified various other versions of KVM in Linux 4.20, 5.0, 5.1, 5.2, 5.3,

5.4, and 5.5, which involved reusing much of the same code required for the 4.18 Linux kernel. For

instance, less than 100 LOC needed to be changed in KServ going from Linux 4.18 to 5.4, mostly

to support installing and initializing KCore on a different codebase before KCore starts running in

EL2. No code changes were required in KCore in going from Linux 4.18 to all other versions. The

initial retrofit for KVM in Linux 4.18 took one person-year. The port from KVM in Linux 4.18

to another kernel version took less than one person-month. These results indicate that the changes

needed to retrofit a widely-used, commodity hypervisor, so it can be verified and integrated with

multiple versions of a commodity host kernel, were modest overall.

86

KServ Modifications. We modified KServ in KVM to support SeKVM. We categorize the re-

quired modifications as follows. First, we updated Linux’s linker script to reserve memory regions

private to KCore at the end of the kernel data section to accommodate the page table pools, KCore’s

private metadata, and the intermediate state structures. The boot loader on some Arm hardware

loads the device tree to a fixed memory location that overlaps with KCore’s reserved memory re-

gion. To reconcile the conflict, we allocated a memory buffer in the data section for storing the

overlapped device tree. Second, we modified KServ to initialize KCore’s metadata during boot.

Third, we updated KServ’s code that is in charge of building page table mappings to allocate the

EL2 stage 1 page table from KCore’s private memory pool. Fourth, we updated KServ to map

KCore’s metadata, the intermediate state structures, and all physical memory to the EL2 stage 1

page table. We used 2MB mappings in the EL2 stage 1 page table to map the physical memory,

reducing the amount of page tables required to fulfill the mappings. Fifth, we changed KServ to

allocate large EL2 stack frames for each CPU to support HACL*. HACL* uses a large local array

from the stack in its Ed25519 implementation. Sixth, we instrumented KServ with hypercalls to

support SeKVM. Specifically, we add hypercalls to KServ to install KCore, verify boot images,

safely boot VMs, and import and export encrypted VM data. Finally, we changed the SMMU

driver in KServ to use the IOMMU OPS API.

4.1.2 Virtualization Feature Support

Table 4.2 compares features provided by commodity hypervisors with both the SeKVM im-

plementations. 1 means the feature is not implemented. It shows that SeKVM can improve the

overall security of KVM without compromising on its hypervisor features. Four KVM features

are not yet fully implemented in SeKVM, namely same page merging in Linux (KSM), swapping,

VM live migration, and checkpoint/restart. These features require additional changes to QEMU

and KServ. For example, appropriate GET VM STATE hypercalls need to be made in KServ to

export and import encrypted VM data for these features.

1 means the given feature is supported, means the given feature is not applicable, and means the feature isnot implemented.

87

Xen KVMSeK

VM

Boot and InitializationSecure Boot

Secure VM Boot CPUVM Symmetric Multiprocessing (SMP)

VCPU Scheduling Memory

Dynamic Allocation Swapping

DMA Same Page Merging

Shared Multi-Level Page Tables Interrupts Virtualization

Hardware Assisted I/O

Device Emulation Paravirtualized (Virtio)

Paravirtualized (Virtio with vhost) Device Passthrough

VM ManagementMulti-VM

VM Snapshot VM Restore

VM Migration VM Live Migration

VM Checkpoint/Restart

Table 4.2: Commodity Hypervisor Feature Support

Hardware Support. SeKVM has been verified and tested on three Armv8-based hardware plat-

forms: (1) HP Moonshot m400 server, (2) AMD Seattle server, and (3) Raspberry Pi 4. SeKVM

leverages the hardware features from VGIC and existing support to virtualize interrupts. Our im-

plementation currently supports ARM GIC 2.0. SeKVM supports secure VM boot using ARM

TrustZone-based TEE frameworks such as OP-TEE [79] to store the signatures and keys securely.

SeKVM does not yet support VHE [80, 81].

Supporting Virtio. The paravirtualized virtio frontend drivers in the guest OS communicate with

the backend driver in the hypervisor via a shared memory region called virtqueue. The guest driver

88

allocates the virtqueues from its memory for each virtio device. A given virtio device could use one

or multiple virtqueues. Each virtqueue includes a few circular ring buffers used for sending and

receiving data. Each entry in the buffer points to a virtio descriptor that specifies a guest physical

address of the memory buffer and its length. The frontend driver can attach or detach a descriptor

to the virtqueue. The backend driver looks up the virtqueue to get the virtio descriptor and emulate

the I/O operations. For instance, the guest network driver can use the descriptor to specify the

guest memory address and size of the encrypted I/O data it sends out. The backend driver then gets

the data from guest memory and sends the data to the network.

To support virtio on SeKVM, we modified the virtio frontend driver in the Linux guest kernel

to use the GRANT_MEM and REVOKE_MEM hypercalls. The driver makes the GRANT_MEM hypercall

to share the virtqueues with KServ. We also modify the driver to make the GRANT_MEM hypercall

each time it attaches a new memory buffer to the virtqueue to grant KServ access to the buffer

it allocates for I/O, and make the REVOKE_MEM hypercall when it detaches the buffer from the

virtqueue, preventing KServ from further accessing the previously shared memory. An alternative

approach to support virtio is via bounce buffering. The frontend driver grants KServ access to

a contiguous memory buffer it allocates, allowing KServ to access data located in the bounce

buffer. The frontend driver copies the data from the original buffer to the shared buffer, so the

backend driver in KServ can access the data. The backend driver can also put data into the shared

buffer. The frontend driver thus avoids making the GRANT_MEM and REVOKE_MEM hypercalls to

KCore upon attaching to or detaching from the virtio virtqueues. We have implemented both

approaches by adapting the virtio frontend driver in the Linux kernel. We found that the bounce

buffer implementation incurs more overhead than the approach that makes the GRANT_MEM and

REVOKE_MEM hypercalls in each transaction. This is due to the cost in managing and allocating

bounce buffer memory, as well as the overhead in memory copying between the original buffer

and bounce buffer memory. We thus measure VM performance using the virtio implementation

without bounce buffering.

89

4.1.3 Auxiliary Principles for Simplifying Verification

We discuss the auxiliary principles that we adopted to implement SeKVM to reduce verification

effort.

Disable Preemption. Mainline KVM on Arm hardware without VHE disables interrupts when

running in EL2. Therefore, the software that runs in EL2 does not have to handle interrupts or

support preemptive execution and can be simplified. SeKVM inherits KVM’s design and disables

interrupts in EL2 to simplify KCore. This also reduces verification effort because we can avoid

reasoning over complex interleavings resulting from preemptive KCore execution since such pre-

emption is not possible.

Metadata Allocation. Commodity systems support dynamic resource allocation to optimize its

resource utilization. For example, KVM leverages Linux’s memory allocation API from the slab

allocator to dynamically allocate metadata. To simplify verification, KCore does not support the

dynamic allocation of metadata. Instead, SeKVM pre-allocates KCore’s metadata during the se-

cure machine boot. For example, KCore maintains an array of the VMInfo metadata to manage

the per-VM execution state. To further simplify verification, KCore uses a one-to-one map that

statically assigns a metadata component in the array to a pre-determined target. For instance, an

entry in the VMInfo array is assigned to a unique VM. KCore uses the VM’s identifier as the index

to the array to get the VM’s VMInfo. The size of the array is capped by the maximum number of

VMs that KCore can allocate. We verified the correctness of KCore to rule out out-of-bound mem-

ory accesses. Statically assigning metadata simplifies the noninterference proof, ensuring that the

memory allocation of one VM does not interfere with others’ execution. This approach, however,

can have some impact on SeKVM’s scalability, as the metadata allocated to one VM cannot be

reused by another VM.

Changing KVM’s EL2 interface. In KVM, the Linux host running in EL1 is allowed to pass a

pointer to an EL2 function via a given hypercall to invoke the function with the EL2 privilege. In

90

SeKVM, the pointer must be validated to prevent a compromised KServ from invoking arbitrary

EL2 functions that could compromise VM data. To simplify verification, we adapted KVM’s

hypercall interface in our SeKVM implementation. The untrusted KServ is only allowed to make

hypercalls by passing a hypercall number. This simplifies verification, because we do not have to

reason about the safety of the EL2 function pointer. Instead of validating a pointer for arbitrary

invocations, KCore checks the hypercall number passed by KServ and handles the hypercall only

if the hypercall number is valid.

Simplify VM Memory Allocation. In our SeKVM implementation, KCore provides the API

to KServ to pass the address and size of the allocated physical page to a shared data structure in

memory. KCore then uses the API to query the data structure to get the vhPA and size of the new

page allocated by KServ. In contrast, using shadow paging complicates verification because we

cannot guarantee that the untrusted page tables remain a tree, making it problematic to prove that

the page tables managed by KServ refine a logical map. Reasoning directly over multi-level page

tables is required to support shadow paging.

Per-VM Page Table Pool. Hypervisors usually allocate memory for all principals from a global

memory pool. For instance, the page table helpers example below uses hyp_alloc_pages to allo-

cate page table levels from the shared memory pool dynamically. This approach improves memory

utilization but complicates the noninterference proof. We have to prove that one principal’s page

allocation from the shared pool will not affect the executions of other principals. To simplify ver-

ification, KCore statically assigns a page table pool to each principal for allocating the principal’s

page tables. Each page pool is capped with a page count quota, ensuring one principal’s allocation

cannot interfere with others’. As discussed in Section 2.2.2, we prove the page table structural

property for the page table refinement proof. To reduce the proof effort, we statically partition the

per-principal page pool further into separate smaller pools, each for allocating page tables for a

given page table level. This ensures that the page tables are never shared across different levels,

simplifying the proof reasoning when inserting a newly allocated page table to the existing page

91

table tree.

Page Table Helpers. We initially ported KVM’s page table helper functions for updating Arm’s

EL2 stage 1 page tables and stage 2 page tables to KCore. The simplified set_pt_orig function

below resembles the initial implementation for mapping pages to the EL2 stage 1 page table, which

it loops over a range of memory and maps each 4KB page to the page table. KVM uses a similar

implementation to map pages in a batch.void set_pt_orig(u64 start, u64 end, u64 pfn, u64 prot)u64 ttbr;

ttbr = get_ttbr();set_pgd_pfn_orig(ttbr, start, end, pfn, prot);

The following includes the rest of the primitives used by set_pt_orig.

void set_pte_pfn_orig(u64 pmd, u64 addr, u64 end, u64 pfn, u64 prot)u64 new_pte;u64 pte = pte_offset(pmd, addr);

do new_pte = pfn_pte(pfn, prot);//update the pte entryset_pte(pte, new_pte);pfn++;

while (addr += 4096, addr != end);

void set_pmd_pfn_orig(u64 pud, u64 addr, u64 end, u64 pfn, u64 prot)u64 next, pmd;

do pmd = pmd_offset(pud, addr);if (pmd_none(pmd))hyp_alloc_pages(1);

next = pmd_addr_end(addr, end);set_pte_pfn_orig(pmd, addr, next, pfn, prot);pfn += (next - addr) >> 12;

while (addr = next, addr != end);

void set_pud_pfn_orig(u64 pgd, u64 addr, u64 end, u64 pfn, u64 prot)u64 next, pud;

do pud = pud_offset(pgd, addr)if (pud_none(pud))

92

hyp_alloc_pages(1);next = pud_addr_end(addr, end);set_pmd_pfn_orig(pud, addr, next, pfn, prot);pfn += (next - addr) >> 12;

while (addr = next, addr != end);

void set_pgd_pfn_orig(u64 ttbr, u64 addr, u64 end, u64 pfn, u64 prot)u64 next, pgd;

do pgd = pgd_offset(ttbr, addr);if (pgd_none(pgd))hyp_alloc_pages(1);

next = pgd_addr_end(addr, end);set_pud_pfn_orig(pgd, addr, next, pfn, prot);pfn += (next - addr) >> 12;

while (addr = next, addr != end)

As shown above, the implementation includes a while loop in the helper function for each

page table level. As discussed in Section 4.2.1, SeKVM’s proof tactics do not support automated

verification for loops. Loops have to be manually verified. We simplify the verification effort by

removing the loops from the page table helpers, as demonstrated in the example set_pt primitive

below:void set_pt(u64 addr, u32 level, u64 pte)u64 ttbr, pgd, pud, pmd;

ttbr = get_pt_ttbr();pgd = set_pgd_pfn(ttbr, addr);pud = set_pud_pfn(pgd, addr);pmd = set_pmd_pfn(pud, addr);set_pte_pfn(vmid, pmd, addr, pte);

The primitives needed by set_pt are shown in the following. The example implementation uses

the pt_load and pt_store primitives to access the page table entries from the page table pool. To

map pages for a range of memory, KServ can then loop over the range of memory and make multi-

ple invocations of the KCore primitives via hypercalls. For instance, as mentioned in Section 2.1.2,

KCore exposes the remap_boot_image_page hypercall, which uses the page table helpers to map

a single page to KCore’s address space for supporting image authentication. KServ loops over the

memory region that contains a VM image and makes the remap_boot_image_page hypercall to

create a mapping for each of the physical pages within the region.

93

void set_pte_pfn(u64 pmd, u64 addr, u64 pte)u64 index;index = pte_idx(addr);pt_store(KCORE, pmd | (index * 8UL), pte);

u64 set_pmd_pfn(u64 pud, u64 addr)u64 index, pmd;

index = pmd_index(addr);pmd = pt_load(KCORE, pud | (index * 8));if (pmd_none(pmd)) pmd = hyp_alloc_pmd();pt_store(KCORE, pud | (index * 8), pmd);return pmd;

elsereturn pmd;

u64 set_pud_pfn(u64 pgd, u64 addr)u64 index, pud;

index = pud_index(addr);pud = pt_load(KCORE, pgd | (index * 8));if (pud_none(pud)) pud = hyp_alloc_pud();pt_store(KCORE, pud | (index * 8), pud);return pud;

elsereturn pud;

u64 set_pgd_pfn(u64 ttbr, u64 addr)u64 index, pgd;

index = pgd_index(addr);pgd = pt_load(KCORE, ttbr | (index * 8));if (pgd_none(pgd)) pgd = hyp_alloc_pgd();pt_store(KCORE, pgd | (index * 8), pgd);return pgd;

elsereturn pgd;

Verifying Shared Data Accesses. To verify concurrent accesses to the shared data are well pro-

tected by locks, all read or write accesses to shared data in KCore are done via a set of atomic

primitives. The example below shows a function func that reads the shared VMInfo array using a

get_vm_state primitive. Accessing shared data using the atomic primitives simplifies the proofs

94

because we can embed checks in the primitives’ specification to ensure the correct lock is held

before a given shared data access is made.

#define MAX_VM 100struct VMInfo u32 state;

;

struct RData struct VMInfo vminfo[MAX_VM];

;struct RData rdata;

void acq_vm_lock(u32 vmid);void rel_vm_lock(u32 vmid);//u32 get_vm_state(u32 vmid)return rdata.[vmid].state;

// verified KCorevoid func(u32 vmid)u32 state;acq_vm_lock(vmid);state = get_vm_state(vmid);// ...rel_vm_lock(vmid);

4.2 Proof Effort and Bug Findings

4.2.1 Proof Effort

We verified all of KCore’s C and assembly code. Table 4.3 shows the LOC in Coq for proving

the correctness of KCore’s code, categorized by the modules shown in Figure 2.2. The proof effort

for each module consists of writing the Coq specifications (Spec), code proofs (Code) to verify the

C and assembly code refines the Coq specifications, and layer refinements (Refine) to verify at each

layer the implementation on the underlay interface refines the overlay interface, thereby linking

the layers together to refine the top-level specification. Some modules required much more manual

effort than others. For the specifications, the LOC for the Exit Handler module is higher than

other modules because it includes the top layer TrapHandler specification that encompasses all of

KCore’s behavior. For code proofs, the LOC for the VCPU module is higher than other modules

95

Component C+Asm Spec Code Refine CodeAll RefineAllExit Handler 0.4K 1.7K 0.2K 1.1K 1K 1.4KVCPU 0.8K 0.5K 2.4K 0.9K 3.3K 1.3KVM Boot 0.9K 1.0K 0.6K 1.1K 2.8K 1.5KSMMU 0.5K 0.7K 0.2K 1.0K 1.8K 1.4KVM Mem 0.5K 0.9K 0.6K 2.2K 2.3K 2.6KSMMU PT 0.2K 0.5K 0.1K 2.3K 1.6K 2.7KMMU PT 0.4K 0.5K 0.1K 4.3K 1.7K 4.7KLock 0.1K 0.2K 1.2K 1.8K 2K 2.2KTotal 3.8K 6.0K 5.4K 14.7K 16.5K 17.8K

Table 4.3: KCore Implementation and Proof Effort in LOC

because it has both loops and assembly code. This is because we used automated reasoning to

reduce manual effort, but our methods do not support automating loop verification or assembly

code. For layer refinement, the LOC for the MMU PT proof is higher than other modules because

refining the multi-level page table implementation to a flat map specification was the most complex

refinement proof.

Table 4.3 also shows all the resulting code in Coq for code proofs (CodeAll) and layer refine-

ment (RefineAll), by adding automatically generated LOC to the manually written LOC. For some

modules, the use of automated reasoning significantly simplified the manual effort, such as for the

code proofs for the MMU PT, SMMU PT, and SMMU modules. However, we did not apply automated

reasoning uniformly for all modules because different parts of the system were verified by differ-

ent authors who took different approaches. For example, we did not use Coq tactics to automate

the proofs for the Lock module, resulting in more LOC for its code proofs, but this could have

been done. While automated tools helped significantly with code proofs, they did not help much

with layer refinement, as shown by comparing the manually written versus total LOC for each in

Table 4.3.

In addition to the Coq code for proving the correctness of each module, we implemented the

machine model and proved the security guarantees in Coq. 1.8K LOC were used to implement

AbsMachine, which models the multiprocessor hardware behaviors, including multi-level page

tables for the MMU and SMMU, TLBs, and write-back caches with bypass support. AbsMachine

96

Verification Component LOCInvariant proofs 1.1KNoninterference proofs 3.7KTotal 4.8K

Table 4.4: Proof Effort for SeKVM’s Security Proofs in LOC

primitives used by higher layers were passed through to those layers, then verified as part of each

layer. We did not link HACL’s F* proofs with our Coq proofs, or our Coq proofs for C code with

those for Arm assembly code. The latter requires a verified compiler for Arm multiprocessor code;

no such compiler exists. No changes were required to the proofs used to verify KVM in different

Linux versions.

Table 4.4 shows the verification effort for SeKVM’s security properties, measured by LOC in

Coq. The security proofs, including the invariant and noninterference proofs, consist of 4.8K LOC.

1.1K LOC were used to verify the isolation invariants mentioned in Section 3.3.2 for the MMU

and SMMU page tables. The rest of the 3.7K LOC were noninterference proofs for KCore’s top-

level primitives; for example, these proofs involved proving state indistinguishability with respect

to caches.

Among the 4.8K LOC required for the security proofs, 0.4K LOC were needed for defining

the PDLs and auxiliary lemmas we mentioned in Section 3.3.2 for the noninterference proof. The

PDLs and lemmas specify the desired security properties of SeKVM and therefore have to be

trusted. Compared with the rest of the proof effort, they are kept simple because their definition

is orthogonal to KCore’s concrete implementation. Instead, we construct the PDLs and lemmas

straightforwardly over the abstract machine state according to SeKVM’s security policies (See

Section 1.4). For example, consider the memory isolation policy as follows: for a given principal

p, for all memory owned by p, the memory contents remain indistinguishable between a pair of

executions before and after p takes an active step. Our Coq implementation for the policy queries

the S2Page metadata to retrieve the set of memory owned by p for the PDL by comparing the owner

in each of the S2Page against p. The implementation does not depend on, or require knowledge of,

the actual KCore implementation that manages S2Page.

97

The Coq development effort for KCore’s functional correctness and security proofs took two

person-years. These results show that microverification of a commodity hypervisor can be accom-

plished with modest proof effort.

4.2.2 Bugs Found During Verification.

While verifying KCore, we found various bugs in our initial implementation. Most bugs were

discovered as part of our noninterference proofs, demonstrating a limitation of verification ap-

proaches that only prove functional correctness via refinement alone: the high-level specifications

may themselves be insecure. In other words, these bugs were not detected by just verifying that

the implementation satisfies its specification, but by ensuring that the specification guarantees the

desired security properties of the system.

Overwrite page table mapping. KCore initially did not check if a gfn was mapped before

updating a VM’s stage 2 page tables, making it possible to overwrite existing mappings. For

example, suppose two VCPUs of a VM trap upon accessing the same unmapped gfn. Since KCore

updates a VM’s stage 2 page table whenever a VCPU traps on accessing unmapped memory, the

same page table entry will be updated twice, the latter replacing the former. A compromised KServ

could leverage this bug and allocate two different physical pages, breaking VM data integrity.

We fixed this bug by changing KCore to update stage 2 page tables only when a mapping was

previously empty.

Huge page ownership. When KServ allocated a 2MB page for a VM, KCore initially only

validated the ownership of the first 4KB page rather than all the 512 4KB pages, leaving a loophole

for KServ to access VM memory. We fixed this bug by accounting for this edge case in our

validation logic.

No SMMU TLB flush after unmapping. We found a TLB management bug in which SeKVM

did not flush the SMMU TLB after unmapping a page from the SMMU page tables. We fixed the

bug in KCore by adding a SMMU TLB flush after the unmap.

Page table update race. When proving invariants for page ownership used in the noninter-

98

ference proofs, we identified a race condition in stage 2 page table updates. When allocating a

physical page to a VM, KCore removes it from KServ’s page table, assigns ownership of the page

to the VM, then maps it in the VM’s page table. However, if KCore is processing a KServ’s stage

2 page fault on another CPU, it could check the ownership of the same page before it was assigned

to the VM, think it was not assigned to any VM, and map the page in KServ’s page table. This

race could lead to both KServ and the VM having a memory mapping to the same physical page,

violating VM memory isolation. We fixed this bug by expanding the critical section and holding

the S2Page array lock, not just while checking and assigning ownership of the page, but until the

page table mapping is completed.

Multiple I/O devices using same physical page. KCore initially did not manage memory

ownership correctly when a physical page was mapped to multiple KServ SMMU page tables,

with each page table controlling DMA access for a different I/O device, allowing KServ devices to

access memory already assigned to VMs. We fixed this bug by having KCore only map a physical

page to a VM’s stage 2 or SMMU page tables when it is not already mapped to an SMMU page

table used by KServ’s devices.

SMMU static after VM boot. KCore initially did not ensure that mappings in SMMU page

tables remain static after VM boot. This could allow KServ to modify SMMU page table mappings

to compromise VM data. We fixed this bug by modifying KCore to check the state of the VM that

owned the device before updating its SMMU page tables, and only allow updates before VM boot.

No cache flush after loading VM boot images. We found a cache management bug in SeKVM

in which a VM boot image may be cached when loaded from the file system but not written back

to the main memory. As VMs are booted with paging and caching disabled, it is possible that the

VMs access the page content in memory, thereby not using the correct VM images. We fixed the

bug in KCore by flushing the corresponding cache lines for memory that contain the pre-loaded

VM image before booting the VM, ensuring the use of the correct VM image loaded in memory.

99

4.3 Evaluation

4.3.1 Experimental Setup

We evaluate the performance of unmodified KVM versus the verified SeKVM implementa-

tion across different software and hardware configurations. We ran KVM and SeKVM in both

Linux 4.18 and 5.4 on two different Armv8 hardware configurations: (1) an HP Moonshot m400

server with an 8-core 64-bit Armv8-A 2.4 GHz Applied Micro Atlas SoC, 64 GB of RAM, a

120 GB SATA3 SSD, and a Dual-port Mellanox ConnectX-3 10GbE NIC, and (2) an AMD Seat-

tle (Rev.B0) server with an 8-core 64-bit Armv8-A 2 GHz AMD Opteron A1100 SoC, 16 GB

of RAM, a 512 GB SATA3 HDD for storage, an IOMMU (SMMU-401) to support control over

DMA devices and direct device assignment, and an AMD XGBE 10 GbE NIC. For client-server

workloads, clients ran on another m400 machine when using the m400 server, and ran on an x86

machine with 24 Intel Xeon CPU 2.20 GHz cores and 96 GB RAM when using the Seattle server,

in all cases connected via 10 GbE.

We used different software configurations across the servers to demonstrate the performance

of the verified KVM across multiple software and VM configurations. We used Ubuntu 18.04

and QEMU 3.0 for the m400 server and its VMs, Ubuntu 16.04 and QEMU 2.3.50 for the Seattle

server and its VMs. Furthermore, we used small and large SMP VM configurations, the former

on the m400 server with 2 CPUs and 256 MB RAM and the latter on the Seattle server with 4

CPUs and 12 GB of RAM. A smaller VM configuration was also used in part to show results for

running many SMP VM instances given the RAM limits of the m400 server. We also measured

performance natively on the servers with the host OS capped at using the same number of CPUs

and amount of RAM as the respective VM configuration. KVM was configured with its standard

vhost virtio network, and with cache=none for its virtual block storage devices [82, 83, 84]. All

VMs used paravirtualized I/O, typical of cloud infrastructure deployments. For the single VM

measurements, we pinned each VCPU to a specific physical CPU and ensured that no other work

was scheduled on that CPU [80, 61, 85, 86]. To quantify the hypervisor’s ability in scheduling

100

Name DescriptionHypercall Transition from the VM to the hypervisor OS and return to the VM without

doing any work in the hypervisor. Measures bidirectional base transition costof hypervisor operations.

I/O Kernel Trap from the VM to the emulated interrupt controller in the hypervisor OSkernel, and then return to the VM. Measures a frequent operation for manydevice drivers and baseline for accessing I/O devices supported by the hyper-visor OS kernel.

I/O User Trap from the VM to the emulated UART in QEMU and then return to theVM. Measures base cost of operations that access I/O devices emulated in thehypervisor OS user space.

Virtual IPI Issue a virtual IPI from a VCPU to another VCPU running on a differ-ent PCPU.Measures time between sending the virtual IPI until the receivingVCPU handles it.

Table 4.5: Microbenchmarks

Benchmark m400 SeattleKVM verified KVM KVM verified KVM

Hypercall 2,275 4,695 2,896 3,720I/O Kernel 3,144 7,235 3,831 4,864I/O User 7,864 15,501 9,288 10,903Virtual IPI 7,915 13,900 8,816 10,699

Table 4.6: Microbenchmark Performance (cycles).

multiple VMs, we did not pin VCPUs for VMs in our multi-VM measurements.

4.3.2 Microbenchmark Results

We measured the set of microbenchmarks listed in Table 4.5 on SeKVM. Table 4.6 shows the

results measured in cycles for the unmodified KVM and verified KVM in Linux 4.18 for each

hardware configuration. Verified KVM overhead compared to unmodified KVM is much higher

on the m400 server versus the Seattle server because the m400 CPUs have a tiny TLB [87] com-

pared to Seattle CPUs. Although KCore supports huge pages for stage 2 page tables for VMs, the

current implementation maps regular 4 KB pages in KServ’s stage 2 page tables so microbench-

mark workloads that spend most of their time running in KServ require more TLB entries to cache

address translations, increasing TLB capacity misses. Newer Arm CPUs have more reasonable

TLB sizes similar to or greater than the Seattle CPUs, so the Seattle measurements are more re-

101

Name DescriptionKernbench Compilation of the Linux kernel using allnoconfig for Arm; m400 com-

piled v4.18 with GCC 7.5.0, Seattle compiled v4.9 with GCC 5.4.0.Hackbench hackbench [88] using Unix domain sockets and process groups running in

500 loops; m400 used 20 groups, Seattle used 100 groups.Netperf netperf v2.6.0 [89] running netserver on the server and the client with

its default parameters in three modes: TCP_STREAM (throughput), TCP_-MAERTS (throughput), and TCP_RR (latency).

Apache Apache server handling concurrent requests from remote ApacheBench [90]v2.3 client, serving the index.html of the GCC manual; m400 used v2.4.29serving 7.5.0 manual, Seattle used v2.4.18 serving 5.4.0 manual.

Memcached memcached v1.4.25 using the memtier benchmark v1.2.3 with its defaultparameters.

MySQL MySQL v14.14 (distrib 5.7.26) running SysBench v.0.4.12 using the defaultconfiguration with 200 parallel transactions.

MongoDB MongoDB server handling requests from a remote YCSB [91] v0.17.0 clientrunning workload A with 16 concurrent threads; m400 used v3.6.3 withreadcount=10000 and operationcount=50000, Seattle used v4.0.20 with read-count=500000 and operationcount=100000.

Redis Redis v4.0.9 server handling requests from a remote YCSB v0.17.0 clientrunning workload A; m400 used v4.0.9, Seattle used v3.0.6.

Table 4.7: Application Benchmarks

flective of typical Arm server performance. For Seattle, verified KVM only incurs 17% to 28%

overhead over KVM, with the added benefit of verified VM protection. On Seattle, the overhead is

highest for the simplest operations because the relatively fixed cost of KCore protecting VM data

is a higher percentage of the work that must be done. These results provide a conservative measure

of overhead since real hypervisor operations will invoke actual KServ functions, not just measure

overhead for a null hypercall. The results show the verified implementation introduced modest

over to the unverified implementation.

4.3.3 Application Workload Results

Single-VM Performance. We evaluated performance using real application workloads listed in

Table 4.7. To evaluate VM performance with end-to-end I/O protection, full disk encryption (FDE)

was enabled for Seattle VMs but not m400 VMs, given the limited memory assigned to m400 VMs.

102

0.0

0.5

1.0

1.5

Hackbench Kernbench Apache Encrypt Mongodb Redis

KVMv4.18 Seattle SeKVMv4.18 SeattleKVMv5.4 Seattle SeKVMv5.4 SeattleKVMv4.18 m400 SeKVMv4.18 m400KVMv5.4 m400 SeKVMv5.4 m400

Figure 4.2: Single-VM Application Benchmark Performance — KVM 4.18 and 5.4

All VMs for Seattle are configured with the same LUKS-based Full Disk Encryption (FDE) in their

virtual disk but without full network encryption.

Figure 4.2 compares performance for running the workloads in a VM on KVM versus verified

KVM. We normalized the VM results to native hardware without FDE. We measured Apache

performance with TLS/SSL support enabled. Results are shown for both KVM and verified KVM

in Linux 4.18 and 5.4 on both m400 and Seattle hardware. In all cases, verified KVM performance

on real application workloads is comparable to unmodified KVM, yet provides the added benefit

of verified VM protection. Worst case overhead for verified KVM is less than 10 percent versus

unmodified KVM, even when running on the m400 server with its small TLBs. There is no change

in relative performance when running 2 CPU VMs versus 4 CPU VMs. In other words, the use

of locks in verified KVM to protect shared memory accesses and make its proofs tractable do not

adversely affect its multiprocessor VM performance on Arm relaxed memory hardware.

Single-VM Performance with Reduced Hypervisor Feature Support. On Seattle, in addition

to running workloads on KVM and SeKVM, we also ran the application workloads using the

103

5.16 4.52 5.93 7.17 4.69 5.20 7.78

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Kernbench

Hackbench

TCP_STREAM

TCP_MAERTSTCP_RR

Apache EncryptApache

Memcached

MySQL EncryptMySQL

MongoDB

KVMSeKVMSMP-no-vhostSMP-no-vhost-no-hugeUP-no-vhost-no-huge

Figure 4.3: Single-VM Application Benchmark Performance — KVM 4.18

three other configurations (SMP-no-vhost, SMP-no-vhost-no-huge, UP-no-vhost-no-huge). We

ran benchmarks on six configurations: (1) native hardware, (2) multiprocessor (SMP) VM on un-

modified KVM (KVM), (3) SMP VM on SeKVM (SeKVM), (4) SMP VM on SeKVM without

vhost (SMP-no-vhost), and (5) SMP VM on SeKVM without vhost or huge page support (SM-

P-no-vhost-no-huge), and (6) uniprocessor VM on SeKVM without vhost or huge page support

(UP-no-vhost-no-huge). SMP-no-vhost, SMP-no-vhost-no-huge, and UP-no-vhost-no-huge were

used to quantify the performance impact of not having verified kernel support for virtual I/O (vhost

in KVM) [92], huge pages, and multiprocessor VM execution on multiple CPUs. The five VM con-

figurations run on KVM in Linux 4.18. We measured Apache and MySQL performance with and

without TLS/SSL support enabled, to show how the option affects overhead.

Figure 4.3 shows the overhead for each hypervisor configuration. The numbers are normal-

ized to native execution. The measurements for the same benchmarks for KVM and SeKVM in

Figure 4.3 are the same as respective ones in the Seattle results in Figure 4.2. On real application

workloads, SeKVM incurs only modest overhead compared to unmodified KVM. In most cases,

the overhead for SeKVM is similar to unmodified KVM and less than 10% compared to native.

104

The worst overhead for SeKVM versus unmodified KVM is for TCP_MAERTS, which measures

bulk data send performance from the VM to a client. Unmodified KVM achieves near native per-

formance here because virtio batches packet sends from the VM without trapping. The cost is

greater for SeKVM because the guest OS virtio driver must trap to KCore to grant KServ access

for each packet, though this can be optimized further. TCP_STREAM, which measures bulk data

receive performance, does not have this overhead because the virtio backend driver batches packet

processing for incoming traffic, resulting in the additional traps happening less often.

In contrast to SeKVM, the performance of SMP-no-vhost, SMP-no-vhost-no-huge and UP-no-

vhost-no-huge is much worse than KVM and SeKVM. SMP-no-vhost shows that lack of kernel-

level virtual I/O support can result in more than two times worse performance for network I/O

related workloads such as TCP_STREAM, TCP_RR, Apache, and Memcached. SMP-no-vhost-

no-huge shows that lack of huge page support adds 50% more overhead versus SMP-no-vhost for

hackbench, MySQL, and MongoDB. UP-no-vhost-no-huge shows that lack of multiprocessor VM

support results in many benchmarks having more than 4 times worse performance than SeKVM.

TCP_STREAM and TCP_MAERTS are bandwidth limited, and TCP_RR is latency bound, so the

performance loss due to using only 1 CPU is smaller than other benchmarks. The results suggest

that a verified system without support for commodity hypervisor features such as kernel-level

virtual I/O, huge pages, and multiprocessor VMs will have relatively poor performance.

Multi-VM Performance. We evaluated the performance scalability of KVM versus verified

KVM by running the application benchmarks in Table 4.7 with multiple concurrent VMs run-

ning on the m400 server. Figure 4.4 shows the measurements for Linux 4.18 from 1 to 32 VMs

normalized to native execution of one instance of the workload running; the maximum number

of VMs was only limited by the number of VM images we could store on the server’s SSD. The

measurements for 1 VM in Figure 4.4 are the same as the m400 results in Figure 4.2. As expected,

running more concurrent VM instances of the application benchmark results in slower performance

as the number of instances increases, but the results show a similar slowdown for both KVM and

105

56 60

0

5

10

15

20

25

1 2 4 8 16 32

Hackbench KVM Hackbench SeKVMKernbench KVM Kernbench SeKVMApache Encrypt KVM Apache Encrypt SeKVMMongodb KVM Mongodb SeKVMRedis KVM Redis SeKVM

Figure 4.4: Multi-VM Performance

verified KVM for all application benchmarks. In all cases, even when running 32 concurrent VMs,

verified KVM has no worse than 10% overhead compared to unmodified KVM, demonstrating that

verified KVM has similar performance scalability as unmodified KVM. In other words, the use of

locks in verified KVM to protect shared memory accesses does not adversely affect its performance

scalability in running multiple multiprocessor VMs on Arm relaxed memory hardware.

4.3.4 Evaluation of Practical Attacks

We evaluated SeKVM’s effectiveness against a compromised KServ by analyzing CVEs and

identifying the cases where SeKVM protects VM data despite any compromise, assuming an equiv-

alent implementation of SeKVM for x86 platforms. We analyzed CVEs related to Linux/KVM,

which are listed in Tables 4.8 and 4.9. The CVEs consider two cases: a malicious VM that ex-

ploits KVM functions supported by KServ, and an unprivileged host user who exploits bugs in

Linux/KVM. Among the selected CVEs, 16 are x86-specific; one is specific to Arm, while the rest

are independent of architecture. An attacker’s goal is to exploit these CVEs to obtain KServ privi-

leges and compromise VM data. The CVEs related to our threat model could result in information

106

Bug Description KVM SeKVMCVE-2015-4036 Memory Corruption: Array index error in KServ. No YesCVE-2013-0311 Privilege Escalation: Improper handling of descriptors in vhost driver. No YesCVE-2017-17741 Info Leakage: Stack out-of-bounds read in KServ. No YesCVE-2010-0297 Code Execution: Buffer overflow in I/O virtualization code. No YesCVE-2014-0049 Code Execution: Buffer overflow in I/O virtualization code. No YesCVE-2013-1798 Info Leakage: Improper handling of invalid combination of operations for

virtual IOAPIC.No Yes

CVE-2016-4440 Code Execution: Mishandling of virtual APIC state. No YesCVE-2016-9777 Privilege Escalation: Out-of-bounds array access using VCPU index in in-

terrupt virtualization code.No Yes

CVE-2015-3456 Code Execution: Memory corruption in virtual floppy driver allows VMuser to execute arbitrary code in KServ.

No Yes

CVE-2011-2212 Privilege Escalation: Buffer overflow in the virtio subsystem allows guestto gain privileges to the host.

No Yes

CVE-2011-1750 Privilege Escalation: Buffer overflow in the virtio subsystem allows guestto gain privileges to the host.

No Yes

CVE-2015-3214 Code Execution: Out-of-bound memory access in QEMU leads to memorycorruption.

No Yes

CVE-2012-0029 Code Execution: Buffer overflow allows VM users to execute arbitrarycode in QEMU

No Yes

CVE-2017-1000407 Denial-of-Service: VMs crash hypserv by flooding the I/O port with writerequests.

No No

CVE-2017-1000252 Denial-of-Service: Out-of-bounds value causes assertion failure and hyper-visor crash.

No No

CVE-2014-7842 Denial-of-Service: Bug in KVM allows guest users to crash its own OS. No NoCVE-2018-1087 Privilege Escalation: Improper handling of exception allows guest users to

escalate their privileges to its own OS.No No

Table 4.8: Selected Set of Analyzed CVEs - from VM

leakage, privilege escalation, code execution, and memory corruption in Linux/KVM. While KVM

does not protect VM data against any of these compromises, SeKVM protects against all of them.

SeKVM does not guarantee availability and cannot protect against CVEs that allow VMs or host

users to cause denial of service in KServ. Vulnerabilities that allow unprivileged guest users to

attack their own VMs like CVE-2014-7842 and CVE-2018-1087 are unrelated to our threat model.

We also executed attacks representative of information leakage to show that SeKVM protects

VM data even if an attacker has full control of KServ. First, we simulated an attacker trying to read

or modify VMs’ memory pages. We added a hook to KVM, which modifies a page that a targeted

gVA maps to. As expected, the compromised mainline KVM successfully modified the VM page.

In SeKVM, the same attack causes a trap to KCore, which rejects the invalid memory access.

Second, we simulated a host that tries to tamper with a VM’s nested page table by redirecting a

107

Bug Description KVM SeKVMCVE-2009-3234 Privilege Escalation: Kernel stack buffer overflow resulting in ret2usr [93]. No YesCVE-2010-2959 Code Execution: Integer overflow resulting in function pointer overwrite. No YesCVE-2010-4258 Privilege Escalation: Improper handling of get_fs value resulting in kernel

memory overwrite.No Yes

CVE-2009-3640 Privilege Escalation: Improper handling of APIC state in KServ. No YesCVE-2009-4004 Privilege Escalation: Buffer overflow in KServ. No YesCVE-2013-1943 Privilege Escalation, Info Leakage: Mishandling of memory slot allocation

allows host users to access KServ memory.No Yes

CVE-2016-10150 Privilege Escalation: Use-after-free in KServ. No YesCVE-2013-4587 Privilege Escalation: Array index error in KServ. No YesCVE-2018-18021 Privilege Escalation: Mishandling of VM register state allows host users to

redirect KServ execution.No Yes

CVE-2016-9756 Info Leakage: Improper initialization in code segment resulting in informa-tion leakage in KServ stack.

No Yes

CVE-2019-14821 Privilege Escalation: Host users cause out-of-bounds memory access inKServ.

No Yes

CVE-2019-6974 Privilege Escalation: Use-after-free in KServ. No YesCVE-2013-6368 Privilege Escalation: Mishandling of APIC state in KServ. No YesCVE-2015-4692 Memory Corruption: Mishandling of APIC state in KServ. No YesCVE-2013-4592 Denial-of-Service: Host users cause memory leak in KServ. No No

Table 4.9: Selected Set of Analyzed CVEs - from Host User

gPA’s NPT mapping to host-owned pages. This is in contrast to the prior attack of modifying VM

pages, but shares the same goal of accessing VM data in memory. We added a hook to the stage

2 page fault handler in KVM/ARM; the hook allocates a new zero page in the host OS’s address

space, which could contain arbitrary code data in a real attack. The hook associates a range of a

VM’s gPAs with this zero page. As expected, this attack succeeds in KVM but fails in SeKVM.

In SeKVM, the attacker in KServ has no access to VM’s stage 2 page table walked by the MMU.

KCore never uses the page tables maintained by the untrusted KServ.

4.4 Summary

In this chapter, we have presented the effort required in retrofitting KVM into SeKVM and ver-

ifying its TCB as well as overall security properties. SeKVM is the first commodity multiprocessor

hypervisor that has been formally verified. We achieved this through microverification, retrofitting

KVM with a small core that can enforce data access controls on the rest of the KVM. We showed

that microverification only required modest KVM modifications, yet resulted in a verified hyper-

visor that retains KVM’s extensive commodity hypervisor features. Our SeKVM implementation

108

supports running multiple multiprocessor VMs, shared multi-level page tables with huge pages,

standardized virtio I/O virtualization with vhost kernel optimizations, and broad Arm hardware

support. We have verified the guarantees of VM confidentiality and integrity in multiple versions

of KVM. Finally, we have evaluated KVM on server class multiprocessor hardware. We showed

that our verified KVM performs comparably to stock, unverified KVM, running real application

workloads in multiprocessor VMs with less than 10% overhead compared to native hardware in

most cases.

109

Chapter 5: Related Work

Secure Hypervisor Design. We first compare the hypervisor design, HypSec, that we introduce

to support microverification with previous research. The idea of retrofitting a commodity hyper-

visor with a smaller core was inspired by KVM/ARM’s split-mode virtualization [29, 30], which

introduced a thin software layer to enable Linux KVM to make use of Arm hardware virtualization

extensions without significant changes to Linux, but did nothing to reduce the hypervisor TCB.

Our KVM implementation based on HypSec builds on KVM/ARM to leverage Arm hardware

virtualization support to run KCore with special hardware privileges to protect VM data against

a compromised KServ. More recently, Nested Kernel [94] used the idea of retrofitting a small

TCB into a commodity OS kernel, FreeBSD, to intercept MMU updates to enforce kernel code

integrity. Similar to HypSec, Nested Kernel retrofit commodity system software with a small TCB

that mediates accesses to critical hardware resources and strengthens system security guarantees

with modest implementation and performance costs. Nested Kernel focuses on a different threat

model. It does not protect against vulnerabilities in existing kernel code, in part because both its

TCB and untrusted components run at the highest hardware privilege level. In contrast, HypSec

deprivileges KServ and uses the privileged KCore to protect data confidentiality and integrity even

in the presence of hypervisor vulnerabilities in KServ.

Bare-metal hypervisors often claim a smaller TCB as an advantage over hosted hypervisors,

but in practice, the aggregate TCB of the widely-used Xen [3] bare-metal hypervisor includes

Dom0 [95, 96] and therefore can be no smaller than hosted hypervisors like KVM. Some work

thus focuses on hardening Xen’s attack surface by redesigning Dom0 [97, 95, 98]. Unlike HypSec,

these approaches do not retrofit Xen to reduce its TCB.

Microhypervisors [99, 100] take a microkernel approach to build clean-slate hypervisors from

scratch to reduce the hypervisor TCB. For example, NOVA [99] moves various aspects of virtual-

110

ization such as CPU and I/O virtualization to userspace services. The virtualization services are

trusted but instantiated per VM so that compromising them only affects the given VM. Others

simplify the hypervisor to reduce its TCB by removing [101] or disabling [102] virtual device I/O

support in hypervisors, or partitioning VM resources statically [103, 104]. Although a key motiva-

tion for both microhypervisors and HypSec is to reduce the size of the TCB, they are different in a

few aspects. First, HypSec does not require a clean-slate redesign. Second, HypSec supports ex-

isting full-featured commodity hypervisors without removing important hypervisor features, such

as I/O support and dynamic resource allocation. Finally, HypSec’s TCB preserves confidentiality

and integrity of VM data even if the untrusted software components are compromised.

HyperLock [105], DeHype [106], and Nexen [107] focus on deconstructing existing monolithic

hypervisors by segregating hypervisor functions to per VM instances. While this can isolate an

exploit of hypervisor functions to a given VM instance, if a vulnerability is exploitable in one

VM instance, it is likely to be exploitable in another as well. Nexen builds on Nested Kernel to

retrofit Xen in this manner, though it does not protect against vulnerabilities in its shared hypervisor

services. In contrast to HypSec, these systems focus on availability and do not fully protect the

confidentiality and integrity of VM data against a compromised hypervisor or host OS.

CloudVisor [96] uses a small, specialized host hypervisor to support nested virtualization and

protect user VMs against an untrusted Xen guest hypervisor, though Xen modifications are re-

quired. CloudVisor encrypts VM I/O and memory but does not fully protect CPU state, contrary

to its claims of “providing both secrecy and integrity to a VM’s states, including CPU states.” For

example, the VM program counter is exposed to Xen to support I/O. As with any nested virtu-

alization approach, performance overhead on application workloads is a problem. Furthermore,

CloudVisor does not support widely used paravirtual I/O. CloudVisor has a smaller TCB by not

supporting public key cryptography, making key management problematic. In contrast, HypSec

protects both CPU and memory state via access control, not encryption, making it possible to sup-

port full-featured hypervisor functionality such as paravirtual I/O. HypSec also does not require

nested virtualization, avoiding its performance overhead.

111

To protect user data in virtualization systems, others enable and require VM support for spe-

cialized hardware, such as Intel SGX [108] or ARM TrustZone. Haven [109] and S-NFV [110]

use Intel SGX to protect application data but, unlike HypSec, cannot protect the whole VM, in-

cluding the guest OS and applications, against an untrusted hypervisor. Although HypSec relies

on a TEE to support key management, it fundamentally differs from other approaches, extensively

using TEEs for much more than storing keys. Others [111, 112] run a security monitor in ARM

TrustZone and rely on ARM IP features such as TrustZone Address Space Controller to protect

VMs. vTZ [112] virtualizes TrustZone and protects the guest TEE against an untrusted hypervisor,

but does not protect the normal world VM. HA-VMSI [111] protects the normal world VM against

a compromised hypervisor, but supports limited virtualization features. In contrast, our KVM

implementation based on HypSec, protects the entire normal world VM against an untrusted hy-

pervisor without requiring VMs to use specialized hardware. The implementation leverages ARM

VE to trap VM exceptions to EL2 while retaining hypervisor functionality. Others [113, 114, 115]

propose hardware-based approaches to protect VM data in CPU and memory against an untrusted

hypervisor. However, without actual hardware implementations, these works implement the pro-

posed changes by modifying either Xen [113] or QEMU [115], or on a simulator [114]. Some of

them [113, 114] do not support commodity hypervisors. In contrast, HypSec leverages existing

hardware features to protect virtual machine data and supports KVM on ARM server hardware.

Recent architectural extensions [116, 117] proposed hardware support on x86 for encrypted

VMs. Fidelius [118] leverages AMD’s SEV (Secure Encrypted Virtualization) [117] to protect

VMs. Unlike these encryption-based approaches, HypSec primarily uses access control mecha-

nisms.

Various projects extend a trusted hypervisor to protect software within VMs, including protect-

ing applications running on an untrusted guest OS in the VM [60, 119, 120, 121, 64], ensuring

kernel integrity and protecting against rootkits and code injection attacks or to isolate I/O chan-

nels [122, 123, 124, 125, 126], and dividing applications and system components in VMs then

relying on the hypervisor to safeguard interactions among secure and insecure components [127,

112

128, 129, 130, 131]. Overshadow [60] and Inktag [64] have some similarities with HypSec in that

they use a more trusted hypervisor component to protect against untrusted kernel software. Over-

shadow and Inktag also assume applications use end-to-end encrypted network I/O, though they

protect file I/O by replacing it with memory-mapped I/O to encrypted memory. HypSec has three

key differences with these approaches. First, instead of memory encryption, HypSec primarily uses

access control, which is more lightweight and avoids the need to emulate problematic functions

when memory is encrypted. Second, instead of instrumenting or emulating complex system calls,

HypSec relies on hardware virtualization mechanisms to interpose on hardware events of interest.

Instead of protecting against guest OS exploits, HypSec protects against hypervisor and host OS

exploits, which none of the other approaches do.

Hypervisor Verification. seL4 [17, 132] and CertiKOS [18, 133] are verified systems with hy-

pervisor functionality, so we compared their virtualization features against SeKVM. We compare

the verified versions of each system; there is little reason to use unverified versions of seL4 or Cer-

tiKOS (mC2) instead of KVM. Table 5.1 shows that SeKVM provides verified support for all listed

virtualization features, while seL4 and CertiKOS do not. For each feature, Verified+FC means it

is supported with verified VM data confidentiality and integrity and its functional correctness is

also verified, Verified means it is supported with verified VM data confidentiality and integrity,

and Unverified means it is supported but unverified. A blank indicates the feature is not available

or so severely limited to be of little practical use.

A key verified feature of SeKVM is page tables that can be shared across multiple CPUs, which

became possible to verify by introducing transparent trace refinement. This makes it possible to

provide verified support for multiprocessor VMs on multiprocessor hardware, as well as DMA

protection. Another key verified feature of SeKVM is noninterference in the presence of I/O

through shared devices paravirtualized using virtio, made possible by introducing data oracles.

None of these verified features are available on seL4 or CertiKOS. Unlike seL4, SeKVM does

verify Arm assembly code, but CertiKOS also links C and x86 assembly code proofs together

113

Feature SeKVM seL4 CertiKOSVM boot protection Verified+FCVM CPU protection Verified+FCVM memory protection Verified+FCVM DMA protection Verified+FCServer hardware VerifiedSMP hardware Verified UnverifiedSMP VMs VerifiedMultiple VMs VerifiedShared page tables Verified+FCMulti-level paging Verified+FC UnverifiedHuge pages VerifiedVirtio Verified Unverified UnverifiedDevice passthrough VerifiedVM migration VerifiedLinux ease-of-use Verified

Table 5.1: Comparison of Hypervisor Features

by using a verified x86 compiler to produce correct assembly; no such compiler exists for Arm

multiprocessor code. Unlike seL4 and CertiKOS, SeKVM supports standard Linux functionality.

While various versions of seL4 exist, noninterference properties and functional correctness

have only been verified on a single uniprocessor version [134]; bugs have been discovered in other

seL4 versions [135]. The verified version only supports Armv7 hardware and has no virtualization

support [134]. Another seL4 Armv7 version verifies the functional correctness of some hypervisor

features, but not MMU functionality [134, 127], which is at the core of a functional hypervisor.

seL4 does not support shared page tables [136], and verifying multiprocessor and hypervisor sup-

port remain future work [137]. It lacks most features expected of a hypervisor. Its device support

via virtio is unverified and also needs to be ported to its platform, limiting its virtio functionality.

For example, seL4 lacks support for virtio block devices and has no vhost optimization. Building

a system using seL4 is much harder than using Linux [137].

CertiKOS proves noninterference for the sequential mCertiKOS kernel [20] without virtualiza-

tion support and goes beyond seL4 in verifying the functional correctness of the mC2 multiproces-

sor kernel with virtualization. However, mC2 provides no data confidentiality and integrity among

114

VMs. Like seL4, CertiKOS also cannot verify shared page tables, so it does not provide verified

support for multiprocessor VMs. The verified kernel does not work on modern 64-bit hardware. It

lacks many hypervisor features, including dynamically allocated page tables for multi-level pag-

ing, huge page support, device passthrough, and VM migration. Its virtio support does not include

vhost, is limited to only certain block devices, and requires porting virtio to its platform, making it

difficult to keep up with virtio improvements and updates.

Others have only partially verified their hypervisor code to reduce proof effort. The VCC

framework has been used to verify 20% of Microsoft’s Hyper-V multiprocessor hypervisor, but

global security properties remain unproven [138, 139]. überSpark has been used to verify the

üXMHF hypervisor, but their architecture does not support concurrent hardware access, and their

verification approach foregoes functional correctness [19, 140]. In contrast, KCore provably en-

forces security properties by leveraging its provably correct core while inheriting the comprehen-

sive features of a commodity hypervisor. INTEGRITY-178 [141] enforces isolation policies

between VMs and the untrusted hypervisor services running in different software partitions. For-

mal methods are employed to verify the abstract functionality of INTEGRITY-178 that is unrelated

to the hardware platform [142]. In contrast, we verify the correctness and security properties of

SeKVM’s concrete implementation on a detailed machine model with realistic multiprocessor fea-

tures.

Information-Flow Security Verification. Information-flow security has previously been proven

for a few small, uniprocessor systems using noninterference [22, 77, 20, 21, 143, 23, 144]. None

of the techniques used generalize to multiprocessor environments, where refinement can hide unin-

tentional information leakage to concurrent observers. Information-flow security has been verified

over a high-level model of the HASPOC multicore hypervisor design [145, 146], but not for the

actual hypervisor implementation. Furthermore, the strict noninterference previously proven is of

limited value for hypervisors because information sharing is necessary in commodity hypervisors

like KVM. While some work has explored verifying information-flow security in concurrent set-

115

tings by requiring the use of programming language constructs [147, 148, 149, 150], they require

writing the system to be verified in their respective proposed languages, and have not been used

to verify any real system. In contrast, SeKVM is written and verified in C without additional

annotations, and information-flow security is proven while permitting dynamic intentional infor-

mation sharing, enabling VMs to use existing KVM functionality, such as paravirtual I/O, without

compromising VM data.

Hardware Modeling in Verified Systems. Unlike SeKVM, none of the previous systems [17,

18, 22, 21, 23] were verified using an abstract multiprocessor machine that models widely-used

hardware features, including shared multi-level page tables, tagged TLBs, and writeback caches.

Some work [151, 152, 153] has verified the MMU subsystem within an OS kernel. Unlike SeKVM,

the verified component does not make any guarantees about the overall behavior of the system.

Other work [154, 155] integrates the specifications of their abstract TLB into the Cambridge Arm

model [156], but only uses it for proving the program logic of the system’s execution, not the

correctness or security of the actual implementation.

116

Chapter 6: Conclusions and Future Work

6.1 Conclusions

This dissertation introduces microverification, a new approach for formally verifying the secu-

rity properties of commodity systems. Microverification reduces the proof effort for a commodity

system by retrofitting the system into a small core and a set of untrusted services, thus making it

possible to reason about the properties of the entire system by verifying the core alone. We have

used microverification to retrofit the KVM hypervisor, then proved that the retrofitted hypervisor

protects the confidentiality and integrity of VM data, while retaining KVM’s overall commodity

feature set and performance.

First, to support the microverification of commodity hypervisors, we presented a new hyper-

visor design, HypSec. HypSec employs microkernel design principles to retrofit a commodity

hypervisor into a small core, KCore, and a large set of untrusted hypervisor services, KServ,

which provides complex virtualization features. KCore serves as the TCB of the hypervisor and

protects VM data confidentiality and integrity. HypSec leverages hardware virtualization support

commonly available on modern architectures to simplify the retrofit. It leverages hardware virtual-

ization support to isolate KCore at a higher privilege level than KServ. To simplify KCore, HypSec

assumes an end-to-end approach, relying on VMs to protect their I/O data using encryption. KCore

is tasked with enforcing access control to protect VM data in the CPU and memory, and providing

basic CPU and memory virtualization. KCore mediates all VM exceptions and interrupts to pro-

tect VM data. More complex operations, including I/O and interrupt virtualization, and resource

management, such as CPU scheduling, device management, and memory allocation, are delegated

to KServ, which leverages an OS kernel. HypSec significantly reduces the TCB of commodity

hypervisors while maintaining their full feature set.

117

Second, we presented the first steps toward verifying the correctness of KVM’s TCB. We in-

troduced SeKVM, a KVM hypervisor retrofitted based on HypSec. The verification is done on

a multiprocessor machine with realistic memory management features. We introduced a layered

hypervisor architecture and verification methodology. We first employed HypSec to decompose

the Arm implementation of KVM into two layers, a higher layer consisting of a large KServ and

a lower layer consisting of s small TCB, KCore. We leveraged Arm’s hardware virtualization

support to simplify the retrofit. We next used layers to modularize the implementation and proof

of KCore. We constructed KCore as a stack of 34 layer modules. We presented a novel layered

hardware model that accounts for Arm’s multiprocessor features such as multi-level page tables,

tagged TLBs, and writeback caches, yet simple enough to be used to verify real software. We

tailored appropriately abstract machine models for each respective layer of the hypervisor TCB.

Low-level software operations in the lower layers were verified using a detailed machine. More

abstract operations in the higher layers were verified using a simplified machine that encapsulates

the hardware complexities, reducing the proof burden.

Third, we verified that the SeKVM implementation protects the confidentiality and integrity of

VM data. We do this in two steps. First, we leveraged SeKVM’s layered software and hardware

architecture to prove the functional correctness of KCore. I introduced security-preserving lay-

ers to incrementally refine KCore’s detailed hardware and software behaviors at lower layers into

simpler abstract specifications at its top layer while preserving security properties. Using security-

preserving layers, we demonstrated the correctness of KCore in managing shared multiple page

tables, a feature that is crucial to supporting multiprocessor VMs. We then used KCore’s top-level

specification to prove security properties and ensure that they hold for the implementation. We

formulated SeKVM’s security guarantees in terms of noninterference. We proved noninterference

assertions hold for any behavior of the untrusted KServ and VMs, interacting with KCore’s top

layer specification, even in the presence of intentional information release needed for supporting

standard hypervisor features. The noninterference assertions are proven over KCore’s specifica-

tion, for any implementation of KServ; thus the security guarantees hold for the entire SeKVM.

118

Fourth, we showed that SeKVM’s hypervisor retrofit requires only modest modifications to

KVM, yet retains KVM’s wide range of commodity hypervisor features, including running mul-

tiple multiprocessor VMs with unmodified commodity OSes, shared multi-level page tables with

huge page support, standardized virtio I/O virtualization with vhost kernel optimizations, emulated,

paravirtualized, and passthrough I/O devices with IOMMU protection against direct memory ac-

cess (DMA) attacks, and compatibility with Linux device drivers for broad Arm hardware support.

We have formally verified the correctness of SeKVM’s TCB and the security guarantees of the en-

tire hypervisor. We have retrofitted and verified multiple versions of KVM using Coq. Finally, we

showed that SeKVM incurs modest overhead compared to unmodified KVM for real application

workloads and similar scalability when running multiple VMs.

My research has demonstrated that modest changes to a commodity system like the Linux

KVM hypervisor can reduce the required proof effort; thus making it possible to verify the security

properties, such as the protection of VM confidentiality and integrity, of the entire hypervisor,

while retaining KVM’s commodity feature set and performance. Our work is the first machine-

checked security proof for a commodity multiprocessor hypervisor. Our work is also the first

machine-checked correctness proof of a multiprocessor system with shared page tables.

6.2 Future Work

My research has investigated using microverification to verify that the commodity KVM hy-

pervisor protects VM confidentiality and integrity on Arm multiprocessor hardware. I believe there

are many other opportunities for future work to apply microverification to various other systems

for different deployment scenarios to verify their security properties.

One area of future work is to explore how to use microverification to verify whether other

commodity hypervisors protect VM confidentiality and integrity. For example, to verify Xen, one

could use the HypSec design to retrofit Xen into a KServ that includes Dom0’s kernel and Xen’s

codebase that provides resource management, scheduling, and interrupt virtualization, and a KCore

that could include the rest of Xen to provide CPU virtualization and page table management while

119

protecting VM data. One could then reason over KCore to prove Xen’s protection of VM data.

As many of the commodity hypervisors are deployed to x86 based server hardware, HypSec could

leverage x86’s virtualization hardware extensions to simplify the retrofit required for x86 hyper-

visors. For example, one could leverage Intel x86’s Virtual Machine Extensions (VMX) [157] to

deprivilege KServ in the VMX non-root operation and run KCore in the VMX root operation to

protect VM data. To interpose on VM exits, KCore could manage the x86 Virtual Machine Con-

trol Structure (VMCS) to trap VM exits to itself, so it could protect VM data before switching to

KServ. Hardware extensions on x86, including VMX, provides support for context switching VM

CPU state. Thus, multiplexing the hardware between VMs and KServ on x86 hardware should not

incur significant VM performance overhead. To protect VM memory, HypSec could use VMX’s

NPTs, Extended Page Tables (EPTs), and the IOMMU. As presented in Section 4.3, using NPTs

on x86 server hardware with reasonable TLB capacity should result in negligible VM performance

overhead. Finally, to verify x86 hypervisors, one could potentially model the multiprocessor ex-

ecution and memory management features of the x86 hardware similar to how we model these

respective features for SeKVM’s Arm based hardware. However, it may require additional effort

to model x86’s systems registers and virtualization extensions and detailed semantics for the x86

instructions used by the hypervisor to manage these hardware features.

Another area of future work is to explore how to use microverification to verify hypervisors

might guarantee other security properties, such as availability. SeKVM focuses only on data con-

fidentiality and integrity; it makes no guarantees about availability. Confidentiality and integrity

may be primary concerns in the context of cloud computing, but in other contexts, availability may

be of much greater importance. For example, virtualization has been increasingly adopted by secu-

rity critical systems such as automotive systems to isolate security critical components into VMs.

Unlike the cloud deployment scenario, in which the administrators can simply terminate malfunc-

tioned VMs to prevent them from affecting others, guaranteeing VM availability in security critical

systems is of key importance because the whole system could fail if a given VM component mal-

functions. To verify that the hypervisor protects VM availability, one could potentially incorporate

120

a hypervisor core that takes charge of scheduling VMs, allocating VM resources, and confining

faulted hypervisor components or VMs. One could then first show that the core implementation

refines its specification and prove the availability guarantee over the specification.

Moreover, various systems rely on a full commodity hypervisor to protect software compo-

nents running within VMs, either to protect the integrity of the guest kernel [122, 123, 125] or

applications in the VMs from a malicious guest kernel [60, 64]. Attackers who exploit vulnera-

bilities in the large hypervisor codebase could compromise the hypervisor’s security guarantees

to VMs. Microverification can be potentially applied to these systems to improve their security.

For instance, to prove that the hypervisor protects the confidentiality and integrity of applications

within a given VM against an untrusted guest OS, one could first leverage HypSec to retrofit the

hypervisor, and extend the retrofitted hypervisor core to protect the VM applications, then prove

noninterference over the core to demonstrate that the hypervisor enforces its security guarantees.

For example, to protect the confidentiality and integrity of the guest applications, the core must

mediate all interactions between the applications and the OS in the VM, such as system calls, page

faults, and interrupts from the userspace, at the guest kernel’s interface. Further investigation is

required to design the hypervisor core to support the commodity OS functionality while protecting

application data.

Exploring microverification of other commodity systems, such as commodity OS kernels, is

another interesting direction for further research. For example, microverification could be poten-

tially applied to verify the integrity of a given OS kernel, guaranteeing that the kernel is immune to

code injection attacks. One could potentially decompose the OS kernel into a large set of untrusted

kernel services and a small TCB. Similar to how KCore manages KServ’s NPT, the TCB could

manage the NPT for the untrusted kernel services using an identity map, and configure the mem-

ory access attributes in the page table entries to protect kernel memory. For instance, to prevent the

attackers from modifying existing kernel memory or page tables to load and execute arbitrary code,

the TCB could set the NPT entries that map to the kernel text section and page tables read-only, and

set the entries that map to the kernel data section non-executable. To verify kernel integrity, one

121

could prove noninterference and show that the contents of the executable kernel memory regions

remain unchanged throughout the kernel execution. However, a strict noninterference guarantee

may be incompatible with the commodity kernel feature set. For instance, commodity OS kernels

support dynamic kernel module loading, which requires updating the kernel page table to map to

the newly allocated executable memory for loading the kernel modules at runtime. Further investi-

gation is needed in designing and proving the TCB’s security policies for ensuring kernel integrity

protection while retaining the commodity OS kernel’s functionality and performance. An alterna-

tive avenue to explore microverification is to prove a given OS kernel protects the confidentiality

and integrity of user data in its hosted applications or containers. One could explore relying on a

small TCB that interposes at the kernel interface to applications or containers to protect the user

data, similar to how the hypervisor core could protect applications from the kernel in a VM as we

discussed earlier, then reasoning over the TCB to prove the desired security properties.

It is potentially possible to further reduce the efforts required for verifying commodity systems

based on microverification. For instance, to protect guest applications from an untrusted guest OS,

one could extend SeKVM’s Arm implementation to trap the exceptions from EL0 to EL2, allowing

KCore to protect application data before entering the guest kernel in EL1. KCore could potentially

program the Trap General Exceptions (TGE) bit provided by Arm to configure the CPU to route

all exceptions from EL0 directly to EL2. However, setting this bit also disables virtual memory in

EL0, which is problematic for real applications. It should be explored if existing hardware features

on Arm or novel software design could be employed to simplify the retrofit needed for interposing

on EL0 exceptions. Furthermore, although we have built tools to automate many parts of the proofs

for SeKVM, it still requires manual effort to write formal specifications in Coq for the assembly

code and C functions that include complex program logics such as loops, then solve the proof goals

defined over the specifications. To further simplify the development and maintenance of formally

verified software systems, a promising direction of research is to explore technologies that support

programming verified software systems directly at scale. This could potentially be accomplished

by incorporating a novel proof framework that fully automates verification with a verified compiler

122

to produce trusted binaries from the formally verified source code.

Finally, although microverification enables formal verification of security properties for com-

modity systems, proving an existing commodity system is functionally correct in its entirety re-

mains a grand challenge. Further research in modularizing the monolithic codebase of commodity

systems into smaller and verifiable components, then verifying the smaller components and linking

their proofs at scale, could enable the first steps toward proving the functional correctness of entire

commodity systems.

123

References

[1] S. J. Vaughan-Nichols, “Hypervisors: The cloud’s potential security Achilles heel,” ZDNet,Mar. 2014.

[2] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori, “KVM: The Linux Virtual Ma-chine Monitor,” in Proceedings of the 2007 Ottawa Linux Symposium (OLS 2007), Ottawa,ON, Canada, Jun. 2007.

[3] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, andA. Warfield, “Xen and the Art of Virtualization,” in Proceedings of the 19th ACM Sym-posium on Operating Systems Principles (SOSP 2003), Bolton Landing, NY, Oct. 2003,pp. 164–177.

[4] Hyper-V Technology Overview, https://docs.microsoft.com/en-us/windows-server/virtualization/hyper-v/hyper-v-technology-overview, Microsoft,Nov. 2016.

[5] CVE, CVE-2009-3234, https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-3234, Sep. 2009.

[6] ——, CVE-2010-4258, https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-4258, Nov. 2010.

[7] ——, CVE-2013-1943, https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-1943, Feb. 2013.

[8] ——, CVE-2016-9756, https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-9756, Dec. 2016.

[9] ——, CVE-2017-17741, https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-17741, Dec. 2017.

[10] ——, CVE-2020-16891, https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-16891, Aug. 2020.



124

https://docs.microsoft.com/en-us/windows-server/virtualization/hyper-v/hyper-v-technology-overview

https://docs.microsoft.com/en-us/windows-server/virtualization/hyper-v/hyper-v-technology-overview

https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-3234
















[13] ——, CVE-2018-18021, https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-18021, Oct. 2018.


[15] Mark Russinovich, Introducing Azure confidential computing, https://azure.microsoft.com/en- us/blog/introducing- azure- confidential- computing/,Microsoft, Sep. 2017.

[16] Confidential VM and Compute Engine, https://cloud.google.com/compute/confidential-vm/docs/about-cvm, Google, May 2021.

[17] G. Klein, K. Elphinstone, G. Heiser, J. Andronick, D. Cock, P. Derrin, D. Elkaduwe, K.Engelhardt, R. Kolanski, M. Norrish, T. Sewell, H. Tuch, and S. Winwood, “seL4: FormalVerification of an OS Kernel,” in Proceedings of the 22nd ACM Symposium on OperatingSystems Principles (SOSP 2009), Big Sky, MT, Oct. 2009, pp. 207–220.

[18] R. Gu, Z. Shao, H. Chen, X. N. Wu, J. Kim, V. Sjöberg, and D. Costanzo, “CertiKOS: AnExtensible Architecture for Building Certified Concurrent OS Kernels,” in Proceedings ofthe 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI2016), Savannah, GA, Nov. 2016, pp. 653–669.

[19] A. Vasudevan, S. Chaki, L. Jia, J. McCune, J. Newsome, and A. Datta, “Design, Imple-mentation and Verification of an eXtensible and Modular Hypervisor Framework,” in Pro-ceedings of the 2013 IEEE Symposium on Security and Privacy (SP 2013), San Francisco,CA, May 2013, pp. 430–444.

[20] D. Costanzo, Z. Shao, and R. Gu, “End-to-End Verification of Information-Flow Securityfor C and Assembly Programs,” in Proceedings of the 37th ACM Conference on Program-ming Language Design and Implementation (PLDI 2016), Santa Barbara, CA, Jun. 2016,pp. 648–664.

[21] H. Sigurbjarnarson, L. Nelson, B. Castro-Karney, J. Bornholt, E. Torlak, and X. Wang,“Nickel: A Framework for Design and Verification of Information Flow Control Systems,”in Proceedings of the 13th USENIX Symposium on Operating Systems Design and Imple-mentation (OSDI 2018), Carlsbad, CA, Oct. 2018, pp. 287–305.

[22] A. Ferraiuolo, A. Baumann, C. Hawblitzel, and B. Parno, “Komodo: Using verificationto disentangle secure-enclave hardware from software,” in Proceedings of the 26th ACMSymposium on Operating Systems Principles (SOSP 2017), Shanghai, China, Oct. 2017,pp. 287–305.

[23] L. Nelson, J. Bornholt, R. Gu, A. Baumann, E. Torlak, and X. Wang, “Scaling SymbolicEvaluation for Automated Verification of Systems Code with Serval,” in Proceedings of the

125





https://azure.microsoft.com/en-us/blog/introducing-azure-confidential-computing/

https://azure.microsoft.com/en-us/blog/introducing-azure-confidential-computing/

https://cloud.google.com/compute/confidential-vm/docs/about-cvm

https://cloud.google.com/compute/confidential-vm/docs/about-cvm

27th ACM Symposium on Operating Systems Principles (SOSP 2019), Huntsville, Ontario,Canada, Oct. 2019, pp. 225–242.

[24] T. Murray, D. Matichuk, M. Brassil, P. Gammie, and G. Klein, “Noninterference for Op-erating System Kernels,” in Proceedings of the 2nd International Conference on CertifiedPrograms and Proofs (CPP 2012), Kyoto, Japan, Dec. 2012, pp. 126–142.

[25] S.-W. Li, X. Li, R. Gu, J. Nieh, and J. Z. Hui, “A Secure and Formally Verified LinuxKVM Hypervisor,” in Proceedings of the 2021 IEEE Symposium on Security and Privacy(SP 2021), May 2021.

[26] J. Graham-Cumming and J. W. Sanders, “On the Refinement of Non-interference,” inProceedings of Computer Security Foundations Workshop IV, Franconia, NH, Jun. 1991,pp. 35–42.

[27] D. Stefan, A. Russo, P. Buiras, A. Levy, J. C. Mitchell, and D. Mazieres, “AddressingCovert Termination and Timing Channels in Concurrent Information Flow Systems,” inProceedings of the 17th ACM SIGPLAN International Conference on Functional Program-ming (ICFP 2012), ser. ACM SIGPLAN Notices, vol. 47, Sep. 2012, pp. 201–214.

[28] J. A. Goguen and J. Meseguer, “Unwinding and Inference Control,” in Proceedings ofthe 1984 IEEE Symposium on Security and Privacy (SP 1984), Oakland, CA, Apr. 1984,pp. 75–86.

[29] C. Dall and J. Nieh, “KVM/ARM: Experiences Building the Linux ARM Hypervisor,”Department of Computer Science, Columbia University, Technical Report CUCS-010-13,Jun. 2013.

[30] ——, “KVM/ARM: The Design and Implementation of the Linux ARM Hypervisor,” inProceedings of the 19th International Conference on Architectural Support for Program-ming Languages and Operating Systems (ASPLOS 2014), Salt Lake City, UT, Mar. 2014,pp. 333–347.

[31] “Cloud companies consider Intel rivals after the discovery of microchip security flaws,”CNBC, Jan. 2018.

[32] C. Williams, “Microsoft: Can’t wait for ARM to power MOST of our cloud data centers!Take that, Intel! Ha! Ha!” The Register, Mar. 2017.

[33] Introducing Amazon EC2 A1 Instances Powered By New Arm-based AWS Graviton Pro-cessors, https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-amazon-ec2-a1-instances, Amazon Web Services, Nov. 2018.

[34] The Coq Proof Assistant, https://coq.inria.fr [Accessed: Dec 16, 2020].

126

https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-amazon-ec2-a1-instances

https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-amazon-ec2-a1-instances

https://coq.inria.fr

[35] S. Landau, “Making Sense from Snowden: What’s Significant in the NSA SurveillanceRevelations,” IEEE Security and Privacy, vol. 11, no. 4, pp. 54–63, Jul. 2013.

[36] Let’s Encrypt, Let’s encrypt stats - let’s encrypt, https://letsencrypt.org/stats/,Apr. 2018.

[37] Google, HTTPS encryption on the web – Google Transparency Report, https://transparencyreport.google.com/https/overview, Apr. 2018.

[38] Business Wire, Research and Markets: Global Encryption Software Market (Usage, Verti-cal and Geography) - Size, Global Trends, Company Profiles, Segmentation and Forecast,2013 - 2020, https://www.businesswire.com/news/home/20150211006369/en/Research-Markets-Global-Encryption-Software-Market-Usage, Feb. 2015.

[39] J. H. Saltzer, D. P. Reed, and D. D. Clark, “End-to-end Arguments in System Design,”ACM Transactions on Computer Systems (TOCS), vol. 2, no. 4, pp. 277–288, Nov. 1984.

[40] “ARM Security Technology Building a Secure System Using TrustZone Technology,”ARM Ltd., Whitepaper PRD29-GENC-009492C, Apr. 2009.

[41] International Organization for Standardization and International Electrotechnical Commis-sion, ISO/IEC 11889-1:2015 - Information technology – Trusted platform module library,https://www.iso.org/standard/66510.html, Sep. 2016.

[42] J. A. Halderman, S. D. Schoen, N. Heninger, W. Clarkson, W. Paul, J. A. Calandrino,A. J. Feldman, J. Appelbaum, and E. W. Felten, “Lest We Remember: Cold Boot Attackson Encryption Keys,” in Proceedings of the 17th USENIX Security Symposium (USENIXSecurity 2008), San Jose, CA, Jul. 2008, pp. 45–60.

[43] “Google Cloud Security and Compliance Whitepaper - How Google protects your data.,”Google Cloud, pp. 6–7, https://static.googleusercontent.com/media/gsuite.google.com/en//files/google-apps-security-and-compliance-whitepaper.pdf [Accessed: Dec 16, 2020].

[44] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage, “Hey, You, Get Off of My Cloud:Exploring Information Leakage in Third-party Compute Clouds,” in Proceedings of the2009 ACM Conference on Computer and Communications Security (CCS 2009), Chicago,IL, Nov. 2009, pp. 199–212.

[45] Y. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Cross-VM Side Channels and TheirUse to Extract Private Keys,” in Proceedings of the 2012 ACM Conference on Computerand Communications Security (CCS 2012), Raleigh, NC, Oct. 2012, pp. 305–316.

[46] G. Irazoqui, T. Eisenbarth, and B. Sunar, “S$A: A Shared Cache Attack That Works AcrossCores and Defies VM Sandboxing – and Its Application to AES,” in Proceedings of the

127

https://letsencrypt.org/stats/

https://transparencyreport.google.com/https/overview

https://transparencyreport.google.com/https/overview

https://www.businesswire.com/news/home/20150211006369/en/Research-Markets-Global-Encryption-Software-Market-Usage

https://www.businesswire.com/news/home/20150211006369/en/Research-Markets-Global-Encryption-Software-Market-Usage

https://www.iso.org/standard/66510.html

https://static.googleusercontent.com/media/gsuite.google.com/en//files/google-apps-security-and-compliance-whitepaper.pdf



2015 IEEE Symposium on Security and Privacy (SP 2015), San Jose, CA, May 2015,pp. 591–604.

[47] Y. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Cross-Tenant Side-Channel Attacksin Paas Clouds,” in Proceedings of the 2014 ACM Conference on Computer and Commu-nications Security (CCS 2014), Scottsdale, AZ, Nov. 2014, pp. 990–1003.

[48] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee, “Last-Level Cache Side-Channel AttacksAre Practical,” in Proceedings of the 2015 IEEE Symposium on Security and Privacy (SP2015), San Jose, CA, May 2015, pp. 605–622.

[49] M. Backes, G. Doychev, and B. Kopf, “Preventing Side-Channel Leaks in Web Traffic: AFormal Approach.,” in 20th Annual Network and Distributed System Security Symposium(NDSS 2013), San Diego, CA, Feb. 2013.

[50] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young,“Mach: A new kernel foundation for UNIX development,” in Proceedings of the SummerUSENIX Conference (USENIX Summer 1986), Atlanta, GA, Jun. 1986, pp. 93–112.

[51] B. N. Bershad, S. Savage, P. Pardyak, E. G. Sirer, M. E. Fiuczynski, D. Becker, C. Cham-bers, and S. Eggers, “Extensibility Safety and Performance in the SPIN Operating Sys-tem,” in Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP1995), Copper Mountain, CO, Dec. 1995, pp. 267–283.

[52] J. Liedtke, “On Micro-kernel Construction,” in Proceedings of the 15th ACM Symposiumon Operating Systems Principles (SOSP 1995), Copper Mountain, CO, Dec. 1995, pp. 237–250.

[53] ArchWiki, dm-crypt, https://wiki.archlinux.org/index.php/dm-crypt [Ac-cessed: Jan 10, 2021].

[54] Microsoft, BitLocker, https://docs.microsoft.com/en-us/windows/security/information-protection/bitlocker/bitlocker-overview, Jan. 2018.

[55] Amazon Web Services, Inc., AWS Key Management Service (KMS), https://aws.amazon.com/kms [Accessed: Jan 10, 2021].

[56] Microsoft Azure, Key Vault - Microsoft Azure, https://azure.microsoft.com/en-in/services/key-vault [Accessed: Jan 10, 2021].

[57] P. Stewin and I. Bystrov, “Understanding DMA Malware,” in Proceedings of the 9th In-ternational Conference on Detection of Intrusions and Malware, and Vulnerability Assess-ment (DIMVA 2012), Heraklion, Crete, Greece, Jul. 2013, pp. 21–41.

128

https://wiki.archlinux.org/index.php/dm-crypt

https://docs.microsoft.com/en-us/windows/security/information-protection/bitlocker/bitlocker-overview

https://docs.microsoft.com/en-us/windows/security/information-protection/bitlocker/bitlocker-overview

https://aws.amazon.com/kms

https://aws.amazon.com/kms

https://azure.microsoft.com/en-in/services/key-vault

https://azure.microsoft.com/en-in/services/key-vault

[58] C. A. Waldspurger, “Memory Resource Management in VMware ESX Server,” in Pro-ceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI2002), Boston, MA, Dec. 2002, pp. 181–194.

[59] K. Adams and O. Agesen, “A Comparison of Software and Hardware Techniques for x86Virtualization,” in Proceedings of the 12th International Conference on Architectural Sup-port for Programming Languages and Operating Systems (ASPLOS 2006), San Jose, CA,Oct. 2006, pp. 2–13.

[60] X. Chen, T. Garfinkel, E. C. Lewis, P. Subrahmanyam, C. A. Waldspurger, D. Boneh, J.Dwoskin, and D. R. Ports, “Overshadow: A Virtualization-based Approach to RetrofittingProtection in Commodity Operating Systems,” in Proceedings of the 13th InternationalConference on Architectural Support for Programming Languages and Operating Systems(ASPLOS 2008), Seattle, WA, Mar. 2008, pp. 2–13.

[61] J. T. Lim, C. Dall, S.-W. Li, J. Nieh, and M. Zyngier, “NEVE: Nested Virtualization Ex-tensions for ARM,” in Proceedings of the 26th ACM Symposium on Operating SystemsPrinciples (SOSP 2017), Shanghai, China, Oct. 2017, pp. 201–217.

[62] J. Corbet, KAISER: hiding the kernel from user space, https://lwn.net/Articles/738975, Nov. 2017.

[63] Kernel Samepage Merging, https://www.linux-kvm.org/page/KSM [Accessed: Jan10, 2021].

[64] O. S. Hofmann, S. Kim, A. M. Dunn, M. Z. Lee, and E. Witchel, “InkTag: Secure Ap-plications on an Untrusted Operating System,” in Proceedings of the 18th InternationalConference on Architectural Support for Programming Languages and Operating Systems(ASPLOS 2013), Houston, TX, Mar. 2013, pp. 265–278.

[65] ARM System Memory Management Unit Architecture Specification - SMMU architectureversion 2.0, ARM Ltd., Jun. 2016.

[66] J.-K. Zinzindohoué, K. Bhargavan, J. Protzenko, and B. Beurdouche, “HACL*: A VerifiedModern Cryptographic Library,” in Proceedings of the 2017 ACM Conference on Com-puter and Communications Security (CCS 2017), Dallas, TX, Oct. 2017, pp. 1789–1806.

[67] “ARM Power State Coordination Interface,” ARM Ltd., ARM DEN 0022D, Apr. 2017.

[68] R. Gu, Z. Shao, J. Kim, X. N. Wu, J. Koenig, V. Sjöberg, H. Chen, D. Costanzo, andT. Ramananandro, “Certified Concurrent Abstraction Layers,” in Proceedings of the 39thACM Conference on Programming Language Design and Implementation (PLDI 2018),Philadelphia, PA, Jun. 2018, pp. 646–661.

129

https://lwn.net/Articles/738975

https://lwn.net/Articles/738975

https://www.linux-kvm.org/page/KSM

[69] X. Leroy, The CompCert Verified Compiler, https://compcert.org [Accessed: Dec16, 2020].

[70] ARM Ltd., ARM CoreLink MMU-401 System Memory Management Unit Technical Refer-ence Manual, Jul. 2014.

[71] A. Sabelfeld and A. C. Myers, “A Model for Delimited Information Release,” in Proceed-ings of the 2nd International Symposium on Software Security (ISSS 2003), Tokyo, Japan,Nov. 2003, pp. 174–191.

[72] R. Gu, J. Koenig, T. Ramananandro, Z. Shao, X. N. Wu, S.-C. Weng, and H. Zhang, “DeepSpecifications and Certified Abstraction Layers,” in Proceedings of the 42nd ACM Sympo-sium on Principles of Programming Languages (POPL 2015), Mumbai, India, Jan. 2015,pp. 595–608.

[73] R. Keller, “Formal Verification of Parallel Programs,” Communications of the ACM, vol. 19,pp. 371–384, Jul. 1976.

[74] C. Jones, “Tentative Steps Toward a Development Method for Interfering Programs.,” ACMTransactions on Programming Languages and Systems (TOPLAS), vol. 5, pp. 596–619,Oct. 1983.

[75] N. Lynch and F. Vaandrager, “Forward and Backward Simulations,” Information and Com-putation, vol. 128, no. 1, 1–25, Jul. 1996.

[76] K. J. Biba, “Integrity Considerations for Secure Computer Systems,” MITRE, TechnicalReport MTR-3153, Jun. 1975.

[77] T. Murray, D. Matichuk, M. Brassil, P. Gammie, T. Bourke, S. Seefried, C. Lewis, X. Gao,and G. Klein, “SeL4: From General Purpose to a Proof of Information Flow Enforcement,”in Proceedings of the 2013 IEEE Symposium on Security and Privacy (SP 2013), SanFrancisco, CA, May 2013, pp. 415–429.

[78] R. Tao, J. Yao, S.-W. Li, X. Li, J. Nieh, and R. Gu, “Verifying a Multiprocessor Hyper-visor on Arm Relaxed Memory Hardware,” Department of Computer Science, ColumbiaUniversity, Technical Report CUCS-005-21, Jun. 2021.

[79] OP-TEE, Open Portable Trusted Execution Environment, https://www.op-tee.org/,[Accessed Jan 12, 2012].

[80] C. Dall, S.-W. Li, J. T. Lim, J. Nieh, and G. Koloventzos, “ARM Virtualization: Perfor-mance and Architectural Implications,” in Proceedings of the 43rd International Sympo-sium on Computer Architecture (ISCA 2016), Seoul, South Korea, Jun. 2016, pp. 304–316.

130

https://compcert.org

https://www.op-tee.org/

[81] C. Dall, S.-W. Li, and J. Nieh, “Optimizing the Design and Implementation of the LinuxARM Hypervisor,” in Proceedings of the 2017 USENIX Annual Technical Conference(USENIX ATC 2017), Santa Clara, CA, Jul. 2017, pp. 221–234.

[82] Tuning KVM, http://www.linux-kvm.org/page/Tuning_KVM [Accessed: Dec 16,2020].

[83] “Disk Cache Modes,” in SUSE Linux Enterprise Server 12 SP5 Virtualization Guide,SUSE, Dec. 2020, ch. 15.

[84] S. Hajnoczi, “An Updated Overview of the QEMU Storage Stack,” in LinuxCon Japan2011, Yokohama, Japan, Jun. 2011.

[85] C. Dall, S.-W. Li, J. T. Lim, and J. Nieh, “ARM Virtualization: Performance and Archi-tectural Implications,” ACM SIGOPS Operating Systems Review, vol. 52, no. 1, pp. 45–56,Jul. 2018.

[86] J. T. Lim and J. Nieh, “Optimizing Nested Virtualization Performance Using Direct VirtualHardware,” in Proceedings of the 25th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS 2020), Lausanne, Switzer-land, Mar. 2020, pp. 557–574.

[87] 7-CPU.COM, Applied micro x-gene, https://www.7-cpu.com/cpu/X-Gene.html[Accessed: Apr 28, 2021].

[88] R. Russell, Z. Yanmin, I. Molnar, and D. Sommerseth, Improve hackbench, http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c, Linux KernelMailing List (LKML), Jan. 2008.

[89] R. Jones, Netperf, https://github.com/HewlettPackard/netperf [Accessed: Dec16, 2020].

[90] ab - Apache HTTP server benchmarking tool, https://httpd.apache.org/docs/2.4/programs/ab.html [Accessed: Dec 16, 2020].

[91] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, “BenchmarkingCloud Serving Systems with YCSB,” in Proceedings of the 1st ACM Symposium on CloudComputing (SoCC 2010), Indianapolis, IN, Jun. 2010, pp. 143–154.

[92] UsingVhost - KVM, https://www.linux-kvm.org/page/UsingVhost [Accessed:Dec 16, 2020].

[93] V. P. Kemerlis, G. Portokalidis, and A. D. Keromytis, “KGuard: Lightweight Kernel Pro-tection against Return-to-User Attacks,” in Proceedings of the 21st USENIX Security Sym-

131

http://www.linux-kvm.org/page/Tuning_KVM

https://www.7-cpu.com/cpu/X-Gene.html

http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c

http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c

https://github.com/HewlettPackard/netperf

https://httpd.apache.org/docs/2.4/programs/ab.html

https://httpd.apache.org/docs/2.4/programs/ab.html

https://www.linux-kvm.org/page/UsingVhost

posium (USENIX Security 2012), Bellevue, WA, Aug. 2012, pp. 459–474, ISBN: 978-931971-95-9.

[94] N. Dautenhahn, T. Kasampalis, W. Dietz, J. Criswell, and V. Adve, “Nested Kernel: AnOperating System Architecture for Intra-Kernel Privilege Separation,” in Proceedings ofthe 20th International Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS 2015), Istanbul, Turkey, Mar. 2015, pp. 191–206.

[95] P. Colp, M. Nanavati, J. Zhu, W. Aiello, G. Coker, T. Deegan, P. Loscocco, and A. Warfield,“Breaking Up is Hard to Do: Security and Functionality in a Commodity Hypervisor,” inProceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP 2011),Cascais, Portugal, Oct. 2011, pp. 189–202.

[96] F. Zhang, J. Chen, H. Chen, and B. Zang, “CloudVisor: Retrofitting Protection of VirtualMachines in Multi-tenant Cloud with Nested Virtualization,” in Proceedings of the 23rdACM Symposium on Operating Systems Principles (SOSP 2011), Cascais, Portugal, Oct.2011, pp. 203–216.

[97] D. G. Murray, G. Milos, and S. Hand, “Improving Xen Security Through Disaggregation,”in Proceedings of the 4th ACM SIGPLAN/SIGOPS International Conference on VirtualExecution Environments (VEE 2008), Seattle, WA, Mar. 2008, pp. 151–160.

[98] S. Butt, H. A. Lagar-Cavilla, A. Srivastava, and V. Ganapathy, “Self-service Cloud Com-puting,” in Proceedings of the 2012 ACM Conference on Computer and CommunicationsSecurity (CCS 2012), Raleigh, NC, Oct. 2012, pp. 253–264.

[99] U. Steinberg and B. Kauer, “NOVA: A Microhypervisor-based Secure Virtualization Ar-chitecture,” in Proceedings of the 5th European Conference on Computer Systems (EuroSys2010), Paris, France, Apr. 2010, pp. 209–222.

[100] G. Heiser and B. Leslie, “The OKL4 Microvisor: Convergence Point of Microkernels andHypervisors,” in Proceedings of the 1st ACM Asia-pacific Workshop on Workshop on Sys-tems (APSys 2010), New Delhi, India, Aug. 2010, pp. 19–24.

[101] T. Shinagawa, H. Eiraku, K. Tanimoto, K. Omote, S. Hasegawa, T. Horie, M. Hirano,K. Kourai, Y. Oyama, E. Kawai, K. Kono, S. Chiba, Y. Shinjo, and K. Kato, “BitVisor: AThin Hypervisor for Enforcing I/O Device Security,” in Proceedings of the 2009 ACM SIG-PLAN/SIGOPS International Conference on Virtual Execution Environments (VEE 2009),Washington, DC, Mar. 2009, pp. 121–130.

[102] A. Nguyen, H. Raj, S. Rayanchu, S. Saroiu, and A. Wolman, “Delusional Boot: SecuringHypervisors Without Massive Re-engineering,” in Proceedings of the 7th ACM EuropeanConference on Computer Systems (EuroSys 2012), Bern, Switzerland, Apr. 2012, pp. 141–154.

132

[103] E. Keller, J. Szefer, J. Rexford, and R. B. Lee, “NoHype: Virtualized Cloud InfrastructureWithout the Virtualization,” in Proceedings of the 37th Annual International Symposiumon Computer Architecture (ISCA 2010), Saint-Malo, France, Jun. 2010, pp. 350–361.

[104] Siemens, jailhouse - Linux-based partitioning hypervisor, https://github.com/siemens/jailhouse [Accessed: Jan 10, 2021].

[105] Z. Wang, C. Wu, M. Grace, and X. Jiang, “Isolating Commodity Hosted Hypervisors withHyperLock,” in Proceedings of the 7th ACM European Conference on Computer Systems(EuroSys 2012), Bern, Switzerland, Apr. 2012, pp. 127–140.

[106] C. Wu, Z. Wang, and X. Jiang, “Taming Hosted Hypervisors with (Mostly) DeprivilegedExecution.,” in 20th Annual Network and Distributed System Security Symposium (NDSS2013), San Diego, CA, Feb. 2013.

[107] L. Shi, Y. Wu, Y. Xia, N. Dautenhahn, H. Chen, B. Zang, and J. Li, “Deconstructing Xen,”in 24th Annual Network and Distributed System Security Symposium (NDSS 2017), SanDiego, CA, Feb. 2017.

[108] Intel Corporation, Intel Software Guard Extensions Programming Reference, https://software.intel.com/sites/default/files/managed/48/88/329298-002.pdf,Oct. 2014.

[109] A. Baumann, M. Peinado, and G. Hunt, “Shielding Applications from an Untrusted Cloudwith Haven,” in Proceedings of the 11th USENIX Symposium on Operating Systems Designand Implementation (OSDI 2014), Broomfield, CO, Oct. 2014, pp. 267–283.

[110] M.-W. Shih, M. Kumar, T. Kim, and A. Gavrilovska, “S-NFV: Securing NFV States byUsing SGX,” in Proceedings of the 2016 ACM International Workshop on Security in Soft-ware Defined Networks & Network Function Virtualization (SDN-NFV Security 2016),New Orleans, LA, Mar. 2016, pp. 45–48.

[111] M. Zhu, B. Tu, W. Wei, and D. Meng, “HA-VMSI: A Lightweight Virtual Machine Iso-lation Approach with Commodity Hardware for ARM,” in Proceedings of the 13th ACMSIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE 2017),Xi’an, China, Apr. 2017, pp. 242–256.

[112] Z. Hua, J. Gu, Y. Xia, H. Chen, B. Zang, and Haibing, “VTZ: Virtualizing ARM Trust-zone,” in Proceedings of the 26th USENIX Security Symposium (USENIX Security 2017),Vancouver, BC, Canada, Aug. 2017, pp. 541–556.

[113] S. Jin, J. Ahn, S. Cha, and J. Huh, “Architectural Support for Secure Virtualization Under aVulnerable Hypervisor,” in Proceedings of the 44th Annual IEEE/ACM International Sym-posium on Microarchitecture (MICRO 2011), Porto Alegre, Brazil, Dec. 2011, pp. 272–283.

133

https://github.com/siemens/jailhouse

https://github.com/siemens/jailhouse

https://software.intel.com/sites/default/files/managed/48/88/329298-002.pdf

https://software.intel.com/sites/default/files/managed/48/88/329298-002.pdf

[114] J. Szefer and R. B. Lee, “Architectural Support for Hypervisor-secure Virtualization,” inProceedings of the 17th International Conference on Architectural Support for Program-ming Languages and Operating Systems (ASPLOS 2012), London, England, UK, Mar.2012, pp. 437–450.

[115] Y. Xia, Y. Liu, and H. Chen, “Architecture Support for Guest-transparent VM Protectionfrom Untrusted Hypervisor and Physical Attacks,” in Proceedings of the 19th IEEE In-ternational Symposium on High Performance Computer Architecture (HPCA 2013), Shen-zhen, China, Feb. 2013, pp. 246–257.

[116] Intel Corporation, Intel Architecture Memory Encryption Technologies Specification, https://software.intel.com/sites/default/files/managed/a5/16/Multi-Key-Total-Memory-Encryption-Spec.pdf, Apr. 2019.

[117] Advanced Micro Devices, Secure Encrypted Virtualization API Version 0.16, http://developer.amd.com/wordpress/media/2017/11/55766_SEV-KM-API_Specification.pdf, Feb. 2018.

[118] Y. Wu, Y. Liu, R. Liu, H. Chen, B. Zang, and H. Guan, “Comprehensive VM Protec-tion Against Untrusted Hypervisor Through Retrofitted AMD Memory Encryption,” inProceedings of the 24th IEEE International Symposium on High Performance ComputerArchitecture (HPCA 2018), Vienna, Austria, Feb. 2018, pp. 441–453.

[119] J. Yang and K. G. Shin, “Using Hypervisor to Provide Data Secrecy for User Applica-tions on a Per-page Basis,” in Proceedings of the 4th ACM SIGPLAN/SIGOPS Interna-tional Conference on Virtual Execution Environments (VEE 2008), Seattle, WA, Mar. 2008,pp. 71–80.

[120] J. M. McCune, Y. Li, N. Qu, Z. Zhou, A. Datta, V. Gligor, and A. Perrig, “TrustVisor:Efficient TCB Reduction and Attestation,” in Proceedings of the 2010 IEEE Symposiumon Security and Privacy (SP 2010), Oakland, CA, May 2010, pp. 143–158.

[121] S. Chhabra, B. Rogers, Y. Solihin, and M. Prvulovic, “SecureME: A Hardware-softwareApproach to Full System Security,” in Proceedings of the 2011 International Conferenceon Supercomputing (ICS 2011), Tucson, Arizona, USA, May 2011, pp. 108–119.

[122] Z. Wang, X. Jiang, W. Cui, and P. Ning, “Countering Kernel Rootkits with LightweightHook Protection,” in Proceedings of the 16th ACM Conference on Computer and Commu-nications Security (CCS 2009), Chicago, IL, Nov. 2009, pp. 545–554.

[123] R. Riley, X. Jiang, and D. Xu, “Guest-Transparent Prevention of Kernel Rootkits withVMM-Based Memory Shadowing,” in Proceedings of the 11th International Symposiumon Recent Advances in Intrusion Detection (RAID 2008), Cambridge, MA, Sep. 2008,pp. 1–20.

134

https://software.intel.com/sites/default/files/managed/a5/16/Multi-Key-Total-Memory-Encryption-Spec.pdf



http://developer.amd.com/wordpress/media/2017/11/55766_SEV-KM-API_Specification.pdf



[124] A. Seshadri, M. Luk, N. Qu, and A. Perrig, “SecVisor: A Tiny Hypervisor to Provide Life-time Kernel Code Integrity for Commodity OSes,” in Proceedings of 21st ACM SIGOPSSymposium on Operating Systems Principles (SOSP 2007), Stevenson, WA, Oct. 2007,pp. 335–350.

[125] X. Wang, Y. Chen, Z. Wang, Y. Qi, and Y. Zhou, “SecPod: A Framework for Virtualization-based Security Systems,” in Proceedings of the 2015 USENIX Annual Technical Confer-ence (USENIX ATC 2015), Santa Clara, CA, Jul. 2015, pp. 347–360.

[126] Z. Zhou, M. Yu, and V. D. Gligor, “Dancing with Giants: Wimpy Kernels for On-DemandIsolated I/O,” in Proceedings of the 2014 IEEE Symposium on Security and Privacy (SP2014), San Jose, CA, May 2014, pp. 308–323.

[127] G. Klein, J. Andronick, M. Fernandez, I. Kuz, T. Murray, and G. Heiser, “Formally Veri-fied Software in the Real World,” Communications of the ACM, vol. 61, no. 10, pp. 68–77,Sep. 2018.

[128] T. Garfinkel, B. Pfaff, J. Chow, M. Rosenblum, and D. Boneh, “Terra: A Virtual Machine-based Platform for Trusted Computing,” in Proceedings of the 19th ACM Symposium onOperating Systems Principles (SOSP 2003), Bolton Landing, NY, Oct. 2003, pp. 193–206.

[129] R. Strackx and F. Piessens, “Fides: Selectively Hardening Software Application Compo-nents Against Kernel-level or Process-level Malware,” in Proceedings of the 2012 ACMConference on Computer and Communications Security (CCS 2012), Raleigh, NC, Oct.2012, pp. 2–13.

[130] R. Ta-Min, L. Litty, and D. Lie, “Splitting Interfaces: Making Trust Between Applica-tions and Operating Systems Configurable,” in Proceedings of the 7th USENIX Symposiumon Operating Systems Design and Implementation (OSDI 2006), Seattle, WA, Nov. 2006,pp. 279–292.

[131] Y. Liu, T. Zhou, K. Chen, H. Chen, and Y. Xia, “Thwarting Memory Disclosure with Effi-cient Hypervisor-enforced Intra-domain Isolation,” in Proceedings of the 2015 ACM Con-ference on Computer and Communications Security (CCS 2015), Denver, CO, Oct. 2015,pp. 1607–1619.

[132] G. Klein, J. Andronick, K. Elphinstone, T. Murray, T. Sewell, R. Kolanski, and G. Heiser,“Comprehensive Formal Verification of an OS Microkernel,” ACM Transactions on Com-puter Systems, vol. 32, no. 1, 2:1–70, Feb. 2014.

[133] R. Gu, Z. Shao, H. Chen, J. Kim, J. Koenig, X. Wu, V. Sjöberg, and D. Costanzo, “BuildingCertified Concurrent OS Kernels,” Communications of the ACM, vol. 62, no. 10, pp. 89–99, Sep. 2019.

135

[134] seL4 Supported Platforms, https://docs.sel4.systems/Hardware [Accessed: Dec16, 2020].

[135] J. Oberhauser, R. L. de Lima Chehab, D. Behrens, M. Fu, A. Paolillo, L. Oberhauser, K.Bhat, Y. Wen, H. Chen, J. Kim, and V. Vafeiadis, “VSync: Push-Button Verification andOptimization for Synchronization Primitives on Weak Memory Models,” in Proceedings ofthe 26th International Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS 2021), Detroit, MI, Apr. 2021.

[136] seL4 Reference Manual Version 11.0.0, Data61, Nov. 2019.

[137] Frequently Asked Questions on seL4, https://docs.sel4.systems/projects/sel4/frequently-asked-questions.html [Accessed: Dec 16, 2020].

[138] E. Cohen, M. Dahlweid, M. Hillebrand, D. Leinenbach, M. Moskal, T. Santen, W. Schulte,and S. Tobies, “VCC: A Practical System for Verifying Concurrent C,” in Proceedings ofthe 22nd International Conference on Theorem Proving in Higher Order Logics (TPHOLs2009), Munich, Germany, Aug. 2009, pp. 23–42.

[139] D. Leinenbach and T. Santen, “Verifying the Microsoft Hyper-V hypervisor with VCC,” inProceedings of the 16th International Symposium on Formal Methods (FM 2009), Eind-hoven, The Netherlands, Nov. 2009, pp. 806–809.

[140] A. Vasudevan, S. Chaki, P. Maniatis, L. Jia, and A. Datta, “überSpark: Enforcing VerifiableObject Abstractions for Automated Compositional Security Analysis of a Hypervisor,” inProceedings of the 25th USENIX Security Symposium (USENIX Security 2016), Austin,TX, Aug. 2016, pp. 87–104.

[141] “Creating a Trusted Embedded Platform for MLS Application,” Green Hills Software,Whitepaper v0520, May 2020.

[142] National Information Assurance Partnership, Separation Kernels on Commodity Worksta-tions, http://www.niap-ccevs.org/announcements/Separation%20Kernels%20on%20Commodity%20Workstations.pdf, Mar. 2010.

[143] C. Hawblitzel, J. Howell, J. R. Lorch, A. Narayan, B. Parno, D. Zhang, and B. Zill, “Iron-clad Apps: End-to-End Security via Automated Full-System Verification,” in Proceedingsof the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI2014), Broomfield, CO, Oct. 2014, pp. 165–181.

[144] D. Jang, Z. Tatlock, and S. Lerner, “Establishing Browser Security Guarantees through For-mal Shim Verification,” in Proceedings of the 21st USENIX Security Symposium (USENIXSecurity 2012), Bellevue, WA, Aug. 2012, pp. 113–128.

136

https://docs.sel4.systems/Hardware

https://docs.sel4.systems/projects/sel4/frequently-asked-questions.html

https://docs.sel4.systems/projects/sel4/frequently-asked-questions.html

http://www.niap-ccevs.org/announcements/Separation%20Kernels%20on%20Commodity%20Workstations.pdf

http://www.niap-ccevs.org/announcements/Separation%20Kernels%20on%20Commodity%20Workstations.pdf

[145] C. Baumann, M. Näslund, C. Gehrmann, O. Schwarz, and H. Thorsen, “A High AssuranceVirtualization Platform for ARMv8,” in Proceedings of the 2016 European Conference onNetworks and Communications (EuCNC 2016), Athens, Greece, Jun. 2016, pp. 210–214.

[146] C. Baumann, O. Schwarz, and M. Dam, “On the verification of system-level informationflow properties for virtualized execution platforms,” Journal of Cryptographic Engineer-ing, vol. 9, no. 3, pp. 243–261, May 2019.

[147] T. Murray, R. Sison, E. Pierzchalski, and C. Rizkallah, “Compositional Verification andRefinement of Concurrent Value-Dependent Noninterference,” in Proceedings of the 29thIEEE Computer Security Foundations Symposium (CSF 2016), Lisbon, Portugal, Jun. 2016,pp. 417–431.

[148] T. Murray, R. Sison, and K. Engelhardt, “COVERN: A Logic for Compositional Verifica-tion of Information Flow Control,” in Proceedings of the 2018 IEEE European Conferenceon Security and Privacy (EuroS&P 2018), London, United Kingdom, Apr. 2018, pp. 16–30.

[149] G. Ernst and T. Murray, “SecCSL: Security Concurrent Separation Logic,” in Proceedingsof the 31st International Conference (CAV 2019), New York, NY, Jul. 2019, pp. 208–230.

[150] D. Schoepe, T. Murray, and A. Sabelfeld, “VERONICA: Expressive and Precise Concur-rent Information Flow Security,” in Proceedings of the 33rd IEEE Computer Security Foun-dations Symposium (CSF 2020), Boston, MA, Jun. 2020, pp. 79–94.

[151] H. Tuch and G. Klein, “Verifying the L4 virtual memory subsystem,” in Proceedings ofthe NICTA Foraml Methods Workshop on OS Verification, Sydney, Australia, Oct. 2004,pp. 73–97.

[152] O. Schwarz and M. Dam, “Formal verification of secure user mode device executionwith DMA,” in Proceedings of the 10th International Haifa Verification Conference (HVC2014), Haifa, Israel, Nov. 2014, pp. 236–251.

[153] Y. Zhao and D. Sanán, “Rely-Guarantee Reasoning About Concurrent Memory Manage-ment in Zephyr RTOS,” in Proceedings of the 31st International Conference (CAV 2019),New York, NY, Jul. 2019, pp. 515–533.

[154] S. H. Taqdees and G. Klein, “Reasoning about Translation Lookaside Buffers,” in Proceed-ings of the 21st International Conference on Logic for Programming, Artificial Intelligenceand Reasoning (LPAR 2017), Maun, Botswana, May 2017, pp. 490–508.

[155] H. T. Syeda and G. Klein, “Program verification in the presence of cached address transla-tion,” in Proceedings of the 2018 International Conference on Interactive Theorem Proving(ITP 2018), Oxford, United Kingdom, Jul. 2018, pp. 542–559.

137

[156] A. Fox, “Formal specification and verification of arm6,” in International Conference onTheorem Proving in Higher Order Logics (TPHOLs 2003), Rome, Italy, Sep. 2003, pp. 25–40.

[157] Intel Corporation, Intel 64 and IA-32 Architectures Software Developer’s Manual, 325462-044US, Aug. 2012.

138

Appendix A: KCore API

Figure A.1 shows the APIs of all KCore layers. As discussed earlier, our layered verification

approach only allows higher layer primitives to call to the lower layer primitives. Black arrows

show primitives from one layer call primitives in the next lower layer. For example, walk_pgd

from PTWalk calls alloc_pgd in PTAlloc. White boxes show primitives that are only used in the

layer in which they are defined. Colored boxes show primitives that are passed through to other

layers. For example, map_pfn_vm from MemAux calls map_page, which is passed through from

NPTOps to PageMgmt. The figure does not include the primitives passed through from the abstract

machine, such as the memory load and store primitives. White arrows show that all primitives from

a given layer use the specific lower layer primitives. For instance, all primitives in BootOps use

the acquire_lock_vm and release_lock_vm primitives. Empty white boxes are used to group

primitives into a set. A black arrow from a primitive in a higher layer to an empty white box

indicates that the higher layer primitive uses all primitives in the set. For example, the primitives

from Locks use all primitives in LockOpsH.

139

SmmuRaw

get_mmio_data get_smmu_unit init_smmu_pte

SmmuCoreAux__handle_smmu_read/write

handle_global_access handle_cb_access

get_smmu_unit

SmmuCore

handle_smmu_read/write

SmmuAux

handle_smmu_access check_smmu_address

SmmuOpshandle_mmio __smmu_alloc/free

_unitsmmu_map/unmap

_page__smmu_iova_to_phys

set/get_vmpower walk_sptalloc_smmu

map_smmu

clear_smmu

VCPUOpsAux

reset_gp/sysregs sync_intr_to_vcpuprep_wfx/psci/abort update_excpt_regs handle_s2pt_fault

VCPUOpsproc_vm_exit proc_vm_enter

prot_map_vm_s2ptset/get_vmpower search_ld_info

CtxtSwitchsave/restore_kserv_gprs

save/restore_vm_gprs

save/restore_core_gprs

save/restore_kserv_sysregs

save/restore_vm_sysregs

set_vcpu_active/inactive

MemHandler

clear_vm_mem_range __smmu_map

set/get_vmpower smmu_map_pageclear_vm_range init_smmu_pte

FaultHandler

handle_kserv_s2pt_fault

TrapDispatcherkserv_hvc_dispatcher

handle_pvops

handle_mmio map_pfn_kservgrant/revoke_vm_pages

vm_exit_dispatcher

proc_vm_enter BootOps (5) MemHandler (2)

__smmu_alloc/free_unit

smmu_map/unmap_page

__smmu_iova_to_phys

__encrypt/decrypt_vcpu/vm_mem

CtxtSwitch (10)

TrapHandlerRawkserv_hvc_handler kserv_s2pt_fault_handlervm_exit_handler

handle_kserv_s2pt_fault

TrapHandler

set_timer

clear_vm

kserv_page faultGRANT/REVOKE_MEM psci_powervm_page_fault

handle_irq/wfx/sysregs

acq/rel_lock_smmu

proc_vm_exit

register_vm/vcpu set_boot_info remap_boot_image_page

verify_vm_image run_vcpu encrypt/decrypt_vcpuencrypt/decrypt_vm_mem smmu_alloc/free_unit

smmu_map/unmap smmu_iova_to_phys

Memblocksearch_memblock

PageIndexget_s2page_index

PageMgmtget_pfn_owner set_pfn_owner get_pfn_share set_pfn_share

map/unmap_spt

map_page unmap_pfn_kserv

MemAuxmap_pfn_kserv map_pfn_vm assign_pfn_vm clear_vm_page grant/revoke

_vm_pagemap/unmap_pfn_smmu

MemOpsprot_map_vm_s2pt clear_vm_range

grant/revoke_vm_pages

encrypt/decrypt_mem

get_s2pt_size walk_s2pt

VMPowerset/get_vmpower

BootCoregen_vmid alloc_remap

_addr

BootAuxload_vm_image

walk_pt prot_map_vm_s2pt

BootOpssearch_ld_info set_vcpu_active/

inactive__register_vcpu/vm/../verify_and_load_images (5)

alloc_smmu

map_smmu

clear_smmu

__encrypt/decrypt_vcpu/vm_mem

init_spt encrypt/decrypt_mem

map/unmap_pfn_smmu

alloc_remap_addr

gen_vmid

acq/rel_lock_s2page

acq/rel_lock_core

acq/rel_lock_vm

acq/rel_lock_vm

map_page_core

140

MmioSPTAlloc

alloc_smmu_pmd alloc_smmu_pte

MmioPTWalk

walk_smmu_pgd walk_smmu_pmd walk_smmu_pte set_smmu_pte

MmioSPTWalk

walk_smmu_pt set_smmu_pt unset_smmu_pt

MmioSPTOpsmap_spt unmap_spt

PTAlloc

alloc_pud alloc_pmd

PTWalk

walk_pgd walk_pmd walk_pte set_pmd

NPTWalk

get_npt_size walk_npt

NPTOps

get_s2pt_size walk_s2pt

alloc_pte

walk_pud set_pte

set_s2pt set_s1ptwalk_s1pt

walk_pt map_page map_page_core

unmap_pfn_kserv

init_sptwalk_spt

Locks

acq/rel_lock_npt acq/rel_lock_s2page acq/rel_lock_core acq/rel_lock_smmu acq/rel_lock_spt

LockOpsH

wait_hlock pass_hlock

LockOpsQ

wait_qlock pass_qlock

LockOps

wait_lock pass_lock

acq/rel_lock_npt

acq/rel_lock_spt

acq/rel_lock_vm

Figure A.1: KCore API

141

A.1 KCore’s Intermediate Layer API

Tables A.1 to A.33 list the APIs of KCore’s 33 intermediate layers shown in Figure 2.2.

Primitive Descriptionkserv_hvc_handler Called by TrapHandler to handle a given hypercall made by

KServ. It first calls CtxtSwitch to context switch to KCore, thencalls TrapDispatcher to handle the hypercall. Finally, it callsCtxtSwitch to context switch to KServ.

kserv_s2pt_fault_handler Called by TrapHandler to handle a given stage 2 page fault forKServ. It first calls CtxtSwitch to context switch to KCore, then callsFaultHandler to handle the page fault. Finally, it calls CtxtSwitchto context switch to KServ.

vm_exit_handler Called by TrapHandler to handle a given VM exit. It first callsCtxtSwitch to context switch to KCore. It then calls VCPUOps tohandle the VM exit. If the exit requires KServ’s functionality, it callsCtxtSwitch to context switch to KServ; otherwise, it handles the exitdirectly and returns to the VM.

Table A.1: TrapHandlerRaw API

Primitive Descriptionkserv_hvc_dispatcher Dispatches a given hypercall made by KServ to the respective hypercall

handler based on the hypercall number.vm_exit_dispatcher Checks if KCore could handle a given VM exit; if yes, it handles the exit

directly. For example, it calls FaultHandler to handle the GRANT_-MEM and REVOKE_MEM hypercall.

Table A.2: TrapDispatcher API

Primitive Descriptionhandle_kserv_s2pt_fault Handles a given stage 2 page fault for KServ. It either calls SMMUOps

to handle the SMMU access, or calls MemAux to resolve the page fault.handle_pvops Calls MemOps to handle a given GRANT_MEM or REVOKE_MEM hypercall.

Table A.3: FaultHandler API

142

Primitive Descriptionclear_vm_mem_range Handles the clear_vm hypercall for KServ. It first calls VMPower to

ensure a given target VM is powered off, then calls MemOps to reclaimpages from the target VM.

__smmu_map Handles the smmu_map hypercall for KServ by calling SmmuRaw andSmmuOps.

Table A.4: MemHandler API

Primitive Descriptionsave_kserv_gprs Saves KServ’s general purpose registers from the hardware to memory.restore_kserv_gprs Restores KServ’s general purpose registers from the memory to hard-

ware.save_vm_gprs Saves a given VM’s general purpose registers from the hardware to

memory.restore_vm_gprs Restores a given VM’s general purpose registers from the memory to

hardware.save_core_gprs Saves KCore’s general purpose registers from the hardware to memory.restore_core_gprs Restores KCore’s general purpose registers from the memory to hard-

ware.save_kserv_sysregs Saves KServ’s systems registers from the hardware to memory.restore_kserv_sysregs Restores KServ’s systems registers from the memory to hardware.save_vm_sysregs Saves a given VM’s systems registers from the hardware to memory.restore_vm_sysregs Restores a given VM’s systems registers from the memory to hardware.

Table A.5: CtxtSwitch API

Primitive Descriptionproc_vm_exit Calls VCPUOpsAux to handle a VM exit.proc_vm_enter Calls VCPUOpsAux to handle the VM ENTER hypercall for KServ, to ei-

ther copy data from the intermediate state to VCPUContext, or resolvea VM’s stage 2 page fault. It also resets VCPU registers at the first vmenter.

Table A.6: VCPUOps API

143

Primitive Descriptionreset_gp_regs Resets the general purposes registers for a given VCPU.reset_sysregs Resets the systems registers for a given VCPU.sync_intr_to_vcpu Copies data from the intermediate state to VCPUContext. For ex-

ample, it copies the MMIO read data from the intermediate state toVCPUContext.

prep_wfx Handles a given VM exit caused by executing Arm’s WFE/WFI instruc-tions.

prep_psci Handles a given VM exit caused by VM making the PSCI hypercalls. Itcopies the VM power states stored in a given VM’s general purpose reg-isters from VCPUContext to the intermediate state, so the PSCI powerhypercall handler in KServ can access the data to handle the hypercall.

prep_abort Handles the VM exits caused by memory access aborts, includingMMIO accesses and faults on regular memory accesses. To handle aMMIO write, it copies the MMIO write data from VCPUContext tothe intermediate state; to handle a MMIO read, it sets a dirty flag sothat KCore could later copy the read data from the intermediate stateto VCPUContext. To handle a stage 2 page fault for regular memoryaccess, it sets a flag in the data structure shared with KServ to notifyKServ for memory allocation.

update_excpt_regs Updates the general purpose registers stored in VCPUContext to injectan exception to a given VM.

handle_s2pt_fault Calls MemOps to resolve a given VM’s stage 2 page fault.

Table A.7: VCPUOpsAux API

handle_mmio Calls SmmuAux to handle a fault caused by KServ’s MMIO access to theSMMU.

__smmu_alloc_unit Handles the smmu_alloc_unit hypercall for KServ. It calls BootOpsto allocate a SMMU translation unit for a given device.

__smmu_free_unit Handles the smmu_free_unit hypercall for KServ to deallocate aSMMU translation unit.

smmu_map_page Calls BootOps to map a given iova to a hPA in a given device’s SMMUpage table.

smmu_unmap_page Handles the smmu_unmap hypercall for KServ. It calls MmioSPTOps tounmap a given iova from a given device’s SMMU page table.

__smmu_iova_to_phys Handles the smmu_iova_to_phys hypercall for KServ. It callsMmioSPTOps to walk a given device’s SMMU page table using an inputiova.

Table A.8: SmmuOps API

check_smmu_address Checks if a given faulted physical address falls within a hardware mem-ory region that belongs to the SMMU.

handle_smmu_access Calls SmmuCore to handle a KServ’s MMIO access to the SMMU.

Table A.9: SmmuAux API

handle_smmu_write Calls SmmuCoreAux to handle a KServ’s write access to the SMMU. Itcalls SmmuRaw to get the MMIO write data to program the SMMU.

handle_smmu_read Calls SmmuCoreAux to handle a KServ’s read access to the SMMU.

Table A.10: SmmuCore API

144

__handle_smmu_write Programs the SMMU hardware to carry out a KServ’s SMMU write.__handle_smmu_read Reads the SMMU hardware to carry out a KServ’s SMMU read.handle_global_access Validates a given KServ’s access to the SMMU global register. For

example, it rejects KServ’s attempt to disable the SMMU page tablesby programming the SMMU_CBAR register.

handle_cb_access Calls SmmuRaw to first locates the SMMU translation unit. It then vali-dates KServ’s access to the bank registers of a given SMMU translationunit. For instance, KCore forbids any write accesses to the SMMU pagetable base register (TTBR0) for a given translation unit to use a mali-cious SMMU page table.

Table A.11: SmmuCoreAux API

get_mmio_data Returns the MMIO write data stored in KServ’s register for a givenSMMU write.

init_smmu_pte Takes a given hPA and formulates a resulting value to store to an entryfor the SMMU page tables.

get_smmu_unit Translates a given input physical address to the index of a correspondingSMMU translation unit for KCore to manage the SMMU.

Table A.12: SmmuRaw API

145

Primitive Descriptionsearch_ld_info Checks if a given guest physical address is within a memory region that

contains a VM image.set_vcpu_active Specifies that a given VCPU is active on the current physical CPU. Used

by KCore to ensure the same VCPU cannot be run concurrently on an-other physical CPU.

set_vcpu_inactive Specifies that a given VCPU is inactive on the current physical CPU.__register_vcpu Handles the register_vcpu hypercall for KServ.__register_vm Handles the register_vm hypercall for KServ. It calls BootCore to

allocate a new VM identifier.__set_boot_info Handles the set_boot_info hypercall for KServ. It stores the infor-

mation of a given VM boot image to VMInfo, and then calls BootCoreto allocate a memory buffer in KCore’s address space to remap the VMimage.

remap_image_page Handles the remap_boot_image_page hypercall for KServ. It callsMemAux to map a given physical page containing VM image to the EL2stage 1 page table.

verify_and_load_images Handles the verify_vm_image hypercall for KServ. It loops over alist of boot images loaded for a given VM and calls HACL* to authen-ticate each of the image. If an image is authenticated, it calls BootAuxto map the image to the VM’s stage 2 page table.

alloc_smmu Checks if a given VM has booted; if not, it allocates a SMMU trans-lation unit to the VM’s device, and calls MmioSPTOps to initialize therespective SMMU page table.

map_smmu Checks if a given VM has booted; if not, it calls MemAux to map an iovato a hPA in the SMMU page table of the VM’s device.

clear_smmu Checks if a given VM has booted; if not, it calls MemAux to unmap aniova from the SMMU page table of the VM’s device.

__encrypt_vcpu Handles the decrypt_vcpu hypercall for KServ. It encrypts the datastored in VCPUContext of a given VCPU and copies the encrypteddata to KServ’s memory.

__decrypt_vcpu Handles the encrypt_vcpu hypercall for KServ. It copies the en-crypted CPU data from KServ memory to a private buffer, and decryptsthe data stored in a private buffer.

__encrypt_vm_mem Handles the encrypt_vm_mem hypercall for KServ. It calls MemOpsto encrypt the data stored in a given physical address and copies theencrypted data to an output buffer owned by KServ.

__decrypt_vm_mem Handles the decrypt_vm_mem hypercall for KServ. It copies the en-crypted data stored in a given physical address to a private buffer, thencalls MemOps to decrypt the encrypted data.

Table A.13: BootOps API

Primitive Descriptionload_vm_image Maps a given VM’s authenticated boot image to the VM’s stage 2 page

table.

Table A.14: BootAux API

146

Primitive Descriptiongen_vmid Allocates a new VM identifier.alloc_remap_addr Allocates a contiguous buffer from KCore’s address space for remap-

ping a given VM image.

Table A.15: BootCore API

Primitive Descriptionset_vm_power Sets a given VM’s power state.get_vm_power Returns a given VM’s power state.

Table A.16: VMPower API

clear_vm_range Loops over a range of physical memory and calls MemAux on each 4KBpage to reclaim memory.

prot_map_vm_s2pt Loops over a range of physical memory, and calls MemAux on each 4KBpage to transfer the page to a given VM, and maps the page to the VM’sstage 2 page table.

grant_vm_pages Loops over a range of guest physical memory and calls MemAux on each4KB page to grant KServ access to the page. Used by FaultHandlerto handle the GRANT_MEM hypercall.

revoke_vm_pages Loops over a range of guest physical memory and calls MemAux oneach 4KB page to revoke KServ’s access to the page. Used byFaultHandler to handle the REVOKE_MEM hypercall.

encrypt_mem Encrypts the contents of a memory buffer using HACL*.decrypt_mem Decrypts the contents of a memory buffer using HACL*.

Table A.17: MemOps API

map_pfn_kserv Handles a stage 2 page fault for KServ. It first validates KServ’s mem-ory access to the faulted hPA. If the access is permitted, it calls NPTOpsto resolve the page fault.

map_pfn_vm Calls NPTOps to map a given guest physical address to a hPA in a givenVM’s stage 2 page table.

clear_vm_page Scrubs a given physical page and calls PageMgmt to assign the page toKServ.

assign_pfn_vm Checks if a given physical page is owned by KServ. If yes, it callsPageMgmt to assign the page to a target VM.

grant_vm_page Calls PageMgmt to update the sharing status of a given physical page togrant KServ access to the page.

revoke_vm_page Calls PageMgmt to update the sharing status of a given physical page torevoke KServ’s access to the page. It then calls NPTOps to unmap thepage from KServ’s stage 2 page table.

map_pfn_smmu Calls PageMgmt to assign a given physical page to a target principaland calls MmioSPTOps to create a mapping to the page in the SMMUpage table for the target principal’s device.

unmap_pfn_smmu Calls MmioSPTOps to unmap a given physical page from a given de-vice’s SMMU page table.

Table A.18: MemAux API

147

get_pfn_owner Calls PageIndex to get the index to the S2Page array for a given phys-ical page, and returns owner from the page’s respective S2Page.

set_pfn_owner Calls PageIndex to get the index to the S2Page array for a given phys-ical page, and updates owner from the page’s respective S2Page.

get_pfn_share Calls PageIndex to get the index to the S2Page array for a given phys-ical page, and returns share from the page’s respective S2Page.

set_pfn_share Calls PageIndex to get the index to the S2Page array for a given phys-ical page, and updates sharing from the page’s respective S2Page.

Table A.19: PageMgmt API

get_s2page_index Calls Memblock to check if a given physical address belongs to RAM.If yes, it returns the index respective to the address in the S2Page array.

Table A.20: PageIndex API

search_memblock Checks if a given input address is within a physical address region thatbelongs to RAM.

Table A.21: Memblock API

Primitive Descriptionget_s2pt_size Acquires the page table lock and calls get_npt_size in NPTWalk.walk_s2pt Acquires the page table lock and calls walk_npt in NPTWalk.walk_pt Acquires the page table lock and calls walk_s1pt in NPTWalk.map_page Acquires the page table lock and calls set_s2pt in NPTWalk.map_page_core Acquires the page table lock and calls set_s1pt in NPTWalk.unmap_pfn_kserv Acquires the page table lock and calls unset_s2pt in NPTWalk.

Table A.22: NPTOps API

get_npt_size Walks a given principal’s stage 2 page table using a given address andreturns the page size, either 4KB or 2MB, used in the page table to mapthe address.

walk_npt Walks a given principal’s stage 2 page table using a given address andreturns the hPA that the address maps to.

set_s2pt Maps a given address addr to a hPA in a given principal’s stage 2 pagetable.

unset_s2pt Unmaps a given address addr from a given principal’s stage 2 page table.set_s1pt Maps a given address addr to a hPA in the EL2 stage 1 page table.walk_s1pt Walks the EL2 stage 1 page table using a given address and returns the

hPA that the address maps to.

Table A.23: NPTWalk API

148

walk_pgd Walks the pgd table using a given input address and returns the pgdentry. It calls PTAlloc to allocate a new page for the next level pagetable if the pgd entry is unmapped.

walk_pud Walks the pud table using a given input address and returns the pudentry. It calls PTAlloc to allocate a new page for the next level pagetable if the pud entry is unmapped.

walk_pmd Walks the pmd table using a given input address and returns the pmdentry. It calls PTAlloc to allocate a new page for the next level pagetable if the pmd entry is unmapped.

walk_pte Walks the pte table using a given input address and returns the pte entry.set_pmd Sets the entry in the pmd table that corresponds to a given input address

to an input value.set_pte Sets the entry in the pte table that corresponds to a given input address

to an input value.

Table A.24: PTWalk API

alloc_pud Allocates a pud page table from a given principal’s page table pool.alloc_pmd Allocates a pmd page table from a given principal’s page table pool.alloc_pte Allocates a pte page table from a given principal’s page table pool.

Table A.25: PTAlloc API

walk_spt Acquires the SMMU page table lock and calls walk_smmu_pt inMmioSPTWalk.

mmap_spt Acquires the SMMU page table lock and calls set_smmu_pt inMmioSPTWalk.

unmap_spt Acquires the SMMU page table lock and calls unset_smmu_pt inMmioSPTWalk.

init_spt Acquires the SMMU page table lock and scrubs the SMMU page tablefor a given SMMU translation unit before allocating the unit to a newdevice.

Table A.26: MmioSPTOps API

walk_smmu_pt Walks a given device’s SMMU page table using a given iova, and re-turns the hPA that the iova maps to.

set_smmu_pt Maps a given iova to a hPA in a given device’s SMMU page table.unset_smmu_pt Unmaps a given iova from a given device’s SMMU page table.

Table A.27: MmioSPTWalk API

walk_smmu_pgd Walks the SMMU pgd table using a given iova and returns the pgd entry.It calls MmioPTAlloc to allocate a new page for the next level pagetable if the pgd entry is unmapped.

walk_smmu_pmd Walks the SMMU pmd table using a given iova and returns the pmdentry. It calls MmioPTAlloc to allocate a new page for the next levelpage table if the pmd entry is unmapped.

walk_smmu_pte Walks the SMMU pte table using a given iova and returns the pte entry.set_smmu_pte Sets the entry in the SMMU pte table that corresponds to a given iova

to an input value.

Table A.28: MmioPTWalk API

149

alloc_smmu_pmd Allocates a SMMU pmd page table from a given device’s page tablepool.

alloc_smmu_pte Allocates a SMMU pte page table from a given device’s page table pool.

Table A.29: MmioPTAlloc API

acquire_lock_npt Acquires the per-principal page table lock used to protect a given prin-cipal’s page table.

acquire_lock_s2page Acquires the S2Page lock used to protect the shared S2Page array.acquire_lock_core Acquires the core lock used to protect the shared resources managed by

KCore, such as the VM identifiers.acquire_lock_spt Acquires the SMMU page pool lock used to protect the page table used

by a given SMMU translation unit.acquire_lock_smmu Acquires the SMMU lock used to protect the SMMU configuration.acquire_lock_vm Acquires the lock used to protect accesses to the global VMInfo array.release_lock_npt Releases the per-principal page table lock used to protect a given prin-

cipal’s page table.release_lock_s2page Releases the S2Page lock used to protect the shared S2Page array.release_lock_core Releases the core lock used to protect the shared resources managed by

KCore.release_lock_spt Releases the SMMU page pool lock used to protect the page table used

by a given SMMU translation unit.release_lock_smmu Releases the SMMU lock used to protect the SMMU configuration.release_lock_vm Releases the lock used to protect accesses to the global VMInfo array.

Table A.30: Locks API

wait_hlock Helper function for acquiring locks. Used for lock verification.pass_hlock Helper function for releasing locks. Used for lock verification.

Table A.31: LockOpsH API

wait_qlock Helper for function acquiring locks. Used for lock verification.pass_qlock Helper for function releasing locks. Used for lock verification.

Table A.32: LockOpsQ API

wait_lock Provides the implementation for acquiring KCore’s spinlock.pass_lock Provides the implementation for releasing KCore’s spinlock.

Table A.33: LockOps API

150

A Secure and Formally Veriﬁed Commodity Multiprocessor ...

Documents