A Secure and Formally Verified Commodity Multiprocessor Hypervisor Shih-Wei Li Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy under the Executive Committee of the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2021
165
Embed
A Secure and Formally Verified Commodity Multiprocessor ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Secure and Formally Verified Commodity Multiprocessor Hypervisor
Shih-Wei Li
Submitted in partial fulfillment of therequirements for the degree of
We verify SeKVM by decomposing the KCore codebase into 34 layered modules. Figure 2.2
shows the KCore layered architecture. The top layer is TrapHandler, which defines KCore’s
interface to KServ and VMs, such as KServ hypercalls and VM exit handlers. Exceptions caused
by KServ and VMs cause a context switch to KCore, calling CtxtSwitch to save CPU register
state to memory, then TrapDispatcher or FaultHandler to handle the respective exception. On
32
a KServ hypercall, TrapDispatcher calls VCPUOps to handle the VM ENTER hypercall to execute a
VM, and MemHandler, BootOps, and SmmuOps to use their respective hypercall handlers. On a VM
exit, TrapDispatcher calls functions at lower layers if the exception can be handled directly by
KCore; otherwise, CtxtSwitch is called again, protecting VM CPU data and switching to KServ
to handle the exception. On other KServ exceptions, FaultHandler calls MemOps to handle KServ
stage 2 page faults and SmmuOps to handle any KServ accesses to SMMU hardware. FaultHandler
also calls MemOps to handle VM GRANT_MEM and REVOKE_MEM hypercalls. KCore implements
basic page table related operations in the layers in MMU PT, including page table walk, map or
unmap a page in a page table, and page table allocation. KCore implements ownership tracking
for each 4K regular page in PageMgmt, PageIndex, and Memblock for memory access control.
MemOps and MemAux provide memory protection APIs to other layers. KCore provides SMMU page
table operations in layers in SMMT PT. KCore provides VM boot protection in BootOps, BootAux,
and BootCore. We ported the verified Hacl* [66] library to SeKVM. BootOps calls the Ed25519
library from Hacl* to authenticate signed VM boot images. BootOps and MemOps call the AES
implementation from Hacl* to encrypt or decrypt VM data to support VM management.
We first describe the API of KCore’s top layer, TrapHandler. Appendix A.1 details the API
for the intermediate layers.
2.1.2 TrapHandler API
TrapHandler includes the hypercalls and exception handlers. It provides HypSec’s hypercall
API, shown in Figure 1.1. Table 2.1 lists the hypercalls for supporting the VM CREATE and VM
BOOT APIs, allowing KServ to allocate VMs and participate in secure VM boot. Table 2.2 lists the
hypercalls for the VM ENTER and GET VM STATE APIs, allowing KServ to execute a VCPU and
export and import encrypted VM data. Table 2.3 lists the hypercalls for the IOMMU OPS API, the
clear_vm hypercall for KServ to reclaim memory from a terminated VM, and a timer hypercall
to set timers. Table 2.4 lists the hypercalls used by VMs. Finally, Table 2.5 lists the exception
handlers.
33
Primitive Descriptionregister_vm Used by KServ to request KCore to create new VMs. KCore allocates
a unique VM identifier, which it returns to KServ. It also allocates theper-VM metadata VMInfo (See Section 3.3.1) and a stage 2 page tableroot for the VM.
register_vcpu Used by KServ to request KCore to initialize a new VCPU for a spec-ified VM. KCore allocates the VCPUContext (See Section 3.3.1) datastructure for the VCPU.
set_boot_info Used by KServ to pass VM boot image information, such as the imagesize and load guest physical address to KCore. An example of a bootimage is the VM kernel binary. The images will be verified and loadedto VM memory when VM boot.
remap_boot_image_page Used by KServ to pass one page of the VM boot image to KCore. KCoreremaps all pages of a VM image to a contiguous range of memory in itsaddress space so it can later authenticate the image.
verify_vm_image Used by KServ to request KCore to authenticate a VM boot image.KCore authenticates each binary of the boot image, and refuses to bootthe VM if authentication fails. Before authenticating a VM image,KCore unmaps all its pages from KServ’s stage 2 page table to guar-antee that the verified image cannot be later altered by KServ. If au-thentication succeeds, KCore maps the boot image to the VM’s stage 2page table.
Table 2.1: TrapHandler API: KServ Hypercall Handlers (VM CREATE and VM BOOT)
Primitive Descriptionrun_vcpu Used by KServ to request KCore to run a VM’s VCPU on the current
physical CPU. KServ passes the VM and VCPU identifiers to KCore.KCore context switches to the VCPU and resolves prior VM exceptionsbefore returning to the VM.
encrypt_vcpu Used by KServ to request KCore to export encrypted VM CPU data; forVM migration and snapshots.
decrypt_vcpu Used by KServ to request KCore to import encrypted VM CPU data;for VM migration and snapshots. KCore copies the data to a privatebuffer before decrypting it.
encrypt_vm_mem Used by KServ to request KCore to export encrypted VM memory data;for VM migration and snapshots.
decrypt_vm_mem Used by KServ to request KCore to import encrypted VM memory data;for VM migration and snapshots. KCore copies the data to a privatebuffer before decrypting it.
Table 2.2: TrapHandler API: KServ Hypercall Handlers (VM ENTER and GET VM STATE)
34
Primitive Descriptionsmmu_alloc_unit Used by KServ to request KCore to allocate an SMMU translation unit
for a given device. KCore sets the owner of the SMMU translation unitto the owner of the device.
smmu_free_unit Used by KServ to request KCore to deallocate an SMMU translationunit previously used by a device. If a device was owned by a VM,KCore ensures that deallocation can succeed when the VM is poweredoff.
smmu_map Used by KServ to request KCore to map a 4KB page to a device’sSMMU page table, from a device virtual address (iova) to the hPA ofthe page. KServ rejects the request if the owner of the page is differentfrom the device. KServ is allowed to map a page to the SMMU pagetable of a VM owned device before the VM boots.
smmu_unmap Used by KServ to request KCore to unmap an iova in a device’s SMMUpage table. KServ is only allowed to do so after the VM that owns thedevice is powered off.
smmu_iova_to_phys Used by KServ to request KCore to walk a device’s SMMU page table.Given an iova, KCore returns the corresponding physical address.
clear_vm Used by KServ to request KCore to reclaim pages from a terminatedVM. KCore will scrub all pages of the terminated VM and set theirownership to KServ.
set_timer Used by KServ to request KCore to update a privileged EL2 timer reg-ister with a timer counter offset; for timer virtualization since SeKVMoffloads timer virtualization to KServ.
Primitive DescriptionGRANT_MEM Used by a VM to grant KServ access to its data in a memory region.
KCore sets the share field in the S2Page (See Section 3.3.1) structurefor each page in the memory region.
REVOKE_MEM Used by a VM to revoke KServ’s access to a previously shared memoryregion. KCore clears the share field in the S2Page structure for eachof the pages in the memory region and unmaps the pages from KServ.
psci_power Used by a VM to request KCore to configure VM power states viaArm’s PSCI power management interface [67]; the request is passedto KServ for power management emulation. It calls VMPower to updateor retrieve the VM’s power state.
Table 2.4: TrapHandler API: VM Hypercall Handlers
35
Primitive Description
host_page_fault Handles stage 2 page faults for KServ. KCore builds the mapping forthe faulted address in KServ’s stage 2 page table if the access is allowed.An identity mapping is used for hPAs in KServ’s stage 2 page table,allowing KServ to implicitly manage all free physical memory. KCorealso handles the page faults caused by KServ’s MMIO access to theSMMU.
vm_page_fault Handles stage 2 page faults for a VM, which occur when a VM accessesunmapped memory or its MMIO devices. KCore context switches toKServ for further exception processing, to allocate memory for the VMand emulate VM MMIO accesses. KCore copies the I/O data fromKServ to the VM’s GPRs on MMIO reads, and vice versa on MMIOwrites.
handle_irq Handles physical interrupts that result in VM exits, KCore contextswitches to KServ for the interrupt handling.
handle_wfx Handles VM exits due to WFI/WFE instructions. KCore contextswitches to KServ to handle the exception.
handle_sysreg Handles VM exits due to accessing privileged system registers, handleddirectly by KCore.
Table 2.5: TrapHandler API: Exception Handlers
36
2.2 Layered Methodology for Verifying SeKVM
We take the first steps toward verifying the correctness of SeKVM’s TCB, KCore, by combin-
ing the layered implementation of the TCB with a layered hardware model. We start with a bottom
machine model that supports comprehensive multiprocessor hardware features such as multi-level
page tables, tagged TLBs, and a coherent cache hierarchy with bypass support. We use layers to
gradually refine the detailed low-level machine model to a higher-level and simpler abstract model.
Finally, we verify each layer of software by matching it with the simplest level of machine model
abstraction, reducing the proof burden to make it possible to verify commodity software using
these hardware features.
We leverage Certified Concurrent Abstraction Layers (CCALs) [18, 68] to verify the correct-
ness of the multiprocessor TCB. Each abstraction layer consists of three components: the underlay
interface, the layer’s module implementation, and its overlay interface. Each interface exposes
abstract primitives, encapsulating the implementation of lower-level routines, so that each layer’s
implementation may invoke the primitives of the underlay interface as part of its execution. We
allow primitives from multiple lower layers to be passed to a single given layer’s underlay interface.
For each layer I of KCore’s implementation, we prove that I running on top of the underlay
interface L refines its (overlay) specification S, I@L v S. Because the layer refinement relation
v is transitive, we can incrementally refine KCore’s entire implementation as a stack of layer
specifications. For example, given a system consisting of layer implementations M3, M2, and M1,
their respective layer specifications S3, S2, and S1, and a base machine model specified by S0, we
prove M1@S0 v S1, M2@S1 v S2, and M3@S2 v S3. In other words, once a module of the
implementation is proven to refine its layer specification, we can use that simpler specification,
instead of the complex module implementation, to prove other modules that depend on it. We
compose these layers to obtain (M3 ⊕ M2 ⊕ M1)@S0 v S3, proving that the behavior of the
system’s linked modules together refines the top-level specification S3.
All KCore interface specifications and refinement proofs are manually written in Coq, which
37
CPU0VA/gPA
PA
PA
Main Memory
Coherent Data Caches
VA/gPA
PA
DEV1DEV0
IOVA
IOVA
PA
IOVA
IOVA
PA
VA/gPA
TLB
PTsSMMU
PTs
CPU1
VA/gPA
PA TLB
SMMU TLB
PTs
PA
(a) The machine model of the bottom layer AbsMachine
CPU0
PA
Main Memory
CPU1
Shared Data Cache
VA/gPA
PA
DEV1DEV0
IOVA
PA
IOVA
PA
VA/gPA
PTs
SMMUPTs
(b) The machine model after the machine refinement
Figure 2.3: Refinement of Machine Models
we trust, with 34 interface specifications matching the layers in Figure 2.2. We use CompCert [69]
to parse each layer of the C implementation into Clight representation, an abstract syntax tree
(AST) defined in Coq; the same is done manually for assembly code. We then use that Coq
representation to prove that the layer implementation refines its respective interface specification
at the C and assembly level. Note that the C functions that we verify may invoke primitives
implemented in assembly and introduced in the bottom machine model. We enforce that these
assembly primitives do not violate C calling conventions and parameters are correctly passed. For
example, we verify the correctness of TLB maintenance code, which is implemented in C, but
invokes primitives implemented in assembly.
2.2.1 AbsMachine: Abstract Multiprocessor Machine Model
Each of KCore’s layer modules successively builds upon AbsMachine, our multiprocessor ma-
chine model. This abstract hardware model constitutes the foundation of our correctness proof. As
shown in Fig. 2.3, AbsMachine includes multiple CPUs and a shared main memory. AbsMachine
models general purpose and systems registers for each CPU. It also models Arm hardware fea-
tures relevant to modern hypervisor implementation, including the multi-level stage 1, stage 2, and
SMMU page tables, a physically indexed, physically tagged (PIPT) shared data cache, and TLBs.
In multiprocessor settings, page tables are usually shared by multiple CPUs. For instance, a mul-
tiprocessor VM running on different CPUs uses the same copy of the stage 2 page table. We must
38
account for the shared multi-level page tables when proving the correctness of SeKVM’s multipro-
cessor TCB. In this chapter, we discuss the correctness proof of the TCB in managing multi-level
page tables. We present the correctness proof of the TCB in managing shared multi-level page
tables in Chapter 3. The shared data cache is semantically equivalent to Arm’s multi-level cache
hierarchy with coherent caches. KCore uses stage 2 page tables to translate guest physical ad-
dresses to actual physical addresses on the host, and uses its EL2 stage 1 page table to translate its
virtual addresses to physical addresses. AbsMachine models the particular hardware configuration
of KCore that we verify. For example, although Arm supports 1GB, 2MB, and 4KB mappings in
stage 2 page tables, KCore only uses 4KB and 2MB mappings in stage 2 page tables, since 1GB
mappings result in fragmentation. Thus, we model a VM’s memory load and store accesses in
AbsMachine over stage 1 and stage 2 page tables using 4KB and 2MB mappings. AbsMachine
models concurrent executions of multiple CPUs. Further details are discussed in Section 3.1.1.
Although the abstract machine model is specified in the bottom layer of our proof, each suc-
cessive layer implicitly has a machine model used to express how events at that layer affect the
machine state. For example, each layer has some notion of memory to support memory load and
store primitives. For many layers, most primitives and their effect on the machine model at the
overlay interface are the same as those at the underlay interface. These pass through primitives
and their effects on the machine state do not need to be re-specified for each higher layer. On the
other hand, each layer may define new primitives based on a higher-level machine model, so long
as a refinement can be proven between the layer’s implementation over the underlay interface and
the overlay interface.
A key aspect of our proofs is to abstract away the low-level details of the machine model, layer
by layer, by proving refinement between the software implementation using a lower-level machine
model and its specification based on a higher-level machine model. For example, we verify that
the TLBs are maintained correctly by lower hypervisor layers, such that the TLB behavior exposed
by AbsMachine is encapsulated by the hypervisor implementation at lower layers, and is thus
abstracted from the higher layer specifications.
39
2.2.2 Page Table Management
As shown in Figure 2.3, AbsMachine includes multi-level page tables. Like Arm hardware,
a page table can include up to four levels, referred to using Linux terminology as pgd, pud, pmd,
and pte. AbsMachine models both regular (4KB) and huge (2MB) page table mappings, as used
by KVM and also employed by KCore. KCore maintains stage 2 page tables — one per VM and
one for KServ — and its EL2 stage 1 page table, all modeled by AbsMachine. KCore associates a
unique VMID identifier for each VM and KServ, identifying the respective stage 2 page table.
The functions for KCore to manipulate page tables are implemented and verified at the four
layers of the MMU PT module, shown in Figure 2.2. The PTAlloc layer dynamically allocates page
table levels, e.g.,pud, pmd, and pte. The PTWalk layer provides helper functions for walking an
individual level of the page table, e.g., walk_pgd, walk_pud, etc. The NPTWalk layer uses PTWalk’s
primitives to perform a full page table walk. The NPTOps layer grabs and releases page table locks
to perform page table operations. For instance, the map_page function maps a VM’s guest physical
frame number (gfn) to a physical frame number (pfn) by calling the set_s2pt function in the
NPTWalk layer to create a new mapping in the VM’s stage 2 page table:
Since TLBs can be refilled by page tables’ contents, the page observers through TLBs remain
the same after the TLB flush. The followed page unmapping does not invalidate TLBs such that
the sequence of page observer groups through TLB for this insecure implementation is as follows:
pfn: kserv, pfn: kserv n
which is different from the one in Eq. (2.1), meaning that more information can be released through
TLBs than page tables.
2.2.4 Cache Management
As shown in Figure 2.3, AbsMachine includes physically-addressed writeback caches. Arm
adopts MESI/MOESI cache coherence protocols, guaranteeing all levels’ of cache are consistent,
meaning the hardware can retrieve the same contents from the cache located at different levels,
and the updates to the cache are synchronized to the cache at different levels. Arm’s multi-level
caches can be modeled by AbsMachine as a uniform global cache. To model hardware that will
invalidate and write back cached entries unbeknownst to software, for example, due to cache line
replacement, AbsMachine exposes a cache-sync primitive that randomly evicts a cache entry
and writes it back to memory. In KCore’s specification, memory load and store operations call
cache-sync before the actual memory accesses to account for all possible cache eviction policies.
While caches are coherent, Arm hardware does not guarantee that cached data is always coherent
with main memory; caches may write back dirty lines at any time. Like other modern architectures,
Arm provides cache maintenance instructions to allow the software to flush cache lines to ensure
what is stored in main memory is up-to-date with what is stored in the cache. AbsMachine provides
a cache-flush primitive that models Arm’s clean and invalidate instruction. The primitive takes a
46
S1PT
VM1
Hypervisor
S2PT
SMain memory
pfn
……
0
Data cache
pfn
S1PT
S2PT
VM2
Non-cacheable
Cacheable
scrub pfn
Figure 2.4: Attack based on Mismatched Memory Attributes
pfn as an argument, copies the val of pfn from the cache to main memory if the entry is present in
the cache, then removes pfn’s entry from the cache. Cache mismanagement could result in security
vulnerabilities, so hypervisors must use these instructions to ensure that data accesses across all of
its cores remain coherent, preventing stale data leaks.
Figure 2.4 shows how a malicious VM could leverage cache mismanagement on Arm hardware
to potentially obtain confidential data of another VM from main memory. Suppose the hypervisor
decides to evict a VM1’s page pfn. It unmaps the page from VM1 and scrubs the page by zeroing
out any residual data. Since the page no longer can be used by VM1, the hypervisor is free to
reassign it to another VM, VM2, by mapping pfn to VM2’s stage 2 page tables (S2PT). Arm hard-
ware guarantees the scrubbing is synchronized across all CPU caches, but does not guarantee it is
written back to the main memory. Arm allows the software to mark whether a page is cacheable or
not by setting the memory attributes in the respective page table entry. When stage 2 translation is
enabled, Arm combines memory attribute settings in stage 1 and stage 2 page tables. For a given
mapping, caching is only enabled when both stages of page tables enable caching. Hypervisors
allow VMs to manage their stage 1 page tables for performance reasons. Although KCore always
enables caching in stage 2 page tables, an attacker in VM2 could disable caching for the mapping
to pfn in its stage 1 page table, bypassing the caches and directly accessing pfn in main memory,
47
which could contain confidential VM1 data. To protect VM memory against this attack, the hy-
pervisor should invalidate pfn’s associated cache line after scrubbing the page to ensure that the
changes are written back to the main memory. This ensures VM2 can never retrieve VM1’s secret
in the main memory.
To ensure that KCore correctly manages caches, we verify it over AbsMachine, which models
writeback caches and cache bypass. AbsMachine models both cache and main memory as partial
maps pfn7→val, where val is the content stored in a given pfn. As a pfn moves between cache and
main memory, AbsMachine propagates its content with it. For example, on a cacheable memory
access, AbsMachine checks if the cache contains a mapping for pfn. If it does not, AbsMachine
populates the cache with val from main memory. It then returns val for memory loads and updates
the cached value for memory stores. Similarly, on a cache-flush or cache-sync, AbsMachine
flushes the pfn to the main memory, populating the main memory with the respective val from the
cache.
Using AbsMachine, we prove that KCore always sets the memory attributes in the page tables
that it manages to enable caching, maximizing performance. We then prove that KCore flushes
caches in the primitives that can change page ownership, verifying that the KCore implementation
refines its specification. Finally, we use KCore’s specification to prove that KCore’s cache man-
agement has no security vulnerabilities and does not compromise VM data. We discuss the first
two proofs here but defer the latter proof to Section 3.3.2.
We first prove that KCore always sets the memory attributes in the stage 2 page tables for VMs
and KServ to enable caching. KCore updates stage 2 page table entries by calling the verified
map_page primitive, as discussed in Section 2.2.2. map_page is passed the attr parameter to set
the page table entry attributes. We verify the primitives that call map_page pass in the correct attr
to enable caching. Specifically, we verify the implementation of map_pfn_vm and map_pfn_host
in the MemAux layer, which call map_page to map a pfn to a VM’s and KServ’s stage 2 page table,
respectively refine their specifications that pass an attr value with caching enabled to map_page.
We also prove that KCore always sets the memory attributes in its own EL2 stage 1 page tables to
48
enable caching. Similar to map_page, NPTOps provides a map_page_core primitive for updating
EL2 stage 1 page tables, which in turn calls set_s1pt in NPTWalk to update the multi-level page
tables — we proved the correctness of these primitives similarly to the proofs for map_page and
set_s2pt. We then verify that the primitives that call map_page_core pass in the correct attr to
enable caching.
We then prove that KCore correctly flushes cache lines in the primitives that change page
ownership. Specifically, we prove the correctness of assign_pfn_vm and clear_vm_page in the
MemAux layer. assign_pfn_vm unmaps pfn from KServ and assigns the owner of a newly allocated
pfn to a VM. clear_vm_page reclaims a pfn from a VM upon the VM’s termination, scrubs the
pfn, and assigns the owner of the pfn to KServ. We prove that the implementations of both
primitives refine their corresponding specifications that call cache-flush.
2.2.5 SMMU Management
As shown in Figure 2.3, AbsMachine models Arm’s SMMU, which supports a shared SMMU
TLB and SMMU multi-level page tables, which can be allocated for each device devk. The TLB is
tagged, and page tables can support up to four levels of paging with regular and huge page support,
similar to the page tables and TLBs discussed in Sections 2.2.2 and 2.2.3. Unlike memory accesses
from CPUs, there are no caches involved in memory accesses through the SMMU. For simplicity,
we only describe the SMMU stage 2 page tables, used by the SMMU implementation [70] on
the Arm Seattle server hardware we used for evaluation. AbsMachine also provides dev_load
and dev_store operations to model memory accesses of DMA-capable devices attached to the
SMMU.
KCore controls the SMMU and maintains the SMMU TLB and SMMU page tables for each
devk. TLB entries are tagged by VMID. The parts of KCore that manipulate page tables are the
four layers of SMMU PT shown in Figure 2.2. Similar to how we refine multi-level page tables in
NPTWalk as discussed in Section 2.2.2, we refine the SMMU multi-level page table and its multi-
level page table walk in MmioSPTWalk in SMMU PT into a layer specification with a partial map that
49
maps an input page frame from device address space, devfn 7→ (pfn, size, attr), where size is
the size of the page, 4KB or 2MB, and attr encompasses attributes of the page. Once we prove
this refinement, higher layers that depend on SMMU page tables can be verified against the abstract
page table, enabling us to prove the correctness of KCore’s SMMU page table management.
Similar to how we refine CPU TLBs as discussed in Section 2.2.3, we refine the SMMU TLB in
MmioSPTOps so that it is abstracted away from higher layers. We model the SMMU TLB as a set of
partial maps, each map identified by VMID and mapping devfn 7→ (pfn, size, attr). AbsMachine
models SMMU TLB invalidation by exposing a smmu-tlb-flush primitive to flush all entries
associated with a VMID. We prove the correctness of KCore with the SMMU TLB by verifying
it correctly flushes entries to ensure consistency with the SMMU page tables, then abstract away
the TLB by proving that the MmioSPTOps implementation using the SMMU TLB refines a simpler,
higher-level specification without the SMMU TLB. We prove unmap_spt in MmioSPTOps calls
smmu-tlb-flush after unmapping a pfn from the SMMU page table.
2.3 Summary
In this chapter, we presented SeKVM, a KVM hypervisor retrofitted based on the HypSec de-
sign. To verify SeKVM’s TCB on a multiprocessor machine with realistic hardware features, we
introduced a layered hypervisor architecture and verification methodology. First, we leveraged
HypSec to split KVM into two layers, a higher layer consisting of a large set of untrusted hy-
pervisor services, and a lower layer consisting of the hypervisor TCB. Next, we used layers to
modularize the implementation and proof of the TCB to reduce the proof effort, modeling mul-
tiprocessor hardware features at different levels of abstraction tailored to each layer of software.
Using this approach, we have taken the first steps to verify the correctness of SeKVM’s TCB, using
a novel layered machine model that accounts for widely-used multiprocessor hardware features,
including multi-level page tables, tagged TLBs, and multi-level caches with cache bypass support.
50
Chapter 3: Verifying Security Guarantees
Although functional correctness can be verified by showing that the software implementation
running on the machine model refines its specification, such a correctness proof may not guarantee
that the desired properties of the software are satisfied. Specifications may themselves be buggy
and not provide the desired properties. When verifying a hypervisor, in addition to verifying the
functional correctness of its TCB, it is also crucial to verify the security properties of the entire
hypervisor.
In this chapter, we verify the security guarantees of SeKVM, demonstrating that it protects
VM data confidentiality and integrity on multiprocessor hardware. We do this in two steps. First,
we build on the proofs described in Chapter 2 to verify the functional correctness of SeKVM’s
TCB, KCore, showing that its implementation refines the specification. Then, we use KCore’s
specification to prove the security properties of the entire system. Because the specification is
easier to use for higher-level reasoning, it becomes possible to prove security properties that would
be intractable if attempted directly on the implementation. A vital requirement of this approach is
that we must ensure the specification soundly captures all behaviors of KCore’s implementation,
so the security guarantees proven over the specification hold on the implementation. However,
refinement may not preserve security properties, such as data confidentiality and integrity in a
multiprocessor setting [26, 27] because intermediate updates to shared data within critical sections
can be hidden by refinement, yet visible to concurrently running code on different CPUs.
To reason about KCore in a multiprocessor setting, I introduce security-preserving layers to ex-
press KCore’s specification as a stack of layers, so that each module of its implementation may be
incrementally proven to refine its layered specification and preserve security properties. Security-
preserving layers are a drop-in replacement for CCALs in the SeKVM layered architecture that
provide the additional benefit of ensuring that the refinement of multiprocessor code does not hide
51
information release. We use security-preserving layers to verify, for the first time, the functional
correctness of a multiprocessor system with shared page tables. Using security-preserving lay-
ers, we gradually refine detailed hardware and software behaviors at lower layers into simpler
abstract specifications at higher layers. We ensure that the composition of layers embodied by
KCore’s top-level specification reflects all intermediate updates to shared data across the entire
KCore implementation. We can then use the abstract top-level specification to prove the system’s
information-flow security properties and ensure those properties hold for the implementation.
Next, we use KCore’s specification to prove that any malicious behavior of the untrusted KServ
using KCore’s interface cannot violate the desired security properties. We prove VM confidential-
ity and integrity using KCore’s specification, formulating the guarantees in terms of noninterfer-
ence [28] to verify that there is no information leakage between VMs and KServ. However, a strict
noninterference guarantee is incompatible with commodity hypervisor features, including KVM’s.
For example, a VM may send encrypted data via shared I/O devices virtualized via untrusted hyper-
visor services, thereby not leaking private VM data. This kind of intentional information release,
known as declassification [71], should be distinguished from unintentional information release.
We incorporate data oracles [25] to mask the intentional information flow and distinguish it from
unintentional data leakage. After this masking, any outstanding information flow is unintentional
and must be prevented, or it will affect the behavior of KServ or VMs. To show the absence of
unintentional information flow, we prove noninterference assertions hold for any behavior by the
untrusted KServ and VMs, interacting with KCore’s top layer specification. The noninterference
assertions are verified over this specification, for any implementation of KServ, but since KCore’s
implementation refines its specification via security-preserving layers, unintentional information
flow is guaranteed to be absent for the entire KVM implementation.
3.1 Security-preserving Refinement
Using KCore’s C and assembly code implementation to prove SeKVM’s security properties
is impractical, as we would be inundated by implementation details and concurrent interleaving.
52
Instead, we show that the implementation of the multiprocessor KCore incrementally refines a
high-level Coq specification. We then prove that any implementation of KServ or VMs interacting
with the top-level specification satisfies the desired security properties, ensuring that the entire
SeKVM system is secure regardless of the behavior of any principal. To guarantee that proven
top-level security properties reflect the behavior of the implementation of KCore, we must ensure
that each level of refinement fully preserves higher-level security guarantees.
To enable incremental and modular verification, I introduce security-preserving layers: security-
preserving layers:
Definition 1 (Security-preserving layer). A layer is security-preserving if and only if its specifi-
cation captures all information released by the layer implementation.
Security-preserving layers build on our initial layered architecture based on CCALs to verify cor-
rectness of multiprocessor code. Security-preserving layers retain the composability of CCALs,
but unlike CCALs, ensure refinement preserves security guarantees in a multiprocessor setting. We
prove that KCore’s implementation refines a stack of security-preserving layers, such that the top
layer specifies the entire system by its functional behavior over its machine state.
To simplify proof effort, we create a set of proof libraries and helper functions, including a
security-preserving layer library that facilitates and soundly abstracts away complications arising
from potential concurrent interference, so that we may leverage sequential reasoning to simplify
layer refinement proofs. The key challenge is handling objects shared across multiple CPUs, as
we must account for how concurrent operations interact with them while reasoning about the local
execution of any given CPU.
Example 1 (Simple page table). We illustrate this problem using a simplified NPT example.
In our example, the NPT of a VM is allocated from its own page table pool. The pool consists of
page table entries, whose implementation is encapsulated by a lower layer interface that exposes
functions pt_load(vmid, ofs) to read the value at offset ofs from the page table pool of VM
vmid and pt_store(vmid, ofs, val) to write the value val at offset ofs. To keep the example
simple, we use a simplified version of the actual NPT we verified, ignore dynamic allocation and
53
permission bits, and assume two-level paging denoted with pgd and pte. Consider the following
two implementations, which map gfn to pfn in VM vmid’s NPT:void set_npt(uint vmid, uint gfn, uint pfn) acq_lock_npt(vmid);// load the pte base addressuint pte = pt_load(vmid, pgd_offset(gfn));pt_store(vmid, pte_offset(pte,gfn), pfn);rel_lock_npt(vmid);
Fixpoint replay (l: Log) (st: SharedObj) :=match l with| e::l’ => match replay l’ st with
| Some (st’, _) => replay_event e st’
56
| None => Noneend
| _ => Some stend.
The replay function recursively traverses the log to reconstruct the state of shared objects, invok-
ing replay_event to handle each event and update shared object state; this update may fail (i.e.,
return None) if the event is not valid with respect to the current state. For example, the replay
function returns the load result for a page table pool load event P_LD, but the event is only allowed
if the lock is held by the current CPU:Definition replay_event (e: Event) (obj: SharedObj) :=match e with| (P_LD vmid ofs, cpu) =>match ZMap.get vmid (pt_locks obj) with| Some cpu’ => (*the pt lock of vmid is held by cpu’*)if cpu =? cpu’ (* if cpu = cpu’ *)then let pool := ZMap.get vmid (pt_pool obj) in
Some (obj, Some (ZMap.get ofs pool))else None (* fails when held by a different cpu *)
| None (* fails if not already held *)end
| . . . (* handle other events *)end.
Our abstract machine is formalized as a transition system, where each step models some atomic
computation taking place on a single CPU; concurrency is realized by the nondeterministic inter-
leaving of steps across all CPUs [73]. To simplify reasoning about all possible interleaving, we lift
multiprocessor execution to a CPU-local model, which distinguishes execution taking place on a
particular CPU from its concurrent environment [18].
All effects coming from the environment are encapsulated by and conveyed through an event
oracle, which yields events emitted by other CPUs when queried. How the event oracle synchro-
nizes these events is left abstract, its behavior constrained only by rely-guarantee conditions [74].
CPUs need only query the event oracle before interacting with shared objects, since its private state
is not affected by these events. Querying the event oracle will result in a composite event trace of
the events from other CPUs interleaved with events from the local CPU.
For example, the pt_load’s specification shown below queries event oracle o to obtain events
from other CPUs, appends them to the logical log (producing l0), checks the validity of the load
57
and calculates the load result using the replay function, then appends a load event to the log
(producing l1):(* Event Oracle takes the current log and produces
a sequence of events generated by other CPUs *)Definition EO := Log -> Log.
let l0 := o (log st) ++ log st in (*query event oracle*)(* produce the P_LD event *)let l1 := (P_LD vmid ofs, cid st) :: l0 inmatch replay l1 with(* log is valid and P_LD event returns r *)
| Some (_, Some r) => Some (st log: l1, r)| _ => Noneend.
Since the interleaving of events is left abstract, our proofs do not rely on any particular inter-
leaving of events and therefore hold for all possible concurrent interleaving. A CPU captures the
effects of its concurrent environment by querying the event oracle, a query move, before its own
CPU step, a CPU-local move. A CPU only needs to query the event oracle before interacting with
shared objects, since its private state is not affected by these events.
Figure 3.2 illustrates query move and CPU-local move in the context of the event trace produced
by set_npt’s implementation to refine its specification. The bottom trace shows events produced
by set_npt’s implementation as it interacts with the shared lock and page table pool it uses. The
query move before ACQ_LK yields all events from other CPUs prior to ACQ_LK; the query move
before P_LD yields all events from other CPUs since the last query up until P_LD. The end result of
its execution is a composite event trace of the events from other CPUs, interleaved with the events
from the local CPU.
Interleaving query and CPU-local moves still complicates reasoning about set_npt’s imple-
mentation. However, if we can guarantee that events from other CPUs do not interfere with the
shared objects used by set_npt, we can safely shuffle events from other CPUs to the beginning or
end of its critical section. For example, if we could prove that set_npt’s implementation is DRF,
then other CPUs will not produce events within set_npt’s critical section that interact with the
locked NPT. We would then only need to make a query move before the critical section, not within
58
Figure 3.2: Querying the Event Oracle to Refine set_npt
the critical section, allowing us to sequentially reason about set_npt’s critical section as an atomic
operation.
Unfortunately, as shown by set_npt_insecure, even if set_npt correctly uses locks to pre-
vent concurrent NPT accesses within KCore’s own code, it is not DRF because KServ or VMs
executing on other CPUs may indirectly read the contents of their NPTs through the MMU hard-
ware. This prevents us from soundly shuffling event queries outside of the critical section and
employing sequential reasoning to refine the critical section to an atomic step. If set_npt cannot
be treated as an atomic primitive, sequential reasoning would then be problematic to use for any
layer that uses set_npt, making their refinement difficult. Without sequential reasoning, verifying
a large system like KCore is not feasible.
3.1.2 Transparent Trace Refinement
We observe that information leakage can be modeled by read events that occur arbitrarily
throughout critical sections, without regard for locks. To ensure that refinement does not hide
this information leakage, transparent trace refinement treats read and write events separately. We
view shared objects as write data-race-free (WDRF) objects—shared objects with unavoidable
concurrent observers. For these objects, we treat their locks as write-locks, meaning that query
moves that yield write events may be safely shuffled to the beginning of the critical section. Query
moves in the critical section may then only yield read events from those concurrent readers.
59
Figure 3.3: Transparent Trace Refinement of Insecure and Secure set_npt Implementations
To determine when read events may also be safely shuffled, each WDRF object must define
an event observer function, which designates what concurrent CPUs may observe: they take the
current machine state as input, and produce some observed result, with consecutive identical event
observations constituting an event observer group. Event observer groups thus represent all pos-
sible intermediate observations by concurrent readers. Since the event observations are the same
in an event observer group, read events from other CPUs will read the same values anywhere in
the group and can be safely shuffled to the beginning, or end, of an event observer group, reducing
the verification effort of dealing with interleaving. Our security-preserving layers enforce that any
refinement of WDRF objects must satisfy the following condition:
Definition 2 (Transparency condition). The list of event observer groups of an implementation
must be a sublist of that generated by its specification. That is, the implementation reveals at most
as much information as its specification.
This condition ensures that the possibility of concurrent readers and information release is
preserved through each layer refinement proof. In particular, if a critical section has at most two
distinct event observer groups, read events can be safely shuffled to the beginning or end of the crit-
ical section. Query moves are no longer needed during the critical section, but can be made before
or after the critical section for both read and write events, making it possible to employ sequential
60
reasoning to refine the critical section. Transparent trace refinement can thereby guarantee that
events from other CPUs do not interfere with shared objects in critical sections. Figure 3.3 illus-
trates how this technique fares against our earlier counterexample, as well as to our original, secure
implementation. Each node in Figure 3.3 represents an event observation. Nodes of the same color
constitute an event observer group. The insecure example shows that set_npt_insecure does not
satisfy the transparency condition. There is an intermediate observation (shown in red) that cannot
map to any group in the specification. set_npt_insecure has three event observer groups that
can observe three different values, before the first pt_store, between the first and second pt_-
store, and after the second pt_store. Read events after the first pt_store cannot be shuffled
before the critical section. On the other hand, set_npt has only two event observer groups, one
that observes the value before pt_store, and one that observes the value after pt_store, so query
moves are not needed during the critical section. The implementation can therefore be refined to
an atomic set_npt specification. Refinement proofs for higher layers that use set_npt can then
treat set_npt as an atomic primitive, simplifying those proofs since set_npt can be viewed as just
one atomic computation step instead of many CPU-local moves with intervening query moves.
Example Proofs: Transparent Trace Refinement. We expand the simple set_npt example
discussed in Section 3.1 with a layered implementation to demonstrate SeKVM’s layer refinement
proof. We use a simplified version of the actual layers in KCore for simplicity. set_npt is anal-
ogous to the map_page primitive in KCore’s actual implementation. Both of them acquire and
release the per-principal page table lock to protect access to the shared NPTs. However, unlike
Based on the generated Coq AST and the primitives specified by the underlay interface, our tool
can automatically infer the following operational Coq specification for set_npt_low, replacing
function calls with primitives to interact with the underlay state.Definition set_npt_low_spec (o: EO) (st: AbsSt) (gfn pfn vmid: Z) :=match acq_lock_npt_spec vmid st with| Some st1 =>
let pgd_off := pgd_offset gfn inlet pte := pt_load_spec o st1 vmid pgd_off inlet pte_off := pte_offset pte gfn in
63
let st2 := pt_store_spec o st1 vmid pte_off pfn inrel_lock_npt_spec vmid st2
else res = 0; // pfn is not owned by KSERVrel_lock_s2pg();return res;
// Primitive provided by TrapHandlervoid run_vcpu(uint vmid)
. . .assign_to_vm(vmid, gfn, pfn);. . .
Layer 1: NPTWalk. This layer specifies a set of pass through verified primitives, and an abstract
state upon which they act. We extend the abstract state from the earlier example to model accesses
80
to the shared memory and the shared S2Page metadata, and a map from vmid to each VM’s local
state. We update the list of events from our earlier definition as follows:Inductive Event :=| ACQ_NPT (vmid: Z) | REL_NPT (vmid: Z)| P_LD (vmid ofs: Z) | P_ST (vmid ofs val: Z)| ACQ_S2PG | REL_S2PG| GET_OWNER (pfn: Z) | SET_OWNER (pfn owner: Z)| SET_MEM (pfn val: Z).
We also update NPTWalk’s abstract state to define a VM’s local state:(* VM local state *)Record LocalState := data_oracle: ZMap.t Z; (* data oracle for the VM *)doracle_counter: Z; (* data oracle query counter *)...
.
(* Abstract state *)Record AbsSt := log: Log;cid: Z; (* local CPU identifier *)vid: Z; (* vmid of the running principal on CPU cid *)lstate: ZMap.t LocalState; (* per-VM local state *)
.
We extend the definition of the shared objects from the earlier example to model the shared mem-
ory, an array of S2Page metadata, and the lock to protect shared accesses to the S2Page array. The
latter, as mentioned earlier is used by KCore to enforce memory access control:(* Shared objects constructed using replay function *)Record SharedObj := mem: ZMap.t Z; (* maps addresses to values *)s2pg_lock: option Z; (* s2pg lock holder *)pt_locks: ZMap.t (option Z); (* pt lock holders *)pt_pool: ZMap.t (ZMap.t Z); (* per-VM page table pool *)(* s2pg_array maps pfn to (owner, share, gfn) *)s2pg_array: ZMap.t (Z * Z * Z);
.
We update NPTWalk’s layer interface to include the newly exposed primitives as follows:Definition NPTWalk: Layer AbsSt :=acq_lock_npt 7→ csem acq_lock_npt_spec⊕ rel_lock_npt 7→ csem rel_lock_npt_spec⊕ pt_load 7→ csem pt_load_spec⊕ pt_store 7→ csem pt_store_spec⊕ acq_lock_s2pg 7→ csem acq_lock_s2pg_spec⊕ rel_lock_s2pg 7→ csem rel_lock_s2pg_spec⊕ get_s2pg_owner 7→ csem get_s2pg_owner_spec⊕ set_s2pg_owner 7→ csem set_s2pg_owner_spec.
81
Data oracles can be used for primitives that declassify data, as discussed in Section 3.2. For
example, set_s2pg_owner changes the ownership of a page. When the owner is changed from
KServ to a VM vmid, the page contents owned by KServ is declassified to VM vmid, so a data
oracle is used in the specification of set_s2pg_owner to mask the declassified contents:Definition set_s2pg_owner_spec
(o: EO) (st: AbsSt) (pfn vmid: Z) :=let l0 := o (log st) ++ log st inlet l1 := (SET_OWNER pfn vmid, cid st) :: l0 inmatch replay l1 with| Some _ => (* log is valid and lock is held *)let st’ := st log: l1) inif vid st =? KSERV && vmid != KSERVthen (* pfn is transferred from KServ to a VM *)mask_with_doracle st’ vmid pfn
else Some st’| _ => Noneend.
We introduce an auxiliary Coq definition mask_with_doracle to encapsulate the masking behav-
ior:Definition mask_with_doracle (st: AbsSt) (vmid pfn: Z) :=let local := ZMap.get vmid (lstate st) inlet n := doracle_counter local inlet val := data_oracle local n inlet l := (SET_MEM pfn val, cid st) :: log st inlet local’ := local doracle_counter: n+1 inst log: l, lstate: ZMap.set vmid local’ (lstate st)
mask_with_doracle queries the local data oracle of VM vmid with a local query counter, generates
an event to mask the declassified content with the query result, then updates the local counter. Since
each principal has its own data oracle based on its own local state, the behavior of other principals
cannot affect the query result. set_s2pg_owner_spec only queries the data oracle when the owner
is changed from KServ to a VM. When the owner is changed from a VM to KServ, the page is
being freed, and KCore must zero out the page before recycling it; masking is not allowed. We also
introduce auxiliary definitions to mask other declassified data, such as page indices and scheduling
decisions proposed by KServ, which are not shown in this simplified example.
Layer 2: NPTOps. As shown in the example in Section 3.1.2, we prove the implementation of
set_npt transparently refines its specification that specifies a logical map. Primitives related to
page table pools are removed from the layer interface. Other primitives are passed through.
Layer 3: MemOps. This layer introduces the assign_to_vm primitive to transfer a page from
KServ to a VM, and hides NPTOps primitives:Definition MemOps: Layer AbsSt :=assign_to_vm 7→ csem assign_to_vm_spec.
assign_to_vm’s specification has a precondition that it must be invoked by KServ and the vmid
must be valid:Definition assign_to_vm_spec
(o: EO) (st: AbsSt) (vmid gfn pfn: Z) :=if vid st =? KSERV && vmid != KSERVthenlet l0 := o (log st) ++ log st inlet l1 := (ASG_TO_VM vmid gfn pfn, cid st) :: l0 inmatch replay l1 with| Some (_, Some res)) => (* res is the return value *)let st’ := st log: l1 in (* update the log *)if res =? 1 (* if pfn is owned by KSERV *)then mask_with_doracle st’ vmid pfnelse Some st’ (* return without masking the page *)
| _ => Noneend
(*get stuck if it’s not transferred from KServ to a VM*)else None.
It transfers a page from KServ to a VM via set_s2pg_owner, so the contents are declassified and
masked using the data oracle.
Layer 4: TrapHandler. This top layer interface introduces run_vcpu, which invokes assign_-
to_vm and context switches from KServ to the VM. We first prove that run_vcpu does not violate
the precondition of assign_to_vm. We then prove noninterference as discussed in Section 3.3.2.
Here, we can see why the page content will be masked with the same data oracle query results in
the proof of Lemma 3 for run_vcpu in Section 3.3.2. Two indistinguishable states will have the
same VM local states, and therefore the same local data oracle and counter. Thus, the data oracle
query results must be the same.
83
3.4 Summary
We have formally verified the guarantees of VM data confidentiality and integrity for the mul-
tiprocessor Linux KVM implementation, SeKVM. First, we built on the software and hardware
layered architecture presented in Chapter 2 to prove the functional correctness of the hypervisor
core. I introduced security-preserving layers to incrementally prove that each layer of the core
implementation refines its layered specification and preserves security guarantees on real multi-
processor hardware with multi-level page tables shared across multiple CPUs. We have used the
core’s specification to verify the security guarantees of the entire multiprocessor KVM hypervisor,
even in the presence of information sharing needed for commodity hypervisor features.
84
Chapter 4: Implementation and Evaluation
We use microverification to verify, for the first time, the guarantees of VM confidentiality and
integrity of the multiprocessor Linux KVM hypervisor. As shown in Figure 4.1, we use microveri-
fication to first retrofit the Arm implementation of KVM into a small hypervisor core, KCore, that
serves as the TCB, and a rich set of the untrusted hypervisor services, KServ, that encapsulates
the rest of the KVM implementation, including the host Linux kernel. We verify SeKVM on a
multiprocessor machine model that accounts for shared multi-level page tables, tagged TLBs, and
writeback caches. We prove that the multiprocessor core refines a stack of security-preserving lay-
ers, such that the top layer specifies the entire system by its functional behavior over its machine
state. We then use KCore’s top layer specification to prove that VM confidentiality and integrity are
protected for any KServ implementation interacting with KCore, thereby proving that the security
guarantees hold for the entire SeKVM hypervisor.
In this chapter, we first present the effort required to retrofit the Arm implementation of KVM
into SeKVM. We then detail the functionality supported by SeKVM and the principles adopted
by the SeKVM implementation to simplify verification. Next, we discuss the verification effort
of SeKVM and the bugs that we discovered in our initial implementation. Finally, we present a
performance evaluation of multiple versions of SeKVM, and an evaluation of practical attacks.
4.1 Retrofitting KVM on Arm
4.1.1 SeKVM Retrofitting Effort and KServ Modifications
Retrofitting Effort. We use microverification to retrofit KVM/ARM [29, 30] into SeKVM, given
Arm’s increasing popularity in server systems [31, 32, 33]. Table 4.1 shows the effort required for
retrofitting mainline KVM in Linux 4.18, measured by LOC in C and assembly. Upon retrofitting,
85
Figure 4.1: Microverification of the Linux KVM Hypervisor
Retrofitting Component LOCQEMU additions 70KVM changes in KServ 1.5KHACL in KCore 10.1KKVM C in KCore 0.2KKVM assembly in KCore 0.3KOther C in KCore 3.2KOther assembly in KCore 0.1KTotal 15.5K
Table 4.1: SeKVM Retrofitting Effort in LOC
SeKVM’s KCore ends up consisting of 3.8K LOC (3.4K LOC in C and 400 LOC in assembly), of
which .5K LOC were in the existing KVM code that we verified. In addition, 10.1K LOC were
added for the implementation of Ed25519 and AES in the ported HACL* library. 1.5K LOC were
modified in existing KVM code, a tiny portion of the codebase, such as adding calls to KCore
hypercalls. 70 LOC were also added to QEMU to support secure VM boot and VM migration.
We also retrofitted and verified various other versions of KVM in Linux 4.20, 5.0, 5.1, 5.2, 5.3,
5.4, and 5.5, which involved reusing much of the same code required for the 4.18 Linux kernel. For
instance, less than 100 LOC needed to be changed in KServ going from Linux 4.18 to 5.4, mostly
to support installing and initializing KCore on a different codebase before KCore starts running in
EL2. No code changes were required in KCore in going from Linux 4.18 to all other versions. The
initial retrofit for KVM in Linux 4.18 took one person-year. The port from KVM in Linux 4.18
to another kernel version took less than one person-month. These results indicate that the changes
needed to retrofit a widely-used, commodity hypervisor, so it can be verified and integrated with
multiple versions of a commodity host kernel, were modest overall.
86
KServ Modifications. We modified KServ in KVM to support SeKVM. We categorize the re-
quired modifications as follows. First, we updated Linux’s linker script to reserve memory regions
private to KCore at the end of the kernel data section to accommodate the page table pools, KCore’s
private metadata, and the intermediate state structures. The boot loader on some Arm hardware
loads the device tree to a fixed memory location that overlaps with KCore’s reserved memory re-
gion. To reconcile the conflict, we allocated a memory buffer in the data section for storing the
overlapped device tree. Second, we modified KServ to initialize KCore’s metadata during boot.
Third, we updated KServ’s code that is in charge of building page table mappings to allocate the
EL2 stage 1 page table from KCore’s private memory pool. Fourth, we updated KServ to map
KCore’s metadata, the intermediate state structures, and all physical memory to the EL2 stage 1
page table. We used 2MB mappings in the EL2 stage 1 page table to map the physical memory,
reducing the amount of page tables required to fulfill the mappings. Fifth, we changed KServ to
allocate large EL2 stack frames for each CPU to support HACL*. HACL* uses a large local array
from the stack in its Ed25519 implementation. Sixth, we instrumented KServ with hypercalls to
support SeKVM. Specifically, we add hypercalls to KServ to install KCore, verify boot images,
safely boot VMs, and import and export encrypted VM data. Finally, we changed the SMMU
driver in KServ to use the IOMMU OPS API.
4.1.2 Virtualization Feature Support
Table 4.2 compares features provided by commodity hypervisors with both the SeKVM im-
plementations. 1 means the feature is not implemented. It shows that SeKVM can improve the
overall security of KVM without compromising on its hypervisor features. Four KVM features
are not yet fully implemented in SeKVM, namely same page merging in Linux (KSM), swapping,
VM live migration, and checkpoint/restart. These features require additional changes to QEMU
and KServ. For example, appropriate GET VM STATE hypercalls need to be made in KServ to
export and import encrypted VM data for these features.
1 means the given feature is supported, means the given feature is not applicable, and means the feature isnot implemented.
87
Xen KVMSeK
VM
Boot and InitializationSecure Boot
Secure VM Boot CPUVM Symmetric Multiprocessing (SMP)
Table 4.4: Proof Effort for SeKVM’s Security Proofs in LOC
primitives used by higher layers were passed through to those layers, then verified as part of each
layer. We did not link HACL’s F* proofs with our Coq proofs, or our Coq proofs for C code with
those for Arm assembly code. The latter requires a verified compiler for Arm multiprocessor code;
no such compiler exists. No changes were required to the proofs used to verify KVM in different
Linux versions.
Table 4.4 shows the verification effort for SeKVM’s security properties, measured by LOC in
Coq. The security proofs, including the invariant and noninterference proofs, consist of 4.8K LOC.
1.1K LOC were used to verify the isolation invariants mentioned in Section 3.3.2 for the MMU
and SMMU page tables. The rest of the 3.7K LOC were noninterference proofs for KCore’s top-
level primitives; for example, these proofs involved proving state indistinguishability with respect
to caches.
Among the 4.8K LOC required for the security proofs, 0.4K LOC were needed for defining
the PDLs and auxiliary lemmas we mentioned in Section 3.3.2 for the noninterference proof. The
PDLs and lemmas specify the desired security properties of SeKVM and therefore have to be
trusted. Compared with the rest of the proof effort, they are kept simple because their definition
is orthogonal to KCore’s concrete implementation. Instead, we construct the PDLs and lemmas
straightforwardly over the abstract machine state according to SeKVM’s security policies (See
Section 1.4). For example, consider the memory isolation policy as follows: for a given principal
p, for all memory owned by p, the memory contents remain indistinguishable between a pair of
executions before and after p takes an active step. Our Coq implementation for the policy queries
the S2Page metadata to retrieve the set of memory owned by p for the PDL by comparing the owner
in each of the S2Page against p. The implementation does not depend on, or require knowledge of,
the actual KCore implementation that manages S2Page.
97
The Coq development effort for KCore’s functional correctness and security proofs took two
person-years. These results show that microverification of a commodity hypervisor can be accom-
plished with modest proof effort.
4.2.2 Bugs Found During Verification.
While verifying KCore, we found various bugs in our initial implementation. Most bugs were
discovered as part of our noninterference proofs, demonstrating a limitation of verification ap-
proaches that only prove functional correctness via refinement alone: the high-level specifications
may themselves be insecure. In other words, these bugs were not detected by just verifying that
the implementation satisfies its specification, but by ensuring that the specification guarantees the
desired security properties of the system.
Overwrite page table mapping. KCore initially did not check if a gfn was mapped before
updating a VM’s stage 2 page tables, making it possible to overwrite existing mappings. For
example, suppose two VCPUs of a VM trap upon accessing the same unmapped gfn. Since KCore
updates a VM’s stage 2 page table whenever a VCPU traps on accessing unmapped memory, the
same page table entry will be updated twice, the latter replacing the former. A compromised KServ
could leverage this bug and allocate two different physical pages, breaking VM data integrity.
We fixed this bug by changing KCore to update stage 2 page tables only when a mapping was
previously empty.
Huge page ownership. When KServ allocated a 2MB page for a VM, KCore initially only
validated the ownership of the first 4KB page rather than all the 512 4KB pages, leaving a loophole
for KServ to access VM memory. We fixed this bug by accounting for this edge case in our
validation logic.
No SMMU TLB flush after unmapping. We found a TLB management bug in which SeKVM
did not flush the SMMU TLB after unmapping a page from the SMMU page tables. We fixed the
bug in KCore by adding a SMMU TLB flush after the unmap.
Page table update race. When proving invariants for page ownership used in the noninter-
98
ference proofs, we identified a race condition in stage 2 page table updates. When allocating a
physical page to a VM, KCore removes it from KServ’s page table, assigns ownership of the page
to the VM, then maps it in the VM’s page table. However, if KCore is processing a KServ’s stage
2 page fault on another CPU, it could check the ownership of the same page before it was assigned
to the VM, think it was not assigned to any VM, and map the page in KServ’s page table. This
race could lead to both KServ and the VM having a memory mapping to the same physical page,
violating VM memory isolation. We fixed this bug by expanding the critical section and holding
the S2Page array lock, not just while checking and assigning ownership of the page, but until the
page table mapping is completed.
Multiple I/O devices using same physical page. KCore initially did not manage memory
ownership correctly when a physical page was mapped to multiple KServ SMMU page tables,
with each page table controlling DMA access for a different I/O device, allowing KServ devices to
access memory already assigned to VMs. We fixed this bug by having KCore only map a physical
page to a VM’s stage 2 or SMMU page tables when it is not already mapped to an SMMU page
table used by KServ’s devices.
SMMU static after VM boot. KCore initially did not ensure that mappings in SMMU page
tables remain static after VM boot. This could allow KServ to modify SMMU page table mappings
to compromise VM data. We fixed this bug by modifying KCore to check the state of the VM that
owned the device before updating its SMMU page tables, and only allow updates before VM boot.
No cache flush after loading VM boot images. We found a cache management bug in SeKVM
in which a VM boot image may be cached when loaded from the file system but not written back
to the main memory. As VMs are booted with paging and caching disabled, it is possible that the
VMs access the page content in memory, thereby not using the correct VM images. We fixed the
bug in KCore by flushing the corresponding cache lines for memory that contain the pre-loaded
VM image before booting the VM, ensuring the use of the correct VM image loaded in memory.
99
4.3 Evaluation
4.3.1 Experimental Setup
We evaluate the performance of unmodified KVM versus the verified SeKVM implementa-
tion across different software and hardware configurations. We ran KVM and SeKVM in both
Linux 4.18 and 5.4 on two different Armv8 hardware configurations: (1) an HP Moonshot m400
server with an 8-core 64-bit Armv8-A 2.4 GHz Applied Micro Atlas SoC, 64 GB of RAM, a
120 GB SATA3 SSD, and a Dual-port Mellanox ConnectX-3 10GbE NIC, and (2) an AMD Seat-
tle (Rev.B0) server with an 8-core 64-bit Armv8-A 2 GHz AMD Opteron A1100 SoC, 16 GB
of RAM, a 512 GB SATA3 HDD for storage, an IOMMU (SMMU-401) to support control over
DMA devices and direct device assignment, and an AMD XGBE 10 GbE NIC. For client-server
workloads, clients ran on another m400 machine when using the m400 server, and ran on an x86
machine with 24 Intel Xeon CPU 2.20 GHz cores and 96 GB RAM when using the Seattle server,
in all cases connected via 10 GbE.
We used different software configurations across the servers to demonstrate the performance
of the verified KVM across multiple software and VM configurations. We used Ubuntu 18.04
and QEMU 3.0 for the m400 server and its VMs, Ubuntu 16.04 and QEMU 2.3.50 for the Seattle
server and its VMs. Furthermore, we used small and large SMP VM configurations, the former
on the m400 server with 2 CPUs and 256 MB RAM and the latter on the Seattle server with 4
CPUs and 12 GB of RAM. A smaller VM configuration was also used in part to show results for
running many SMP VM instances given the RAM limits of the m400 server. We also measured
performance natively on the servers with the host OS capped at using the same number of CPUs
and amount of RAM as the respective VM configuration. KVM was configured with its standard
vhost virtio network, and with cache=none for its virtual block storage devices [82, 83, 84]. All
VMs used paravirtualized I/O, typical of cloud infrastructure deployments. For the single VM
measurements, we pinned each VCPU to a specific physical CPU and ensured that no other work
was scheduled on that CPU [80, 61, 85, 86]. To quantify the hypervisor’s ability in scheduling
100
Name DescriptionHypercall Transition from the VM to the hypervisor OS and return to the VM without
doing any work in the hypervisor. Measures bidirectional base transition costof hypervisor operations.
I/O Kernel Trap from the VM to the emulated interrupt controller in the hypervisor OSkernel, and then return to the VM. Measures a frequent operation for manydevice drivers and baseline for accessing I/O devices supported by the hyper-visor OS kernel.
I/O User Trap from the VM to the emulated UART in QEMU and then return to theVM. Measures base cost of operations that access I/O devices emulated in thehypervisor OS user space.
Virtual IPI Issue a virtual IPI from a VCPU to another VCPU running on a differ-ent PCPU.Measures time between sending the virtual IPI until the receivingVCPU handles it.
multiple VMs, we did not pin VCPUs for VMs in our multi-VM measurements.
4.3.2 Microbenchmark Results
We measured the set of microbenchmarks listed in Table 4.5 on SeKVM. Table 4.6 shows the
results measured in cycles for the unmodified KVM and verified KVM in Linux 4.18 for each
hardware configuration. Verified KVM overhead compared to unmodified KVM is much higher
on the m400 server versus the Seattle server because the m400 CPUs have a tiny TLB [87] com-
pared to Seattle CPUs. Although KCore supports huge pages for stage 2 page tables for VMs, the
current implementation maps regular 4 KB pages in KServ’s stage 2 page tables so microbench-
mark workloads that spend most of their time running in KServ require more TLB entries to cache
address translations, increasing TLB capacity misses. Newer Arm CPUs have more reasonable
TLB sizes similar to or greater than the Seattle CPUs, so the Seattle measurements are more re-
101
Name DescriptionKernbench Compilation of the Linux kernel using allnoconfig for Arm; m400 com-
piled v4.18 with GCC 7.5.0, Seattle compiled v4.9 with GCC 5.4.0.Hackbench hackbench [88] using Unix domain sockets and process groups running in
500 loops; m400 used 20 groups, Seattle used 100 groups.Netperf netperf v2.6.0 [89] running netserver on the server and the client with
its default parameters in three modes: TCP_STREAM (throughput), TCP_-MAERTS (throughput), and TCP_RR (latency).
Apache Apache server handling concurrent requests from remote ApacheBench [90]v2.3 client, serving the index.html of the GCC manual; m400 used v2.4.29serving 7.5.0 manual, Seattle used v2.4.18 serving 5.4.0 manual.
Memcached memcached v1.4.25 using the memtier benchmark v1.2.3 with its defaultparameters.
MySQL MySQL v14.14 (distrib 5.7.26) running SysBench v.0.4.12 using the defaultconfiguration with 200 parallel transactions.
MongoDB MongoDB server handling requests from a remote YCSB [91] v0.17.0 clientrunning workload A with 16 concurrent threads; m400 used v3.6.3 withreadcount=10000 and operationcount=50000, Seattle used v4.0.20 with read-count=500000 and operationcount=100000.
Redis Redis v4.0.9 server handling requests from a remote YCSB v0.17.0 clientrunning workload A; m400 used v4.0.9, Seattle used v3.0.6.
Table 4.7: Application Benchmarks
flective of typical Arm server performance. For Seattle, verified KVM only incurs 17% to 28%
overhead over KVM, with the added benefit of verified VM protection. On Seattle, the overhead is
highest for the simplest operations because the relatively fixed cost of KCore protecting VM data
is a higher percentage of the work that must be done. These results provide a conservative measure
of overhead since real hypervisor operations will invoke actual KServ functions, not just measure
overhead for a null hypercall. The results show the verified implementation introduced modest
over to the unverified implementation.
4.3.3 Application Workload Results
Single-VM Performance. We evaluated performance using real application workloads listed in
Table 4.7. To evaluate VM performance with end-to-end I/O protection, full disk encryption (FDE)
was enabled for Seattle VMs but not m400 VMs, given the limited memory assigned to m400 VMs.
verified KVM for all application benchmarks. In all cases, even when running 32 concurrent VMs,
verified KVM has no worse than 10% overhead compared to unmodified KVM, demonstrating that
verified KVM has similar performance scalability as unmodified KVM. In other words, the use of
locks in verified KVM to protect shared memory accesses does not adversely affect its performance
scalability in running multiple multiprocessor VMs on Arm relaxed memory hardware.
4.3.4 Evaluation of Practical Attacks
We evaluated SeKVM’s effectiveness against a compromised KServ by analyzing CVEs and
identifying the cases where SeKVM protects VM data despite any compromise, assuming an equiv-
alent implementation of SeKVM for x86 platforms. We analyzed CVEs related to Linux/KVM,
which are listed in Tables 4.8 and 4.9. The CVEs consider two cases: a malicious VM that ex-
ploits KVM functions supported by KServ, and an unprivileged host user who exploits bugs in
Linux/KVM. Among the selected CVEs, 16 are x86-specific; one is specific to Arm, while the rest
are independent of architecture. An attacker’s goal is to exploit these CVEs to obtain KServ privi-
leges and compromise VM data. The CVEs related to our threat model could result in information
106
Bug Description KVM SeKVMCVE-2015-4036 Memory Corruption: Array index error in KServ. No YesCVE-2013-0311 Privilege Escalation: Improper handling of descriptors in vhost driver. No YesCVE-2017-17741 Info Leakage: Stack out-of-bounds read in KServ. No YesCVE-2010-0297 Code Execution: Buffer overflow in I/O virtualization code. No YesCVE-2014-0049 Code Execution: Buffer overflow in I/O virtualization code. No YesCVE-2013-1798 Info Leakage: Improper handling of invalid combination of operations for
virtual IOAPIC.No Yes
CVE-2016-4440 Code Execution: Mishandling of virtual APIC state. No YesCVE-2016-9777 Privilege Escalation: Out-of-bounds array access using VCPU index in in-
terrupt virtualization code.No Yes
CVE-2015-3456 Code Execution: Memory corruption in virtual floppy driver allows VMuser to execute arbitrary code in KServ.
No Yes
CVE-2011-2212 Privilege Escalation: Buffer overflow in the virtio subsystem allows guestto gain privileges to the host.
No Yes
CVE-2011-1750 Privilege Escalation: Buffer overflow in the virtio subsystem allows guestto gain privileges to the host.
No Yes
CVE-2015-3214 Code Execution: Out-of-bound memory access in QEMU leads to memorycorruption.
No Yes
CVE-2012-0029 Code Execution: Buffer overflow allows VM users to execute arbitrarycode in QEMU
No Yes
CVE-2017-1000407 Denial-of-Service: VMs crash hypserv by flooding the I/O port with writerequests.
No No
CVE-2017-1000252 Denial-of-Service: Out-of-bounds value causes assertion failure and hyper-visor crash.
No No
CVE-2014-7842 Denial-of-Service: Bug in KVM allows guest users to crash its own OS. No NoCVE-2018-1087 Privilege Escalation: Improper handling of exception allows guest users to
escalate their privileges to its own OS.No No
Table 4.8: Selected Set of Analyzed CVEs - from VM
leakage, privilege escalation, code execution, and memory corruption in Linux/KVM. While KVM
does not protect VM data against any of these compromises, SeKVM protects against all of them.
SeKVM does not guarantee availability and cannot protect against CVEs that allow VMs or host
users to cause denial of service in KServ. Vulnerabilities that allow unprivileged guest users to
attack their own VMs like CVE-2014-7842 and CVE-2018-1087 are unrelated to our threat model.
We also executed attacks representative of information leakage to show that SeKVM protects
VM data even if an attacker has full control of KServ. First, we simulated an attacker trying to read
or modify VMs’ memory pages. We added a hook to KVM, which modifies a page that a targeted
gVA maps to. As expected, the compromised mainline KVM successfully modified the VM page.
In SeKVM, the same attack causes a trap to KCore, which rejects the invalid memory access.
Second, we simulated a host that tries to tamper with a VM’s nested page table by redirecting a
107
Bug Description KVM SeKVMCVE-2009-3234 Privilege Escalation: Kernel stack buffer overflow resulting in ret2usr [93]. No YesCVE-2010-2959 Code Execution: Integer overflow resulting in function pointer overwrite. No YesCVE-2010-4258 Privilege Escalation: Improper handling of get_fs value resulting in kernel
memory overwrite.No Yes
CVE-2009-3640 Privilege Escalation: Improper handling of APIC state in KServ. No YesCVE-2009-4004 Privilege Escalation: Buffer overflow in KServ. No YesCVE-2013-1943 Privilege Escalation, Info Leakage: Mishandling of memory slot allocation
allows host users to access KServ memory.No Yes
CVE-2016-10150 Privilege Escalation: Use-after-free in KServ. No YesCVE-2013-4587 Privilege Escalation: Array index error in KServ. No YesCVE-2018-18021 Privilege Escalation: Mishandling of VM register state allows host users to
redirect KServ execution.No Yes
CVE-2016-9756 Info Leakage: Improper initialization in code segment resulting in informa-tion leakage in KServ stack.
No Yes
CVE-2019-14821 Privilege Escalation: Host users cause out-of-bounds memory access inKServ.
No Yes
CVE-2019-6974 Privilege Escalation: Use-after-free in KServ. No YesCVE-2013-6368 Privilege Escalation: Mishandling of APIC state in KServ. No YesCVE-2015-4692 Memory Corruption: Mishandling of APIC state in KServ. No YesCVE-2013-4592 Denial-of-Service: Host users cause memory leak in KServ. No No
Table 4.9: Selected Set of Analyzed CVEs - from Host User
gPA’s NPT mapping to host-owned pages. This is in contrast to the prior attack of modifying VM
pages, but shares the same goal of accessing VM data in memory. We added a hook to the stage
2 page fault handler in KVM/ARM; the hook allocates a new zero page in the host OS’s address
space, which could contain arbitrary code data in a real attack. The hook associates a range of a
VM’s gPAs with this zero page. As expected, this attack succeeds in KVM but fails in SeKVM.
In SeKVM, the attacker in KServ has no access to VM’s stage 2 page table walked by the MMU.
KCore never uses the page tables maintained by the untrusted KServ.
4.4 Summary
In this chapter, we have presented the effort required in retrofitting KVM into SeKVM and ver-
ifying its TCB as well as overall security properties. SeKVM is the first commodity multiprocessor
hypervisor that has been formally verified. We achieved this through microverification, retrofitting
KVM with a small core that can enforce data access controls on the rest of the KVM. We showed
that microverification only required modest KVM modifications, yet resulted in a verified hyper-
paravirtualized, and passthrough I/O devices with IOMMU protection against direct memory ac-
cess (DMA) attacks, and compatibility with Linux device drivers for broad Arm hardware support.
We have formally verified the correctness of SeKVM’s TCB and the security guarantees of the en-
tire hypervisor. We have retrofitted and verified multiple versions of KVM using Coq. Finally, we
showed that SeKVM incurs modest overhead compared to unmodified KVM for real application
workloads and similar scalability when running multiple VMs.
My research has demonstrated that modest changes to a commodity system like the Linux
KVM hypervisor can reduce the required proof effort; thus making it possible to verify the security
properties, such as the protection of VM confidentiality and integrity, of the entire hypervisor,
while retaining KVM’s commodity feature set and performance. Our work is the first machine-
checked security proof for a commodity multiprocessor hypervisor. Our work is also the first
machine-checked correctness proof of a multiprocessor system with shared page tables.
6.2 Future Work
My research has investigated using microverification to verify that the commodity KVM hy-
pervisor protects VM confidentiality and integrity on Arm multiprocessor hardware. I believe there
are many other opportunities for future work to apply microverification to various other systems
for different deployment scenarios to verify their security properties.
One area of future work is to explore how to use microverification to verify whether other
commodity hypervisors protect VM confidentiality and integrity. For example, to verify Xen, one
could use the HypSec design to retrofit Xen into a KServ that includes Dom0’s kernel and Xen’s
codebase that provides resource management, scheduling, and interrupt virtualization, and a KCore
that could include the rest of Xen to provide CPU virtualization and page table management while
119
protecting VM data. One could then reason over KCore to prove Xen’s protection of VM data.
As many of the commodity hypervisors are deployed to x86 based server hardware, HypSec could
leverage x86’s virtualization hardware extensions to simplify the retrofit required for x86 hyper-
visors. For example, one could leverage Intel x86’s Virtual Machine Extensions (VMX) [157] to
deprivilege KServ in the VMX non-root operation and run KCore in the VMX root operation to
protect VM data. To interpose on VM exits, KCore could manage the x86 Virtual Machine Con-
trol Structure (VMCS) to trap VM exits to itself, so it could protect VM data before switching to
KServ. Hardware extensions on x86, including VMX, provides support for context switching VM
CPU state. Thus, multiplexing the hardware between VMs and KServ on x86 hardware should not
incur significant VM performance overhead. To protect VM memory, HypSec could use VMX’s
NPTs, Extended Page Tables (EPTs), and the IOMMU. As presented in Section 4.3, using NPTs
on x86 server hardware with reasonable TLB capacity should result in negligible VM performance
overhead. Finally, to verify x86 hypervisors, one could potentially model the multiprocessor ex-
ecution and memory management features of the x86 hardware similar to how we model these
respective features for SeKVM’s Arm based hardware. However, it may require additional effort
to model x86’s systems registers and virtualization extensions and detailed semantics for the x86
instructions used by the hypervisor to manage these hardware features.
Another area of future work is to explore how to use microverification to verify hypervisors
might guarantee other security properties, such as availability. SeKVM focuses only on data con-
fidentiality and integrity; it makes no guarantees about availability. Confidentiality and integrity
may be primary concerns in the context of cloud computing, but in other contexts, availability may
be of much greater importance. For example, virtualization has been increasingly adopted by secu-
rity critical systems such as automotive systems to isolate security critical components into VMs.
Unlike the cloud deployment scenario, in which the administrators can simply terminate malfunc-
tioned VMs to prevent them from affecting others, guaranteeing VM availability in security critical
systems is of key importance because the whole system could fail if a given VM component mal-
functions. To verify that the hypervisor protects VM availability, one could potentially incorporate
120
a hypervisor core that takes charge of scheduling VMs, allocating VM resources, and confining
faulted hypervisor components or VMs. One could then first show that the core implementation
refines its specification and prove the availability guarantee over the specification.
Moreover, various systems rely on a full commodity hypervisor to protect software compo-
nents running within VMs, either to protect the integrity of the guest kernel [122, 123, 125] or
applications in the VMs from a malicious guest kernel [60, 64]. Attackers who exploit vulnera-
bilities in the large hypervisor codebase could compromise the hypervisor’s security guarantees
to VMs. Microverification can be potentially applied to these systems to improve their security.
For instance, to prove that the hypervisor protects the confidentiality and integrity of applications
within a given VM against an untrusted guest OS, one could first leverage HypSec to retrofit the
hypervisor, and extend the retrofitted hypervisor core to protect the VM applications, then prove
noninterference over the core to demonstrate that the hypervisor enforces its security guarantees.
For example, to protect the confidentiality and integrity of the guest applications, the core must
mediate all interactions between the applications and the OS in the VM, such as system calls, page
faults, and interrupts from the userspace, at the guest kernel’s interface. Further investigation is
required to design the hypervisor core to support the commodity OS functionality while protecting
application data.
Exploring microverification of other commodity systems, such as commodity OS kernels, is
another interesting direction for further research. For example, microverification could be poten-
tially applied to verify the integrity of a given OS kernel, guaranteeing that the kernel is immune to
code injection attacks. One could potentially decompose the OS kernel into a large set of untrusted
kernel services and a small TCB. Similar to how KCore manages KServ’s NPT, the TCB could
manage the NPT for the untrusted kernel services using an identity map, and configure the mem-
ory access attributes in the page table entries to protect kernel memory. For instance, to prevent the
attackers from modifying existing kernel memory or page tables to load and execute arbitrary code,
the TCB could set the NPT entries that map to the kernel text section and page tables read-only, and
set the entries that map to the kernel data section non-executable. To verify kernel integrity, one
121
could prove noninterference and show that the contents of the executable kernel memory regions
remain unchanged throughout the kernel execution. However, a strict noninterference guarantee
may be incompatible with the commodity kernel feature set. For instance, commodity OS kernels
support dynamic kernel module loading, which requires updating the kernel page table to map to
the newly allocated executable memory for loading the kernel modules at runtime. Further investi-
gation is needed in designing and proving the TCB’s security policies for ensuring kernel integrity
protection while retaining the commodity OS kernel’s functionality and performance. An alterna-
tive avenue to explore microverification is to prove a given OS kernel protects the confidentiality
and integrity of user data in its hosted applications or containers. One could explore relying on a
small TCB that interposes at the kernel interface to applications or containers to protect the user
data, similar to how the hypervisor core could protect applications from the kernel in a VM as we
discussed earlier, then reasoning over the TCB to prove the desired security properties.
It is potentially possible to further reduce the efforts required for verifying commodity systems
based on microverification. For instance, to protect guest applications from an untrusted guest OS,
one could extend SeKVM’s Arm implementation to trap the exceptions from EL0 to EL2, allowing
KCore to protect application data before entering the guest kernel in EL1. KCore could potentially
program the Trap General Exceptions (TGE) bit provided by Arm to configure the CPU to route
all exceptions from EL0 directly to EL2. However, setting this bit also disables virtual memory in
EL0, which is problematic for real applications. It should be explored if existing hardware features
on Arm or novel software design could be employed to simplify the retrofit needed for interposing
on EL0 exceptions. Furthermore, although we have built tools to automate many parts of the proofs
for SeKVM, it still requires manual effort to write formal specifications in Coq for the assembly
code and C functions that include complex program logics such as loops, then solve the proof goals
defined over the specifications. To further simplify the development and maintenance of formally
verified software systems, a promising direction of research is to explore technologies that support
programming verified software systems directly at scale. This could potentially be accomplished
by incorporating a novel proof framework that fully automates verification with a verified compiler
122
to produce trusted binaries from the formally verified source code.
Finally, although microverification enables formal verification of security properties for com-
modity systems, proving an existing commodity system is functionally correct in its entirety re-
mains a grand challenge. Further research in modularizing the monolithic codebase of commodity
systems into smaller and verifiable components, then verifying the smaller components and linking
their proofs at scale, could enable the first steps toward proving the functional correctness of entire
commodity systems.
123
References
[1] S. J. Vaughan-Nichols, “Hypervisors: The cloud’s potential security Achilles heel,” ZDNet,Mar. 2014.
[2] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori, “KVM: The Linux Virtual Ma-chine Monitor,” in Proceedings of the 2007 Ottawa Linux Symposium (OLS 2007), Ottawa,ON, Canada, Jun. 2007.
[3] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, andA. Warfield, “Xen and the Art of Virtualization,” in Proceedings of the 19th ACM Sym-posium on Operating Systems Principles (SOSP 2003), Bolton Landing, NY, Oct. 2003,pp. 164–177.
[16] Confidential VM and Compute Engine, https://cloud.google.com/compute/confidential-vm/docs/about-cvm, Google, May 2021.
[17] G. Klein, K. Elphinstone, G. Heiser, J. Andronick, D. Cock, P. Derrin, D. Elkaduwe, K.Engelhardt, R. Kolanski, M. Norrish, T. Sewell, H. Tuch, and S. Winwood, “seL4: FormalVerification of an OS Kernel,” in Proceedings of the 22nd ACM Symposium on OperatingSystems Principles (SOSP 2009), Big Sky, MT, Oct. 2009, pp. 207–220.
[18] R. Gu, Z. Shao, H. Chen, X. N. Wu, J. Kim, V. Sjöberg, and D. Costanzo, “CertiKOS: AnExtensible Architecture for Building Certified Concurrent OS Kernels,” in Proceedings ofthe 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI2016), Savannah, GA, Nov. 2016, pp. 653–669.
[19] A. Vasudevan, S. Chaki, L. Jia, J. McCune, J. Newsome, and A. Datta, “Design, Imple-mentation and Verification of an eXtensible and Modular Hypervisor Framework,” in Pro-ceedings of the 2013 IEEE Symposium on Security and Privacy (SP 2013), San Francisco,CA, May 2013, pp. 430–444.
[20] D. Costanzo, Z. Shao, and R. Gu, “End-to-End Verification of Information-Flow Securityfor C and Assembly Programs,” in Proceedings of the 37th ACM Conference on Program-ming Language Design and Implementation (PLDI 2016), Santa Barbara, CA, Jun. 2016,pp. 648–664.
[21] H. Sigurbjarnarson, L. Nelson, B. Castro-Karney, J. Bornholt, E. Torlak, and X. Wang,“Nickel: A Framework for Design and Verification of Information Flow Control Systems,”in Proceedings of the 13th USENIX Symposium on Operating Systems Design and Imple-mentation (OSDI 2018), Carlsbad, CA, Oct. 2018, pp. 287–305.
[22] A. Ferraiuolo, A. Baumann, C. Hawblitzel, and B. Parno, “Komodo: Using verificationto disentangle secure-enclave hardware from software,” in Proceedings of the 26th ACMSymposium on Operating Systems Principles (SOSP 2017), Shanghai, China, Oct. 2017,pp. 287–305.
[23] L. Nelson, J. Bornholt, R. Gu, A. Baumann, E. Torlak, and X. Wang, “Scaling SymbolicEvaluation for Automated Verification of Systems Code with Serval,” in Proceedings of the
27th ACM Symposium on Operating Systems Principles (SOSP 2019), Huntsville, Ontario,Canada, Oct. 2019, pp. 225–242.
[24] T. Murray, D. Matichuk, M. Brassil, P. Gammie, and G. Klein, “Noninterference for Op-erating System Kernels,” in Proceedings of the 2nd International Conference on CertifiedPrograms and Proofs (CPP 2012), Kyoto, Japan, Dec. 2012, pp. 126–142.
[25] S.-W. Li, X. Li, R. Gu, J. Nieh, and J. Z. Hui, “A Secure and Formally Verified LinuxKVM Hypervisor,” in Proceedings of the 2021 IEEE Symposium on Security and Privacy(SP 2021), May 2021.
[26] J. Graham-Cumming and J. W. Sanders, “On the Refinement of Non-interference,” inProceedings of Computer Security Foundations Workshop IV, Franconia, NH, Jun. 1991,pp. 35–42.
[27] D. Stefan, A. Russo, P. Buiras, A. Levy, J. C. Mitchell, and D. Mazieres, “AddressingCovert Termination and Timing Channels in Concurrent Information Flow Systems,” inProceedings of the 17th ACM SIGPLAN International Conference on Functional Program-ming (ICFP 2012), ser. ACM SIGPLAN Notices, vol. 47, Sep. 2012, pp. 201–214.
[28] J. A. Goguen and J. Meseguer, “Unwinding and Inference Control,” in Proceedings ofthe 1984 IEEE Symposium on Security and Privacy (SP 1984), Oakland, CA, Apr. 1984,pp. 75–86.
[29] C. Dall and J. Nieh, “KVM/ARM: Experiences Building the Linux ARM Hypervisor,”Department of Computer Science, Columbia University, Technical Report CUCS-010-13,Jun. 2013.
[30] ——, “KVM/ARM: The Design and Implementation of the Linux ARM Hypervisor,” inProceedings of the 19th International Conference on Architectural Support for Program-ming Languages and Operating Systems (ASPLOS 2014), Salt Lake City, UT, Mar. 2014,pp. 333–347.
[31] “Cloud companies consider Intel rivals after the discovery of microchip security flaws,”CNBC, Jan. 2018.
[32] C. Williams, “Microsoft: Can’t wait for ARM to power MOST of our cloud data centers!Take that, Intel! Ha! Ha!” The Register, Mar. 2017.
[33] Introducing Amazon EC2 A1 Instances Powered By New Arm-based AWS Graviton Pro-cessors, https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-amazon-ec2-a1-instances, Amazon Web Services, Nov. 2018.
[34] The Coq Proof Assistant, https://coq.inria.fr [Accessed: Dec 16, 2020].
[35] S. Landau, “Making Sense from Snowden: What’s Significant in the NSA SurveillanceRevelations,” IEEE Security and Privacy, vol. 11, no. 4, pp. 54–63, Jul. 2013.
[37] Google, HTTPS encryption on the web – Google Transparency Report, https://transparencyreport.google.com/https/overview, Apr. 2018.
[38] Business Wire, Research and Markets: Global Encryption Software Market (Usage, Verti-cal and Geography) - Size, Global Trends, Company Profiles, Segmentation and Forecast,2013 - 2020, https://www.businesswire.com/news/home/20150211006369/en/Research-Markets-Global-Encryption-Software-Market-Usage, Feb. 2015.
[39] J. H. Saltzer, D. P. Reed, and D. D. Clark, “End-to-end Arguments in System Design,”ACM Transactions on Computer Systems (TOCS), vol. 2, no. 4, pp. 277–288, Nov. 1984.
[40] “ARM Security Technology Building a Secure System Using TrustZone Technology,”ARM Ltd., Whitepaper PRD29-GENC-009492C, Apr. 2009.
[41] International Organization for Standardization and International Electrotechnical Commis-sion, ISO/IEC 11889-1:2015 - Information technology – Trusted platform module library,https://www.iso.org/standard/66510.html, Sep. 2016.
[42] J. A. Halderman, S. D. Schoen, N. Heninger, W. Clarkson, W. Paul, J. A. Calandrino,A. J. Feldman, J. Appelbaum, and E. W. Felten, “Lest We Remember: Cold Boot Attackson Encryption Keys,” in Proceedings of the 17th USENIX Security Symposium (USENIXSecurity 2008), San Jose, CA, Jul. 2008, pp. 45–60.
[43] “Google Cloud Security and Compliance Whitepaper - How Google protects your data.,”Google Cloud, pp. 6–7, https://static.googleusercontent.com/media/gsuite.google.com/en//files/google-apps-security-and-compliance-whitepaper.pdf [Accessed: Dec 16, 2020].
[44] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage, “Hey, You, Get Off of My Cloud:Exploring Information Leakage in Third-party Compute Clouds,” in Proceedings of the2009 ACM Conference on Computer and Communications Security (CCS 2009), Chicago,IL, Nov. 2009, pp. 199–212.
[45] Y. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Cross-VM Side Channels and TheirUse to Extract Private Keys,” in Proceedings of the 2012 ACM Conference on Computerand Communications Security (CCS 2012), Raleigh, NC, Oct. 2012, pp. 305–316.
[46] G. Irazoqui, T. Eisenbarth, and B. Sunar, “S$A: A Shared Cache Attack That Works AcrossCores and Defies VM Sandboxing – and Its Application to AES,” in Proceedings of the
2015 IEEE Symposium on Security and Privacy (SP 2015), San Jose, CA, May 2015,pp. 591–604.
[47] Y. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Cross-Tenant Side-Channel Attacksin Paas Clouds,” in Proceedings of the 2014 ACM Conference on Computer and Commu-nications Security (CCS 2014), Scottsdale, AZ, Nov. 2014, pp. 990–1003.
[48] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee, “Last-Level Cache Side-Channel AttacksAre Practical,” in Proceedings of the 2015 IEEE Symposium on Security and Privacy (SP2015), San Jose, CA, May 2015, pp. 605–622.
[49] M. Backes, G. Doychev, and B. Kopf, “Preventing Side-Channel Leaks in Web Traffic: AFormal Approach.,” in 20th Annual Network and Distributed System Security Symposium(NDSS 2013), San Diego, CA, Feb. 2013.
[50] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young,“Mach: A new kernel foundation for UNIX development,” in Proceedings of the SummerUSENIX Conference (USENIX Summer 1986), Atlanta, GA, Jun. 1986, pp. 93–112.
[51] B. N. Bershad, S. Savage, P. Pardyak, E. G. Sirer, M. E. Fiuczynski, D. Becker, C. Cham-bers, and S. Eggers, “Extensibility Safety and Performance in the SPIN Operating Sys-tem,” in Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP1995), Copper Mountain, CO, Dec. 1995, pp. 267–283.
[52] J. Liedtke, “On Micro-kernel Construction,” in Proceedings of the 15th ACM Symposiumon Operating Systems Principles (SOSP 1995), Copper Mountain, CO, Dec. 1995, pp. 237–250.
[53] ArchWiki, dm-crypt, https://wiki.archlinux.org/index.php/dm-crypt [Ac-cessed: Jan 10, 2021].
[55] Amazon Web Services, Inc., AWS Key Management Service (KMS), https://aws.amazon.com/kms [Accessed: Jan 10, 2021].
[56] Microsoft Azure, Key Vault - Microsoft Azure, https://azure.microsoft.com/en-in/services/key-vault [Accessed: Jan 10, 2021].
[57] P. Stewin and I. Bystrov, “Understanding DMA Malware,” in Proceedings of the 9th In-ternational Conference on Detection of Intrusions and Malware, and Vulnerability Assess-ment (DIMVA 2012), Heraklion, Crete, Greece, Jul. 2013, pp. 21–41.
[58] C. A. Waldspurger, “Memory Resource Management in VMware ESX Server,” in Pro-ceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI2002), Boston, MA, Dec. 2002, pp. 181–194.
[59] K. Adams and O. Agesen, “A Comparison of Software and Hardware Techniques for x86Virtualization,” in Proceedings of the 12th International Conference on Architectural Sup-port for Programming Languages and Operating Systems (ASPLOS 2006), San Jose, CA,Oct. 2006, pp. 2–13.
[60] X. Chen, T. Garfinkel, E. C. Lewis, P. Subrahmanyam, C. A. Waldspurger, D. Boneh, J.Dwoskin, and D. R. Ports, “Overshadow: A Virtualization-based Approach to RetrofittingProtection in Commodity Operating Systems,” in Proceedings of the 13th InternationalConference on Architectural Support for Programming Languages and Operating Systems(ASPLOS 2008), Seattle, WA, Mar. 2008, pp. 2–13.
[61] J. T. Lim, C. Dall, S.-W. Li, J. Nieh, and M. Zyngier, “NEVE: Nested Virtualization Ex-tensions for ARM,” in Proceedings of the 26th ACM Symposium on Operating SystemsPrinciples (SOSP 2017), Shanghai, China, Oct. 2017, pp. 201–217.
[62] J. Corbet, KAISER: hiding the kernel from user space, https://lwn.net/Articles/738975, Nov. 2017.
[64] O. S. Hofmann, S. Kim, A. M. Dunn, M. Z. Lee, and E. Witchel, “InkTag: Secure Ap-plications on an Untrusted Operating System,” in Proceedings of the 18th InternationalConference on Architectural Support for Programming Languages and Operating Systems(ASPLOS 2013), Houston, TX, Mar. 2013, pp. 265–278.
[65] ARM System Memory Management Unit Architecture Specification - SMMU architectureversion 2.0, ARM Ltd., Jun. 2016.
[66] J.-K. Zinzindohoué, K. Bhargavan, J. Protzenko, and B. Beurdouche, “HACL*: A VerifiedModern Cryptographic Library,” in Proceedings of the 2017 ACM Conference on Com-puter and Communications Security (CCS 2017), Dallas, TX, Oct. 2017, pp. 1789–1806.
[67] “ARM Power State Coordination Interface,” ARM Ltd., ARM DEN 0022D, Apr. 2017.
[68] R. Gu, Z. Shao, J. Kim, X. N. Wu, J. Koenig, V. Sjöberg, H. Chen, D. Costanzo, andT. Ramananandro, “Certified Concurrent Abstraction Layers,” in Proceedings of the 39thACM Conference on Programming Language Design and Implementation (PLDI 2018),Philadelphia, PA, Jun. 2018, pp. 646–661.
[70] ARM Ltd., ARM CoreLink MMU-401 System Memory Management Unit Technical Refer-ence Manual, Jul. 2014.
[71] A. Sabelfeld and A. C. Myers, “A Model for Delimited Information Release,” in Proceed-ings of the 2nd International Symposium on Software Security (ISSS 2003), Tokyo, Japan,Nov. 2003, pp. 174–191.
[72] R. Gu, J. Koenig, T. Ramananandro, Z. Shao, X. N. Wu, S.-C. Weng, and H. Zhang, “DeepSpecifications and Certified Abstraction Layers,” in Proceedings of the 42nd ACM Sympo-sium on Principles of Programming Languages (POPL 2015), Mumbai, India, Jan. 2015,pp. 595–608.
[73] R. Keller, “Formal Verification of Parallel Programs,” Communications of the ACM, vol. 19,pp. 371–384, Jul. 1976.
[74] C. Jones, “Tentative Steps Toward a Development Method for Interfering Programs.,” ACMTransactions on Programming Languages and Systems (TOPLAS), vol. 5, pp. 596–619,Oct. 1983.
[75] N. Lynch and F. Vaandrager, “Forward and Backward Simulations,” Information and Com-putation, vol. 128, no. 1, 1–25, Jul. 1996.
[76] K. J. Biba, “Integrity Considerations for Secure Computer Systems,” MITRE, TechnicalReport MTR-3153, Jun. 1975.
[77] T. Murray, D. Matichuk, M. Brassil, P. Gammie, T. Bourke, S. Seefried, C. Lewis, X. Gao,and G. Klein, “SeL4: From General Purpose to a Proof of Information Flow Enforcement,”in Proceedings of the 2013 IEEE Symposium on Security and Privacy (SP 2013), SanFrancisco, CA, May 2013, pp. 415–429.
[78] R. Tao, J. Yao, S.-W. Li, X. Li, J. Nieh, and R. Gu, “Verifying a Multiprocessor Hyper-visor on Arm Relaxed Memory Hardware,” Department of Computer Science, ColumbiaUniversity, Technical Report CUCS-005-21, Jun. 2021.
[79] OP-TEE, Open Portable Trusted Execution Environment, https://www.op-tee.org/,[Accessed Jan 12, 2012].
[80] C. Dall, S.-W. Li, J. T. Lim, J. Nieh, and G. Koloventzos, “ARM Virtualization: Perfor-mance and Architectural Implications,” in Proceedings of the 43rd International Sympo-sium on Computer Architecture (ISCA 2016), Seoul, South Korea, Jun. 2016, pp. 304–316.
[81] C. Dall, S.-W. Li, and J. Nieh, “Optimizing the Design and Implementation of the LinuxARM Hypervisor,” in Proceedings of the 2017 USENIX Annual Technical Conference(USENIX ATC 2017), Santa Clara, CA, Jul. 2017, pp. 221–234.
[82] Tuning KVM, http://www.linux-kvm.org/page/Tuning_KVM [Accessed: Dec 16,2020].
[83] “Disk Cache Modes,” in SUSE Linux Enterprise Server 12 SP5 Virtualization Guide,SUSE, Dec. 2020, ch. 15.
[84] S. Hajnoczi, “An Updated Overview of the QEMU Storage Stack,” in LinuxCon Japan2011, Yokohama, Japan, Jun. 2011.
[85] C. Dall, S.-W. Li, J. T. Lim, and J. Nieh, “ARM Virtualization: Performance and Archi-tectural Implications,” ACM SIGOPS Operating Systems Review, vol. 52, no. 1, pp. 45–56,Jul. 2018.
[86] J. T. Lim and J. Nieh, “Optimizing Nested Virtualization Performance Using Direct VirtualHardware,” in Proceedings of the 25th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS 2020), Lausanne, Switzer-land, Mar. 2020, pp. 557–574.
[88] R. Russell, Z. Yanmin, I. Molnar, and D. Sommerseth, Improve hackbench, http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c, Linux KernelMailing List (LKML), Jan. 2008.
[89] R. Jones, Netperf, https://github.com/HewlettPackard/netperf [Accessed: Dec16, 2020].
[90] ab - Apache HTTP server benchmarking tool, https://httpd.apache.org/docs/2.4/programs/ab.html [Accessed: Dec 16, 2020].
[91] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, “BenchmarkingCloud Serving Systems with YCSB,” in Proceedings of the 1st ACM Symposium on CloudComputing (SoCC 2010), Indianapolis, IN, Jun. 2010, pp. 143–154.
[93] V. P. Kemerlis, G. Portokalidis, and A. D. Keromytis, “KGuard: Lightweight Kernel Pro-tection against Return-to-User Attacks,” in Proceedings of the 21st USENIX Security Sym-
posium (USENIX Security 2012), Bellevue, WA, Aug. 2012, pp. 459–474, ISBN: 978-931971-95-9.
[94] N. Dautenhahn, T. Kasampalis, W. Dietz, J. Criswell, and V. Adve, “Nested Kernel: AnOperating System Architecture for Intra-Kernel Privilege Separation,” in Proceedings ofthe 20th International Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS 2015), Istanbul, Turkey, Mar. 2015, pp. 191–206.
[95] P. Colp, M. Nanavati, J. Zhu, W. Aiello, G. Coker, T. Deegan, P. Loscocco, and A. Warfield,“Breaking Up is Hard to Do: Security and Functionality in a Commodity Hypervisor,” inProceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP 2011),Cascais, Portugal, Oct. 2011, pp. 189–202.
[96] F. Zhang, J. Chen, H. Chen, and B. Zang, “CloudVisor: Retrofitting Protection of VirtualMachines in Multi-tenant Cloud with Nested Virtualization,” in Proceedings of the 23rdACM Symposium on Operating Systems Principles (SOSP 2011), Cascais, Portugal, Oct.2011, pp. 203–216.
[97] D. G. Murray, G. Milos, and S. Hand, “Improving Xen Security Through Disaggregation,”in Proceedings of the 4th ACM SIGPLAN/SIGOPS International Conference on VirtualExecution Environments (VEE 2008), Seattle, WA, Mar. 2008, pp. 151–160.
[98] S. Butt, H. A. Lagar-Cavilla, A. Srivastava, and V. Ganapathy, “Self-service Cloud Com-puting,” in Proceedings of the 2012 ACM Conference on Computer and CommunicationsSecurity (CCS 2012), Raleigh, NC, Oct. 2012, pp. 253–264.
[99] U. Steinberg and B. Kauer, “NOVA: A Microhypervisor-based Secure Virtualization Ar-chitecture,” in Proceedings of the 5th European Conference on Computer Systems (EuroSys2010), Paris, France, Apr. 2010, pp. 209–222.
[100] G. Heiser and B. Leslie, “The OKL4 Microvisor: Convergence Point of Microkernels andHypervisors,” in Proceedings of the 1st ACM Asia-pacific Workshop on Workshop on Sys-tems (APSys 2010), New Delhi, India, Aug. 2010, pp. 19–24.
[101] T. Shinagawa, H. Eiraku, K. Tanimoto, K. Omote, S. Hasegawa, T. Horie, M. Hirano,K. Kourai, Y. Oyama, E. Kawai, K. Kono, S. Chiba, Y. Shinjo, and K. Kato, “BitVisor: AThin Hypervisor for Enforcing I/O Device Security,” in Proceedings of the 2009 ACM SIG-PLAN/SIGOPS International Conference on Virtual Execution Environments (VEE 2009),Washington, DC, Mar. 2009, pp. 121–130.
[102] A. Nguyen, H. Raj, S. Rayanchu, S. Saroiu, and A. Wolman, “Delusional Boot: SecuringHypervisors Without Massive Re-engineering,” in Proceedings of the 7th ACM EuropeanConference on Computer Systems (EuroSys 2012), Bern, Switzerland, Apr. 2012, pp. 141–154.
132
[103] E. Keller, J. Szefer, J. Rexford, and R. B. Lee, “NoHype: Virtualized Cloud InfrastructureWithout the Virtualization,” in Proceedings of the 37th Annual International Symposiumon Computer Architecture (ISCA 2010), Saint-Malo, France, Jun. 2010, pp. 350–361.
[105] Z. Wang, C. Wu, M. Grace, and X. Jiang, “Isolating Commodity Hosted Hypervisors withHyperLock,” in Proceedings of the 7th ACM European Conference on Computer Systems(EuroSys 2012), Bern, Switzerland, Apr. 2012, pp. 127–140.
[106] C. Wu, Z. Wang, and X. Jiang, “Taming Hosted Hypervisors with (Mostly) DeprivilegedExecution.,” in 20th Annual Network and Distributed System Security Symposium (NDSS2013), San Diego, CA, Feb. 2013.
[107] L. Shi, Y. Wu, Y. Xia, N. Dautenhahn, H. Chen, B. Zang, and J. Li, “Deconstructing Xen,”in 24th Annual Network and Distributed System Security Symposium (NDSS 2017), SanDiego, CA, Feb. 2017.
[109] A. Baumann, M. Peinado, and G. Hunt, “Shielding Applications from an Untrusted Cloudwith Haven,” in Proceedings of the 11th USENIX Symposium on Operating Systems Designand Implementation (OSDI 2014), Broomfield, CO, Oct. 2014, pp. 267–283.
[110] M.-W. Shih, M. Kumar, T. Kim, and A. Gavrilovska, “S-NFV: Securing NFV States byUsing SGX,” in Proceedings of the 2016 ACM International Workshop on Security in Soft-ware Defined Networks & Network Function Virtualization (SDN-NFV Security 2016),New Orleans, LA, Mar. 2016, pp. 45–48.
[111] M. Zhu, B. Tu, W. Wei, and D. Meng, “HA-VMSI: A Lightweight Virtual Machine Iso-lation Approach with Commodity Hardware for ARM,” in Proceedings of the 13th ACMSIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE 2017),Xi’an, China, Apr. 2017, pp. 242–256.
[112] Z. Hua, J. Gu, Y. Xia, H. Chen, B. Zang, and Haibing, “VTZ: Virtualizing ARM Trust-zone,” in Proceedings of the 26th USENIX Security Symposium (USENIX Security 2017),Vancouver, BC, Canada, Aug. 2017, pp. 541–556.
[113] S. Jin, J. Ahn, S. Cha, and J. Huh, “Architectural Support for Secure Virtualization Under aVulnerable Hypervisor,” in Proceedings of the 44th Annual IEEE/ACM International Sym-posium on Microarchitecture (MICRO 2011), Porto Alegre, Brazil, Dec. 2011, pp. 272–283.
[114] J. Szefer and R. B. Lee, “Architectural Support for Hypervisor-secure Virtualization,” inProceedings of the 17th International Conference on Architectural Support for Program-ming Languages and Operating Systems (ASPLOS 2012), London, England, UK, Mar.2012, pp. 437–450.
[115] Y. Xia, Y. Liu, and H. Chen, “Architecture Support for Guest-transparent VM Protectionfrom Untrusted Hypervisor and Physical Attacks,” in Proceedings of the 19th IEEE In-ternational Symposium on High Performance Computer Architecture (HPCA 2013), Shen-zhen, China, Feb. 2013, pp. 246–257.
[117] Advanced Micro Devices, Secure Encrypted Virtualization API Version 0.16, http://developer.amd.com/wordpress/media/2017/11/55766_SEV-KM-API_Specification.pdf, Feb. 2018.
[118] Y. Wu, Y. Liu, R. Liu, H. Chen, B. Zang, and H. Guan, “Comprehensive VM Protec-tion Against Untrusted Hypervisor Through Retrofitted AMD Memory Encryption,” inProceedings of the 24th IEEE International Symposium on High Performance ComputerArchitecture (HPCA 2018), Vienna, Austria, Feb. 2018, pp. 441–453.
[119] J. Yang and K. G. Shin, “Using Hypervisor to Provide Data Secrecy for User Applica-tions on a Per-page Basis,” in Proceedings of the 4th ACM SIGPLAN/SIGOPS Interna-tional Conference on Virtual Execution Environments (VEE 2008), Seattle, WA, Mar. 2008,pp. 71–80.
[120] J. M. McCune, Y. Li, N. Qu, Z. Zhou, A. Datta, V. Gligor, and A. Perrig, “TrustVisor:Efficient TCB Reduction and Attestation,” in Proceedings of the 2010 IEEE Symposiumon Security and Privacy (SP 2010), Oakland, CA, May 2010, pp. 143–158.
[121] S. Chhabra, B. Rogers, Y. Solihin, and M. Prvulovic, “SecureME: A Hardware-softwareApproach to Full System Security,” in Proceedings of the 2011 International Conferenceon Supercomputing (ICS 2011), Tucson, Arizona, USA, May 2011, pp. 108–119.
[122] Z. Wang, X. Jiang, W. Cui, and P. Ning, “Countering Kernel Rootkits with LightweightHook Protection,” in Proceedings of the 16th ACM Conference on Computer and Commu-nications Security (CCS 2009), Chicago, IL, Nov. 2009, pp. 545–554.
[123] R. Riley, X. Jiang, and D. Xu, “Guest-Transparent Prevention of Kernel Rootkits withVMM-Based Memory Shadowing,” in Proceedings of the 11th International Symposiumon Recent Advances in Intrusion Detection (RAID 2008), Cambridge, MA, Sep. 2008,pp. 1–20.
[124] A. Seshadri, M. Luk, N. Qu, and A. Perrig, “SecVisor: A Tiny Hypervisor to Provide Life-time Kernel Code Integrity for Commodity OSes,” in Proceedings of 21st ACM SIGOPSSymposium on Operating Systems Principles (SOSP 2007), Stevenson, WA, Oct. 2007,pp. 335–350.
[125] X. Wang, Y. Chen, Z. Wang, Y. Qi, and Y. Zhou, “SecPod: A Framework for Virtualization-based Security Systems,” in Proceedings of the 2015 USENIX Annual Technical Confer-ence (USENIX ATC 2015), Santa Clara, CA, Jul. 2015, pp. 347–360.
[126] Z. Zhou, M. Yu, and V. D. Gligor, “Dancing with Giants: Wimpy Kernels for On-DemandIsolated I/O,” in Proceedings of the 2014 IEEE Symposium on Security and Privacy (SP2014), San Jose, CA, May 2014, pp. 308–323.
[127] G. Klein, J. Andronick, M. Fernandez, I. Kuz, T. Murray, and G. Heiser, “Formally Veri-fied Software in the Real World,” Communications of the ACM, vol. 61, no. 10, pp. 68–77,Sep. 2018.
[128] T. Garfinkel, B. Pfaff, J. Chow, M. Rosenblum, and D. Boneh, “Terra: A Virtual Machine-based Platform for Trusted Computing,” in Proceedings of the 19th ACM Symposium onOperating Systems Principles (SOSP 2003), Bolton Landing, NY, Oct. 2003, pp. 193–206.
[129] R. Strackx and F. Piessens, “Fides: Selectively Hardening Software Application Compo-nents Against Kernel-level or Process-level Malware,” in Proceedings of the 2012 ACMConference on Computer and Communications Security (CCS 2012), Raleigh, NC, Oct.2012, pp. 2–13.
[130] R. Ta-Min, L. Litty, and D. Lie, “Splitting Interfaces: Making Trust Between Applica-tions and Operating Systems Configurable,” in Proceedings of the 7th USENIX Symposiumon Operating Systems Design and Implementation (OSDI 2006), Seattle, WA, Nov. 2006,pp. 279–292.
[131] Y. Liu, T. Zhou, K. Chen, H. Chen, and Y. Xia, “Thwarting Memory Disclosure with Effi-cient Hypervisor-enforced Intra-domain Isolation,” in Proceedings of the 2015 ACM Con-ference on Computer and Communications Security (CCS 2015), Denver, CO, Oct. 2015,pp. 1607–1619.
[132] G. Klein, J. Andronick, K. Elphinstone, T. Murray, T. Sewell, R. Kolanski, and G. Heiser,“Comprehensive Formal Verification of an OS Microkernel,” ACM Transactions on Com-puter Systems, vol. 32, no. 1, 2:1–70, Feb. 2014.
[133] R. Gu, Z. Shao, H. Chen, J. Kim, J. Koenig, X. Wu, V. Sjöberg, and D. Costanzo, “BuildingCertified Concurrent OS Kernels,” Communications of the ACM, vol. 62, no. 10, pp. 89–99, Sep. 2019.
[135] J. Oberhauser, R. L. de Lima Chehab, D. Behrens, M. Fu, A. Paolillo, L. Oberhauser, K.Bhat, Y. Wen, H. Chen, J. Kim, and V. Vafeiadis, “VSync: Push-Button Verification andOptimization for Synchronization Primitives on Weak Memory Models,” in Proceedings ofthe 26th International Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS 2021), Detroit, MI, Apr. 2021.
[136] seL4 Reference Manual Version 11.0.0, Data61, Nov. 2019.
[137] Frequently Asked Questions on seL4, https://docs.sel4.systems/projects/sel4/frequently-asked-questions.html [Accessed: Dec 16, 2020].
[138] E. Cohen, M. Dahlweid, M. Hillebrand, D. Leinenbach, M. Moskal, T. Santen, W. Schulte,and S. Tobies, “VCC: A Practical System for Verifying Concurrent C,” in Proceedings ofthe 22nd International Conference on Theorem Proving in Higher Order Logics (TPHOLs2009), Munich, Germany, Aug. 2009, pp. 23–42.
[139] D. Leinenbach and T. Santen, “Verifying the Microsoft Hyper-V hypervisor with VCC,” inProceedings of the 16th International Symposium on Formal Methods (FM 2009), Eind-hoven, The Netherlands, Nov. 2009, pp. 806–809.
[140] A. Vasudevan, S. Chaki, P. Maniatis, L. Jia, and A. Datta, “überSpark: Enforcing VerifiableObject Abstractions for Automated Compositional Security Analysis of a Hypervisor,” inProceedings of the 25th USENIX Security Symposium (USENIX Security 2016), Austin,TX, Aug. 2016, pp. 87–104.
[141] “Creating a Trusted Embedded Platform for MLS Application,” Green Hills Software,Whitepaper v0520, May 2020.
[142] National Information Assurance Partnership, Separation Kernels on Commodity Worksta-tions, http://www.niap-ccevs.org/announcements/Separation%20Kernels%20on%20Commodity%20Workstations.pdf, Mar. 2010.
[143] C. Hawblitzel, J. Howell, J. R. Lorch, A. Narayan, B. Parno, D. Zhang, and B. Zill, “Iron-clad Apps: End-to-End Security via Automated Full-System Verification,” in Proceedingsof the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI2014), Broomfield, CO, Oct. 2014, pp. 165–181.
[144] D. Jang, Z. Tatlock, and S. Lerner, “Establishing Browser Security Guarantees through For-mal Shim Verification,” in Proceedings of the 21st USENIX Security Symposium (USENIXSecurity 2012), Bellevue, WA, Aug. 2012, pp. 113–128.
[145] C. Baumann, M. Näslund, C. Gehrmann, O. Schwarz, and H. Thorsen, “A High AssuranceVirtualization Platform for ARMv8,” in Proceedings of the 2016 European Conference onNetworks and Communications (EuCNC 2016), Athens, Greece, Jun. 2016, pp. 210–214.
[146] C. Baumann, O. Schwarz, and M. Dam, “On the verification of system-level informationflow properties for virtualized execution platforms,” Journal of Cryptographic Engineer-ing, vol. 9, no. 3, pp. 243–261, May 2019.
[147] T. Murray, R. Sison, E. Pierzchalski, and C. Rizkallah, “Compositional Verification andRefinement of Concurrent Value-Dependent Noninterference,” in Proceedings of the 29thIEEE Computer Security Foundations Symposium (CSF 2016), Lisbon, Portugal, Jun. 2016,pp. 417–431.
[148] T. Murray, R. Sison, and K. Engelhardt, “COVERN: A Logic for Compositional Verifica-tion of Information Flow Control,” in Proceedings of the 2018 IEEE European Conferenceon Security and Privacy (EuroS&P 2018), London, United Kingdom, Apr. 2018, pp. 16–30.
[149] G. Ernst and T. Murray, “SecCSL: Security Concurrent Separation Logic,” in Proceedingsof the 31st International Conference (CAV 2019), New York, NY, Jul. 2019, pp. 208–230.
[150] D. Schoepe, T. Murray, and A. Sabelfeld, “VERONICA: Expressive and Precise Concur-rent Information Flow Security,” in Proceedings of the 33rd IEEE Computer Security Foun-dations Symposium (CSF 2020), Boston, MA, Jun. 2020, pp. 79–94.
[151] H. Tuch and G. Klein, “Verifying the L4 virtual memory subsystem,” in Proceedings ofthe NICTA Foraml Methods Workshop on OS Verification, Sydney, Australia, Oct. 2004,pp. 73–97.
[152] O. Schwarz and M. Dam, “Formal verification of secure user mode device executionwith DMA,” in Proceedings of the 10th International Haifa Verification Conference (HVC2014), Haifa, Israel, Nov. 2014, pp. 236–251.
[153] Y. Zhao and D. Sanán, “Rely-Guarantee Reasoning About Concurrent Memory Manage-ment in Zephyr RTOS,” in Proceedings of the 31st International Conference (CAV 2019),New York, NY, Jul. 2019, pp. 515–533.
[154] S. H. Taqdees and G. Klein, “Reasoning about Translation Lookaside Buffers,” in Proceed-ings of the 21st International Conference on Logic for Programming, Artificial Intelligenceand Reasoning (LPAR 2017), Maun, Botswana, May 2017, pp. 490–508.
[155] H. T. Syeda and G. Klein, “Program verification in the presence of cached address transla-tion,” in Proceedings of the 2018 International Conference on Interactive Theorem Proving(ITP 2018), Oxford, United Kingdom, Jul. 2018, pp. 542–559.
137
[156] A. Fox, “Formal specification and verification of arm6,” in International Conference onTheorem Proving in Higher Order Logics (TPHOLs 2003), Rome, Italy, Sep. 2003, pp. 25–40.
[157] Intel Corporation, Intel 64 and IA-32 Architectures Software Developer’s Manual, 325462-044US, Aug. 2012.
138
Appendix A: KCore API
Figure A.1 shows the APIs of all KCore layers. As discussed earlier, our layered verification
approach only allows higher layer primitives to call to the lower layer primitives. Black arrows
show primitives from one layer call primitives in the next lower layer. For example, walk_pgd
from PTWalk calls alloc_pgd in PTAlloc. White boxes show primitives that are only used in the
layer in which they are defined. Colored boxes show primitives that are passed through to other
layers. For example, map_pfn_vm from MemAux calls map_page, which is passed through from
NPTOps to PageMgmt. The figure does not include the primitives passed through from the abstract
machine, such as the memory load and store primitives. White arrows show that all primitives from
a given layer use the specific lower layer primitives. For instance, all primitives in BootOps use
the acquire_lock_vm and release_lock_vm primitives. Empty white boxes are used to group
primitives into a set. A black arrow from a primitive in a higher layer to an empty white box
indicates that the higher layer primitive uses all primitives in the set. For example, the primitives
Tables A.1 to A.33 list the APIs of KCore’s 33 intermediate layers shown in Figure 2.2.
Primitive Descriptionkserv_hvc_handler Called by TrapHandler to handle a given hypercall made by
KServ. It first calls CtxtSwitch to context switch to KCore, thencalls TrapDispatcher to handle the hypercall. Finally, it callsCtxtSwitch to context switch to KServ.
kserv_s2pt_fault_handler Called by TrapHandler to handle a given stage 2 page fault forKServ. It first calls CtxtSwitch to context switch to KCore, then callsFaultHandler to handle the page fault. Finally, it calls CtxtSwitchto context switch to KServ.
vm_exit_handler Called by TrapHandler to handle a given VM exit. It first callsCtxtSwitch to context switch to KCore. It then calls VCPUOps tohandle the VM exit. If the exit requires KServ’s functionality, it callsCtxtSwitch to context switch to KServ; otherwise, it handles the exitdirectly and returns to the VM.
Table A.1: TrapHandlerRaw API
Primitive Descriptionkserv_hvc_dispatcher Dispatches a given hypercall made by KServ to the respective hypercall
handler based on the hypercall number.vm_exit_dispatcher Checks if KCore could handle a given VM exit; if yes, it handles the exit
directly. For example, it calls FaultHandler to handle the GRANT_-MEM and REVOKE_MEM hypercall.
Table A.2: TrapDispatcher API
Primitive Descriptionhandle_kserv_s2pt_fault Handles a given stage 2 page fault for KServ. It either calls SMMUOps
to handle the SMMU access, or calls MemAux to resolve the page fault.handle_pvops Calls MemOps to handle a given GRANT_MEM or REVOKE_MEM hypercall.
Table A.3: FaultHandler API
142
Primitive Descriptionclear_vm_mem_range Handles the clear_vm hypercall for KServ. It first calls VMPower to
ensure a given target VM is powered off, then calls MemOps to reclaimpages from the target VM.
__smmu_map Handles the smmu_map hypercall for KServ by calling SmmuRaw andSmmuOps.
Table A.4: MemHandler API
Primitive Descriptionsave_kserv_gprs Saves KServ’s general purpose registers from the hardware to memory.restore_kserv_gprs Restores KServ’s general purpose registers from the memory to hard-
ware.save_vm_gprs Saves a given VM’s general purpose registers from the hardware to
memory.restore_vm_gprs Restores a given VM’s general purpose registers from the memory to
hardware.save_core_gprs Saves KCore’s general purpose registers from the hardware to memory.restore_core_gprs Restores KCore’s general purpose registers from the memory to hard-
ware.save_kserv_sysregs Saves KServ’s systems registers from the hardware to memory.restore_kserv_sysregs Restores KServ’s systems registers from the memory to hardware.save_vm_sysregs Saves a given VM’s systems registers from the hardware to memory.restore_vm_sysregs Restores a given VM’s systems registers from the memory to hardware.
Table A.5: CtxtSwitch API
Primitive Descriptionproc_vm_exit Calls VCPUOpsAux to handle a VM exit.proc_vm_enter Calls VCPUOpsAux to handle the VM ENTER hypercall for KServ, to ei-
ther copy data from the intermediate state to VCPUContext, or resolvea VM’s stage 2 page fault. It also resets VCPU registers at the first vmenter.
Table A.6: VCPUOps API
143
Primitive Descriptionreset_gp_regs Resets the general purposes registers for a given VCPU.reset_sysregs Resets the systems registers for a given VCPU.sync_intr_to_vcpu Copies data from the intermediate state to VCPUContext. For ex-
ample, it copies the MMIO read data from the intermediate state toVCPUContext.
prep_wfx Handles a given VM exit caused by executing Arm’s WFE/WFI instruc-tions.
prep_psci Handles a given VM exit caused by VM making the PSCI hypercalls. Itcopies the VM power states stored in a given VM’s general purpose reg-isters from VCPUContext to the intermediate state, so the PSCI powerhypercall handler in KServ can access the data to handle the hypercall.
prep_abort Handles the VM exits caused by memory access aborts, includingMMIO accesses and faults on regular memory accesses. To handle aMMIO write, it copies the MMIO write data from VCPUContext tothe intermediate state; to handle a MMIO read, it sets a dirty flag sothat KCore could later copy the read data from the intermediate stateto VCPUContext. To handle a stage 2 page fault for regular memoryaccess, it sets a flag in the data structure shared with KServ to notifyKServ for memory allocation.
update_excpt_regs Updates the general purpose registers stored in VCPUContext to injectan exception to a given VM.
handle_s2pt_fault Calls MemOps to resolve a given VM’s stage 2 page fault.
Table A.7: VCPUOpsAux API
handle_mmio Calls SmmuAux to handle a fault caused by KServ’s MMIO access to theSMMU.
__smmu_alloc_unit Handles the smmu_alloc_unit hypercall for KServ. It calls BootOpsto allocate a SMMU translation unit for a given device.
__smmu_free_unit Handles the smmu_free_unit hypercall for KServ to deallocate aSMMU translation unit.
smmu_map_page Calls BootOps to map a given iova to a hPA in a given device’s SMMUpage table.
smmu_unmap_page Handles the smmu_unmap hypercall for KServ. It calls MmioSPTOps tounmap a given iova from a given device’s SMMU page table.
__smmu_iova_to_phys Handles the smmu_iova_to_phys hypercall for KServ. It callsMmioSPTOps to walk a given device’s SMMU page table using an inputiova.
Table A.8: SmmuOps API
check_smmu_address Checks if a given faulted physical address falls within a hardware mem-ory region that belongs to the SMMU.
handle_smmu_access Calls SmmuCore to handle a KServ’s MMIO access to the SMMU.
Table A.9: SmmuAux API
handle_smmu_write Calls SmmuCoreAux to handle a KServ’s write access to the SMMU. Itcalls SmmuRaw to get the MMIO write data to program the SMMU.
handle_smmu_read Calls SmmuCoreAux to handle a KServ’s read access to the SMMU.
Table A.10: SmmuCore API
144
__handle_smmu_write Programs the SMMU hardware to carry out a KServ’s SMMU write.__handle_smmu_read Reads the SMMU hardware to carry out a KServ’s SMMU read.handle_global_access Validates a given KServ’s access to the SMMU global register. For
example, it rejects KServ’s attempt to disable the SMMU page tablesby programming the SMMU_CBAR register.
handle_cb_access Calls SmmuRaw to first locates the SMMU translation unit. It then vali-dates KServ’s access to the bank registers of a given SMMU translationunit. For instance, KCore forbids any write accesses to the SMMU pagetable base register (TTBR0) for a given translation unit to use a mali-cious SMMU page table.
Table A.11: SmmuCoreAux API
get_mmio_data Returns the MMIO write data stored in KServ’s register for a givenSMMU write.
init_smmu_pte Takes a given hPA and formulates a resulting value to store to an entryfor the SMMU page tables.
get_smmu_unit Translates a given input physical address to the index of a correspondingSMMU translation unit for KCore to manage the SMMU.
Table A.12: SmmuRaw API
145
Primitive Descriptionsearch_ld_info Checks if a given guest physical address is within a memory region that
contains a VM image.set_vcpu_active Specifies that a given VCPU is active on the current physical CPU. Used
by KCore to ensure the same VCPU cannot be run concurrently on an-other physical CPU.
set_vcpu_inactive Specifies that a given VCPU is inactive on the current physical CPU.__register_vcpu Handles the register_vcpu hypercall for KServ.__register_vm Handles the register_vm hypercall for KServ. It calls BootCore to
allocate a new VM identifier.__set_boot_info Handles the set_boot_info hypercall for KServ. It stores the infor-
mation of a given VM boot image to VMInfo, and then calls BootCoreto allocate a memory buffer in KCore’s address space to remap the VMimage.
remap_image_page Handles the remap_boot_image_page hypercall for KServ. It callsMemAux to map a given physical page containing VM image to the EL2stage 1 page table.
verify_and_load_images Handles the verify_vm_image hypercall for KServ. It loops over alist of boot images loaded for a given VM and calls HACL* to authen-ticate each of the image. If an image is authenticated, it calls BootAuxto map the image to the VM’s stage 2 page table.
alloc_smmu Checks if a given VM has booted; if not, it allocates a SMMU trans-lation unit to the VM’s device, and calls MmioSPTOps to initialize therespective SMMU page table.
map_smmu Checks if a given VM has booted; if not, it calls MemAux to map an iovato a hPA in the SMMU page table of the VM’s device.
clear_smmu Checks if a given VM has booted; if not, it calls MemAux to unmap aniova from the SMMU page table of the VM’s device.
__encrypt_vcpu Handles the decrypt_vcpu hypercall for KServ. It encrypts the datastored in VCPUContext of a given VCPU and copies the encrypteddata to KServ’s memory.
__decrypt_vcpu Handles the encrypt_vcpu hypercall for KServ. It copies the en-crypted CPU data from KServ memory to a private buffer, and decryptsthe data stored in a private buffer.
__encrypt_vm_mem Handles the encrypt_vm_mem hypercall for KServ. It calls MemOpsto encrypt the data stored in a given physical address and copies theencrypted data to an output buffer owned by KServ.
__decrypt_vm_mem Handles the decrypt_vm_mem hypercall for KServ. It copies the en-crypted data stored in a given physical address to a private buffer, thencalls MemOps to decrypt the encrypted data.
Table A.13: BootOps API
Primitive Descriptionload_vm_image Maps a given VM’s authenticated boot image to the VM’s stage 2 page
table.
Table A.14: BootAux API
146
Primitive Descriptiongen_vmid Allocates a new VM identifier.alloc_remap_addr Allocates a contiguous buffer from KCore’s address space for remap-
ping a given VM image.
Table A.15: BootCore API
Primitive Descriptionset_vm_power Sets a given VM’s power state.get_vm_power Returns a given VM’s power state.
Table A.16: VMPower API
clear_vm_range Loops over a range of physical memory and calls MemAux on each 4KBpage to reclaim memory.
prot_map_vm_s2pt Loops over a range of physical memory, and calls MemAux on each 4KBpage to transfer the page to a given VM, and maps the page to the VM’sstage 2 page table.
grant_vm_pages Loops over a range of guest physical memory and calls MemAux on each4KB page to grant KServ access to the page. Used by FaultHandlerto handle the GRANT_MEM hypercall.
revoke_vm_pages Loops over a range of guest physical memory and calls MemAux oneach 4KB page to revoke KServ’s access to the page. Used byFaultHandler to handle the REVOKE_MEM hypercall.
encrypt_mem Encrypts the contents of a memory buffer using HACL*.decrypt_mem Decrypts the contents of a memory buffer using HACL*.
Table A.17: MemOps API
map_pfn_kserv Handles a stage 2 page fault for KServ. It first validates KServ’s mem-ory access to the faulted hPA. If the access is permitted, it calls NPTOpsto resolve the page fault.
map_pfn_vm Calls NPTOps to map a given guest physical address to a hPA in a givenVM’s stage 2 page table.
clear_vm_page Scrubs a given physical page and calls PageMgmt to assign the page toKServ.
assign_pfn_vm Checks if a given physical page is owned by KServ. If yes, it callsPageMgmt to assign the page to a target VM.
grant_vm_page Calls PageMgmt to update the sharing status of a given physical page togrant KServ access to the page.
revoke_vm_page Calls PageMgmt to update the sharing status of a given physical page torevoke KServ’s access to the page. It then calls NPTOps to unmap thepage from KServ’s stage 2 page table.
map_pfn_smmu Calls PageMgmt to assign a given physical page to a target principaland calls MmioSPTOps to create a mapping to the page in the SMMUpage table for the target principal’s device.
unmap_pfn_smmu Calls MmioSPTOps to unmap a given physical page from a given de-vice’s SMMU page table.
Table A.18: MemAux API
147
get_pfn_owner Calls PageIndex to get the index to the S2Page array for a given phys-ical page, and returns owner from the page’s respective S2Page.
set_pfn_owner Calls PageIndex to get the index to the S2Page array for a given phys-ical page, and updates owner from the page’s respective S2Page.
get_pfn_share Calls PageIndex to get the index to the S2Page array for a given phys-ical page, and returns share from the page’s respective S2Page.
set_pfn_share Calls PageIndex to get the index to the S2Page array for a given phys-ical page, and updates sharing from the page’s respective S2Page.
Table A.19: PageMgmt API
get_s2page_index Calls Memblock to check if a given physical address belongs to RAM.If yes, it returns the index respective to the address in the S2Page array.
Table A.20: PageIndex API
search_memblock Checks if a given input address is within a physical address region thatbelongs to RAM.
Table A.21: Memblock API
Primitive Descriptionget_s2pt_size Acquires the page table lock and calls get_npt_size in NPTWalk.walk_s2pt Acquires the page table lock and calls walk_npt in NPTWalk.walk_pt Acquires the page table lock and calls walk_s1pt in NPTWalk.map_page Acquires the page table lock and calls set_s2pt in NPTWalk.map_page_core Acquires the page table lock and calls set_s1pt in NPTWalk.unmap_pfn_kserv Acquires the page table lock and calls unset_s2pt in NPTWalk.
Table A.22: NPTOps API
get_npt_size Walks a given principal’s stage 2 page table using a given address andreturns the page size, either 4KB or 2MB, used in the page table to mapthe address.
walk_npt Walks a given principal’s stage 2 page table using a given address andreturns the hPA that the address maps to.
set_s2pt Maps a given address addr to a hPA in a given principal’s stage 2 pagetable.
unset_s2pt Unmaps a given address addr from a given principal’s stage 2 page table.set_s1pt Maps a given address addr to a hPA in the EL2 stage 1 page table.walk_s1pt Walks the EL2 stage 1 page table using a given address and returns the
hPA that the address maps to.
Table A.23: NPTWalk API
148
walk_pgd Walks the pgd table using a given input address and returns the pgdentry. It calls PTAlloc to allocate a new page for the next level pagetable if the pgd entry is unmapped.
walk_pud Walks the pud table using a given input address and returns the pudentry. It calls PTAlloc to allocate a new page for the next level pagetable if the pud entry is unmapped.
walk_pmd Walks the pmd table using a given input address and returns the pmdentry. It calls PTAlloc to allocate a new page for the next level pagetable if the pmd entry is unmapped.
walk_pte Walks the pte table using a given input address and returns the pte entry.set_pmd Sets the entry in the pmd table that corresponds to a given input address
to an input value.set_pte Sets the entry in the pte table that corresponds to a given input address
to an input value.
Table A.24: PTWalk API
alloc_pud Allocates a pud page table from a given principal’s page table pool.alloc_pmd Allocates a pmd page table from a given principal’s page table pool.alloc_pte Allocates a pte page table from a given principal’s page table pool.
Table A.25: PTAlloc API
walk_spt Acquires the SMMU page table lock and calls walk_smmu_pt inMmioSPTWalk.
mmap_spt Acquires the SMMU page table lock and calls set_smmu_pt inMmioSPTWalk.
unmap_spt Acquires the SMMU page table lock and calls unset_smmu_pt inMmioSPTWalk.
init_spt Acquires the SMMU page table lock and scrubs the SMMU page tablefor a given SMMU translation unit before allocating the unit to a newdevice.
Table A.26: MmioSPTOps API
walk_smmu_pt Walks a given device’s SMMU page table using a given iova, and re-turns the hPA that the iova maps to.
set_smmu_pt Maps a given iova to a hPA in a given device’s SMMU page table.unset_smmu_pt Unmaps a given iova from a given device’s SMMU page table.
Table A.27: MmioSPTWalk API
walk_smmu_pgd Walks the SMMU pgd table using a given iova and returns the pgd entry.It calls MmioPTAlloc to allocate a new page for the next level pagetable if the pgd entry is unmapped.
walk_smmu_pmd Walks the SMMU pmd table using a given iova and returns the pmdentry. It calls MmioPTAlloc to allocate a new page for the next levelpage table if the pmd entry is unmapped.
walk_smmu_pte Walks the SMMU pte table using a given iova and returns the pte entry.set_smmu_pte Sets the entry in the SMMU pte table that corresponds to a given iova
to an input value.
Table A.28: MmioPTWalk API
149
alloc_smmu_pmd Allocates a SMMU pmd page table from a given device’s page tablepool.
alloc_smmu_pte Allocates a SMMU pte page table from a given device’s page table pool.
Table A.29: MmioPTAlloc API
acquire_lock_npt Acquires the per-principal page table lock used to protect a given prin-cipal’s page table.
acquire_lock_s2page Acquires the S2Page lock used to protect the shared S2Page array.acquire_lock_core Acquires the core lock used to protect the shared resources managed by
KCore, such as the VM identifiers.acquire_lock_spt Acquires the SMMU page pool lock used to protect the page table used
by a given SMMU translation unit.acquire_lock_smmu Acquires the SMMU lock used to protect the SMMU configuration.acquire_lock_vm Acquires the lock used to protect accesses to the global VMInfo array.release_lock_npt Releases the per-principal page table lock used to protect a given prin-
cipal’s page table.release_lock_s2page Releases the S2Page lock used to protect the shared S2Page array.release_lock_core Releases the core lock used to protect the shared resources managed by
KCore.release_lock_spt Releases the SMMU page pool lock used to protect the page table used
by a given SMMU translation unit.release_lock_smmu Releases the SMMU lock used to protect the SMMU configuration.release_lock_vm Releases the lock used to protect accesses to the global VMInfo array.
Table A.30: Locks API
wait_hlock Helper function for acquiring locks. Used for lock verification.pass_hlock Helper function for releasing locks. Used for lock verification.
Table A.31: LockOpsH API
wait_qlock Helper for function acquiring locks. Used for lock verification.pass_qlock Helper for function releasing locks. Used for lock verification.
Table A.32: LockOpsQ API
wait_lock Provides the implementation for acquiring KCore’s spinlock.pass_lock Provides the implementation for releasing KCore’s spinlock.