User-Space Process Virtualization in the Context of Checkpoint-Restart and Virtual Machines A dissertation presented by Kapil Arya to the Faculty of the Graduate School of the College of Computer and Information Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy Northeastern University Boston, Massachusetts August 2014 Copyright c August 2014 by Kapil Arya
200
Embed
User-space process virtualization in the context of ...336364/...NORTHEASTERN UNIVERSITY GRADUATE SCHOOL OF COMPUTER SCIENCE Ph.D. THESIS APPROVAL FORM THESIS TITLE: User-Space Process
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
User-Space Process Virtualization in the Context of
Checkpoint-Restart and Virtual Machines
A dissertation presented
by
Kapil Arya
to the Faculty of the Graduate School
of the College of Computer and Information Science
in partial fulfillment of the requirements for the degree of
NORTHEASTERN UNIVERSITYGRADUATE SCHOOL OF COMPUTER SCIENCE
Ph.D. THESIS APPROVAL FORM
THESIS TITLE: User-Space Process Virtualization in the Context ofCheckpoint-Restart and Virtual Machines
AUTHOR: Kapil Arya
Ph.D. Thesis approved to complete all degree requirements for the Ph.D. degreein Computer Science
Distribution: Once completed, this form should be scanned and attached to the frontof the electronic dissertation document (page 1). An electronic version of the documentcan then be uploaded to the Northeastern University-UMI website.
Abstract
Checkpoint-Restart is the ability to save a set of running processes to a check-point image on disk, and to later restart them from the disk. In addition toits traditional use in fault tolerance, recovering from a system failure, it hasnumerous other uses, such as for application debugging and save/restore ofthe workspace of an interactive problem-solving environment. Transparentcheckpointing operates without modifying the underlying application pro-gram, but it implicitly relies on a “Closed World Assumption” — the world(including file system, network, etc.) will look the same upon restart as itdid at the time of checkpoint. This is not valid for more complex programs.Until now, checkpoint-restart packages have adopted ad hoc solutions foreach case where the environment changes upon restart.
This dissertation presents user-space process virtualization to decouple ap-plication processes from the external subsystems. A thin virtualization layeris introduced between the application and each external subsystem. It pro-vides the application with a consistent view of the external world and allowsfor checkpoint-restart to succeed. The ever growing number of external sub-systems make it harder to deploy and maintain virtualization layers in amonolithic checkpoint-restart system. To address this, an adaptive pluginbased approach is used to implement the virtualization layers that allow thecheckpoint-restart system to grow organically.
The principle of decoupling the external subsystem through process vir-tualization is also applied in the context of virtual machines for providinga solution to the long standing double-paging problem. Double-paging oc-curs when the guest attempts to page out memory that has previously beenswapped out by the hypervisor and leads to long delays for the guest as thecontents are read back into machine memory only to be written out again.The performance rapidly drops as a result of significant lengthening of thetime to complete the guest I/O request.
Acknowledgments
No dissertation is accomplished without the support of many people and I
can only begin to thank all those who have helped me in completing it.
I am indebted to my advisor, Gene Cooperman, for his patience, encour-
agement, support, and guidance over the years. It is because of Gene that
I decided to go for a Ph.D., while I was a Master’s student at Northeastern.
Gene taught me about how to do research and to distinguish the ideas that
only I would find interesting, from the ideas that are important. I could not
have asked for a better teacher and without him, this document would not
exist.
I am thankful to Panagiotis (Pete) Manolios, Alan Mislove and William
Robertson for serving on my committee and for providing their insightful
input and constructive criticism. I resoundingly thank Peter Desnoyers for
always being available to discuss ideas and for providing constructive feed-
back on several occasions.
I also want to thank the International Student and Scholar Institute (ISSI)
team and Bryan Lackaye for helping with the administrative matters during
my stay at Northeastern.
I was fortunate to be mentored by Alex Garthwaite during the summer
internships at VMware. His guidance and encouragement is always there
and never seems to fade away. Alex agreed to be the external member in
my committee and I am thankful for his feedback and thoughtful comments
that have not only improved the quality of this dissertation, but also pro-
vided ideas for future directions. His dictum that a good dissertation is a
completed one, became my mantra during the last two years.
I also want to thank Yury Baskakov for all the help that I received while
working on the Tesseract project. He never got tired of my random specula-
tions and was always there to provide further insights and also to cover my
blind spots. A special thanks goes to Jerri-Ann Meyer and Joyce Spencer for
their continued support of the project. Finally, I want to thank Ron Mann
for his continued advise and guidance that has helped me become a better
engineer.
I am grateful to Alok Singh Gehlot for his friendship, all the advice he
provided me over the years, and for his constant reminder that it’s not done
until it’s done. He was always available for me and without his guidance, I
would not have been at Northeastern for my Master’s and later, Ph.D.
I want to thank Rohan Garg and Jaideep Ramachandran for going through
the thesis drafts and sitting through my practice talks and for providing valu-
able feedback. Over the years, I have had the support of a lot of friends and
I want to thank Jaijun Cao, Harsh Raju Chamarthi, Tyler Denniston, Anand
with database servers, networks of virtual machines, hybrid computations
using CPU accelerators (e.g., GPU and Xeon Phi), Hadoop-style computa-
tions, a broader variety of network models (TCP sockets, InfiniBand, the
SCIF network for the Intel Xeon Phi), competing implementations of Infini-
Band libraries (QLogic/PSM versus InfiniBand OpenIB verbs), and so on.
These complex applications have created a dilemma. A system for pure
transparent checkpointing has no knowledge of the application’s external
world, and an application-level checkpointing system would require the
writer of the target application to insert code that adapts to the modified
external environment after restart. This conflict is the core problem being
solved.
4 CHAPTER 1. OVERVIEW
1.2 Double-Paging Anomaly
Hypervisors often overcommit memory to achieve higher VM consolidation
on the physical host. When overcommitting host physical memory, guest
memory is paged in and out from a hypervisor-level swap file to reclaim
host memory. Further, guests running in the virtual machines manage their
own physical address space and may overcommit memory as needed.
Double-paging is an often-cited problem in multi-level scheduling of mem-
ory between virtual machines (VMs) and the hypervisor. This problem oc-
curs when both a virtualized guest and the hypervisor overcommit their re-
spective physical address-spaces. When the guest pages out memory previ-
ously swapped out by the hypervisor, it initiates an expensive sequence of
steps causing the contents to be read in from the hypervisor-level swapfile
only to be written out again, significantly lengthening the time to complete
the guest I/O request. As a result, performance rapidly drops.
1.3 Process Virtualization
Often, application processes violate the closed-world assumption. When
restarting from a checkpoint image, the recreated objects derived from ex-
ternal systems/services may not be the same as their pre-checkpoint version.
This is due to the changing execution environment across a checkpoint-
restart boundary. In order to successfully restart an application process, we
need to virtualize these objects in such a way that the application view of
the objects does not change across checkpoint and restart.
Definition: The application surface of a running application is a set of code
and associated data that includes all application-specific objects (code+data)
and excludes all opaque objects derived from any outside systems/services.
(An opaque object is an object for which the application knows nothing
about the internal structure. The opaque object is only accessible through
1.3. PROCESS VIRTUALIZATION 5
Process
Application
ApplicationSurface
ExternalResource
real names
virtual names
Translation layer
Figure 1.1: Application surface of a running process. The virtual names lieinside the application surface, whereas the real names lie outside the surface.
an identifying handle)
Definition: User-space process virtualization finds a surface that is at least as
large as the application surface, such that any virtualized view of an object
lies inside this surface and any real view lies outside this surface (see Fig-
ure 1.1). On restart, the opaque objects are recreated to provide semanti-
cally equivalent functionality to their pre-checkpoint version. Process virtu-
alization then links these opaque objects with their virtualized view inside
the application surface (through the identifying handles).
There can be more than one possible application surface. Typically one
chooses an application surface close to a well known API for the sake of
stability and maintainability. A wrapper around any call to the API will
update both the virtual and the real view in a consistent manner.
Remarks:
1. In virtualizing a pid, we will see that libc will retain the real pid known
to the kernel. Thus libc is outside the application surface. But the ap-
plication knows only the virtual pid that resides inside the application
surface.
6 CHAPTER 1. OVERVIEW
2. In the case of a shadow device driver, the user-space memory of the
application may contain both some opaque objects (e.g., InfiniBand
queues) and their virtualized views. In this case the application surface
excludes parts of the user-space memory of the application process.
3. Because daemons and the kernel are opaque to the application, they
always lie outside the application surface.
4. An application may create an auxiliary child process (or even dis-
tributed processes in the case of MPI). In this case, the application
surface includes these auxiliary processes.
The goal of user-space process virtualization is to break the tight coupling
between the application process and an external subsystem not under the
control of the application process. In effect, each API is designed to provide
a stable interface to a single system service under the lifetime of a process.
This thesis will demonstrate the ability to find an application surface and
a corresponding API, for which a software translation layer can be built,
enabling the application process to continue to receive the corresponding
system service from an alternative external subsystem. This decouples the
application process from the external subsystem.
1.4 Thesis Statement
User-space process virtualization can be used to decouple application pro-
cesses from external subsystems to allow checkpoint-restart without enforc-
ing a strict “closed-world assumption”. The method of decoupling subsys-
tems applies beyond checkpointing as seen in a solution to the long standing
double-paging problem.
1.5. CONTRIBUTIONS 7
1.5 Contributions
This dissertation shows that a checkpointing system can “adapt” to the ex-
ternal environment, one subsystem at a time, by using the user-space process
virtualization technique. To that end, this work introduces a plugin archi-
tecture based on adaptive plugins to virtualize these external subsystems. A
plugin is responsible for virtualizing and checkpointing exactly one external
subsystem to allow the application to adapt to the modified external subsys-
tem.
The plugin architecture allows us to do selective (or partial) virtualiza-
tion of the underlying resources for efficiency purposes. Plugins can be load-
ed/unloaded to suit application requirements. Further, it allows the check-
pointing system to be extended organically, in a non-monolithic manner.
1.5.1 Process Virtualization through Plugins
To demonstrate the strength of the plugin architecture for user-space pro-
cess virtualization, this work presents principled techniques for the follow-
ing problems, which have resisted successful checkpoint-restart solutions for
at least a decade (these plugins are original with this dissertation):
• The PID plugin (§5.2) virtualizes the process and thread identifiers
assigned by the kernel.
• The System V IPC plugin (§5.2) virtualizes the shared memory, semaphore,
and message queue identifiers assigned by the kernel.
• The Timer plugin (§5.2) virtualizes posix timers as well as as clock
identifiers assigned by the kernel.
• The SSH plugin (§5.4) virtualizes the underlying SSH connection be-
tween two processes to allow recreation on restart.
8 CHAPTER 1. OVERVIEW
• The IB2TCP plugin (§5.11) virtualizes the InfiniBand device driver to
allow a computation to be checkpointed on the InfiniBand hardware
and restarted on the TCP hardware.
Notice that the Zap [86] system virtualized the kernel resource identi-
fiers such as pids and System V IPC ids in kernel space. However, the work
of this dissertation virtualizes entirely in user space without any applica-
tion or kernel modifications or kernel modules. Further, this work extends
the notion of user-space virtualization to processes/services outside the ker-
nel such as SSH connections, network daemons and device drivers. This
is achieved either through interposing library calls or by creating shadow
agents/processes for the external resources.
1.5.2 Application-Specific Plugins
Next, we show that plugins can be used for application-specific adapta-
tions, providing the benefits of application-level checkpointing without hav-
ing to modify the base application. The following application-specific plug-
ins (§5.3) are original with this dissertation:
• Malloc plugin virtualizes access to the underlying memory allocation
library (e.g., libc malloc, tcmalloc, etc.).
• DL plugin is used to ensure atomicity for dlopen/dlsym functions with
respect to checkpoint-restart.
• CkptFile plugin provides heuristics for checkpointing open files. It also
helps the file plugin to locate files on restart.
• Uniq-Ckpt plugin is used to control the checkpoint file names, loca-
tions, etc.
1.5. CONTRIBUTIONS 9
1.5.3 Third-Party Plugins
Finally, the success of the plugin architecture can also be seen in third party
plugins. We show that third parties can write orthogonal customized plugins
to fit their needs. The following demonstrates original work due to plugins
created by third party contributors (this dissertation is not claiming these
results):
• Ptrace plugin [127] virtualizes the ptrace system call to allow check-
pointing of an entire gdb session for reversible debugging.
• Record-replay plugin [126] provides a light-weight deterministic re-
play mechanism by recording library calls for reversible debugging.
• KVM plugin [44] is used for checkpointing the KVM/Qemu virtual ma-
chine.
• Tun plugin [44] is used for checkpointing the Tun/Tap network inter-
face for checkpointing a network of virtual machines.
• RM plugin [93] is used for checkpointing in a batch-queue environ-
ment and can handle multiple batch-queue systems.
• InfiniBand plugin [27] provides the first non MPI-specific transparent
checkpoint-restart of InfiniBand network.
• OpenGL plugin [62] uses a record-prune-replay technique for check-
pointing 3D graphics (OpenGL 2.0 and beyond).
1.5.4 Solving the Double-Paging Problem
The process virtualization principles are also applied in the context of vir-
tual machines. The double-paging problem is directly and transparently ad-
dressed by applying the decoupling principle [11]. The guest and hyper-
visor I/O operations are tracked to detect redundancy and are modified to
10 CHAPTER 1. OVERVIEW
create indirections to existing disk blocks containing the page contents. The
indirection is created by introducing a thin virtualization layer to virtualize
access to the guest disk blocks. Further, the virtualization is done completely
in user space.
1.6 Organization
The remainder of this dissertation is organized as follows.
A literature review is presented in Chapter 2 and various checkpoint-
restart mechanisms are discussed. The review also includes various virtual-
ization schemes in the context of checkpointing. (Literature for the double-
paging problem is reviewed in Chapter 6)
Chapter 3 provides several examples to motivate the need for virtualiz-
ing the execution environment. This chapter then uses this motivation to
outline two basic requirements for virtualizing the execution environment.
It is argued there that an adaptive plugin based approach is well suited for
process virtualization.
Chapter 4 describes the design of adaptive plugins and presents the plu-
gin architecture. The proposed plugin architecture is shown to meet the vir-
tualization requirements laid out in Chapter 3. This is followed by a design
recipe for developing new plugins. Dependencies among multiple plugins
are also discussed and an approach to dependency resolution is provided.
Finally, some implementation challenges involved in designing plugins are
presented.
Chapter 5 provides some case studies involving various plugins. In-
cluded there are seven plugins that provide novel checkpointing solutions
of their corresponding subsystems. Some application-specific plugins are
also demonstrated along with several plugins that provide virtualization of
kernel resource identifies in the user space.
Chapter 6 then turns to the double-paging problem. Like the core issue
1.6. ORGANIZATION 11
in checkpoint-restart, here also one is presented by distinct subsystems that
must be combined in a unified virtualization scheme. The core problem is
described and motivated, and a design and implementation of a solution is
presented. We also discuss some of the side-effects of the proposed solution
and finally present evaluation.
Chapter 7 provides some new directions and applications of checkpoint-
restart to non-traditional use-cases that can be pursued based on this disser-
tation, with a conclusion presented in Chapter 8.
Finally, a plugin tutorial is presented in Appendix A, thus providing a
concrete view of the plugin API.
CHAPTER 2
Concepts Related to
Checkpoint-Restart and
Virtualization
This dissertation intersects with four broad areas. The first is that of checkpoint-
restart at the process level. The second concerns system/library call inter-
positioning for modifying process behavior. The third concerns process level
virtualization. The fourth concerns the double-paging problem in the con-
text of virtual machines. The literature for the first three areas is reviewed
here, whereas the related work for the double-paging problem is discussed
in Chapter 6. Since this work builds on the DMTCP software package, a brief
overview of the legacy DMTCP software (DMTCP version 1) is also provided.
2.1 Checkpoint-Restart
Checkpoint-restart has a long history with several mechanisms proposed
over the years [90, 97, 98, 35]. It is often used for process migration,
for load balancing, for fault tolerance, and so on [34]. The work of Milo-
jicic et al. [81] provides a review of the field of process migration. Egwu-
tuoha et al. [35] provides a survey of various checkpoint/restart implemen-
13
14 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION
tations in high performance computing. The website checkpointing.org
also lists several checkpoint-restart systems. There are three primary ap-
proaches to checkpointing: virtual machine snapshotting, application-level
checkpointing, and transparent checkpointing.
Virtual machine snapshotting
Virtual machine (VM) snapshotting is a form of checkpointing for virtual
machines and is often used for virtual machine migration. A complex appli-
cation is treated as a black box, and its application surface is expanded to
include the entire guest physical memory, operating system state, devices,
etc. Checkpointing an application involves involves saving everything inside
the application surface (i.e. the entire virtual machine). While this tech-
nique is general and has been discussed quite extensively [80], it is also
slower and produces larger checkpoint images because the checkpoint mod-
ule is unable to exclude unnecessary parts of guest physical memory from
the application surface. Hence, it is not commonly used for mechanisms of
checkpoint-restart.
Application-level checkpointing
Application-level checkpointing is the simplest form of checkpointing. The
developer of the application inserts checkpointing code directly inside the
application to save the process state, such as data structures, to a file on disk
that is later used to resume the computation. This is application-specific and
requires extensive knowledge of the application. The knowledge of the ap-
plication internals provides complete flexibility, but places a larger burden
on the end user. There are several techniques [129] and frameworks that
provide tools to assist in application-level checkpointing. Examples include
pickling for Python [120] and Boost serialization [108] for C++. A some-
what lighter mode of application-level checkpointing is the save/restore
Table 2.1: Comparison of various checkpointing systems. The other resourcevirtualization refers to the ability to virtualize protocols, device drivers, etc.
2.1.3 Fault Tolerance
Fault tolerance [70, 58] is a broader concept not discussed here. It enables
a system to continue operating properly in the event of a failure of one
of its components. Several strategies can be deployed to make a system
fault tolerant such as: redundancy, partial re-execution, atomic transactions,
instrumentation of data, and so on.
2.2 System Call Interpositioning
The concept of wrappers, as implemented in DMTCP, have a long and inde-
pendent history under the more general heading of interposition. Interpo-
sition techniques have been used for a wide variety of purposes [123, 136,
65]. See especially [123] for a survey of a wide variety of interposition tech-
niques. The work of Garfinkel [42] discusses practical problems associated
with system call interpositioning. The packages PIN [88] and DynInst [124]
are two examples of software packages that provide interposition techniques
at the level of binary instrumentation.
22 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION
2.3 Virtualization
Virtualization is the process of allowing unmodified source code or an un-
modified binary to transparently run under varied external environments
(different CPU, different network, different graphics server (e.g., X11-server),
etc.). Most of the original checkpointing packages [73, 74, 26, 31, 71] ig-
nored these issues and concentrated on homogeneous checkpointing.
Virtualization techniques have been developed since the 1960s. Since
then, systems have implemented different flavors of virtualization. In this
section, we discuss the four types of virtualization techniques in common
use today that are closest in spirit to this work.
2.3.1 Language-Specific Virtual Machines
A language-specific virtual machine, sometimes also known as an applica-
tion virtual machine, a runtime environment, or a process virtual machine,
allows an application to execute on any platform without having to write any
platform-specific code. This is achieved by creating a platform-independent
programming environment that abstracts the details of the underlying hard-
ware or operating system. This abstraction is provided at the level of a
high-level programming language. Notable examples include Java Virtual
Machine (JVM) [75], .NET framework [122], and Android virtual machines
(Dalvik) [20, 36].
Language-specific virtual machines are often implemented using an in-
terpreter, with an option of using just-in-time compilation for performance
close to that of a compiled language [32].
2.3.2 Process Virtualization
Process virtualization allows a process to be migrated or restarted in a new
external environment, while preserving the process’s view of the external
world. For example, a kernel may assign to a restarted process a different
2.3. VIRTUALIZATION 23
pid than the original pid at the time of checkpoint. The earliest checkpoint-
ing packages had assumed that the targeted user process would not save
the value of the pid of a peer process, but rather would re-discover that
pid on each use. As software complexity grew, this assumption became
unreliable. More recent packages either modified the Linux kernel (e.g.,
BLCR [52]), or ran inside a Linux Container, a lightweight virtual machine
(e.g., CRIU [111]).
Process virtualization (as exemplified by this work) has been considered
intensively in the context of checkpointing only recently. Nevertheless, it has
important forerunners in process hijacking [136] and in the checkpointing
packages [76, 135] used in Condor’s Standard Universe. Similarly, there are
connections of process virtualization with dynamic instrumentation (e.g.,
Paradyn/DynInst [124], PIN [88]).
2.3.3 Lightweight O/S-based Virtual Machines
O/S virtualization allows several isolated execution environments to run
within a single operating system kernel. This technique exhibits better per-
formance and density compared to virtual machines. On the downside, it
cannot host a guest operating system different from the host operating sys-
tem, or a different guest kernel (different Linux distributions is fine). Some
examples include FreeBSD Jail [61], Solaris Zones [96], Linux Containers
(LXC) [117], Linux-VServer [116], OpenVZ [118] and Virtuozzo [119].
Linux Containers are a kernel-level tool for providing a type of virtual-
ization in the form of namespaces for process spaces and network spaces.
This provides an alternative approach for such tasks as that of pid virtu-
alization. The CRIU [111] checkpointing system uses LXC namespaces to
virtualize kernel resource identifiers within the container. The namespaces
avoid the problem of name conflicts for kernel resource identifiers during
process migration.
24 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION
Although process-level virtualization and Library OS [6, 95, 107] both
operate in user space without special privileges, the goal of Library OS
is quite different. A Library OS modifies or extends the system services
provided by the operating system kernel. For example, Drawbridge [95]
presents a Windows 7 personality, so as to run Windows 7 applications un-
der newer versions of Windows. Similarly, the original exokernel operating
system [37] provided additional operating system services beyond those of
a small underlying operating system kernel, and this was argued to often be
more efficient that a larger kernel directly providing those services.
2.3.4 Virtual Machines
Hardware virtualization uses an abstract computing platform. Thus, it hides
the hardware platform (the host software). On top of the host software, a
virtual machine (guest software) is running. The guest software executes as
if it were running directly on the physical hardware, with a few restrictions,
such as the network access, display, keyboard, and disk storage. Examples
of virtual machines include VMware, Qemu/KVM [114], Xen [15], Virtu-
alBox [130], and Lguest [115]. The virtual machines often run a set of
tools inside the guest operating system to inspect and control its behavior.
Further, in some cases the guest operating system is modified to provide
additional support/features and the technique is referred to as paravirtu-
alization. Some notable examples of paravirtualization are Xen [15] and
Microsoft Hyper-V [125].
One could also include binary instrumentation techniques such PIN [88]
and DynInst [124] in a discussion of virtualization, but this tends not to be
used much with checkpointing.
The work of this thesis introduces process virtualization for abstractions
beyond the traditional kernel resource identifiers in order to virtualize nu-
merous external subsystems such as SSH connections, InfiniBand network,
2.4. DMTCP VERSION 1 25
KVM and Tun/Tap interfaces, SLURM and Torque batch queues, and GPU
drivers. The modular approach to virtualize these external subsystems al-
lows the checkpointing system to grow organically (see Chapter 4). By vir-
tualizing these external environments, this work enabled some projects to
be the “first” to support checkpointing.
2.4 DMTCP Version 1
DMTCP (Distributed MultiThreaded CheckPointing) is free, open source soft-
ware (http://dmtcp.sourceforge.net, LGPL license) and traces its
roots to early 2005 [30]. The DMTCP approach has always insisted on not
making modifications to the kernel, and not requiring any root (administra-
tive) privileges. While this was sometimes more difficult than an approach
with full privileges inside the kernel, it integrates better with complex cyber
infrastructures. DMTCP’s lack of administrative privilege provides a level of
security assurance.
As a side effect of working completely in the user-space, DMTCP relies
only on the published APIs (e.g., POSIX and the Linux proc filesystem) to
perform checkpoint-restart. Thanks to the highly stable kernel API, the same
DMTCP software can be used on Linux kernel ranging from the latest bleed-
ing edge release to Linux 2.6.5 (released in April, 2004). In this section,
we provide a only brief overview of the checkpoint-restart mechanisms of
DMTCP. More Details can be found in Ansel et al. [7].
Using DMTCP with an application is as simple as:
dmtcp_launch ./myapp arg1 ...
# From a second terminal window:
dmtcp_command --checkpoint
dmtcp_restart ckpt_myapp_*.dmtcp
This checkpoint image contains a complete standalone image of the ap-
kernel knows about is looked up in the translation table and passed on to
the kernel. Figure 3.1 shows a simple schematic of a translation layer be-
tween the user processes and the operating system kernel along with a pid
translation table to convert between virtual and real pids. At each restart,
the translation table is refreshed to update the real pids.
3.1.2 SSH Connection: Virtualizing a Protocol
Pid virtualization is a classic example of virtualizing low level kernel re-
source identifiers using a translation layer. However, the same solution
doesn’t suffice for higher level abstractions, such as an SSH connection.
app1 app2
std
io
Node1 Node2
socketSSH client
(ssh) (sshd)SSH server
std
io
Figure 3.2: SSH connection: ssh Node2 app2The user process, app1, forks a child SSH client process (ssh) to call the SSHserver (sshd) on the remote node to create a remote peer process, app2.
34 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION
Recall that the ssh command operates by connecting across the net-
work to a remote SSH daemon, sshd, as shown in Figure 3.2. Since the
SSH daemon is privileged, it is not possible for the unprivileged user-space
checkpointing system to start a new SSH daemon during restart. The issue
becomes even more complicated when the client and server processes are
restarted at entirely different network addresses on different hosts.
For virtualizing an SSH connection, it doesn’t suffice to virtualize just the
network address. Instead, it must virtualize the entire SSH client-server con-
nection. In essence, the SSH daemon represents a privileged process running
a certain protocol. Regardless of whether the protocol is an explicit standard
or a de facto standard internal to the subsystem, process virtualization must
virtualize that protocol. Checkpointing and restarting the privileged SSH
daemon is not an option.
app1 app2
std
io
Node1 Node2
SSH serverSSH client
(ssh) socket
virt_ssh virt_sshd
(sshd)
std
io
std
io
std
io
Figure 3.3: Virtualizing an SSH connection: ssh Node2 app2The call to launch an SSH client process is intercepted to launch virtualssh client (virt_ssh) and server (virt_sshd) processes. virt_ssh andvirt_sshd are unprivileged processes.
Process virtualization provides a principled and robust algorithm for trans-
parently checkpointing an SSH connections. As shown in Figure 3.3, the SSH
3.1. THE EVER CHANGING EXECUTION ENVIRONMENT 35
connection is virtualized by creating virt_ssh and virt_sshd helper pro-
cesses that shadow the SSH client and server processes respectively. The
virt_ssh and virt_sshd processes are owned by the user and are placed
under checkpoint control. The ssh and sshd processes are not check-
pointed.
On restart, the user processes are restored along with virt_ssh and
virt_sshd processes (without the underlying SSH connection) on new
hosts. The virt_ssh process then recreates a new SSH connection (see Sec-
tion 5.4).
3.1.3 InfiniBand: Virtualizing a Device Driver
Both ssh for a traditional TCP network and the new InfiniBand network
are intimately connected with high performance implementations of MPI
(Message Passing Interface). An implementation usually retains ssh and
TCP in addition to InfiniBand support, since typical MPI implementations
bootstrap their operation through ssh in order to create additional MPI
processes (MPI ranks), and to exchange InfiniBand addresses among peers.
InfiniBand virtualization has been a particular challenge both due to its
complexity [134, 63, 16] and due to the fact that much of the state is hid-
den either within a proprietary device driver or within the hardware itself.
The solution here is to use a shadow device driver approach [106]. The
InfiniBand plugin (§5.10) maintains a replica of the device driver and hard-
ware state by intercepting and recording the InfiniBand library calls. On
restart, this replica is used to recreate and restore the state of the InfiniBand
connection.
36 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION
3.1.4 OpenGL: A Record/Replay Approach to Virtualizing
a Device Driver
Scientific visualization is yet another example that requires a different kind
of virtualization solution. Some graphics computations are extremely GPU-
intensive. Further, most scientific visualizations today use OpenGL for 3D-
graphics. If a scientist walks away from a visualization and needs to restart
it the next day, there will be wasted time to reproduce it. Further, switch-
ing between multiple scientific visualizations becomes extremely inefficient.
Hence, checkpoint-restart is a critical technology. However, it is difficult
to checkpoint, because much of the graphics state is encapsulated into a
vendor-proprietary hardware GPU chip.
The OpenGL plugin (§5.9) achieves checkpoint-restart of 3-D graphics
by using a process virtualization strategy of record (record all OpenGL calls),
prune (prune any calls not needed to reproduce the most recent graphics state),
and replay (replay the calls during restart in order to place the GPU into a
semantically equivalent state to the state that existed prior to checkpoint).
3.1.5 POSIX Timers: Adapting to Application
Requirements
A posix timer is an external resource maintained within the kernel and has
an associated kernel resource identifier known as timer id. As with pid virtu-
alization, the timer-id needs to be virtualized as well and can use the same
strategy.
Consider a process that is checkpointed while a timer is still armed, i.e.
the timeout specified with the timer has not expired yet. On restart, what
is the desired behavior? Should the timer expire immediately or should it
expire after exhausting the remaining timeout period? There is no single
correct answer as the desired result is application dependent. For an appli-
cation that is waiting for a response from a web server, it is desired to expire
3.2. VIRTUALIZING THE EXECUTION ENVIRONMENT 37
the timer on restart. However, for an application process that is monitor-
ing a peer process for potential deadlocks, the time should continue for the
remaining time period.
3.2 Virtualizing the Execution Environment
As seen in the previous section, it is imperative to virtualize the external
resources in order to fully support checkpoint restart for any application. In
order to be successful, virtualization should be done transparently to the ap-
plication. This assumes that the application is interacting with the external
resource through a fixed set of API. Two basic requirements for virtualizing
an external resource for checkpointing are:
1. Virtualize external subsystems.
2. Capture/restore the state of external resources.
Next, we talk about each of these requirements and elaborate on their im-
portance and discuss what additional features are required for a complete
virtualization solution.
3.2.1 Virtualize Access to External Resources
Since external resources may change between checkpoint and restart, we
need to virtualize them. This can be achieved through a translation layer
between the application process and the resource. Virtualizing a resource
may be as simple as translating between virtual and real identifiers such
as pid-virtualization (Section 3.1.1) or it may involve more sophisticated
mechanisms like shadow device drivers (Section 3.1.3). Depending upon the
external resource, the translation may be active throughout the computation
(e.g., for pids) or only during the restart procedure (for SSH).
Further, the translation layer should ensure that the access to a resource
is atomic with respect to checkpoint-restart i.e. a checkpoint shouldn’t be
38 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION
allowed while the process is in the middle of manipulating/accessing the re-
source. Not doing this may result in an inconsistent state at restart. Consider
pid virtualization where a thread tries to send a signal to another thread us-
ing the virtual tid (thread id). The pid virtualization layer translates the
virtual tid to the real tid and sends the signal using real tid. Further con-
sider that the process is checkpointed after the translation from virtual to
real, but before the signal is actually sent. On restart, the process will re-
sume and will try to send the signal with the old real tid, which of course is
not valid now.
Share the virtualized view with peers
Virtualizing access to external resources gets complicated in a distributed
environment. Processes communicate with their peers. This demands a
consistent virtualization layer across all involved parties. It becomes more
evident after restart, when the translation table is updated to reflect the
current view of the external resource. These updates must be shared with all
the peer processes to allow them to update their own translation tables. For
example, in case of network address virtualization, each process must inform
its peers of its new network address on restart to allow them to restore socket
connections.
3.2.2 Capture/Restore the State of External Resources
When restarting a process from a previous checkpoint, we need to restore
the process view of the external resource. We need to identify the relevant
information that would be required to restore/recreate the external resource
during restart. This information should be gathered at the time of check-
point and should be saved as part of the checkpoint image. This information
can then be read from the ckpt image on restart.
3.3. ADAPTIVE PLUGINS 39
Quiesce the external resource
During checkpoint, the external resources should be quiesced to ensure a
consistent state. For example, an asynchronous disk read operation must be
allowed to finish before writing the process memory to the checkpoint image
to avoid data transformation due to on going memory updates (DMA).
Consistency of the computation state
As discussed above, a virtualization scheme should be transparent to the
user application. Thus, the application view of the external resource should
be consistent before and after checkpoint. Similarly, the application process
should not observe any change in its own state before and after checkpoint.
This involves preserving the state of the running process (e.g., threads, mem-
ory layout, and file descriptors) between checkpoint and restart.
Note that it is acceptable to alter the process state and/or the state of
external resource while perform checkpoint-restart. However, such changes
should be reverted and the pre-checkpoint view of the application should
be restored before the application process is allowed to resume executing
application code.
3.3 Adaptive Plugins as a Synthesis of
System-Level and Application-Level
Checkpointing
So far we have discussed the motivation for virtualizing the execution envi-
ronment along with the basic requirements for achieving the same. In this
section we will discuss possible design choices.
There are two basic approaches for achieving the goals discussed in Sec-
tion 3.2. One is to use application-specific checkpointing by having the ap-
40 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION
plication developer write extra code for supporting checkpointing. However,
as discussed in Section 2.1, this is not an ideal solution as it requires knowl-
edge of the internals of the applications and puts a burden on the developer.
The second approach is to use an existing monolithic checkpointing system
such as DMTCP version 1 and insert the virtualization code in it along with
a large number of heuristics to satisfy a variety of application needs (e.g.,
heuristics for posix timers as discussed in Section 3.1.5). However, there is
no universal set of heuristics that can be used with all applications as each
application requires specific heuristics to cater its needs.
In this work, we present adaptive plugins as an ideal compromise be-
tween these two extreme approaches to meet the virtualization require-
ments. An adaptive plugin is responsible for virtualizing a single external
resource. By basing plugins on top of a transparent checkpointing package
such as DMTCP, the simplicity of transparent checkpointing is maintained.
With plugins, no target application code is ever modified, yet they enable
application-specific fine tuning for checkpoint-restart. We have already seen
examples where the external resource needs to be virtualized in previous
sections. The posix timer plugin is an example of application-specific heuris-
tic plugin. A memory cutout plugin to reduce the memory footprint of the
process for reducing checkpoint image size would be yet another example of
an application-specific plugin.
CHAPTER 4
The Design of Plugins
In the previous chapter, we discussed several use cases that require virtual-
ization of external resources in order to support checkpoint-restart. External
resources may include, but are not limited to kernel resource identifiers,
protocols, and hardware device drivers. We further listed the two basic re-
quirements for virtualizing an external resource and discussed how a design
based on adaptive plugins is well suited for such tasks.
Section 4.1 introduces a basic framework of a plugin architecture that pro-
vides the same set of services for virtualizing external resources that were
introduced informally in Chapter 3. A plugin is an implementation of the
process virtualization abstraction. In process virtualization, an external sub-
system is virtualized by a plugin. All software layers above the layer of that
plugin see a modified subsystem.
Section 4.2 then uses these requirements to provide a design recipe for
virtualization through plugins. Section 4.3 then takes into account the is-
sue of dependencies among multiple plugins within the same application
process. Section 4.4 extends that design recipe to multiple processes, in-
cluding distributed processes on multiple hosts. Section 4.5 describes three
special-purpose plugins that are required for checkpointing all processes.
This chapter concludes with Section 4.6, containing some implementation
challenges.
41
42 CHAPTER 4. THE DESIGN OF PLUGINS
Operating System Kernel
Memory Plugin
Plugin EngineRuntime Libraries(libc, etc.)
Ap
pli
cati
on
Ta
rget
Target Application (program+data)
Thread Plugin
Coordinator Interface Plugin
Lib
rari
es
Ru
nti
me
Lib
rari
es
Ba
se P
lug
inIn
tern
al
an
d T
hir
d−
Pa
rty
Plu
gin
Lib
sCapture/Restore State
Virtualize ResourceLibrary Wrappers
Library Wrappers Virtualize Resource
Capture/Restore State
Figure 4.1: Plugin Architecture.
4.1 Plugin Architecture
An application consists of program and data. It interacts with the execution
environment through various libraries. For example, the libc runtime library
provides access to the kernel resources, a device driver library may provide
access to the underlying device hardware, and so on. Thus one can imagine
virtualizing the execution environment by intercepting the relevant library
calls. This allows us to inspect and modify the behavior of the underlying
subsystem as seen by the application.
Figure 4.1 shows a high level view of the plugin architecture. It has
4.1. PLUGIN ARCHITECTURE 43
two main components: (1) plugins, and (2) the plugin engine. Plugins
and the plugin engine are implemented as separate dynamic libraries. They
are loaded into the application using the LD_PRELOAD feature of the Linux
loader.
Plugin
A plugin is a checkpoint subsystem that virtualizes a single external resource
or subsystem with the help of function wrappers (§4.1.1). It save/restores
the state of the external subsystem. Examples of external subsystems are:
process-id, network sockets, InfiniBand, etc. Application processes are con-
sidered as if they are independent and inter process communication through
pids, sockets, etc. is handled through plugins. Further, a plugin is transpar-
ent to the target application and can be enabled/disabled for the application
as needed. Finally, third parties can write orthogonal customized plugins to
fit their needs.
Plugin Engine
The plugin engine provides event notification services (§4.1.2) to assist plug-
ins to capture/restore the state of their specific external resources. It further
interacts with a coordinator interface plugin to provide publish/subscribe
services (§4.1.3) to enable plugins to interact with each other and share the
translation tables for resource virtualization.
4.1.1 Virtualization through Function Wrappers
Since the underlying resources provided by the operating system may change
between checkpoint and restart, there is a need to virtualize them. The plu-
gin virtualizes the external resources by putting wrappers around interesting
library calls, which interpose when the target application makes such a call.
In case of pids, the virtualization can be done using a simple table translat-
44 CHAPTER 4. THE DESIGN OF PLUGINS
ing between virtual and real pids as shown in Listing 4.1. The arguments
passed to the library call are modified to replace the virtual pid with the real
pid. Similarly, the return value can also be modified as required. The virtual
pid column of this table is saved as part of checkpoint image and at restart
time the real pid column is populated as processes/threads are recreated.
int kill(pid_t pid, int sig) {
disable_checkpoint();
real_pid = virt_to_real(pid);
int ret = REAL_kill(real_pid, sig);
enable_checkpoint();
return ret;
} �Listing 4.1: A simple wrapper for kill
As seen in the above listing, a function wrapper is implemented by defin-
ing a function of the same name as the call it is going to wrap. Real function
here refers to the function by the same signature, in a later plugin or a run-
time library. It is possible for multiple plugins to create wrappers around a
single library function. The order of execution of wrappers is determined
by a plugin hierarchy corresponding to the order in which the plugins are
invoked (Section 4.3).
Capture/Restore state of external resource
Wrappers are also used to “spy” on the parameters used by an application to
create a system resource, in order to assist in creating a semantically equiv-
alent copy on restart. At the time of checkpoint, a plugin saves the current
state of its underlying resources into the process memory. The state can be
obtained from a number of places such as the process environment and the
4.1. PLUGIN ARCHITECTURE 45
operating system kernel. In some cases, the function wrappers can also be
used to gather the information about the external resources. For example, in
the “socket” wrapper (Listing 4.2), the socket plugin will save the associated
domain and protocol information along with the socket identifier.
int socket(int domain, int type, int protocol) {
disable_checkpoint();
int ret = REAL_socket(domain, type, protocol);
if (ret != -1) {
register_new_socket(ret, domain, type, protocol);
}
enable_checkpoint();
return ret;
} �Listing 4.2: Wrapper for socket() to record socket state
Atomic transactions
Plugins may have to perform atomic operations that must not be interrupted
by a checkpoint. For example, the translation and call to real function should
be done atomically with respect to checkpoint-restart. Otherwise, there is a
possibility of checkpointing after the translation but before the real function
is called. In that case, on restart, the translated value is no longer valid
and can impact the correctness of the program. The plugin engine provides
disable_checkpoint and enable_checkpoint services for enclosing the critical
section as seen in Listing 4.1.
The disable_checkpoint and enable_checkpoint services are implemented
using a modified write-biased reader-writer lock. The modification allows a
recursive reader lock even if the writer is queued and waiting for the lock.
The checkpoint thread must acquire the writer lock before it can quiesce the
46 CHAPTER 4. THE DESIGN OF PLUGINS
user threads. On the other hand, the user threads acquire and release the
reader lock as part of a call to disable_checkpoint and enable_checkpoint
respectively. If a checkpoint request arrives while a user thread is in the
middle of a critical section, the checkpoint thread will wait until the user
thread comes out of the critical section and releases the reader lock. A user
thread is not allowed to acquire a reader lock if the checkpoint thread is
already waiting for the writer lock to prevent checkpoint starvation.
Atomicity is especially important for wrappers that create or destroy a
resource instance. For example, when creating a network socket, if the
checkpoint is taken right after the socket is created but before the socket
plugin has a chance to register it, the socket may not be create at restart as
no record exists of the socket. Thus one must atomically create and record
socket state as shown in Listing 4.2.
Wrappers can be considered the most basic of all virtualization tools. A
flexible, robust implementation of wrapper functions turns out to be surpris-
ingly subtle and is discussed in more detail in Section 4.6.1.
4.1.2 Event Notifications
Event notifications are used to inform other plugins (within the same pro-
cess) of interesting events. Any plugin can generate notifications. Plugin
engine then delivers these notification to all available plugin in a sequential
fashion. The order of delivery of notification depends on the plugin hier-
archy as discussed in Section 4.3. Plugins must declare an event hook in
order to receive event notifications. A plugin may decide to ignore any or all
notifications.
Figure 4.2 shows the “write-ckpt” and “restart” events generated by the
coordinator interface plugin which are then delivered to all other plugins by
the plugin engine.
4.1. PLUGIN ARCHITECTURE 47
Plugin Engine
Target Application
Socket Plugin
Fork/Exec Plugin
Pid Plugin
Coordinator Interface Plugin
Memory Plugin
write−ckpt
wri
te−
ckp
t
(1)
(2)
(3)
(4)
(5)
(6)
(a) Event notification for write-ckpt
Plugin Engine
Target Application
Socket Plugin
Fork/Exec Plugin
Pid Plugin
Coordinator Interface Plugin
Memory Pluginresta
rt
rest
art
(1)
(6)
(5)
(4)
(3)
(2)
(b) Event notification for restart
Figure 4.2: Event notifications for write-ckpt and restart events. The numbersin the parenthesis indicate the order in which messages are sent. Notice that therestart event notification is delivered in the opposite order of write-ckpt event.
Some of the interesting notifications are:
• Initialize: generated during the process initialization phase (even be-
fore main() is called). The plugins can initialize data structures, etc. A
plugin may choose to register an exit-handler using atexit() which will
be called when the process is terminating.
• Write-Ckpt: each plugin saves the state of the external resources into
process’s memory. The memory plugin(s) then create the checkpoint
image.)
• Resume: generated during the checkpoint cycle.
• Restart: generated during restart phase.
• AtFork: generated during a fork and works similar to the libc function,
pthread_atfork.
48 CHAPTER 4. THE DESIGN OF PLUGINS
dmtcp_event_hook(is_pre_process, type, data) {
if (is_pre_process) {
switch (type) {
case Initialize:
myInit(); break;
case Write_Ckpt:
myWriteCkpt(); break;
...
}
}
if (!is_pre_process) {
switch (type) {
case Resume:
myResume(); break;
case Restart:
myRestart(); break;
...
}
}
} �Listing 4.3: An event hook inside a plugin
The Resume and Restart notifications are sent to plugins in the oppo-
site order from the Write-Checkpoint notification (see Listing 4.3 and Fig-
ure 4.2b). This is to ensure that any dependencies of a plugin are restored
before the plugin itself is restored. For example, the memory plugin (re-
sponsible for writing out or reading back the checkpoint image) is always
the lowest layer (see Figure 4.1). This is so that other plugins may save data
in the process’s memory during checkpoint, and find it again at the same
address during restart.
4.1. PLUGIN ARCHITECTURE 49
Target Application
Coordinator Interface Plugin
Plugin Engine
Socket Plugin
Coordinator Interface Plugin
Plugin Engine
Target Application
Socket Plugin
Coordinator
current local addr
current remote addr
curren
t loc
al a
ddr
curren
t rem
ote
addr
Node 1 Node 2
Figure 4.3: Publish/Subscribe example for sockets.
4.1.3 Publish/Subscribe Service
In a distributed environment, a publish/subscribe service is needed so that a
given type of plugin may communicate with its peers in different processes.
Typically, on restart, once the process resources have been recreated, the
plugins publish their virtual ids along with the corresponding real ids using
the publish/subscribe service. Next they subscribe for updates from other
processes and update their translation tables accordingly. This was seen
for the pid virtualization plugin (Section 3.1.1). Similarly, when a parallel
computation is restarted on a new cluster, the socket plugin must exchange
socket addresses among peers.
At the heart of the publish/subscribe services is a key-value database
whose key corresponds to the virtual name and whose value corresponds to
the real name of the underlying resource. The database is populated when
plugins publish the key-value pairs. Once the plugin has published all of
the relevant key-value pairs, it may now subscribe by sending queries to the
database. The plugins are notified as soon as a match for the queried key is
available. Typically, the key-value database is used only at restart time, as
doesn’t need to be preserved across checkpoint-restart.
50 CHAPTER 4. THE DESIGN OF PLUGINS
Figure 4.3 shows an example of the socket plugins exchanging their cur-
rent network address with their peers. During the Write-Checkpoint phase,
the socket peers agree on using a unique key (see Section 4.4.1) to iden-
tify the connection. While restarting, this unique key is used to publish the
current network address.
It is possible to have multiple publish/subscribe APIs that differ accord-
ing to scope. It is left to the plugins to choose the scope best suited for their
needs. Two trivial scopes are node-private and cluster-wide. Node-private
publish/subscribe API is sufficient for plugins dealing with resources limited
to a single node, such as pseudo-terminals, shared-memory, and message-
queues. Whereas plugins dealing with resources that may span over multiple
nodes, such as sockets and InfiniBand, should use the cluster-wide publish/-
subscribe API.
The node-private publish/subscribe service may be implemented using
shared-memory while the cluster-wide publish/subscribe service must be
provided by some centralized resource such as the DMTCP coordinator.
4.2 Design Recipe for Virtualization through
Plugins
So far we have seen the plugin architecture and the services provided by
it. We have also seen how these services suffice to meet the virtualization
requirements. We use this information to create a typical recipe for writing a
new plugin to virtualize an “external resource”. One is usually given a name
or id (identifier) to provide a link to the external resource. The id may be for
an InfiniBand queue pair, for a graphics window, for a database connection,
for a connection from a guest virtual machine to its host/hypervisor, and so
on.
4.2. DESIGN RECIPE FOR VIRTUALIZATION THROUGH PLUGINS 51
In all of these cases, the recipe is:
1. Intercept communication to the external resource (usually by inter-
posing between library calls), and translate between any real ids from
the external resource and virtual ids that are passed to the application
software. A plugin maintains this translation table of virtual/real ids.
2. Quiesce the external resource (or wait until the external resource has
itself reached a quiescent state);
3. Interrogate the state of the external resource sufficiently to be able to
reconstruct a semantically equivalent resource at restart time.
4. Checkpoint the application. The checkpoint will include state infor-
mation about the external resource, as well as a translation table of
virtual/real ids.
5. At restart time, the state information for the external resource is used
to create a semantically equivalent copy of the external resource. The
translation table is then updated to maintain the same virtual ids,
while replacing the real ids of the original external resource with the
real ids of the newly created copy of the external resource.
It is not always efficient to quiesce and save the state of an external
resource. The many disks used by Hadoop are a good example of this. The
data in an external database server is another example. It is not practical to
drain and save all of the external data in secondary storage.
There are two potential approaches. The first approach is to delay the
checkpoint during a critical phase. In the case of Hadoop, one would delay
the checkpoint until the Hadoop computation has executed a reduce oper-
ation, in order to not overly burden the resources of the Hadoop back end.
A similar approach can be taken for NVIDIA GPUs. In many cases, there
are also strategies for plugins to transparently detect this critical phase and
delay the checkpoint until that time.
52 CHAPTER 4. THE DESIGN OF PLUGINS
The second approach is to allow for a partial closed-world assumption
in which some state (data/contents) is assumed to be compatible across
checkpoint and restart. In case of the external database server, the external
data already lies in fault tolerant storage and is compatible across checkpoint
and restart. Thus the solution is to maintain a virtual id that identifies the
external storage of the server. That virtual id is used at restart time to restore
the connection to the database server.
4.3 Plugin Dependencies
Some plugins may have dependencies on other plugins. For example, the
File plugin depends on the Pid plugin to restore file descriptors pointing to
“/proc/PID/maps” and so on. Each plugin provides the list of dependencies
which must be satisfied to successfully load the given plugin. The depen-
dency declaration also affects the level of parallelism that can be achieved
when performing phases such as Checkpoint, Resume and Restart.
Subject to the dependencies among plugins, this design provides end
users with the possibility of selective virtualization. Selectively including only
some plugins is advantageous for three reasons: (i) performance reasons
(some end-user plugins might have high overhead); (ii) software mainte-
nance (other plugins can be removed while debugging a particular plugin);
and (iii) platform-specific plugins.
4.3.1 Dependency Resolution
Similar in spirit to modern software package formats such as RPM and deb,
a plugin provides a list of features/services that it provides, depends on,
or conflicts with. For example, the socket plugin may provide services for
“TCP”, “UDS” (Unix Domain Sockets), and “Netlink” socket types and de-
pends on the “File” plugin (to restore file system based unix domain sock-
ets).
4.3. PLUGIN DEPENDENCIES 53
The dmtcp_launch program, that is used to launch an application un-
der checkpoint control, compiles list of all available plugins by looking at
various environment variables, such as LD_LIBRARY_PATH. A user-defined
list of plugins can also be specified to be loaded into the application. The
dmtcp_launch program examines this plugin list and creates a partial or-
der of dependencies among the plugins. The list of available plugins is
searched to fulfill any missing dependencies for the user-defined plugins.
If a match is found, plugins are loaded automatically. Otherwise an error is
reported. If two or more plugins provide the same feature/service, a conflict
is recorded and the user is provided with the conflicting plugins.
(4) : Memory allocation and swap in(5) : Establish PPN to MPN mapping(6) : Write block to guest disk(7) : Zero the new MPN for reuse
Figure 6.2: An example of double-paging.
PPN that now maps to it can be used for a new virtual mapping in step 7.
A hypervisor has no control over when a virtualized guest may page
memory out to disk, and may even employ reclamation techniques like bal-
looning [128] in addition to hypervisor-level swapping. Ballooning is a tech-
nique that co-opts the guest into choosing pages to release back to the plat-
form. It employs a guest driver or agent to allocate, and often pin, pages
in the guest’s physical address-space. Ballooning is not a reliable solution in
overcommitted situations since it requires guest execution to choose pages
and release memory and the guest is unaware of which pages are backed
by MPNs. Hypervisors that do not also page risk running out of memory.
While preferring ballooning, VMware uses hypervisor swapping to guaran-
tee progress. Because levels of overcommitment vary over time, hypervisor
swapping may interleave with the guest, under pressure from ballooning,
6.3. DESIGN 97
also paging. This can lead to double paging.
The double-paging problem also impacts hypervisor design. Citing the
potential effects of double-paging, some [82] have advocated avoiding the
use of hypervisor-level swapping completely. Others have attempted to mit-
igate the likelihood through techniques such as employing random page
selection for hypervisor-level swapping [128] or employing some form of
paging-aware paravirtualized interface [48, 47]. For example, VMware’s
scheduler uses heuristics to find “warm” pages to avoid paging out what
the guest may also choose to page out. These heuristics have extended ef-
fects, for example, on the ability to provide large (2MB) mappings to the
guest. Our goals are to address the double-paging problem in a manner
that is transparent to the guest running in the VM and identifies and elides
the unnecessary intermediate steps such as steps 4, 5 and 6 in Figure 6.2
and to simplify hypervisor scheduling policies. Although we do not demon-
strate that double-paging is a problem in real workloads, we do show how
its effects can be mitigated.
6.3 Design
We now describe our prototype’s design. First, we describe how we extended
the hosted platform to behave more like VMware’s server platform, ESX.
Next, we outline how we identify and eliminate redundant I/Os. Finally, we
describe the design of the hypervisor swap subsystem and the extensions to
the virtual disks to support indirections.
6.3.1 Extending The Hosted Platform To Be Like ESX
VMware supports two kinds of hypervisors: the hosted platform in which
the hypervisor cooperatively runs on top of an unmodified host operating
system such as Windows or Linux, and ESX where the hypervisor runs as
the platform kernel, the vmkernel. Two key differences between these two
98 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
platforms are how memory is allocated and mapped to a VM, and where the
network and storage stacks execute.
In the existing hosted platform, each VM’s device support is managed in
the vmx, a user-level process running on the host operating system. Privi-
leged services are mediated by the vmmon device driver loaded into the host
kernel, and control is passed between the vmx and the VMM and its guest
via vmmon. An advantage of the hosted approach is that the virtualization
of I/O devices is handled by libraries in the vmx and these benefit from the
device support of the underlying host OS. Guest memory is mmapped into
the address space of the vmx. Memory pages exposed to the VMM and guest
by using the vmmon device driver to pin the pages in the host kernel and
return the MPNs to the VMM. By backing the mmapped region for guest
memory with a file, hypervisor swapping is a simple matter of invalidating
all mappings for the pages to be released in the VMM, marking, if necessary,
those pages as dirty in the vmx’s address space, and unpinning the pages on
the host.
In ESX, network and storage virtual devices are managed in the vmker-
nel. Likewise, the hypervisor manages per-VM pools of memory for backing
guest memory. To page memory out to the VM’s swap file, the VMM and
vmkernel simply invalidate any guest mappings and schedule the pages’ con-
tents to be written out. Because ESX explicitly manages the swap state for
a VM including its swap file, it is able to employ a number of optimizations
unavailable on the current hosted platform. These optimizations include the
capturing of writes to entire pages of memory [4], and the cancellation of
swap-ins for swapped-out guest PPNs that are targets for disk read requests.
The first optimization is triggered when the guest accesses an unmapped
or write-protected page and faults into the VMM. At this point, the guest’s
instruction stream is analyzed. If the page is shared [128] and the effect
of the write does not change the content of the page, page-sharing is not
broken. Instead, the guest’s program counter is advanced past the write and
6.3. DESIGN 99
it is allowed to continue execution. If the guest’s write is overwriting an
entire page, one or both of two actions are taken. If the written pattern is
a known value, such as repeated 0x00, the guest may be mapped a shared
page. This technique is used, for example, on Windows guests because Win-
dows zeroes physical pages as they are placed on the freelist. Linux, which
zeroes on allocation of a physical page, is simply mapped a writeable zeroed
MPN. Separately, any pending swap-in for that PPN is cancelled. Since the
most common case is the mapping of a shared zeroed-page to the guest, this
optimization is referred to as the PShareZero optimization.
The second optimization is triggered by interposition on guest disk read
requests. If a read request will overwrite whole PPNs, any pending swap-ins
associated with those PPNs are deferred during write-preparation, the pages
are pinned for the I/O, and the swap-ins are cancelled on successful I/O
completion.
We have extended Tesseract so that its guest-memory and swap mecha-
nisms behave more like those of ESX. Instead of mmapping a pagefile to pro-
vide memory for the guest, Tesseract’s vmx process mmaps an anonymously-
backed region of its address space, uses madvise to mark the range as NOT-
NEEDED, and explicitly pins pages as they are accessed by either the vmx or
by the VMM. Paging by the hypervisor becomes an explicit operation, read-
ing from or writing to an explicit swap file. In this way, we are able to also
employ the above optimizations on the hosted platform. We consider these
as part of our baseline implementation.
6.3.2 Reconciling Redundant I/Os
Tesseract addresses the double-paging problem transparently to the guest al-
lowing our solution to be applied to unmodified guests. To achieve this goal,
we employ two forms of interposition. The first tracks writes to PPNs by the
guest and is extended to include a mechanism to track valid relationships
100 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
between guest memory pages and disk blocks that contain the same state.
The second exploits the fact that the hypervisor interposes on guest I/O re-
quests in order to transform the requests’ scatter-gather lists. In addition,
we modify the structure of the guest VMDKs and the hypervisor swap file,
extending the former to support indirections from the VMDKs into the hy-
pervisor swap disk. Finally, when the guest reallocates the PPN and zeroes
its contents, we apply the PShareZero optimization in step 7 in Figure 6.2.
In order to track which pages have writable mappings in the guest, MPNs
are initially mapped into the guest read-only. When written by the guest, the
resulting page-fault allows the hypervisor to track that the guest page has
been modified. We extend this same tracking mechanism to also track when
guest writes invalidate associations between guest pages in memory and
blocks on disk. The task is simpler when the hypervisor, itself, modifies guest
memory since it can remove any associations for the modified guest pages.
Likewise, virtual device operations into guest pages can create associations
between the source blocks and pages. In addition, the device operations may
remove prior associations when the underlying disk blocks are written. This
approach, employed for example to speed the live migration of VMs from
one host to another [87], can efficiently track which guest pages in memory
have corresponding valid copies of their contents on disks.
The second form of interposition occurs in the handling of virtualized
guest I/O operations. The basic I/O path can be broken down into three
stages. The basic data structure describing an I/O request is the scatter-
gather list, a structure that maps one or more possibly discontiguous mem-
ory extents to a contiguous range of disk sectors. In the preparation stage,
the guest’s scatter-gather list is examined and a new request is constructed
that will be sent to the underlying physical device. It is here that the unmod-
ified hypervisor handles the faulting in of swapped out pages as shown in
steps 4 and 5 of Figure 6.2. Once the new request has been constructed, it is
issued asynchronously and some time later there is an I/O completion event.
6.3. DESIGN 101
To support the elimination of I/Os to and from virtual disks and the hy-
pervisor block-swap store (or BSST), each guest VMDK has been extended
to maintain a mapping structure allowing its virtual block identifiers to refer
to blocks in other VMDKs. Likewise, the hypervisor BSST has been extended
with per-block reference counts to track whether blocks in the swap file are
accessible from other VMDKs or from guest memory.
The tracking of associations and interposition on guest I/Os allows four
kinds of I/O elisions:
swap - guest-I/O a guest I/O follows the hypervisor swapping out a page’s
contents (Figures 6.1a and 6.1d)
swap - swap a page is repeatedly swapped out to the BSST with no inter-
vening modification
guest-I/O - swap the case in which the hypervisor can take advantage of
prior guest reads or writes to avoid writing redundant contents to the
BSST (Figure 6.1c)
guest-I/O - guest-I/O the case in which guest I/Os can avoid redundant
operations based on prior guest operations where the results known
reside in memory (for reads) or in a guest VMDK (for writes)
For simplicity, Tesseract focuses on the first two cases since these capture the
case of double-paging. Because Tesseract does not introspect on the guest,
it cannot distinguish guest I/Os related to memory paging from other kinds
of guest I/O. But the technique is general enough to support a wider set
of optimizations such as disk deduplication for content streamed through a
guest. It also complements techniques that eliminate redundant read I/Os
across VMs [82].
102 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
Guest
Disk
BSST
Blo
ck I
ndir
ecti
on L
ayer
LP1
Guest Physical Memory
Host Memory
PPN
MPN
hypervisor view
guest view
Figure 6.3: Double-paging with Tesseract.
6.3.3 Tesseract’s Virtual Disk and Swap Subsystems
Figure 6.3 shows our approach embodied in Tesseract. The hypervisor swaps
guest memory to a block-swap store (BSST) VMDK, which manages a map
from guest PPNs to blocks in the BSST, a per-block reference-counting mech-
anism to track indirections from guest virtual disks, and a pool of 4KB disk
blocks. When the guest OS writes out a memory page that happens to be
swapped out by the hypervisor, the disk subsystem detects this condition
while preparing to issue the write request. Rather than bringing memory
contents for the swapped out page back to memory, the hypervisor updates
the appropriate reference counts in the BSST, issues the I/O, and updates
metadata in guest VMDK and adds a reference to the corresponding disk
block in BSST.
Figure 6.4 shows timelines for the scenario when guest OS is paging out
an already swapped page with and without Tesseract. With Tesseract we are
able to eliminate the overheads of a new page allocation and a disk read.
To achieve this, Tesseract modifies the I/O preparation and I/O comple-
tion steps. For write requests, the memory pages in the scatter-gather list are
6.3. DESIGN 103
VMM SwapOut Allocate Memory
Synchronous SwapIn Guest
Write I/OZeroWrite
UpdatePTE...
(a) Baseline (without Tesseract)
VMM SwapOutGuestWrite
WriteMetadata
PShareZero
UpdatePTE...
(b) With Tesseract
Figure 6.4: Write I/O and hypervisor swapping.
checked for valid associations to blocks in the BSST. If these are found, the
target VMDK’s mapping structure is updated for those pages’ corresponding
virtual disk blocks to reference the appropriate blocks in the BSST and the
reference counts of these referenced blocks in the BSST are incremented. For
read requests, the guest I/O request may be split into multiple I/O requests
depending on where the source disk blocks reside.
Consider the state of a guest VMDK and the BSST as shown in Fig-
ure 6.5a. Here, a guest write operation wrote five disk blocks in which
two were previously swapped to the BSST. In this example, block 2 still con-
tains the swapped contents of some PPN and has a reference count reflecting
this fact and the guest write. Hence, its state has “swapped” as true and a
reference count of 2. Similarly, block 4 only has a nonzero reference count
because the PPN whose swapped contents originally created the disk block
has since been accessed and its contents paged back in. Hence, its state has
“swapped” as false and a reference count of 1. To read these blocks from
the guest VMDK now requires three read operations: one against the guest
VMDK and two against the BSST. The results of these read operations must
then be coalesced in the read completion path.
One can view the primary cost of double-paging in an unmodified hy-
pervisor as impacting the write-preparation time for guest I/Os. Likewise,
one can view the primary cost of these cases in Tesseract as impacting the
read-completion time. To mitigate these effects, we consider two forms of
defragmentation. Both strategies make two assumptions:
104 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
1
D
3
D
5
Guest VMDK
2
Block-Swap Store (BSST)
swapped: true
swapped: false
refcnt: 2
refcnt: 1
(a) With Tesseract
1
D
3
D
5
Guest VMDK
2
Block-Swap Store (BSST)
swapped: false
swapped: false
refcnt: 0
swapped: truerefcnt: 2
refcnt: 0
swapped: falserefcnt: 1
2
(b) With Tesseract and BSST defragmentation
1
2
3
4
5
Guest VMDK
S
Block-Swap Store (BSST)
swapped: true
swapped: false
refcnt: 1
refcnt: 0
(c) With Tesseract and guest VMDK defragmentation
Figure 6.5: Examples of reference count with Tesseract and with defragmenta-tion.
• the original guest write I/O request (represented in blue) captures the
guest’s notion of expected locality, and
• the guest is unlikely to immediately read the same disk blocks back
into memory
6.4. IMPLEMENTATION 105
Based on these assumptions, we extended Tesseract to asynchronously reor-
ganize the referenced state in the BSST. In Figure 6.5b, we copy the refer-
enced blocks into a contiguous sequence in the BSST and update the guest
VMDK indirections to refer to the new sequence. This approach reduces
the number of split read operations. In Figure 6.5c, we copy the references
blocks back to the locations in the original guest VMDK where the guest
expects them. With this approach, the typical read operation need not be
split. In effect, Tesseract asynchronously performs the expensive work that
occurred in steps 4, 5, and 6 of Figure 6.2 eliminating its cost to the guest.
6.4 Implementation
Our prototype extends VMware Workstation as described in Section 6.3.1.
Here, we provide more detail.
6.4.1 Explicit Management of Hypervisor Swapping
VMware Workstation relies on the host OS to handle much of the work as-
sociated with swapping guest memory. A pagefile is mapped into the vmx’s
address space and calls to the vmmon driver are used to lock MPNs backing
this memory as needed by the guest. When memory is released through hy-
pervisor swapping, the pages are dirtied, if necessary, in the vmx’s address
space and unlocked by vmmon. Should the host OS need to reclaim the
backing memory, it does so as if the vmx were any other process: it writes
out the state to the backing pagefiles and repurposes the MPN.
For Tesseract, we modified Workstation to support explicit swapping of
guest memory. First, we eliminated the pagefile and replaced it with a spe-
cial VMDK, the block swap store (BSST) into which swapped-out contents
are written. The BSST maintains a partial mapping from PPNs to disk blocks
tracking the contents of currently swapped-out PPNs. In addition, BSST
106 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
maintains a table of reference counts on the blocks in the BSST referenced
by other guest VDMKs.
Second, we split the process for selecting pages for swapping from the
process for actually writing out contents to the BSST and unlocking the back-
ing memory. This split is motivated by the fact that having eliminated dupli-
cate I/Os between hypervisor swapping and guest paging, the system should
benefit by both levels of scheduling choosing the same set of pages. The se-
lected swap candidates are placed in a victim cache to “cool down”. Only
the coldest pages are eventually written out to disk. This victim cache is
maintained as a percentage of locked memory by the guest—for our study,
10%. Should the guest access a page in the pool, it is removed from the pool
without being unlocked.
When the guest pages out memory, it does so to repurpose a given guest
physical page for a new linear mapping. Since this new use will access that
guest physical page, one may be concerned that this access will force the
page to be swapped in from the BSST first. However, because the guest will
either zero the contents of that page or read into it from disk and because the
VMM can detect that the whole page will be overwritten before it is visible
to the guest, the vmx is able to cancel the swap-in and complete the page
locking operation.
6.4.2 Tracking Memory Pages and Disk Blocks
There are two steps to maintaining a mapping between disk blocks and pages
in memory. The first is recognizing the pages read and written in guest and
hypervisor I/O operations. By examining scatter-gather lists of each I/O,
one can identify when the contents in memory and on disk match. While
we plan to maintain this mapping for all associations between guest disks
and guest memory, we currently only track the associations between blocks
in the BSST and main memory.
6.4. IMPLEMENTATION 107
The second step is to track when these associations are broken. For guest
memory, this event happens when the guest modifies a page of memory. The
VMM tracks when this happens by trapping the fact that a writable mapping
is required and this information is communicated to the vmx. For device
accesses, on the other hand, this event is tracked either through explicit
checks in the module which provides devices the access to guest memory, or
by examining page-lists for I/O operations that read contents into memory
pages.
6.4.3 I/O Paths
When the guest OS is running inside a virtual machine, guest I/O requests
are intercepted by the VMM, which is responsible for storage adaptor virtu-
alization, and then passed to the hypervisor, where further I/O virtualization
occurs.
Figure 6.6 identifies the primary modules in VMware Workstation’s I/O
stack. Guest operating system generates scatter-gather lists for I/O (1).
Tesseract inspects scatter-gather lists of incoming guest I/O requests in the
SCSI Disk Device layer, where a request to the guest VMDK may be updated
(2). Any extra I/O requests to the BSST may be issued (3) as shown in
Table 6.2. The Asynchronous I/O manager issues sends to I/O requests to
the host file system (4). On completion, the asynchronous I/O manager
generates completion events (5). Waiting for the completion of all the I/O
requests needed to service the original guest I/O request is isolated to the
SCSI Disk Device layer as well (6). When running with defragmentation
enabled (see Section 6.5), Tesseract allocates a pool of worker threads for
handling defragmentation requests.
Guest Write I/Os
Guest I/O requests have PPNs in scatter-gather lists. The vmx rewrites the
scatter-gather list, replacing guest PPNs with virtual pages from its address
108 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
Asynchronous I/O Manager
Host File Layer
VMX
Virtual Machine Monitor (VMM)
Guest Operating System
(1)
(3)
(4)
SCSI Disk Device
(5)
I/O completionI/O dispatch
Block Indirection Layer(2) (6)
Guest I/O requests(1) : S/G list received from guest(2) : Tesseract updates S/G list
: (write): swapped pages are removed: (read) : guest VMDK indirections are looked up
(3) : dispatch I/O request: (write): a single request with holes: (read) : one request to guest VMDK;
one or more requests to BSST(4) : asynchronous I/O scheduled
. . .I/O takes place asynchronously. . .
(5) : completion events generate for each dispatched I/O(6) : notify guest of completion:
: (write): create guest to BSST indirections: (read) : wait for all requests; merge results
Figure 6.6: VMware Workstation I/O Stack
space before passing it further to the physical device. Normally, for write
I/O requests, if a page was previously swapped, so that PPN does not have
a backing MPN, the hypervisor allocates a new MPN and brings page’s con-
tents from disk.
With Tesseract, we check if the PPNs are already swapped out to BSST
blocks by querying the PPN BSST-block mapping. We then use a virtual
6.4. IMPLEMENTATION 109
address of a special dummy page in the scatter-gather list for each page that
resides in the BSST. On completion of the I/O, metadata associated with the
guest VMDK is updated to reflect the fact that the contents of guest disk
blocks for BSST-resident pages are in the BSST. This sequence allows the
guest to page out memory without inducing double-paging.
1 2 3 4 5 6 7 8
(a) Scatter-gather prepared by the guest OS for disk write.
���������
���������
������
������
���������
���������
������
������
1 3 5 8
(b) Modified scatter-gather to avoid double-paging
��������
pages swapped out to BSSTpages in host memory dummy page
Figure 6.7: The pages swapped out to BSST are replaced with a dummy pageto avoid double-paging. Indirections are created for the corresponding guestdisk blocks.
Figure 6.7 illustrates how write I/O requests to the guest VMDK are han-
dled by Tesseract. Tesseract recognizes that contents for pages 2, 4, 6 and 7
in the scatter-gather list provided by the guest OS reside in the BSST (Fig-
ure 6.7a). When a new scatter-gather list to be passed to the physical device
is formed, a dummy page is used for each BSST resident (Figure 6.7b).
Guest Read I/Os and Guest Disk Fragmentation
Recognizing that data may reside in both the guest VMDK and the BSST is a
double-edged sword. On the guest write path it allows us to dismiss pages
that are already present in the BSST and thus avoid swapping them in just to
be written out to the guest VMDK. However, when it comes to guest reads,
the otherwise single I/O request might have to be split into multiple I/Os.
This happens when some of the data needed by the I/O is located in the
BSST.
Since data that has to be read from the BSST may not be contiguous on
disk, the number of extra I/O requests to the BSST may be as high as the
number of data pages in the original I/O request that reside in the BSST. We
110 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
refer to a collection of pages in the original I/O request for which a separate
I/O request to the BSST must be issued as a hole. Read I/O requests to the
guest VMDK which have holes are called fragmented.
We modify a fragmented request so that all pages that should be filled
in with the data from the BSST are replaced with a dummy page which will
serve as a placeholder and will get random data read from the guest VMDK.
So in the end for each fragmented read request we issue one modified I/O
request to the guest VMDK and N requests to the BSST, where N is the
number of holes. After all the issued I/Os are completed, we signal the
completion of the originally issued guest read I/O request.
������
������
���������
���������
���������
���������
���������
���������
����������������������������
����������������������������
�������������������������������������������������
�������������������������������������������������
���������������������
���������������������
�����������������������������������
�����������������������������������
����������������������������
����������������������������
������������������������������������������
������������������������������������������
�����������������������������������
�����������������������������������
����������������������������
����������������������������
1 3 5 8 2 4 6 7
1 2 3 4 5 6 7 8
��������
pages swapped out to BSSTpages in host memory dummy page
Figure 6.8: Original guest read request split into multiple reads requests dueto holes in the guest VMDK.
In Figure 6.8, the guest read I/O request finds disk blocks for pages 2, 4,
6 and 7 located on the BSST, where they are taking non-contiguous space.
Tesseract issues one read request to the guest VMDK to get data for pages 1,
3, 5 and 8. In the scatter-gather list sent to the physical device, a dummy
page is used as a read target for pages 2, 4, 6 and 7. Together with that one
read I/O request to the guest VMDK, four read I/O requests are issued to
the BSST. Each of those four requests reads data from one of the four disk
blocks in the BSST.
Optimization of Repeated Swaps
In addition to addressing the double-paging anomaly by tracking guest I/Os
whose contents exist in the BSST, we also implemented an optimization for
back-to-back swap-out requests for a memory page whose contents remain
6.4. IMPLEMENTATION 111
clean. If a page’s contents are written out to the BSST, and later swapped
back in, we continue to track the old block in the BSST as a form of victim
cache. If the same page is chosen to be swapped out again and there has
been no intervening modification of the contents of the page in memory, we
simply adjust the reference count (see Section 6.4.4) for the block copy that
is already in the BSST.
6.4.4 Managing Block Indirection Metadata
Tesseract keeps in-memory metadata for tracking PPN-to-BSST block map-
pings and for recording block indirections between guest and BSST VMDKs.
The PPN-to-BSST block mapping is stored as key-value pair using a hash
table. Indirection between guest and BSST VMDKs are tracked in a similar
manner.
Tesseract also keeps reference counts for the BSST blocks. When a new
PPN-to-BSST mapping is created, the reference count for the corresponding
BSST block is set to 1. The reference count is incremented in the write
prepare stage for PPNs found to have PPN-to-BSST block mappings. This
ensures that such BSST blocks are not repurposed while the guest write
is still in progress. Later, on the write completion path, the guest-VMDK-
to-BSST indirection is created. The reference count of the BSST blocks is
decremented during hypervisor swap in operation. It is also decremented
when the guest VMDK block is overwritten by new contents and the previous
guest block indirection is invalidated. Blocks with zero reference counts are
considered free and reclaimable.
Metadata Consistency
While updating metadata in memory is faster than updating it on the disk,
it poses consistency issues. What if the system crashes before the metadata
is synced back to persistent storage? To reduce the likelihood of such prob-
lems, Tesseract periodically synchronizes the metadata to disk on the same
112 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
schedule used by the VMDK management library for virtual disk state. How-
ever, because reference counts in the BSST and block-indirections in VMDKs
are written at different stages in an I/O request, crashes must be detected
and a fsck-like repair process run.
Entanglement of guest VMDKs and BSST
Once indirections are created between guest and BSST VMDK, it becomes
impossible to move just the guest VMDK. To disentangle the guest VMDK,
we must copy each block from the BSST to its guest VMDK for which there
is an indirection. This can be done both online and offline. More details
about the online process are in Section 6.5.2.
6.5 Guest Disk Fragmentation
As mentioned in Section 6.4.3, when running with Tesseract, guest read I/O
requests might be fragmented in the sense that some of the data the guest
is asking for in a single request may reside in both the BSST and the guest
VMDK.
The fragmentation level depends on the nature of the workload, the
guest OS, and swap activity at the guest and the hypervisor level. Our ex-
periments with SPECjbb2005 [103] showed that even for moderate level of
memory pressure as much as 48% of all read I/O requests had at least one
hole.
By solving double-paging problem Tesseract significantly reduced write-
prepare time of the guest I/O requests since synchronous swap-in requests
no longer cause delays. However, a non-trivial overhead was added to read-
completion. Indeed, instead of waiting for a single read I/O request to the
guest VMDK, the hypervisor may now have to wait for several extra read
I/O requests to the BSST to complete before reporting the completion to the
guest.
6.5. GUEST DISK FRAGMENTATION 113
To address these overheads, Tesseract was extended with a defragmen-
tation mechanism that improves read I/O access locality and thus reduces
read-completion time. We investigated two approaches to implementing
defragmentation - BSST defragmentation and guest VMDK defragmentation.
While defragmentation is intended to help reduce read-completion time, it
has its own cost. Defragmentation requests are asynchronous and reduce
time to complete affected guest I/Os, but, at the same time, they contribute
to a higher disk load and in the extreme cases may have an impact on read-
prepare times. The defragmentation activity can be throttled on detecting
performance bottlenecks due to higher disk load. ESX, for example, pro-
vides a mechanism, SIOC, that measures latencies to detect overload and
enforce proportional-share fairness [50]. The defragmentation mechanism
could participate in this protocol.
6.5.1 BSST Defragmentation
BSST defragmentation uses guest write I/O requests as a hint of which BSST
blocks might be accessed together in a single I/O read request in the future.
Given that information we then group together the identified blocks in the
BSST.
Figure 6.9 shows a scatter-gather list of the write I/O request that goes
to the guest VMDK. In that request, the contents of pages 2, 4, 6 and 7 is
already present in the BSST. As soon as these blocks are identified, a worker
thread picks up a reallocation job that will allocate a new set of contiguous
blocks in BSST and will copy the contents of BSST blocks for pages 2, 4,
6 and 7 into that new set of block. This copying allows those blocks to be
read later as a single I/O request issued by the guest and reflects its own
expectation of the locality of these blocks.
BSST defragmentation is not perfect. If multiple guest VMDK writes cre-
ate indirections to the same BSST blocks, multiple copies of those blocks
114 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
���������
���������
������
������
���������
���������
������
������
4
7 6
7642
2
1 3 5 8
BSST DiskGuest Disk
Figure 6.9: Defragmenting the BSST.
may be made in the BSST. Further, since blocks are still present in both the
guest VMDK and the BSST, extra I/O requests to the BSST cannot be en-
tirely eliminated. In addition, BSST defragmentation tries to predict read
access locality from write access locality and obviously the boundaries of
read requests will not match with the boundaries of the write requests. So
each read I/O request that without defragmentation would have required
reads from both the guest VMDK and the BSST will still be split into the one
which goes to the guest VMDK and one or more requests to the BSST. All
this contributes to longer read completion times as shown in Table 6.4.
However, it is relatively easy to implement BSST defragmentation with-
out worriying too much about data races with the I/O going to the guest
VMDK. It can significantly reduce the number of extra I/Os that have to be
issued to the BSST to service the guest I/O request as shown in Table 6.3.
If a guest read I/O request preserves the locality observed at the time
of guest writes, we need more than one read I/O request from the BSST
only when it hits more than one group of blocks created during BSST de-
fragmentation. Although this is entirely dependent on a workload, one can
expect read requests to typically be smaller than write requests, and, so, the
number of extra I/O requests to BSST being reduced to one (fits into one de-
fragmented area) or two (crosses the boundary of two defragmented areas)
in many cases.
6.5. GUEST DISK FRAGMENTATION 115
4
7 6
2
2 4 761 3 5 8
BSST DiskGuest Disk
Figure 6.10: Defragmenting the guest VMDK.
6.5.2 Guest VMDK Defragmentation
Like BSST defragmentation, guest VMDK defragmentation uses the scatter-
gather lists of write I/O requests to identify BSST blocks that must be copied.
But unlike BSST defragmentation, these blocks are copied to the guest VMDK.
The goal is to restore the guest VMDK to the state it would have had with-
out Tesseract. Tesseract with guest VMDK defragmentation replaces swap-in
operations with asynchronous copying from the BSST to the guest VMDK.
For example, in Figure 6.10, blocks 2, 4, 6 and 7 are copied to the relevant
locations on the guest VMDK by a worker thread.
We enqueue a defragmentation request as soon as the scatter-gather list
of the guest write I/O request is processed and blocks to be asynchronously
fetched to the guest VMDK are identified. The defragmentation requests are
organized as a priority queue. If a guest read I/O request needs to read
data from the block that has not been copied from the BSST, the priority of
the defragmentation request that refers to the block is raised to highest and
the guest read I/O request is blocked until copying of all the missing blocks
finishes.
While Tesseract with guest defragmentation can have an edge over Tesser-
act without defragmentation, it is not always a win. With guest defragmen-
tation, before a guest I/O read request has a chance to be issued to the
guest VMDK, it may become blocked waiting for a defragmentation request
to complete. This may end up being slower than issuing requests to the
BSST and the guest VMDK in parallel.
116 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
Disentanglement of Guest and BSST VMDKs.
Guest defragmentation has an added benefit of removing the entanglement
between guest and BSST VMDK. Once there are no block indirections be-
tween guest and BSST VMDK, the guest VMDK can be moved easily. This
also allows us to disable Tesseract’s double-paging optimization on-the-fly.
6.6 Evaluation
We ran our experiments on an AMD Opteron 6168 (Magny-Cours) with 12
1.9 GHz cores, 1.5 GB of memory and a 1 TB 7200rpm Seagate SATA drive, a
1 TB 7200rpm Western Digital SATA drive, and a 128 GB Samsung SSD drive.
We used OpenSUSE 11.4 as the host OS and a 6 VCPU 700 MB VM running
Ubuntu 11.04. We used Jenkins [113] to monitor and manage execution of
the test cases.
To ensure same test conditions for all test runs, we created a fresh copy
of the guest virtual disk from backup before each run. For the evaluation
we ran SPECjbb2005 [103] that was modified to emit instantaneous scores
every second. It was run with 6 warehouses for 120 seconds. The heap size
was set to 450 MB. The SPECjbb benchmark creates several warehouses and
processes transactions for each of them.
We induced hypervisor-level swapping by setting a maximum limit on
the pages the VM can lock. The BSST VMDK was preallocated. Swap-out
victim cache size was chosen to be 10% of the VM’s memory size.
All experiments except the one with SSD represent results from five trial
runs. The SSD experiment represents results from three trial runs.
6.6.1 Inducing Double-Paging Activity
To control hypervisor swapping, we set a hypervisor-imposed limit on the
machine memory available for the VM. Guest paging was induced by running
6.6. EVALUATION 117
the SPECjbb benchmark with a working set larger than the available guest
memory.
To induce double-paging, the guest must page out the pages that were
already swapped by the hypervisor. Since, the hypervisor would choose only
the cold pages from the guest memory, we employed a custom memhog that
would lock some pages in the guest memory for a predetermined amount
of time inside the guest. While the pages were locked by this memhog, a
different memhog would repeatedly touch the rest of available guest pages
making them “hot”. At this point the pages locked by the first memhog are
considered “cold” and swapped out by the hypervisor.
Next, memhog unlocks all its memory and the SPECjbb benchmark is
started inside the guest. Once the warehouses have been created by SPECjbb,
the memory pressure increases inside the guest. The guest is forced to find
and page out “cold pages”. The pages unlocked by memhog are good candi-
dates as they have not been touched in the recent past.
We used memhog and memory locking in our setup to make the exper-
iments more repeatable. In real world the conditions we were simulating
could have been observed, for example, when execution phase shift of an
application occurs, or when an application that caches a lot of data in mem-
ory and not actively uses is descheduled and another memory intensive ap-
plication is woken up by the guest.
As a baseline we ran with Tesseract disabled. This effectively disabled
analysis and rewriting of guest I/O commands so that all pages affected by
an I/O command that happened to be swapped out by the hypervisor had to
be swapped back in before the command could be issued to disk.
6.6.2 Application Performance
While it is hard to control and measure the direct impact of individual
double-paging events, we use the pauses or gaps observed in the logged
118 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
instantaneous scores of each SPECjbb run to characterize the application be-
havior. Depending upon the amount of double-paging activity, the pauses
can be as big as 60 seconds in a 120 second run and negatively affect the
final score. Often the pauses are associated with garbage collection activity.
0 20 40 60 80 100Total SPECjbb blockage time (seconds)
1000
2000
3000
4000
5000
6000
7000
SPECjb
b s
core
baseline
tesseract
(d) 20% host overcommitment
Figure 6.13: Scores and total pause times for SPECjbb runs with varying hostovercommitment and 60 MB memhog.
Varying Levels of Host Memory Pressure
To study the effect of increasing memory pressure by the hypervisor, we
ran the application with various levels of host overcommitment with 60 MB
memhog inside the guest.
Figure 6.13 shows the effect of increasing host memory pressure on the
application scores and total pause times. For lower host pressure (0% and
5%), the score and pause times for the baseline and Tesseract are about the
same. However, for higher memory pressure there is a significant difference
in the performance. For example, in the 20% case, the baseline observes
total pauses in the range of 80–110 seconds. Tesseract, on the other hand,
observes total pauses in a much lower range of 30–60 seconds.
6.6. EVALUATION 121
0 5 15 20
Host Memory Overcommitment (%)
0
10
20
30
40
50
60
70
80
Max p
ause
/blo
ckage tim
e (se
conds)
no-defrag
guest-defrag
bsst-defrag
baseline
Figure 6.14: Comparing maximum single pauses for SPECjbb under vari-ous defragmentation schemes with varying host memory overcommitment and60 MB memhog
Figure 6.14 focuses on the maximum pauses seen by the application as
host memory pressure grows. While the maximum pauses are insignificant
at lower memory pressure, with a higher pressure Tesseract clearly outper-
Table 6.2: Holes in read I/O requests for Tesseract without defragmentationfor varying levels of host overcommitment and 60 MB memhog inside the guest.
Without host memory pressure there is no hypervisor level swapping and
all 5,152 guest read I/O requests can be satisfied without going to the BSST.
At higher levels of memory pressure, the hypervisor starts swapping pages
to disk. Tesseract detects pages in guest write I/O requests that are already
in the BSST to avoid swap-in requests for such pages. The amount of work
saved by Tesseract on the write I/O path is quantified in the final column of
Table 6.4: Average read and write prepare/completion times in microsecondsfor baseline and Tesseract with and without defragmentation. Host overcom-mitment was 10%; memhog size was 60 MB.
6.6.7 Overheads
I/O Path Overhead
Table 6.4 presents Tesseract overheads on I/O paths. The average overhead
per I/O is on the order of microseconds. Read prepare time for guest defrag-
mentation is higher than the others due to the contention on guest VMDK
during defragmentation. At the same time, the read completion time for
guest defragmentation case is much lower than the other two cases as there
are no extra reads going to the BSST. On the write I/O path, the defrag-
mentation schemes have larger overhead. This is due to the background
defragmentation of the disks which is kicked off as soon as the write I/O is
scheduled.
Memory Overhead
Per Section 6.4.4, Tesseract maintains in-memory metadata for three pur-
poses: tracking (a) associations between PPN and BSST blocks; (b) refer-
ence counts for BSST blocks; and (c) indirections between guest VMDK and
BSST VMDK. We use 64 bits to store a (4 KB) block number. To track asso-
ciations between PPN and BSST blocks we re-use MPN field in page frames
maintained by the hypervisor so there is no extra memory overhead here.
In general case where associations between PPN and blocks in guest VMDK
have to be tracked we will need a separate memory structure with a maxi-
128 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
mum overhead of 0.2% of VM’s memory size. Each BSST block’s reference
count requires 4 bytes per disk block. To optimize the lookup for free/avail-
able BSST blocks, a bitmap is also maintained with one bit for each block.
The guest VMDK to BSST VMDK indirection metadata requires 24 bytes for
each guest VMDK block for which there is a valid indirection to BSST. A
bitmap similar to that for BSST is maintained for guest VMDK blocks to de-
termine if an indirection to BSST exists for a given guest VMDK block.
6.7 Related Work
This work intersects three areas. The first is that of uncooperative hypervisor
swapping and the double-paging problem. The second concerns the tracking
of associations between guest memory and disk state. The third concerns
memory and I/O deduplication.
6.7.1 Hypervisor Swapping and Double Paging
Concurrent work by Amit et al. [5] systematically explores the behavior of
uncooperative hypervisor swapping and implement an improved swap sub-
system for KVM called VSwapper. The main components of their imple-
mentation are the Swap Mapper and the False Reader Preventer. The paper
identifies five primary causes for performance degradation, studies each, and
offers solutions to address them. The first, “silent swap writes”, corresponds
to our notion of guest-I/O–swap optimization which we do not yet support
because we do not support reference-counting on blocks in guest VMDKs.
The second and third, “stale swap reads” and “false swap reads”, and their
solutions are similar to the existing ESX optimizations that cancel swap-ins
for memory pages that are either overwritten by disk I/O or by the guest.
For “silent swap writes” and “stale swap reads”, the Swap Mapper uses the
same techniques Tesseract does to track valid associations between pages in
guest memory and blocks on disk. Their solution to “false swap reads”, the
6.7. RELATED WORK 129
False Reader Preventer, is more general, however, because it supports the
accumulation of successive guest writes in a temporary buffer to identify if
a page is entirely overwritten before next read. The last two, “decayed swap
sequentiality” and “false page anonymity”, are not issues we consider. In
their investigation, they did not observe double-paging to have much impact
on performance. This is likely due to the fact that they followed guidelines
from VMware and provisioned guests with enough VRAM that guest pag-
ing was uncommon and most of the experiments were run with a persistent
level of overcommitment. Tesseract allows for optimizing operations involv-
ing guest I/O followed by another guest I/O with either same pages or disk
blocks. This is not possible with VSwapper. Also, vswapper doesn’t allow for
defragmentation or disk deduplication.
The double-paging problem was first identified in the context of virtual
machines running on VM/370 [46, 101]. Goldberg and Hassinger [46] dis-
cuss the impact of increased paging when the virtual machine’s address ex-
ceeds that with which it is backed. Seawright and MacKinnon [101] mention
the use of handshaking between the VMM and operating system to address
the issue but do not offer details.
The Cellular Disco project at Stanford describes the problem of paging
in the guest and swapping in the hypervisor [48, 47]. They address this
double-paging or redundant paging problem by introducing a virtual paging
device in the guest. The paging device allows the hypervisor to track the
paging activity of the guest and reconcile it with its own. Like our approach,
the guest paging device identified already swapped-out blocks and creates
indirections to these blocks that are already persistent on disk. There is no
mention of the fact that these indirections destroy expected locality and may
impact subsequent guest read I/Os.
Subsequent papers on scheduling memory for virtual machines also refer
in passing to the general problem. Waldspurger [128], for example, men-
tions the impact of double-paging and advocates random selection of pages
130 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
by the hypervisor as a simple way to minimize overlap with page-selection
by the guest. Others projects, such as the Satori project [82], use double-
paging to advocate against any mechanism to swap guest pages from the
hypervisor.
Our approach differs from these efforts in several ways. First, we have
a system in which we can—for the first time—measure the extent to which
double-paging occurs. Second, we have an approach that directly addresses
the problem of double-paging in a manner transparent to the guest. Finally,
our techniques change the relationship between the two levels of scheduling:
by reconciling and eliding redundant I/Os, Tesseract encourages the two
schedulers to choose the same pages to be paged out.
6.7.2 Associations Between Memory and Disk State
Tracking the associations between guest memory and guest disks has been
used to improve memory management and working-set estimation for vir-
tual machines. The Geiger project [60], for example, uses paravirtualization
and intimate knowledge of the guest disks to implement a secondary cache
for guest buffer-cache pages. Lu et al. [78] implement a similar form of
victim cache for the Xen hypervisor.
Park et al. [87] describe a set of techniques to speed live-migration of
VMs. One of these techniques is to track associations between pages in
memory and blocks on disks whose contents are shared between the source
and destination machines. In cases where the contents are known to be
resident on disk, the block information is sent to the destination in place
of the memory contents. In the paper, the authors describe techniques for
maintaining this mapping both through paravirtualization and through the
use of read-only mappings for fully virtualized guests.
6.8. OBSERVATIONS 131
6.7.3 I/O and Memory Deduplication
The Satori project [82] also tracks the association between disk blocks and
pages in memory. It extends the Xen hypervisor to exploit these associations,
allowing it to elide repeated I/Os that read the same blocks from disk across
VMs immediately sharing these pages of memory across those guests.
Originally inspired by the Cellular Disco and Geiger projects, Tesseract
shares much in common with these approaches. Like many of them, it tracks
valid associations between memory pages and disk blocks that contain iden-
tical content. Like Park et al., it employs techniques that are fully transpar-
ent to the guest allowing it to be applied in a wider set of contexts. Unlike
the Satori projects which focused on eliminating redundant read operations
across VMs, Tesseract uses that mapping information to deduplicate I/Os
from a specific guest and its hypervisor. As such, our approach complements
and extends these others.
6.8 Observations
Our experience in this project has led us to question the existing interface
for issuing I/O requests with scatter-gather lists. Given that the underly-
ing physical organization of the disk blocks can differ significantly from the
virtual disk structure, it makes little sense for a scatter-gather list to require
that the target blocks on disk be contiguous. Having a more flexible structure
may allow I/Os to be expressed more succinctly and to be more effective at
communicating expected relationships or locality among those disk blocks.
Further, one can think of generalizing I/O scatter-gather lists and espe-
cially virtual disks to just be indirection tables into a large sea-of-blocks. This
allows for a natural application surface for block indirection.
CHAPTER 7
Impact for the Future
In this chapter, we discuss some of the future directions that can be pursued
based on this dissertation.
7.1 Compiled Code In Scripting Languages:
Fast-Slow Paradigm
For many scripting languages (Python, R, Matlab, etc.), the interpreted lan-
guage was developed first, and researchers developed an efficient compiler
after the fact. As a result, we often have fast compiled functions that run
inside the interpreted language. The compiled code makes assumptions to
generate efficient code. Unusual user applications may violate these assump-
tions, causing the compiled code to silently return an incorrect answer. So, a
user must choose between reliable, interpreted (slow) code, and unreliable
compiled (fast) code.
Checkpointing provides an interesting third alternative. One splits the
computation into segments. For concreteness, we will give an example with
ten segments, and we will assume that ten additional “checking” hosts (or
ten additional CPU cores) are available to run in parallel.
Initially, the compiled code is run. At the beginning of each of the ten
segments, one takes a checkpoint and copies it to a different “checking”
133
134 CHAPTER 7. IMPACT FOR THE FUTURE
computer. That computer runs the next segment in interpreted mode. At
the end of that segment, the data from the corresponding checkpoint of the
compiled segment is compared with the data at the end of the interpreted
segment for correctness.
At the end, either the ten “checking” hosts (or ten “checking” CPU cores)
report that the computation is correct, or else they report that the compu-
tation must switch to interpreted mode for correctness at the beginning of
a particular segment (after which, one can return to compiled operation as
described above).
Wester et al. [131] implemented a speculation mechanism in the operat-
ing system. It provided coordination across all applications and kernel state
while the speculation policy was left up to the applications. A scheme similar
to this was employed using DMTCP by Ghoshal et al. [45] in an application
to MPI [45] and by Arya and Cooperman [9] to support the Python scripting
language.
7.2 Support for Hadoop-style Big Data
Hadoop [39] and Spark [40] support a map-reduce paradigm in which the
size of intermediate data may increase during a “map” phase and may de-
crease during a “reduce” phase. Thus, the best place to checkpoint is at
the end of a “reduce” phase. With the right hooks added to Hadoop (or
Spark), Hadoop could be instructed by a plugin to move back-end data to
longer-term storage. On restart, the plugin would use those hooks to move
the longer-term storage back to active storage, and the front end would re-
connect.
7.3. CYBERSECURITY 135
7.3 Cybersecurity
Section 5.8 described the ability to checkpoint a network of virtual machines
using plugins [44]. This can be combined with DMTCP plugins to monitor
and modify the operation of a guest virtual machine. In particular, if mal-
ware uses any external services (from gettimeofday to calling back to a con-
troller on the Internet), this can be intercepted by a suitable DMTCP plugin,
and even replayed, in order to more closely examine the malware. See Visan
et al. [127] and Arya et al. [10] for examples of using record-replay through
DMTCP plugins. (While some malware tries to detect if it is running inside
a virtual machine, malware will often continue to run in this situation. Oth-
erwise, virtual machines would provide a good defense against malware.)
7.4 Algorithmic debugging
Algorithmic debugging [102, 13, 94, 83, 84, 79] is a well-developed tech-
nique that was especially explored in the 1990s. Roughly, the idea is that
an algorithmic debugger keeps a trace of the computation, and shows the
user the input and output of various subprocedures. Through a series of
questions and answers (similar to the game of 20 questions), the software
determines which low-level subprocedure caused the bug. This tended to
be used in functional languages and declarative languages such as Prolog,
because of the ease of capturing the input and output of a subprocedure.
The use of checkpoints allows one to apply this same technique to main-
stream languages including C/C++, Python, and others. Instead of en-
capsulating a small input and output, a traditional debugger (e.g., GDB,
Python pdb) would be used to allow the programmer to fully explore the
global state at the beginning and end of the subprocedure. In case of a
failed step, checkpoint-restart would allow us to restart from the last valid
step instead of rerunning the program from the beginning.
136 CHAPTER 7. IMPACT FOR THE FUTURE
7.5 Reversible Debugging
Reversible debugging or time-travelling debuggers have a long history [19,
38, 64, 72]. Checkpointing provides an obvious approach in this area. Some
parts of this approach have already been developed within the context of
DMTCP (decomposing debugging histories for replay [127] and reverse ex-
pression watchpoints [10]).
7.6 Android-Based Mobile Computing
Huang and Cheng have already demonstrated the use of DMTCP to check-
point processes under Android [53]. This provides the potential for truly
pervasive mobile apps, which can checkpoint themselves and migrate them-
selves to other platforms. This can provide greater software sustainability
(software engineering) by saving the entire mobile app, instead of the cur-
rent practice of saving the state of an app and re-loading the state whenever
the app is re-launched.
7.7 Cloud Computing
Cloud computing provides on-demand self-service and rapid elasticity of re-
sources for applications. These characteristics are similar to that of the old-
style mainframes from the 1960s through 1980s. However, to make the
analogy complete, we need a scheduler for the Cloud. This scheduler must
support parallel applications in addition to single-process applications. A
scheduler for the Cloud can use DMTCP to suspend or migrate jobs. The ca-
pabilities of DMTCP contributing to this goal include providing checkpoint